E12 1 PDF

EACL 2012
13th Conference of the European Chapter of the

Association for Computational Linguistics
Proceedings of the Conference
April 23 - 27 2012
Avignon France

c 2012 The Association for Computational Linguistics
ISBN 978-1-937284-19-0
Order copies of this and other ACL proceedings from:
Association for Computational Linguistics (ACL)

209 N. Eighth Street
Stroudsburg, PA 18360
USA
Tel: +1-570-476-8006
Fax: +1-570-476-0860
acl@aclweb.org
ii
Preface: General Chair
Welcome to EACL 2012, the 13th Conference of the European Chapter of the Association for
Computational Linguistics. We are happy that despite strong competition from other Computational
Linguistics events and economic turmoil in many European countries, this EACL is comparable to the
successful previous ones, both in terms of the number of papers submitted and in terms of attendance. We
have a strong scientific program, including ten workshops, four tutorials, a demos session and a student
research workshop. I am convinced that you will appreciate our program.
What does a General Chair at EACL have to do? Not much, it turns out. My job was to act as a liaison
between the local organizing team, the scientific committees, and the EACL board, and to give advice
when needed. Looking back at the thousands of e-mails I was copied on reminded me of the Jerome K.
Jerome quote: I like work. I can sit and look at it for hours. It has been an enjoyable experience to
cooperate with the many people who made this conference happen, and to see them work. I have learned
a lot from them.
The Program Committee at an ACL conference is a trained army of Area Chairs, Program Committee
members, and additional reviewers. Mirella Lapata and Llus Marquez commanded this particular one.
It is thanks to the voluntary peer reviewing work, year after year, of this large group of people, formed by
the top researchers in our field, that you will find a high-quality program. It is thanks to Mirella and Llus
that you will not only find the quality we expect from EACL, but also innovation, coherence, breadth,
and depth. I cant thank them enough for their work on all aspects of the scientific program and for their
advice on virtually any other aspect of the organization. Many thanks also to Regina Barzilay, Raymond
Mooney, and Martin Cooke for accepting to present an invited lecture and thereby increase the appeal of
this event even more.
As in previous years, the selection of the workshops of all ACL conferences in the same year is
coordinated in a single committee. For EACL, Kristiina Jokinen and Alessandro Moschitti collaborated
with the NAACL and ACL chairs in reviewing and selecting the workshops. As EACL is the first
conference of the three, they had to initiate the call for proposals and activate their colleagues long
before they were planning to. Thanks to their professionalism and efficiency, the process went very
smoothly, and the resulting workshops program reflects the diversity and maturity of the field. For
even more variation during the first two days of the conference, we also have a strong tutorial program.
Tutorial Chairs Lieve Macken and Eneko Agirre managed to attract an impressive list of high-quality
submissions and performed a thorough and thoughtful review and selection. It is truly a pity only four
could be accommodated in the program, but their quality and timeliness is inspiring. Many thanks to
Kristiina, Alessandro, Lieve, and Eneko for making this important part of the scientific program such a
success.
As is previous editions of EACL, the Student Research Workshop was organized by the student members
of the EACL board: Pierre Lison, Mattias Nilsson, and Marta Recasens, with help from faculty advisor
Laurence Danlos. Their task was a huge one: to organize a mini-conference within the conference.
This included finding reviewers, selecting papers, setting up a program for the student session, finding
mentors for the accepted papers, selecting a best paper award, . . . The amount of work they did cannot
be overestimated, and the result is brilliant. Thank you! To round of the scientific program, we
have stimulating demonstration sessions, selected and coordinated by Demonstrations Chair Frederique
Segond. Thank you for showing so clearly the rapid progress application-oriented computational
linguistics is making.
Thanks also to Gertjan van Noord and Caroline Sporleder for accepting the role of coordinators of the
mentoring service. In the end, they didnt have to assign mentors, but it is important that such a service
is available when needed.
iii
For EACL 2012 we decided to switch to digital proceedings only. They were available before the
conference from the website, during the conference on the memory stick you received with your
registration material, and afterwards from the website and the ACL Anthology. An exception was made
for the tutorial notes, which are available to participants on paper as well. I warned the Publications
Chairs, Adria de Gispert and Fabrice Lefevre, beforehand that theirs was probably the most demanding
and stressful task of the conference: making sure that huge volumes of material from so many sources are
available in time and in the right format, incorporating last minute corrections, and handling unavoidable
glitches in the publications software. It is a formidable task, but they completed it without flinching. We
all owe them our gratitude.
EACL seems to follow economical crises, let us hope it does not become a habit. Both the previous
conference in 2009 and the current one happened in grim economical times. Being a Sponsorship Chair
is not a happy occasion in such times. Nevertheless, both the international ACL Sponsorship Committee
(with Massimiliano Ciaramita as EACL member) and the local Sponsorship Chairs (Eric SanJuan and
Stephane Huet) left no stone unturned looking for sponsors. We would have ended up in a much worse
financial situation if it hadnt been for their efforts. Thank you! And of course also many thanks to our
sponsors who, despite the economic situation, decided to help us financially with the conference. I am
convinced their investment will be rewarded.
Organizing large conferences like this is a complex undertaking, even with the help of extensive material
(the ACL conference handbook). Whenever in doubt, I have had the opportunity to interact with the
EACL Board, and occasionally with the ACL Board and with Priscilla Rasmussen. This has always been
a pleasure. I have learned that the people running our associations are dedicated, know everything, and
never sleep.
Last but not least, the local organizing team has had to carry the largest burden in the organization. The
sheer number of tasks and actions the local organizers of a conference like EACL have to assume is
astonishing. Marc El-Beze has been a wonderful chair and his team (Frederic Bechet, Yann Fernandez,
Stephane Huet, Tania Jimenez, Fabrice Lefevre, Georges Linares, Alexis Nasr, Eric SanJuan, and Iria
Da Cunha) has done outstanding work. There is no beginning in mentioning the many tasks they had to
fulfill for making this a top conference. I am very grateful for all the work they put in the event and for
the stress-free and friendly cooperation. I am also grateful for the support of the University of Avignon.
I hope you will have many fond memories of EACL 2012, organized in these stunning surroundings
in Avignon, both about the exciting scientific program and about the superb social program and local
arrangements.
Walter Daelemans
General Chair
March 2012
iv
Preface: Program Chairs
We are delighted to present you with this volume containing the papers accepted for presentation at
the 13th Conference of the European Chapter of the Association for Computational Linguistics, held in
Avignon, France, from April 23 till April 27 2012.
EACL 2012 received 326 submissions. We were able to accept 85 papers in total (an acceptance rate
of 26%). 48 of the papers (14.7%) were accepted for oral presentation, and 34 (10.4%) for poster
presentation. One oral paper was subsequently withdrawn after acceptance. The papers were selected
by a program committee of 28 area chairs, from Asia, Europe, and North America, assisted by a panel
of 471 reviewers. Each submission was reviewed by three reviewers, who were furthermore encouraged
to discuss any divergences they might have, and the papers in each area were ranked by the area chairs.
The final selection was made by the program co-chairs after an independent check of all reviews and
discussions with the area chairs.
This year EACL introduced an author response period. Authors were able to read and respond to the
reviews of their paper before the program committee made a final decision. They were asked to correct
factual errors in the reviews and answer questions raised in the reviewers comments. The intention was
to help produce more accurate reviews. In some cases, reviewers changed their scores in view of the
authors response and the area chairs read all responses carefully prior to making recommendations for
acceptance. Another new feature was to allow authors to include optional supplementary material in
addition to the paper itself (e.g., code, data sets, and resources). Finally, in an attempt to eliminate any
bias from the reviewing process we put in place a double-blind reviewing system where the identity of
the authors was not revealed to the area chairs.
After the program was selected, each of the area chairs was asked to nominate the best paper from his
or her area, or to explicitly decline to nominate any. This resulted in several nominations out of which
three stood out and were further considered in more detail by an dedicated committee chaired by Stephen
Clark. This independent committee selected the best paper of the conference, which will be also awarded
with a prize sponsored by Google. The best paper and the other two finalists will be presented in plenary
sessions at the conference.
In addition to the main conference program, EACL 2012 will feature the now traditional Student
Research Workshop, 10 workshops, 4 tutorials and a demo session with 21 presentations. We are also
fortunate to have three invited speakers, Martin Cooke, Ikerbasque (Basque Foundation for Science),
Regina Barzilay, Massachusetts Institute of Technology, and Raymond Mooney, University of Texas at
Austin. Martin Cooke will speak about Speech Communication in the Wild, Regina Barzilay will
discuss the topic of Learning to Behave by Reading, and Raymond Mooney will present on Learning
Language from Perceptual Context.
First and foremost, we would like to thank the authors who submitted their work to EACL. The sheer
number of submissions reflects how broad and active our field is. We are deeply indebted to the area
chairs and the reviewers for their hard work. They enabled us to select an exciting program and to
provide valuable feedback to the authors. We are grateful to our invited speakers who graciously agreed
to give talks at EACL. Additional thanks to the Publications Chairs, Adria de Gispert and Fabrice
Lefevre who put this volume together. We are grateful to Rich Gerber and the START team who
always responded to our questions quickly, and helped us manage the large number of submissions
smoothly. Thanks are due to the local organizing committee chair, Marc El-Beze for his cooperation
with us over many organisational issues. We are also grateful to the Student Research Workshop chairs,
Pierre Lison, Mattias Nilsson, and Marta Recasens, and the NAACL-HLT (Srinivas Bangalore, Eric
Fosler-Lussier and Ellen Riloff) and ACL (Chin-Yew Lin and Miles Osborne) program chairs for their
smooth collaboration in the handling of double submissions. Last but not least, we are indebted to the
v
General Chair, Walter Daelemans, for his guidance and support throughout the whole process.
We hope you enjoy the conference!
Mirella Lapata and Llus Marquez
EACL 2012 Program Chairs
vi
Organizing Committee
General Chair:
Walter Daelemans, University of Antwerp, Belgium
Programme Committee Chairs:

Mirella Lapata, University of Edinburgh, UK
Llus Marquez, Universitat Politecnica de Catalunya, Spain
Area Chairs:
Katja Filippova, Google
Min-Yen Kan, National University of Singapore
Charles Sutton, University of Edinburgh
Ivan Titov, Saarland University
Xavier Carreras, Universitat Politecnica de Catalunya (UPC)
Kenji Sagae, University of Southern California
Kallirroi Georgila, Institute for Creative Technologies, University of Southern California
Michael Strube, HITS gGmbH
Pascale Fung, The Hong Kong University of Science and Technology
Bing Liu, University of Illinois at Chicago
Theresa Wilson, Johns Hopkins University
David McClosky, Stanford University
Sebastian Riedel, University of Massachusetts
Phil Blunsom, University of Oxford
Mikel L. Forcada, Universitat dAlacant
Christof Monz, University of Amsterdam
Sharon Goldwater, University of Edinburgh
Richard Wicentowski, Swarthmore College
Patrick Pantel, Microsoft Research
Hiroya Takamura, Tokyo Institute of Technology
Alexander Koller, University of Potsdam
Sebastian Pado, Universitat Heidelberg
Maarten de Rijke, University of Amsterdam
Julio Gonzalo, UNED
Lori Levin, Carnegie Mellon University
Piek Vossen, VU University Amsterdam
Afra Alishahi, Tilburg University, The Netherlands
John Hale, Cornell University
Workshop Committee chairs:

Kristiina Jokinen, University of Helsinki, Finland
Alessandro Moschitti, University of Trento, Italy
Tutorials Committee chairs:

Eneko Agirre, University of the Basque Country, Spain
Lieve Macken, University College Ghent, Belgium
vii
Student research workshop chairs:
Pierre Lison, University of Oslo, Norway
Mattias Nilsson, Uppsala University, Sweden
Marta Recasens, University of Barcelona, Spain
Student research workshop faculty advisor:

Laurence Danlos, Universite Paris 7
System Demonstrations Committee:

Frederique Segond
Publications Committee:
Adria de Gispert, University of Cambridge, UK
Fabrice Lefevre, University of Avignon, France
Sponsorship Committee:
Massimiliano Ciaramita
Mentoring service:
Caroline Sporleder, Saarland University, Germany
Gertjan van Noord, University of Groningen, The Netherlands
Local Organising Committee:

Marc El-Beze (Chair), University of Avignon, France
Frederic Bechet (Publicity chair), University Aix-Marseille 2, France
Yann Fernandez, University of Avignon, France
Stephane Huet (Exhibits local chair), University of Avignon, France
Tania Jimenez, University of Avignon, France
Fabrice Lefevre, University of Avignon, France
Georges Linares, University of Avignon, France
Alexis Nasr, University Aix-Marseille 2, France
Eric SanJuan (Sponsorship local chair), University of Avignon, France
Iria Da Cunha, Pompeu Fabra University, Spain
viii
Table of Contents
Speech Communication in the Wild

Martin Cooke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Power-Law Distributions for Paraphrases Extracted from Bilingual Corpora

Spyros Martzoukos and Christof Monz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
A Bayesian Approach to Unsupervised Semantic Role Induction

Ivan Titov and Alexandre Klementiev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Entailment above the word level in distributional semantics

Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do and Chung-chieh Shan . . . . . . . . . . . . . . . . . . . 23
Evaluating Distributional Models of Semantics for Syntactically Invariant Inference

Jackie Chi Kit Cheung and Gerald Penn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Cross-Framework Evaluation for Statistical Parsing

Reut Tsarfaty, Joakim Nivre and Evelina Andersson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Dependency Parsing of Hungarian: Baseline Results and Challenges

Richard Farkas, Veronika Vincze and Helmut Schmid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Dependency Parsing with Undirected Graphs

Carlos Gomez-Rodrguez and Daniel Fernandez-Gonzalez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
The Best of BothWorlds A Graph-based Completion Model for Transition-based Parsers

Bernd Bohnet and Jonas Kuhn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Answer Sentence Retrieval by Matching Dependency Paths acquired from Question/Answer Sentence
Pairs
Michael Kaisser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Can Click Patterns across Users Query Logs Predict Answers to Definition Questions?
Alejandro Figueroa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Retrieval in a Ser-
vice Context
Vassilina Nikoulina, Bogomil Kovachev, Nikolaos Lagos and Christof Monz . . . . . . . . . . . . . . . . 109
Computing Lattice BLEU Oracle Scores for Machine Translation

Artem Sokolov, Guillaume Wisniewski and Francois Yvon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Toward Statistical Machine Translation without Parallel Corpora

Alexandre Klementiev, Ann Irvine, Chris Callison-Burch and David Yarowsky . . . . . . . . . . . . . . 130
Character-Based Pivot Translation for Under-Resourced Languages and Domains

Jorg Tiedemann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Does more data always yield better translations?

Guillem Gasco, Martha-Alicia Rocha, German Sanchis-Trilles, Jesus Andres-Ferrer and Francisco
Casacuberta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Recall-Oriented Learning of Named Entities in Arabic Wikipedia

Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer and Noah A. Smith . . . . 162
ix
Tree Representations in Probabilistic Models for Extended Named Entities Detection
Marco Dinarelli and Sophie Rosset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
When Did that Happen? Linking Events and Relations to Timestamps

Dirk Hovy, James Fan, Alfio Gliozzo, Siddharth Patwardhan and Christopher Welty . . . . . . . . . . 185
Compensating for Annotation Errors in Training a Relation Extractor

Bonan Min and Ralph Grishman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Incorporating Lexical Priors into Topic Models

Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
DualSum: a Topic-Model based approach for update summarization

Jean-Yves Delort and Enrique Alfonseca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Large-Margin Learning of Submodular Summarization Models

Ruben Sipos, Pannaga Shivaswamy and Thorsten Joachims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utterances and their
Meanings
Tom Kwiatkowski, Sharon Goldwater, Luke Zettlemoyer and Mark Steedman . . . . . . . . . . . . . . . 234
Active learning for interactive machine translation

Jesus Gonzalez-Rubio, Daniel Ortiz-Martnez and Francisco Casacuberta . . . . . . . . . . . . . . . . . . . 245
Adapting Translation Models to Translationese Improves SMT

Gennadi Lembersky, Noam Ordan and Shuly Wintner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Aspectual Type and Temporal Relation Classification

Francisco Costa and Antonio Branco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Automatic generation of short informative sentiment summaries

Andrea Glaser and Hinrich Schutze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Bootstrapped Training of Event Extraction Classifiers

Ruihong Huang and Ellen Riloff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Bootstrapping Events and Relations from Text

Ting Liu and Tomek Strzalkowski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
CLex: A Lexicon for Exploring Color, Concept and Emotion Associations in Language
Svitlana Volkova, William B. Dolan and Theresa Wilson. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .306
Extending the Entity-based Coherence Model with Multiple Ranks

Vanessa Wei Feng and Graeme Hirst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Generalization Methods for In-Domain and Cross-Domain Opinion Holder Extraction

Michael Wiegand and Dietrich Klakow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Skip N-grams and Ranking Functions for Predicting Script Events

Bram Jans, Steven Bethard, Ivan Vulic and Marie-Francine Moens . . . . . . . . . . . . . . . . . . . . . . . . . 336
The Problem with Kappa

David Martin Ward Powers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
x
User Edits Classification Using Document Revision Histories
Amit Bronner and Christof Monz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
User Participation Prediction in Online Forums

Zhonghua Qu and Yang Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Inferring Selectional Preferences from Part-Of-Speech N-grams

Hyeju Jang and Jack Mostow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
WebCAGe A Web-Harvested Corpus Annotated with GermaNet Senses

Verena Henrich, Erhard Hinrichs and Tatiana Vodolazova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Learning to Behave by Reading

Regina Barzilay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Lexical surprisal as a general predictor of reading time

Irene Fernandez Monsalve, Stefan L. Frank and Gabriella Vigliocco . . . . . . . . . . . . . . . . . . . . . . . . 398
Spectral Learning for Non-Deterministic Dependency Parsing

Franco M. Luque, Ariadna Quattoni, Borja Balle and Xavier Carreras . . . . . . . . . . . . . . . . . . . . . . . 409
Combining Tree Structures, Flat Features and Patterns for Biomedical Relation Extraction
Md. Faisal Mahbub Chowdhury and Alberto Lavelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Coordination Structure Analysis using Dual Decomposition

Atsushi Hanamoto, Takuya Matsuzaki and Junichi Tsujii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
Cutting the Long Tail: Hybrid Language Models for Translation Style Adaptation
Arianna Bisazza and Marcello Federico . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Detecting Highly Confident Word Translations from Comparable Corpora without Any Prior Knowledge
Ivan Vulic and Marie-Francine Moens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
Efficient parsing with Linear Context-Free Rewriting Systems

Andreas van Cranenburgh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Evaluating language understanding accuracy with respect to objective outcomes in a dialogue system
Myroslava O. Dzikovska, Peter Bell, Amy Isard and Johanna D. Moore . . . . . . . . . . . . . . . . . . . . . 471
Experimenting with Distant Supervision for Emotion Classification

Matthew Purver and Stuart Battersby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgar-

ian
Georgi Georgiev, Valentin Zhikov, Kiril Simov, Petya Osenova and Preslav Nakov . . . . . . . . . . . 492
Instance-Driven Attachment of Semantic Annotations over Conceptual Hierarchies

Janara Christensen and Marius Pasca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503
Joint Satisfaction of Syntactic and Pragmatic Constraints Improves Incremental Spoken Language Un-
derstanding
Andreas Peldszus, Okko Bu, Timo Baumann and David Schlangen . . . . . . . . . . . . . . . . . . . . . . . . 514
Learning How to Conjugate the Romanian Verb. Rules for Regular and Partially Irregular Verbs
Liviu P. Dinu, Vlad Niculae and Octavia-Maria Sulea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
xi
Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History
Torsten Zesch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation
Rico Sennrich . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
Subcat-LMF: Fleshing out a standardized format for subcategorization frame interoperability

Judith Eckle-Kohler and Iryna Gurevych . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
The effect of domain and text type on text prediction quality

Suzan Verberne, Antal van den Bosch, Helmer Strik and Lou Boves . . . . . . . . . . . . . . . . . . . . . . . . 561
The Impact of Spelling Errors on Patent Search

Benno Stein, Dennis Hoppe and Tim Gollub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF

Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian M. Meyer
and Christian Wirth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
Word Sense Induction for Novel Sense Detection

Jey Han Lau, Paul Cook, Diana McCarthy, David Newman and Timothy Baldwin . . . . . . . . . . . . 591
Learning Language from Perceptual Context

Raymond Mooney . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter
Micol Marchetti-Bowick and Nathanael Chambers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
Learning from evolving data streams: online triage of bug reports

Grzegorz Chrupala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Towards a model of formal and informal address in English

Manaal Faruqui and Sebastian Pado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Character-based kernels for novelistic plot structure

Micha Elsner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
Smart Paradigms and the Predictability and Complexity of Inflectional Morphology

Gregoire Detrez and Aarne Ranta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
Probabilistic Hierarchical Clustering of Morphological Paradigms

Burcu Can and Suresh Manandhar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
Modeling Inflection and Word-Formation in SMT

Alexander Fraser, Marion Weller, Aoife Cahill and Fabienne Cap . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
Identifying Broken Plurals, Irregular Gender, and Rationality in Arabic Text

Sarah Alkuhlani and Nizar Habash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
Framework of Semantic Role Assignment based on Extended Lexical Conceptual Structure: Comparison
with VerbNet and FrameNet
Yuichiroh Matsubayashi, Yusuke Miyao and Akiko Aizawa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686
Unsupervised Detection of Downward-Entailing Operators By Maximizing Classification Certainty

Jackie Chi Kit Cheung and Gerald Penn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .696
xii
Elliphant: Improved Automatic Detection of Zero Subjects and Impersonal Constructions in Spanish
Luz Rello, Ricardo Baeza-Yates and Ruslan Mitkov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
Validation of sub-sentential paraphrases acquired from parallel monolingual corpora

Houda Bouamor, Aurelien Max and Anne Vilnat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716
Determining the placement of German verbs in EnglishtoGerman SMT

Anita Gojun and Alexander Fraser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
Syntax-Based Word Ordering Incorporating a Large-Scale Language Model

Yue Zhang, Graeme Blackwood and Stephen Clark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
Midge: Generating Image Descriptions From Computer Vision Detections

Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Stratos, Xufeng Han, Alyssa
Mensch, Alex Berg, Tamara Berg and Hal Daume III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
Generation of landmark-based navigation instructions from open-source data

Markus Drager and Alexander Koller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
To what extent does sentence-internal realisation reflect discourse context? A study on word order
Sina Zarrie, Aoife Cahill and Jonas Kuhn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages

Oliver Ferschke, Iryna Gurevych and Yevgen Chebotar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
An Unsupervised Dynamic Bayesian Network Approach to Measuring Speech Style Accommodation

Mahaveer Jain, John McDonough, Gahgene Gweon, Bhiksha Raj and Carolyn Penstein Rose . 787
Learning the Fine-Grained Information Status of Discourse Entities

Altaf Rahman and Vincent Ng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
Composing extended top-down tree transducers

Aurelie Lagoutte, Fabienne Braune, Daniel Quernheim and Andreas Maletti . . . . . . . . . . . . . . . . . 808
Structural and Topical Dimensions in Multi-Task Patent Translation

Katharina Waeschle and Stefan Riezler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
Not as Awful as it Seems: Explaining German Case through Computational Experiments in Fluid Con-
struction Grammar
Remi van Trijp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
Managing Uncertainty in Semantic Tagging

Silvie Cinkova, Martin Holub and Vincent Krz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840
Parallel and Nested Decomposition for Factoid Questions

Aditya Kalyanpur, Siddharth Patwardhan, Branimir Boguraev, Jennifer Chu-Carroll and Adam
Lally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
xiii
Conference Program
Wednesday April 25, 2012
(8:45) Session 1: Plenary Session
9:00 Speech Communication in the Wild

Martin Cooke
(10:30) Session 2a: Semantics
10:30 Power-Law Distributions for Paraphrases Extracted from Bilingual Corpora

Spyros Martzoukos and Christof Monz
10:55 A Bayesian Approach to Unsupervised Semantic Role Induction

Ivan Titov and Alexandre Klementiev
11:20 Entailment above the word level in distributional semantics

Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do and Chung-chieh Shan
11:45 Evaluating Distributional Models of Semantics for Syntactically Invariant Inference

Jackie Chi Kit Cheung and Gerald Penn
(10:30) Session 2b: Parsing
10:30 Cross-Framework Evaluation for Statistical Parsing

Reut Tsarfaty, Joakim Nivre and Evelina Andersson
10:55 Dependency Parsing of Hungarian: Baseline Results and Challenges

Richard Farkas, Veronika Vincze and Helmut Schmid
11:20 Dependency Parsing with Undirected Graphs

Carlos Gomez-Rodrguez and Daniel Fernandez-Gonzalez
11:45 The Best of BothWorlds A Graph-based Completion Model for Transition-based

Parsers
Bernd Bohnet and Jonas Kuhn
xv
Wednesday April 25, 2012 (continued)
(10:30) Session 2c: QA and IR
10:30 Answer Sentence Retrieval by Matching Dependency Paths acquired from Ques-
tion/Answer Sentence Pairs
Michael Kaisser
10:55 Can Click Patterns across Users Query Logs Predict Answers to Definition Questions?
Alejandro Figueroa
11:20 Adaptation of Statistical Machine Translation Model for Cross-Lingual Information Re-
trieval in a Service Context
Vassilina Nikoulina, Bogomil Kovachev, Nikolaos Lagos and Christof Monz
(14:00) Session 3a: Machine Translation
14:00 Computing Lattice BLEU Oracle Scores for Machine Translation

Artem Sokolov, Guillaume Wisniewski and Francois Yvon
14:25 Toward Statistical Machine Translation without Parallel Corpora

Alexandre Klementiev, Ann Irvine, Chris Callison-Burch and David Yarowsky
14:50 Character-Based Pivot Translation for Under-Resourced Languages and Domains

Jorg Tiedemann
15:15 Does more data always yield better translations?

Guillem Gasco, Martha-Alicia Rocha, German Sanchis-Trilles, Jesus Andres-Ferrer and
Francisco Casacuberta
(14:00) Session 3b: Information Extraction
14:00 Recall-Oriented Learning of Named Entities in Arabic Wikipedia

Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer and Noah A. Smith
14:25 Tree Representations in Probabilistic Models for Extended Named Entities Detection
Marco Dinarelli and Sophie Rosset
14:50 When Did that Happen? Linking Events and Relations to Timestamps
Dirk Hovy, James Fan, Alfio Gliozzo, Siddharth Patwardhan and Christopher Welty
xvi
15:15 Compensating for Annotation Errors in Training a Relation Extractor

Bonan Min and Ralph Grishman
(14:00) Session 3c: Machine Learning and Summarization
14:00 Incorporating Lexical Priors into Topic Models

Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa
14:25 DualSum: a Topic-Model based approach for update summarization

Jean-Yves Delort and Enrique Alfonseca
14:50 Large-Margin Learning of Submodular Summarization Models

Ruben Sipos, Pannaga Shivaswamy and Thorsten Joachims
(16:10) Session 4: Posters (1) and Demos (1)
16:10 A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utter-
ances and their Meanings
Tom Kwiatkowski, Sharon Goldwater, Luke Zettlemoyer and Mark Steedman
16:10 Active learning for interactive machine translation

Jesus Gonzalez-Rubio, Daniel Ortiz-Martnez and Francisco Casacuberta
16:10 Adapting Translation Models to Translationese Improves SMT

Gennadi Lembersky, Noam Ordan and Shuly Wintner
16:10 Aspectual Type and Temporal Relation Classification

Francisco Costa and Antonio Branco
16:10 Automatic generation of short informative sentiment summaries

Andrea Glaser and Hinrich Schutze
16:10 Bootstrapped Training of Event Extraction Classifiers

Ruihong Huang and Ellen Riloff
16:10 Bootstrapping Events and Relations from Text

Ting Liu and Tomek Strzalkowski
xvii
16:10 CLex: A Lexicon for Exploring Color, Concept and Emotion Associations in Language
Svitlana Volkova, William B. Dolan and Theresa Wilson
16:10 Extending the Entity-based Coherence Model with Multiple Ranks

Vanessa Wei Feng and Graeme Hirst
16:10 Generalization Methods for In-Domain and Cross-Domain Opinion Holder Extraction
Michael Wiegand and Dietrich Klakow
16:10 Skip N-grams and Ranking Functions for Predicting Script Events
Bram Jans, Steven Bethard, Ivan Vulic and Marie-Francine Moens
16:10 The Problem with Kappa

David Martin Ward Powers
16:10 User Edits Classification Using Document Revision Histories

Amit Bronner and Christof Monz
16:10 User Participation Prediction in Online Forums

Zhonghua Qu and Yang Liu
16:10 Inferring Selectional Preferences from Part-Of-Speech N-grams

Hyeju Jang and Jack Mostow
16:10 WebCAGe A Web-Harvested Corpus Annotated with GermaNet Senses

Verena Henrich, Erhard Hinrichs and Tatiana Vodolazova
xviii
Thursday April 26, 2012
9:00 Learning to Behave by Reading

Regina Barzilay
(10:30) Session 6a: Student Workshop
(10:30) Session 6b: Student Workshop
(10:30) Session 6c: Student Workshop
(14:00) Session 7: EACL business meeting
14:50 Lexical surprisal as a general predictor of reading time

Irene Fernandez Monsalve, Stefan L. Frank and Gabriella Vigliocco
15:15 Spectral Learning for Non-Deterministic Dependency Parsing

Franco M. Luque, Ariadna Quattoni, Borja Balle and Xavier Carreras
(16:10) Session 9: Posters (2) and Demos (2)
16:10 Combining Tree Structures, Flat Features and Patterns for Biomedical Relation Extraction
Md. Faisal Mahbub Chowdhury and Alberto Lavelli
16:10 Coordination Structure Analysis using Dual Decomposition

Atsushi Hanamoto, Takuya Matsuzaki and Junichi Tsujii
16:10 Cutting the Long Tail: Hybrid Language Models for Translation Style Adaptation
Arianna Bisazza and Marcello Federico
16:10 Detecting Highly Confident Word Translations from Comparable Corpora without Any
Prior Knowledge
Ivan Vulic and Marie-Francine Moens
xix
Thursday April 26, 2012 (continued)
16:10 Efficient parsing with Linear Context-Free Rewriting Systems

Andreas van Cranenburgh
16:10 Evaluating language understanding accuracy with respect to objective outcomes in a dia-
logue system
Myroslava O. Dzikovska, Peter Bell, Amy Isard and Johanna D. Moore
16:10 Experimenting with Distant Supervision for Emotion Classification

Matthew Purver and Stuart Battersby
16:10 Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Applica-

tion to Bulgarian
Georgi Georgiev, Valentin Zhikov, Kiril Simov, Petya Osenova and Preslav Nakov
16:10 Instance-Driven Attachment of Semantic Annotations over Conceptual Hierarchies

Janara Christensen and Marius Pasca
16:10 Joint Satisfaction of Syntactic and Pragmatic Constraints Improves Incremental Spoken
Language Understanding
Andreas Peldszus, Okko Bu, Timo Baumann and David Schlangen
16:10 Learning How to Conjugate the Romanian Verb. Rules for Regular and Partially Irregular
Verbs
Liviu P. Dinu, Vlad Niculae and Octavia-Maria Sulea
16:10 Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revi-
sion History
Torsten Zesch
16:10 Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine
Translation
Rico Sennrich
16:10 Subcat-LMF: Fleshing out a standardized format for subcategorization frame interoper-
ability
Judith Eckle-Kohler and Iryna Gurevych
16:10 The effect of domain and text type on text prediction quality
Suzan Verberne, Antal van den Bosch, Helmer Strik and Lou Boves
16:10 The Impact of Spelling Errors on Patent Search

Benno Stein, Dennis Hoppe and Tim Gollub
xx
Thursday April 26, 2012 (continued)
16:10 UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF

Iryna Gurevych, Judith Eckle-Kohler, Silvana Hartmann, Michael Matuschek, Christian
M. Meyer and Christian Wirth
16:10 Word Sense Induction for Novel Sense Detection

Jey Han Lau, Paul Cook, Diana McCarthy, David Newman and Timothy Baldwin
Friday April 27, 2012
9:00 Learning Language from Perceptual Context

Raymond Mooney
(10:30) Session 11a: Data Mining and Discourse
10:30 Learning for Microblogs with Distant Supervision: Political Forecasting with Twitter
Micol Marchetti-Bowick and Nathanael Chambers
10:55 Learning from evolving data streams: online triage of bug reports
Grzegorz Chrupala
11:20 Towards a model of formal and informal address in English

Manaal Faruqui and Sebastian Pado
11:45 Character-based kernels for novelistic plot structure

Micha Elsner
xxi
Friday April 27, 2012 (continued)
(10:30) Session 11b: Morphology
10:30 Smart Paradigms and the Predictability and Complexity of Inflectional Morphology
Gregoire Detrez and Aarne Ranta
10:55 Probabilistic Hierarchical Clustering of Morphological Paradigms

Burcu Can and Suresh Manandhar
11:20 Modeling Inflection and Word-Formation in SMT

Alexander Fraser, Marion Weller, Aoife Cahill and Fabienne Cap
11:45 Identifying Broken Plurals, Irregular Gender, and Rationality in Arabic Text
Sarah Alkuhlani and Nizar Habash
(10:30) Session 11c: Semantics
10:30 Framework of Semantic Role Assignment based on Extended Lexical Conceptual Struc-
ture: Comparison with VerbNet and FrameNet
Yuichiroh Matsubayashi, Yusuke Miyao and Akiko Aizawa
10:55 Unsupervised Detection of Downward-Entailing Operators By Maximizing Classification

Certainty
Jackie Chi Kit Cheung and Gerald Penn
11:20 Elliphant: Improved Automatic Detection of Zero Subjects and Impersonal Constructions
in Spanish
Luz Rello, Ricardo Baeza-Yates and Ruslan Mitkov
11:45 Validation of sub-sentential paraphrases acquired from parallel monolingual corpora

Houda Bouamor, Aurelien Max and Anne Vilnat
xxii
(14:00) Session 12a: Generation and Word Ordering
14:00 Determining the placement of German verbs in EnglishtoGerman SMT

Anita Gojun and Alexander Fraser
14:25 Syntax-Based Word Ordering Incorporating a Large-Scale Language Model

Yue Zhang, Graeme Blackwood and Stephen Clark
14:50 Midge: Generating Image Descriptions From Computer Vision Detections

Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Stratos, Xufeng Han,
Alyssa Mensch, Alex Berg, Tamara Berg and Hal Daume III
15:15 Generation of landmark-based navigation instructions from open-source data

Markus Drager and Alexander Koller
(14:00) Session 12b: Discourse and Dialogue
14:00 To what extent does sentence-internal realisation reflect discourse context? A study on
word order
Sina Zarrie, Aoife Cahill and Jonas Kuhn
14:25 Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages
Oliver Ferschke, Iryna Gurevych and Yevgen Chebotar
14:50 An Unsupervised Dynamic Bayesian Network Approach to Measuring Speech Style Ac-
commodation
Mahaveer Jain, John McDonough, Gahgene Gweon, Bhiksha Raj and Carolyn Penstein
Rose
15:15 Learning the Fine-Grained Information Status of Discourse Entities

Altaf Rahman and Vincent Ng
xxiii
(14:00) Session 12c: Parsing and MT
14:00 Composing extended top-down tree transducers

Aurelie Lagoutte, Fabienne Braune, Daniel Quernheim and Andreas Maletti
14:25 Structural and Topical Dimensions in Multi-Task Patent Translation

Katharina Waeschle and Stefan Riezler
14:50 Not as Awful as it Seems: Explaining German Case through Computational Experiments
in Fluid Construction Grammar
Remi van Trijp
15:45 Managing Uncertainty in Semantic Tagging

Silvie Cinkova, Martin Holub and Vincent Krz

Aditya Kalyanpur, Siddharth Patwardhan, Branimir Boguraev, Jennifer Chu-Carroll and
Adam Lally
xxiv
Speech Communication in the Wild
Martin Cooke
Language and Speech Laboratory
University of the Basque Country
Ikerbasque (Basque Science Foundation)
m.cooke@ikerbasque.org
Abstract
Much of what we know about speech perception comes from laboratory studies with clean, canonical
speech, ideal listeners and artificial tasks. But how do interlocutors manage to communicate effec-
tively in the seemingly less-than-ideal conditions of everyday listening, which frequently involve try-
ing to make sense of speech while listening in a non-native language, or in the presence of competing
sound sources, or while multitasking? In this talk Ill examine the effect of real-world conditions on
speech perception and quantify the contributions made by factors such as binaural hearing, visual in-
formation and prior knowledge to speech communication in noise. Ill present a computational model
which trades stimulus-related cues with information from learnt speech models, and examine how
well it handles both energetic and informational masking in a two-sentence separation task. Speech
communication also involves listening-while-talking. In the final part of the talk Ill describe some
ways in which speakers might be making communication easier for their interlocutors, and demon-
strate the application of these principles to improving the intelligibility of natural and synthetic speech
in adverse conditions.
1
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, page 1,
Avignon, France, April 23 - 27 2012. 2012
c Association for Computational Linguistics
Power-Law Distributions for Paraphrases Extracted from Bilingual
Corpora
Spyros Martzoukos Christof Monz

Informatics Institute, University of Amsterdam
Science Park 904, 1098 XH Amsterdam, The Netherlands
{s.martzoukos, c.monz}@uva.nl
Abstract was shown to outperform pivoting with syntac-

tic information, when multiple phrase-tables are
We describe a novel method that extracts used. In SMT, extracted paraphrases with asso-
paraphrases from a bitext, for both the
ciated pivot-based (Callison-Burch et al., 2006;
source and target languages. In order
to reduce the search space, we decom- Onishi et al., 2010) and cluster-based (Kuhn et
pose the phrase-table into sub-phrase-tables al., 2010) probabilities have been found to im-
and construct separate clusters for source prove the quality of translation. Pivoting has also
and target phrases. We convert the clus- been employed in the extraction of syntactic para-
ters into graphs, add smoothing/syntactic- phrases, which are a mixture of phrases and non-
information-carrier vertices, and compute terminals (Zhao et al., 2008; Ganitkevitch et al.,
the similarity between phrases with a ran-
2011).
dom walk-based measure, the commute
time. The resulting phrase-paraphrase We develop a method for extracting para-
probabilities are built upon the conversion phrases from a bitext for both the source and tar-
of the commute times into artificial co- get languages. Emphasis is placed on the qual-
occurrence counts with a novel technique. ity of the phrase-paraphrase probabilities as well
The co-occurrence count distribution be-
as on providing a stepping stone for extracting
longs to the power-law family.
syntactic paraphrases with equally reliable prob-
abilities. In line with previous work, our method
1 Introduction depends on the connectivity of the phrase-table,
Paraphrase extraction has emerged as an impor- but the resulting construction treats each side sep-
tant problem in NLP. Currently, there exists an arately, which can potentially be benefited from
abundance of methods for extracting paraphrases additional monolingual data.
from monolingual, comparable and bilingual cor- The initial problem in harvesting paraphrases
pora (Madnani and Dorr, 2010; Androutsopou- from a phrase-table is the identification of the
los and Malakasiotis, 2010); we focus on the lat- search space. Previous work has relied on breadth
ter and specifically on the phrase-table that is ex- first search from the query phrase with a depth
tracted from a bitext during the training stage of of 2 (pivoting) and 6 (KB). The former can be
Statistical Machine Translation (SMT). Bannard too restrictive and the latter can lead to excessive
and Callison-Burch (2005) introduced the pivot- noise contamination when taking shallow syntac-
ing approach, which relies on a 2-step transition tic information features into account. Instead, we
from a phrase, via its translations, to a paraphrase choose to cluster the phrase-table into separate
candidate. By incorporating the syntactic struc- source and target clusters and in order to make this
ture of phrases (Callison-Burch, 2005), the qual- task computationally feasible, we decompose the
ity of the paraphrases extracted with pivoting can phrase-table into sub-phrase-tables. We propose
be improved. Kok and Brockett (2010) (hence- a novel heuristic algorithm for the decomposition
forth KB) used a random walk framework to de- of the phrase-table (Section 2.1), and use a well-
termine the similarity between phrases, which established co-clustering algorithm for clustering
2
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 211,
each sub-phrase-table (Section 2.2). done by identifying all vertices such that, upon
The underlying connectivity of the source removal, the component becomes disconnected.
and target clusters gives rise to a natural graph Such vertices are called articulation points or cut-
representation for each cluster (Section 3.1). vertices. Cut-vertices of high connectivity degree
The vertices of the graphs consist of phrases are removed from the giant component (see Sec-
and features with a dual smoothing/syntactic- tion 4.1). For the remaining vertices of the giant
information-carrier role. The latter allow (a) re- component, new components are identified and
distribution of the mass for phrases with no appro- we proceed iteratively, while keeping track of the
priate paraphrases and (b) the extraction of syn- cut-vertices that are removed at each iteration, un-
tactic paraphrases. The proximity among vertices til the size of the largest component is less than a
of a graph is measured by means of a random walk certain threshold (see Section 4.1).
distance measure, the commute time (Aldous and Note that at each iteration, when removing cut-
Fill, 2001). This measure is known to perform vertices from a giant component, the resulting col-
well in identifying similar words on the graph of lection of components may include graphs con-
WordNet (Rao et al., 2008) and a related measure, sisting of a single vertex. We refer to such ver-
the hitting time is known to perform well in har- tices as residues. They are excluded from the re-
vesting paraphrases on a graph constructed from sulting collection and are considered for separate
multiple phrase-tables (KB). treatment, as explained later in this section.
Generally in NLP, power-law distributions are The cut-vertices need to be inserted appropri-
typically encountered in the collection of counts ately back to the components: Starting from the
during the training stage. The distances of Sec- last iteration step, the respective cut-vertices are
tion 3.1 are converted into artificial co-occurrence added to all the components of P0 which they
counts with a novel technique (Section 3.2). Al- used to glue together; this process is performed
though they need not be integers, the main chal- iteratively, until there are no more cut-vertices to
lenge is the type of the underlying distributions; add. By addition of a cut-vertex to a component,
it should ideally emulate the resulting count dis- we mean the re-establishment of edges between
tributions from the phrase extraction stage of a the former and other vertices of the latter. The
monolingual parallel corpus (Dolan et al., 2004). result is a collection of components whose total
These counts give rise to the desired probability number of unique vertices is less than the number
distributions by means of relative frequencies. of vertices of the initial giant component P0 .
These remaining vertices are the residues. We
2 Sub-phrase-tables & Clustering then construct the graph R which consists of
the residues together with all their translations
2.1 Extracting Connected Components
(even those that are included in components of
For the decomposition of the phrase-table into the above collection) and then identify its compo-
sub-phrase-tables it is convenient to view the nents {R0 , ..., Rm }. It turns out, that the largest
phrase-table as an undirected, unweighted graph component, say R0 , is giant and we repeat the de-
P with the vertex set being the source and target composition process that was performed on P0 .
phrases and the edge set being the phrase-table en- This results in a new collection of components
tries. For the rest of this section, we do not distin- as well as new residues: The components need
guish between source and target phrases, i.e. both to be pruned (see Section 4.1) and the residues
types are treated equally as vertices of P . When give rise to a new graph R0 which is constructed
referring to the size of a graph, we mean the num- in the same way as R. We proceed iteratively until
ber of vertices it contains. the number of residues stops changing. For each
A trivial initial decomposition of P is achieved remaining residue u, we identify its translations,
by identifying all its connected components (com- and for each translation v we identify the largest
ponents for brevity), i.e. the mutually disjoint component of which v is a member and add u to
connected subgraphs, {P0 , P1 , ..., Pn }. It turns that component.
out (see Section 4.1) that the largest component, The final result is a collection C = D F,
say P0 , is of significant size. We call P0 giant where D is the collection of components emerg-
and it needs to be further decomposed. This is ing from the entire iterative decomposition of P0
3
and R, and F = {P1 , ..., Pn }. Figure 1 shows which the phrases x and x0 co-occur, and equiv-
the decomposition of a connected graph G0 ; for alently for c(). The purpose of this measure is
simplicity we assume that only one cut-vertex is for pruning paraphrase candidates and its use is
removed at each iteration and ties are resolved ar- explained in Section 3.1. Note that idf (x, x0 )
bitrarily. In Figure 2 the residue graph is con- [0, 1].
structed and its two components are identified. The merging process and the idf measure are
The iterative insertion of the cut vertices is also irrelevant for phrases belonging to the compo-
depicted. The resulting two components together nents of F, since the vertex set of each compo-
with those from R form the collection D for G0 . nent of F is mutually disjoint with the vertex set
The addition of cut-vertices into multiple com- of any other component in C.
ponents, as well as the construction method of the
residue-based graph R, can yield the occurrences s1 t1 s1 t1 G
11
of a vertex in multiple components in D. We ex- G0 s2 t2 s3 t3
ploit this property in two ways: G 12
s3 t 3 c 0={s 2 } s 4 t4
(a) In order to mitigate the risk of excessive de-
s4 t4
composition (which implies greater risk of good r={t 2 }
paraphrases being in different components), as
well as to reduce the size of D, a conserva- s3 t3 s3 t 4 G 21
G 12
tive merging algorithm of components is em- s4 t 4 c 1={t 3 }
ployed. Suppose that the elements of D are r r{s 4 }
ranked according to size in ascending order as
D = {D1 , ..., Dk , Dk+1 , ..., D|D| }, where |Di | Figure 1: The decomposition of G0 with vertices
, for i = 1, ..., k, and some threshold (see Sec- si and tj : The cut-vertex of the ith iteration is de-
tion 4.1). Each component Di with i {1, ..., k} noted by ci , and r collects the residues after each
is examined as follows: For each vertex of Di the iteration. The task is completed in Figure 2.
number of its occurrences in D is inspected; this is
done in order to identify an appropriate vertex b to
act as a bridge between Di and other components
s2 t2 s2 t2
of which b is a member. Note that translations of R
a vertex b with smaller number of occurrences in s4 t3 s4 t3
D are less likely to capture their full spectrum of c 0
paraphrases. We thus choose a vertex b from Di s1 t1
c 1 t3
with the smallest number of occurrences in D , s3 t 4 s3 t4 t3 c 0
resolving ties arbitrarily, and proceed with merg- s3 t4
ing Di with the largest component, say Dj with s2 t3
s1 t1
j {1, ..., |D| 1}, of which b is also a member.
s2 s3 t4
The resulting merged component Dj 0 contains all
vertices and edges of Di and Dj and new edges,
which are formed according to the rule: if u is a Figure 2: Top: Residue graph with its components
vertex of Di and v is a vertex of Dj and (u, v) is (no further decomposition is required). Bottom:
a phrase-table entry, then (u, v) is an edge in Dj 0 . Adding cut-vertices back to their components.
As long as no connected component has identi-
fied Di as the component with which it should be
merged, then Di is deleted from the collection D. 2.2 Clustering Connected Components
(b) We define an idf -inspired measure for each The aim of this subsection is to generate sep-
phrase pair (x, x0 ) of the same type (source or tar- arate clusters for the source and target phrases
get) as of each sub-phrase-table (component) C C.
1

2c(x, x0 )|D|
For this purpose the Information-Theoretic Co-
0
idf (x, x ) = log , (1) Clustering (ITC) algorithm (Dhillon et al., 2003)
log |D| c(x) + c(x0 )
is employed, which is a general principled cluster-
where c(x, x0 ) is the number of components in ing algorithm that generates hard clusters (i.e. ev-
4
ery element belongs to exactly one cluster) of two than some threshold (see Section 4.1). If two
interdependent quantities and is known to per- phrases that satisfy condition (b) and have trans-
form well on high-dimensional and sparse data. lations in more than one common target cluster,
In our case, the interdependent quantities are the a distinct such edge is established. All edges are
source and target phrases and the sparse data is bi-directional with distinct weights for both direc-
the phrase-table. tions.
ITC is a search algorithm similar to K-means, Figure 3 depicts an example of such a construc-
in the sense that a cost function, is minimized at tion; a link between a phrase si and a target cluster
each iteration step and the number of clusters for implies the existence of at least one translation for
both quantities are meta-parameters. The number si in that cluster. We are not interested in the tar-
of clusters is set to the most conservative initial- get phrases and they are thus not shown. For sim-
ization for both source and target phrases, namely plicity we assume that condition (b) is always sat-
to as many clusters as there are phrases. At each isfied and the extracted graph contains the maxi-
iteration, new clusters are constructed based on mum possible edges. Observe that phrases s3 and
the identification of the argmin of the cost func- s4 have two edges connecting them, (due to tar-
tion for each phrase, which gradually reduces the get clusters Tc and Td ) and that the target cluster
number of clusters. Ta is irrelevant to the construction of the graph,
We observe that conservative choices for the since s1 is the only phrase with translations in it.
meta-parameters often result in good paraphrases This conversion of a source cluster into a graph G
being in different clusters. To overcome this prob-
lem, the hard clusters are converted into soft (i.e. s1 s2 s3 s4 s5 s6 s7 s8
an element may belong to several clusters): One
step before the stopping criterion is met, we mod-
ify the algorithm so that instead of assigning a Ta Tb Tc Td Te Tf
phrase to the cluster with the smallest cost we se-
lect the bottom-X clusters ranked by cost. Addi- s1 s4 s7
tionally, only a certain number of phrases is cho- s3
sen for soft clustering. Both selections are done s6
conservatively with criteria based on the proper- s5 s8
s2
ties of the cost functions.
The formation of clusters leads to a natural re-
finement of the idf measure defined in eqn. (1): Figure 3: Top: A source cluster containing
The quantity c(x, x0 ) is redefined as the number phrases s1 ,..., s8 and the associated target clusters
of components in which the phrases x and x0 co- Ta ,..., Tf . Bottom: The extracted graph from the
occur in at least one cluster. source cluster. All edges are bi-directional.
3 Monolingual Graphs & Counts results in the formation of subgraphs in G, where

We proceed with converting the clusters into di- each subgraph is generated by a target cluster. In
rected, weighted graphs and then extract para- general, if condition (b) is not always satisfied,
phrases for both the source and target side. For then G need not be connected and each connected
brevity we explain the process restricted to the component is treated as a distinct graph.
source clusters of a sub-phrase-table, but the same Analogous to KB, we introduce feature vertices
method applies for the target side and for all sub- to G: For each phrase vertex s, its part-of-speech
phrase-tables in the collection C. (POS) tag sequence and stem sequence are iden-
tified and inserted into G as new vertices with
3.1 Monolingual graphs bi-directional weighted edges connected to s. If
Each source cluster is converted into a graph G as phrase vertices s and s0 have the same POS tag se-
follows: The vertex set consists of the phrases of quence, then they are connected to the same POS
the cluster and an edge between s and s0 exists, if tag feature vertex. Similarly for stem feature ver-
(a) s and s0 have at least one translation from the tices. See Figure 4 for an example. Note that we
same target cluster, and (b) idf (s, s0 ) is greater do not allow edges between POS tag and stem fea-
5
Phrase* )feature weights: As mentioned
HAVE
OWN OWN I HAVE
above, feature vertices have the dual role of car-
rying syntactic information and smoothing. From
eqn. (3) it can be deduced that, if for a phrase
has owns i have i had
s, the amount of its outgoing weights is close to
the amount of its incoming weights, then this is
VBZ PRP VBP PRP VBD
an indication that at least a significant part of its
neighborhood is reliable; the larger the strengths,
the more certain the indication. Otherwise, either
Figure 4: Adding feature vertices to the extracted s or a significant part of its neighborhood is
graph (has) *) (owns) * ) (i have) * ) (i had). unreliable. The amount of weight from s to its
Phrase, POS tag feature and stem feature ver- feature vertices should depend on this observation
tices are drawn in circles, dotted rectangles and and we thus let
solid rectangles respectively. All edges are bi-

directional.
X 0 0

net(s) = (w(s s ) w(s s)) + ,
s0 (s)
ture vertices. The purpose of the feature vertices, (4)
unlike KB, is primarily for smoothing and secon- where prevents net(s) from becoming 0 (see
darily for identifying paraphrases with the same Section 4.1). The net weight of a phrase vertex
syntactic information and this will become clear s is distributed over its feature vertices as
in the description of the computation of weights.
The set of all phrase vertices that are adja- w(s fX ) =< w(s s0 ) > +net(s), (5)
cent to s is written as (s), and referred to
as the neighborhood of s. Let n(s, t) denote where the first summand is the average weight
the co-occurrence count of a phrase-table entry from s to its neighboring phrase vertices and
(s, t) (Koehn, 2009). We define the strength of X = POS, STEM. If s has multiple POS tag
s in the subgraph generated by cluster T as sequences, we distribute the weight of eqn. (5)
X relatively to the co-occurrences of s with the re-
n(s; T ) = n(s, t), (2) spective POS tag feature vertices. The quantity
tT
< w(s s0 ) > accounts for the basic smoothing
which is simply a partial occurrence count for s. and is augmented by a value net(s) that measures
We proceed with computing weights for all edges the reliability of ss neighborhood; the more unre-
of G: liable the neighborhood, the larger the net weight
and thus larger the overall weights to the feature
Phrase* )phrase weights: Inspired by the
vertices.
notion of preferential attachment (Yule, 1925),
which is known to produce power-law weight dis- The choice for the opposite direction is trivial:
tributions for evolving weighted networks (Barrat 1
et al., 2004), we set the weight of a directed w(fX s) = , (6)
|{s0 : (fX , s0 ) is an edge }|
edge from s to s0 to be proportional to the
strengths of s0 in all subgraphs in which both where X = POS, STEM. Note the effect of
s and s0 are members. Thus, in the random eqns. (4)(6) in the case where the neighborhood
walk framework, s is more likely to visit of s has unreliable strengths: In a random walk
a stronger (more reliable) neighbor. If Ts,s0 = the feature vertices of s will be preferred and the
{T |s and s0 coexist in subgraph generated by T }, resulting similarities between s and other phrase
then the weight w(s s0 ) of the directed edge vertices will be small, as desired. Nonetheless,
from s to s0 is given by if the syntactic information is the same with any
X other phrase vertex in G, then the paraphrases will
w(s s0 ) = n(s0 ; T ), (3)
be captured.
T Ts,s0
The transition probability from any vertex u to
if s0 (s) and 0 otherwise. any other vertex v in G, i.e., the probability of
6
hopping from u to v in one step, is given by for all pairs of vertices u, v in G until conver-
gence. Experimentally, we find that convergence
w(u v) is always achieved. After the execution of this it-
p(u v) = P 0
, (7)
v 0 w(u v ) erative process we divide each count by the small-
where we sum over all vertices adjacent to u in G. est count in order to achieve a lower bound of 1.
We can thus compute the similarity between any A pair u, v may appear in multiple graphs in the
two vertices u and v in G by their commute time, same sub-phrase-table C. The total co-occurrence
i.e., the expected number of steps in a round trip, count of u and v in C and the associated condi-
in a random walk from u to v and then back to u, tional probabilities are thus given by
which is denoted by (u, v) (see Section 4.1 for X
the method of computation of ). Since (u, v) is nC (u, v) = nG (u, v) (13)
a distance measure, the smaller its value, the more GC
similar u and v are. nC (u, v)
pC (v|u) = P . (14)
xC nC (u, x)
3.2 Counts
We convert the distance (u, v) of a vertex pair A pair u, v may appear in multiple sub-phrase-
u, v in a graph G into a co-occurrence count tables and for the calculation of the final count
nG (u, v) with a novel technique: In order to as- n(u, v) we need to average over the associated
sess the quality of the pair u, v with respect to G counts from all sub-phrase-tables. Moreover, we
we compare (u, v) with (u, x) and (v, x) for have to take into account the type of the vertices:
all other vertices x in G. We thus consider the av- For the simplest case where both u and v repre-
erage distance of u with the other vertices of G sent phrase vertices, their expected count is, by
other than v, and similarly for v. This quantity is definition, given by
denoted by (u; v) and (v; u) respectively, and X
by definition it is given by n(s, s0 ) = nC (s, s0 )p(C|s, s0 ). (15)
X C
(i; j) = (i, x)pG (x|i) (8)
xG On the other hand, if at least one of u or v is
x6=j
a feature vertex, then we have to consider the
where pG (x|i) p(x|G, i) is a yet unknown phrase vertex that generates this feature: Suppose
probability distribution with respect to G. The that u is the phrase vertex s=acquire and v the
quantity ((u; v)+(v; u))/2 can then be viewed POS tag vertex f =NN and they co-occur in two
as the average distance of the pair u, v to the rest sub-phrase-tables C and C 0 with positive counts
of the graph G. The co-occurrence count of u and nC (s, f ) and nC 0 (s, f ) respectively; the feature
v in G is thus defined by vertex f is generated by the phrase vertices own-
ership in C and by possession in C 0 . In that
(u; v) + (v; u) case, an interpolation of the counts nC (s, f ) and
nG (u, v) = . (9)
2(u, v) nC 0 (s, f ) as in eqn. (15) would be incorrect and
a direct sum nC (s, f ) + nC 0 (s, f ) would provide
In order to calculate the probabilities pG (|) we
the true count. As a result we have
employ the following heuristic: Starting with a
(0)
uniform distribution pG (|) at timestep t = 0, n(s, f ) =
XX
nC (s, f (s0 ))p(C|s, f (s0 )),
we iterate s0 C
X (t) (16)
(t) (i; j) = (i, x)pG (x|i) (10) where the first summation is over all phrase ver-
xG
x6=j tices s0 such that f (s0 ) = f . With a similar argu-
ment we can write
(t) (t) (u; v) + (t) (v; u)
nG (u, v) = (11)
2(u, v) XX
n(f, f 0 ) = nC (f (s), f (s0 ))
(t)
(t+1) nG (u, v) s,s0 C
pG (v|u) = (t)
(12)
p(C|f (s), f (s0 )).
P
xG nG (u, v) (17)
7
For the interpolants, from standard probability we 7
10
find 6
5
10
10
pC (v|u)p(C|u) P0
p(C|u, v) = P 0
, (18) 5
10
C 0 pC 0 (v|u)p(C |u) 4
10
size
0
where the probabilities p(C|u) can be computed 3
10
10 0
10 10
2 4
10
6
10
by considering the likelihood function
2
10
N
Y N X
Y 1
10
`(u) = p(xi |u) = pC (xi |u)p(C|u)
0
i=1 i=1 C 10 0 2 4 6
10 10 10 10
rank
and by maximizing the average log-likelihood
1
N log `(u), where N is the total number of ver- Figure 5: Log-log plot of ranked components ac-
tices with which u co-occurs with positive counts cording to their size (number of source and target
in all sub-phrase-tables. phrases) for: (a) Components extracted from P .
Finally, the desired probability distributions are 1-1 components are not shown. (b) Components
given by the relative frequencies extracted from the decomposition of P0 .
n(u, v)
p(v|u) = P , (19)
x n(u, x)
In the components emerging from the decompo-
for all pairs of vertices u, v. sition of R0 , we observe an excessive number
of cut-vertices. Note that vertices that consist
4 Experiments
these components can be of two types: i) for-
4.1 Setup mer residues, i.e., residues that emerged from the
The data for building the phrase-table P decomposition of P0 , and ii) other vertices of
is drawn from DE-EN bitexts crawled from P0 . Cut-vertices can be of either type. For each
www.project-syndicate.org, which is component, we remove cut-vertices that are not
a standard resource provider for the WMT translations of the former residues of that com-
campaigns (News Commentary bitexts, see, ponent. Following this pruning strategy, the de-
e.g. (Callison-Burch et al., 2007) ). The filtered generacy of excessive cut-vertices does not reap-
bitext consists of 125K sentences; word align- pear in the subsequent iterations of decompos-
ment was performed running GIZA++ in both di- ing components generated by new residues, but
rections and generating the symmetric alignments the emergence of two giant components was ob-
using the grow-diag-final-and heuristics. The served: One consisting mostly of source type ver-
resulting P has 7.7M entries, 30% of which are tices and one of target type vertices. Without go-
1-1, i.e. entries (s, t) that satisfy p(s|t) = ing into further details, the algorithm can extend
p(t|s) = 1. These entries are irrelevant for para- to multiple giant components straightforwardly.
phrase harvesting for both the baseline and our For the merging process of the collection D we
method, and are thus excluded from the process. set = 5000, to avoid the emergence of a giant
The initial giant component P0 contains 1.7M component. The sizes of the resulting sub-phrase-
vertices (Figure 5), of which 30% become tables are shown in Figure 6. For the ITC algo-
residues and are used to construct R. At each it- rithm we use the smoothing technique discussed
eration of the decomposition of a giant compo- in (Dhillon and Guan, 2003) with = 106 .
nent, we remove the top 0.5% size cut-vertices For the monolingual graphs, we set = 0.65
ranked by degree of connectivity, where size is and discard graphs with more than 20 phrase ver-
the number of vertices of the giant component and tices, as they contain mostly noise. Thus, the sizes
set = 2500 as the stopping criterion. The latter of the graphs allow us to use analytical methods
choice is appropriate for the subsequent step of to compute the commute times: For a graph G,
co-clustering the components, for both time com- we form the transition matrix P , whose entries
plexity and performance of the ITC algorithm. P (u, v) are given by eqn. (7), and the fundamen-
8
10
6 the graph. Figure 8 depicts the new graph, where
the lengths of the edges represent the magnitude
5
10
before merging of commute times. Observe that the quality of
after merging
10
4 the probabilities is preserved but the counts are
inflated, as required.
size
3
10
In general, if a source phrase vertex s has at
10
2
least one translation t such that n(s, t) 3, then a
1 triplet (is , f (is ), g(is )) is added to the graph as in
10
Figure 8. The inflation vertex is establishes edges
0
10 0
10 10
2
10
4
10
6 with all other phrase and inflation vertices in the
rank
graph and weights are computed as in Section 3.1.
The pipeline remains the same up to eqn. (13),
Figure 6: Log-log plot of ranked sub-phrase-
where all counts that include inflation vertices are
tables according to their size (number of source
ignored.
and target phrases).
f a f b
tal matrix (Grinstead and Snell, 2006; Boley et al.,
a b
2011) Z = (I P + 1 T )1 , where I is the iden-
tity matrix, 1 denotes the vector of all ones and g a g b
is the vector of stationary probabilities (Aldous
and Fill, 2001) which is such that T P = T
and T 1 = 1 and can be computed as in (Hunter, na , b = 2.0 p ba = .20
2000). The commute time between any vertices u na , f a = 2.6 p f aa = .27
na , g a = 2.6 p g aa = .27
and v in G is then given by (Grinstead and Snell,
na , f b = 1.3 p f ba = .13
2006)
na , g b = 1.3 p g ba = .13
(u, v) = (Z(v, v) Z(u, v))/(v) +
Figure 7: Top: A graph with source phrase ver-
+ (Z(u, u) Z(v, u))/(u). (20)
tices a and b, both of strength 40, with accom-
panying distinct POS sequence vertices f () and
For the parameter of eqn. (4), an appropriate
stem sequence vertices g(). Bottom: The result-
choice is = |(s)| + 1; for reliable neighbor-
ing co-occurrence counts and conditional proba-
hoods, this quantity is insignificant. POS tags and
bilities for a.
lemmata are generated with TreeTagger1 .
Figure 7 depicts the most basic type of graph
that can be extracted from a cluster; it includes
two source phrase vertices a, b, of different syn-
tactic information. Suppose that both a and g a g i a
b are highly reliable with strengths n(a; T ) = f a ia f i a
a
n(b; T ) = 40, for some target cluster T . The re- b ib
sulting conditional probabilities adequately repre- f b f i b
sent the proximity of the involved vertices. On g b g i b
the other hand, the range of the co-occurrence
na , b = 11.3 p ba = .22
counts is not compatible with that of the strengths.
na , f a = 13.5 p f aa = .26
This is because i) there are no phrase vertices with p g aa = .26
na , g a = 13.5
small strength in the graph, and ii) eqn. (9) is es- na , f b = 6.7 p f ba = .13
sentially a comparison between a pair of vertices na , g b = 6.7 p g ba = .13
and the rest of the graph. To overcome this prob-
lem inflation vertices ia and ib of strength 1 with
Figure 8: The inflated version of Figure 7.
accompanying feature vertices are introduced to
1
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
9
4.2 Results Lenient MEP Strict MEP
Method
Our method generates conditional probabilities @1 @5 @10 @1 @5 @10
for any pair chosen from {phrase, POS sequence, Baseline .58 .47 .41 .43 .33 .28
stem sequence}, but for this evaluation we restrict Graphs .72 .61 .52 .53 .40 .33
ourselves to phrase pairs. For a phrase s, the qual-
Table 1: Mean Expected Precision (MEP) at k un-
ity of a paraphrase s0 is assessed by
der lenient and strict evaluation criteria.
P (s0 |s) p(s0 |s) + p(f1 (s0 )|s) + p(f2 (s0 )|s),
(21)
by eqns. (15)(17)) for all vertices u and v, be-
where f1 (s0 ) and f2 (s0 ) denote the POS tag se-
longs to the power-law family (Figure 9). This is
quence and stem sequence of s0 respectively. All
evidence that the monolingual graphs can simu-
three summands of eqn. (21) are computed from
late the phrase extraction process of a monolin-
eqn. (19). The baseline is given by pivoting (Ban-
gual parallel corpus. Intuitively, we may think of
nard and Callison-Burch, 2005),
X the German side of the DEEN parallel corpus as
P (s0 |s) = p(t|s)p(s0 |t), (22) the English approximation to a ENEN par-
t allel corpus, and the monolingual graphs as the
where p(t|s) and p(s0 |t) are the phrase-based rel- word alignment process.
ative frequencies of the translation model.
5
We select 150 phrases (an equal number for 10
unigrams, bigrams and trigrams), for which we

4
expect to see paraphrases, and keep the top-10 10
cooccurrence count
paraphrases for each phrase, ranked by the above

3
measures. We follow (Kok and Brockett, 2010; 10
Metzler et al., 2011) in the evaluation of the ex-
2
tracted paraphrases: Each phrase-paraphrase pair 10
is manually annotated with the following options:
0) Different meaning; 1) (i) Same meaning, but 1
10
potential replacement of the phrase with the para-
phrase in a sentence ruins the grammatical struc- 0
10 0 2 4 6 8
10 10 10 10 10
ture of the sentence. (ii) Tokens of the paraphrase rank
are morphological inflections of the phrases to-

kens. 2) Same meaning. Although useful for SMT Figure 9: Log-log plot of ranked pairs of English
purposes, super/substrings of are annotated with vertices according to their counts
0 to achieve an objective evaluation.
Both methods are evaluated in terms of the
Mean Expected Precision (MEP) at k; the Ex- 5 Conclusions & Future Work
pected Precision for each selected phrase P s at We have described a new method that harvests
rank k is computed by Es [p@k] = k1 ki=1 pi ,
paraphrases from a bitext, generates artificial
where pi is the proportion of positive annotations
co-occurrence counts for any pair chosen from
for item i. The P desired metric is thus given by
1 {phrase, POS sequence, stem sequence}, and po-
MEP@k = 150 s Es [p@k]. The contribution tentially identifies patterns for the syntactic infor-
to pi can be restricted to perfect paraphrases only,
mation of the phrases. The quality of the para-
which leads to a strict strategy for harvesting para-
phrases ranked lists outperforms that of a stan-
phrases. Table 1 summarizes the results of our
dard baseline. The quality of the resulting condi-
evaluation and
tional probabilities is promising and will be eval-
uated implicitly via an application to SMT.
we deduce that our method can lead to improve- This research was funded by the European
ments over the baseline. Commission through the CoSyne project FP7-
An important accomplishment of our method ICT- 4-248531.
is that the distribution of counts n(u, v), (as given
10
References Stanley Kok and Chris Brockett. 2010. Hitting the
Right Paraphrases in Good Time. Proc. NAACL,
David Aldous and James A. Fill. 2001. Reversible pp.145153.
Markov Chains and Random Walks on Graphs.
Roland Kuhn, Boxing Chen, George Foster, and Evan
http://www.stat.berkeley.edu/aldous/RWG/
Stratford. 2010. Phrase Clustering for Smoothing
book.html
TM Probabilities: or, how to Extract Paraphrases
Ion Androutsopoulos and Prodromos Malakasiotis. from Phrase Tables. Proc. COLING, pp.608616.
2010. A Survey of Paraphrasing and Textual En-
Nitin Madnani and Bonnie Dorr. 2010. Generating
tailment Methods. Journal of Artificial Intelligence
Phrasal and Sentential Paraphrases: A Survey of
Research, 38:135187.
Data-Driven Methods. Computational Linguistics,
Colin Bannard and Chris Callison-Burch. 2005. Para- 36(3):341388.
phrasing with Bilingual Parallel Corpora. Proc. Donald Metzler, Eduard Hovy, and Chunliang
ACL, pp. 597604. Zhang. 2011. An Empirical Evaluation of Data-
Alain Barrat, Marc Barthlemy, and Alessandro Vespig- Driven Paraphrase Generation Techniques. Proc.
nani. 2004. Modeling the Evolution of Weighted ACL:Short Papers, pp. 546551.
Networks. Phys. Rev. Lett., 92. Takashi Onishi, Masao Utiyama, and Eiichiro Sumita.
Daniel Boley, Gyan Ranjan, and Zhi-Li Zhang. 2011. 2010. Paraphrase Lattice for Statistical Machine
Commute Times for a Directed Graph using an Translation. Proc. ACL:Short Papers, pp. 15.
Asymmetric Laplacian. Linear Algebra and its Ap- Delip Rao, David Yarowsky, and Chris Callison-
plications, Issue 2, pp. 224242. Burch. 2008. Affinity Measures based on the Graph
Chris Callison-Burch. 2008. Syntactic Constraints Laplacian. Proc. Textgraphs Workshop on Graph-
on Paraphrases Extracted from Parallel Corpora. based Algorithms for NLP at COLING, pp. 4148.
Proc. EMNLP, pp. 196205. George U. Yule. 1925. A Mathematical Theory of
Chris Callison-Burch, Cameron Fordyce, Philipp Evolution, based on the Conclusions of Dr. J. C.
Koehn, Christof Monz, and Josh Schroeder. 2007 Willis, F.R.S. Philos. Trans. R. Soc. London, B 213,
(Meta-) Evaluation of Machine Translation. Proc. pp. 2187.
Workshop on Statistical Machine Translation, pp. Shiqi Zhao, Haifeng Wang, Ting Liu, and Sheng Li.
136158. 2008. Pivot Approach for Extracting Paraphrase
Chris Callison-Burch, Philipp Koehn, and Miles Os- Patterns from Bilingual Corpora. Proc. ACL, pp.
borne. 2006 Improved statistical machine trans- 780788.
lation using paraphrases. Proc. HLT/NAACL, pp.
1724.
Inderjit S. Dhillon and Yuqiang Guan. 2003. Informa-
tion Theoretic Clustering of Sparse Co-Occurrence
Data. Proc. IEEE Intl Conf. Data Mining, pp. 517
520.
Inderjit S. Dhillon, Subramanyam Mallela, and Dhar-
mendra S. Modha. 2003. Information-Theoretic
Coclustering. Proc. ACM SIGKDD Intl Conf.
Knowledge Discovery and Data Mining, pp. 8998.
William Dolan, Chris Quirk, and Chris Brockett.
2004. Unsupervised construction of large para-
phrase corpora: Exploiting massively parallel news
sources. Proc. COLING, pp. 350-356.
Juri Ganitkevitch, Chris Callison-Burch, Courtney
Napoles, and Benjamin Van Durme 2011. Learn-
ing Sentential Paraphrases from Bilingual Paral-
lel Corpora for Text-to-Text Generation. Proc.
EMNLP, pp. 11681179.
Charles Grinstead and Laurie Snell. 2006. Introduc-
tion to Probability. Second ed., American Mathe-
matical Society.
Jeffrey J. Hunter. 2000. A Survey of Generalized In-
verses and their Use in Stochastic Modelling. Res.
Lett. Inf. Math. Sci., Vol. 1, pp. 2536.
Philipp Koehn. 2009. Statistical Machine Translation.
Cambridge University Press, Cambridge, UK.
11
A Bayesian Approach to Unsupervised Semantic Role Induction
Ivan Titov Alexandre Klementiev

Saarland University
Saarbrucken, Germany
{titov|aklement}@mmci.uni-saarland.de
Abstract Mary always takes an agent role (A0) for the pred-
icate open, and door is always a patient (A1).
We introduce two Bayesian models for un- SRL representations have many potential appli-
supervised semantic role labeling (SRL) cations in natural language processing and have
task. The models treat SRL as clustering
recently been shown to be beneficial in question
of syntactic signatures of arguments with
clusters corresponding to semantic roles. answering (Shen and Lapata, 2007; Kaisser and
The first model induces these clusterings Webber, 2007), textual entailment (Sammons et
independently for each predicate, exploit- al., 2009), machine translation (Wu and Fung,
ing the Chinese Restaurant Process (CRP) 2009; Liu and Gildea, 2010; Wu et al., 2011; Gao
as a prior. In a more refined hierarchical and Vogel, 2011), and dialogue systems (Basili et
model, we inject the intuition that the clus- al., 2009; van der Plas et al., 2011), among others.
terings are similar across different predi-
Though syntactic representations are often predic-
cates, even though they are not necessar-
ily identical. This intuition is encoded as tive of semantic roles (Levin, 1993), the interface
a distance-dependent CRP with a distance between syntactic and semantic representations is
between two syntactic signatures indicating far from trivial. The lack of simple determinis-
how likely they are to correspond to a single tic rules for mapping syntax to shallow semantics
semantic role. These distances are automat- motivates the use of statistical methods.
ically induced within the model and shared
across predicates. Both models achieve Although current statistical approaches have
state-of-the-art results when evaluated on been successful in predicting shallow seman-
PropBank, with the coupled model consis- tic representations, they typically require large
tently outperforming the factored counter- amounts of annotated data to estimate model pa-
part in all experimental set-ups. rameters. These resources are scarce and ex-
pensive to create, and even the largest of them
1 Introduction have low coverage (Palmer and Sporleder, 2010).
Moreover, these models are domain-specific, and
Semantic role labeling (SRL) (Gildea and Juraf- their performance drops substantially when they
sky, 2002), a shallow semantic parsing task, has are used in a new domain (Pradhan et al., 2008).
recently attracted a lot of attention in the com- Such domain specificity is arguably unavoidable
putational linguistic community (Carreras and for a semantic analyzer, as even the definitions
Marquez, 2005; Surdeanu et al., 2008; Hajic et of semantic roles are typically predicate specific,
al., 2009). The task involves prediction of predi- and different domains can have radically different
cate argument structure, i.e. both identification of distributions of predicates (and their senses). The
arguments as well as assignment of labels accord- necessity for a large amounts of human-annotated
ing to their underlying semantic role. For exam- data for every language and domain is one of the
ple, in the following sentences: major obstacles to the wide-spread adoption of se-
mantic role representations.
(a) [A0 Mary] opened [A1 the door].
These challenges motivate the need for unsu-
(b) [A0 Mary] is expected to open [A1 the door].
pervised methods which, instead of relying on la-
(c) [A1 The door] opened. beled data, can exploit large amounts of unlabeled
(d) [A1 The door] was opened [A0 by Mary]. texts. In this paper, we propose simple and effi-
12
cient hierarchical Bayesian models for this task. parses, and with gold and automatically identified
It is natural to split the SRL task into two arguments).
stages: the identification of arguments (the iden- Both models admit efficient inference: the es-
tification stage) and the assignment of semantic timation time on the Penn Treebank WSJ corpus
roles (the labeling stage). In this and in much does not exceed 30 minutes on a single proces-
of the previous work on unsupervised techniques, sor and the inference algorithm is highly paral-
the focus is on the labeling stage. Identification, lelizable, reducing inference time down to sev-
though an important problem, can be tackled with eral minutes on multiple processors. This sug-
heuristics (Lang and Lapata, 2011a; Grenager and gests that the models scale to much larger corpora,
Manning, 2006) or, potentially, by using a super- which is an important property for a successful
vised classifier trained on a small amount of data. unsupervised learning method, as unlabeled data
We follow (Lang and Lapata, 2011a), and regard is abundant.
the labeling stage as clustering of syntactic sig- The rest of the paper is structured as follows.
natures of argument realizations for every predi- Section 2 begins with a definition of the seman-
cate. In our first model, as in most of the previous tic role labeling task and discuss some specifics
work on unsupervised SRL, we define an indepen- of the unsupervised setting. In Section 3, we de-
dent model for each predicate. We use the Chi- scribe CRPs and dd-CRPs, the key components
nese Restaurant Process (CRP) (Ferguson, 1973) of our models. In Sections 4 6, we describe
as a prior for the clustering of syntactic signatures. our factored and coupled models and the infer-
The resulting model achieves state-of-the-art re- ence method. Section 7 provides both evaluation
sults, substantially outperforming previous meth- and analysis. Finally, additional related work is
ods evaluated in the same setting. presented in Section 8.
In the first model, for each predicate we inde-
pendently induce a linking between syntax and se-
2 Task Definition
mantics, encoded as a clustering of syntactic sig- In this work, instead of assuming the availabil-
natures. The clustering implicitly defines the set ity of role annotated data, we rely only on auto-
of permissible alternations, or changes in the syn- matically generated syntactic dependency graphs.
tactic realization of the argument structure of the While we cannot expect that syntactic structure
verb. Though different verbs admit different alter- can trivially map to a semantic representation
nations, some alternations are shared across mul- (Palmer et al., 2005)1 , we can use syntactic cues
tiple verbs and are very frequent (e.g., passiviza- to help us in both stages of unsupervised SRL.
tion, example sentences (a) vs. (d), or dativiza- Before defining our task, let us consider the two
tion: John gave a book to Mary vs. John gave stages separately.
Mary a book) (Levin, 1993). Therefore, it is nat- In the argument identification stage, we imple-
ural to assume that the clusterings should be sim- ment a heuristic proposed in (Lang and Lapata,
ilar, though not identical, across verbs. 2011a) comprised of a list of 8 rules, which use
Our second model encodes this intuition by re- nonlexicalized properties of syntactic paths be-
placing the CRP prior for each predicate with tween a predicate and a candidate argument to it-
a distance-dependent CRP (dd-CRP) prior (Blei eratively discard non-arguments from the list of
and Frazier, 2011) shared across predicates. The all words in a sentence. Note that inducing these
distance between two syntactic signatures en- rules for a new language would require some lin-
codes how likely they are to correspond to a sin- guistic expertise. One alternative may be to an-
gle semantic role. Unlike most of the previous notate a small number of arguments and train a
work exploiting distance-dependent CRPs (Blei classifier with nonlexicalized features instead.
and Frazier, 2011; Socher et al., 2011; Duan et al., In the argument labeling stage, semantic roles
2007), we do not encode prior or external knowl- are represented by clusters of arguments, and la-
edge in the distance function but rather induce it beling a particular argument corresponds to decid-
automatically within our Bayesian model. The ing on its role cluster. However, instead of deal-
coupled dd-CRP model consistently outperforms 1
Although it provides a strong baseline which is diffi-
the factored CRP counterpart across all the experi- cult to beat (Grenager and Manning, 2006; Lang and Lapata,
mental settings (with gold and predicted syntactic 2010; Lang and Lapata, 2011a).
13
ing with argument occurrences directly, we rep- for describing CRPs is assignment of tables to
resent them as predicate specific syntactic signa- restaurant customers. Assume a restaurant with a
tures, and refer to them as argument keys. This sequence of tables, and customers who walk into
representation aids our models in inducing high the restaurant one at a time and choose a table to
purity clusters (of argument keys) while reducing join. The first customer to enter is assigned the
their granularity. We follow (Lang and Lapata, first table. Suppose that when a client number i
2011a) and use the following syntactic features to enters the restaurant, i 1 customers are sitting
form the argument key representation: at each of the k (1, . . . , K) tables occupied so
Active or passive verb voice (ACT/PASS). far. The new customer is then either seated at one
Nk
Argument position relative to predicate of the K tables with probability i1+ , where Nk
(LEFT/RIGHT). is the number customers already sitting at table
k, or assigned to a new table with the probability
Syntactic relation to its governor.
i1+ . The concentration parameter encodes
Preposition used for argument realization. the granularity of the drawn partitions: the larger
In the example sentences in Section 1, the argu- , the larger the expected number of occupied ta-
ment keys for candidate arguments Mary for sen- bles. Though it is convenient to describe CRP in a
tences (a) and (d) would be ACT:LEFT:SBJ and sequential manner, the probability of a seating ar-
PASS:RIGHT:LGS->by,2 respectively. While rangement is invariant of the order of customers
aiming to increase the purity of argument key arrival, i.e. the process is exchangeable. In our
clusters, this particular representation will not al- factored model, we use CRPs as a prior for clus-
ways produce a good match: e.g. the door in tering argument keys, as we explain in Section 4.
sentence (c) will have the same key as Mary in Often CRP is used as a part of the Dirich-
sentence (a). Increasing the expressiveness of the let Process mixture model where each subset in
argument key representation by flagging intransi- the partition (each table) selects a parameter (a
tive constructions would distinguish that pair of meal) from some base distribution over parame-
arguments. However, we keep this particular rep- ters. This parameter is then used to generate all
resentation, in part to compare with the previous data points corresponding to customers assigned
work. to the table. The Dirichlet processes (DP) are
In this work, we treat the unsupervised seman- closely connected to CRPs: instead of choosing
tic role labeling task as clustering of argument meals for customers through the described gener-
keys. Thus, argument occurrences in the corpus ative story, one can equivalently draw a distribu-
whose keys are clustered together are assigned the tion G over meals from DP and then draw a meal
same semantic role. Note that some adjunct-like for every customer from G. We refer the reader
modifier arguments are already explicitly repre- to Teh (2010) for details on CRPs and DPs. In
sented in syntax and thus do not need to be clus- our method, we use DPs to model distributions of
tered (modifiers AM-TMP, AM-MNR, AM-LOC, and arguments for every role.
AM-DIR are encoded as syntactic relations TMP, In order to clarify how similarities between
MNR, LOC, and DIR, respectively (Surdeanu et al., customers can be integrated in the generative pro-
2008)); instead we directly use the syntactic labels cess, we start by reformulating the traditional
as semantic roles. CRP in an equivalent form so that distance-
dependent CRP (dd-CRP) can be seen as its gen-
3 Traditional and Distance-dependent eralization. Instead of selecting a table for each
CRPs customer as described above, one can equiva-
The central components of our non-parametric lently assume that a customer i chooses one of
Bayesian models are the Chinese Restaurant Pro- the previous customers ci as a partner with prob-
cesses (CRPs) and the closely related Dirichlet 1
ability i1+ and sits at the same table, or occu-
Processes (DPs) (Ferguson, 1973). pies a new table with the probability i1+
. The
CRPs define probability distributions over par- transitive closure of this seating-with relation de-
titions of a set of objects. An intuitive metaphor termines the partition.
2
LGS denotes a logical subject in a passive construction A generalization of this view leads to the defini-
(Surdeanu et al., 2008). tion of the distance-dependent CRP. In dd-CRPs,
14
a customer i chooses a partner ci = j with Our model associates two distributions with
the probability proportional to some non-negative each predicate: one governs the selection of argu-
score di,j (di,j = dj,i ) which encodes a similarity ment fillers for each semantic role, and the other
between the two customers.3 More formally, models (and penalizes) duplicate occurrence of
roles. Each predicate occurrence is generated in-

di,j , i 6= j
p(ci = j|D, ) (1) dependently given these distributions. Let us de-
, i = j
scribe the model by first defining how the set of
where D is the entire similarity graph. This pro- model parameters and an argument key clustering
cess lacks the exchangeability property of the tra-
are drawn, and then explaining the generation of
ditional CRP but efficient approximate inference
individual predicate and argument instances. The
with dd-CRP is possible with Gibbs sampling.
generative story is formally presented in Figure 1.
For more details on inference with dd-CRPs, we
We start by generating a partition of argument
refer the reader to Blei and Frazier (2011).
keys Bp with each subset r Bp representing
Though in previous work dd-CRP was used ei-
a single semantic role. The partitions are drawn
ther to encode prior knowledge (Blei and Fra-
from CRP() (see the Factored model section of
zier, 2011) or other external information (Socher
Figure 1) independently for each predicate. The
et al., 2011), we treat D as a latent variable
crucial part of the model is the set of selectional
drawn from some prior distribution over weighted
preference parameters p,r , the distributions of ar-
graphs. This view provides a powerful approach
guments x for each role r of predicate p. We
for coupling a family of distinct but similar clus-
represent arguments by their syntactic heads,4 or
terings: the family of clusterings can be drawn by
more specifically, by either their lemmas or word
first choosing a similarity graph D for the entire
clusters assigned to the head by an external clus-
family and then re-using D to generate each of the
tering algorithm, as we will discuss in more detail
clusterings independently of each other as defined
in Section 7.5 For the agent role A0 of the pred-
by equation (1). In Section 5, we explain how we
icate open, for example, this distribution would
use this formalism to encode relatedness between
assign most of the probability mass to arguments
argument key clusterings for different predicates.
denoting sentient beings, whereas the distribution
4 Factored Model for the patient role A1 would concentrate on ar-
guments representing openable things (doors,
In this section we describe the factored method boxes, books, etc).
which models each predicate independently. In In order to encode the assumption about sparse-
Section 2 we defined our task as clustering of ar- ness of the distributions p,r , we draw them from
gument keys, where each cluster corresponds to a the DP prior DP (, H (A) ) with a small concen-
semantic role. If an argument key k is assigned tration parameter , the base probability distribu-
to a role r (k r), all of its occurrences are lation H (A) is just the normalized frequencies of ar-
beled r. guments in the corpus. The geometric distribution
Our Bayesian model encodes two common as- p,r is used to model the number of times a role
sumptions about semantic roles. First, we enforce r appears with a given predicate occurrence. The
the selectional restriction assumption: we assume decision whether to generate at least one role r is
that the distribution over potential argument fillers drawn from the uniform Bernoulli distribution. If
is sparse for every role, implying that peaky dis- 0 is drawn then the semantic role is not realized
tributions of arguments for each role r are pre- for the given occurrence, otherwise the number
ferred to flat distributions. Second, each role nor- of additional roles r is drawn from the geometric
mally appears at most once per predicate occur- distribution Geom(p,r ). The Beta priors over
rence. Our inference will search for a clustering
4
which meets the above requirements to the maxi- For prepositional phrases, we take as head the head noun
of the object noun phrase as it encodes crucial lexical infor-
mal extent. mation. However, the preposition is not ignored but rather
3
It may be more standard to use a decay function f : encoded in the corresponding argument key, as explained
R R and choose a partner with the probability propor- in Section 2.
5
tional to f (di,j ). However, the two forms are equivalent Alternatively, the clustering of arguments could be in-
and using scores di,j directly is more convenient for our induced within the model, as done in (Titov and Klementiev,
duction purposes. 2011).
15
Clustering of argument keys: nations for a predicate. E.g., passivization can be
Factored model: roughly represented with the clustering of the key
for each predicate p = 1, 2, . . . : ACT:LEFT:SBJ with PASS:RIGHT:LGS->by
Bp CRP () [partition of arg keys] and ACT:RIGHT:OBJ with PASS:LEFT:SBJ.
Coupled model: The set of permissible alternations is predicate-
D N onInf orm [similarity graph] specific,6 but nevertheless they arguably repre-
for each predicate p = 1, 2, . . . : sent a small subset of all clusterings of argu-
Bp dd-CRP (, D) [partition of arg keys] ment keys. Also, some alternations are more
likely to be applicable to a verb than others: for
Parameters:
example, passivization and dativization alterna-
for each predicate p = 1, 2, . . . : tions are both fairly frequent, whereas, locative-
for each role r Bp :
p,r DP (, H (A) ) [distrib of arg fillers]
preposition-drop alternation (Mary climbed up the
p,r Beta(0 , 1 ) [geom distr for dup roles] mountain vs. Mary climbed the mountain) is less
common and applicable only to several classes
Data Generation: of predicates representing motion (Levin, 1993).
for each predicate p = 1, 2, . . . : We represent this observation by quantifying how
for each occurrence l of p: likely a pair of keys is to be clustered. These
for every role r Bp : scores (di,j for every pair of argument keys i and
if [n U nif (0, 1)] = 1: [role appears at least once]
j) are induced automatically within the model,
GenArgument(p, r) [draw one arg]
while [n p,r ] = 1: [continue generation] and treated as latent variables shared across pred-
GenArgument(p, r) [draw more args] icates. Intuitively, if data for several predicates
strongly suggests that two argument keys should
GenArgument(p, r):
kp,r U nif (1, . . . , |r|) [draw arg key]
be clustered (e.g., there is a large overlap be-
xp,r p,r [draw arg filler] tween argument fillers for the two keys) then the
posterior will indicate that di,j is expected to be
greater for the pair {i, j} than for some other pair
Figure 1: Generative stories for the factored and cou-
{i0 , j 0 } for which the evidence is less clear. Con-
pled models.
sequently, argument keys i and j will be clustered
even for predicates without strong evidence for
can indicate the preference towards generating at such a clustering, whereas i0 and j 0 will not.
most one argument for each role. For example, One argument against coupling predicates may
it would express the preference that a predicate stem from the fact that we are using unlabeled
open typically appears with a single agent and a data and may be able to obtain sufficient amount
single patient arguments. of learning material even for less frequent pred-
Now, when parameters and argument key clus- icates. This may be a valid observation, but an-
terings are chosen, we can summarize the re- other rationale for sharing this similarity structure
mainder of the generative story as follows. We is the hypothesis that alternations may be easier
begin by independently drawing occurrences for to detect for some predicates than for others. For
each predicate. For each predicate role we in- example, argument key clustering of predicates
dependently decide on the number of role occur- with very restrictive selectional restrictions on ar-
rences. Then we generate each of the arguments gument fillers is presumably easier than clustering
(see GenArgument) by generating an argument for predicates with less restrictive and overlap-
key kp,r uniformly from the set of argument keys ping selectional restriction, as compactness of se-
assigned to the cluster r, and finally choosing its lectional preferences is a central assumption driv-
filler xp,r , where the filler is either a lemma or a ing unsupervised learning of semantic roles. E.g.,
word cluster corresponding to the syntactic head predicates change and defrost belong to the same
of the argument. Levin class (change-of-state verbs) and therefore
admit similar alternations. However, the set of po-
5 Coupled Model tential patients of defrost is sufficiently restricted,
As we argued in Section 1, clusterings of argu- 6
Or, at least specific to a class of predicates (Levin,
ment keys implicitly encode the pattern of alter- 1993).
16
whereas the selectional restrictions for the patient key implies some computations for all its occur-
of change are far less specific and they overlap rences in the corpus. Instead of more complex
with selectional restrictions for the agent role, fur- MAP search algorithms (see, e.g., (Daume III,
ther complicating the clustering induction task. 2007)), we use a greedy procedure where we start
This observation suggests that sharing clustering with each argument key assigned to an individual
preferences across verbs is likely to help even if cluster, and then iteratively try to merge clusters.
the unlabeled data is plentiful for every predicate. Each move involves (1) choosing an argument key
More formally, we generate scores di,j , or and (2) deciding on a cluster to reassign it to. This
equivalently, the full labeled graph D with ver- is done by considering all clusters (including cre-
tices corresponding to argument keys and edges ating a new one) and choosing the most probable
weighted with the similarity scores, from a prior. one.
In our experiments we use a non-informative prior Instead of choosing argument keys randomly at
which factorizes over pairs (i.e. edges of the the first stage, we order them by corpus frequency.
graph D), though more powerful alternatives can This ordering is beneficial as getting clustering
be considered. Then we use it, in a dd-CRP(, right for frequent argument keys is more impor-
D), to generate clusterings of argument keys for tant and the corresponding decisions should be
every predicate. The rest of the generative story is made earlier.7 We used a single iteration in our
the same as for the factored model. The part rele- experiments, as we have not noticed any benefit
vant to this model is shown in the Coupled model from using multiple iterations.
section of Figure 1.
6.2 Similarity Graph Induction
Note that this approach does not assume that
the frequencies of syntactic patterns correspond- In the coupled model, clusterings for different
ing to alternations are similar, and a large value predicates are statistically dependent, as the simi-
for di,j does not necessarily mean that the corre- larity structure D is latent and shared across pred-
sponding syntactic frames i and j are very fre- icates. Consequently, a more complex inference
quent in a corpus. What it indicates is that a large procedure is needed. For simplicity here and in
number of different predicates undergo the corre- our experiments, we use the non-informative prior
sponding alternation; the frequency of the alterna- distribution over D which assigns the same prior
tion is a different matter. We believe that this is an probability to every possible weight di,j for every
important point, as we do not make a restricting pair {i, j}.
assumption that an alternation has the same dis- Recall that the dd-CRP prior is defined in terms
tributional properties for all verbs which undergo of customers choosing other customers to sit with.
this alternation. For the moment, let us assume that this relation
among argument keys is known, that is, every ar-
6 Inference gument key k for predicate p has chosen an argu-
ment key cp,k to sit with. We can compute the
An inference algorithm for an unsupervised MAP estimate for all di,j by maximizing the ob-
model should be efficient enough to handle vast jective:
amounts of unlabeled data, as it can easily be ob- X X dk,cp,k
tained and is likely to improve results. We use arg max log P ,
di,j , i6=j p k0 K p dk,k0
a simple approximate inference algorithm based kK p
on greedy MAP search. We start by discussing where K p is the set of all argument keys for the
MAP search for argument key clustering with the predicate p. We slightly abuse the notation by us-
factored model and then discuss its extension ap- ing di,i to denote the concentration parameter
plicable to the coupled model. in the previous expression. Note that we also as-
sume that similarities are symmetric, di,j = dj,i .
6.1 Role Induction If the set of argument keys K p would be the same
For the factored model, semantic roles for every for every predicate, then the optimal di,j would
predicate are induced independently. Neverthe- 7
This idea has been explored before for shallow semantic
less, search for a MAP clustering can be expen- representations (Lang and Lapata, 2011a; Titov and Klemen-
sive, as even a move involving a single argument tiev, 2011).
17
be proportional to the number of times either i se- rior with Naive Bayes tends to be overconfident
lects j as a partner, or j chooses i as a partner.8 due to violated conditional independence assump-
This no longer holds if the sets are different, but tions (Rennie, 2001). The same behavior is ob-
the solution can be found efficiently using a nu- served here: the shared prior does not have suf-
meric optimization strategy; we use the gradient ficient effect on frequent predicates.10 Though
descent algorithm. different techniques have been developed to dis-
We do not learn the concentration parameter count the over-confidence (Kolcz and Chowdhury,
, as it is used in our model to indicate the de- 2005), we use the most basic one: we raise the
sired granularity of semantic roles, but instead likelihood term in power T1 , where the parameter
only learn di,j (i 6= j). However, just learning T is chosen empirically.
the concentration parameter would not be suffi-
cient as the effective concentration can be reduced 7 Empirical Evaluation
or increased arbitrarily by scaling all the similar- 7.1 Data and Evaluation
ities di,j (i 6= j) at once, as follows from expres-
We keep the general setup of (Lang and Lapata,
sion (1). Instead, we enforce the normalization
2011a), to evaluate our models and compare them
constraint on the similarities di,j . We ensure that
to the current state of the art. We run all of our
the prior probability of choosing itself as a part-
experiments on the standard CoNLL 2008 shared
ner, averaged over predicates, is the same as it
task (Surdeanu et al., 2008) version of Penn Tree-
would be with uniform di,j (di,j = 1 for every
bank WSJ and PropBank. In addition to gold
key pair {i, j}, i 6= j). This roughly says that
dependency analyses and gold PropBank annota-
we want to preserve the same granularity of clus-
tions, it has dependency structures generated au-
tering as it was with the uniform similarities. We
tomatically by the MaltParser (Nivre et al., 2007).
accomplish this normalization in a post-hoc fash-
We vary our experimental setup as follows:
ion
P by P dividing the weightsPafter optimization by
We evaluate our models on gold and auto-
p k,k0 K p , k0 6=k dk,k0 / p |K p |(|K p | 1).
matically generated parses, and use either
If D is fixed, partners for every predicate p and
gold PropBank annotations or the heuristic
every k can be found using virtually the same al-
from Section 2 to identify arguments, result-
gorithm as in Section 6.1: the only difference is
ing in four experimental regimes.
that, instead of a cluster, each argument key itera-
tively chooses a partner. In order to reduce the sparsity of predicate
Though, in practice, both the choice of partners argument fillers we consider replacing lem-
and the similarity graphs are latent, we can use an mas of their syntactic heads with word clus-
iterative approach to obtain a joint MAP estimate ters induced by a clustering algorithm as a
of ck (for every k) and the similarity graph D by preprocessing step. In particular, we use
alternating the two steps.9 Brown (Br) clustering (Brown et al., 1992)
Notice that the resulting algorithm is again induced over RCV1 corpus (Turian et al.,
highly parallelizable: the graph induction stage 2010). Although the clustering is hierarchi-
is fast, and induction of the seat-with relation cal, we only use a cluster at the lowest level
(i.e. clustering argument keys) is factorizable over of the hierarchy for each word.
predicates. We use the purity (PU) and collocation (CO) met-
One shortcoming of this approach is typical rics as well as their harmonic mean (F1) to mea-
for generative models with multiple features: sure the quality of the resulting clusters. Purity
when such a model predicts a latent variable, it measures the degree to which each cluster con-
tends to ignore the prior class distribution and retains arguments sharing the same gold role:
lies solely on features. This behavior is due to 1 X
PU = max |Gj Ci |
the over-simplifying independence assumptions. N j
i
It is well known, for instance, that the poste-
where if Ci is the set of arguments in the i-th in-
8
Note that weights di,j are invariant under rescaling duced cluster, Gj is the set of arguments in the jth
when the rescaling is also applied to the concentration pa-
10
rameter . The coupled model without discounting still outper-
9
In practice, two iterations were sufficient. forms the factored counterpart in our experiments.
18
gold cluster, and N is the total number of argu- gold parses auto parses
ments. Collocation evaluates the degree to which PU CO F1 PU CO F1
LLogistic 79.5 76.5 78.0 77.9 74.4 76.2
arguments with the same gold roles are assigned
SplitMerge 88.7 73.0 80.1 86.5 69.8 77.3
to a single cluster. It is computed as follows: GraphPart 88.6 70.7 78.6 87.4 65.9 75.2
1 X Factored 88.1 77.1 82.2 85.1 71.8 77.9
CO = max |Gj Ci |
N i Coupled 89.3 76.6 82.5 86.7 71.2 78.2
j
Factored+Br 86.8 78.8 82.6 83.8 74.1 78.6
We compute the aggregate PU, CO, and F1 Coupled+Br 88.7 78.1 83.0 86.2 72.7 78.8
scores over all predicates in the same way as SyntF 81.6 77.5 79.5 77.1 70.9 73.9
(Lang and Lapata, 2011a) by weighting the scores Table 1: Argument clustering performance with gold
of each predicate by the number of its argument argument identification. Bold-face is used to highlight
occurrences. Note that since our goal is to evalu- the best F1 scores.
ate the clustering algorithms, we do not include
incorrectly identified arguments (i.e. mistakes tion stage, and minimize the noise due to auto-
made by the heuristic defined in Section 2) when matic syntactic annotations. All four variants of
computing these metrics. the models we propose substantially outperform
We evaluate both factored and coupled models other models: the coupled model with Brown
proposed in this work with and without Brown clustering of argument fillers (Coupled+Br) beats
word clustering of argument fillers (Factored, the previous best model SplitMerge by 2.9% F1
Coupled, Factored+Br, Coupled+Br). Our mod- score. As mentioned in Section 2, our approach
els are robust to parameter settings, they were specifically does not cluster some of the modifier
tuned (to an order of magnitude) on the develop- arguments. In order to verify that this and argu-
ment set and were the same for all model variants: ment filler clustering were not the only aspects
= 1.e-3, = 1.e-3, 0 = 1.e-3, 1 = 1.e-10, of our approach contributing to performance im-
T = 5. Although they can be induced within the provements, we also evaluated our coupled model
model, we set them by hand to indicate granular- without Brown clustering and treating modifiers
ity preferences. We compare our results with the as regular arguments. The model achieves 89.2%
following alternative approaches. The syntactic purity, 74.0% collocation, and 80.9% F1 scores,
function baseline (SyntF) simply clusters predi- still substantially outperforming all of the alter-
cate arguments according to the dependency re- native approaches. Replacing gold parses with
lation to their head. Following (Lang and Lapata, MaltParser analyses we see a similar trend, where
2010), we allocate a cluster for each of 20 most Coupled+Br outperforms the best alternative ap-
frequent relations in the CoNLL dataset and one proach SplitMerge by 1.5%.
cluster for all other relations. We also compare 7.2.2 Automatic Arguments
our performance with the Latent Logistic classifi- Results are summarized in Table 2.11 The
cation (Lang and Lapata, 2010), Split-Merge clus- precision and recall of our re-implementation of
tering (Lang and Lapata, 2011a), and Graph Parti- the argument identification heuristic described in
tioning (Lang and Lapata, 2011b) approaches (la- Section 2 on gold parses were 87.7% and 88.0%,
beled LLogistic, SplitMerge, and GraphPart, re- respectively, and do not quite match 88.1% and
spectively) which achieve the current best unsu- 87.9% reported in (Lang and Lapata, 2011a).
pervised SRL results in this setting. Since we could not reproduce their argument
7.2 Results identification stage exactly, we are omitting their
results for the two regimes, instead including the
7.2.1 Gold Arguments results for our two best models Factored+Br and
Experimental results are summarized in Ta- Coupled+Br. We see a similar trend, where the
ble 1. We begin by comparing our models to the coupled system consistently outperforms its fac-
three existing clustering approaches on gold syn- tored counterpart, achieving 85.8% and 83.9% F1
tactic parses, and using gold PropBank annota- 11
Note, that the scores are computed on correctly iden-
tions to identify predicate arguments. In this set of tified arguments only, and tend to be higher in these ex-
experiments we measure the relative performance periments probably because the complex arguments get dis-
of argument clustering, removing the identifica- carded by the heuristic.
19
gold parses auto parses balizations of relations (Lin and Pantel, 2001;
PU CO F1 PU CO F1 Banko et al., 2007). Early unsupervised ap-
Factored+Br 87.8 82.9 85.3 85.8 81.1 83.4 proaches to the SRL problem include the work
Coupled+Br 89.2 82.6 85.8 87.4 80.7 83.9
by Swier and Stevenson (2004), where the Verb-
SyntF 83.5 81.4 82.4 81.4 79.1 80.2
Net verb lexicon was used to guide unsupervised
Table 2: Argument clustering performance with auto- learning, and a generative model of Grenager and
matic argument identification. Manning (2006) which exploits linguistic priors
on syntactic-semantic interface.
for gold and MaltParser analyses, respectively.
More recently, the role induction problem has
We observe that consistently through the four been studied in Lang and Lapata (2010) where
regimes, sharing of alternations between predi- it has been reformulated as a problem of detect-
cates captured by the coupled model outperforms ing alterations and mapping non-standard link-
the factored version, and that reducing the argu- ings to the canonical ones. Later, Lang and La-
ment filler sparsity with clustering also has a sub- pata (2011a) proposed an algorithmic approach
stantial positive effect. Due to the space con- to clustering argument signatures which achieves
straints we are not able to present detailed anal- higher accuracy and outperforms the syntactic
ysis of the induced similarity graph D, however, baseline. In Lang and Lapata (2011b), the role
argument-key pairs with the highest induced sim- induction problem is formulated as a graph parti-
ilarity encode, among other things, passivization, tioning problem: each vertex in the graph corre-
benefactive alternations, near-interchangeability sponds to a predicate occurrence and edges repre-
of some subordinating conjunctions and preposi- sent lexical and syntactic similarities between the
tions (e.g., if and whether), as well as, restoring occurrences. Unsupervised induction of seman-
some of the unnecessary splits introduced by the tics has also been studied in Poon and Domin-
argument key definition (e.g., semantic roles for gos (2009) and Titov and Klementiev (2010) but
adverbials do not normally depend on whether the the induced representations are not entirely com-
construction is passive or active). patible with the PropBank-style annotations and
they have been evaluated only on a question an-
8 Related Work
swering task for the biomedical domain. Also, the
Most of SRL research has focused on the super- related task of unsupervised argument identifica-
vised setting (Carreras and Marquez, 2005; Sur- tion was considered in Abend et al. (2009).
deanu et al., 2008), however, lack of annotated re-
sources for most languages and insufficient cover- 9 Conclusions
age provided by the existing resources motivates
In this work we introduced two Bayesian models
the need for using unlabeled data or other forms
for unsupervised role induction. They treat the
of weak supervision. This work includes methods
task as a family of related clustering problems,
based on graph alignment between labeled and
one for each predicate. The first factored model
unlabeled data (Furstenau and Lapata, 2009), us-
induces each clustering independently, whereas
ing unlabeled data to improve lexical generaliza-
the second model couples them by exploiting a
tion (Deschacht and Moens, 2009), and projection
novel technique for sharing clustering preferences
of annotation across languages (Pado and Lapata,
across a family of clusterings. Both methods
2009; van der Plas et al., 2011). Semi-supervised
achieve state-of-the-art results with the coupled
and weakly-supervised techniques have also been
model outperforming the factored counterpart in
explored for other types of semantic representa-
all regimes.
tions but these studies have mostly focused on re-
stricted domains (Kate and Mooney, 2007; Liang Acknowledgements
et al., 2009; Titov and Kozhevnikov, 2010; Gold-
The authors acknowledge the support of the MMCI
wasser et al., 2011; Liang et al., 2011).
Cluster of Excellence, and thank Hagen Furstenau,
Unsupervised learning has been one of the cen- Mikhail Kozhevnikov, Alexis Palmer, Manfred Pinkal,
tral paradigms for the closely-related area of re- Caroline Sporleder and the anonymous reviewers for
lation extraction, where several techniques have their suggestions, and Joel Lang for answering ques-
been proposed to cluster semantically similar ver- tions about their methods and data.
20
References Michael Kaisser and Bonnie Webber. 2007. Question
answering based on semantic roles. In ACL Work-
Omri Abend, Roi Reichart, and Ari Rappoport. 2009.
shop on Deep Linguistic Processing.
Unsupervised argument identification for semantic
Rohit J. Kate and Raymond J. Mooney. 2007. Learn-
role labeling. In ACL-IJCNLP.
ing language semantics from ambigous supervision.
Michele Banko, Michael J Cafarella, Stephen Soder-
In AAAI.
land, Matt Broadhead, and Oren Etzioni. 2007.
Aleksander Kolcz and Abdur Chowdhury. 2005. Dis-
Open information extraction from the web. In IJ-
counting over-confidence of naive bayes in high-
CAI.
recall text classification. In ECML.
Roberto Basili, Diego De Cao, Danilo Croce,
Joel Lang and Mirella Lapata. 2010. Unsupervised
Bonaventura Coppola, and Alessandro Moschitti.
induction of semantic roles. In ACL.
2009. Cross-language frame semantics transfer in
bilingual corpora. In CICLING. Joel Lang and Mirella Lapata. 2011a. Unsupervised
David M. Blei and Peter Frazier. 2011. Distance de- semantic role induction via split-merge clustering.
pendent chinese restaurant processes. Journal of In ACL.
Machine Learning Research, 12:24612488. Joel Lang and Mirella Lapata. 2011b. Unsupervised
Peter F. Brown, Vincent Della Pietra, Peter V. deSouza, semantic role induction with graph partitioning. In
Jenifer C. Lai, and Robert L. Mercer. 1992. Class- EMNLP.
based n-gram models for natural language. Compu- Beth Levin. 1993. English Verb Classes and Alter-
tational Linguistics, 18(4):467479. nations: A Preliminary Investigation. University of
Xavier Carreras and Llus Marquez. 2005. Intro- Chicago Press.
duction to the CoNLL-2005 Shared Task: Semantic Percy Liang, Michael I. Jordan, and Dan Klein. 2009.
Role Labeling. In CoNLL. Learning semantic correspondences with less super-
Hal Daume III. 2007. Fast search for dirichlet process vision. In ACL-IJCNLP.
mixture models. In AISTATS. Percy Liang, Michael Jordan, and Dan Klein. 2011.
Koen Deschacht and Marie-Francine Moens. 2009. Learning dependency-based compositional seman-
Semi-supervised semantic role labeling using the tics. In ACL: HLT.
Latent Words Language Model. In EMNLP. Dekang Lin and Patrick Pantel. 2001. DIRT discov-
Jason Duan, Michele Guindani, and Alan Gelfand. ery of inference rules from text. In KDD.
2007. Generalized spatial dirichlet process models. Ding Liu and Daniel Gildea. 2010. Semantic role fea-
Biometrika, 94:809825. tures for machine translation. In Coling.
Thomas S. Ferguson. 1973. A Bayesian analysis J. Nivre, J. Hall, S. Kubler, R. McDonald, J. Nilsson,
of some nonparametric problems. The Annals of S. Riedel, and D. Yuret. 2007. The CoNLL 2007
Statistics, 1(2):209230. shared task on dependency parsing. In EMNLP-
Hagen Furstenau and Mirella Lapata. 2009. Graph CoNLL.
alignment for semi-supervised semantic role label- Sebastian Pado and Mirella Lapata. 2009. Cross-
ing. In EMNLP. lingual annotation projection for semantic roles.
Qin Gao and Stephan Vogel. 2011. Corpus expansion Journal of Artificial Intelligence Research, 36:307
for statistical machine translation with semantic role 340.
label substitution rules. In ACL:HLT. Alexis Palmer and Caroline Sporleder. 2010. Evalu-
Daniel Gildea and Daniel Jurafsky. 2002. Automatic ating FrameNet-style semantic parsing: the role of
labelling of semantic roles. Computational Linguis- coverage gaps in FrameNet. In COLING.
tics, 28(3):245288. M. Palmer, D. Gildea, and P. Kingsbury. 2005. The
Dan Goldwasser, Roi Reichart, James Clarke, and Dan proposition bank: An annotated corpus of semantic
Roth. 2011. Confidence driven unsupervised se- roles. Computational Linguistics, 31(1):71106.
mantic parsing. In ACL. Hoifung Poon and Pedro Domingos. 2009. Unsuper-
Trond Grenager and Christoph Manning. 2006. Unsu- vised semantic parsing. In EMNLP.
pervised discovery of a statistical verb lexicon. In Sameer Pradhan, Wayne Ward, and James H. Martin.
EMNLP. 2008. Towards robust semantic role labeling. Com-
Jan Hajic, Massimiliano Ciaramita, Richard Johans- putational Linguistics, 34:289310.
son, Daisuke Kawahara, Maria Antonia Mart, Llus Jason Rennie. 2001. Improving multi-class text
Marquez, Adam Meyers, Joakim Nivre, Sebastian classification with Naive bayes. Technical Report
Pado, Jan Stepanek, Pavel Stranak, Mihai Surdeanu, AITR-2001-004, MIT.
Nianwen Xue, and Yi Zhang. 2009. The CoNLL- M. Sammons, V. Vydiswaran, T. Vieira, N. Johri,
2009 shared task: Syntactic and semantic depen- M. Chang, D. Goldwasser, V. Srikumar, G. Kundu,
dencies in multiple languages. In Proceedings Y. Tu, K. Small, J. Rule, Q. Do, and D. Roth. 2009.
of the 13th Conference on Computational Natural Relation alignment for textual entailment recogni-
Language Learning (CoNLL-2009), June 4-5. tion. In Text Analysis Conference (TAC).
21
Dan Shen and Mirella Lapata. 2007. Using semantic
roles to improve question answering. In EMNLP.
Richard Socher, Andrew Maas, and Christopher Man-
ning. 2011. Spectral chinese restaurant processes:
Nonparametric clustering based on similarities. In
AISTATS.
Mihai Surdeanu, Adam Meyers Richard Johansson,
Llus Marquez, and Joakim Nivre. 2008. The
CoNLL-2008 shared task on joint parsing of syn-
tactic and semantic dependencies. In CoNLL 2008:
Shared Task.
Richard Swier and Suzanne Stevenson. 2004. Unsu-
pervised semantic role labelling. In EMNLP.
Yee Whye Teh. 2010. Dirichlet processes. In Ency-
clopedia of Machine Learning. Springer.
Ivan Titov and Alexandre Klementiev. 2011. A
Bayesian model for unsupervised semantic parsing.
In ACL.
Ivan Titov and Mikhail Kozhevnikov. 2010.
Bootstrapping semantic analyzers from non-
contradictory texts. In ACL.
Joseph Turian, Lev Ratinov, and Yoshua Bengio.
2010. Word representations: A simple and general
method for semi-supervised learning. In ACL.
Lonneke van der Plas, Paola Merlo, and James Hen-
derson. 2011. Scaling up automatic cross-lingual
semantic role annotation. In ACL.
Dekai Wu and Pascale Fung. 2009. Semantic roles for
SMT: A hybrid two-pass model. In NAACL.
Dekai Wu, Marianna Apidianaki, Marine Carpuat, and
Lucia Specia, editors. 2011. Proc. of Fifth Work-
shop on Syntax, Semantics and Structure in Statisti-
cal Translation. ACL.
22
Entailment above the word level in distributional semantics
Marco Baroni Ngoc-Quynh Do Chung-chieh Shan

Raffaella Bernardi Free University of Bozen-Bolzano Cornell University
University of Trento quynhdtn.hut@gmail.com University of Tsukuba
name.surname@unitn.it ccshan@post.harvard.edu
Abstract lexical domain. On the other hand, FS has pro-

vided sophisticated models of sentence meaning,
We introduce two ways to detect entail- but it has been largely limited to hand-coded mod-
ment using distributional semantic repre- els that do not scale up to real-life challenges by
sentations of phrases. Our first experiment learning from data.
shows that the entailment relation between
adjective-noun constructions and their head Given these complementary strengths, we nat-
nouns (big cat |= cat), once represented as urally ask if DS and FS can address each others
semantic vector pairs, generalizes to lexical limitations. Two recent strands of research are
entailment among nouns (dog |= animal). bringing DS closer to meeting core FS chal-
Our second experiment shows that a classi- lenges. One strand attempts to model compo-
fier fed semantic vector pairs can similarly sitionality with DS methods, representing both
generalize the entailment relation among primitive and composed linguistic expressions
quantifier phrases (many dogs|=some dogs)
to entailment involving unseen quantifiers
as distributional vectors (Baroni and Zamparelli,
(all cats|=several cats). Moreover, nominal 2010; Grefenstette and Sadrzadeh, 2011; Gue-
and quantifier phrase entailment appears to vara, 2010; Mitchell and Lapata, 2010). The
be cued by different distributional corre- other strand attempts to reformulate FSs notion
lates, as predicted by the type-based view of logical inference in terms that DS can cap-
of entailment in formal semantics. ture (Erk, 2009; Geffet and Dagan, 2005; Kotler-
man et al., 2010; Zhitomirsky-Geffet and Dagan,
2010). In keeping with the lexical emphasis of
1 Introduction
DS, this strand has focused on inference at the
Distributional semantics (DS) approximates lin- word level, or lexical entailment, that is, discover-
guistic meaning with vectors summarizing the ing from distributional vectors of hyponyms (dog)
contexts where expressions occur. The success that they entail their hypernyms (animal).
of DS in lexical semantics has validated the hy- This paper brings these two strands of research
pothesis that semantically similar expressions oc- together by demonstrating two ways in which the
cur in similar contexts (Landauer and Dumais, distributional vectors of composite expressions
1997; Lund and Burgess, 1996; Sahlgren, 2006; bear on inference. Here we focus on phrasal vec-
Schutze, 1997; Turney and Pantel, 2010). For- tors harvested directly from the corpus rather than
mal semantics (FS) represents linguistic mean- obtained compositionally. In a first experiment,
ings as symbolic formulas and assemble them via we exploit the entailment properties of a class
composition rules. FS has successfully modeled of composite expressions, namely adjective-noun
quantification and captured inferential relations constructions (ANs), to harvest training data for
between phrases and between sentences (Mon- an entailment recognizer. The recognizer is then
tague, 1970; Thomason, 1974; Heim and Kratzer, successfully applied to detect lexical entailment.
1998). The strengths of DS and FS have been In short, since almost all ANs entail the noun they
complementary to date: On one hand, DS has in- contain (red car entails car), the distributional
duced large-scale semantic representations from vectors of AN-N pairs can train a classifier to de-
corpora, but it has been largely limited to the tect noun pairs that stand in the same relation (dog
23
entails animal). With almost no manual effort, 2 Background
we achieve performance nearly identical with the
state-of-the-art balAPinc measure that Kotlerman 2.1 Distributional semantics above the word
et al. (2010) crafted, which detects feature inclu- level
sion between the two nouns occurrence contexts. DS models such as LSA (Landauer and Dumais,
1997) and HAL (Lund and Burgess, 1996) ap-
Our second experiment goes beyond lexical in-
proximate the meaning of a word by a vector that
ference. We look at phrases built from a quanti-
summarizes its distribution in a corpus, for exam-
fying determiner1 and a noun (QNs) and use their
ple by counting co-occurrences of the word with
distributional vectors to recognize entailment re-
other words. Since semantically similar words
lations of the form many dogs |= some dogs, be-
tend to share similar contexts, DS has been very
tween two QNs sharing the same noun. It turns
successful in tasks that require quantifying se-
out that a classifier trained on a set of Q1 N |= Q2 N
mantic similarity among words, such as synonym
pairs can recognize entailment in pairs with a new
detection and concept clustering (Turney and Pan-
quantifier configuration. For example, we can
tel, 2010).
train on many dogs |= some dogs then correctly
predict all cats|=several cats. Interestingly, on the Recently, there has been a flurry of interest
QN entailment task, neither our classifier trained in DS to model meaning composition: How can
on AN-N pairs nor the balAPinc method beat we derive the DS representation of a composite
baseline methods. This suggests that our success- phrase from that of its constituents? Although the
ful QN classifiers tap into vector properties be- general focus in the area is to perform algebraic
yond such relations as feature inclusion that those operations on word semantic vectors (Mitchell
methods for nominal entailment rely upon. and Lapata, 2010), some researchers have also di-
rectly examined the corpus contexts of phrases.
Together, our experiments show that corpus- For example, Baldwin et al. (2003) studied vec-
harvested DS representations of composite ex- tor extraction for phrases because they were inter-
pressions such as ANs and QNs contain suffi- ested in the decomposability of multiword expres-
cient information to capture and generalize their sions. Baroni and Zamparelli (2010) and Gue-
inference patterns. This result brings DS closer vara (2010) look at corpus-harvested phrase vec-
to the central concerns of FS. In particular, the tors to learn composition functions that should de-
QN study is the first to our knowledge to show rive such composite vectors automatically. Ba-
that DS vectors capture semantic properties not roni and Zamparelli, in particular, showed qual-
only of content words, but of an important class of itatively that directly corpus-harvested vectors for
function words (quantifying determiners) deeply AN constructions are meaningful; for example,
studied in FS but of little interest until now in DS. the vector of young husband has nearest neigh-
Besides these theoretical implications, our re- bors small son, small daughter and mistress. Fol-
sults are of practical import. First, our AN study lowing up on this approach, we show here quanti-
presents a novel, practical method for detect- tatively that corpus-harvested AN vectors are also
ing lexical entailment that reaches state-of-the- useful for detecting entailment. We find moreover
art performance with little or no manual interven- distributional vectors informative and useful not
tion. Lexical entailment is in turn fundamental only for phrases made of content words (such as
for constructing ontologies and other lexical re- ANs) but also for phrases containing functional
sources (Buitelaar and Cimiano, 2008). Second, elements, namely quantifying determiners.
our QN study demonstrates that phrasal entail-
2.2 Entailment from formal to distributional
ment can be automatically detected and thus paves
semantics
the way to apply DS to advanced NLP tasks such
as recognizing textual entailment (Dagan et al., Entailment in FS To characterize the condi-
2009). tions under which a sentence is true, FS begins
with the lexical meanings of the words in the sen-
tence and builds up the meanings of larger and
1
In the sequel we will simply refer to a quantifying de- larger phrases until it arrives at the meaning of the
terminer as a quantifier. whole sentence. The meanings throughout this
24
compositional process inhabit a variety of seman- for phrasal entailment in a way that can be cap-
tic domains, depending on the syntactic category tured and generalized to unseen phrase pairs.
of the expressions: typically, a sentence denotes a Rather recently, the study of sentential entail-
truth value (true or false) or truth conditions, ment has taken an empirical turn, thanks to the de-
a noun such as cat denotes a set of entities, and a velopment of benchmarks for entailment systems.
quantifier phrase (QP) such as all cats denotes a The FS definition of entailment has been modified
set of sets of entities. by taking common sense into account. Instead of
The entailment relation (|=) is a core notion of a relation from the truth of the consequent to the
logic: it holds between one or more sentences and truth of the antecedent in any circumstance, the
a sentence such that it cannot be that the former applied view looks at entailment in terms of plau-
(antecedent) are true and the latter (consequent) sibility: |= if a human who reads (and trusts)
is false. FS extends this notion from formal-logic would most likely infer that is also true. En-
sentences to natural-language expressions. By as- tailment systems have been compared under this
signing meanings to parts of a sentence, FS allows new perspective in various evaluation campaigns,
defining entailment not only among sentences but the best known being the Recognizing Textual En-
also among words and phrases. Each semantic tailment (RTE) initiative (Dagan et al., 2009).
domain A has its own entailment relation |=A . Most RTE systems are based on advanced NLP
The entailment relation |=S among sentences is components, machine learning techniques, and/or
the logical notion just described, whereas the en- syntactic transformations (Zanzotto et al., 2007;
tailment relations |=N and |=QP among nouns Kouleykov and Magnini, 2005). A few systems
and quantifier phrases are the inclusion relations exploit deep FS analysis (Bos and Markert, 2006;
among sets of entities and sets of sets of entities Chambers et al., 2007). In particular, the FS re-
respectively. Our results in Section 5 show that sults about QP properties that affect entailment
DS needs to treat |=N and |=QP differently as well. have been exploited by Chambers et al, who com-
plement a core broad-coverage system with a Nat-
Empirical, corpus-based perspectives on en- ural Logic module to trade lower recall for higher
tailment Until recently, the corpus-based re- precision. For instance, they exploit the mono-
search tradition has studied entailment mostly at tonicity properties of no that cause the follow-
the word level, with applied goals such as clas- ing reversal in entailment direction: some bee-
sifying lexical relations and building taxonomic tles |= some insects but no insects |= no beetles.
WordNet-like resources automatically. The most To investigate entailment step by step, we ad-
popular approach, first adopted by Hearst (1992), dress here a much simpler and clearer type of
extracts lexical relations from patterns in large entailment than the more complex notion taken
corpora. For instance, from the pattern N1 such up by the RTE community. While RTE is out-
as N2 one learns that N2 |= N1 (from insects such side our present scope, we do focus on QP entail-
as beetles, derive beetles |= insects). Several stud- ment as Natural Logic does. However, our eval-
ies have refined and extended this approach (Pan- uation differs from Chambers et al.s, since we
tel and Ravichandran, 2004; Snow et al., 2005; rely on general-purpose DS vectors as our only
Snow et al., 2006; Turney, 2008). resource, and we look at phrase pairs with differ-
While empirically very successful, the pattern- ent quantifiers but the same noun. For instance,
based method is mostly limited to single content we aim to predict that all beetles |= many beetles
words (or frequent content-word phrases). We are but few beetles 6|= all beetles. QPs, of course, have
interested in entailment between phrases, where it many well-known semantic properties besides en-
is not obvious how to use lexico-syntactic patterns tailment; we leave their analysis to future study.
and cope with data sparsity. For instance, it seems
hard to find a pattern that frequently connects one Entailment in DS Erk (2009) suggests that it
QP to another it entails, as in all beetles PATTERN may not be possible to induce lexical entailment
many beetles. Hence, we aim to find a more gen- directly from a vector space representation, but it
eral method and investigate whether DS vectors is possible to encode the relation in this space af-
(whether corpus-harvested or compositionally deter it has been derived through other means. On
rived) encode the information needed to account the other hand, recent studies (Geffet and Dagan,
25
2005; Kotlerman et al., 2010; Weeds et al., 2004) into pointwise mutual information (PMI) scores
have pursued the intuition that entailment is the (Church and Hanks, 1990). The result of this step
asymmetric ability of one term to substitute for is a sparse matrix (with both positive and negative
another. For example, baseball contexts are also entries) with 48K rows (one per phrase of interest)
sport contexts but not vice versa, hence baseball and 27K columns (one per content word).
is narrower than sport and baseball |= sport. On
this view, entailment between vectors corresponds 3.2 The AN |= N data set
to inclusion of contexts or features, and can be To characterize entailment between nouns using
captured by asymmetric measures of distribution their semantic vectors, we need data exemplifying
similarity. In particular, Kotlerman et al. (2010) which noun entails which. This section introduces
carefully crafted the balAPinc measure (see Sec- one cheap way to collect such a training data set
tion 3.5 below). We adopt this measure because exploiting semantic vectors for composed expres-
it has been shown to outperform others in several sions, namely AN sequences. We rely on the lin-
tasks that require lexical entailment information. guistic fact that ANs share a syntactic category
Like Kotlerman et al., we want to capture the and semantic type with plain common nouns (big
entailment relation between vectors of features. cat shares syntactic category and semantic type
However, we are interested in entailment not only with cat). Furthermore, most adjectives are re-
between words but also between phrases, and we strictive in the sense that, for every noun N, the
ask whether the DS view of entailment as fea- AN sequence entails the N alone (every big cat
ture inclusion, which captures entailment between is a cat). From a distributional point of view, the
nouns, also captures entailment between QPs. To vector for an N should by construction include the
this end, we complement balAPinc with a more information in the vector for an AN, given that the
flexible supervised classifier. contexts where the AN occurs are a subset of the
contexts where the N occurs (cat occurs in all the
3 Data and methods contexts where big cat occurs). This ideal inclu-
sion suggests that the DS notion of lexical entail-
3.1 Semantic space
ment as feature inclusion (see Section 2.2 above)
We construct distributional semantic vectors from should be reflected in the AN |= N pattern.
the 2.83-billion-token concatenation of the British Because most ANs entail their head Ns, we can
National Corpus (http://www.natcorp. create positive examples of AN |= N without any
ox.ac.uk/), WackyPedia and ukWaC (http: manual inspection of the corpus: simply pair up
//wacky.sslmit.unibo.it/). We tok- the semantic vectors of ANs and Ns. Furthermore,
enize and POS-tag this corpus, then lemmatize because an AN usually does not entail another N,
it with TreeTagger (Schmid, 1995) to merge sin- we can create negative examples (AN1 6|= N2 ) just
gular and plural instances of words and phrases by randomly permuting the Ns. Of course, such
(some dogs is mapped to some dog). unsupervised data would be slightly noisy, espe-
We process the corpus in two steps to compute cially because some of the most frequent adjec-
semantic vectors representing our phrases of in- tives are not restrictive.
terest. We use phrases of interest as a general To collect cleaner data and to be sure that we
term to refer to both multiword phrases and sin- are really examining the phenomenon of entail-
gle words, and more precisely to: those AN and ment, we took a mere few moments of man-
QN sequences that are in the data sets (see next ual effort to select the 256 restrictive adjectives
subsections), the adjectives, quantifiers and nouns from the most frequent 300 adjectives in the cor-
contained in those sequences, and the most fre- pus. We then took the Cartesian product of these
quent (9.8K) nouns and (8.1K) adjectives in the 256 adjectives with the 200 concrete nouns in the
corpus. The first step is to count the content BLESS data set (Baroni and Lenci, 2011). Those
words (more precisely, the most frequent 9.8K nouns were chosen to avoid highly polysemous
nouns, 8.1K adjectives, and 9.6K verbs in the cor- words. From the Cartesian product, we obtain a
pus) that occur in the same sentence as phrases total of 1246 AN sequences, such as big cat, that
of interest. In the second step, following standard occur more than 100 times in the corpus. These
practice, the co-occurrence counts are converted AN sequences encompass 190 of the 256 adjec-
26
tives and 128 of the 200 nouns. Quantifier pair Instances Correct
The process results in 1246 positive instances all |= some 1054 1044 (99%)
of AN |= N entailment, which we use as training all |= several 557 550 (99%)
data. To create a comparable amount of negative each |= some 656 647 (99%)
data, we randomly permuted the nouns in the pos- all |= many 873 772 (88%)
itive instances to obtain pairs of AN1 6|= N2 (e.g., much |= some 248 217 (88%)
big cat 6|= dog). We manually double-checked that every |= many 460 400 (87%)
all positive and negative examples are correctly many |= some 951 822 (86%)
all |= most 465 393 (85%)
classified (2 of 1246 negative instances were re-
several |= some 580 439 (76%)
moved, leaving 1244 negative training examples). both |= some 573 322 (56%)
many |= several 594 113 (19%)
3.3 The lexical entailment N1 |= N2 data set most |= many 463 84 (18%)
For testing data, we first listed all WordNet nouns both |= either 63 1 (2%)
in our corpus, then extracted hyponym-hypernym Subtotal 7537 5804 (77%)
chains linking the first synsets of these nouns. For some 6|= every 484 481 (99%)
example, pope is found to entail leader because several 6|= all 557 553 (99%)
WordNet contains the chain pope spiritual several 6|= every 378 375 (99%)
leader leader. Eliminating the 20 hypernyms some 6|= all 1054 1043 (99%)
many 6|= every 460 452 (98%)
with more than 180 hyponyms (mostly very ab-
some 6|= each 656 640 (98%)
stract nouns such as entity, object, and quality) few 6|= all 157 153 (97%)
yields 9734 hyponym-hypernym pairs, encom- many 6|= all 873 843 (97%)
passing 6402 nouns. Manually double-checking both 6|= most 369 347 (94%)
these pairs leaves us with 1385 positive instances several 6|= few 143 134 (94%)
of N1 |= N2 entailment. both 6|= many 541 397 (73%)
We created the negative instances of again 1385 many 6|= most 463 300 (65%)
either 6|= both 63 39 (62%)
pairs by inverting 33% of the positive instances
many 6|= no 714 369 (52%)
(from pope|=leader to leader6|=pope), and by ran- some 6|= many 951 468 (49%)
domly shuffling the words across the positive in- few 6|= many 161 33 (20%)
stances. We also manually double-checked these both 6|= several 431 63 (15%)
pairs to make sure that they are not hyponym- Subtotal 8455 6690 (79%)
hypernym pairs. Total 15992 12494 (78%)
3.4 The Q1 N |= Q2 N data set Table 1: Entailing and non-entailing quantifier pairs
We study 12 quantifiers: all, both, each, either, with number of instances per pair (Section 3.4) and
every, few, many, most, much, no, several, some. SVMpair-out performance breakdown (Section 5).
We took the Cartesian product of these quantifiers
with the 6402 WordNet nouns described in Sec-
rise to an instance of entailment (Q1 N |= Q2 N if
tion 3.3. From this Cartesian product, we obtain
Q1 |= Q2 ; example: many dogs |= several dogs) or
a total of 28926 QN sequences, such as every cat,
non-entailment (Q1 N6|=Q2 N if Q1 6|=Q2 ; example:
that occur at least 100 times in the corpus. These
many dogs6|=most dogs). The number of QN pairs
are our QN phrases of interest to which the proce-
that each quantifier pair gives rise to in this way is
dure in Section 3.1 assigns a semantic vector.
listed in the second column of Table 1. As shown
Also, from the set of quantifier pairs (Q1 , Q2 )
there, we have a total of 7537 positive instances
where Q1 6= Q2 , we identified 13 clear cases
and 8455 negative instances of QN entailment.
where Q1 |=Q2 and 17 clear cases where Q1 6|=Q2 .
These 30 cases are listed in the first column of
3.5 Classification methods
Table 1. For each of these 30 quantifier pairs
(Q1 , Q2 ), we enumerate those WordNet nouns N We consider two methods to classify candidate
such that semantic vectors are available for both pairs as entailing or non-entailing, the balAPinc
Q1 N and Q2 N (that is, both sequences occur in measure of Kotlerman et al. (2010) and a standard
at least 100 times). Each such noun then gives Support Vector Machine (SVM) classifier.
27
balAPinc As discussed in Section 2.2, balAP- To adapt balAPinc to recognize entailment, we
inc is optimized to capture a relation of feature must select a threshold t above which we classify
inclusion between the narrower (entailing) and a pair as entailing. In the experiments below, we
broader (entailed) terms, while capturing other in- explore two approaches. In balAPincupper , we op-
tuitions about the relative relevance of features. timize the threshold directly on the test data, by
balAPinc averages two terms, APinc and LIN. setting t to maximize the F-measure on the test
APinc is given by: set. This gives us an upper bound on how well bal-
P|Fu | 0
APinc could perform on the test set (but note that
r=1 P (r) rel (fr ) optimizing F does not necessarily translate into a
APinc(u |= v) =
|Fu | good accuracy performance, as clearly illustrated
APinc is a version of the Average Precision by Table 3 below). In balAPincAN |= N , we use the
measure from Information Retrieval tailored to AN |= N data set as training data and pick the t
lexical inclusion. Given vectors Fu and Fv rep- that maximizes F on this training set.
resenting the dimensions with positive PMI val- We use the balAPinc measure as a refer-
ues in the semantic vectors of the candidate pair ence point because, on the evidence provided by
u |= v, the idea is that we want the features (that Kotlerman et al., it is the state of the art in various
is, vector dimensions) that have larger values in tasks related to lexical entailment. We recognize
Fu to also have large values in Fv (the opposite however that it is somewhat complex and specifi-
does not matter because it is u that should be in- cally tuned to capturing the relation of feature in-
cluded in v, not vice versa). The Fu features are clusion. Consequently, we also experiment with
ranked according to their PMI value so that fr a more flexible classifier, which can detect other
is the feature in Fu with rank r, i.e., r-th high- systematic properties of vectors in an entailment
est PMI. Then the sum of the product of the two relation. We present this classifier next.
terms P (r) and rel0 (fr ) across the features in Fu
is computed. The first term is the precision at r, SVM Support vector machines are widely used
which is higher when highly ranked u features are high-performance discriminative classifiers that
present in Fv as well. The relevance term rel0 (fr ) find the hyperplane providing the best separation
is higher when the feature fr in Fu also appears between negative and positive instances (Cristian-
in Fv with a high rank. (See Kotlerman et al. for ini and Shawe-Taylor, 2000). Our SVM classifiers
how P (r) and rel0 (fr ) are computed.) The result- are trained and tested using Weka 3 and LIBSVM
ing score is normalized by dividing by the entail- 2.8 (Chang and Lin, 2011). We use the default
ing vector size |Fu | (in accordance with the idea polynomial kernel ((u v/600)3 ) with (tolerance
that having more v features should not hurt be- of termination criterion) set to 1.6. This value was
cause the u features should be included in the v tuned on the AN|=N data set, which we never use
features, not vice versa). for testing. In the same initial tuning experiments
To balance the potentially excessive asymmetry on the AN |= N data set, SVM outperformed deci-
of APinc towards the features of the antecedent, sion trees, naive Bayes, and k-nearest neighbors.
Kotlerman et al. average it with LIN, the widely We feed each potential entailment pair to SVM
used symmetric measure of distributional similar- by concatenating the two vectors representing the
ity proposed by Lin (1998): antecedent and consequent expressions.2 How-
P ever, for efficiency and to mitigate data sparse-
f Fu Fv [wu (f ) + wv (f )]
LIN(u, v) = P P ness, we reduce the dimensionality of the seman-
f Fu wu (f ) + f Fv wv (f ) tic vectors to 300 columns using Singular Value
LIN essentially measures feature vector overlap. Decomposition (SVD) before feeding them to the
The positive PMI values wu (f ) and wv (f ) of a classifier.3 Because the SVD-reduced semantic
feature f in Fu and Fv are summed across those 2
We have tried also to represent a pair by subtracting and
features that are positive in both vectors, normal- by dividing the two vectors. The concatenation operation
izing by the cumulative positive PMI mass in both gave more successful results.
3
vectors. Finally, balAPinc is the geometric aver- To keep a manageable parameter space, we picked 300
age of APinc and LIN: columns without tuning. This is the best value reported in
p many earlier studies, including classic LSA. Since SVD
balAPinc(u|=v) = APinc(u |= v) LIN(u, v) sometimes improves the semantic space (Landauer and Du-
28
vectors occupy a 300-dimensional space, the en- P R F Accuracy
tailment pairs occupy a 600-dimensional space. (95% C.I.)
An SVM with a polynomial kernel takes into SVMupper 88.6 88.6 88.5 88.6 (87.389.7)
account not only individual input features but also balAPincAN |= N 65.2 87.5 74.7 70.4 (68.772.1)
their interactions (Manning et al., 2008, chapter
balAPincupper 64.4 90.0 75.1 70.1 (68.471.8)
15). Thus, our classifier can capture not just prop-
SVMAN |= N 69.3 69.3 69.3 69.3 (67.671.0)
erties of individual dimensions of the antecedent
and consequent pairs, but also properties of their cos(N1 , N2 ) 57.7 57.6 57.5 57.6 (55.859.5)
combinations (e.g., the product of the first dimen- fq(N1 ) < fq(N2 ) 52.1 52.1 51.8 53.3 (51.455.2)
sions of the antecedent and the consequent). We
conjecture that this property of SVMs is funda- Table 2: Detecting lexical entailment. Results ranked
mental to their success at detecting entailment, by accuracy and expressed as percentages. 95% con-
where relations between the antecedent and the fidence intervals around accuracy calculated by bino-
mial exact tests.
consequent should matter more than their inde-
pendent characteristics.
accuracy on the test set, which is balanced be-
4 Predicting lexical entailment from tween positive and negative instances. Interest-
AN |= N evidence ingly, the balAPinc decision thresholds tuned on
Since the contexts of AN must be a subset of the the AN |= N set and on the test data are very
contexts of N, semantic vectors harvested from close (0.26 vs. 0.24), resulting in very similar per-
AN phrases and their head Ns are by construc- formance for balAPincAN |= N and balAPincupper .
tion in an inclusion relation. The first experiment This suggests that the relation captured by bal-
shows that these vectors constitute excellent train- APinc on the phrasal entailment training data is
ing data to discover entailment between nouns. indeed the same that the measure captures when
This suggests that the vector pairs representing applied to lexical entailment data.
entailment between nouns are also in an inclusion The success of this first experiment shows that
relation, supporting the conjectures of Kotlerman the entailment relation present in the distribu-
et al. (2010) and others. tional representation of AN phrases and their
Table 2 reports the results we obtained with head Ns transfers to lexical entailment (entailment
balAPincupper , balAPincAN |= N (Section 3.5) and among Ns). Most importantly, this result demon-
SVMAN |= N (the SVM classifier trained on the strates that the semantic vectors of composite ex-
AN |= N data). As an upper bound for meth- pressions (such as ANs) are useful for lexical en-
ods that generalize from AN |= N, we also re- tailment. Moreover, the result is in accordance
port the performance of SVM trained with 10-fold with the view of FS, that ANs and Ns have the
cross-validation on the N1 |= N2 data themselves same semantic type, and thus they enter entail-
(SVMupper ). Finally, we tried two baseline classi- ment relations of the same kind. Finally, the hy-
fiers. The first baseline (fq(N1 ) < fq(N2 )) guesses pothesis that entailment among nouns is reflected
entailment if the first word is less frequent than by distributional inclusion among their semantic
the second. The second (cos(N1 , N2 )) applies a vectors (Kotlerman et al., 2010) is supported both
threshold (determined on the test set) to the co- by the successful generalization of the SVM clas-
sine similarity of the pair. The results of these sifier trained on AN |= N pairs and by the good
baselines shown in Table 2 use SVD; those with- performance of the balAPinc measure.
out SVD are similar. Both baselines outperformed 5 Generalizing QN entailment
more trivial methods such as random guessing or
fixed response, but they performed significantly The second study is somewhat more ambitious,
worse than SVM and balAPinc. as it aims to capture and generalize the entailment
Both methods that generalize entailment from relation between QPs (of shape QN) using only
AN |= N to N1 |= N2 perform well, with 70% the corpus-harvested semantic vectors represent-
mais, 1997; Rapp, 2003; Schutze, 1997), we tried balAPinc
ing these phrases as evidence. We are thus first
on the SVD-reduced vectors as well, but results were consis- and foremost interested in testing whether these
tently worse than with PMI vectors. vectors encode information that can help a power-
29
P R F Accuracy baselines are only slightly better overall than more
(95% C.I.) trivial baselines.) We consider moreover an alter-
SVMpair-out 76.7 77.0 76.8 78.1 (77.578.8) native approach that ignores the noun altogether
SVMquantifier-out 70.1 65.3 68.0 71.0 (70.371.7)
and uses vectors for the quantifiers only (e.g., the
decision about all dogs|=some dogs considers the
SVMQ
pair-out 67.9 69.8 68.9 70.2 (69.570.9)
corpus-derived all and some vectors only). The
SVMQ
quantifier-out 53.3 52.9 53.1 56.0 (55.256.8) models resulting from this Q-only strategy are
cos(QN1 , QN2 ) 52.9 52.3 52.3 53.1 (52.353.9) marked with the superscript Q in the table.
balAPincAN |= N 46.7 5.6 10.0 52.5 (51.753.3) The results confirm clearly that semantic vec-
SVMAN |= N 2.8 42.9 5.2 52.4 (51.753.2) tors for QNs contain enough information to allow
fq(QN1 )<fq(QN2 ) 51.0 47.4 49.1 50.2 (49.451.0) a classifier to detect entailment: SVMquantifier-out
balAPincupper 47.1 100 64.1 47.2 (46.447.9) performs as well as the lexical entailment classi-
fiers of our first study, and SVMpair-out does even
Table 3: Detecting quantifier entailment. Results better. This success is especially impressive given
ranked by accuracy and expressed as percentages. our challenging training and testing regimes.
95% confidence intervals around accuracy calculated In contrast to the first study, now SVMAN |= N ,
by binomial exact tests. the classifier trained on the AN |= N data set,
and balAPinc perform no better than the base-
lines. (Here balAPincupper and balAPincAN |= N
ful classifier, such as SVM, to detect entailment. pick very different thresholds: the first settling
To abstract away from lexical or other effects on a very low t = 0.01, whereas for the sec-
linked to a specific quantifier, we consider two ond t = 0.26.) As predicted by FS (see Section
challenging training and testing regimes. In the 2.2 above), noun-level entailment does not gen-
first (SVMpair-out ), we hold out one quantifier pair eralize to quantifier phrase entailment, since the
as testing data and use the other 29 pairs in Table 1 two structures have different semantic types, cor-
as training data. Thus, for example, the classifier responding to different kinds of entailment rela-
must discover all dogs |= some dogs without see- tions. Moreover, the failure of balAPinc suggests
ing any all N |= some N instance in the training that, whatever evidence the SVMs rely upon, it is
data. In the second (SVMquantifier-out ), we hold out not simple feature inclusion.
one of the 12 quantifiers as testing data (that is, Interestingly, even the Q vectors alone encode
hold out every pair involving a certain quantifier) enough information to capture entailment above
and use the rest as training data. For example, chance. Still, the huge drop in performance from
the quantifier must guess all dogs |= some dogs SVMQ Q
pair-out to SVMquantifier-out suggests that the Q-
without ever seeing all in the training data. We only method learned ad-hoc properties that do not
expect the second training regime to be more dif- generalize (e.g., all entails every Q2 ).
ficult, not just because there is less training data, Tables 1 and 4 break down the SVM results by
but also because the trained classifier is tested on (pairs of) quantifiers. We highlight the remark-
a quantifier that it has never encountered within able dichotomy in Table 4 between the good per-
any training QN sequence.4 formance on the universal-like quantifiers (each,
Table 3 reports the results for SVMpair-out and every, all, much) and the poor performance on the
SVMquantifier-out , as well as for the methods we existential-like ones (some, no, both, either).
tried in the lexical entailment experiments. (As In sum, the QN experiments show that seman-
in the first study, the frequency- and cosine-based tic vectors contain enough information to detect
4
a logical relation such as entailment not only be-
In our initial experiments, we added negative entail-
ment instances by blindly permuting the nouns, under the
tween words, but also between phrases contain-
assumption that Q1 N1 typically does not entail Q2 N2 when ing quantifiers that determine their entailment re-
Q1 6= Q2 and N1 6= N2 . These additional instances turned lation. While a flexible classifier such as SVM
out to be much easier to classify: adding an equal proportion performs this task well, neither measuring fea-
of them to the training data and testing data, such that the
number of instances where N1 = N2 and where N1 6= N2
ture inclusion nor generalizing nominal entail-
is equal, reduced every error rate roughly by half. The re- ment works. SVMs are evidently tapping into
ported results do not involve these additional instances. other properties of the vectors.
30
Quantifier Instances Correct Very importantly, instead of extracting vectors
|= 6|= |= 6|= representing phrases directly from the corpus, we
each 656 656 649 637 (98%) intend to derive them by compositional operations
every 460 1322 402 1293 (95%) proposed in the literature (see Section 2.1 above).
much 248 0 216 0 (87%) We will look for composition methods producing
all 2949 2641 2011 2494 (81%) vector representations of composite expressions
several 1731 1509 1302 1267 (79%) that are as good as (or better than) vectors directly
many 3341 4163 2349 3443 (77%) extracted from the corpus at encoding entailment.
few 0 461 0 311 (67%)
Finally, we would like to evaluate our entail-
most 928 832 549 511 (60%)
some 4062 3145 1780 2190 (55%) ment detection strategies for larger phrases and
no 0 714 0 380 (53%) sentences, possibly containing multiple quanti-
both 636 1404 589 303 (44%) fiers, and eventually embed them as core compo-
either 63 63 2 41 (34%) nents of an RTE system.
Total 15074 16910 9849 12870 (71%)
Acknowledgments
Table 4: Breakdown of results with leaving-one-
We thank the Erasmus Mundus EMLCT Program
quantifier-out (SVMquantifier-out ) training regime.
for the student and visiting scholar grants to the
third and fourth author, respectively. The first
6 Conclusion two authors are partially funded by the ERC 2011
Starting Independent Research Grant supporting
Our main results are as follows. the COMPOSES project (nr. 283554). We are
grateful to Gemma Boleda, Louise McNally, and
1. Corpus-harvested semantic vectors repre-
the anonymous reviewers for valuable comments,
senting adjective-noun constructions and
and to Ido Dagan for important insights into en-
their heads encode a relation of entailment
tailment from an empirical point of view.
that can be exploited to train a classifier
to detect lexical entailment. In particular,
a relation of feature inclusion between the References
narrower antecedent and broader consequent
terms captures both AN |= N and N1 |= N2 Timothy Baldwin, Colin Bannard, Takaaki Tanaka,
and Dominic Widdows. 2003. An empirical model
entailment.
of multiword expression decomposability. In Pro-
ceedings of the ACL 2003 Workshop on Multiword
2. The semantic vectors of quantifier-noun con-
Expressions, pages 8996.
structions also encode information sufficient Marco Baroni and Alessandro Lenci. 2011. How
to learn an entailment relation that general- we BLESSed distributional semantic evaluation. In
izes to QNs containing quantifiers that were Proceedings of the Workshop on Geometrical Mod-
not seen during training. els of Natural Language Semantics.
Marco Baroni and Roberto Zamparelli. 2010. Nouns
3. Neither the entailment information encoded are vectors, adjectives are matrices: Representing
in AN |= N vectors nor the balAPinc mea- adjective-noun constructions in semantic space. In
sure generalizes well to entailment detection Proceedings of EMNLP, pages 11831193, Boston,
in QNs. This result suggests that QN vectors MA.
encode a different kind of entailment, as also Johan Bos and Katja Markert. 2006. When logical
suggested by type distinctions in Formal Se- inference helps determining textual entailment (and
when it doesnt. In Proceedings of the Second PAS-
mantics. CAL Challenges Workshop on Recognising Textual
Entailment.
In future work, we want first of all to conduct
Paul Buitelaar and Philipp Cimiano. 2008. Bridging
an analysis of the features in the Q1 N |= Q2 N vec- the Gap between Text and Knowledge. IOS, Ams-
tors that are crucially exploited by our success- terdam.
ful entailment recognizers, in order to understand Nathanael Chambers, Daniel Cer, Trond Grenager,
which characteristics of entailment are encoded in David Hall, Chloe Kiddon, Bill MacCartney, Marie-
these vectors. Catherine de Marneffe, Daniel Ramage, Eric Yeh,
31
and Christopher D. Manning. 2007. Learning Kevin Lund and Curt Burgess. 1996. Producing
alignments and leveraging natural logic. In ACL- high-dimensional semantic spaces from lexical co-
PASCAL Workshop on Textual Entailment and Para- occurrence. Behavior Research Methods, 28:203
phrasing. 208.
Chih-Chung Chang and Chih-Jen Lin. 2011. LIB- Chris Manning, Prabhakar Raghavan, and Hinrich
SVM: A library for support vector machines. ACM Schutze. 2008. Introduction to Information Re-
Transactions on Intelligent Systems and Technol- trieval. Cambridge University Press, Cambridge.
ogy, 2(3):27:127:27. Jeff Mitchell and Mirella Lapata. 2010. Composi-
Kenneth Church and Peter Hanks. 1990. Word associ- tion in distributional models of semantics. Cogni-
ation norms, mutual information, and lexicography. tive Science, 34(8):13881429.
Computational Linguistics, 16(1):2229. Richard Montague. 1970. Universal Grammar. Theo-
Nello Cristianini and John Shawe-Taylor. 2000. An ria, 36:373398.
introduction to Support Vector Machines and other Patrick Pantel and Deepak Ravichandran. 2004. Au-
kernel-based learning methods. Cambridge Univer- tomatically labeliing semantic classes. In Proceed-
sity Press, Cambridge. ings of HLT-NAACL 2004, pages 321328.
Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan Reinhard Rapp. 2003. Word sense discovery based on
Roth. 2009. Recognizing textual entailment: ratio- sense descriptor dissimilarity. In Proceedings of the
nal, evaluation and approaches. Natural Language 9th MT Summit, pages 315322, New Orleans, LA.
Engineering, 15:459476. Magnus Sahlgren. 2006. The Word-Space Model.
Katrin Erk. 2009. Supporting inferences in semantic Dissertation, Stockholm University.
space: representing words as regions. In Proceed- Helmut Schmid. 1995. Improvements in part-of-
ings of IWCS, pages 104115, Tilburg, Netherlands. speech tagging with an application to German.
Maayan Geffet and Ido Dagan. 2005. The distribu- In Proceedings of the EACL-SIGDAT Workshop,
tional inclusion hypotheses and lexical entailment. Dublin, Ireland.
In Proceedings of ACL, pages 107114, Ann Arbor, Hinrich Schutze. 1997. Ambiguity Resolution in Nat-
MI. ural Language Learning. CSLI, Stanford, CA.
Edward Grefenstette and Mehrnoosh Sadrzadeh. Rion Snow, Daniel Juravsky, and Andrew Y. Ng.
2011. Experimental support for a categorical com- 2005. Learning syntactic patterns for automatic hy-
positional distributional model of meaning. In Pro- pernym discovery. In Proceedings of NIPS 17.
ceedings of EMNLP, pages 13951404, Edinburgh. Rion Snow, Daniel Juravsky, and Andrew Y. Ng.
2006. Semantic taxonomy induction from het-
Emiliano Guevara. 2010. A regression model
erogenous evidence. In Proceedings of ACL 2006,
of adjective-noun compositionality in distributional
pages 801808.
semantics. In Proceedings of the ACL GEMS Work-
shop, pages 3337, Uppsala, Sweden. Richmond H. Thomason, editor. 1974. Formal Phi-
losophy: Selected Papers of Richard Montague.
Marti Hearst. 1992. Automatic acquisition of hy-
Yale University Press, New York.
ponyms from large text corpora. In Proceedings of
Peter Turney and Patrick Pantel. 2010. From fre-
COLING, pages 539545, Nantes, France.
quency to meaning: Vector space models of se-
Irene Heim and Angelika Kratzer. 1998. Semantics in mantics. Journal of Artificial Intelligence Research,
Generative Grammar. Blackwell, Oxford. 37:141188.
Lili Kotlerman, Ido Dagan, Idan Szpektor, and Peter Turney. 2008. A uniform approach to analogies,
Maayan Zhitomirsky-Geffet. 2010. Directional synonyms, antonyms and associations. In Proceed-
distributional similarity for lexical inference. Natu- ings of COLING, pages 905912, Manchester, UK.
ral Language Engineering, 16(4):359389. Julie Weeds, David Weir, and Diana McCarthy. 2004.
Milen Kouleykov and Bernardo Magnini. 2005. Tree Characterising measures of lexical distributional
edit sistance for textual entailment. In Proceed- similarity. In Proceedings of the 20th Interna-
ings of RALNP-2005, International Conference on tional Conference of Computational Linguistics,
Recent Advances in Natural Language Processing, COLING-2004, pages 10151021.
pages 271278. Fabio M. Zanzotto, Marco Pennacchiotti, and Alessan-
Thomas Landauer and Susan Dumais. 1997. A dro Moschitti. 2007. Shallow semantics in fast tex-
solution to Platos problem: The latent semantic tual entailment rule learners. In Proceedings of the
analysis theory of acquisition, induction, and rep- ACL-PASCAL Workshop on Textual Entailment and
resentation of knowledge. Psychological Review, Paraphrasing.
104(2):211240. Maayan Zhitomirsky-Geffet and Ido Dagan. 2010.
Dekang Lin. 1998. An information-theoretic defini- Bootstrapping distributional feature vector quality.
tion of similarity. In Proceedings of ICML, pages Computational Linguistics, 35(3):435461.
296304, Madison, WI, USA.
32
Evaluating Distributional Models of Semantics for Syntactically
Invariant Inference
Jackie CK Cheung and Gerald Penn

Department of Computer Science
University of Toronto
Toronto, ON, M5S 3G4, Canada
{jcheung,gpenn}@cs.toronto.edu
Abstract the notion of compositionality as the litmus test of

a truly semantic model. Compositionality is a nat-
A major focus of current work in distri- ural way to construct representations of linguistic
butional models of semantics is to con- units larger than a word, and it has a long history
struct phrase representations composition-
in Montagovian semantics for dealing with argu-
ally from word representations. However,
the syntactic contexts which are modelled ment structure and assembling rich semantical ex-
are usually severely limited, a fact which pressions of the kind found in predicate logic.
is reflected in the lexical-level WSD-like While compositionality may thus provide a
evaluation methods used. In this paper, we convenient recipe for producing representations
broaden the scope of these models to build of propositionally typed phrases, it is not a nec-
sentence-level representations, and argue essary condition for a semantic representation.
that phrase representations are best eval-
Rather, that distinction still belongs to the crucial
uated in terms of the inference decisions
that they support, invariant to the partic-
ability to support inference. It is not the inten-
ular syntactic constructions used to guide tion of this paper to argue for or against composi-
composition. We propose two evaluation tionality in semantic representations. Rather, our
methods in relation classification and QA interest is in evaluating semantic models in order
which reflect these goals, and apply several to determine their suitability for inference tasks.
recent compositional distributional models In particular, we contend that it is desirable and
to the tasks. We find that the models out- arguably necessary for a compositional semantic
perform a simple lemma overlap baseline
representation to support inference invariantly, in
slightly, demonstrating that distributional
approaches can already be useful for tasks the sense that the particular syntactic construction
requiring deeper inference. that guided the composition should not matter rel-
ative to the representations of syntactically differ-
ent phrases with the same meanings. For example,
1 Introduction we can assert that John threw the ball and The ball
A number of unsupervised semantic models was thrown by John have the same meaning for
(Mitchell and Lapata, 2008, for example) have re- the purposes of inference, even though they differ
cently been proposed which are inspired at least syntactically.
in part by the distributional hypothesis (Harris, An analogy can be drawn to research in image
1954)that a words meaning can be character- processing, in which it is widely regarded as im-
ized by the contexts in which it appears. Such portant for the representations of images to be in-
models represent word meaning as one or more variant to rotation and scaling. What we should
high-dimensional vectors which capture the lex- want is a representation of sentence meaning that
ical and syntactic contexts of the words occur- is invariant to diathesis, other regular syntactic al-
rences in a training corpus. ternations in the assignment of argument struc-
Much of the recent work in this area has, fol- ture, and, ideally, even invariant to other meaning-
lowing Mitchell and Lapata (2008), focused on preserving or near-preserving paraphrases.
33
Existing evaluations of distributional semantic 2 Compositionality and Distributional
models fall short of measuring this. One evalua- Semantics
tion approach consists of lexical-level word sub-
The idea of compositionality has been central to
stitution tasks which primarily evaluate a sys-
understanding contemporary natural language se-
tems ability to disambiguate word senses within a
mantics from an historiographic perspective. The
controlled syntactic environment (McCarthy and
idea is often credited to Frege, although in fact
Navigli, 2009, for example). Another approach is
Frege had very little to say about compositional-
to evaluate parsing accuracy (Socher et al., 2010,
ity that had not already been repeated since the
for example), which is really a formalism-specific
time of Aristotle (Hodges, 2005). Our modern
approximation to argument structure analysis.
notion of compositionality took shape primarily
These evaluations may certainly be relevant to
with the work of Tarski (1956), who was actu-
specific components of, for example, machine
ally arguing that a central difference between for-
translation or natural language generation sys-
mal languages and natural languages is that nat-
tems, but they tell us little about a semantic
ural language is not compositional. This in turn
models ability to support inference.
was the the contention that an important theo-
In this paper, we propose a general framework
retical difference exists between formal and nat-
for evaluating distributional semantic models that
ural languages, that Richard Montague so fa-
build sentence representations, and suggest two
mously rejected (Montague, 1974). Composi-
evaluation methods that test the notion of struc-
tionality also features prominently in Fodor and
turally invariant inference directly. Both rely on
Pylyshyns (1988) rejection of early connection-
determining whether sentences express the same
ist representations of natural language semantics,
semantic relation between entities, a crucial step
which seems to have influenced Mitchell and La-
in solving a wide variety of inference tasks like
pata (2008) as well.
recognizing textual entailment, information re-
Logic-based forms of compositional semantics
trieval, question answering, and summarization.
have long strived for syntactic invariance in mean-
The first evaluation is a relation classification
ing representations, which is known as the doc-
task, where a semantic model is tested on its abil-
trine of the canonical form. The traditional justifi-
ity to recognize whether a pair of sentences both
cation for canonical forms is that they allow easy
contain a particular semantic relation, such as
access to a knowledge base to retrieve some de-
Company X acquires Company Y. The second task
sired information, which amounts to a form of in-
is a question answering task, the goal of which is
ference. Our work can be seen as an extension of
to locate the sentence in a document that contains
this notion to distributional semantic models with
the answer. Here, the semantic model must match
a more general notion of representational similar-
the question, which expresses a proposition with a
ity and inference.
missing argument, to the answer-bearing sentence
There are many regular alternations that seman-
which contains the full proposition.
tics models have tried to account for such as pas-
We apply these new evaluation protocols to
sive or dative alternations. There are also many
several recent distributional models, extending
lexical paraphrases which can take drastically dif-
several of them to build sentence representa-
ferent syntactic forms. Take the following exam-
tions. We find that the models outperform a sim-
ple from Poon and Domingos (2009), in which the
ple lemma overlap model only slightly, but that
same semantic relation can be expressed by a tran-
combining these models with the lemma overlap
sitive verb or an attributive prepositional phrase:
model can improve performance. This result is
likely due to weaknesses in current models abil- (1) Utah borders Idaho.
ity to deal with issues such as named entities, Utah is next to Idaho.
coreference, and negation, which are not empha-
sized by existing evaluation methods, but it does In distributional semantics, the original sen-
suggest that distributional models of semantics tence similarity test proposed by Kintsch (2001)
can play a more central role in systems that reserved as the inspiration for the evaluation per-
quire deep, precise inference. formed by Mitchell and Lapata (2008) and most
later work in the area. Intransitive verbs are given
34
in the context of their syntactic subject, and can- which words are given in the context of the sur-
didate synonyms are ranked for their appropri- rounding sentence, and the task is to rank a given
ateness. This method targets the fact that a syn- list of proposed substitutions for that word. The
onym is appropriate for only some of the verbs list of substitutions as well as the correct rankings
senses, and the intended verb sense depends on are elicited from annotators. This task was origi-
the surrounding context. For example, burn and nally conceived as an applied evaluation of WSD
beam are both synonyms of glow, but given a par- systems, not an evaluation of phrase representa-
ticular subject, one of the synonyms (called the tions.
High similarity landmark) may be a more appro- Parsing accuracy has been used as a prelimi-
priate substitution than the other (the Low similar- nary evaluation of semantic models that produce
ity landmark). So, if the fire is the subject, glowed syntactic structure (Socher et al., 2010; Wu and
is the High similarity landmark, and beamed the Schuler, 2011). However, syntax does not always
Low similarity landmark. reflect semantic content, and we are specifically
Fundamentally, this method was designed as interested in supporting syntactic invariance when
a demonstration that compositionality in com- doing semantic inference. Also, this type of eval-
puting phrasal semantic representations does not uation is tied to a particular grammar formalism.
interfere with the ability of a representation to The existing evaluations that are most similar in
synthesize non-compositional collocation effects spirit to what we propose are paraphrase detection
that contribute to the disambiguation of homo- tasks that do not assume a restricted syntactic con-
graphs. Here, word-sense disambiguation is im- text. Washtell (2011) collected human judgments
plicitly viewed as a very restricted, highly lexi- on the general meaning similarity of candidate
calized case of inference for selecting the appro- phrase pairs. Unfortunately, no additional guid-
priate disjunct in the representation of a words ance on the definition of most similar in mean-
meaning. ing was provided, and it appears likely that sub-
Kintsch (2001) was interested in sentence sim- jects conflated lexical, syntactic, and semantic re-
ilarity, but he only conducted his evaluation on latedness. Dolan and Brockett (2005) define para-
a few hand-selected examples. Mitchell and La- phrase detection as identifying sentences that are
pata (2008) conducted theirs on a much larger in a bidirectional entailment relation. While such
scale, but chose to focus only on this single case sentences do support exactly the same inferences,
of syntactic combination, intransitive verbs and we are also interested in the inferences that can
their subjects, in order to factor out inessential be made from similar sentences that are not para-
degrees of freedom to compare their various al- phrases according to this strict definition a sit-
ternative models more equitably. This was not uation that is more often encountered in end ap-
necessaryusing the same, sufficiently large, un- plications. Thus, we adopt a less restricted notion
biased but syntactically heterogeneous sample of of paraphrasis.
evaluation sentences would have served as an ade-
quate controland this decision furthermore pre- 3 An Evaluation Framework
vents the evaluation from testing the desired in- We now describe a simple, general framework
variance of the semantic representation. for evaluating semantic models. Our framework
Other lexical evaluations suffer from the same consists of the following components: a seman-
problem. One uses the WordSim-353 dataset tic model to be evaluated, pairs of sentences that
(Finkelstein et al., 2002), which contains hu- are considered to have high similarity, and pairs
man word pair similarity judgments that seman- of sentences that are considered to have low simi-
tic models should reproduce. However, the word larity.
pairs are given without context, and homography In particular, the semantic model is a binary
is unaddressed. Also, it is unclear how reliable function, s = M(x, x ), which returns a real-
the similarity scores are, as different annotators valued similarity score, s, given a pair of arbitrary
may interpret the integer scale of similarity scores linguistic units (that is, words, phrases, sentences,
differently. Recent work uses this dataset mostly etc.), x and x . Note that this formulation of the
for parameter tuning. Another is the lexical para- semantic model is agnostic to whether the models
phrase task of McCarthy and Navigli (2009), in use compositionality to build a phrase represen-
35
tation from constituent representations, and even ontology construction, recognizing textual entail-
to the actual representation used. The model is ment and question answering.
tested by applying it to each element in the fol- In this task, the high and the low similarity sen-
lowing two sets: tence pairs are constructed in the following man-
ner. First, a target semantic relation, such as Com-
H = {(h, h )|h and h are linguistic units (2)
pany X acquires Company Y is chosen, and enti-
with high similarity} ties are chosen for each slot in the relation, such as
L = {(l, l )|l and l are linguistic units (3) Company X=Pfizer and Company Y=Rinat Neu-
with low similarity} roscience. Then, sentences containing these enti-
ties are extracted and divided into two subsets. In
The resulting sets of similarity scores are: one of them, E, the entities are in the target se-
S H = M(h, h )|(h, h ) H

(4) mantic relation, while in the other, N E, they are
not. The evaluation sets H and L are then con-
S L = M(l, l )|(l, l ) L

(5) structed as follows:
The semantic model is evaluated according to
H = E E \ {(e, e)|e E} (6)
its ability to separate S H and S L . We will de-
fine specific measures of separation for the tasks L = E NE (7)
that we propose shortly. While the particular def-
In other words, the high similarity sentence
initions of high similarity and low similarity
pairs are all the pairs where both express the tar-
depend on the task, at the crux of both our evalu-
get semantic relation, except the pairs between a
ations is that two sentences are similar if they ex-
sentence and itself, while the low similarity pairs
press the same semantic relation between a given
are all the pairs where exactly one of the two sen-
entity pair, and dissimilar otherwise. This thresh-
tences expresses the target relation.
old for similarity is closely tied to the argument
Several sentences expressing the relation Pfizer
structure of the sentence, and allows considerable
acquires Rinat Neuroscience are shown in Exam-
flexibility in the other semantic content that may
ples 8 to 10. These sentences illustrate the amount
be contained in the sentence, unlike the bidirec-
of syntactic and lexical variation that the semantic
tional paraphrase detection task. Yet it ensures
model must recognize as expressing the same se-
that a consistent and useful distinction for infer-
mantic relation. In particular, besides recognizing
ence is being detected, unlike unconstrained sim-
synonymy or near-synonymy at the lexical level,
ilarity judgments.
models must also account for subcategorization
Also, compared to word similarity assessments
differences, extra arguments or adjuncts, and part-
or paraphrase elicitation, determining whether a
of-speech differences due to nominalization.
sentence expresses a semantic relation is a much
easier task cognitively for human judges. This bi- (8) Pfizer buys Rinat Neuroscience to extend
nary judgment does not involve interpreting a nu- neuroscience research and in doing so
merical scale or coming up with an open-ended acquires a product candidate for OA.
set of alternative paraphrases. It is thus easier to (lexical difference)
get reliable annotated data.
Below, we present two tasks that instantiate (9) A month earlier, Pfizer paid an estimated
this evaluation framework and choice of similar- several hundred million dollars for biotech
ity threshold. They differ in that the first is tar- firm Rinat Neuroscience. (extra argument,
geted towards recognizing declarative sentences subcategorization)
or phrases, while the second is targeted towards a
(10) Pfizer to Expand Neuroscience Research
question answering scenario, where one argument
With Acquisition of Biotech Company Rinat
in the semantic relation is queried.
Neuroscience (nominalization)
3.1 Task 1: Relation Classification Since our interest is to measure the models
The first task is a relation classification task. Rela- ability to separate S H and S L in an unsuper-
tion extraction and recognition are central to a va- vised setting, standard supervised classification
riety of other tasks, such as information retrieval, accuracy is not applicable. Instead, we employ
36
the area under a ROC curve (AUC), which does manually checked. We use only those cases that
not depend on choosing an arbitrary classification have thus been determined to be correct question-
threshold. A ROC curve is a plot of the true pos- answer pairs. As a result of this restriction, this
itive versus false positive rate of a binary classi- task is rather more like Task 1 in how it tests a
fier as the classification threshold is varied. The models ability to recognize lexical and syntac-
area under a ROC curve can thus be seen as the tic paraphrases. This task also involves recog-
performance of linear classifiers over the scores nizing voicing alternations, which were automati-
produced by the semantic model. The AUC can cally extracted by the semantic parser.
also be interpreted as the probability that a ran- An example of a question-answer pair involv-
domly chosen positive instance will have a higher ing a voicing alternation that is used in this task is
similarity score than a randomly chosen negative presented in Example 13.
instance. A random classifier is expected to have
an AUC of 0.5. (13) Q: What does il-2 activate?
A: PI3K
3.2 Task 2: Restricted QA Sentence: Phosphatidyl inositol 3-kinase
The second task that we propose is a restricted (PI3K) is activated by IL-2.
form of question answering. In this task, the sys- Since there is only one element in H and hence
tem is given a question q and a document D con- H
S for each question and document, we measure
sisting of a list of sentences, in which one of the the separation between S H and S L using the rank
sentences contains the answer to the question. We of the score of answer-bearing sentence among
define: the scores of all the sentences in the document.
We normalize the rank so that it is between 0
H = {(q, d)|d D and d answers q} (11)
(ranked least similar) and 1 (ranked most simi-
L = {(q, d)|d D and d does not answer q} lar). Where ties occur, the sentence is ranked as
(12) if it were in the median position among the tied
sentences. If the question-answer pairs are zero-
In other words, the sentences are divided into two
indexed by i, answer(i) is the index of the sen-
subsets; those that contain the answer to q should
tence containing the answer for the ith pair, and
be similar to q, while those that do not should be
length(i) is the number of sentences in the doc-
dissimilar. We also assume that only one sentence
ument, then the mean normalized rank score of a
in each document contains the answer, so H con-
system is:
tains only one sentence.
Unrestricted question answering is a difficult
answer(i)

problem that forces a semantic representation to norm rank = E 1 (14)
i length(i) 1
deal sensibly with a number of other semantic is-
sues such as coreference and information aggre- 4 Experiments
gation which still seem to be out of reach for
We drew a number of recent distributional seman-
contemporary distributional models of meaning.
tic models to compare in this paper. We first de-
Since our focus in this work is on argument struc-
scribe the models and our reimplementation of
ture semantics, we restrict the question-answer
them, before describing the tasks and the datasets
pairs to those that only require dealing with para-
used in detail and the results.
phrases of this type.
To do so, we semi-automatically restrict the 4.1 Distributional Semantic Models
question-answer pairs by using the output of an
We tested four recent distributional models and a
unsupervised clustering semantic parser (Poon
lemma overlap baseline, which we now describe.
and Domingos, 2009). The semantic parser clus-
We extended several of the models to compo-
ters semantic sub-expressions derived from a de-
sitionally construct phrase representations using
pendency parse of the sentence, so that those sub-
component-wise vector addition and multiplica-
expressions that express the same semantic re-
tion, as we note below. Since the focus of this pa-
lations are clustered. The parser is used to an-
per is on evaluation methods for such models, we
swer questions, and the output of the parser is
did not experiment with other compositionality
37
operators. We do note, however, that component- a distributional representation of a, va , the repre-
wise operators have been popular in recent liter- sentation of a in context, a , is given by
ature, and have been applied across unrestricted
syntactic contexts (Mitchell and Lapata, 2009), a = va Rb (r 1 ) (17)
X
so there is value in evaluating the performance of Rb (r) = f (c, r, b) vc , (18)
these operators in itself. The models were trained c:f (c,r,b)>
on the Gigaword corpus (2nd ed., ~2.3B words).
All models use cosine similarity to measure the where Rb (r) is the vector describing the selec-
similarity between representations, except for the tional preference of word b in relation r, f (c, r, b)
baseline model. is the frequency of this dependency triple, is a
frequency threshold to weed out uncommon de-
Lemma Overlap This baseline simply repre- pendency triples (10 in our experiments), and
sents a sentence as the counts of each lemma is a vector combination operator, here component-
present in the sentence after removing stop wise multiplication. We extend the model to com-
words. Let a sentence x consist of lemma-tokens pute sentence representations from the contextu-
m1 , . . . , m|x| . The similarity between two sen- alized word vectors using component-wise addi-
tences is then defined as tion and multiplication.
M(x, x ) = #In(x, x ) + #In(x , x) (15) TFP Thater et al. (2010)s model is also sensi-
|x| tive to selectional preferences, but to two degrees.
For example, the vector for catch might contain
X
#In(x, x ) = 1x (mi ) (16)
i=1 a dimension labelled (OBJ,OBJ-1,throw),
which indicates the strength of connection be-
where 1x (mi ) is an indicator function that returns tween the two verbs through all of the co-
1 if mi x , and 0 otherwise. This definition occurring direct objects which they share. Unlike
accounts for multiple occurrences of a lemma. E&P, TFPs model encodes the selectional prefer-
M&L Mitchell and Lapata (2008) propose a ences in a single vector using frequency counts.
framework for compositional distributional se- We extend the model to the sentence level with
mantics using a standard term-context vector component-wise addition and multiplication, and
space word representation. A phrase is repre- word vectors are contextualized by the depen-
sented as a vector of context-word counts (actu- dency neighbours. We use a frequency threshold
ally, pmi-scaled values), which is derived compo- of 10 and a pmi threshold of 2 to prune infrequent
sitionally by a function over constituent vectors, word and dependencies.
such as component-wise addition or multiplica- D&L Dinu and Lapata (2010) (D&L) assume
tion. This model ignores syntactic relations and a global set of latent senses for all words, and
is insensitive to word-order. models each word as a mixture over these latent
E&P Erk and Pado (2008) introduce a struc- senses. The vector for a word ti in the context of
tured vector space model which uses syntactic de- a word cj is modelled by
pendencies to model the selectional preferences v(ti , cj ) = P (z1 |ti , cj ), ...P (zK |ti , cj ) (19)
of words. The vector representation of a word in
context depends on the inverse selectional prefer- where z1...K are the latent senses. By mak-
ences of its dependents, and the selectional pref- ing independence assumptions and decomposing
erences of its head. For example, suppose catch probabilities, training becomes a matter of esti-
occurs with a dependent ball in a direct object mating the probability distributions P (zk |ti ) and
relation. The vector for catch would then be in- P (cj |zk ) from data. While Dinu and Lapata
fluenced by the inverse direct object preferences (2010) describe two methods to do so, based
of ball (e.g. throw, organize), and the vector for on non-negative matrix factorization and latent
ball would be influenced by the selectional pref- Dirichlet allocation, the performances are similar,
erences of catch (e.g. cold, drift). More formally, so we tested only the latent Dirichlet allocation
given words a and b in a dependency relation r, method. Like the two previous models, we ex-
tend the model to build sentence representations
38
Pfizer/Rinat N. Yahoo/Inktomi Besson/Paris Antoinette/Vienna Average
Overlap 0.7393 0.6007 0.7395 0.8914 0.7427
Models trained on the entire GigaWord
M&L add 0.6196 0.5387 0.5259 0.7275 0.6029
M&L mult 0.9036 0.6099 0.6443 0.8467 0.7511
D&L add 0.9214 0.8168 0.6989 0.8932 0.8326
D&L mult 0.7732 0.6734 0.6527 0.7659 0.7163
Models trained on the AFP section
E&P add 0.7536 0.4933 0.2780 0.6408 0.5414
E&P mult 0.5268 0.5328 0.5252 0.8421 0.6067
TFP add 0.4357 0.5325 0.8725 0.7183 0.6398
TFP mult 0.5554 0.5524 0.7283 0.6917 0.6320
M&L add 0.5643 0.5504 0.4594 0.7640 0.5845
M&L mult 0.8679 0.6324 0.4356 0.8258 0.6904
D&L add 0.8143 0.9062 0.6373 0.8664 0.8061
D&L mult 0.8429 0.7461 0.645 0.5948 0.7072
Table 1: Task 1 results in AUC scores. The values in bold indicate the best performing model for a particular
training corpus. The expected random baseline performance is 0.5.
Entities: {X, Y} + N tion for comparison. Note that the AFP portion
Relation: acquires of Gigaword is three times larger than the BNC
{Pfizer, Rinat Neuroscience} 41 50 corpus (~100M words), on which several previ-
{Yahoo, Inktomi} 115 433 ous syntactic models were trained. Because our
Relation: was born in main goal is to test the general performance of the
{Luc Besson, Paris} 6 126 models and to demonstrate the feasibility of our
{Marie Antoinette, Vienna} 39 105 evaluation methods, we did not further tune the
parameter settings to each of the tasks, as doing
Table 2: Task 1 dataset characteristics. N is the total
number of sentences. + is the number of sentences
so would likely only yield minor improvements.
that express the relation.
4.3 Task 1
We used the dataset by Bunescu and Mooney
from the contextualized representations. We set (2007), which we selected because it contains
the number of latent senses to 1200, and train for multiple realizations of an entity pair in a target
600 Gibbs sampling iterations. semantic relation, unlike similar datasets such as
the one by Roth and Yih (2002). Controlling for
4.2 Training and Parameter Settings
the target entity pair in this manner makes the task
We reimplemented these four models, following more difficult, because the semantic model cannot
the parameter settings described by previous work make use of distributional information about the
where possible, though we also aimed for consis- entity pair in inference. The dataset is separated
tency in parameter settings between models (for into subsets depending on the target binary rela-
example, in the number of context words). For the tion (Company X acquires Company Y or Person
non-baseline models, we followed previous work X was born in Place Y) and the entity pair (e.g.,
and model only the 30000 most frequent lemmata. Yahoo and Inktomi) (Table 2).
Context vectors are constructed using a symmet- The dataset was constructed semi-
ric window of 5 words, and their dimensions rep- automatically using a Google search for the
resent the 3000 most frequent lemmatized context two entities in order with up to seven content
words excluding stop words. Due to resource lim- words in between. Then, the extracted sentences
itations, we trained the syntactic models over the were hand-labelled with whether they express the
AFP subset of Gigaword (~338M words). We also target relation. Because the order of the entities
trained the other two models on just the AFP por- has been fixed, passive alternations do not appear
39
Pure models Mixed models ing off to word vectors from the GENIA corpus
All Subset All Subset when a word vector could not be found in the
Overlap 0.8770 0.7291 0.8770 0.7291 Gigaword-trained model. We could not do this
Models trained on the entire GigaWord for the D&L model, since the global latent senses
M&L add 0.7467 0.6106 0.8782 0.7523 that are found by latent Dirichlet allocation train-
M&L mult 0.5331 0.5690 0.8841 0.7678 ing do not have any absolute meaning that holds
D&L add 0.6552 0.5716 0.8791 0.7539 across multiple runs. Instead, we found the 5
D&L mult 0.5488 0.5255 0.8841 0.7466 words in the Gigaword-trained D&L model that
Models trained on the AFP section were closest to each novel word in the GENIA
E&P add 0.4589 0.4516 0.8748 0.7375 corpus according to cosine similarity over the co-
E&P mult 0.5201 0.5584 0.8882 0.7719 occurrence vectors of the words in the GENIA
TFP add 0.6887 0.6443 0.8940 0.7871 corpus, and took their average latent sense distri-
TFP mult 0.5210 0.5199 0.8785 0.7432 butions as the vector for that word.
M&L add 0.7588 0.6206 0.8710 0.7371 Unlike in Task 1, there is no control for the
M&L mult 0.5710 0.5540 0.8801 0.7540 named entities in a sentence, because one of the
D&L add 0.6358 0.5402 0.8713 0.7305 entities in the semantic relation is missing. Also,
D&L mult 0.5647 0.5461 0.8856 0.7683 distributional models have problems in dealing
with named entities which are common in this
Table 3: Task 2 results, in normalized rank scores.
Subset is the cases where lemma overlap does not
corpus, such as the names of genes and proteins.
achieve a perfect score. The two columns on the right To address these issues, we tested hybrid models
indicate performance using the sum of the scores from where the similarity score from a semantic model
the lemma overlap and the semantic model. The ex- is added to the similarity score from the lemma
pected random baseline performance is 0.5. overlap model.
The results are presented in Table 3. Lemma
overlap again presents a strong baseline, but the
in this dataset.
hybridized models are able to outperform simple
The results for Task 1 indicate that the D&L ad-
lemma overlap. Unlike in Task 1, the E&P and
dition model performs the best (Table 1), though
TFP models are comparable to the D&L model,
the lemma overlap model presents a surprisingly
and the mixed TFP addition model achieves the
strong baseline. The syntax-modulated E&P and
best result, likely due to the need to more pre-
TFP models perform poorly on this task, even
cisely distinguish syntactic roles in this task. The
when compared to the other models trained on the
D&L addition model, which achieved the best
AFP subset. The M&L multiplication model out-
performance in Task 1, does not perform as well
performs the addition model, a result which cor-
in this task. This could be due to the domain adap-
roborates previous findings on the lexical substi-
tation procedure for the D&L model, which could
tution task. The same does not hold in the D&L
not be reasonably trained on such a small, special-
latent sense space. Overall, some of the datasets
ized corpus.
(Yahoo and Antoinette) appear to be easier for the
models than others (Pfizer and Besson), but more 5 Related Work
entity pairs and relations would be needed to in-
vestigate the models variance across datasets. Turney and Pantel (2010) survey various types of
vector space models and applications thereof in
4.4 Task 2 computational linguistics. We summarize below
We used the question-answer pairs extracted by a number of other word- or phrase-level distribu-
the Poon and Domingos (2009) semantic parser tional models.
from the GENIA biomedical corpus that have Several approaches are specialized to deal with
been manually checked to be correct (295 pairs). homography. The top-down multi-prototype ap-
Because our models were trained on newspaper proach determines a number of senses for each
text, they required adaptation to this specialized word, and then clusters the occurrences of the
domain. Thus, we also trained the M&L, E&P word (Reisinger and Mooney, 2010) into these
and TFP models on the GENIA corpus, back- senses. A prototype vector is created for each
of these sense clusters. When a new occurrence
40
of a word is encountered, it is represented as a results demonstrate that compositional distribu-
combination of the prototype vectors, with the de- tional models of semantics already have some
gree of influence from each prototype determined utility in the context of more empirically complex
by the similarity of the new context to the exist- semantic tasks than WSD-like lexical substitution
ing sense contexts. In contrast, the bottom-up ex- tasks, in which compositional invariance is a req-
emplar-based approach assumes that each occur- uisite property. Simply computing lemma over-
rence of a word expresses a different sense of the lap, however, is a very competitive baseline, due
word. The most similar senses of the word are ac- to issues in these protocols with named entities
tivated when a new occurrence of it is encountered and domain adaptivity. The better performance
and combined, for example with a kNN algorithm of the mixture models in Task 2 shows that such
(Erk and Pado, 2010). weaknesses can be addressed by hybrid seman-
The models we compared and the above work tic models. Future work should investigate more
assume each dimension in the feature vector cor- refined versions of such hybridization, as well as
responds to a context word. In contrast, Washtell extend this idea to other semantic phenomena like
(2011) uses potential paraphrases directly as di- coreference, negation and modality.
mensions in his expectation vectors. Unfortu- We also observe that no single model or com-
nately, this approach does not outperform vari- position operator performs best for all tasks and
ous context word-based approaches in two phrase datasets. The latent sense mixture model of Dinu
similarity tasks. and Lapata (2010) performs well in recognizing
In terms of the vector composition function, semantic relations in general web text. Because
component-wise addition and multiplication are of the difficulty of adapting it to a specialized
the most popular in recent work, but there ex- domain, however, it does less well in biomedi-
ist a number of other operators such as tensor cal question answering, where the syntax-based
product and convolution product, which are re- model of Thater et al. (2010) performs the best.
viewed by Widdows (2008). Instead of vector A more thorough investigation of the factors that
space representations, one could also use a matrix can predict the performance and/or invariance of
space representation with its much more expres- a given composition operator is warranted.
sive matrix operators (Rudolph and Giesbrecht, In the future, we would like to evaluate other
2010). So far, however, this has only been ap- models of compositional semantics that have been
plied to specific syntactic contexts (Baroni and recently proposed. We would also like to collect
Zamparelli, 2010; Guevara, 2010; Grefenstette more comprehensive test data, to increase the ex-
and Sadrzadeh, 2011), or tasks (Yessenalina and ternal validity of our evaluations.
Cardie, 2011).
Neural networks have been used to learn both Acknowledgments
phrase structure and representations. In Socher et We would like to thank Georgiana Dinu and Ste-
al. (2010), word representations learned by neu- fan Thater for help with reimplementing their
ral network models such as (Bengio et al., 2006; models. Saif Mohammad, Peter Turney, and
Collobert and Weston, 2008) are fed as input into the anonymous reviewers provided valuable com-
a recursive neural network whose nodes represent ments on drafts of this paper. This project was
syntactic constituents. Each node models both the supported by the Natural Sciences and Engineer-
probability of the input forming a constituent and ing Research Council of Canada.
the phrase representation resulting from composi-
tion.
References
6 Conclusions Marco Baroni and Roberto Zamparelli. 2010. Nouns
We have proposed an evaluation framework for are vectors, adjectives are matrices: Representing
adjective-noun constructions in semantic space. In
distributional models of semantics which build
Proceedings of the 2010 Conference on Empirical
phrase- and sentence-level representations, and Methods in Natural Language Processing, pages
instantiated two evaluation tasks which test for 11831193.
the crucial ability to recognize whether sen- Yoshua Bengio, Holger Schwenk, Jean-Sebastien
tences express the same semantic relation. Our Senecal, Frederic Morin, and Jean-Luc Gauvain.
41
2006. Neural probabilistic language models. In- Diana McCarthy and Roberto Navigli. 2009. The en-
novations in Machine Learning, pages 137186. glish lexical substitution task. Language Resources
Razvan C. Bunescu and Raymond J. Mooney. 2007. and Evaluation, 43(2):139159.
Learning to extract relations from the web using Jeff Mitchell and Mirella Lapata. 2008. Vector-based
minimal supervision. In Proceedings of the 45th models of semantic composition. In Proceedings of
Annual Meeting of the Association for Computa- ACL-08: HLT, pages 236244.
tional Linguistics, pages 576583. Jeff Mitchell and Mirella Lapata. 2009. Language
Ronan Collobert and Jason Weston. 2008. A unified models based on semantic composition. In Pro-
architecture for natural language processing: Deep ceedings of the 2009 Conference on Empirical
neural networks with multitask learning. In Pro- Methods in Natural Language Processing, pages
ceedings of the 25th International Conference on 430439.
Machine Learning, page 160167. Richard Montague. 1974. English as a formal lan-
Georgiana Dinu and Mirella Lapata. 2010. Measuring guage. Formal Philosophy, pages 188221.
distributional similarity in context. In Proceedings Hoifung Poon and Pedro Domingos. 2009. Unsuper-
of the 2010 Conference on Empirical Methods in vised semantic parsing. In Proceedings of the 2009
Natural Language Processing, pages 11621172. Conference on Empirical Methods in Natural Lan-
William B. Dolan and Chris Brockett. 2005. Auto- guage Processing, pages 110.
matically constructing a corpus of sentential para- Joseph Reisinger and Raymond J. Mooney. 2010.
phrases. In Proceedings of the Third International Multi-prototype vector-space models of word
Workshop on Paraphrasing, pages 916. meaning. In Human Language Technologies: The
Katrin Erk and Sebastian Pado. 2008. A structured 2010 Annual Conference of the North American
vector space model for word meaning in context. In Chapter of the Association for Computational Lin-
Proceedings of the Conference on Empirical Meth- guistics.
ods in Natural Language Processing, pages 897 Dan Roth and Wen-tau Yih. 2002. Probabilistic rea-
906. soning for entity & relation recognition. In Pro-
ceedings of the 19th International Conference on
Katrin Erk and Sebastian Pado. 2010. Exemplar-
Computational Linguistics, pages 835841.
based models for word meaning in context. In Pro-
Sebastian Rudolph and Eugenie Giesbrecht. 2010.
ceedings of the ACL 2010 Conference Short Papers,
Compositional matrix-space models of language.
pages 9297.
In Proceedings of the 48th Annual Meeting of the
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,
Association for Computational Linguistics, pages
Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-
907916.
tan Ruppin. 2002. Placing search in context: The
Richard Socher, Christopher D. Manning, and An-
concept revisited. ACM Transactions on Informa-
drew Y. Ng. 2010. Learning continuous phrase
tion Systems, 20(1):116131.
representations and syntactic parsing with recursive
Jerry A. Fodor and Zenon W. Pylyshyn. 1988. Con-
neural networks. Proceedings of the Deep Learn-
nectionism and cognitive architecture: A critical
ing and Unsupervised Feature Learning Workshop
analysis. Cognition, 28:371.
of NIPS 2010, pages 19.
Edward Grefenstette and Mehrnoosh Sadrzadeh. Alfred Tarski. 1956. The concept of truth in formal-
2011. Experimental support for a categorical com- ized languages. Logic, Semantics, Metamathemat-
positional distributional model of meaning. In ics, pages 152278.
Proceedings of the 2011 Conference on Empirical Stefan Thater, Hagen Furstenau, and Manfred Pinkal.
Methods in Natural Language Processing, pages 2010. Contextualizing semantic representations us-
13941404. ing syntactically enriched vector models. In Pro-
Emiliano Guevara. 2010. A regression model ceedings of the 48th Annual Meeting of the Associa-
of adjective-noun compositionality in distributional tion for Computational Linguistics, pages 948957.
semantics. In Proceedings of the 2010 Workshop on Peter D. Turney and Patrick Pantel. 2010. From
GEometrical Models of Natural Language Seman- frequency to meaning: Vector space models of se-
tics, pages 3337. mantics. Journal of Artificial Intelligence Research,
Zeller S. Harris. 1954. Distributional structure. Word, 37:141188.
10(23):146162. Justin Washtell. 2011. Compositional expectation:
Wilfred Hodges. 2005. The interplay of fact and the- A purely distributional model of compositional se-
ory in separating syntax from meaning. In Work- mantics. In Proceedings of the Ninth International
shop on Empirical Challenges and Analytical Al- Conference on Computational Semantics (IWCS
ternatives to Strict Compositionality. 2011), pages 285294.
Walter Kintsch. 2001. Predication. Cognitive Sci- Dominic Widdows. 2008. Semantic vector products:
ence, 25(2):173202. Some initial investigations. In Second AAAI Sym-
posium on Quantum Interaction.
42
Stephen Wu and William Schuler. 2011. Structured
composition of semantic vectors. In Proceedings
of the Ninth International Conference on Computa-
tional Semantics (IWCS 2011), pages 295304.
Ainur Yessenalina and Claire Cardie. 2011. Com-
positional matrix-space models for sentiment analy-
sis. In Proceedings of the 2011 Conference on Em-
pirical Methods in Natural Language Processing,
pages 172182.
43
Cross-Framework Evaluation for Statistical Parsing
Reut Tsarfaty Joakim Nivre Evelina Andersson

Uppsala University, Box 635, 75126 Uppsala, Sweden
tsarfaty@stp.lingfil.uu.se,{joakim.nivre,evelina.andersson}@lingfil.uu.se
Abstract a phrase-structure tree using hard-coded conver-

sion procedures (de Marneffe et al., 2006). This
A serious bottleneck of comparative parser diversity poses a challenge to cross-experimental
evaluation is the fact that different parsers parser evaluation, namely: How can we evaluate
subscribe to different formal frameworks the performance of these different parsers relative
and theoretical assumptions. Converting to one another?
outputs from one framework to another is
less than optimal as it easily introduces Current evaluation practices assume a set of
noise into the process. Here we present a correctly annotated test data (or gold standard)
principled protocol for evaluating parsing for evaluation. Typically, every parser is eval-
results across frameworks based on func- uated with respect to its own formal representa-
tion trees, tree generalization and edit dis- tion type and the underlying theory which it was
tance metrics. This extends a previously
trained to recover. Therefore, numerical scores
proposed framework for cross-theory eval-
uation and allows us to compare a wider of parses across experiments are incomparable.
class of parsers. We demonstrate the useful- When comparing parses that belong to different
ness and language independence of our pro- formal frameworks, the notion of a single gold
cedure by evaluating constituency and de- standard becomes problematic, and there are two
pendency parsers on English and Swedish. different questions we have to answer. First, what
is an appropriate gold standard for cross-parser
evaluation? And secondly, how can we alle-
1 Introduction viate the differences between formal representa-
The goal of statistical parsers is to recover a for- tion types and theoretical assumptions in order to
mal representation of the grammatical relations make our comparison sound that is, to make sure
that constitute the argument structure of natural that we are not comparing apples and oranges?
language sentences. The argument structure en- A popular way to address this has been to
compasses grammatical relationships between el- pick one of the frameworks and convert all
ements such as subject, predicate, object, etc., parser outputs to its formal type. When com-
which are useful for further (e.g., semantic) pro- paring constituency-based and dependency-based
cessing. The parses yielded by different parsing parsers, for instance, the output of constituency
frameworks typically obey different formal and parsers has often been converted to dependency
theoretical assumptions concerning how to rep- structures prior to evaluation (Cer et al., 2010;
resent the grammatical relationships in the data Nivre et al., 2010). This solution has vari-
(Rambow, 2010). For example, grammatical rela- ous drawbacks. First, it demands a conversion
tions may be encoded on top of dependency arcs script that maps one representation type to another
in a dependency tree (Melcuk, 1988), they may when some theoretical assumptions in one frame-
decorate nodes in a phrase-structure tree (Marcus work may be incompatible with the other one.
et al., 1993; Maamouri et al., 2004; Simaan et In the constituency-to-dependency case, some
al., 2001), or they may be read off of positions in constituency-based structures (e.g., coordination
44
and ellipsis) do not comply with the single head 2 Preliminaries: Relational Schemes for
assumption of dependency treebanks. Secondly, Cross-Framework Parse Evaluation
these scripts may be labor intensive to create, and
are available mostly for English. So the evalua- Traditionally, different statistical parsers have
tion protocol becomes language-dependent. been evaluated using specially designated evalu-
In Tsarfaty et al. (2011) we proposed a gen- ation measures that are designed to fit their repre-
eral protocol for handling annotation discrepan- sentation types. Dependency trees are evaluated
cies when comparing parses across different de- using attachment scores (Buchholz and Marsi,
pendency theories. The protocol consists of three 2006), phrase-structure trees are evaluated using
phases: converting all structures into function ParsEval (Black et al., 1991), LFG-based parsers
trees, for each sentence, generalizing the different postulate an evaluation procedure based on f-
gold standard function trees to get their common structures (Cahill et al., 2008), and so on. From a
denominator, and employing an evaluation mea- downstream application point of view, there is no
sure based on tree edit distance (TED) which dis- significance as to which formalism was used for
cards edit operations that recover theory-specific generating the representation and which learning
structures. Although the protocol is potentially methods have been utilized. The bottom line is
applicable to a wide class of syntactic represen- simply which parsing framework most accurately
tation types, formal restrictions in the procedures recovers a useful representation that helps to un-
effectively limit its applicability only to represen- ravel the human-perceived interpretation.
tations that are isomorphic to dependency trees. Relational schemes, that is, schemes that en-
The present paper breaks new ground in the code the set of grammatical relations that con-
ability to soundly compare the accuracy of differ- stitute the predicate-argument structures of sen-
ent parsers relative to one another given that they tences, provide an interface to semantic interpre-
employ different formal representation types and tation. They are more intuitively understood than,
obey different theoretical assumptions. Our solu- say, phrase-structure trees, and thus they are also
tion generally confines with the protocol proposed more useful for practical applications. For these
in Tsarfaty et al. (2011) but is re-formalized to reasons, relational schemes have been repeatedly
allow for arbitrary linearly ordered labeled trees, singled out as an appropriate level of representa-
thus encompassing constituency-based as well as tion for the evaluation of statistical parsers (Lin,
dependency-based representations. The frame- 1995; Carroll et al., 1998; Cer et al., 2010).
work in Tsarfaty et al. (2011) assumes structures The annotated data which statistical parsers are
that are isomorphic to dependency trees, bypass- trained on encode these grammatical relationships
ing the problem of arbitrary branching. Here we in different ways. Dependency treebanks provide
lift this restriction, and define a protocol which a ready-made representation of grammatical rela-
is based on generalization and TED measures to tions on top of arcs connecting the words in the
soundly compare the output of different parsers. sentence (Kubler et al., 2009). The Penn Tree-
We demonstrate the utility of this protocol by bank and phrase-structure annotated resources en-
comparing the performance of different parsers code partial information about grammatical rela-
for English and Swedish. For English, our tions as dash-features decorating phrase structure
parser evaluation across representation types al- nodes (Marcus et al., 1993). Treebanks like Tiger
lows us to analyze and precisely quantify previ- for German (Brants et al., 2002) and Talbanken
ously encountered performance tendencies. For for Swedish (Nivre and Megyesi, 2007) explic-
Swedish we show the first ever evaluation be- itly map phrase structures onto grammatical rela-
tween dependency-based and constituency-based tions using dedicated edge labels. The Relational-
parsing models, all trained on the Swedish tree- Realizational structures of Tsarfaty and Simaan
bank data. All in all we show that our ex- (2008) encode relational networks (sets of rela-
tended protocol, which can handle linearly- tions) projected and realized by syntactic cate-
ordered labeled trees with arbitrary branch- gories on top of ordinary phrase-structure nodes.
ing, can soundly compare parsing results across Function trees, as defined in Tsarfaty et al.
frameworks in a representation-independent and (2011), are linearly ordered labeled trees in which
language-independent fashion. every node is labeled with the grammatical func-
45
root (t1) root (t2) root (t3) root
sbj obj
(a) -ROOT- John loves Mary root f1 f2 {f1,f2}
f2 f1 w
sbj hd obj
w w
John loves Mary
(b) S-root root Figure 2: Unary chains in function trees
NP-sbj VP-prd sbj prd

Once we have converted framework-specific
NN-hd V-hd NP-obj hd hd obj
representations into function trees, the problem of
John loves NN-hd John loves hd cross-framework evaluation can potentially be re-
duced to a cross-theory evaluation following Tsar-
Mary Mary
faty et al. (2011). The main idea is that once
(c) S root all structures have been converted into function
{sbj,prd,obj} trees, one can perform a formal operation called
sbj prd obj generalization in order to harmonize the differ-
sbj prd obj hd loves hd ences between theories, and measure accurately
the distance of parse hypotheses from the gener-
NP V NP John Mary alized gold. The generalization operation defined
{hd} loves {hd} in Tsarfaty et al. (2011), however, cannot handle
trees that may contain unary chains, and therefore
hd hd cannot be used for arbitrary function trees.
NN NN Consider for instance (t1) and (t2) in Figure 2.
According to the definition of subsumption in
John Mary Tsarfaty et al. (2011), (t1) is subsumed by (t2)
Figure 1: Deterministic conversion into function trees. and vice versa, so the two trees should be identi-
The algorithm for extracting a function tree from a de- cal but they are not. The interpretation we wish
pendency tree as in (a) is provided in Tsarfaty et al. to give to a function tree such as (t1) is that the
(2011). For a phrase-structure tree as in (b) we can re- word w has both the grammatical function f1 and
place each node label with its function (dash-feature). the grammatical function f2. This can be graphi-
In a relational-realizational structure like (c) we can re- cally represented as a set of labels dominating w,
move the projection nodes (sets) and realization nodes
as in (t3). We call structures such as (t3) multi-
(phrase labels), which leaves the function nodes intact.
function trees. In the next section we formally de-
fine multi-function trees, and then use them to de-
tion of the dominated span. Function trees ben-
velop our protocol for cross-framework and cross-
efit from the same advantages as other relational
theory evaluation.
schemes, namely that they are intuitive to under-
stand, they provide the interface for semantic in-
3 The Proposal: Cross-Framework
terpretation, and thus may be useful for down-
Evaluation with Multi-Function Trees
stream applications. Yet they do not suffer from
formal restrictions inherent in dependency struc- Our proposal is a three-phase evaluation proto-
tures, for instance, the single head assumption. col in the spirit of Tsarfaty et al. (2011). First,
For many formal representation types there ex- we obtain a formal common ground for all frame-
ists a fully deterministic, heuristics-free, proce- works in terms of multi-function trees. Then we
dure mapping them to function trees. In Figure 1 obtain a theoretical common ground by means
we illustrate some such procedures for a simple of tree-generalization on gold trees. Finally, we
transitive sentence. Now, while all the structures calculate TED-based scores that discard the cost
at the right hand side of Figure 1 are of the same of annotation-specific edits. In this section, we
formal type (function trees), they have different define multi-function trees and update the tree-
tree structures due to different theoretical assump- generalization and TED-based metrics to handle
tions underlying the original formal frameworks. multi-function trees that reflect different theories.
46
Figure 3: The Evaluation Protocol. Different formal frameworks yield different parse and gold formal types.
All types are transformed into multi-function trees. All gold trees enter generalization to yield a new gold for
each sentence. The different arcs represent the different edit scripts used for calculating the TED-based scores.
3.1 Defining Multi-Function Trees 3.2 Generalizing Multi-Function Trees

An ordinary function tree is a linearly ordered tree Once we obtain multi-function trees for all the
T = (V, A) with yield w1 , ..., wn , where internal different gold standard representations in the sys-
nodes are labeled with grammatical function la- tem, we feed them to a generalization operation
bels drawn from some set L. We use span(v) as shown in Figure 3. The goal of this opera-
and label(v) to denote the yield and label, respec- tion is to provide a consensus gold standard that
tively, of an internal node v. A multi-function tree captures the linguistic structure that the different
is a linearly ordered tree T = (V, A) with yield gold theories agree on. The generalization struc-
w1 , ..., wn , where internal nodes are labeled with tures are later used as the basis for the TED-based
sets of grammatical function labels drawn from L evaluation. Generalization is defined by means of
and where v 6= v 0 implies span(v) 6= span(v 0 ) subsumption. A multi-function tree subsumes an-
for all internal nodes v, v 0 . We use labels(v) to other one if and only if all the constraints defined
denote the label set of an internal node v. by the first tree are also defined by the second tree.
We interpret multi-function trees as encoding So, instead of demanding equality of labels as in
sets of functional constraints over spans in func- Tsarfaty et al. (2011), we demand set inclusion:
tion trees. Each node v in a multi-function tree
represents a constraint of the form: for each T-Subsumption, denoted vt , is a relation
l labels(v), there should be a node v 0 in the between multi-function trees that indicates
function tree such that span(v) = span(v 0 ) and that a tree 1 is consistent with and more
label(v 0 ) = l. Whenever we have a conversion for general than tree 2 . Formally: 1 vt 2
function trees, we can efficiently collapse them iff for every node n 1 there exists a node
into multi-function trees with no unary produc- m 2 such that span(n) = span(m) and
tions, and with label sets labeling their nodes. labels(n) labels(m).
Thus, trees (t1) and (t2) in Figure 2 would both T-Unification, denoted tt , is an operation
be mapped to tree (t3), which encodes the func- that returns the most general tree structure
tional constraints encoded in either of them. that contains the information from both input
For dependency trees, we assume the conver- trees, and fails if such a tree does not exist.
sion to function trees defined in Tsarfaty et al. Formally: 1 tt 2 = 3 iff 1 vt 3 and
(2011), where head daughters always get the la- 2 vt 3 , and for all 4 such that 1 vt 4
bel hd. For PTB style phrase-structure trees, we and 2 vt 4 it holds that 3 vt 4 .
replace the phrase-structure labels with functional
dash-features. In relational-realization structures T-Generalization, denoted ut , is an opera-
we remove projection and realization nodes. De- tion that returns the most specific tree that
terministic conversions exist also for Tiger style is more general than both trees. Formally,
treebanks and frameworks such as LFG, but we 1 ut 2 = 3 iff 3 vt 1 and 3 vt 2 , and
do not discuss them here.1 for every 4 such that 4 vt 1 and 4 vt 2
1
All the conversions we use are deterministic and are it holds that 4 vt 3 .
defined in graph-theoretic and language-independent terms.
We make them available at http://stp.lingfil.uu. The generalization tree contains all nodes that ex-
se/tsarfaty/unipar/index.html. ist in both trees, and for each node it is labeled by
47
the intersection of the label sets dominating the We would now like to use distance-based met-
same span in both trees. The unification tree con- rics in order to measure the gap between the gold
tains nodes that exist in one tree or another, and and predicted theories. The idea behind distance-
for each span it is labeled by the union of all label based evaluation in Tsarfaty et al. (2011) is that
sets for this span in either tree. If we generalize recording the edit operations between the native
two trees and one tree has no specification for la- gold and the generalized gold allows one to dis-
bels over a span, it does not share anything with card their cost when computing the cost of a parse
the label set dominating the same span in the other hypothesis turned into the generalized gold. This
tree, and the label set dominating this span in the makes sure that different parsers do not get penal-
generalized tree is empty. If the trees do not agree ized, or favored, due to annotation specific deci-
on any label for a particular span, the respective sions that are not shared by other frameworks.
node is similarly labeled with an empty set. When The problem is now that TED is undefined with
we wish to unify theories, then an empty set over respect to multi-function trees because it cannot
a span is unified with any other set dominating the handle complex labels. To overcome this, we
same span in the other tree, without altering it. convert multi-function trees into sorted function
trees, which are simply function trees in which
Digression: Using Unification to Merge Infor-
any label set is represented as a unary chain of
mation From Different Treebanks In Tsarfaty
single-labeled nodes, and the nodes are sorted ac-
et al. (2011), only the generalization operation
cording to the canonical order of their labels.2 In
was used, providing the common denominator of
case of an empty set, a 0-length chain is created,
all the gold structures and serving as a common
that is, no node is created over this span. Sorted
ground for evaluation. The unification operation
function trees prevent reordering nodes in a chain
is useful for other NLP tasks, for instance, com-
in one tree to fit the order in another tree, since it
bining information from two different annotation
would violate the idea that the set of constraints
schemes or enriching one annotation scheme with
over a span in a multi-function tree is unordered.
information from a different one. In particular,
The edit operations we assume are add-
we can take advantage of the new framework to
node(l, i, j) and delete-node(l, i, j) where l L
enrich the node structure reflected in one theory
is a grammatical function label and i < j define
with grammatical functions reflected in an anno-
the span of a node in the tree. Insertion into a
tation scheme that follows a different theory. To
unary chain must confine with the canonical order
do so, we define the Tree-Labeling-Unification
of the labels. Every operation is assigned a cost.
operation on multi-function trees.
An edit script is a sequence of edit operations that
TL-Unification, denoted ttl , is an opera- turns a function tree 1 into 2 , that is:
tion that returns a tree that retains the struc- ES(1 , 2 ) = he1 , . . . , ek i
ture of the first tree and adds labels that ex-
ist over its spans in the second tree. For- Since all operations are anchored in spans, the se-
mally: 1 ttl 2 = 3 iff for every node quence can be determined to have a unique order
n 1 there exists a node m 3 such of traversing the tree (say, DFS). Different edit
that span(m) = span(n) and labels(m) = scripts then only differ in their set of operations
labels(n) labels(2 , span(n)). on spans. The edit distance problem is finding the
minimal cost script, that is, one needs to solve:
Where labels(2 , span(n)) is the set of labels of X
the node with yield span(n) in 2 if such a node ES (1 , 2 ) = min cost(e)
ES(1 ,2 )
exists and otherwise. We further discuss the TL- eES(1 ,2 )
Unification and its use for data preparation in 4.
In the current setting, when using only add and
3.3 TED Measures for Multi-Function Trees delete operations on spans, there is only one edit
The result of the generalization operation pro- script that corresponds to the minimal edit cost.
vides us with multi-function trees for each of the So, finding the minimal edit script entails finding
sentences in the test set representing sets of con- a single set of operations turning 1 into 2 .
2
straints on which the different gold theories agree. The ordering can be alphabetic, thematic, etc.
48
We can now define for the ith framework, as parser (Petrov et al., 2006) and the Brown parser
the error of parsei relative to its native gold stan- (Charniak and Johnson, 2005). All experiments
dard goldi and to the generalized gold gen. This use Penn Treebank (PTB) data. For Swedish,
is the edit cost minus the cost of the script turning we compare MaltParser and MSTParser with two
parsei into gen intersected with the script turning variants of the Berkeley parser, one trained on
goldi into gen. The underlying intuition is that phrase structure trees, and one trained on a vari-
if an operation that was used to turn parsei into ant of the Relational-Realizational representation
gen is used to discard theory-specific information of Tsarfaty and Simaan (2008). All experiments
from goldi , its cost should not be counted as error. use the Talbanken Swedish Treebank (STB) data.
(parsei , goldi , gen) = cost(ES (parsei , gen)) 4.1 English Cross-Framework Evaluation
We use sections 0221 of the WSJ Penn Tree-
cost(ES (parsei , gen) ES (goldi , gen)) bank for training and section 00 for evaluation and
In order to turn distance measures into parse- analysis. We use two different native gold stan-
scores we now normalize the error relative to the dards subscribing to different theories of encoding
size of the trees and subtract it from a unity. So grammatical relations in tree structures:
the Sentence Score for parsing with framework i T HE DEPENDENCY- BASED THEORY is the
is: theory encoded in the basic Stanford Depen-
score(parsei , goldi , gen) = dencies (SD) scheme. We obtain the set of
(parsei , goldi ,gen) basic stanford dependency trees using the
1 software of de Marneffe et al. (2006) and
|parsei | + |gen|
train the dependency parsers directly on it.
Finally, Test-Set Average is defined by macro-
avaraging over all sentences in the test-set: T HE CONSTITUENCY- BASED THEORY is
the theory reflected in the phrase-structure
P|testset| representation of the PTB (Marcus et al.,
j=1 (parseij , goldij , genj )
1 P|testset| 1993) enriched with function labels compat-
j=1 |parseij | + |genj | ible with the Stanford Dependencies (SD)
scheme. We obtain trees that reflect this
This last formula represents the T ED E VAL metric
theory by TL-Unification of the PTB multi-
that we use in our experiments.
function trees with the SD multi-function
A Note on System Complexity Conversion of trees (PTBttl SD) as illustrated in Figure 4.
a dependency or a constituency tree into a func- The theory encoded in the multi-function trees
tion tree is linear in the size of the tree. Our corresponding to SD is different from the one
implementation of the generalization and unifica- obtained by our TL-Unification, as may be seen
tion operation is an exact, greedy, chart-based al- from the difference between the flat SD multi-
gorithm that runs in polynomial time (O(n2 ) in function tree and the result of the PTBttl SD in
n the number of terminals). The TED software Figure 4. Another difference concerns coordina-
that we utilize builds on the TED efficient algo- tion structures, encoded as binary branching trees
rithm of Zhang and Shasha (1989) which runs in in SD and as flat productions in the PTBttl SD.
O(|T1 ||T2 | min(d1 , n1 ) min(d2 , n2 )) time where Such differences are not only observable but also
di is the tree degree (depth) and ni is the number quantifiable, and using our redefined TED metric
of terminals in the respective tree (Bille, 2005). the cross-theory overlap is 0.8571.
The two dependency parsers were trained using
4 Experiments
the same settings as in Tsarfaty et al. (2011), using
We validate our cross-framework evaluation pro- SVMTool (Gimenez and Marquez, 2004) to pre-
cedure on two languages, English and Swedish. dict part-of-speech tags at parsing time. The two
For English, we compare the performance of constituency parsers were used with default set-
two dependency parsers, MaltParser (Nivre et al., tings and were allowed to predict their own part-
2006) and MSTParser (McDonald et al., 2005), of-speech tags. We report three different evalua-
and two constituency-based parsers, the Berkeley tion metrics for the different experiments:
49
(PTB) S
a constituency tree, it is converted to and evalu-
NP VP ated on SD. Here we see that MST outperforms
NN V NP John Malt, though the differences for labeled depen-
John loves NN John loves loves Mary dencies are insignificant. We also observe here a
Mary Mary
familiar pattern from Cer et al. (2010) and others,
root
sbj obj
where the constituency parsers significantly out-
(SD) -ROOT- John loves Mary root {root}
perform the dependency parsers after conversion
of their output into dependencies.
sbj hd obj {sbj} {hd} {obj} The conversion to SD allows one to compare
John loves Mary John loves Mary results across formal frameworks, but not with-
(PTB) ttl (SD) = {root} out a cost. The conversion introduces a set of an-
notation specific decisions which may introduce
{sbj}
a bias into the evaluation. In the middle column
John {hd} {obj}
of Table 1 we report the T ED E VAL metrics mea-
loves Mary sured against the generalized gold standard for all
Figure 4: Conversion of PTB and SD tree to multi- parsing frameworks. We can now confirm that
function trees, followed by TL-Unification of the trees. the constituency-based parsers significantly out-
Note that some PTB nodes remain without an SD label. perform the dependency parsers, and that this is
not due to specific theoretical decisions which are
LAS/UAS (Buchholz and Marsi, 2006) seen to affect LAS/UAS metrics (Schwartz et al.,
PARS E VAL (Black et al., 1991) 2011). For the dependency parsers we now see
T ED E VAL as defined in Section 3 that Malt outperforms MST on labeled dependen-
cies slightly, but the difference is insignificant.
We use LAS/UAS for dependency parsers that The fact that the discrepancy in theoretical as-
were trained on the same dependency theory. We sumptions between different frameworks indeed
use ParseEval to evaluate phrase-structure parsers affects the conversion-based evaluation procedure
that were trained on PTB trees in which dash- is reflected in the results we report in Table 2.
features and empty traces are removed. We Here the leftmost and rightmost columns report
use our implementation of T ED E VAL to evaluate T ED E VAL scores against the own native gold
parsing results across all frameworks under two (S INGLE) and the middle column against the gen-
different scenarios:3 T ED E VAL S INGLE evalu- eralized gold (M ULTIPLE). Had the theories
ates against the native gold multi-function trees. for SD and PTBttl SD been identical, T ED E VAL
T ED E VAL M ULTIPLE evaluates against the gen- S INGLE and T ED E VAL M ULTIPLE would have
eralized (cross-theory) multi-function trees. Un- been equal in each line. Because of theoretical
labeled T ED E VAL scores are obtained by sim- discrepancies, we see small gaps in parser perfor-
ply removing all labels from the multi-function mance between these cases. Our protocol ensures
nodes, and using unlabeled edit operations. We that such discrepancies do not bias the results.
calculate pairwise statistical significance using a
4.2 Cross-Framework Swedish Parsing
shuffling test with 10K iterations (Cohen, 1995).
Tables 1 and 2 present the results of our cross- We use the standard training and test sets of the
framework evaluation for English Parsing. In the Swedish Treebank (Nivre and Megyesi, 2007)
left column of Table 1 we report ParsEval scores with two gold standards presupposing different
for constituency-based parsers. As expected, F- theories:
Scores for the Brown parser are higher than the
T HE D EPENDENCY-BASED T HEORY is the
F-Scores of the Berkeley parser. F-Scores are
dependency version of the Swedish Tree-
however not applicable across frameworks. In
bank. All trees are projectivized (STB-Dep).
the rightmost column of Table 1 we report the
LAS/UAS results for all parsers. If a parser yields T HE C ONSTITUENCY-BASED T HEORY is
3
Our TedEval software can be downloaded at
the standard Swedish Treebank with gram-
http://stp.lingfil.uu.se/tsarfaty/ matical function labels on the edges of con-
unipar/download.html. stituency structures (STB).
50
Formalism PS Trees MF Trees Dep Trees Formalism PS Trees MF Trees Dep Trees
Theory PTB tlt SD (PTB tlt SD) SD Theory STB STB ut Dep Dep
ut SD Metrics PARS E VAL T ED E VAL ATT S CORE
Metrics PARS E VAL T ED E VAL ATT S CORES U: 0.9266 U: 0.8298
M ALT N/A
U: 0.9525 U: 0.8962 L: 0.8225 L: 0.7782
M ALT N/A
L: 0.9088 L: 0.8772 U: 0.9275 U: 0.8438
MST N/A
U: 0.9549 U: 0.9059 L: 0.8121 L: 0.7824
MST N/A
L: 0.9049 L: 0.8795 F-Score U: 0.9281
B KLY /STB-RR N/A
F-Scores U: 0.9677 U: 0.9254 0.7914 L: 0.7861
B ERKELEY
0.9096 L: 0.9227 L: 0.9031 F-Score
B KLY /STB-PS N/A N/A
F-Scores U: 0.9702 U: 0.9289 0.7855
B ROWN
0.9129 L: 0.9264 L: 0.9057
Table 3: Swedish cross-framework evaluation: Three
Table 1: English cross-framework evaluation: Three measures as applicable to the different schemes. Bold-
measures as applicable to the different schemes. Bold- face scores are the highest in their column.
face scores are highest in their column. Italic scores
are the highest for dependency parsers in their column.
Formalism PS Trees MF Trees Dep Trees
Theory STB STB ut Dep Dep
Formalism PS Trees MF Trees Dep Trees Metrics T ED E VAL T ED E VAL T ED E VAL
Theory PTB tlt SD (PTB tlt SD) SD S INGLE M ULTIPLE S INGLE
ut SD
U: 0.9266 U: 0.9264
Metrics T ED E VAL T ED E VAL T ED E VAL M ALT N/A
L: 0.8225 L: 0.8372
S INGLE M ULTIPLE S INGLE
U: 0.9275 U: 0.9272
U: 0.9525 U: 0.9524 MST N/A
M ALT N/A L: 0.8121 L: 0.8275
L: 0.9088 L: 0.9186
U: 0.9239 U: 0.9281
U: 0.9549 U: 0.9548 B KLY-STB-RR N/A
MST N/A L: 0.7946 L: 0.7861
L: 0.9049 L: 0.9149
U: 0.9645 U: 0.9677 U: 0.9649
B ERKELEY Table 4: Swedish cross-framework evaluation: T ED E-
L: 0.9271 L: 0.9227 L: 0.9324
VAL scores against the native gold and the generalized
U: 0.9667 U: 9702 U: 0.9679
B ROWN gold. Boldface scores are the highest in their column.
L: 0.9301 L: 9264 L: 0.9362
Table 2: English cross-framework evaluation: T ED E-

VAL scores against gold and generalized gold. Bold- parsers, using the HunPoS tagger (Megyesi,
face scores are highest in their column. Italic scores 2009), but let the Berkeley parser predict its own
are highest for dependency parsers in their column. tags. We use the same evaluation metrics and pro-
cedures as before. Prior to evaluating RR trees
Because there are no parsers that can out- using ParsEval we strip off the added function
put the complete STB representation including nodes. Prior to evaluating them using TedEval we
edge labels, we experiment with two variants of strip off the phrase-structure nodes.
this theory, one which is obtained by simply re- Tables 3 and 4 summarize the parsing results
moving the edge labels and keeping only the for the different Swedish parsers. In the leftmost
phrase-structure labels (STB-PS) and one which column of table 3 we present the constituency-
is loosely based on the Relational-Realizational based evaluation measures. Interestingly, the
scheme of Tsarfaty and Simaan (2008) but ex- Berkeley RR instantiation performs better than
cludes the projection set nodes (STB-RR). RR when training the Berkeley parser on PS trees.
trees only add function nodes to PS trees, and These constituency-based scores however have a
it holds that STB-PSut STB-RR=STB-PS. The limited applicability, and we cannot use them to
overlap between the theories expressed in multi- compare constituency and dependency parsers. In
function trees originating from STB-Dep and the rightmost column of Table 3 we report the
STB-RR is 0.7559. Our evaluation protocol takes LAS/UAS results for the two dependency parsers.
into account such discrepancies while avoiding Here we see higher performance demonstrated by
biases that may be caused due to these differences. MST on both labeled and unlabeled dependen-
We evaluate MaltParser, MSTParser and two cies, but the differences on labeled dependencies
versions of the Berkeley parser, one trained on are insignificant. Since there is no automatic pro-
STB-PS and one trained on STB-RR. We use cedure for converting bare-bone phrase-structure
predicted part-of-speech tags for the dependency Swedish trees to dependency trees, we cannot use
51
LAS/UAS to compare across frameworks, and we languages such as English (de Marneffe et al.,
use T ED E VAL for cross-framework evaluation. 2006) but since grammatical relations are re-
Training the Berkeley parser on RR trees which flected differently in different languages (Postal
encode a mapping of PS nodes to grammatical and Perlmutter, 1977; Bresnan, 2000), a proce-
functions allows us to compare parse results for dure to read off these relations in a language-
trees belonging to the STB theory with trees obey- independent fashion from phrase-structure trees
ing the STB-Dep theory. For unlabeled T ED E- does not, and should not, exist (Rambow, 2010).
VAL scores, the dependency parsers perform at the The crucial point is that even when using ex-
same level as the constituency parser, though the ternal scripts for recovering a relational scheme
difference is insignificant. For labeled T ED E VAL for phrase-structure trees, our protocol has a clear
the dependency parsers significantly outperform advantage over simply scoring converted trees.
the constituency parser. When considering only Manually created conversion scripts alter the the-
the dependency parsers, there is a small advantage oretical assumptions inherent in the trees and thus
for Malt on labeled dependencies, and an advan- may bias the results. Our generalization operation
tage for MST on unlabeled dependencies, but the and three-way TED make sure that theory-specific
latter is insignificant. This effect is replicated in idiosyncrasies injected through such scripts do
Table 4 where we evaluate dependency parsers us- not lead to over-penalizing or over-crediting
ing T ED E VAL against their own gold theories. Ta- theory-specific structural variations.
ble 4 further confirms that there is a gap between Certain linguistic structures cannot yet be eval-
the STB and the STB-Dep theories, reflected in uated with our protocol because of the strict as-
the scores against the native and generalized gold. sumption that the labeled spans in a parse form a
tree. In the future we plan to extend the protocol
5 Discussion for evaluating structures that go beyond linearly-
We presented a formal protocol for evaluating ordered trees in order to allow for non-projective
parsers across frameworks and used it to soundly trees and directed acyclic graphs. In addition, we
compare parsing results for English and Swedish. plan to lift the restriction that the parse yield is
Our approach follows the three-phase protocol of known in advance, in order to allow for evalua-
Tsarfaty et al. (2011), namely: (i) obtaining a for- tion of joint parse-segmentation hypotheses.
mal common ground for the different representa-
tion types, (ii) computing the theoretical common 6 Conclusion
ground for each test sentence, and (iii) counting We developed a protocol for comparing parsing
only what counts, that is, measuring the distance results across different theories and representa-
between the common ground and the parse tree tion types which is framework-independent in the
while discarding annotation-specific edits. sense that it can accommodate any formal syntac-
A pre-condition for applying our protocol is the tic framework that encodes grammatical relations,
availability of a relational interpretation of trees in and it is language-independent in the sense that
the different frameworks. For dependency frame- there is no language specific knowledge encoded
works this is straightforward, as these relations in the procedure. As such, this protocol is ad-
are encoded on top of dependency arcs. For con- equate for parser evaluation in cross-framework
stituency trees with an inherent mapping of nodes and cross-language tasks and parsing competi-
onto grammatical relations (Merlo and Musillo, tions, and using it across the board is expected
2005; Gabbard et al., 2006; Tsarfaty and Simaan, to open new horizons in our understanding of the
2008), a procedure for reading relational schemes strengths and weaknesses of different parsers in
off of the trees is trivial to implement. the face of different theories and different data.
For parsers that are trained on and parse into
bare-bones phrase-structure trees this is not so. Acknowledgments We thank David McClosky,
Reading off the relational structure may be more Marco Khulmann, Yoav Goldberg and three
costly and require interjection of additional theo- anonymous reviewers for useful comments. We
retical assumptions via manually written scripts. further thank Jennifer Foster for the Brown parses
Scripts that read off grammatical relations based and parameter files. This research is partly funded
on tree positions work well for configurational by the Swedish National Science Foundation.
52
References Mohamed Maamouri, Ann Bies, Tim Buckwalter, and
Wigdan Mekki. 2004. The Penn Arabic treebank:
Philip Bille. 2005. A survey on tree edit distance and Building a large-scale annotated Arabic corpus. In
related. problems. Theoretical Computer Science, Proceedings of NEMLAR International Conference
337:217239. on Arabic Language Resources and Tools.
Ezra Black, Steven P. Abney, D. Flickenger, Clau- Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
dia Gdaniec, Ralph Grishman, P. Harrison, Don- Marcinkiewicz. 1993. Building a large annotated
ald Hindle, Robert Ingria, Frederick Jelinek, Ju- corpus of English: The Penn Treebank. Computa-
dith L. Klavans, Mark Liberman, Mitchell P. Mar- tional Linguistics, 19:313330.
cus, Salim Roukos, Beatrice Santorini, and Tomek Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
Strzalkowski. 1991. A procedure for quantitatively Jan Hajic. 2005. Non-projective dependency pars-
comparing the syntactic coverage of English gram- ing using spanning tree algorithms. In HLT 05:
mars. In Proceedings of the DARPA Workshop on Proceedings of the conference on Human Language
Speech and Natural Language, pages 306311. Technology and Empirical Methods in Natural Lan-
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf- guage Processing, pages 523530, Morristown, NJ,
gang Lezius, and George Smith. 2002. The Tiger USA. Association for Computational Linguistics.
treebank. In Proceedings of TLT. Beata Megyesi. 2009. The open source tagger Hun-
Joan Bresnan. 2000. Lexical-Functional Syntax. PoS for Swedish. In Proceedings of the 17th Nordic
Blackwell. Conference of Computational Linguistics (NODAL-
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X IDA), pages 239241.
shared task on multilingual dependency parsing. In Igor Melcuk. 1988. Dependency Syntax: Theory and
Proceedings of CoNLL-X, pages 149164. Practice. State University of New York Press.
Aoife Cahill, Michael Burke, Ruth ODonovan, Stefan Paola Merlo and Gabriele Musillo. 2005. Accurate
Riezler, Josef van Genabith, and Andy Way. 2008. function parsing. In Proceedings of EMNLP, pages
Wide-coverage deep statistical parsing using auto- 620627.
matic dependency structure annotation. Computa- Joakim Nivre and Beata Megyesi. 2007. Bootstrap-
tional Linguistics, 34(1):81124. ping a Swedish Treebank using cross-corpus har-
John Carroll, Edward Briscoe, and Antonio Sanfilippo. monization and annotation projection. In Proceed-
1998. Parser evaluation: A survey and a new pro- ings of TLT.
posal. In Proceedings of LREC, pages 447454. Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
Daniel Cer, Marie-Catherine de Marneffe, Daniel Ju- Maltparser: A data-driven parser-generator for de-
rafsky, and Christopher D. Manning. 2010. Pars- pendency parsing. In Proceedings of LREC, pages
ing to Stanford Dependencies: Trade-offs between 22162219.
speed and accuracy. In Proceedings of LREC. Joakim Nivre, Laura Rimell, Ryan McDonald, and
Eugene Charniak and Mark Johnson. 2005. Coarse- Carlos Gomez-Rodrguez. 2010. Evaluation of de-
to-fine n-best parsing and maxent discriminative pendency parsers on unbounded dependencies. In
reranking. In Proceedings of ACL. Proceedings of COLING, pages 813821.
Slav Petrov, Leon Barrett, Romain Thibaux, and Dan
Paul Cohen. 1995. Empirical Methods for Artificial
Klein. 2006. Learning accurate, compact, and in-
Intelligence. The MIT Press.
terpretable tree annotation. In Proceedings of ACL.
Marie-Catherine de Marneffe, Bill MacCartney, and
Paul M. Postal and David M. Perlmutter. 1977. To-
Christopher D. Manning. 2006. Generating typed
ward a universal characterization of passivization.
dependency parses from phrase structure parses. In
In Proceedings of the 3rd Annual Meeting of the
Proceedings of LREC, pages 449454.
Berkeley Linguistics Society, pages 394417.
Ryan Gabbard, Mitchell Marcus, and Seth Kulick. Owen Rambow. 2010. The simple truth about de-
2006. Fully parsing the Penn treebank. In Proceed- pendency and phrase structure representations: An
ing of HLT-NAACL, pages 184191. opinion piece. In Proceedings of HLT-ACL, pages
Jesus Gimenez and Llus Marquez. 2004. SVMTool: 337340.
A general POS tagger generator based on support Roy Schwartz, Omri Abend, Roi Reichart, and Ari
vector machines. In Proceedings of LREC. Rappoport. 2011. Neutralizing linguistically prob-
Sandra Kubler, Ryan McDonald, and Joakim Nivre. lematic annotations in unsupervised dependency
2009. Dependency Parsing. Number 2 in Synthesis parsing evaluation. In Proceedings of ACL, pages
Lectures on Human Language Technologies. Mor- 663672.
gan & Claypool Publishers. Khalil Simaan, Alon Itai, Yoad Winter, Alon Altman,
Dekang Lin. 1995. A dependency-based method for and Noa Nativ. 2001. Building a Tree-Bank for
evaluating broad-coverage parsers. In Proceedings Modern Hebrew Text. In Traitement Automatique
of IJCAI-95, pages 14201425. des Langues.
53
Reut Tsarfaty and Khalil Simaan. 2008. Relational-
Realizational parsing. In Proceedings of CoLing.
Reut Tsarfaty, Joakim Nivre, and Evelina Andersson.
2011. Evaluating dependency parsing: Robust and
heuristics-free cross-framework evaluation. In Pro-
ceedings of EMNLP.
Kaizhong Zhang and Dennis Shasha. 1989. Sim-
ple fast algorithms for the editing distance between
trees and related problems. In SIAM Journal of
Computing, volume 18, pages 12451262.
54
Dependency Parsing of Hungarian: Baseline Results and Challenges
Richard Farkas1 , Veronika Vincze2 , Helmut Schmid1

1
Institute for Natural Language Processing, University of Stuttgart
{farkas,schmid}@ims.uni-stuttgart.de
2
Research Group on Artificial Intelligence, Hungarian Academy of Sciences
vinczev@inf.u-szeged.hu
Abstract In this study, we present results on Hungarian de-

pendency parsing and we investigate this general
Hungarian is a stereotype of morpholog- issue in the case of English and Hungarian.
ically rich and non-configurational lan- We employed three state-of-the-art data-driven
guages. Here, we introduce results on de-
parsers (Nivre et al., 2004; McDonald et al., 2005;
pendency parsing of Hungarian that em-
ploy a 80K, multi-domain, fully manu- Bohnet, 2010), which achieved (un)labeled at-
ally annotated corpus, the Szeged Depen- tachment scores on Hungarian not so different
dency Treebank. We show that the results from the corresponding English scores (and even
achieved by state-of-the-art data-driven higher on certain domains/subcorpora). Our in-
parsers on Hungarian and English (which is vestigations show that the feature representation
at the other end of the configurational-non- used by the data-driven parsers is so rich that they
configurational spectrum) are quite simi-
can without any modification effectively learn
lar to each other in terms of attachment
scores. We reveal the reasons for this and
a reasonable model for non-configurational lan-
present a systematic and comparative lin- guages as well.
guistically motivated error analysis on both We also conducted a systematic and compar-
languages. This analysis highlights that ad- ative error analysis of the systems outputs for
dressing the language-specific phenomena Hungarian and English. This analysis highlights
is required for a further remarkable error re- the challenges of parsing Hungarian and sug-
duction. gests that the further improvement of parsers re-
quires special handling of language-specific phe-
1 Introduction nomena. We believe that some of our findings
can be relevant for intermediate languages on the
From the viewpoint of syntactic parsing, the lan- configurational-non-configurational spectrum.
guages of the world are usually categorized ac-
cording to their level of configurationality. At one 2 Chief Characteristics of the
end, there is English, a strongly configurational Hungarian Morphosyntax
language while Hungarian is at the other end of
the spectrum. It has very few fixed structures Hungarian is an agglutinative language, which
at the sentence level. Leaving aside the issue of means that a word can have hundreds of word
the internal structure of NPs, most sentence-level forms due to inflectional or derivational affixa-
syntactic information in Hungarian is conveyed tion. A lot of grammatical information is encoded
by morphology, not by configuration (E. Kiss, in morphology and Hungarian is a stereotype of
2002). morphologically rich languages. The Hungarian
A large part of the methodology for syntactic word order is free in the sense that the positions
parsing has been developed for English. How- of the subject, the object and the verb are not fixed
ever, parsing non-configurational and less config- within the sentence, but word order is related to
urational languages requires different techniques. information structure, e.g. new (or emphatic) in-
formation (the focus) always precedes the verb
55
and old information (the topic) precedes the focus with that of a nominative noun while in the second
position. Thus, the position relative to the verb case, it coincides with a dative noun.
has no predictive force as regards the syntactic According to these facts, a Hungarian parser
function of the given argument: while in English, must rely much more on morphological analysis
the noun phrase before the verb is most typically than e.g. an English one since in Hungarian it
the subject, in Hungarian, it is the focus of the is morphemes that mostly encode morphosyntac-
sentence, which itself can be the subject, object tic information. One of the consequences of this
or any other argument (E. Kiss, 2002). is that Hungarian sentences are shorter in terms
The grammatical function of words is deter- of word numbers than English ones. Based on
mined by case suffixes as in gyerek child gye- the word counts of the HungarianEnglish paral-
reknek (child-DAT) for (a/the) child. Hungarian lel corpus Hunglish (Varga et al., 2005), an En-
nouns can have about 20 cases1 which mark the glish sentence contains 20.5% more words than its
relationship between the head and its arguments Hungarian equivalent. These extra words in En-
and adjuncts. Although there are postpositions glish are most frequently prepositions, pronomi-
in Hungarian, case suffixes can also express re- nal subjects or objects, whose parent and depen-
lations that are expressed by prepositions in En- dency label are relatively easy to identify (com-
glish. pared to other word classes). This train of thought
Verbs are inflected for person and number and indicates that the cross-lingual comparison of fi-
the definiteness of the object. Since conjugational nal parser scores should be conducted very care-
information is sufficient to deduce the pronominal fully.
subject or object, they are typically omitted from
the sentence: Varlak (wait-1 SG 2 OBJ) I am wait- 3 Related work
ing for you. This pro-drop feature of Hungar- We decided to focus on dependency parsing in
ian leads to the fact that there are several clauses this study as it is a superior framework for non-
without an overt subject or object. configurational languages. It has gained inter-
Another peculiarity of Hungarian is that the est in natural language processing recently be-
third person singular present tense indicative form cause the representation itself does not require
of the copula is phonologically empty, i.e. there the words inside of constituents to be consecu-
are apparently verbless sentences in Hungarian: tive and it naturally represent discontinuous con-
A haz nagy (the house big) The house is big. structions, which are frequent in languages where
However, in other tenses or moods, the copula grammatical relations are often signaled by mor-
is present as in A haz nagy lesz (the house big phology instead of word order (McDonald and
will.be) The house will be big. Nivre, 2011). The two main efficient approaches
There are two possessive constructions in for dependency parsing are the graph-based and
Hungarian. First, the possessive relation is only the transition-based parsers. The graph-based
marked on the possessed noun (in contrast, it is models look for the highest scoring directed span-
marked only on the possessor in English): a fiu ning tree in the complete graph whose nodes are
kutyaja (the boy dog-POSS) the boys dog. Sec- the words of the sentence in question. They solve
ond, both the possessor and the possessed bear a the machine learning problem of finding the opti-
possessive marker: a fiunak a kutyaja (the boy- mal scoring function of subgraphs (Eisner, 1996;
DAT the dog- POSS ) the boys dog. In the latter McDonald et al., 2005). The transition-based ap-
case, the possessor and the possessed may not be proaches parse a sentence in a single left-to-right
adjacent within the sentence as in A fiunak latta a pass over the words. The next transition in these
kutyajat (the boy-DAT see-PAST 3 SGOBJ the dog- systems is predicted by a classifier that is based
POSS - ACC ) He saw the boys dog, which results on history-related features (Kudo and Matsumoto,
in a non-projective syntactic tree. Note that in 2002; Nivre et al., 2004).
the first case, the form of the possessor coincides Although the available treebanks for Hungar-
1
Hungarian grammars and morphological coding sys- ian are relatively big (82K sentences) and fully
tems do not agree on the exact number of cases, some rare manually annotated, the studies on parsing Hun-
suffixes are treated as derivational suffixes in one grammar garian are rather limited. The Szeged (Con-
and as case suffixes in others; see e.g. Farkas et al. (2010).
stituency) Treebank (Csendes et al., 2005) con-
56
sists of six domains namely, short business The annotation employs 16 coarse grained POS
news, newspaper, law, literature, compositions tags, 95 morphological feature values and 29 de-
and informatics and it is manually annotated pendency labels. 19.6% of the sentences in the
for the possible alternatives of words morpho- corpus contain non-projective edges and 1.8% of
logical analyses, the disambiguated analysis and the edges are non-projective2 , which is almost 5
constituency trees. We are aware of only two times more frequent than in English and is the
articles on phrase-structure parsers which were same as the Czech non-projectivity level (Buch-
trained and evaluated on this corpus (Barta et al., holz and Marsi, 2006). Here we discuss two an-
2005; Ivan et al., 2007) and there are a few studies notation principles along with our modifications
on hand-crafted parsers reporting results on small in the dataset for this study which strongly influ-
own corpora (Babarczy et al., 2005; Proszeky et ence the parsers accuracies.
al., 2004).
Named Entities (NEs) were treated as one to-
The Szeged Dependency Treebank (Vincze et
ken in the Szeged Dependency Treebank. Assum-
al., 2010) was constructed by first automatically
ing a perfect phrase recogniser on the whitespace
converting the phrase-structure trees into depen-
tokenised input for them is quite unrealistic. Thus
dency trees, then each of them was manually
we decided to split them into tokens for this study.
investigated and corrected. We note that the
The new tokens automatically got a proper noun
dependency treebank contains more information
with default morphological features morphologi-
than the constituency one as linguistic phenom-
cal analysis except for the last token the head of
ena (like discontinuous structures) were not anno-
the phrase , which inherited the morphological
tated in the former corpus, but were added to the
analysis of the original multiword unit (which can
dependency treebank. To the best of our knowl-
contain various grammatical information). This
edge no parser results have been published on this
resulted in an N N N N POS sequence for Kovacs
corpus. Both corpora are available at www.inf.
es tarsa kft. Smith and Co. Ltd. which would
u-szeged.hu/rgai/SzegedTreebank.
be annotated as N C N N in the Penn Treebank.
The multilingual track of the CoNLL-2007
Moreover, we did not annotate any internal struc-
Shared Task (Nivre et al., 2007) addressed also
ture of Named Entities. We consider the last word
the task of dependency parsing of Hungarian. The
of multiword named entities as the head because
Hungarian corpus used for the shared task con-
of morphological reasons (the last word of multi-
sists of automatically converted dependency trees
word units gets inflected in Hungarian) and all the
from the Szeged Constituency Treebank. Several
previous elements are attached to the succeeding
issues of the automatic conversion tool were re-
word, i.e. the penultimate word is attached to the
considered before the manual annotation of the
last word, the antepenultimate word to the penulti-
Szeged Dependency Treebank was launched and
mate one etc. The reasons for these considerations
the annotation guidelines contained instructions
are that we believe that there are no downstream
related to linguistic phenomena which could not
applications which can exploit the information of
be converted from the constituency representa-
the internal structures of Named Entities and we
tion for a detailed discussion, see Vincze et al.
imagine a pipeline where a Named Entity Recog-
(2010). Hence the annotation schemata of the
niser precedes the parsing step.
CoNLL-2007 Hungarian corpus and the Szeged
Dependency Treebank are rather different and the Empty copula: In the verbless clauses (pred-
final scores reported for the former are not di- icative nouns or adjectives) the Szeged Depen-
rectly comparable with our reported scores here dency Treebank introduces virtual nodes (16,000
(see Section 5). items in the corpus). This solution means that
a similar tree structure is ascribed to the same
4 The Szeged Dependency Treebank sentence in the present third person singular and
We utilize the Szeged Dependency Treebank all the other tenses / persons. A further argu-
(Vincze et al., 2010) as the basis of our experiment for the use of a virtual node is that the vir-
ments for Hungarian dependency parsing. It con- tual node is always present at the syntactic level
tains 82,000 sentences, 1.2 million words and 2
Using the transitive closure definition of Nivre and Nils-
250,000 punctuation marks from six domains. son (2005).
57
corpus Malt MST Mate
ULA LAS ULA LAS ULA LAS
dev 88.3 (89.9) 85.7 (87.9) 86.9 (88.5) 80.9 (82.9) 89.7 (91.1) 86.8 (89.0)
Hungarian
test 88.7 (90.2) 86.1 (88.2) 87.5 (89.0) 81.6 (83.5) 90.1 (91.5) 87.2 (89.4)
dev 87.8 (89.1) 84.5 (86.1) 89.4 (91.2) 86.1 (87.7) 91.6 (92.7) 88.5 (90.0)
English
test 88.8 (89.9) 86.2 (87.6) 90.7 (91.8) 87.7 (89.2) 92.6 (93.4) 90.3 (91.5)
Table 1: Results achieved by the three parsers on the (full) Hungarian (Szeged Dependency Treebank) and
English (CoNLL-2009) datasets. The scores in brackets are achieved with gold-standard POS tagging.
since it is overt in all the other forms, tenses and Tools: We employed a finite state automata-
moods of the verb. Still, the state-of-the-art de- based morphological analyser constructed from
pendency parsers cannot handle virtual nodes. For the morphdb.hu lexical resource (Tron et al.,
this study, we followed the solution of the Prague 2006) and we used the MSD-style morphological
Dependency Treebank (Hajic et al., 2000) and vir- code system of the Szeged TreeBank (Alexin et
tual nodes were removed from the gold standard al., 2003). The output of the morphological anal-
annotation and all of their dependents were at- yser is a set of possible lemmamorphological
tached to the head of the original virtual node and analysis pairs. This set of possible morphologi-
they were given a dedicated edge label (Exd). cal analyses for a word form is then used as pos-
sible alternatives instead of open and closed tag
Dataset splits: We formed training, develop-
sets in a standard sequential POS tagger. Here,
ment and test sets from the corpus where each
we applied the Conditional Random Fields-based
set consists of texts from each of the domains.
Stanford POS tagger (Toutanova et al., 2003) and
We paid attention to the issue that a document
carried out 5-fold-cross POS training/tagging in-
should not be separated into different datasets be-
side the subcorpora.4 For the English experiments
cause it could result in a situation where a part of
we used the predicted POS tags provided for the
the test document was seen in the training dataset
CoNLL-2009 shared task (Hajic et al., 2009).
(which is unrealistic because of unknown words,
As the dependency parser we employed three
style and frequently used grammatical structures).
state-of-the-art data-driven parsers, a transition-
As the fiction subcorpus consists of three books
based parser (Malt) and two graph-based parsers
and the law subcorpus consists of two rules, we
(MST and Mate parsers). The Malt parser (Nivre
took half of one of the documents for the test
et al., 2004) is a transition-based system, which
and development sets and used the other part(s)
uses an arc-eager system along with support vec-
for training there. This principle was followed at
tor machines to learn the scoring function for tran-
our cross-fold-validation experiments as well ex-
sitions and which uses greedy, deterministic one-
cept for the law subcorpus. We applied 3 folds for
best search at parsing time. As one of the graph-
cross-validation for the fiction subcorpus, other-
based parsers, we employed the MST parser (Mc-
wise we used 10 folds (splitting at documentary
Donald et al., 2005) with a second-order feature
boundaries would yield a training fold consisting
decoder. It uses an approximate exhaustive search
of just 3000 sentences).3
for unlabeled parsing, then a separate arc label
5 Experiments classifier is applied to label each arc. The Mate
parser (Bohnet, 2010) is an efficient second or-
We carried out experiments using three state-of- der dependency parser that models the interaction
the-art parsers on the Szeged Dependency Tree- between siblings as well as grandchildren (Car-
bank (Vincze et al., 2010) and on the English reras, 2007). Its decoder works on labeled edges,
datasets of the CoNLL-2009 Shared Task (Hajic i.e. it uses a single-step approach for obtaining
et al., 2009). labeled dependency trees. Mate uses a rich and
3
Both the training/development/test and the cross- 4
The JAVA implementation of the morphological anal-
validation splits are available at www.inf.u-szeged. yser and the slightly modified POS tagger along with trained
hu/rgai/SzegedTreebank. models are available at http://www.inf.u-szeged.
hu/rgai/magyarlanc.
58
corpus #sent. length CPOS DPOS ULA all ULA LAS all LAS
newspaper 9189 21.6 97.2 96.5 88.0 (90.0) +0.8 84.7 (87.5) +1.0
short business 8616 23.6 98.0 97.7 93.8 (94.8) +0.3 91.9 (93.4) +0.4
fiction 9279 12.6 96.9 95.8 87.7 (89.4) -0.5 83.7 (86.2) -0.3
law 8347 27.3 98.3 98.1 90.6 (90.7) +0.2 88.9 (89.0) +0.2
computer 8653 21.9 96.4 95.8 91.3 (92.8) -1.2 88.9 (91.2) -1.6
composition 22248 13.7 96.7 95.6 92.7 (93.9) +0.3 88.9 (91.0) +0.3
Table 2: Domain results achieved by the Mate parser in cross-validation settings. The scores in brackets are
achieved with gold-standard POS tagging. The all columns contain the added value of extending the training
sets with each of the five out-domain subcorpora.
well-engineered feature set and it is enhanced by ment was to gain an insight into the performance
a Hash Kernel, which leads to higher accuracy. of the parsers which can only access configura-
tional information. These parsers achieved worse
Evaluation metrics: We apply the Labeled At-
results than the full parsers by 6.8 ULA, 20.3 LAS
tachment Score (LAS) and Unlabeled Attachment
and 2.9 ULA, 6.4 LAS on the development sets
Score (ULA), taking into account punctuation as
of Hungarian and English, respectively. As ex-
well for evaluating dependency parsers and the
pected, Hungarian suffers much more when the
accuracy on the main POS tags (CPOS) and a
parser has to learn from configurational informa-
fine-grained morphological accuracy (DPOS) for
tion only, especially when grammatical functions
evaluating the POS tagger. In the latter, the analy-
have to be predicted (LAS). Despite this, the re-
sis is regarded as correct if the main POS tag and
sults of Table 1 show that the parsers can practi-
each of the morphological features of the token in
cally eliminate this gap by learning from morpho-
question are correct.
logical features (and lexicalization). This means
Results: Table 1 shows the results got by the that the data-driven parsers employing a very rich
parsers on the whole Hungarian corpora and on feature set can learn a model which effectively
the English datasets. The most important point captures the dependency structures using feature
is that scores are not different from the English weights which are radically different from the
scores (although they are not directly compara- ones used for English.
ble). To understand the reasons for this, we man- Another cause of the relatively high scores is
ually investigated the set of firing features with that the CPOS accuracy scores on Hungarian
the highest weights in the Mate parser. Although and English are almost equal: 97.2 and 97.3, re-
the assessment of individual feature contributions spectively. This also explains the small differ-
to a particular decoder decision is not straightfor- ence between the results got by gold-standard and
ward, we observed that features encoding config- predicted POS tags. Moreover, the parser can
urational information (i.e. the direction or length also exploit the morphological features as input
of an edge, the words or POS tag sequences/sets in Hungarian.
between the governor and the dependent) were The Mate parser outperformed the other two
frequently among the highest weighted features parsers on each of the four datasets. Comparing
in English but were extremely rare in Hungarian. the two graph-based parsers Mate and MST, the
For instance, one of the top weighted features for gap between them was twice as big in LAS than in
a subject dependency in English was the there is ULA in Hungarian, which demonstrates that the
no word between the head and the dependent fea- one-step approach looking for the maximum
ture while this never occurred among the top fea- labeled spanning tree is more suitable for Hun-
tures in Hungarian. garian than the two-step arc labeling approach of
As a control experiment, we trained the Mate MST. This probably holds for other morpholog-
parser only having access to the gold-standard ically rich languages too as the decoder can ex-
POS tag sequences of the sentences, i.e. we ploit information from the labels of decoded arcs.
switched off the lexicalization and detailed mor- Based on these results, we decided to use only
phological information. The goal of this experi- Mate for our further experiments.
59
Table 2 provides an insight into the effect of low, they form important features for the parser,
domain differences on POS tagging and pars- thus we will focus on the more accurate handling
ing scores. There is a noticeable difference be- of these cases in future work.
tween the newspaper and the short business
Comparison to CoNLL-2007 results: The
news corpora. Although these domains seem to
best performing participant of the CoNLL-2007
be close to each other at the first glance (both are
Shared Task (Nivre et al., 2007) achieved an ULA
news), they have different characteristics. On the
of 83.6 and LAS of 80.3 (Hall et al., 2007) on
one hand, short business news is a very narrow
the Hungarian corpus. The difference between the
domain consisting of 2-3 sentence long financial
top performing English and Hungarian systems
short reports. It frequently uses the same gram-
were 8.14 ULA and 9.3 LAS. The results reported
matical structures (like Stock indexes rose X per-
in 2007 were significantly lower and the gap be-
cent at the Y Stock on Wednesday) and the lexi-
tween English and Hungarian is higher than our
con is also limited. On the other hand, the news-
current values. To locate the sources of difference
paper subcorpus consists of full journal articles
we carried out other experiments with Mate on
covering various domains and it has a fancy jour-
the CoNLL-2007 dataset using the gold-standard
nalist style.
POS tags (the shared task used gold-standard POS
The effect of extending the training dataset with
tags for evaluation).
out-of-domain parses is not convincing. In spite
First we trained and evaluated Mate on the
of the ten times bigger training datasets, there
original CoNLL-2007 datasets, where it achieved
are two subcorpora where they just harmed the
ULA 84.3 and LAS 80.0. Then we used the sen-
parser, and the improvement on other subcorpora
tences of the CoNLL-2007 datasets but with the
is less than 1 percent. This demonstrates well the
new, manual annotation. Here, Mate achieved
domain-dependence of parsing.
ULA 88.6 and LAS 85.5, which means that the
The parser and the POS tagger react to do-
modified annotation schema and the less erro-
main difficulties in a similar way, according to
neous/noisy annotation caused an improvement of
the first four rows of Table 2. This observation
ULA 4.3 and LAS 5.5. The annotation schema
holds for the scores of the parsers working with
changed a lot: coordination had to be corrected
gold-standard POS tags, which suggests that do-
manually since it is treated differently after con-
main difficulties harm POS tagging and parsing as
version, moreover, the internal structure of ad-
well. Regarding the two last subcorpora, the com-
jectival/participial phrases was not marked in the
positions consist of very short and usually simple
original constituency treebank, so it was also
sentences and the training corpora are twice as big
added manually (Vincze et al., 2010). The im-
compared with other subcorpora. Both factors are
provement in the labeled attachment score is prob-
probably the reasons for the good parsing perfor-
ably due to the reduction of the label set (from 49
mance. In the computer corpus, there are many
to 29 labels), which step was justified by the fact
English terms which are manually tagged with an
that some morphosyntactic information was dou-
unknown tag. They could not be accurately pre-
bly coded in the case of nouns (e.g. hazzal (house-
dicted by the POS tagger but the parser could pre-
INS) with the/a house) in the original CoNLL-
dict their syntactic role.
2007 dataset first, by their morphological case
Table 2 also tells us that the difference between
(Cas=ins) and second, by their dependency label
CPOS and DPOS is usually less than 1 percent.
(INS).
This experimentally supports that the ambigu-
Lastly, as the CoNLL-2007 sentences came
ity among alternative morphological analyses
from the newspaper subcorpus, we can compare
is mostly present at the POS-level and the mor-
these scores with the ULA 90.0 and LAS 87.5
phological features are efficiently identified by
of Table 2. The ULA 1.5 and LAS 2.0 differ-
our morphological analyser. The most frequent
ences are the result of the bigger training corpus
morphological features which cannot be disam-
(9189 sentences on average compared to 6390 in
biguated at the word level are related to suffixes
the CoNLL-2007 dataset).
with multiple functions or the word itself cannot
be unambiguously segmented into morphemes.
Although the number of such ambiguous cases is
60
Hungarian English
label attachment label attachment
virtual nodes 31.5% 39.5% multiword NEs 15.2% 17.6%
conjunctions and negation 11.2% PP-attachment 15.9%
noun attachment 9.6% non-canonical word order 6.4% 6.5%
more than 1 premodifier 5.1% misplaced clause 9.7%
coordination 13.5% 16.5% coordination 8.5% 12.5%
mislabeled adverb 16.3% mislabeled adverb 40.1%
annotation errors 10.7% 6.8% annotation errors 9.7% 8.5%
other 28.0% 11.3% other 20.1% 29.3%
TOTAL 100% 100% TOTAL 100% 100%
Table 3: The most frequent corpus-specific and general attachment and labeling error categories (based on a
manual investigation of 200200 erroneous sentences).
6 A Systematic Error Analysis Hungarian, respectively).

In order to discover specialties and challenges of Virtual nodes: In Hungarian, the most common
Hungarian dependency parsing, we conducted an source of parsing errors was virtual nodes. As
error analysis of parsed texts from the newspaper there are quite a lot of verbless clauses in Hungar-
domain both in English and Hungarian. 200 ran- ian (see Section 2 on sentences without copula), it
domly selected erroneous sentences from the out- might be difficult to figure out the proper depen-
put of Mate were investigated in both languages dency relations within the sentence, since the verb
and we categorized the errors on the basis of the plays the central role in the sentence, cf. Tesniere
linguistic phenomenon responsible for the errors (1959). Our parser was not efficient in identify-
for instance, when an error occurred because of ing the structure of such sentences, probably due
the incorrect identification of a multiword Named to the lack of information for data-driven parsers
Entity containing a conjunction, we treated it as (each edge is labeled as Exd while they have sim-
a Named Entity error instead of a conjunction er- ilar features to ordinary edges). We also note that
ror , i.e. our goal was to reveal the real linguistic the output of the current system with Exd labels
sources of errors rather than deducing from auto- does not contain too much information for down-
matically countable attachment/labeling statistics. stream applications of parsing. The appropriate
We used the parses based on gold-standard handling of virtual nodes is an important direction
POS tagging for this analysis as our goal was to for future work.
identify the challenges of parsing independently
Noun attachment: In Hungarian, the nomi-
of the challenges of POS tagging. The error cate-
nal arguments of infinitives and participles were
gories are summarized in Table 3 along with their
frequently erroneously attached to the main
relative contribution to attachment and labeling
verb. Take the following sentence: A Horn-
errors. This table contains the categories with
kabinet idejen jol bevalt modszerhez probalnak
over 5% relative frequency.5
meg visszaterni (the Horn-government time-
The 200 sentences contained 429/319 and
3 SGPOSS - SUP well tried method-ALL try-3 PL
353/330 attachment/labeling errors in Hungarian
PREVERB return- INF) They are trying to return
and English, respectively. In Hungarian, attach-
to the well-tried method of the Horn government.
ment errors outnumber label errors to a great ex-
In this sentence, a Horn-kabinet idejen during
tent whereas in English, their distribution is basi-
the Horn government is a modifier of the past
cally the same. This might be attributed to the
participle bevalt well-tried, however, it is at-
higher level of non-projectivity (see Section 4)
tached to the main verb probalnak they are try-
and to the more fine-grained label set of the En-
ing by the parser. Moreover, modszerhez to the
glish dataset (36 against 29 labels in English and
method is an argument of the infinitive visszater-
5
The full tables are available at www.inf.u-szeged. ni to return, but the parser links it to the main
hu/rgai/SzegedTreebank.
61
verb. In free word order languages, the order of typically, the prepositional complement which
the arguments of the infinitive and the main verb follows the head was attached to the verb instead
may get mixed, which is called scrambling (Ross, of the noun or vice versa. In contrast, Hungarian
1986). This is not a common source of error in is a head-after-dependent language, which means
English as arguments cannot scramble. that dependents most often occur before the head.
Furthermore, there are no prepositions in Hungar-
Article attachment: In Hungarian, if there is
ian, and grammatical relations encoded by prepo-
an article before a prenominal modifier, it can be-
sitions in English are conveyed by suffixes or
long to the head noun and to the modifier as well.
postpositions. Thus, if there is a modifier before
In a szoba ajtaja (the room door-3 SGPOSS) the
the nominal head, it requires the presence of a
door of the room the article belongs to the modi-
participle as in: Felvette a kirakatban levo ruhat
fier but when the prenominal modifier cannot have
(take.on-PAST 3 SGOBJ the shop.window-INE be-
an article (e.g. a februarban indulo projekt (the
ing dress-ACC) She put on the dress in the shop
February-INE starting project) the project start-
window. The English sentence is ambiguous (ei-
ing in February), it is attached to the head noun
ther the event happens in the shop window or the
(i.e. to projekt project). It was not always clear
dress was originally in the shop window) while
for the parser which parent to select for the arti-
the Hungarian has only the latter meaning.6
cle. In contrast, these cases are not problematic
in English since the modifier typically follows the General dependency parsing difficulties:
head and thus each article precedes its head noun. There were certain structures that led to typical
label and/or attachment errors in both languages.
Conjunctions or negation words most typ-
The most frequent one among them is coordi-
ically the words is too, csak only/just and
nation. However, it should be mentioned that
nem/sem not were much more frequently at-
syntactic ambiguities are often problematic even
tached to the wrong node in Hungarian than in
for humans to disambiguate without contextual
English. In Hungarian, they are ambiguous be-
or background semantic knowledge.
tween being adverbs and conjunctions and it is
In the case of label errors, the relation between
mostly their conjunctive uses which are problem-
the given node and its parent was labeled incor-
atic from the viewpoint of parsing. On the other
rectly. In both English and Hungarian, one of the
hand, these words have an important role in mark-
most common errors of this type was mislabeled
ing the information structure of the sentence: they
adverbs and adverbial phrases, e.g. locative ad-
are usually attached to the element in focus posi-
verbs were labeled as ADV/MODE. However, the
tion, and if there is no focus, they are attached
frequency rate of this error type is much higher
to the verb. However, sentences with or with-
in English than in Hungarian, which may be re-
out focus can have similar word order but their
lated to the fact that in the English corpus, there
stress pattern is different. Dependency parsers
is a much more balanced distribution of adverbial
obviously cannot recognize stress patterns, hence
labels than in the Hungarian one (where the cat-
conjunctions and negation words are sometimes
egories MODE and TLOCY are responsible for
erroneously attached to the verb in Hungarian.
90% of the occurrences). Assigning the most fre-
English sentences with non-canonical word quent label of the training dataset to each adverb
order (e.g. questions) were often incorrectly yields an accuracy of 82% in English and 93% in
parsed, e.g. the noun following the main verb is Hungarian, which suggests that there is a higher
the object in sentences like Replied a salesman: level of ambiguity for English adverbial phrases.
Exactly., where it is the subject that follows the For instance, the preposition by may introduce an
verb for stylistic reasons. However, in Hungarian, adverbial modifier of manner (MNR) in by cre-
morphological information is of help in such sen- ating a bill and the agent in a passive sentence
tences, as it is not the position relative to the verb (LGS). Thus, labeling adverbs seems to be a more
but the case suffix that determines the grammati- 6
However, there exists a head-before-dependent version
cal role of the noun. of the sentence (Felvette a ruhat a kirakatban), whose pre-
In English, high or low PP-attachment was ferred reading is She was in the shop window while dressing
responsible for many parsing ambiguities: most up, that is, the modifier belongs to the verb.
62
difficult task in English.7 7 Conclusions
Clauses were also often mislabeled in both lan-
We showed that state-of-the-art dependency
guages, most typically when there was no overt
parsers achieve similar results in terms of at-
conjunction between clauses. Another source of
tachment scores on Hungarian and English. Al-
error was when more than one modifier occurred
though the results with this comparison should be
before a noun (5.1% and 4.2% of attachment er-
taken with a pinch of salt as sentence lengths
rors in Hungarian and in English): in these cases,
(and information encoded in single words) differ,
the first modifier could belong to the noun (a
domain differences and annotation schema diver-
brown Japanese car) or to the second modifier (a
gences are uncatchable we conclude that parsing
brown haired girl).
Hungarian is just as hard a task as parsing English.
Multiword Named Entities: As we mentioned We argued that this is due to the relatively good
in Section 4, members of multiword Named Enti- POS tagging accuracy (which is a consequence
ties had a proper noun POS-tag and an NE label of the low ambiguity of alternative morphological
in our dataset. Hence when parsing is based on analyses of a sentence and the good coverage of
gold standard POS-tags, their recognition is al- the morphological analyser) and the fact that data-
most perfect while it is a frequent source or er- driven dependency parsers employ a rich feature
rors in the CoNLL-2009 corpus. We investigated representation which enables them to learn differ-
the parse of our 200 sentences with predicted POS ent kinds of feature weight profiles.
tags at NEs and found that this introduces several We also discussed the domain differences
errors (about 5% of both attachment and labeling among the subcorpora of the Szeged Dependency
errors) in Hungarian. On the other hand, the re- Treebank and their effect on parsing results. Our
sults are only slightly worse in English, i.e. iden- results support that there can be higher differences
tifying the inner structure of NEs does not depend in parsing scores among domains in one language
on whether the parser builds on gold standard or than among corpora from a similar domain but
predicted POS-tags since function words like con- different languages (which again marks pitfalls of
junctions or prepositions which mark grammat- inter-language comparison of parsing scores).
ical relations are tagged in the same way in both Our systematic error analysis showed that han-
cases. The relative frequency of this error type is dling the virtual nodes (mostly empty copula) is
much higher in English even when the Hungar- a frequent source of errors. We identified several
ian parser does not have access to the gold proper phenomena which are not typically listed as Hun-
noun POS tags. The reason for this is simple: in garian syntax-specific features but are challeng-
the Penn Treebank the correct internal structure of ing for the current data-driven parsers, however,
the NEs has to be identified beyond the phrase they are not problematic in English (like the at-
boundaries while in Hungarian their members tachment of conjunctions and negation words and
just form a chain. the attachment problem of nouns and articles).
We concluded based on our quantitative analy-
Annotation errors: We note that our analysis
sis that a further notable error reduction is only
took into account only sentences which contained
achievable if distinctive attention is paid to these
at least one parsing error and we crawled only
language-specific phenomena.
the dependencies where the gold standard anno-
We intend to investigate the problem of vir-
tation and the output of the parser did not match.
tual nodes in dependency parsing in more depth
Hence, the frequency of annotation errors is prob-
and to implement new feature templates for the
ably higher than we found (about 1% of the en-
Hungarian-specific challenges as future work.
tire set of dependencies) during our investigation
as there could be annotation errors in the error- Acknowledgments
free sentences and also in the investigated sen-
tences where the parser agrees with that error. This work was supported in part by the Deutsche
7
Forschungsgemeinschaft grant SFB 732 and the
We would nevertheless like to point out that adverbial
NIH grant (project codename MASZEKER) of
labels have a highly semantic nature, i.e. it could be argued
that it is not the syntactic parser that should identify them but the Hungarian government.
a semantic processor.
63
References Nianwen Xue, and Yi Zhang. 2009. The CoNLL-
2009 Shared Task: Syntactic and Semantic Depen-
Zoltan Alexin, Janos Csirik, Tibor Gyimothy, Karoly
dencies in Multiple Languages. In Proceedings of
Bibok, Csaba Hatvani, Gabor Proszeky, and Laszlo
the Thirteenth Conference on Computational Nat-
Tihanyi. 2003. Annotated Hungarian National Cor-
ural Language Learning (CoNLL 2009): Shared
pus. In Proceedings of the EACL, pages 5356.
Task, pages 118.
Anna Babarczy, Balint Gabor, Gabor Hamp, and
Johan Hall, Jens Nilsson, Joakim Nivre, Gulsen
Andras Rung. 2005. Hunpars: a rule-based sen-
Eryigit, Beata Megyesi, Mattias Nilsson, and
tence parser for Hungarian. In Proceedings of the
Markus Saers. 2007. Single Malt or Blended?
6th International Symposium on Computational In-
A Study in Multilingual Parser Optimization. In
telligence.
Proceedings of the CoNLL Shared Task Session of
Csongor Barta, Dora Csendes, Janos Csirik, Andras EMNLP-CoNLL 2007, pages 933939.
Hocza, Andras Kocsor, and Kornel Kovacs. 2005.
Szilard Ivan, Robert Ormandi, and Andras Kocsor.
Learning syntactic tree patterns from a balanced
2007. Magyar mondatok SVM alapu szintaxis
Hungarian natural language database, the Szeged
elemzese [SVM-based syntactic parsing of Hun-
Treebank. In Proceedings of 2005 IEEE Interna-
garian sentences]. In V. Magyar Szamtogepes
tional Conference on Natural Language Processing
Nyelveszeti Konferencia, pages 281283.
and Knowledge Engineering, pages 225 231.
Taku Kudo and Yuji Matsumoto. 2002. Japanese
Bernd Bohnet. 2010. Top accuracy and fast depen-
dependency analysis using cascaded chunking. In
dency parsing is not a contradiction. In Proceedings
Proceedings of the 6th Conference on Natural Lan-
of the 23rd International Conference on Computa-
guage Learning - Volume 20, COLING-02, pages
tional Linguistics (Coling 2010), pages 8997.
17.
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X
Ryan McDonald and Joakim Nivre. 2011. Analyzing
Shared Task on Multilingual Dependency Parsing.
and integrating dependency parsers. Computational
In Proceedings of the Tenth Conference on Com-
Linguistics, 37:197230.
putational Natural Language Learning (CoNLL-X),
Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
pages 149164.
Jan Hajic. 2005. Non-Projective Dependency Pars-
Xavier Carreras. 2007. Experiments with a higher-
ing using Spanning Tree Algorithms. In Proceed-
order projective dependency parser. In Proceed-
ings of Human Language Technology Conference
ings of the CoNLL Shared Task Session of EMNLP-
and Conference on Empirical Methods in Natural
CoNLL 2007, pages 957961.
Language Processing, pages 523530.
Dora Csendes, Janos Csirik, Tibor Gyimothy, and
Joakim Nivre and Jens Nilsson. 2005. Pseudo-
Andras Kocsor. 2005. The Szeged Treebank. In
Projective Dependency Parsing. In Proceedings
TSD, pages 123131.
of the 43rd Annual Meeting of the Association
Katalin E. Kiss. 2002. The Syntax of Hungarian. for Computational Linguistics (ACL05), pages 99
Cambridge University Press, Cambridge. 106.
Jason M. Eisner. 1996. Three new probabilistic mod- Joakim Nivre, Johan Hall, and Jens Nilsson. 2004.
els for dependency parsing: an exploration. In Pro- Memory-Based Dependency Parsing. In HLT-
ceedings of the 16th conference on Computational NAACL 2004 Workshop: Eighth Conference
linguistics - Volume 1, COLING 96, pages 340 on Computational Natural Language Learning
345. (CoNLL-2004), pages 4956.
Richard Farkas, Daniel Szeredi, Daniel Varga, and Joakim Nivre, Johan Hall, Sandra Kubler, Ryan Mc-
Veronika Vincze. 2010. MSD-KR harmonizacio a Donald, Jens Nilsson, Sebastian Riedel, and Deniz
Szeged Treebank 2.5-ben [Harmonizing MSD and Yuret. 2007. The CoNLL 2007 Shared Task
KR codes in the Szeged Treebank 2.5]. In VII. Ma- on Dependency Parsing. In Proceedings of the
gyar Szamtogepes Nyelveszeti Konferencia, pages CoNLL Shared Task Session of EMNLP-CoNLL
349353. 2007, pages 915932.
Jan Hajic, Alena Bohmova, Eva Hajicova, and Barbora Gabor Proszeky, Laszlo Tihanyi, and Gabor L. Ugray.
Vidova-Hladka. 2000. The Prague Dependency 2004. Moose: A Robust High-Performance Parser
Treebank: A Three-Level Annotation Scenario. In and Generator. In Proceedings of the 9th Workshop
Anne Abeille, editor, Treebanks: Building and of the European Association for Machine Transla-
Using Parsed Corpora, pages 103127. Amster- tion.
dam:Kluwer.
John R. Ross. 1986. Infinite syntax! ABLEX, Nor-
Jan Hajic, Massimiliano Ciaramita, Richard Johans- wood, NJ.
son, Daisuke Kawahara, Maria Antonia Mart, Llus
Lucien Tesniere. 1959. Elements de syntaxe struc-
Marquez, Adam Meyers, Joakim Nivre, Sebastian
turale. Klincksieck, Paris.
Pado, Jan Stepanek, Pavel Stranak, Mihai Surdeanu,
64
Kristina Toutanova, Dan Klein, Christopher D. Man-
ning, and Yoram Singer. 2003. Feature-rich part-
of-speech tagging with a cyclic dependency net-
work. In Proceedings of the 2003 Conference
of the North American Chapter of the Association
for Computational Linguistics on Human Language
Technology - Volume 1, pages 173180.
Viktor Tron, Peter Halacsy, Peter Rebrus, Andras
Rung, Eszter Simon, and Peter Vajda. 2006. Mor-
phdb.hu: Hungarian lexical database and morpho-
logical grammar. In Proceedings of 5th Inter-
national Conference on Language Resources and
Evaluation (LREC 06).
Daniel Varga, Peter Halacsy, Andras Kornai, Viktor
Nagy, Laszlo Nemeth, and Viktor Tron. 2005. Par-
allel corpora for medium density languages. In Pro-
ceedings of the RANLP, pages 590596.
Veronika Vincze, Dora Szauter, Attila Almasi, Gyorgy
Mora, Zoltan Alexin, and Janos Csirik. 2010. Hun-
garian Dependency Treebank. In Proceedings of the
Seventh Conference on International Language Re-
sources and Evaluation (LREC10).
65
Dependency Parsing with Undirected Graphs
Carlos Gomez-Rodrguez Daniel Fernandez-Gonzalez

Departamento de Computacion Departamento de Informatica
Universidade da Coruna Universidade de Vigo
Campus de Elvina, 15071 Campus As Lagoas, 32004
A Coruna, Spain Ourense, Spain
carlos.gomez@udc.es danifg@uvigo.es
Abstract
We introduce a new approach to transition-
based dependency parsing in which the
parser does not directly construct a depen- 0 1 2 3
dency structure, but rather an undirected
graph, which is then converted into a di- Figure 1: An example dependency structure where
rected dependency tree in a post-processing transition-based parsers enforcing the single-head con-
step. This alleviates error propagation, straint will incur in error propagation if they mistak-
since undirected parsers do not need to ob- enly build a dependency link 1 2 instead of 2 1
serve the single-head constraint. (dependency links are represented as arrows going
Undirected parsers can be obtained by sim- from head to dependent).
plifying existing transition-based parsers
satisfying certain conditions. We apply this
approach to obtain undirected variants of It has been shown by McDonald and Nivre
the planar and 2-planar parsers and of Cov- (2007) that such parsers suffer from error prop-
ingtons non-projective parser. We perform
agation: an early erroneous choice can place the
experiments on several datasets from the
CoNLL-X shared task, showing that these
parser in an incorrect state that will in turn lead to
variants outperform the original directed al- more errors. For instance, suppose that a sentence
gorithms in most of the cases. whose correct analysis is the dependency graph
in Figure 1 is analyzed by any bottom-up or left-
1 Introduction to-right transition-based parser that outputs de-
Dependency parsing has proven to be very use- pendency trees, therefore obeying the single-head
ful for natural language processing tasks. Data- constraint (only one incoming arc is allowed per
driven dependency parsers such as those by Nivre node). If the parser chooses an erroneous transi-
et al. (2004), McDonald et al. (2005), Titov and tion that leads it to build a dependency link from
Henderson (2007), Martins et al. (2009) or Huang 1 to 2 instead of the correct link from 2 to 1, this
and Sagae (2010) are accurate and efficient, they will lead it to a state where the single-head con-
can be trained from annotated data without the straint makes it illegal to create the link from 3 to
need for a grammar, and they provide a simple 2. Therefore, a single erroneous choice will cause
representation of syntax that maps to predicate- two attachment errors in the output tree.
argument structure in a straightforward way. With the goal of minimizing these sources of
In particular, transition-based dependency errors, we obtain novel undirected variants of
parsers (Nivre, 2008) are a type of dependency several parsers; namely, of the planar and 2-
parsing algorithms which use a model that scores planar parsers by Gomez-Rodrguez and Nivre
transitions between parser states. Greedy deter- (2010) and the non-projective list-based parser
ministic search can be used to select the transition described by Nivre (2008), which is based on
to be taken at each state, thus achieving linear or Covingtons algorithm (Covington, 2001). These
quadratic time complexity. variants work by collapsing the LEFT- ARC and
66
RIGHT- ARC transitions in the original parsers, say that i is the head of j and, conversely, that j
which create right-to-left and left-to-right depen- is a syntactic dependent of i.
dency links, into a single ARC transition creating Given a dependency graph G = (Vw , E), we
an undirected link. This has the advantage that write i ? j E if there is a (possibly empty)
the single-head constraint need not be observed directed path from i to j; and i ? j E if
during the parsing process, since the directed no- there is a (possibly empty) path between i and j in
tions of head and dependent are lost in undirected the undirected graph underlying G (omitting the
graphs. This gives the parser more freedom and references to E when clear from the context).
can prevent situations where enforcing the con- Most dependency-based representations of syn-
straint leads to error propagation, as in Figure 1. tax do not allow arbitrary dependency graphs, in-
On the other hand, these new algorithms have stead, they are restricted to acyclic graphs that
the disadvantage that their output is an undirected have at most one head per node. Dependency
graph, which has to be post-processed to recover graphs satisfying these constraints are called de-
the direction of the dependency links and generate pendency forests.
a valid dependency tree. Thus, some complexity Definition 1 A dependency graph G is said to be
is moved from the parsing process to this post- a forest iff it satisfies:
processing step; and each undirected parser will
outperform the directed version only if the simpli- 1. Acyclicity constraint: if i ? j, then not
fication of the parsing phase is able to avoid more j i.
errors than are generated by the post-processing. 2. Single-head constraint: if j i, then there
As will be seen in latter sections, experimental re- is no k 6= j such that k i.
sults indicate that this is in fact the case.
The rest of this paper is organized as follows: A node that has no head in a dependency for-
Section 2 introduces some notation and concepts est is called a root. Some dependency frame-
that we will use throughout the paper. In Sec- works add the additional constraint that depen-
tion 3, we present the undirected versions of the dency forests have only one root (or, equivalently,
parsers by Gomez-Rodrguez and Nivre (2010) that they are connected). Such a forest is called a
and Nivre (2008), as well as some considerations dependency tree. A dependency tree can be ob-
about the feature models suitable to train them. In tained from any dependency forest by linking all
Section 4, we discuss post-processing techniques of its root nodes as dependents of a dummy root
that can be used to recover dependency trees from node, conventionally located in position 0 of the
undirected graphs. Section 5 presents an empir- input.
ical study of the performance obtained by these 2.2 Transition Systems
parsers, and Section 6 contains a final discussion. In the framework of Nivre (2008), transition-
2 Preliminaries based parsers are described by means of a non-
deterministic state machine called a transition
2.1 Dependency Graphs system.
Let w = w1 . . . wn be an input string. A de- Definition 2 A transition system for dependency
pendency graph for w is a directed graph G = parsing is a tuple S = (C, T, cs , Ct ), where
(Vw , E), where Vw = {0, . . . , n} is the set of
nodes, and E Vw Vw is the set of directed 1. C is a set of possible parser configurations,
arcs. Each node in Vw encodes the position of 2. T is a finite set of transitions, which are par-
a token in w, and each arc in E encodes a de- tial functions t : C C,
pendency relation between two tokens. We write 3. cs is a total initialization function mapping
i j to denote a directed arc (i, j), which will each input string to a unique initial configu-
also be called a dependency link from i to j.1 We ration, and
1
In practice, dependency links are usually labeled, but
4. Ct C is a set of terminal configurations.
to simplify the presentation we will ignore labels throughout
most of the paper. However, all the results and algorithms
To obtain a deterministic parser from a non-
presented can be applied to labeled dependency graphs and deterministic transition system, an oracle is used
will be so applied in the experimental evaluation. to deterministically select a single transition at
67
each configuration. An oracle for a transition sys- the planar system that uses two stacks, allowing
tem S = (C, T, cs , Ct ) is a function o : C T . it to recognize 2-planar structures, a larger set
Suitable oracles can be obtained in practice by of dependency structures that has been shown to
training classifiers on treebank data (Nivre et al., cover the vast majority of non-projective struc-
2004). tures in a number of treebanks (Gomez-Rodrguez
2.3 The Planar, 2-Planar and Covington and Nivre, 2010).
Transition Systems This transition system, shown in Figure 2, has
configurations of the form c = h0 , 1 , B, Ai ,
Our undirected dependency parsers are based where we call 0 the active stack and 1 the in-
on the planar and 2-planar transition systems active stack. Its S HIFT, L EFT-A RC, R IGHT-A RC
by Gomez-Rodrguez and Nivre (2010) and the and R EDUCE transitions work similarly to those
version of the Covington (2001) non-projective in the planar parser, but while S HIFT pushes the
parser defined by Nivre (2008). We now outline first word in the buffer to both stacks; the other
these directed parsers briefly, a more detailed de- three transitions only work with the top of the ac-
scription can be found in the above references. tive stack, ignoring the inactive one. Finally, a
2.3.1 Planar S WITCH transition is added that makes the active
The planar transition system by Gomez- stack inactive and vice versa.
Rodrguez and Nivre (2010) is a linear-time 2.3.3 Covington Non-Projective
transition-based parser for planar dependency
forests, i.e., forests whose dependency arcs do not Covington (2001) proposes several incremen-
cross when drawn above the words. The set of tal parsing strategies for dependency representa-
planar dependency structures is a very mild ex- tions and one of them can recover non-projective
tension of that of projective structures (Kuhlmann dependency graphs. Nivre (2008) implements a
and Nivre, 2006). variant of this strategy as a transition system with
Configurations in this system are of the form configurations of the form c = h1 , 2 , B, Ai,
c = h, B, Ai where and B are disjoint lists of where 1 and 2 are lists containing partially pro-
nodes from Vw (for some input w), and A is a set cessed words and B is the buffer list of unpro-
of dependency links over Vw . The list B, called cessed words.
the buffer, holds the input words that are still to The Covington non-projective transition sys-
be read. The list , called the stack, is initially tem is shown at the bottom in Figure 2. At each
empty and is used to hold words that have depen- configuration c = h1 , 2 , B, Ai, the parser has
dency links pending to be created. The system to consider whether any dependency arc should
is shown at the top in Figure 2, where the nota- be created involving the top of the buffer and the
tion | i is used for a stack with top i and tail , words in 1 . A L EFT-A RC transition adds a link
and we invert the notation for the buffer for clarity from the first node j in the buffer to the node in the
(i.e., i | B as a buffer with top i and tail B). head of the list 1 , which is moved to the list 2
The system reads the input sentence and creates to signify that we have finished considering it as a
links in a left-to-right order by executing its four possible head or dependent of j. The R IGHT-A RC
transitions, until it gets to a terminal configura- transition does the same manipulation, but creat-
tion. A S HIFT transition moves the first (leftmost) ing the symmetric link. A N O -A RC transition re-
node in the buffer to the top of the stack. Transi- moves the head of the list 1 and inserts it at the
tions L EFT-A RC and R IGHT-A RC create leftward head of the list 2 without creating any arcs: this
or rightward link, respectively, involving the first transition is to be used where there is no depen-
node in the buffer and the topmost node in the dency relation between the top node in the buffer
stack. Finally, R EDUCE transition is used to pop and the head of 1 , but we still may want to cre-
the top word from the stack when we have fin- ate an arc involving the top of the buffer and other
ished building arcs to or from it. nodes in 1 . Finally, if we do not want to create
any such arcs at all, we can execute a S HIFT tran-
2.3.2 2-Planar sition, which advances the parsing process by re-
The 2-planar transition system by Gomez- moving the first node in the buffer B and inserting
Rodrguez and Nivre (2010) is an extension of it at the head of a list obtained by concatenating
68
1 and 2 . This list becomes the new 1 , whereas 3.1 Feature models
2 is empty in the resulting configuration.
Some of the features that are typically used to
Note that the Covington parser has quadratic train transition-based dependency parsers depend
complexity with respect to input length, while the on the direction of the arcs that have been built up
planar and 2-planar parsers run in linear time. to a certain point. For example, two such features
for the planar parser could be the POS tag associ-
3 The Undirected Parsers ated with the head of the topmost stack node, or
The transition systems defined in Section 2.3 the label of the arc going from the first node in the
share the common property that their L EFT-A RC buffer to its leftmost dependent.3
and R IGHT-A RC have exactly the same effects ex- As the notion of head and dependent is lost in
cept for the direction of the links that they create. undirected graphs, this kind of features cannot be
We can take advantage of this property to define used to train undirected parsers. Instead, we use
undirected versions of these transition systems, by features based on undirected relations between
transforming them as follows: nodes. We found that the following kinds of fea-
tures worked well in practice as a replacement for
Configurations are changed so that the arc set
features depending on arc direction:
A is a set of undirected arcs, instead of di-
rected arcs. Information about the ith node linked to a
given node (topmost stack node, topmost
The L EFT-A RC and R IGHT-A RC transitions
buffer node, etc.) on the left or on the right,
in each parser are collapsed into a single A RC
and about the associated undirected arc, typi-
transition that creates an undirected arc.
cally for i = 1, 2, 3,
The preconditions of transitions that guaran-
Information about whether two nodes are
tee the single-head constraint are removed,
linked or not in the undirected graph, and
since the notions of head and dependent are
about the label of the arc between them,
lost in undirected graphs.
Information about the first left and right
By performing these transformations and leaving undirected siblings of a given node, i.e., the
the systems otherwise unchanged, we obtain the first node q located to the left of the given node
undirected variants of the planar, 2-planar and p such that p and q are linked to some common
Covington algorithms that are shown in Figure 3. node r located to the right of both, and vice
Note that the transformation can be applied versa. Note that this notion of undirected sib-
to any transition system having L EFT-A RC and lings does not correspond exclusively to sib-
R IGHT-A RC transitions that are equal except for lings in the directed graph, since it can also
the direction of the created link, and thus col- capture other second-order interactions, such
lapsable into one. The above three transition sys- as grandparents.
tems fulfill this property, but not every transition
system does. For example, the well-known arc- 4 Reconstructing the dependency forest
eager parser of Nivre (2003) pops a node from the The modified transition systems presented in the
stack when creating left arcs, and pushes a node previous section generate undirected graphs. To
to the stack when creating right arcs, so the trans- obtain complete dependency parsers that are able
formation cannot be applied to it.2 to produce directed dependency forests, we will
need a reconstruction step that will assign a direc-
2
One might think that the arc-eager algorithm could still tion to the arcs in such a way that the single-head
be transformed by converting each of its arc transitions into constraint is obeyed. This reconstruction step can
an undirected transition, without collapsing them into one. be implemented by building a directed graph with
However, this would result into a parser that violates the
acyclicity constraint, since the algorithm is designed in such
weighted arcs corresponding to both possible di-
a way that acyclicity is only guaranteed if the single-head rections of each undirected edge, and then finding
constraint is kept. It is easy to see that this problem cannot an optimum branching to reduce it to a directed
happen in parsers where L EFT-A RC and R IGHT-A RC transi-
3
tions have the same effect: in these, if a directed graph is not These example features are taken from the default model
parsable in the original algorithm, its underlying undirected for the planar parser in version 1.5 of MaltParser (Nivre et
graph cannot not be parsable in the undirected variant. al., 2006).
69
Planar initial/terminal configurations: cs (w1 . . . wn ) = h[], [1 . . . n], i, Cf = {h, [], Ai C}
Transitions: S HIFT h, i|B, Ai h|i, B, Ai
R EDUCE h|i, B, Ai h, B, Ai
L EFT-A RC h|i, j|B, Ai h|i, j|B, A {(j, i)}i
only if @k | (k, i) A (single-head) and i j 6 A (acyclicity).
R IGHT-A RC h|i, j|B, Ai h|i, j|B, A {(i, j)}i
only if @k | (k, j) A (single-head) and i j 6 A (acyclicity).
2-Planar initial/terminal configurations: cs (w1 . . . wn ) = h[], [], [1 . . . n], i, Cf = {h0 , 1 , [], Ai C}
Transitions: S HIFT h0 , 1 , i|B, Ai h0 |i, 1 |i, B, Ai
R EDUCE h0 |i, 1 , B, Ai h0 , 1 , B, Ai
L EFT-A RC h0 |i, 1 , j|B, Ai h0 |i, 1 , j|B, A {j, i)}i
R IGHT-A RC h0 |i, 1 , j|B, Ai h0 |i, 1 , j|B, A {(i, j)}i
S WITCH h0 , 1 , B, Ai h1 , 0 , B, Ai
Covington initial/term. configurations: cs (w1 . . . wn ) = h[], [], [1 . . . n], i, Cf = {h1 , 2 , [], Ai C}
Transitions: S HIFT h1 , 2 , i|B, Ai h1 2 |i, [], B, Ai
N O -A RC h1 |i, 2 , B, Ai h1 , i|2 , B, Ai
L EFT-A RC h1 |i, 2 , j|B, Ai h1 , i|2 , j|B, A {(j, i)}i
R IGHT-A RC h1 |i, 2 , j|B, Ai h1 , i|2 , j|B, A {(i, j)}i
Figure 2: Transition systems for planar, 2-planar and Covington non-projective dependency parsing.
Undirected Planar initial/term. conf.: cs (w1 . . . wn ) = h[], [1 . . . n], i, Cf = {h, [], Ai C}

Transitions: S HIFT h, i|B, Ai h|i, B, Ai
R EDUCE h|i, B, Ai h, B, Ai
A RC h|i, j|B, Ai h|i, j|B, A {{i, j}}i
only if i j 6 A (acyclicity).
Undirected 2-Planar initial/term. conf.: cs (w1 . . . wn ) = h[], [], [1 . . . n], i, Cf = {h0 , 1 , [], Ai C}
Transitions: S HIFT h0 , 1 , i|B, Ai h0 |i, 1 |i, B, Ai
R EDUCE h0 |i, 1 , B, Ai h0 , 1 , B, Ai
A RC h0 |i, 1 , j|B, Ai h0 |i, 1 , j|B, A {{i, j}}i
S WITCH h0 , 1 , B, Ai h1 , 0 , B, Ai
Undirected Covington init./term. conf.: cs (w1 . . . wn ) = h[], [], [1 . . . n], i, Cf = {h1 , 2 , [], Ai C}
Transitions: S HIFT h1 , 2 , i|B, Ai h1 2 |i, [], B, Ai
N O -A RC h1 |i, 2 , B, Ai h1 , i|2 , B, Ai
A RC h1 |i, 2 , j|B, Ai h1 , i|2 , j|B, A {{i, j}}i
Figure 3: Transition systems for undirected planar, 2-planar and Covington non-projective dependency parsing.
70
tree. Different criteria for assigning weights to A(U ) as follows:
arcs provide different variants of the reconstruc-
tion technique. 1 if (i, j) A1 (U ),
c(i, j)
To describe these variants, we first introduce 2 if (i, j) A2 (U ) (i, j) 6 A1 (U ).
preliminary definitions. Let U = (Vw , E) be
an undirected graph produced by an undirected This approach gives the same cost to all arcs
parser for some string w. We define the follow- obtained from the undirected graph U , while also
ing sets of arcs: allowing (at a higher cost) to attach any node to
the dummy root. To obtain satisfactory results
A1 (U ) = {(i, j) | j 6= 0 {i, j} E},
with this technique, we must train our parser to
A2 (U ) = {(0, i) | i Vw }. explicitly build undirected arcs from the dummy
Note that A1 (U ) represents the set of arcs ob- root node to the root word(s) of each sentence us-
tained from assigning an orientation to an edge ing arc transitions (note that this implies that we
in U , except arcs whose dependent is the dummy need to represent forests as trees, in the manner
root, which are disallowed. On the other hand, described at the end of Section 2.1). Under this
A2 (U ) contains all the possible arcs originating assumption, it is easy to see that we can obtain the
from the dummy root node, regardless of whether correct directed tree T for a sentence if it is pro-
their underlying undirected edges are in U or not; vided with its underlying undirected tree U : the
this is so that reconstructions are allowed to link tree is obtained in O(n) as the unique orientation
unattached tokens to the dummy root. of U that makes each of its edges point away from
The reconstruction process consists of finding the dummy root.
a minimum branching (i.e. a directed minimum This approach to reconstruction has the advan-
spanning tree) for a weighted directed graph ob- tage of being very simple and not adding any com-
tained from assigning a cost c(i, j) to each arc plications to the parsing process, while guarantee-
(i, j) of the following directed graph: ing that the correct directed tree will be recovered
if the undirected tree for a sentence is generated
D(U ) = {Vw , A(U ) = A1 (U ) A2 (U )}. correctly. However, it is not very robust, since the
That is, we will find a dependency tree T = direction of all the arcs in the output depends on
(Vw , AT A(U )) such that the sum of costs of which node is chosen as sentence head and linked
the arcs in AT is minimal. In general, such a min- to the dummy root. Therefore, a parsing error af-
imum branching can be calculated with the Chu- fecting the undirected edge involving the dummy
Liu-Edmonds algorithm (Chu and Liu, 1965; Ed- root may result in many dependency links being
monds, 1967). Since the graph D(U ) has O(n) erroneous.
nodes and O(n) arcs for a string of length n, this 4.2 Label-based reconstruction
can be done in O(n log n) if implemented as de-
scribed by Tarjan (1977). To achieve a more robust reconstruction, we use
However, applying these generic techniques is labels to encode a preferred direction for depen-
not necessary in this case: since our graph U is dency arcs. To do so, for each pre-existing label
acyclic, the problem of reconstructing the forest X in the training set, we create two labels Xl and
can be reduced to choosing a root word for each Xr . The parser is then trained on a modified ver-
connected component in the graph, linking it as sion of the training set where leftward links orig-
a dependent of the dummy root and directing the inally labelled X are labelled Xl , and rightward
other arcs in the component in the (unique) way links originally labelled X are labelled Xr . Thus,
that makes them point away from the root. the output of the parser on a new sentence will be
It remains to see how to assign the costs c(i, j) an undirected graph where each edge has a label
to the arcs of D(U ): different criteria for assign- with an annotation indicating whether the recon-
ing scores will lead to different reconstructions. struction process should prefer to link the pair of
nodes with a leftward or a rightward arc. We can
4.1 Naive reconstruction then assign costs to our minimum branching algo-
A first, very simple reconstruction technique can rithm so that it will return a tree agreeing with as
be obtained by assigning arc costs to the arcs in many such annotations as possible.
71
To do this, we call A1+ (U ) A1 (U ) the set a. R
of arcs in A1 (U ) that agree with the annotations, R L L L
i.e., arcs (i, j) A1 (U ) where either i < j and
0 1 2 3 4 5
i, j is labelled Xr in U , or i > j and i, j is labelled
Xl in U . We call A1 (U ) the set of arcs in A1 (U )
b.
that disagree with the annotations, i.e., A1 (U ) =
A1 (U )\A1+ (U ). And we assign costs as follows:
0 1 2 3 4 5
1 if (i, j) A1+ (U ),
c(i, j) 2 if (i, j) A1 (U ),
c.
2n if (i, j) A2 (U ) (i, j) 6 A1 (U ).

where n is the length of the string. 0 1 2 3 4 5

With these costs, the minimum branching algo-
rithm will find a tree which agrees with as many Figure 4: a) An undirected graph obtained by the
annotations as possible. Additional arcs from the parser with the label-based transformation, b) and c)
root not corresponding to any edge in the output The dependency graph obtained by each of the variants
of the parser (i.e. arcs in A2 (U ) but not in A1 (U )) of the label-based reconstruction (note how the second
will be used only if strictly necessary to guarantee variant moves an arc from the root).
connectedness, this is implemented by the high
cost for these arcs.
given sentence, then the obtained directed tree is
While this may be the simplest cost assignment
guaranteed to be correct (as it will simply be the
to implement label-based reconstruction, we have
tree obtained by decoding the label annotations).
found that better empirical results are obtained if
we give the algorithm more freedom to create new 5 Experiments
arcs from the root, as follows:
In this section, we evaluate the performance of the
if (i, j) A1+ (U ) (i, j) 6 A2 (U ), undirected planar, 2-planar and Covington parsers

1
c(i, j) 2 if (i, j) A1 (U ) (i, j) 6 A2 (U ), on eight datasets from the CoNLL-X shared task

2n if (i, j) A2 (U ). (Buchholz and Marsi, 2006).
Tables 1, 2 and 3 compare the accuracy of the
While the cost of arcs from the dummy root is undirected versions with naive and label-based re-
still 2n, this is now so even for arcs that are in the construction to that of the directed versions of
output of the undirected parser, which had cost 1 the planar, 2-planar and Covington parsers, re-
before. Informally, this means that with this con- spectively. In addition, we provide a comparison
figuration the postprocessor does not trust the to well-known state-of-the-art projective and non-
links from the dummy root created by the parser, projective parsers: the planar parsers are com-
and may choose to change them if it is conve- pared to the arc-eager projective parser by Nivre
nient to get a better agreement with label anno- (2003), which is also restricted to planar struc-
tations (see Figure 4 for an example of the dif- tures; and the 2-planar parsers are compared with
ference between both cost assignments). We be- the arc-eager parser with pseudo-projective trans-
lieve that the better accuracy obtained with this formation of Nivre and Nilsson (2005), capable of
criterion probably stems from the fact that it is bi- handling non-planar dependencies.
ased towards changing links from the root, which We use SVM classifiers from the LIBSVM
tend to be more problematic for transition-based package (Chang and Lin, 2001) for all the lan-
parsers, while respecting the parser output for guages except Chinese, Czech and German. In
links located deeper in the dependency structure, these, we use the LIBLINEAR package (Fan et
for which transition-based parsers tend to be more al., 2008) for classification, which reduces train-
accurate (McDonald and Nivre, 2007). ing time for these larger datasets; and feature
Note that both variants of label-based recon- models adapted to this system which, in the case
struction have the property that, if the undirected of German, result in higher accuracy than pub-
parser produces the correct edges and labels for a lished results using LIBSVM.
72
The LIBSVM feature models for the arc-eager based reconstruction technique of Section 4.2, im-
projective and pseudo-projective parsers are the proves parsing accuracy on most of the tested
same used by these parsers in the CoNLL-X dataset/algorithm combinations, and it can out-
shared task, where the pseudo-projective version perform state-of-the-art transition-based parsers.
of MaltParser was one of the two top performing The accuracy improvements achieved by re-
systems (Buchholz and Marsi, 2006). For the 2- laxing the single-head constraint to mitigate er-
planar parser, we took the feature models from ror propagation were able to overcome the er-
Gomez-Rodrguez and Nivre (2010) for the lan- rors generated in the reconstruction phase, which
guages included in that paper. For all the algo- were few: we observed empirically that the dif-
rithms and datasets, the feature models used for ferences between the undirected LAS obtained
the undirected parsers were adapted from those of from the undirected graph before the reconstruc-
the directed parsers as described in Section 3.1.4 tion and the final directed LAS are typically be-
The results show that the use of undirected low 0.20%. This is true both for the naive and
parsing with label-based reconstruction clearly label-based transformations, indicating that both
improves the performance in the vast majority of techniques are able to recover arc directions accu-
the datasets for the planar and Covington algo- rately, and the accuracy differences between them
rithms, where in many cases it also improves upon come mainly from the differences in training (e.g.
the corresponding projective and non-projective having tentative arc direction as part of feature
state-of-the-art parsers provided for comparison. information in the label-based reconstruction and
In the case of the 2-planar parser the results are not in the naive one) rather than from the differ-
less conclusive, with improvements over the di- ences in the reconstruction methods themselves.
rected versions in five out of the eight languages. The reason why we can apply the undirected
The improvements in LAS obtained with label- simplification to the three parsers that we have
based reconstruction over directed parsing are sta- used in this paper is that their L EFT-A RC and
tistically significant at the .05 level5 for Danish, R IGHT-A RC transitions have the same effect ex-
German and Portuguese in the case of the pla- cept for the direction of the links they create.
nar parser; and Czech, Danish and Turkish for The same transformation and reconstruction tech-
Covingtons parser. No statistically significant de- niques could be applied to any other transition-
crease in accuracy was detected in any of the al- based dependency parsers sharing this property.
gorithm/dataset combinations. The reconstruction techniques alone could po-
As expected, the good results obtained by the tentially be applied to any dependency parser
undirected parsers with label-based reconstruc- (transition-based or not) as long as it can be some-
tion contrast with those obtained by the variants how converted to output undirected graphs.
with root-based reconstruction, which performed The idea of parsing with undirected relations
worse in all the experiments. between words has been applied before in the
6 Discussion work on Link Grammar (Sleator and Temperley,
1991), but in that case the formalism itself works
We have presented novel variants of the planar with undirected graphs, which are the final out-
and 2-planar transition-based parsers by Gomez- put of the parser. To our knowledge, the idea of
Rodrguez and Nivre (2010) and of Covingtons using an undirected graph as an intermediate step
non-projective parser (Covington, 2001; Nivre, towards obtaining a dependency structure has not
2008) which ignore the direction of dependency been explored before.
links, and reconstruction techniques that can be
used to recover the direction of the arcs thus pro- Acknowledgments
duced. The results obtained show that this idea This research has been partially funded by the Spanish
of undirected parsing, together with the label- Ministry of Economy and Competitiveness and FEDER
(projects TIN2010-18552-C03-01 and TIN2010-18552-
4
All the experimental settings and feature models used C03-02), Ministry of Education (FPU Grant Program) and
are included in the supplementary material and also available Xunta de Galicia (Rede Galega de Recursos Lingusticos
at http://www.grupolys.org/cgomezr/exp/. para unha Soc. do Conec.). The experiments were conducted
5
Statistical significance was assessed using Dan Bikels with the help of computing resources provided by the Su-
randomized comparator: http://www.cis.upenn. percomputing Center of Galicia (CESGA). We thank Joakim
edu/dbikel/software.html Nivre for helpful input in the early stages of this work.
73
Planar UPlanarN UPlanarL MaltP
Lang. LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p)
Arabic 66.93 (67.34) 77.56 (77.22) 65.91 (66.33) 77.03 (76.75) 66.75 (67.19) 77.45 (77.22) 66.43 (66.74) 77.19 (76.83)
Chinese 84.23 (84.20) 88.37 (88.33) 83.14 (83.10) 87.00 (86.95) 84.51* (84.50*) 88.37 (88.35*) 86.42 (86.39) 90.06 (90.02)
Czech 77.24 (77.70) 83.46 (83.24) 75.08 (75.60) 81.14 (81.14) 77.60* (77.93*) 83.56* (83.41*) 77.24 (77.57) 83.40 (83.19)
Danish 83.31 (82.60) 88.02 (86.64) 82.65 (82.45) 87.58 (86.67*) 83.87* (83.83*) 88.94* (88.17*) 83.31 (82.64) 88.30 (86.91)
German 84.66 (83.60) 87.02 (85.67) 83.33 (82.77) 85.78 (84.93) 86.32* (85.67*) 88.62* (87.69*) 86.12 (85.48) 88.52 (87.58)
Portug. 86.22 (83.82) 89.80 (86.88) 85.89 (83.82) 89.68 (87.06*) 86.52* (84.83*) 90.28* (88.03*) 86.60 (84.66) 90.20 (87.73)
Swedish 83.01 (82.44) 88.53 (87.36) 81.20 (81.10) 86.50 (85.86) 82.95 (82.66*) 88.29 (87.45*) 82.89 (82.44) 88.61 (87.55)
Turkish 62.70 (71.27) 73.67 (78.57) 59.83 (68.31) 70.15 (75.17) 63.27* (71.63*) 73.93* (78.72*) 62.58 (70.96) 73.09 (77.95)
Table 1: Parsing accuracy of the undirected planar parser with naive (UPlanarN) and label-based (UPlanarL)
postprocessing in comparison to the directed planar (Planar) and the MaltParser arc-eager projective (MaltP)
algorithms, on eight datasets from the CoNLL-X shared task (Buchholz and Marsi, 2006): Arabic (Hajic et al.,
2004), Chinese (Chen et al., 2003), Czech (Hajic et al., 2006), Danish (Kromann, 2003), German (Brants et
al., 2002), Portuguese (Afonso et al., 2002), Swedish (Nilsson et al., 2005) and Turkish (Oflazer et al., 2003;
Atalay et al., 2003). We show labelled (LAS) and unlabelled (UAS) attachment score excluding and including
punctuation tokens in the scoring (the latter in brackets). Best results for each language are shown in boldface,
and results where the undirected parser outperforms the directed version are marked with an asterisk.
2Planar U2PlanarN U2PlanarL MaltPP

Lang. LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p)
Arabic 66.73 (67.19) 77.33 (77.11) 66.37 (66.93) 77.15 (77.09) 66.13 (66.52) 76.97 (76.70) 65.93 (66.02) 76.79 (76.14)
Chinese 84.35 (84.32) 88.31 (88.27) 83.02 (82.98) 86.86 (86.81) 84.45* (84.42*) 88.29 (88.25) 86.42 (86.39) 90.06 (90.02)
Czech 77.72 (77.91) 83.76 (83.32) 74.44 (75.19) 80.68 (80.80) 78.00* (78.59*) 84.22* (84.21*) 78.86 (78.47) 84.54 (83.89)
Danish 83.81 (83.61) 88.50 (87.63) 82.00 (81.63) 86.87 (85.80) 83.75 (83.65*) 88.62* (87.82*) 83.67 (83.54) 88.52 (87.70)
German 86.28 (85.76) 88.68 (87.86) 82.93 (82.53) 85.52 (84.81) 86.52* (85.99*) 88.72* (87.92*) 86.94 (86.62) 89.30 (88.69)
Portug. 87.04 (84.92) 90.82 (88.14) 85.61 (83.45) 89.36 (86.65) 86.70 (84.75) 90.38 (87.88) 87.08 (84.90) 90.66 (87.95)
Swedish 83.13 (82.71) 88.57 (87.59) 81.00 (80.71) 86.54 (85.68) 82.59 (82.25) 88.19 (87.29) 83.39 (82.67) 88.59 (87.38)
Turkish 61.80 (70.09) 72.75 (77.39) 58.10 (67.44) 68.03 (74.06) 61.92* (70.64*) 72.18 (77.46*) 62.80 (71.33) 73.49 (78.44)
Table 2: Parsing accuracy of the undirected 2-planar parser with naive (U2PlanarN) and label-based (U2PlanarL)
postprocessing in comparison to the directed 2-planar (2Planar) and MaltParser arc-eager pseudo-projective
(MaltPP) algorithms. The meaning of the scores shown is as in Table 1.
Covington UCovingtonN UCovingtonL

Lang. LAS(p) UAS(p) LAS(p) UAS(p) LAS(p) UAS(p)
Arabic 65.17 (65.49) 75.99 (75.69) 63.49 (63.93) 74.41 (74.20) 65.61* (65.81*) 76.11* (75.66)
Chinese 85.61 (85.61) 89.64 (89.62) 84.12 (84.02) 87.85 (87.73) 86.28* (86.17*) 90.16* (90.04*)
Czech 78.26 (77.43) 84.04 (83.15) 74.02 (74.78) 79.80 (79.92) 78.42* (78.69*) 84.50* (84.16*)
Danish 83.63 (82.89) 88.50 (87.06) 82.00 (81.61) 86.55 (85.51) 84.27* (83.85*) 88.82* (87.75*)
German 86.70 (85.69) 89.08 (87.78) 84.03 (83.51) 86.16 (85.39) 86.50 (85.90*) 88.84 (87.95*)
Portug. 84.73 (82.56) 89.10 (86.30) 83.83 (81.71) 87.88 (85.17) 84.95* (82.70*) 89.18* (86.31*)
Swedish 83.53 (82.76) 88.91 (87.61) 81.78 (81.47) 86.78 (85.96) 83.09 (82.73) 88.11 (87.23)
Turkish 64.25 (72.70) 74.85 (79.75) 63.51 (72.08) 74.07 (79.10) 64.91* (73.38*) 75.46* (80.40*)
Table 3: Parsing accuracy of the undirected Covington non-projective parser with naive (UCovingtonN) and
label-based (UCovingtonL) postprocessing in comparison to the directed algorithm (Covington). The meaning
of the scores shown is as in Table 1.
74
References Proceedings of the NEMLAR International Confer-
ence on Arabic Language Resources and Tools.
Susana Afonso, Eckhard Bick, Renato Haber, and Di-
Jan Hajic, Jarmila Panevova, Eva Hajicova, Jarmila
ana Santos. 2002. Floresta sinta(c)tica: a tree-
Panevova, Petr Sgall, Petr Pajas, Jan Stepanek,
bank for Portuguese. In Proceedings of the 3rd In-
Jir Havelka, and Marie Mikulova. 2006.
ternational Conference on Language Resources and
Prague Dependency Treebank 2.0. CDROM CAT:
Evaluation (LREC 2002), pages 19681703, Paris,
LDC2006T01, ISBN 1-58563-370-4. Linguistic
France. ELRA.
Data Consortium.
Nart B. Atalay, Kemal Oflazer, and Bilge Say. 2003. Liang Huang and Kenji Sagae. 2010. Dynamic pro-
The annotation process in the Turkish treebank. gramming for linear-time incremental parsing. In
In Proceedings of EACL Workshop on Linguisti- Proceedings of the 48th Annual Meeting of the As-
cally Interpreted Corpora (LINC-03), pages 243 sociation for Computational Linguistics, ACL 10,
246, Morristown, NJ, USA. Association for Com- pages 10771086, Stroudsburg, PA, USA. Associa-
putational Linguistics. tion for Computational Linguistics.
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf- Matthias T. Kromann. 2003. The Danish dependency
gang Lezius, and George Smith. 2002. The tiger treebank and the underlying linguistic theory. In
treebank. In Proceedings of the Workshop on Tree- Proceedings of the 2nd Workshop on Treebanks and
banks and Linguistic Theories, September 20-21, Linguistic Theories (TLT), pages 217220, Vaxjo,
Sozopol, Bulgaria. Sweden. Vaxjo University Press.
Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Marco Kuhlmann and Joakim Nivre. 2006. Mildly
shared task on multilingual dependency parsing. In non-projective dependency structures. In Proceed-
Proceedings of the 10th Conference on Computa- ings of the COLING/ACL 2006 Main Conference
tional Natural Language Learning (CoNLL), pages Poster Sessions, pages 507514.
149164. Andre Martins, Noah Smith, and Eric Xing. 2009.
Chih-Chung Chang and Chih-Jen Lin, 2001. Concise integer linear programming formulations
LIBSVM: A Library for Support Vec- for dependency parsing. In Proceedings of the
tor Machines. Software available at Joint Conference of the 47th Annual Meeting of the
http://www.csie.ntu.edu.tw/cjlin/libsvm. ACL and the 4th International Joint Conference on
K. Chen, C. Luo, M. Chang, F. Chen, C. Chen, Natural Language Processing of the AFNLP (ACL-
C. Huang, and Z. Gao. 2003. Sinica treebank: De- IJCNLP), pages 342350.
sign criteria, representational issues and implemen- Ryan McDonald and Joakim Nivre. 2007. Character-
tation. In Anne Abeille, editor, Treebanks: Building izing the errors of data-driven dependency parsing
and Using Parsed Corpora, chapter 13, pages 231 models. In Proceedings of the 2007 Joint Confer-
248. Kluwer. ence on Empirical Methods in Natural Language
Y. J. Chu and T. H. Liu. 1965. On the shortest arbores- Processing and Computational Natural Language
cence of a directed graph. Science Sinica, 14:1396 Learning (EMNLP-CoNLL), pages 122131.
1400. Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
Jan Hajic. 2005. Non-projective dependency pars-
Michael A. Covington. 2001. A fundamental algo-
ing using spanning tree algorithms. In Proceedings
rithm for dependency parsing. In Proceedings of
of the Human Language Technology Conference
the 39th Annual ACM Southeast Conference, pages
and the Conference on Empirical Methods in Nat-
95102.
ural Language Processing (HLT/EMNLP), pages
Jack Edmonds. 1967. Optimum branchings. Journal
523530.
of Research of the National Bureau of Standards,
Jens Nilsson, Johan Hall, and Joakim Nivre. 2005.
71B:233240.
MAMBA meets TIGER: Reconstructing a Swedish
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and treebank from Antiquity. In Peter Juel Henrichsen,
C.-J. Lin. 2008. LIBLINEAR: A library for large editor, Proceedings of the NODALIDA Special Ses-
linear classification. Journal of Machine Learning sion on Treebanks.
Research, 9:18711874. Joakim Nivre and Jens Nilsson. 2005. Pseudo-
Carlos Gomez-Rodrguez and Joakim Nivre. 2010. projective dependency parsing. In Proceedings of
A transition-based parser for 2-planar dependency the 43rd Annual Meeting of the Association for
structures. In Proceedings of the 48th Annual Meet- Computational Linguistics (ACL), pages 99106.
ing of the Association for Computational Linguis- Joakim Nivre, Johan Hall, and Jens Nilsson. 2004.
tics, ACL 10, pages 14921501, Stroudsburg, PA, Memory-based dependency parsing. In Proceed-
USA. Association for Computational Linguistics. ings of the 8th Conference on Computational Nat-
Jan Hajic, Otakar Smrz, Petr Zemanek, Jan Snaidauf, ural Language Learning (CoNLL-2004), pages 49
and Emanuel Beska. 2004. Prague Arabic Depen- 56, Morristown, NJ, USA. Association for Compu-
dency Treebank: Development in data and tools. In tational Linguistics.
75
Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
MaltParser: A data-driven parser-generator for de-
pendency parsing. In Proceedings of the 5th In-
Evaluation (LREC), pages 22162219.
Joakim Nivre. 2003. An efficient algorithm for pro-
jective dependency parsing. In Proceedings of the
8th International Workshop on Parsing Technolo-
gies (IWPT), pages 149160.
Joakim Nivre. 2008. Algorithms for Deterministic
Incremental Dependency Parsing. Computational
Linguistics, 34(4):513553.
Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tur,
and Gokhan Tur. 2003. Building a Turkish tree-
bank. In Anne Abeille, editor, Treebanks: Build-
ing and Using Parsed Corpora, pages 261277.
Kluwer.
Daniel Sleator and Davy Temperley. 1991. Pars-
ing English with a link grammar. Technical Re-
port CMU-CS-91-196, Carnegie Mellon University,
Computer Science.
R. E. Tarjan. 1977. Finding optimum branchings.
Networks, 7:2535.
Ivan Titov and James Henderson. 2007. A latent vari-
able model for generative dependency parsing. In
Proceedings of the 10th International Conference
on Parsing Technologies (IWPT), pages 144155.
76
The Best of Both Worlds A Graph-based Completion Model for
Transition-based Parsers
Bernd Bohnet and Jonas Kuhn

University of Stuttgart
Institute for Natural Language Processing
{bohnet,jonas}@ims.uni-stuttgart.de
Abstract that machine learning, i.e., a model of linguis-

tic experience, is used in exactly those situations
Transition-based dependency parsers are
when there is an attachment choice in an other-
often forced to make attachment deci-
sions at a point when only partial infor-
wise deterministic incremental left-to-right pars-
mation about the relevant graph configu- ing process. As a new word is processed, the
ration is available. In this paper, we de- parser has to decide on one out of a small num-
scribe a model that takes into account com- ber of possible transitions (adding a dependency
plete structures as they become available arc pointing to the left or right and/or pushing or
to rescore the elements of a beam, com- popping a word on/from a stack representation).
bining the advantages of transition-based Obviously, the learning can be based on the fea-
and graph-based approaches. We also pro-
ture information available at a particular snapshot
pose an efficient implementation that al-
lows for the use of sophisticated features in incremental processing, i.e., only surface in-
and show that the completion model leads formation for the unparsed material to the right,
to a substantial increase in accuracy. We but full structural information for the parts of the
apply the new transition-based parser on ty- string already processed. For the completely pro-
pologically different languages such as En- cessed parts, there are no principled limitations as
glish, Chinese, Czech, and German and re- regards the types of structural configurations that
port competitive labeled and unlabeled at-
can be checked in feature functions.
tachment scores.
The graph-based approach in contrast empha-
1 Introduction sizes the objective of exhaustive search over all
possible trees spanning the input words. Com-
Background. A considerable amount of recent monly, dynamic programming techniques are
research has gone into data-driven dependency used to decide on the optimal tree for each par-
parsing, and interestingly throughout the continu- ticular word span, considering all candidate splits
ous process of improvements, two classes of pars- into subspans, successively building longer spans
ing algorithms have stayed at the centre of at- in a bottom-up fashion (similar to chart-based
tention, the transition-based (Nivre, 2003) vs. the constituent parsing). Machine learning drives
graph-based approach (Eisner, 1996; McDonald the process of deciding among alternative can-
et al., 2005).1 The two approaches apply funda- didate splits, i.e., feature information can draw
mentally different strategies to solve the task of on full structural information for the entire ma-
finding the optimal labeled dependency tree over terial in the span under consideration. However,
the words of an input sentence (where supervised due to the dynamic programming approach, the
machine learning is used to estimate the scoring features cannot use arbitrarily complex structural
parameters on a treebank). configurations: otherwise the dynamic program-
The transition-based approach is based on the ming chart would have to be split into exponen-
conceptually (and cognitively) compelling idea tially many special states. The typical feature
1
More references will be provided in sec. 2. models are based on combinations of edges (so-
77
called second-order factors) that closely follow for a given input word at a point where no or only
the bottom-up combination of subspans in the partial information about the words own depen-
parsing algorithm, i.e., the feature functions de- dents (and further decendents) is available. Fig-
pend on the presence of two specific dependency ure 1 illustrates such a case.
edges. Configurations not directly supported by
the bottom-up building of larger spans are more
cumbersome to integrate into the model (since the
combination algorithm has to be adjusted), in par-
ticular for third-order factors or higher.
Empirically, i.e., when applied in supervised
machine learning experiments based on existing
treebanks for various languages, both strategies
(and further refinements of them not mentioned
here) turn out roughly equal in their capability Figure 1: The left set of brackets indicates material
of picking up most of the relevant patterns well; that has been processed or is under consideration; on
some subtle strengths and weaknesses are com- the right is the input, still to be processed. Access to in-
plementary, such that stacking of two parsers rep- formation that is yet unavailable would help the parser
to decide on the correct transition.
resenting both strategies yields the best results
(Nivre and McDonald, 2008): in training and ap-
Here, the parser has to decide whether to create an
plication, one of the parsers is run on each sen-
edge between house and with or between bought
tence prior to the other, providing additional fea-
and with (which is technically achieved by first
ture information for the other parser. Another suc-
popping house from the stack and then adding the
cessful technique to combine parsers is voting as
edge). At this time, no information about the ob-
carried out by Sagae and Lavie (2006).
ject of with is available; with fails to provide what
The present paper addresses the question if
we call a complete factor for the calculation of the
and how a more integrated combination of the
scores of the alternative transitions under consid-
strengths of the two strategies can be achieved
eration. In other words, the model cannot make
and implemented efficiently to warrant competi-
use of any evidence to distinguish between the
tive results.
two examples in Figure 1, and it is bound to get
one of the two cases wrong.
The main issue and solution strategy. In or-
Figure 2 illustrates the same case from the per-
der to preserve the conceptual (and complexity)
spective of a graph-based parser.
advantages of the transition-based strategy, the
integrated algorithm we are looking for has to
be transition-based at the top level. The advan-
tages of the graph-based approach a more glob-
ally informed basis for the decision among dif-
ferent attachment options have to be included
as part of the scoring procedure. As a prerequi- Figure 2: A second order model as used in graph-based
site, our algorithm will require a memory for stor- parsers has access to the crucial information to build
ing alternative analyses among which to choose. the correct tree. In this case, the parser condsiders the
word friend (as opposed to garden, for instance) as it
This has been previously introduced in transition-
introduces the bold-face edge.
based approaches in the form of a beam (Johans-
son and Nugues, 2006): rather than representing Here, the combination of subspans is performed
only the best-scoring history of transitions, the k at a point when their internal structure has been
best-scoring alternative histories are kept around. finalized, i.e., the attachment of with (to bought
As we will indicate in the following, the mere or house) is not decided until it is clear that friend
addition of beam search does not help overcome is the object of with; hence, the semantically im-
a representational key issue of transition-based portant lexicalization of withs object informs the
parsing: in many situations, a transition-based higher-level attachment decision through a so-
parser is forced to make an attachment decision called second order factor in the feature model.
78
Given a suitable amount of training data, the parser requires only one training phase (without
model can thus learn to make the correct deci- jackknifing) and it uses only a single transition-
sion. The dynamic-programming based graph- based decoder.
based parser is designed in such a way that any The structure of this paper is as follows. In Sec-
score calculation is based on complete factors for tion 2, we discuss related work. In Section 3, we
the subspans that are combined at this point. introduce our transition-based parser and in Sec-
Note that the problem for the transition-based tion 4 the completion model as well as the im-
parser cannot be remedied by beam search alone. plementation of third order models. In Section 5,
If we were to keep the two options for attach- we describe experiments and provide evaluation
ing with around in a beam (say, with a slightly results on selected data sets.
higher score for attachment to house, but with
bought following narrowly behind), there would 2 Related Work
be no point in the further processing of the sen- Kudo and Matsumoto (2002) and Yamada and
tence at which the choice could be corrected: the Matsumoto (2003) carried over the idea for de-
transition-based parser still needs to make the deterministic parsing by chunks from Abney (1991)
cision that friend is attached to with, but this will to dependency parsing. Nivre (2003) describes
not lead the parser to reconsider the decision made in a more strict sense the first incremental parser
earlier on. that tries to find the most appropriate dependency
The strategy we describe in this paper applies tree by a sequence of local transitions. In order
in this very type of situation: whenever infor- to optimize the results towards a more globally
mation is added in the transition-based parsing optimal solution, Johansson and Nugues (2006)
process, the scores of all the histories stored in first applied beam search, which leads to a sub-
the beam are recalculated based on a scoring stantial improvment of the results (cf. also (Titov
model inspired by the graph-based parsing ap- and Henderson, 2007)). Zhang and Clark (2008)
proach, i.e., taking complete factors into account augment the beam-search algorithm, adapting the
as they become incrementally available. As a con- early update strategy of Collins and Roark (2004)
sequence the beam is reordered, and hence, the to dependency parsing. In this approach, the
incorrect preference of an attachment of with to parser stops and updates the model when the or-
house (based on incomplete factors) can later be acle transition sequence drops out of the beam.
corrected as friend is processed and the complete In contrast to most other approaches, the training
second-order factor becomes available.2 procedure of Zhang and Clark (2008) takes the
The integrated transition-based parsing strategy complete transition sequence into account as it is
has a number of advantages: calculating the update. Zhang and Clark compare
(1) We can integrate and investigate a number of aspects of transition-based and graph-based pars-
third order factors, without the need to implement ing, and end up using a transition-based parser
a more complex parsing model each time anew to with a combined transition-based/second-order
explore the properties of such distinct model. graph-based scoring model (Zhang and Clark,
(2) The parser with completion model main- 2008, 567), which is similar to the approach we
tains the favorable complexity of transition-based describe in this paper. However, their approach
parsers. does not involve beam rescoring as the partial
(3) The completion model compensates for the structures built by the transition-based parser are
lower accuracy of cases when only incomplete in- subsequently augmented; hence, there are cases in
formation is available. which our approach is able to differentiate based
(4) The parser combines the two leading pars- on higher-order factors that go unnoticed by the
ing paradigms in a single efficient parser with- combined model of (Zhang and Clark, 2008, 567).
out stacking the two approaches. Therefore the One step beyond the use of a beam is a dynamic
programming approach to carry out a full search
2
Since search is not exhaustive, there is of course a slight in the state space, cf. (Huang and Sagae, 2010;
danger that the correct history drops out of the beam before
complete information becomes available. But as our experi-
Kuhlmann et al., 2011). However, in this case
ments show, this does not seem to be a serious issue empiri- one has to restrict the employed features to a set
cally. which fits to the elements composed by the dy-
79
namic programming approach. This is a trade-off swap}) taken up to this point.
between an exhaustive search and a unrestricted (1) The initial state 0 has an empty stack, the
(rich) feature set and the question which provides input buffer is the full input string x, and the edge
a higher accuracy is still an open research ques- set is empty. (2) The (partial) transition function
tion, cf. (Kuhlmann et al., 2011). (i , t) : x maps a state and an opera-
Parsing of non-projective dependency trees is tion t to a new state i+1 . (3) Final states f are
an important feature for many languages. At characterized by an empty input buffer and stack;
first most algorithms were restricted to projec- no further transitions can be taken.
tive dependency trees and used pseudo-projective The transition function is informally defined as
parsing (Kahane et al., 1998; Nivre and Nilsson, follows: The shift transition removes the first ele-
2005). Later, additional transitions were intro- ment of the input buffer and pushes it to the stack.
duced to handle non-projectivity (Attardi, 2006; The left-arcl transition adds an edge with label l
Nivre, 2009). The most common strategy uses from the first word in the buffer to the word on
the swap transition (Nivre, 2009; Nivre et al., top of the stack, removes the top element from
2009), an alternative solution uses two planes the stack and pushes the first element of the input
and a switch transition to switch between the two buffer to the stack.
planes (Gomez-Rodrguez and Nivre, 2010). The right-arcl transition adds an edge from word
Since we use the scoring model of a graph- on top of the stack to the first word in the input
based parser, we briefly review releated work buffer and removes the top element of the input
on graph-based parsing. The most well known buffer and pushes that element onto the stack.
graph-based parser is the MST (maximum span- The reduce transition pops the top word from the
ning tree) parser, cf. (McDonald et al., 2005; Mc- stack.
Donald and Pereira, 2006). The idea of the MST The swap changes the order of the two top el-
parser is to find the highest scoring tree in a graph ements on the stack (possibly generating non-
that contains all possible edges. Eisner (1996) projkective trees).
introduced a dynamic programming algorithm to When more than one operation is applicable, a
solve this problem efficiently. Carreras (2007) in- scoring function assigns a numerical value (based
troduced the left-most and right-most grandchild on a feature vector and a weight vector trained
as factors. We use the factor model of Carreras by supervised machine learning) to each possi-
(2007) as starting point for our experiments, cf. ble continuation. When using a beam search ap-
Section 4. We extend Carreras (2007) graph- proach with beam size k, the highest-scoring k al-
based model with factors involving three edges ternative states with the same length n of transi-
similar to that of Koo and Collins (2010). tion history h are kept in a set beamn .
In the beam-based parsing algorithm (cf. the
3 Transition-based Parser with a Beam pseudo code in Algorithm 1), all candidate states
This section specifies the transition-based beam- for the next set beamn+1 are determined using
search parser underlying the combined approach the transition function , but based on the scor-
more formally. Sec. 4 will discuss the graph- ing function, only the best k are preserved. (Fi-
based scoring model that we are adding. nal) states to which no more transitions apply are
The input to the parser is a word string x, copied to the next state set. This means that once
the goal is to find the optimal set y of labeled all transition paths have reached a final state, the
edges xi l xj forming a dependency tree over x overall best-scoring states can be read off the fi-
{root}. We characterize the state of a transition- nal beamn . The y of the top-scoring state is the
based parser as i =hi , i , yi , hi i, i , the set predicted parse.
of possible states. i is a stack of words from x Under the plain transition-based scoring
that are still under consideration; i is the input regime scoreT , the score for a state is the sum
buffer, the suffix of x yet to be processed; yi the of the local scores for the transitions ti in the
set of labeled edges already assigned (a partial la- states history sequence:
beled dependency tree); hi is a sequence record-
P|h|
ing the history of transitions (from the set of op- scoreT () = i=0 w f (i , ti )
erations = {shift, left-arcl , right-arcl , reduce,
80
Algorithm 1: Transition-based parser up to this point, which is continuously augmented.
// x is the input sentence, k is the beam size This means if at a given point n in the transition
0 = , 0 = x, y0 = , h = path, complete information for a particular config-
0 h0 , 0 , y0 , h0 i // initial parts of a state uration (e.g., a third-order factor involving a head,
beam0 {0 } // create initial state its dependent and its grand-child dependent) is
n 0 // iteration unavailable, scoring will ignore this factor at time
repeat n, but the configuration will inform the scoring
nn+1
later on, maybe at point n + 4, when the complete
for all j beamn1 do
transitions possible-applicable-transition (j )
information for this factor has entered the partial
// if no transition is applicable keep state j : graph yn+4 .
if transitions = then beamn beamn {j } We present results for a number of different
else for all ti transitions do second-order and third-order feature models.
// apply the transition i to state j
(j , ti ) Second Order Factors. We start with the
beamn beamn {} model introduced by Carreras (2007). Figure 3
// end for illustrates the factors used.
// end for
sort beamn due to the score(j )
beamn sublist (beamn , 0, k)
until beamn1 = beamn // beam changed?
w is the weight vector. Note that the features

f (i , ti ) can take into account all structural and
labeling information available prior to taking tran- Figure 3: Model 2a. Second order factors of Carreras
sition ti , i.e., the graph built so far, the words (and (2007). We omit the right-headed cases, which are
their part of speech etc.) on the stack and in the mirror images. The model comprises a factoring into
one first order part and three second order factors (2-
input buffer, etc. But if a larger graph configu-
4): 1) The head (h) and the dependent (c); 2) the head,
ration involving the next word evolves only later, the dependent and the left-most (or right-most) grand-
as in Figure 1, this information is not taken into child in between (cmi); 3) the head, the dependent and
account in scoring. For instance, if the feature the right-most (or left-most) grandchild away from the
extraction uses the subcategorization frame of a head (cmo). 4) the head, the dependent and between
word under consideration to compute a score, it is those words the right-most (or left-most) sibling (ci).
quite possible that some dependents are still miss-
ing and will only be attached in a future transition.
4 Completion Model
We define an augmented scoring function which
Figure 4: 2b. The left-most dependent of the head or
can be used in the same beam-search algorithm in
the right-most dependent in the right-headed case.
order to ensure that in the scoring of alternative
transition paths, larger configurations can be ex- Figure 4 illustrates a new type of factor we use,
ploited as they are completed in the incremental which includes the left-most dependent in the left-
process. The feature configurations can be largely headed case and symmetricaly the right-most sib-
taken from graph-based approaches. Here, spans ling in the right-head case.
from the string are assembled in a bottom-up fash-
ion, and the scoring for an edge can be based on Third Order Factors. In addition to the second
structurally completed subspans (factors). order factors, we investigate combinations of third
Our completion model for scoring a state n order factors. Figure 5 and 6 illustrate the third
incorporates factors for all configurations (match- order factors, which are similar to the factors of
ing the extraction scheme that is applied) that are Koo and Collins (2010). They restrict the factor
present in the partial dependency graph yn built to the innermost sibling pair for the tri-siblings
81
and the outermost pair for the grand-siblings. We model, we have to add the scoring function (2a)
use the first two siblings of the dependent from the sum:
the left side of the head for the tri-siblings and (2b) scoreG2b (x, y) = scoreG2a (x, y)
the first two dependents of the child for the grand- P
+ (h,c,cmi)y w fgra (x,h,c,cmi)
siblings. With these factors, we aim to capture
non-projective edges and subcategorization infor-
In order to build a scoring function for combi-
mation. Figure 7 illustrates a factor of a sequence
nation of the factors shown in Figure 5 to 7, we
of four nodes. All the right headed variants are
have to add to the equation 2b one or more of the
symmetrically and left out for brevity.
following sums:
P
(3a) (h,c,ch1,ch2)y w fgra (x,h,c,ch1,ch2)
P
(3b) (h,c,cm1,cm2)y w fgra (x,h,c,cm1,cm2)
P
(3c) (h,c,cmo,tmo)y w fgra (x,h,c,cmo,tmo)
Figure 5: 3a. The first two children of the head, which
do not include the edge between the head and the de-
pendent.
Feature Set. The feature set of the transition
model is similar to that of Zhang and Nivre
(2011). In addition, we use the cross product of
morphologic features between the head and the
dependent since we apply also the parser on mor-
phologic rich languages.
Figure 6: 3b. The first two children of the dependent.
The feature sets of the completion model de-
scribed above are mostly based on previous work
(McDonald et al., 2005; McDonald and Pereira,
2006; Carreras, 2007; Koo and Collins, 2010).
The models denoted with + use all combinations
Figure 7: 3c. The right-most dependent of the right-
of words before and after the head, dependent,
most dependent.
sibling, grandchilrden, etc. These are respectively
three-, and four-grams for the first order and sec-
Integrated approach. To obtain an integrated
ond order. The algorithm includes these features
system for the various feature models, the scoring
only the words left and right do not overlap with
function of the transition-based parser from Sec-
the factor (e.g. the head, dependent, etc.). We use
tion 3 is augmented by a family of scoring func-
feature extraction procedure for second order, and
tions scoreGm for the completion model, where m
third order factors. Each feature extracted in this
is from 2a, 2b, 3a etc., x is the input string, and y
procedure includes information about the position
is the (partial) dependency tree built so far:
of the nodes relative to the other nodes of the part
scoreTm () = scoreT () + scoreGm (x, y) and a factor identifier.
The scoring function of the completion model
depends on the selected factor model Gm . The Training. For the training of our parser, we use
model G2a comprises the edge factoring of Fig- a variant of the perceptron algorithm that uses the
ure 3. With this model, we obtain the following Passive-Aggressive update function, cf. (Freund
scoring function. and Schapire, 1998; Collins, 2002; Crammer et
al., 2006). The Passive-Aggressive perceptron
P
scoreG2a (x, y) = (h,c)y w ff irst (x,h,c) uses an aggressive update strategy by modifying
P
+ (h,c,ci)y w fsib (x,h,c,ci) the weight vector by as much as needed to clas-
P
+ (h,c,cmo)y w fgra (x,h,c,cmo) sify correctly the current example, cf. (Crammer
P
+ (h,c,cmi)y w fgra (x,h,c,cmi) et al., 2006). We apply a random function (hash
function) to retrieve the weights from the weight
The function f maps the input sentence x, and vector instead of a table. Bohnet (2010) showed
a subtree y defined by the indexes to a feature- that the Hash Kernel improves parsing speed and
vector. Again, w is the corresponding weight vec- accuracy since the parser uses additionaly nega-
tor. In order to add the factor of Figure 4 to our tive features. Ganchev and Dredze (2008) used
82
this technique for structured prediction in NLP to select the best parse tree. The complexity of the
reduce the needed space, cf. (Shi et al., 2009). transition-based parser is quadratic due to swap
We use as weight vector size 800 million. After operation in the worse case, which is rare, and
the training, we counted 65 millions non zero O(n) in the best case, cf. (Nivre, 2009). The
weights for English (penn2malt), 83 for Czech beam size B is constant. Hence, the complexity
and 87 millions for German. The feature vectors is in the worst case O(n2 ).
are the union of features originating from the The parsing time is to a large degree deter-
transition sequence of a sentence and the features mined by the feature extraction, the score calcu-
of the factors over all edges of a dependency tree lation and the implementation, cf. also (Goldberg
(e.g. G2a , etc.). To prevent over-fitting, we use and Elhadad, 2010). The transition-based parser
averaging to cope with this problem, cf. (Freund is able to parse 30 sentences per second. The
and Schapire, 1998; Collins, 2002). We calculate parser with completion model processes about 5
the error e as the sum of all attachment errors and sentences per second with a beam size of 80.
label errors both weighted by 0.5. We use the Note, we use a rich feature set, a completion
following equations to compute the update. model with third order factors, negative features,
and a large beam. 3
loss: lt = e-(scoreT (xgt , ytg )-scoreT (xt , yt )) We implemented the following optimizations:
(1) We use a parallel feature extraction for the
lt
PA-update: t = ||fg fp ||2 beam elements. Each process extracts the fea-
tures, scores the possible transitions and computes
We train the model to select the transitions and the score of the completion model. After the ex-
the completion model together and therefore, we tension step, the beam is sorted and the best ele-
use one parameter space. In order to compute the ments are selected according to the beam size.
weight vector, we employ standard online learn- (2) The calculation of each score is optimized (be-
ing with 25 training iterations, and carry out early yond the distinction of a static and a dynamic
updates, cf. Collins and Roark (2004; Zhang and component): We calculate for each location de-
Clark (2008). termined by the last element sl i and the first
element of b0 i a numeric feature representa-
Efficient Implementation. Keeping the scoring tion. This is kept fix and we add only the numeric
with the completion model tractable with millions value for each of the edge labels plus a value for
of feature weights and for second- and third-order the transition left-arc or right-arc. In this way, we
factors requires careful bookkeeping and a num- create the features incrementally. This has some
ber of specialized techniques from recent work on similarity to Goldberg and Elhadad (2010).
dependency parsing. (3) We apply edge filtering as it is used in graph-
We use two variables to store the scores (a) based dependency parsing, cf. (Johansson and
for complete factors and (b) for incomplete fac- Nugues, 2008), i.e., we calculate the edge weights
tors. The complete factors (first-order factors and only for the labels that were found for the part-of-
higher-order factors for which further augmenta- speech combination of the head and dependent in
tion is structurally excluded) need to be calculated the training data.
only once and can then be stored with the tree fac-
tors. The incomplete factors (higher-order factors 5 Parsing Experiments and Discussion
whose node elements may still receive additional
descendants) need to be dynamically recomputed The results of different parsing systems are of-
while the tree is built. ten hard to compare due to differences in phrase
structure to dependency conversions, corpus ver-
The parsing algorithm only has to compute the
sion, and experimental settings. For better com-
scores of the factored model when the transition-
parison, we provide results on English for two
based parser selects a left-arc or right-arc transi-
commonly used data sets, based on two differ-
tion and the beam has to be sorted. The parser
ent conversions of the Penn Treebank. The first
sorts the beam when it exceeds the maximal beam
uses the Penn2Malt conversion based on the head-
size, in order to discard superfluous parses or
3
when the parsing algorithm terminates in order to 6 core, 3.33 Ghz Intel Nehalem
83
Section Sentences PoS Acc. Parser UAS LAS
Training 2-21 39.832 97.08 (McDonald et al., 2005) 90.9
Dev 24 1.394 97.18 (McDonald and Pereira, 2006) 91.5
Test 23 2.416 97.30 (Huang and Sagae, 2010) 92.1
(Zhang and Nivre, 2011) 92.9
Table 1: Overview of the training, development and (Koo and Collins, 2010) 93.04
test data split converted to dependency graphs with (Martins et al., 2010) 93.26
T (baseline) 92.7
head-finding rules of (Yamada and Matsumoto, 2003).
G2a (baseline) 92.89
The last column shows the accuracy of Part-of-Speech
T2a 92.94 91.87
tags. T2ab 93.16 92.08
T2ab3a 93.20 92.10
T2ab3b 93.23 92.15
finding rules of Yamada and Matsumoto (2003).
T2ab3c 93.17 92.10
Table 1 gives an overview of the properties of the T2ab3abc+ 93.39 92.38
corpus. The annotation of the corpus does not G2a+ 93.1
contain non-projective links. The training data (Koo et al., 2008) 93.16
was 10-fold jackknifed with our own tagger.4 . Ta- (Carreras et al., 2008) 93.5
(Suzuki et al., 2009) 93.79
ble 1 shows the tagging accuracy.
Table 2 lists the accuracy of our transition- Table 2: English Attachment Scores for the
based parser with completion model together with Penn2Malt conversion of the Penn Treebank for the
results from related work. All results use pre- test set. Punctuation is excluded from the evaluation.
dicted PoS tags. As a baseline, we present in ad- The results marked with are not directly comparable
to our work as they depend on additional sources of
dition results without the completion model and
information (Brown Clusters).
a graph-based parser with second order features
(G2a ). For the Graph-based parser, we used 10
training iterations. The following rows denoted tags. From the same data set, we selected the
with Ta , T2a , T2ab , T2ab3a , T2ab3b , T2ab3bc , and corpora for Czech and German. In all cases, we
T2a3abc present the result for the parser with com- used the provided training, development, and test
pletion model. The subscript letters denote the data split, cf. (Hajic et al., 2009). In contrast
used factors of the completion model as shown to the evaluation of the Penn2Malt conversion,
in Figure 3 to 7. The parsers with subscribed plus we include punctuation marks for these corpora
(e.g. G2a+ ) in addition use feature templates that and follow in that the evaluation schema of the
contain one word left or right of the head, depen- CoNLL Shared Task 2009. Table 3 presents the
dent, siblings, and grandchildren. We left those results as obtained for these data set.
feature in our previous models out as they may in- The transition-based parser obtains higher ac-
terfere with the second and third order factors. As curacy scores for Czech but still lower scores for
in previous work, we exclude punctuation marks English and German. For Czech, the result of T
for the English data converted with Penn2Malt in is 1.59 percentage points higher than the top la-
the evaluation, cf. (McDonald et al., 2005; Koo beled score in the CoNLL shared task 2009. The
and Collins, 2010; Zhang and Nivre, 2011).5 We reason is that T includes already third order fea-
optimized the feature model of our parser on sec- tures that are needed to determine some edge la-
tion 24 and used section 23 for evaluation. We use bels. The transition-based parser with completion
a beam size of 80 for our transition-based parser model T2a has even 2.62 percentage points higher
and 25 training iterations. accuracy and it could improve the results of the
The second English data set was obtained by parser T by additional 1.03 percentage points.
using the LTH conversion schema as used in the The results of the parser T are lower for English
CoNLL Shared Task 2009, cf. (Hajic et al., 2009). and German compared to the results of the graph-
This corpus preserves the non-projectivity of the based parser G2a . The completion model T2a can
phrase structure annotation, it has a rich edge reach a similar accuracy level for these two lan-
label set, and provides automatic assigned PoS guages. The third order features let the transition-
4
http://code.google.com/p/mate-tools/
based parser reach higher scores than the graph-
5
We follow Koo and Collins (2010) and ignore any token based parser. The third order features contribute
whose POS tag is one of the following tokens :,. for each language a relatively small improvement
84
Parser Eng. Czech German 6 Conclusion and Future Work
(Gesmundo
et al., 2009) 88.79/- 80.38 87.29 The parser introduced in this paper combines
(Bohnet, 2009) 89.88/- 80.11 87.48
advantageous properties from the two major
T (Baseline) 89.52/92.10 81.97/87.26 87.53/89.86
G2a (Baseline) 90.14/92.36 81.13/87.65 87.79/90.12 paradigms in data-driven dependency parsing,
T2a 90.20/92.55 83.01/88.12 88.22/90.36 in particular worst case quadratic complexity of
T2ab 90.26/92.56 83.22/88.34 88.31/90.24 transition-based parsing with a swap operation
T2ab3a 90.20/90.51 83.21.88.30 88.14/90.23
and the consideration of complete second and
T2ab3b 90.26/92.57 83.22/88.35 88.50/90.59
T2ab3abc 90.31/92.58 83.31/88.30 88.33/90.45 third order factors in the scoring of alternatives.
G2a+ 90.39/92.8 81.43/88.0 88.26/90.50 While previous work using third order factors, cf.
T2ab3ab+ 90.36/92.66 83.48/88.47 88.51/90.62 Koo and Collins (2010), was restricted to unla-
beled and projective trees, our parser can produce
Table 3: Labeled Attachment Scores of parsers that
use the data sets of the CoNLL shared task 2009. In
labeled and non-projective dependency trees.
line with previous work, punctuation is included. The In contrast to parser stacking, which involves
parsers marked with used a joint model for syntactic running two parsers in training and application,
parsing and semantic role labelling. We provide more we use only the feature model of a graph-based
parsing results for the languages of CoNLL-X Shared parser but not the graph-based parsing algorithm.
Task at http://code.google.com/p/mate-tools/. This is not only conceptually superior, but makes
training much simpler, since no jackknifing has
Parser UAS LAS to be carried out. Zhang and Clark (2008) pro-
(Zhang and Clark, 2008) 84.3 posed a similar combination, without the rescor-
(Huang and Sagae, 2010) 85.2 ing procedure. Our implementation allows for the
(Zhang and Nivre, 2011) 86.0 84.4 use of rich feature sets in the combined scoring
T2ab3abc+ 87.5 85.9
functions, and our experimental results show that
Table 4: Chinese Attachment Scores for the conver- the graph-based completion model leads to an
sion of CTB 5 with head rules of Zhang and Clark increase of between 0.4 (for English) and about
(2008). We take the standard split of CTB 5 and use 1 percentage points (for Czech). The scores go
in line with previous work gold segmentation, POS- beyond the current state of the art results for ty-
tags and exclude punctuation marks for the evaluation.
pologically different languages such as Chinese,
Czech, English, and German. For Czech, English
(Penn2Malt) and German, these are to our knowl-
of the score. Small and statistically significant im- ege the highest reported scores of a dependency
provements provides the additional second order parser that does not use additional sources of in-
factor (2b).6 We tried to determine the best third formation (such as extra unlabeled training data
order factors or set of factors but we cannot denote for clustering). Note that the efficient techniques
such a factor which is the best for all languages. and implementation such as the Hash Kernel, the
For German, we obtained a significant improve- incremental calculation of the scores of the com-
ment with the factor (3b). We believe that this is pletion model, and the parallel feature extraction
due to the flat annotation of PPs in the German as well as the parallelized transition-based pars-
corpus. If we combine all third order factors we ing strategy play an important role in carrying out
obtain for the Penn2Malt conversion a small im- this idea in practice.
provement of 0.2 percentage points over the re-
sults of (2ab). We think that a more deep feature
selection for third order factors may help to im- References
prove the actuary further. S. Abney. 1991. Parsing by chunks. In Principle-
In Table 4, we present results on the Chinese Based Parsing, pages 257278. Kluwer Academic
Treebank. To our knowledge, we obtain the best Publishers.
published results so far. G. Attardi. 2006. Experiments with a Multilan-
guage Non-Projective Dependency Parser. In Tenth
Conference on Computational Natural Language
6
The results of the baseline T compared to T2ab3abc are Learning (CoNLL-X).
statistically significant (p < 0.01). B. Bohnet. 2009. Efficient Parsing of Syntactic and
85
Semantic Dependency Structures. In Proceedings S. Pado, J. Stepanek, P. Stranak, M. Surdeanu,
of the 13th Conference on Computational Natural N. Xue, and Y. Zhang. 2009. The CoNLL-2009
Language Learning (CoNLL-2009). shared task: Syntactic and semantic dependencies
B. Bohnet. 2010. Top accuracy and fast dependency in multiple languages. In Proceedings of the Thir-
parsing is not a contradiction. In Proceedings of the teenth Conference on Computational Natural Lan-
23rd International Conference on Computational guage Learning (CoNLL 2009): Shared Task, pages
Linguistics (Coling 2010), pages 8997, Beijing, 118, Boulder, United States, June.
China, August. Coling 2010 Organizing Commit- L. Huang and K. Sagae. 2010. Dynamic programming
tee. for linear-time incremental parsing. In Proceedings
X. Carreras, M. Collins, and T. Koo. 2008. Tag, of the 48th Annual Meeting of the Association for
dynamic programming, and the perceptron for ef- Computational Linguistics, pages 10771086, Up-
ficient, feature-rich parsing. In Proceedings of the psala, Sweden, July. Association for Computational
Twelfth Conference on Computational Natural Lan- Linguistics.
guage Learning, CoNLL 08, pages 916, Strouds- R. Johansson and P. Nugues. 2006. Investigating
burg, PA, USA. Association for Computational Lin- multilingual dependency parsing. In Proceedings
guistics. of the Shared Task Session of the Tenth Confer-
X. Carreras. 2007. Experiments with a Higher-order ence on Computational Natural Language Learning
Projective Dependency Parser. In EMNLP/CoNLL. (CoNLL-X), pages 206210, New York City, United
M. Collins and B. Roark. 2004. Incremental parsing States, June 8-9.
with the perceptron algorithm. In ACL, pages 111 R. Johansson and P. Nugues. 2008. Dependency-
118. based SyntacticSemantic Analysis with PropBank
M. Collins. 2002. Discriminative Training Methods and NomBank. In Proceedings of the Shared Task
for Hidden Markov Models: Theory and Experi- Session of CoNLL-2008, Manchester, UK.
ments with Perceptron Algorithms. In EMNLP. S. Kahane, A. Nasr, and O. Rambow. 1998.
Pseudo-projectivity: A polynomially parsable non-
K. Crammer, O. Dekel, S. Shalev-Shwartz, and
projective dependency grammar. In COLING-ACL,
Y. Singer. 2006. Online Passive-Aggressive Al-
pages 646652.
gorithms. Journal of Machine Learning Research,
T. Koo and M. Collins. 2010. Efficient third-order
7:551585.
dependency parsers. In Proceedings of the 48th
J. Eisner. 1996. Three New Probabilistic Models for
Annual Meeting of the Association for Computa-
Dependency Parsing: An Exploration. In Proceed-
tional Linguistics, pages 111, Uppsala, Sweden,
ings of the 16th International Conference on Com-
July. Association for Computational Linguistics.
putational Linguistics (COLING-96), pages 340
Terry Koo, Xavier Carreras, and Michael Collins.
345, Copenhaen.
2008. Simple semi-supervised dependency parsing.
Y. Freund and R. E. Schapire. 1998. Large margin pages 595603.
classification using the perceptron algorithm. In T. Kudo and Y. Matsumoto. 2002. Japanese de-
11th Annual Conference on Computational Learn- pendency analysis using cascaded chunking. In
ing Theory, pages 209217, New York, NY. ACM proceedings of the 6th conference on Natural lan-
Press. guage learning - Volume 20, COLING-02, pages 1
K. Ganchev and M. Dredze. 2008. Small statisti- 7, Stroudsburg, PA, USA. Association for Compu-
cal models by random feature mixing. In Proceed- tational Linguistics.
ings of the ACL-2008 Workshop on Mobile Lan- M. Kuhlmann, C. Gomez-Rodrguez, and G. Satta.
guage Processing. Association for Computational 2011. Dynamic programming algorithms for
Linguistics. transition-based dependency parsers. In ACL, pages
A. Gesmundo, J. Henderson, P. Merlo, and I. Titov. 673682.
2009. A Latent Variable Model of Syn- Andre Martins, Noah Smith, Eric Xing, Pedro Aguiar,
chronous Syntactic-Semantic Parsing for Multiple and Mario Figueiredo. 2010. Turbo parsers: De-
Languages. In Proceedings of the 13th Confer- pendency parsing by approximate variational infer-
ence on Computational Natural Language Learning ence. pages 3444.
(CoNLL-2009), Boulder, Colorado, USA., June 4-5. R. McDonald and F. Pereira. 2006. Online Learning
Y. Goldberg and M. Elhadad. 2010. An efficient al- of Approximate Dependency Parsing Algorithms.
gorithm for easy-first non-directional dependency In In Proc. of EACL, pages 8188.
parsing. In HLT-NAACL, pages 742750. R. McDonald, K. Crammer, and F. Pereira. 2005. On-
C. Gomez-Rodrguez and J. Nivre. 2010. A line Large-margin Training of Dependency Parsers.
Transition-Based Parser for 2-Planar Dependency In Proc. ACL, pages 9198.
Structures. In ACL, pages 14921501. J. Nivre and R. McDonald. 2008. Integrating Graph-
J. Hajic, M. Ciaramita, R. Johansson, D. Kawahara, Based and Transition-Based Dependency Parsers.
M. Antonia Mart, L. Marquez, A. Meyers, J. Nivre, In ACL-08, pages 950958, Columbus, Ohio.
86
J. Nivre and J. Nilsson. 2005. Pseudo-projective de-
pendency parsing. In ACL.
J. Nivre, M. Kuhlmann, and J. Hall. 2009. An im-
proved oracle for dependency parsing with online
reordering. In Proceedings of the 11th Interna-
tional Conference on Parsing Technologies, IWPT
09, pages 7376, Stroudsburg, PA, USA. Associa-
tion for Computational Linguistics.
J. Nivre. 2003. An Efficient Algorithm for Pro-
jective Dependency Parsing. In 8th International
Workshop on Parsing Technologies, pages 149160,
Nancy, France.
J. Nivre. 2009. Non-Projective Dependency Parsing
in Expected Linear Time. In Proceedings of the
47th Annual Meeting of the ACL and the 4th IJC-
NLP of the AFNLP, pages 351359, Suntec, Singa-
pore.
K. Sagae and A. Lavie. 2006. Parser combina-
tion by reparsing. In NAACL 06: Proceedings of
the Human Language Technology Conference of the
NAACL, Companion Volume: Short Papers on XX,
pages 129132, Morristown, NJ, USA. Association
for Computational Linguistics.
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola,
and S.V.N. Vishwanathan. 2009. Hash Kernels for
Structured Data. In Journal of Machine Learning.
J. Suzuki, H. Isozaki, X. Carreras, and M Collins.
2009. An empirical study of semi-supervised struc-
tured conditional models for dependency parsing.
In EMNLP, pages 551560.
I. Titov and J. Henderson. 2007. A Latent Variable
Model for Generative Dependency Parsing. In Pro-
ceedings of IWPT, pages 144155.
H. Yamada and Y. Matsumoto. 2003. Statistical De-
pendency Analysis with Support Vector Machines.
In Proceedings of IWPT, pages 195206.
Y. Zhang and S. Clark. 2008. A tale of two
parsers: investigating and combining graph-based
and transition-based dependency parsing using
beam-search. In Proceedings of EMNLP, Hawaii,
USA.
Y. Zhang and J. Nivre. 2011. Transition-based de-
pendency parsing with rich non-local features. In
Proceedings of the 49th Annual Meeting of the As-
sociation for Computational Linguistics: Human
Language Technologies, pages 188193, Portland,
Oregon, USA, June. Association for Computational
Linguistics.
87
Answer Sentence Retrieval by Matching Dependency Paths
Acquired from Question/Answer Sentence Pairs
Michael Kaisser
AGT Group (R&D) GmbH
Jagerstr. 41, 10117 Berlin, Germany
mkaisser@agtgermany.com
Abstract pendency structures of known valid answer sen-

tence and from these acquires patterns that can be
In Information Retrieval (IR) in general used to more precisely retrieve relevant text pas-
and Question Answering (QA) in particu- sages from the underlying document collection.
lar, queries and relevant textual content of- To achieve this, the position of key phrases in the
ten significantly differ in their properties
answer sentence relative to the answer itself is an-
and are therefore difficult to relate with tra-
ditional IR methods, e.g. key-word match- alyzed and linked to a certain syntactic question
ing. In this paper we describe an algorithm type. Unlike most previous work that uses depen-
that addresses this problem, but rather than dency paths for QA (see Section 2), our approach
looking at it on a term matching/term re- does not require a candidate sentence to be similar
formulation level, we focus on the syntac- to the question in any respect. We learn valid de-
tic differences between questions and rele- pendency structures from the known answer sen-
vant text passages. To this end we propose
tences alone, and therefore are able to link a much
a novel algorithm that analyzes dependency
structures of queries and known relevant
wider spectrum of answer sentences to the ques-
text passages and acquires transformational tion.
patterns that can be used to retrieve rele- The work in this paper is presented and eval-
vant textual content. We evaluate our algo- uated in a classical factoid Question Answering
rithm in a QA setting, and show that it out-
(QA) setting. The main reason for this is that
performs a baseline that uses only depen-
dency information contained in the ques-
in QA suitable training and test data is available
tions by 300% and that it also improves per- in the public domain, e.g. via the Text REtrieval
formance of a state of the art QA system Conference (TREC), see for example (Voorhees,
significantly. 1999). The methods described in this paper how-
ever can also be applied to other IR scenarios, e.g.
web search. The necessary condition for our ap-
1 Introduction proach to work is that the user query is somewhat
It is a well known problem in Information Re- grammatically well formed; this kind of queries
trieval (IR) and Question Answering (QA) that are commonly referred to as Natural Language
queries and relevant textual content often signif- Queries or NLQs.
icantly differ in their properties, and are therefore Table 1 provides evidence that users indeed
difficult to match with traditional IR methods. A search the web with NLQs. The data is based on
common example is a user entering words to de- two query sets sampled from three months of user
scribe their information need that do not match logs from a popular search engine, using two dif-
the words used in the most relevant indexed doc- ferent sampling techniques. The head set sam-
uments. This work addresses this problem, but ples queries taking query frequency into account,
shifts focus from words to syntactic structures of so that more common queries have a proportion-
questions and relevant pieces of text. To this end, ally higher chance of being selected. The tail
we present a novel algorithm that analyses the de- query set samples only queries that have been is-
88
Set Head Tail damental problem, but shifting focus from query
Query # 15,665 12,500 term/document term mismatch to mismatches ob-
how 1.33% 2.42% served between the grammatical structure of Nat-
what 0.77% 1.89% ural Language Queries and relevant text pieces. In
define 0.34% 0.18%
order to achieve this we analyze the queries and
is/are 0.25% 0.42%
the relevant contents syntactic structure by using
where 0.18% 0.45%
do/does 0.14% 0.30% dependency paths.
can 0.14% 0.25% Especially in QA there is a strong tradition
why 0.13% 0.30% of using dependency structures: (Lin and Pan-
who 0.12% 0.38% tel, 2001) present an unsupervised algorithm to
when 0.09% 0.21% automatically discover inference rules (essentially
which 0.03% 0.08% paraphrases) from text. These inference rules are
Total 3.55% 6.86% based on dependency paths, each of which con-
nects two nouns. Their paths have the following
Table 1: Percentages of Natural Language queries in
head and tail search engine query logs. See text for form:
details. N:subj:VfindV:obj:NsolutionN:to:N
This path represents the relation X finds a solu-
sued less that 500 times during a three months pe- tion to Y and can be mapped to another path rep-
riod and it disregards query frequency. As a result, resenting e.g. X solves Y. As such the approach
rare and frequent queries have the same chance of is suitable to detect paraphrases that describe the
being selected. Doubles are excluded from both relation between two entities in documents. How-
sets. Table 1 lists the percentage of queries in ever, the paper does not describe how the mined
the query sets that start with the specified word. paraphrases can be linked to questions, and which
In most contexts this indicates that the query is a paraphrase is suitable to answer which question
question, which in turn means that we are dealing type.
with an NLQ. Of course there are many NLQs that (Attardi et al., 2001) describes a QA system
start with words other than the ones listed, so we that, after a set of candidate answer sentences
can expect their real percentage to be even higher. have been identified, matches their dependency
relations against the question. Questions and
2 Related Work
answer sentences are parsed with MiniPar (Lin,
In IR the problem that queries and relevant tex- 1998) and the dependency output is analyzed in
tual content often do not exhibit the same terms is order to determine whether relations present in a
commonly encountered. Latent Semantic Index- question also appear in a candidate sentence. For
ing (Deerwester et al., 1900) was an early, highly the question Who killed John F. Kennedy, for
influential approach to solve this problem. More example an answer sentence is expected to con-
recently, a significant amount of research is ded- tain the answer as subject of the verb kill, to
icated to query alteration approaches. (Cui et al., which John F. Kennedy should be in object re-
2002), for example, assume that if queries con- lation.
taining one term often result in the selection of (Cui et al., 2005) describe a fuzzy depen-
documents containing another term, then a strong dency relation matching approach to passage re-
relationship between the two terms exist. In their trieval in QA. Here, the authors present a statis-
approach, query terms and document terms are tical technique to measure the degree of overlap
linked via sessions in which users click on doc- between dependency relations in candidate sen-
uments that are presented as results for the query. tences with their corresponding relations in the
(Riezler and Liu, 2010) apply a Statistical Ma- question. Question/answer passage pairs from
chine Translation model to parallel data consist- TREC-8 and TREC-9 evaluations are used as
ing of user queries and snippets from clicked web training data. As in some of the papers mentioned
documents and in such a way extract contextual earlier, a statistical translation model is used, but
expansion terms from the query rewrites. this time to learn relatedness between paths. (Cui
We see our work as addressing the same fun- et al., 2004) apply the same idea to answer ex-
89
traction. In each sentences returned by the IR 4. The acquisition of Alaska by the United
module, all named entities of the expected answer States of America from Russia in 1867 is
types are treated as answer candidates. For ques- known as Sewards Folly.
tions with an unknown answer type, all NPs in
the candidate sentence are considered. Then those The remaining three sentences introduce vari-
paths in the answer sentence that are connected ous forms of syntactic and semantic transforma-
to an answer candidate are compared against the tions. In order to capture a wide range of possible
corresponding paths in the question, in a similar ways on how answer sentences can be formulated,
fashion as in (Cui et al., 2005). The candidate in our model a candidate sentence is not evalu-
whose paths show the highest matching score is ated according to its similarity with the question.
selected. (Shen and Klakow, 2006) also describe Instead, its similarity to known answer sentences
a method that is primarily based on similarity (which were presented to the system during train-
scores between dependency relation pairs. How- ing) is evaluated. This allows to us to capture a
ever, their algorithm computes the similarity of much wider range of syntactic and semantic trans-
paths between key phrases, not between words. formations.
Furthermore, it takes relations in a path not as in-
dependent from each other, but acknowledges that 3 Overview of the Algorithm
they form a sequence, by comparing two paths Our algorithm uses input data containing pairs of
with the help of an adaptation of the Dynamic the following:
Time Warping algorithm (Rabiner et al., 1991).
(Molla, 2006) presents an approach for the ac- NLQs/Questions NLQs that describe the users
quisition of question answering rules by apply- information need. For the experiments car-
ing graph manipulation methods. Questions are ried out in this paper we use questions from
represented as dependency graphs, which are ex- the TREC QA track 2002-2006.
tended with information from answer sentences.
Relevant textual content This is a piece of text
These combined graphs can then be used to iden-
that is relevant to the user query in that it
tify answers. Finally, in (Wang et al., 2007), a
contains the information the user is search-
quasi-synchronous grammar (Smith and Eisner,
ing for. In this paper, we use sentences ex-
2006) is used to model relations between ques-
tracted from the AQUAINT corpus (Graff,
tions and answer sentences.
2002) that contain the answer to the given
In this paper we describe an algorithm that
TREC question.
learns possible syntactic answer sentence formu-
lations for syntactic question classes from a set of In total, the data available to us for our experi-
example question/answer sentence pairs. Unlike ments consists of 8,830 question/answer sentence
the related work described above, it acknowledges pairs. This data is publicly available, see (Kaisser
that a) a valid answer sentences syntax might and Lowe, 2008). The algorithm described in this
be very different for the questions syntax and b) paper has three main steps:
several valid answer sentence structures, which
might be completely independent from each other, Phrase alignment Key phrases from the ques-
can exist for one and the same question. tion are paired with phrases from the answer
To illustrate this consider the question When sentences.
was Alaska purchased? The following four sen- Pattern creation The dependency structures of
tences all answer the given question, but only the queries and answer sentences are analyzed
first sentence is a straightforward reformulation of and patterns are extracted.
the question: Pattern evaluation The patterns discovered in
1. The United States purchased Alaska in 1867 the last step are evaluated and a confidence
from Russia. score is assigned to each.
2. Alaska was bought from Russia in 1867. The acquired patterns can then be used during
3. In 1867, the Russian Empire sold the Alaska retrieval, where a question is matched against the
territory to the USA. antecedents describing the syntax of the question.
90
Input: (a) Query: When was Alaska purchased?
(b) Answer sentence: The acquisition of Alaska happened in 1867.
Step 1: Question is segmented into key phrases and stop words:
When[1]+was[2]+NP[3]+VERB[4]
Step 2: Key question phrases are aligned with key answer sentence phrases:
[3]Alaska Alaska
[4]purchased acquisition
ANSWER 1867
Step 3: A pre-computed parse tree of the answer sentence is loaded:
1: The (the, DT, 2) [det]
2: acquisition (acquisition, NN, 5) [nsubj]
3: of (of, IN, 2) [prep]
4: Alaska (Alaska, IN, 2) [pobj]
5: happened (happen, VBD, null) [ROOT]
6: in (in, IN, 5) [prep]
7: 1867 (1867, CD, 6) [pobj]
Step 4: Dependency paths from key question phrases to the answer are computed:
Alaska1867: pobjprepnsubjpreppobj
acquisition1867: nsubjpreppobj
Step 5: The resulting pattern is stored:
Query: When[1]+was[2]+NP[3]+VERB[4]
Path 3: pobjprepnsubjpreppobj
Path 4: nsubjpreppobj
Figure 1: The pattern creation algorithm exemplified in five key steps for the query When was Alaska pur-
chased? and the answer sentence The acquisition of Alaska happened in 1867.
Note that one question can potentially match sev- tify and align phrases. Word Alignment is im-
eral patterns. The consequents contain descrip- portant in many fields of NLP, e.g. Machine
tions of grammatical structures of potential an- Translation (MT) where words in parallel, bilin-
swer sentences that can be used to identify and gual corpora need to be aligned, see (Och and
evaluate candidate sentences. Ney, 2003) for a comparison of various statisti-
cal alignment models. In our case however we
4 Phrase Alignment are dealing with a monolingual alignment prob-
lem which enables us to exploit clues not available
The goal of this processing step is to align phrases
for bilingual alignment: First of all, we can expect
from the question with corresponding phrases
many query words to be present in the answer sen-
from the answer sentences in the training data.
tence, either with the exact same surface appear-
Consider the following example:
ance or in some morphological variant. Secondly,
Query: When was the Alaska territory pur- there are tools available that tell us how semanti-
chased? cally related two words are, most notably Word-
Answer sentence: The acquisition of what Net (Miller et al., 1993). For these reasons we im-
would become the territory of Alaska took place plemented a bespoke alignment strategy, tailored
in 1867. towards our problem description.
The mapping that has to be achieved is: This method is described in detail in (Kaisser,
Query Answer Sentence 2009). The processing steps described in the
phrase phrase next sections build on its output. For reasons of
Alaska territory territory of Alaska brevity, we skip a detailed explanations in this pa-
purchased acquisition per and focus only on its key part: the alignment
ANSWER 1867 of words with very different surface structures.
In our approach, this is a two step process. For more details we would like to point the reader
First we align on a word level, then the output to the aforementioned work.
of the word alignment process is used to iden- In the above example, the alignment of pur-
91
chased and acquisition is the most problem- Klein and Manning, 2003a), so at this point they
atic, because the surface structures of the two are simply loaded from file. Step 4 is the key step
words clearly are very different. For such cases in our algorithm. From the previous steps, we
we experimented with a number of alignment know where the key constituents from the ques-
strategies based on WordNet. These approaches tion as well as the answer are located in the an-
are similar in that each picks one word that has to swer sentence. This enables us to compute the
be aligned from the question at a time and com- dependency paths in the answer sentences parse
pares it to all of the non-stop words in the answer tree that connect the answer with the key con-
sentence. Each of the answer sentence words is stituents. In our example the answer is 1867
assigned a value between zero and one express- and the key constituents are acquisition and
ing its relatedness to the question word. The Alaska. Knowing the syntactic relationships
highest scoring word, if above a certain thresh- (captured by their dependency paths) between the
old, is selected as the closest semantic match. answer and the key phrases enables us to capture
Most of these approaches make use of Word- one syntactic possibility of how answer sentences
Net::Similarity, a Perl software package that mea- to queries of the form When+was+NP+VERB can
sures semantic similarity (or relatedness) between be formulated.
a pair of word senses by returning a numeric value As can be seen in Step 5 a flat syntactic ques-
that represents the degree to which they are sim- tion representation is stored, together with num-
ilar or related (Pedersen et al., 2004). Addition- bers assigned to each constituent. The num-
ally, we developed a custom-built method that as- bers for those constituents for which alignments
sumes that two words are semantically related if in the answer sentence were sought and found
any kind of pointer exists between any occurrence are listed together with the resulting dependency
of the words root form in WordNet. For details of paths. Path 3 for example denotes the path from
these experiments, please refer to (Kaisser, 2009). constituent 3 (the NP Alaska) to the answer. If
In our experiments the custom-built method per- no alignment could be found for a constituent,
formed best, and was therefore used for the exper- null is stored instead of a path. Should two or
iments described in this paper. The main reasons more alternative constituents be identified for one
for this are: question constituent, additional patterns are cre-
ated, so that each contains one of the possibilities.
1. Many of the measures in the Word- The described procedure is repeated for all ques-
Net::Similarity package take only hyponym/ tion/answer sentence pairs in the training set and
hypernym relations into account. This makes for each, one or more patterns are created.
aligning word of different parts of speech It is worth to note that many TREC ques-
difficult or even impossible. However, such tions are fairly short and grammatically sim-
alignments are important for our needs. ple. In our training data we for exam-
ple find 102 questions matching the pattern
2. Many of the measures return results, even if When[1]+was[2]+NP[3]+VERB[4], which
only a weak semantic relationship exists. For together list 382 answer sentences, and thus 382
our purposes however, it is beneficial to only potentially different answer sentence structures
take strong semantic relations into account. from which patterns can be gained. As a result,
the amount of training examples we have avail-
5 Pattern Creation
able, is sufficient to achieve the performance de-
Figure 1 details our algorithm in its five key steps. scribed in Section 7. The algorithm described in
In step 1 and 2 key phrases from the question are this paper can of course also be used for more
aligned to the corresponding phrases in the an- complicated NLQs, although in such a scenario a
swer sentence, see Section 4 of this paper. Step significantly larger amount of training data would
3 is concerned with retrieving the parse tree for have to be used.
the answer sentence. In our implementation all
6 Pattern Evaluation
answer sentences in the training set have for per-
formance reasons been parsed beforehand with For each created pattern, at least one match-
the Stanford Parser (Klein and Manning, 2003b; ing example must exists: the sentence that was
92
used to create it in the first place. However, we
n
do not know how precise each pattern is. To X
score(ac) = score(pi ) (2)
this end, an additional processing step between i=1
pattern creation and application is needed: pat-
where
tern evaluation. Similar approaches to ours have (
been described in the relevant literature, many correcti +1
if match
score(pi ) = correcti +incorrecti +2 (3)
of them concerned with bootstrapping, starting 0 no match
with (Ravichandran and Hovy, 2002). The gen-
eral purpose of this step is to use the available The highest scoring candidate is selected.
data about questions and their correct answers to We would like to explicitly call out one prop-
evaluate how often each created pattern returns a erty of our algorithm: It effectively returns two
correct or an incorrect result. This data is stored entities: a) a sentence that constitutes a valid
with each pattern and the result of the equation, response to the query, b) the head node of a
often called pattern precision, can be used during phrase in that sentence that constitutes the answer.
retrieval stage. Pattern precision in our case is de- Therefore the algorithm can be used for sentence
fined as: retrieval or for answer retrieval. It depends on
#correct + 1
the application which of the two behaviors is de-
p= (1) sired. In the next section, we evaluate its answer
#correct + #incorrect + 2
retrieval performance.
We use Lucene to retrieve the top 100 para-
7 Experiments & Results
graphs from the AQUAINT corpus by issuing a
query that consists of the querys key words and This section provides an evaluation of the algo-
all non-stop words in the answer. Then, all pat- rithm described in this paper. The key questions
terns are loaded whose antecedent matches the we seek to answer are the following:
query that is currently being processed. After that,
constituents from all sentences in the retrieved 1. How does our method perform when com-
100 paragraphs are aligned to the querys con- pared to a baseline that extracts dependency
stituents in the same way as for the sentences dur- paths from the question?
ing pattern creation, see Section 5. Now, the paths 2. How much does the described algorithm im-
specified in these patterns are searched for in the prove performance of a state-of-the-art QA
paragraphs parse trees. If they are all found, system?
it is checked whether they all point to the same
node and whether this nodes surface structure is 3. What is the effect of training data size on per-
in some morphological form present in the answer formance? Can we expect that more training
strings associated with the question in our train- data would further improve the algorithms
ing data. If this is the case a variable in the pat- performance?
tern named correct is increased by 1, otherwise
7.1 Evaluation Setup
the variable incorrect is increased by 1. After the
evaluation process is finished the final version of We use all factoid questions in TRECs QA test
the pattern given as an example in Figure 1 now sets from 2002 to 2006 for evaluation for which
is: a known answer exists in the AQUAINT corpus.
Query: When[1]+was[2]+NP[3]+VERB[4] Additionally, the data in (Lin and Katz, 2005) is
Path 3: pobjprepnsubjpreppobj used. In this paper the authors attempt to identify
Path 4: nsubjpreppobj a much more complete set of relevant documents
Correct: 15 for a subset of TREC 2002 questions than TREC
Incorrect: 4 itself. We adopt a cross validation approach for
our evaluation. Table 4 shows how the data is split
The variables correct and incorrect are used into five folds.
during retrieval, where the score of an answer can- In order to evaluate the algorithms patterns we
didate ac is the sum of all scores of all matching need a set of sentences to which they can be ap-
patterns p: plied. In a traditional QA system architecture,
93
Test Number of Correct Answer Sentences
Mean Med
set =0 <= 1 <= 3 <= 5 <= 10 <= 25 <= 50 >= 75 >= 90 >= 100
2002 0.203 0.396 0.580 0.671 0.809 0.935 0.984 0.0 0.0 0.0 6.86 2.0
2003 0.249 0.429 0.627 0.732 0.828 0.955 0.997 0.003 0.003 0.0 5.67 2.0
2004 0.221 0.368 0.539 0.637 0.799 0.936 0.985 0.0 0.0 0.0 6.51 3.0
2005 0.245 0.404 0.574 0.665 0.777 0.912 0.987 0.0 0.0 0.0 7.56 2.0
2006 0.241 0.389 0.568 0.665 0.807 0.920 0.966 0.006 0.0 0.0 8.04 3.0
Table 2: Fraction of sentences that contain correct answers in Evaluation Set 1 (approximation).
Test Number of Correct Answer Sentences

Mean Med
set =0 <= 1 <= 3 <= 5 <= 10 <= 25 <= 50 >= 75 >= 90 >= 100
2002 0.0 0.074 0.158 0.235 0.342 0.561 0.748 0.172 0.116 0.060 33.46 21.0
2003 0.0 0.099 0.203 0.254 0.356 0.573 0.720 0.161 0.090 0.031 32.88 19.0
2004 0.0 0.073 0.137 0.211 0.328 0.598 0.779 0.142 0.069 0.034 30.82 20.0
2005 0.0 0.163 0.238 0.279 0.410 0.589 0.759 0.141 0.097 0.069 30.87 17.0
2006 0.0 0.125 0.207 0.281 0.415 0.596 0.727 0.173 0.122 0.088 32.93 17.5
Table 3: Fraction of sentences that contain correct answers in Evaluation Set 2 (approximation).
Training Data Test Data In order to provide a quantitative characteriza-

Fold
sets used # set #
1 T03, T04, T05, T06 4565 T02 1159 tion of the two evaluation sets we estimated the
2 T02, T04, T05, T06, Lin02 6174 T03 1352 number of correct answer sentences they contain.
3 T02, T03, T05, T06, Lin02 6700 T04 826
4 T02, T03, T04, T06, Lin02 6298 T05 1228
For each paragraph it was determined whether it
5 T02, T03, T04, T05, Lin02 6367 T06 1159 contained one of the known answer strings and
at least of one of the question key words. Ta-
Table 4: Splits into training and tests sets of the data
bles 2 and 3 show for each evaluation set how
used for evaluation. T02 stands for TREC 2002 data
etc. Lin02 is based on (Lin and Katz, 2005). The # many answers on average it contains per ques-
rows show how many question/answer sentence pairs tion. The column = 0 for example shows the
are used for training and for testing. fraction of questions for which no valid answer
sentence is contained in the evaluation set, while
see e.g. (Prager, 2006; Voorhees, 2003), the docu- column >= 90 gives the fraction of questions
ment or passage retrieval step performs this func- with 90 or more valid answer sentences. The last
tion. This step is crucial to a QA systems per- two columns show mean and median values.
formance, because it is impossible to locate an-
7.2 Comparison with Baseline
swers in the subsequent answer extraction step if
the passages returned during passage retrieval do As pointed out in Section 2 there is a strong tra-
not contain the answer in the first place. This also dition of using dependency paths in QA. Many
holds true in our case: the patterns cannot be ex- relevant papers describe algorithms that analyze
pected to identify a correct answer if none of the a questions grammatical structure and expect
sentences used as input contains the correct an- to find a similar structure in valid answer sen-
swer. We therefore use two different evaluation tences, e.g. (Attardi et al., 2001), (Cui et al., 2005)
sets to evaluate our algorithm: or (Bouma et al., 2005) to name just a few. As
already pointed out, a major contribution of our
1. The first set contains for each question all
work is that we do not assume this similarity. In
sentences in the top 100 paragraphs returned
our approach valid answer sentences are allowed
by Lucene when using simple queries made
to have grammatical structures that are very dif-
up from the questions key words. It cannot
ferent from the question and also very different
be guaranteed that answers to every question
from each other. Thus it is natural to compare our
are present in this test set.
approach against a baseline that compares can-
2. For the second set, the query additionally list didate sentences not against patterns that were
all known correct answers to the question as gained from question/answer sentence pairs, but
parts of one OR operator. This increases the from questions alone. In order to create these pat-
chance that the evaluation set actually con- terns, we use a small trick: During the Pattern
tains valid answer sentences significantly. Creation step, see Section 5 and Figure 1, we re-
94
Test Q Qs with Min one Overall Accuracy Acc. if
place the answer sentences in the input file with set number patterns correct correct overall pattern
2002 429 321 77 37 0.086 0.115
the questions, and assume that the question word 2003 354 237 39 26 0.073 0.120
2004 204 142 25 15 0.074 0.073
indicates the position where the answer should be 2005 319 214 38 18 0.056 0.084
located. 2006 352 208 34 16 0.045 0.077
Sum 1658 1122 213 112 0.068 0.100
Test Q Qs with >1 Overall Accuracy Acc. if
set number patterns correct correct overall pattern
2002 429 321 147 50 0.117 0.156
Table 8: Baseline performance based on evaluation set
2003 354 237 76 22 0.062 0.093 2.
2004 204 142 74 26 0.127 0.183
2005 319 214 97 46 0.144 0.215
2006 352 208 85 31 0.088 0.149
Sum 1658 1122 452 176 0.106 0.156 described in this paper and the baseline approach
do not make use of many techniques commonly
Table 5: Performance based on evaluation set 1.
used to increase performance of a QA system, e.g.
TF-IDF fallback strategies, fuzzy matching, man-
Test Q Qs with >1 Overall Accuracy Acc. if
set number patterns correct correct overall pattern ual reformulation patterns etc. It was a deliberate
2002 429 321 239 133 0.310 0.414
2003 354 237 149 88 0.248 0.371 decision from our side not to use any of these ap-
2004 204 142 119 65 0.319 0.458
2005 319 214 161 92 0.288 0.429 proaches. After all, this would result in an ex-
2006 352 208 139 84 0.238 0.403
Sum 1658 1122 807 462 0.278 0.411
perimental setup where the performance of our
answer extraction strategy could not have been
Table 6: Performance based on evaluation set 2. observed in isolation. The QA system used as a
baseline in the next section makes use of many of
Tables 5 and 6 show how our algorithm per- these techniques and we will see that our method,
forms on evaluation sets 1 and 2, respectively. Ta- as described here, is suitable to increase its per-
bles 7 and 8 show how the baseline performs on formance significantly.
evaluation sets 1 and 2, respectively. The tables
columns list the year of the TREC test set used, 7.3 Impact on an existing QA System
the number of questions in the set (we only use Tables 9 and 10 show how our algorithm in-
questions for which we know that there is an an- creases performance of our QuALiM system, see
swer in the corpus), the number of questions for e.g. (Kaisser et al., 2006). Section 6 in this pa-
which one or more patterns exist, how often at per describes via formulas 2 and 3 how answer
least one pattern returned the correct answer, how candidates are ranked. This ranking is combined
often we get an overall correct result by taking with the existing QA systems candidate ranking
all patterns and their confidence values into ac- by simply using it as an additional feature that
count, accuracy@1 of the overall system, and ac- boosts candidates proportionally to their confi-
curacy@1 computed only for those questions for dence score. The difference between both tables
which we have at least one pattern available (for is that the first uses all 1658 questions in our test
all other questions the system returns no result.) sets for the evaluation, whereas the second con-
As can be seen, on evaluation set 1 our method siders only those 1122 questions for which our
outperforms the baseline by 300%, on evaluation system was able to learn a pattern. Thus for Table
set 2 by 311%, taking accuracy if a pattern exists 10 questions which the system had no chance of
as a basis. answering due to limited training data are omitted.
Test Q Qs with Min one Overall Accuracy Acc. if As can be seen, accuracy@1 increases by 4.9% on
set number patterns correct correct overall pattern
2002 429 321 43 14 0.033 0.044 the complete test set and by 11.5% on the partial
2003 354 237 28 10 0.028 0.042
2004 204 142 19 6 0.029 0.042 set.
2005 319 214 21 7 0.022 0.033
2006 352 208 20 7 0.020 0.034 Note that the QA system used as a baseline is
Sum 1658 1122 131 44 0.027 0.039 at an advantage in at least two respects: a) It has
Table 7: Baseline performance based on evaluation set important web-based components and as such has
1. access to a much larger body of textual informa-
tion. b) The algorithm described in this paper is an
Many of the papers cited earlier that use an ap- answer extraction approach only. For paragraph
proach similar to our baseline approach of course retrieval we use the same approach as for evalu-
report much better results than Tables 7 and 8. ation set 1, see Section 7.1. However, in more
This however is not too surprising as the approach than 20% of the cases, this method returns not
95
a single paragraph that contains both the answer
and at least one question keyword. In such cases,
the simple paragraph retrieval makes it close to
impossible for our algorithm to return the correct
answer.
Test Set QuALiM QASP combined increase

2002 0.503 0.117 0.524 4.2%
2003 0.367 0.062 0.390 6.2%
2004 0.426 0.127 0.451 5.7%
2005 0.373 0.144 0.389 4.2%
2006 0.341 0.088 0.358 5.0%
02-06 0.405 0.106 0.425 4.9%
Table 9: Top-1 accuracy of the QuALiM system on its

own and when combined with the algorithm described Figure 2: Effect of the amount of training data on sys-
in this paper. All increases are statistically significant tem performance
using a sign test (p < 0.05).
8 Conclusions
Test Set QuALiM QASP combined increase
2002 0.530 0.156 0.595 12.3%
2003
2004
0.380
0.465
0.093
0.183
0.430
0.514
13.3%
10.6%
In this paper we present an algorithm that acquires
2005 0.388 0.214 0.421 8.4% syntactic information about how relevant textual
2006 0.385 0.149 0.428 11.3%
02-06 0.436 0.157 0.486 11.5% content to a question can be formulated from a
collection of paired questions and answer sen-
Table 10: Top-1 accuracy of the QuALiM system on
its own and when combined with the algorithm de- tences. Other than previous work employing de-
scribed in this paper, when only considering questions pendency paths for QA, our approach does not as-
for which a pattern could be acquired from the training sume that a valid answer sentence is similar to the
data. All increases are statistically significant using a question and it allows many potentially very dif-
sign test (p < 0.05). ferent syntactic answer sentence structures. The
algorithm is evaluated using TREC data, and it
is shown that it outperforms an algorithm that
merely uses the syntactic information contained
7.4 Effect of Training Data Size in the question itself by 300%. It is also shown
that the algorithm improves the performance of a
We now assess the effect of training data size on
state-of-the-art QA system significantly.
performance. Tables 5 and 6 presented earlier
show that an average of 32.2% of the questions As always, there are many ways how we could
have no matching patterns. This is because the imagine our algorithm to be improved. Combin-
data used for training contained no examples for a ing it with fuzzy matching techniques as in (Cui et
significant subset of question classes. It can be ex- al., 2004) or (Cui et al., 2005) is an obvious direc-
pected that, if more training data would be avail- tion for future work. We are also aware that in or-
able, this percentage would decrease and perfor- der to apply our algorithm on a larger scale and in
mance would increase. In order to test this as- a real world setting with real users, we would need
sumption, we repeated the evaluation procedure a much larger set of training data. These could
detailed in this section several times, initially us- be acquired semi-manually, for example by using
ing data from only one TREC test set for train- crowd-sourcing techniques. We are also thinking
ing and then gradually adding more sets until all about fully automated approaches, or about us-
available training data had been used. The results ing indirect human evidence, e.g. user clicks in
for evaluation set 2 are presented in Figure 2. As search engine logs. Typically users only see the
can be seen, every time more data is added, per- title and a short abstract of the document when
formance increases. This strongly suggests that clicking on a result, so it is possible to imagine a
the point of diminishing returns, when adding ad- scenario where a subset of these abstracts, paired
ditional training data no longer improves perfor- with user queries, could serve as training data.
mance is not yet reached.
96
References Dekang Lin and Patrick Pantel. 2001. Discovery of
Inference Rules for Question-Answering. Natural
Giuseppe Attardi, Antonio Cisternino, Francesco
Language Engineering, 7(4):343360.
Formica, Maria Simi, and Alessandro Tommasi.
2001. PIQASso: Pisa Question Answering System. Dekang Lin. 1998. Dependency-based Evaluation of
In Proceedings of the 2001 Edition of the Text RE- MINIPAR. In Workshop on the Evaluation of Pars-
trieval Conference (TREC-01). ing Systems.
Gosse Bouma, Jori Mur, and Gertjan van Noord. 2005. George A. Miller, Richard Beckwith, Christiane Fell-
Reasoning over Dependency Relations for QA. In baum, Derek Gross, and Katherine Miller. 1993.
Proceedings of the IJCAI workshop on Knowledge Introduction to WordNet: An On-Line Lexical
and Reasoning for Answering Questions (KRAQ- Database. Journal of Lexicography, 3(4):235244.
05). Diego Molla. 2006. Learning of Graph-based
Hang Cui, Ji-Rong Wen, Jian-Yun Nie, and Wei-Ying Question Answering Rules. In Proceedings of
Ma. 2002. Probabilistic query expansion using HLT/NAACL 2006 Workshop on Graph Algorithms
query logs. In 11th International World Wide Web for Natural Language Processing.
Conference (WWW-02). Franz Josef Och and Hermann Ney. 2003. A System-
Hang Cui, Keya Li, Renxu Sun, Tat-Seng Chua, and atic Comparison of Various Statistical Alignment
Min-Yen Kan. 2004. National University of Sin- Models. Computational Linguistics, 29(1):1952.
gapore at the TREC-13 Question Answering Main Ted Pedersen, Siddharth Patwardhan, and Jason
Task. In Proceedings of the 2004 Edition of the Text Michelizzi. 2004. WordNet::Similarity - Measur-
REtrieval Conference (TREC-04). ing the Relatedness of Concepts. In Proceedings
Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan, and of the Nineteenth National Conference on Artificial
Tat-Seng Chua. 2005. Question Answering Pas- Intelligence (AAAI-04).
sage Retrieval Using Dependency Relations. In John Prager. 2006. Open-Domain Question-
Proceedings of the 28th ACM-SIGIR International Answering. Foundations and Trends in Information
Conference on Research and Development in Infor- Retrieval, 1(2).
mation Retrieval (SIGIR-05). L. R. Rabiner, A. E. Rosenberg, and S. E. Levin-
Scott Deerwester, Susan Dumais, George Furnas, son. 1991. Considerations in Dynamic Time Warp-
Thomas Landauer, and Richard Harshman. 1900. ing Algorithms for Discrete Word Recognition. In
Indexing by Latent Semantic Analysis. Journal of Proceedings of IEEE Transactions on Acoustics,
the American society for information science, 41(6). Speech and Signal Processing.
David Graff. 2002. The AQUAINT Corpus of English
Deepak Ravichandran and Eduard Hovy. 2002.
News Text.
Learning Surface Text Patterns for a Question An-
Michael Kaisser and John Lowe. 2008. Creating a
swering System. In Proceedings of the 40th Annual
Research Collection of Question Answer Sentence
Meeting of the Association for Computational Lin-
Pairs with Amazons Mechanical Turk. In Proceed-
guistics (ACL-02).
ings of the Sixth International Conference on Lan-
Stefan Riezler and Yi Liu. 2010. Query Rewriting
guage Resources and Evaluation (LREC-08).
using Monolingual Statistical Machine Translation.
Michael Kaisser, Silke Scheible, and Bonnie Webber.
Computational Linguistics, 36(3).
2006. Experiments at the University of Edinburgh
for the TREC 2006 QA track. In Proceedings of Dan Shen and Dietrich Klakow. 2006. Exploring Cor-
the 2006 Edition of the Text REtrieval Conference relation of Dependency Relation Paths for Answer
(TREC-06). Extraction. In Proceedings of the 21st International
Michael Kaisser. 2009. Acquiring Syntactic and Conference on Computational Linguistics and 44th
Semantic Transformations in Question Answering. Annual Meeting of the ACL (COLING/ACL-06).
Ph.D. thesis, University of Edinburgh. David A. Smith and Jason Eisner. 2006. Quasisyn-
Dan Klein and Christopher D. Manning. 2003a. Ac- chronous grammars: Alignment by Soft Projec-
curate Unlexicalized Parsing. In Proceedings of the tion of Syntactic Dependencies. In Proceedings of
41st Meeting of the Association for Computational the HLTNAACL Workshop on Statistical Machine
Linguistics (ACL-03). Translation.
Dan Klein and Christopher D. Manning. 2003b. Fast Ellen M. Voorhees. 1999. Overview of the Eighth
Exact Inference with a Factored Model for Natural Text REtrieval Conference (TREC-8). In Pro-
Language Parsing. In Advances in Neural Informa- ceedings of the Eighth Text REtrieval Conference
tion Processing Systems 15. (TREC-8).
Jimmy Lin and Boris Katz. 2005. Building a Reusable Ellen M. Voorhees. 2003. Overview of the TREC
Test Collection for Question Answering. Journal of 2003 Question Answering Track. In Proceedings of
the American Society for Information Science and the 2003 Edition of the Text REtrieval Conference
Technology (JASIST). (TREC-03).
97
Mengqiu Wang, Noah A. Smith, and Teruko Mita-
mura. 2007. What is the Jeopardy model? A Qua-
sisynchronous Grammar for QA. In Proceedings of
EMNLP-CoNLL 2007.
98
Can Click Patterns across Users Query Logs Predict Answers to
Definition Questions?
Alejandro Figueroa
Yahoo! Research Latin America
Blanco Encalada 2120, Santiago, Chile
afiguero@yahoo-inc.com
Abstract It is a standard practice of definition ques-

tion answering (QA) systems to mine KBs (e.g.,
In this paper, we examined click patterns online encyclopedias and dictionaries) for reli-
produced by users of Yahoo! search engine able descriptive information on the definiendum
when prompting definition questions. Reg- (Sacaleanu et al., 2008). Normally, these pieces of
ularities across these click patterns are then
information (i.e., nuggets) explain different facets
utilized for constructing a large and hetero-
geneous training corpus for answer rank- of the definiendum (e.g., ballet choreographer
ing. In a nutshell, answers are extracted and born in Bordeaux), and the main idea con-
from clicked web-snippets originating from sists in projecting the acquired nuggets into the
any class of web-site, including Knowledge set of answer candidates afterwards. However,
Bases (KBs). On the other hand, non- the performance of this category of method falls
answers are acquired from redundant pieces into sharp decline whenever few or no coverage
of text across web-snippets.
is found across KBs (Zhang et al., 2005; Han et
The effectiveness of this corpus was as- al., 2006). Put differently, this technique usually
sessed via training two state-of-the-art succeeds in discovering the most relevant facts
models, wherewith answers to unseen
about the most promiment sense of the definien-
queries were distinguished. These test-
ing queries were also submitted by search dum. But it often misses many pertinent nuggets,
engine users, and their answer candidates especially those that can be paraphrased in several
were taken from their respective returned ways; and/or those regarding ancillary senses of
web-snippets. This corpus helped both the definiendum, which are hardly found in KBs.
techniques to finish with an accuracy higher
As a means of dealing with this, current strate-
than 70%, and to predict over 85% of the
answers clicked by users. In particular, our gies try to construct general definition models
results underline the importance of non-KB inferred from a collection of definitions com-
training data. ing from the Internet or KBs (Androutsopoulos
and Galanis, 2005; Xu et al., 2005; Han et al.,
2006). To a great extent, models exploiting non-
1 Introduction KB sources demand considerable annotation ef-
It is a well-known fact that definition queries are forts, or when the data is obtained automatically,
very popular across users of commercial search they benefit from empirical thresholds that ensure
engines (Rose and Levinson, 2004). The essen- a certain degree of similarity to an array of KB
tial characteristic of definition questions is their articles. These thesholds attempt to trade-off the
aim for discovering as much as possible descrip- cleanness of the training material against its cov-
tive information about the concept being defined erage. Moreover, gathering negative samples is
(a.k.a. definiendum, pl. definienda). Some exam- also hard as it is not easy to find wide-coverage
ples of this kind of query include Who is Ben- authoritative sources of non-descriptive informa-
jamin Millepied? and Tell me about Bank of tion about a particular definiendum.
America. Our approach has different innovative aspects
99
compared to other research in the area of defini- Katz et al., 2007; Westerhout, 2009; Navigli and
tion extraction. It is at the crossroads of query Velardi, 2010). Due to training, there is a press-
log analysis and QA systems. We study the click ing necessity for large-scale authoritative sources
behavior of search engines users with regard to of descriptive and non-descriptive nuggets. In the
definition questions. Based on this study, we pro- same manner, there is a growing importance of
pose a novel way of acquiring large-scale and het- strategies capable of extracting trustworthy and
erogeneous training material for this task, which negative/positive samples from any type of text.
consists of: Conventionally, these methods interpret descrip-
tions as positive examples, whereas contexts pro-
automatically obtaining positive samples in viding non-descriptive information as negative el-
accordance with click patterns of search en- ements. Four representative techniques are:
gine users. This aids in harvesting a host
of descriptions from non-KB sources in con- centroid vector (Xu et al., 2003; Cui et
junction with descriptive information from al., 2004) collects an array of articles about
KBs. the definiendum from a battery of pre-
determined KBs. These articles are then
automatically acquiring negative data in con- used to learn a vector of word frequencies,
sonance with redundancy patterns across wherewith answer candidates are rated af-
snippets displayed within search engine re- terwards. Sometimes web-snippets together
sults when processing definition queries. with a query reformulation method are ex-
In brief, our experiments reveal that these pat- ploited instead of pre-defined KBs (Chen et
terns can be effectively exploited for devising ef- al., 2006).
ficient models. (Androutsopoulos and Galanis, 2005) gath-
Given the huge amount of amassed data, we ered articles from KBs to score 250-
additionally contrast the performance of systems characters windows carrying the definien-
built on top of samples originated solely from dum. These windows were taken from
KB, non-KB, and both combined. Our compar- the Internet, and accordingly, highly sim-
ison corroborates that KBs yield massive trust- ilar windows were interpreted as positive
worthy descriptive knowledge, but they do not examples, while highly dissimilar as nega-
bear enough diversity to discriminate all answer- tive samples. For this purpose, two thresh-
ing nuggets within any kind of text. Essentially, olds are used, which ensure the trustwor-
our experiments unveil that non-KB data is richer thiness of both sets. However, they also
and therefore it is useful for discovering more de- cause the sets to be less diverse as not all
scriptive nuggets than KB material. But its usage definienda are widely covered across KBs.
relies on its cleanness and on a negative set. Many Indeed, many facets outlined within the 250-
people had these intuitions before, but to the best characters windows will not be detected.
of our knowledge, we provide the first empirical
confirmation and quantification. (Xu et al., 2005) manually labeled samples
The road-map of this paper is as follows: sec- taken from an Intranet. Manual annotations
tion 2 touches on related works; section 3 digs are constrained to a small amount of exam-
deeper into click patterns for definition questions, ples, because it requires substantial human
subsequently section 4 explains our corpus con- efforts to tag a large corpus, and disagree-
struction strategy; section 5 describes our experiments between annotators are not uncom-
ments, and section 6 draws final conclusions. mon.
2 Related Work (Figueroa and Atkinson, 2009) capitalized

on abstracts supplied by Wikipedia for build-
In recent years, definition QA systems have ing language models (LMs), thus there was
shown a trend towards the utilization of several no need for a negative set.
discriminant and statistical learning techniques
(Androutsopoulos and Galanis, 2005; Chen et al., Our contribution is a novel technique for ob-
2006; Han et al., 2006; Fahmi and Bouma, 2006; taining heterogeneous training material for defi-
100
nitional QA, that is to say, massive examples har- the desire of going to a specific site that the user
vested from KBs and non-KBs. Fundamentally, has in mind, and the latter regards the goal of
positive examples are extracted from web snippets learning something by reading or viewing some
grounded on click patterns of users of a search en- content (Rose and Levinson, 2004). Navigational
gine, whereas the negative collection is acquired queries are hence of less relevance to definition
via redundancy patterns across web-snippets dis- questions, and for this reason, these were removed
played to the user by the search engine. This data in congruence with the next three criteria:
is capitalized by two state-of-the-art definition ex-
tractors, which are different in nature. In addition, (Lee et al., 2005) pointed out that users will
our paper discusses the effect on the performance only visit the web site they bear in mind,
of different sorts (KBs and non-KBs) and amount when prompting navigational queries. Thus,
of training data. these queries are characterized by clicking
As for user clicks, they provide valuable rele- the same URL almost all the time (Lee et al.,
vance feedback for a variety of tasks, cf. (Radlin- 2005). More precisely, we discarded queries
ski et al., 2010). For instance, (Ji et al., 2009) that: a) appear more than four times in the
extracted relevance information from clicked and query log; and which at the same time b) its
non-clicked documents within aggregated search most clicked URL represents more than 98%
sessions. They modelled sequences of clicks as of all its clicks. Following the same idea, we
a means of learning to globally rank the relative additionally eliminated prompted URLs and
relevance of all documents with respect to a given queries where the clicked URL is of the form
query. (Xu et al., 2010) improved the quality of www.search-query-without-spaces.
training material for learning to rank approaches
By the same token, queries containing key-
via predicting labels using clickthrough data. In
words such as homepage, on-line, and
our work, we combine click patterns across Ya-
sign in were also removed.
hoo! search query logs with QA techniques to
build one-sided and two-sided classifiers for rec- After the previous steps, many navigational
ognizing answers to definition questions. queries (e.g., facebook) still remained in
the query log. We noticed that a substantial
3 User Click Analysis for Definition QA portion was signaled by several frequently
In this section, we examine a collection of queries and indistinctly clicked URLs. Take for
submitted to Yahoo! search engine during the pe- instance facebook: www.facebook.com
riod from December 2010 to March 2011. More and www.facebook.com/login.php.
specifically, for this analysis, we considered a
log encompassing a random sample of 69,845,262 With this in mind, we discarded entries em-
(23,360,089 distinct) queries. Basically, this log bodied in a manually compiled black list.
comprises the query sent by the user in conjunc- This list contains the 600 highest frequent
tion with the displayed URLs and the information cases.
about the sequence of their clicks.
In the first place, we associate each query with A third category in (Rose and Levinson, 2004)
a category in the taxonomy proposed by (Rose regards resource queries, which we distinguished
and Levinson, 2004), and in this way definition via keywords like image, lyrics and maps.
queries are selected. Secondly, we investigate Altogether, an amount of (35.67%) 24,916,610
user click patterns observed across these filtered (3,576,817 distinct) queries were seen as navi-
definition questions. gational and resource. Note that in (Rose and
Levinson, 2004) both classes encompassed be-
3.1 Finding Definition Queries tween 37%-38% of their query set.
According to (Broder, 2002; Lee et al., 2005; Subsequently, we profited from the remaining
Dupret and Piwowarski, 2008), the intention of 44,928,652 (informational) entries for detecting
the user falls into at least two categories: navi- queries where the intention of the user is find-
gational (e.g., google) and informational (e.g., ing descriptive information about a topic (i.e.,
maximum entropy models). The former entails definiendum). In the taxonomy delineated by
101
(Rose and Levinson, 2004), informational queries 3.2 User Click Patterns
are sub-categorized into five groups including list, In substance, the first filter recognizes the inten-
locate, and definitional (directed and undirected). tion of the user by means of the formulation given
In practice, we filtered definition questions as fol- by the user (e.g., What is a/the/an...). With re-
lows: gard to this filter, some interesting observations
1. We exploited an array of expressions that are as follows:
are commonly utilized in query analysis for
In 40.27% of the entries, users did not visit
classifying definition questions (Figueroa,
any of the displayed web-sites. Conse-
2010). E.g., Who is/was..., What is/was
quently, we concluded that the information
a/an..., define..., and describe.... Over-
conveyed within the multiple snippets was
all, these rules assisted in selecting 332,227
often enough to answer the respective def-
entries.
inition question. In other words, a signifi-
2. As stated in (Dupret and Piwowarski, 2008), cant fraction of the users were satisfied with
informational queries are typified by the user a small set of brief, but quickly generated de-
clicking several documents. In light of that, scriptions.
we say that some definitional queries are
characterized by multiple clicks, where at In 2.18% of these cases, the search engine re-
least one belongs to a KB. This aids in cap- turned no results, and a few times users tried
turing the intention of the user when look- another paraphrase or query, due to useless
ing for descriptive knowledge and only en- results or misspellings.
tering noun phrases like thoracic outlet syn- We also noticed that definition questions
drome: matched by these expressions are seldom re-
www.medicinenet.com lated to more than one click, although infor-
en.wikipedia.org mational queries produce several clicks, in
health.yahoo.net general. In 46.44% of the cases, the user
www.livestrong.com clicked a sole document, and more surpris-
health.yahoo.net ingly, we observed that users are likely to
en.wikipedia.org click sources different from KBs, in con-
www.medicinenet.com trast to the widespread belief in definition
www.mayoclinic.com QA research. Users pick hits originating
en.wikipedia.org from small but domain-specific web-sites as
www.nismat.org a result of at least two effects: a) they are
en.wikipedia.org looking for minor or ancillary senses of the
definiendum (e.g., ETA in www.travel-
Table 1: Four distinct sequences of hosts clicked by industry-dictionary.com); and more perti-
users given the search query: thoracic outlet syn- nent b) the user does not trust the information
drome. yielded by KBs and chooses more authorita-
tive resources, for instance, when looking for
In so doing, we manually compiled a list reliable medical information (e.g., What is
of 36 frequently clicked KB hosts (e.g., hypothyroidism?, and What is mrsa infec-
Wikipedia and Britannica encyclopedia). tion?).
This filter produced 567,986 queries.
While the first filter infers the intention of the
Unfortunately, since query logs stored by user from the query itself, the second deduces it
search engines are not publicly available due to from the origin of the clicked documents. With
privacy and legal concerns, there is no accessible regard to this second filter, clicking patterns are
training material to build models on top of anno- more disperse. Here, the first two clicks normally
tated data. Thus, we exploited the aforementioned correspond to the top two/three ranked hits re-
hand-crafted rules to connect queries to their returned by the search engine, see also (Ji et al.,
spective category in this taxonomy. 2009). Also, sequences of clicks signal that users
102
normally visit only one site belonging to a KB, they appear within snippets across several ques-
and at least one coming from a non-KB (see Ta- tions. In other words: If it seems to answer every
ble 1). question, it will probably answer no question.
All in all, the insight gained in this analysis al- Take for instance:
lows the construction of an heterogeneous corpus Information about #Q# in the Columbia
for definition question answering. Put differently, Encyclopedia , Computer Desktop
these user click patterns offer a way to obtain huge Encyclopedia , computing dictionary
amounts of heterogeneous training material. In Conversely, templates that are more plausible
this way the heavy dependence of open-domain to be answers are strongly related to their specific
description identifiers on KB data can be allevi- definition questions, and consequently, they are
ated. low in frequency and unlikely to be in the result
set of a large number of queries. This negative set
4 Click-Based Corpus Acquisition was expanded with templates coming from titles
Since queries obtained by the previous two filters of snippets, which at the same time, have a fre-
are not associated with the actual snippets seen quency higher than four across all snippets (inde-
by the users (due to storage limitations), snip- pendent on which queries they appear). This pro-
pets were recovered by means of submitting the cess cooperated on gathering 1,021,571 different
queries to Yahoo! search engine. negative examples. In order to measure the pre-
After retrieval, we benefited from OpenNLP1 cision of this process, we randomly selected and
for detecting sentence boundaries, tokenization checked 1,000 elements, and we found an error of
and part-of-speech (POS) information. Here, we 1.3%.
additionally interpreted truncations (. . .) as sen-
4.2 Positive Set
tence delimiters. POS tags were used to recognize
and replace numbers with a placeholder (#CD#) As for the positive set, this was constructed
as a means of creating sentence templates. We only from the summary section of web-snippets
modified numbers as their value is just as of- clicked by the users. We constrained these snip-
ten confusing as useful (Baeza-Yates and Ribeiro- pets to bear a title template associated with at least
Neto, 1999). two web-snippets clicked for two distinct queries.
Along with numbers, sequences of full Some good examples are:
and partial matches of the definiendum were What is #Q# ? Choices and Consequences.
also substituted with placeholders, #Q# and Biology question : What is an #Q# ?
#QT#, respectively. To exemplify, consider Since clicks are linked with entire snippets,
this pre-processed snippet regarding Benjamin it is uncertain which sentences are genuine de-
Millepied from www.mashceleb.com: scriptions (see the previous example). There-
#Q# / News & Biography - MashCeleb fore, we removed those templates already con-
Latest news coverage of #Q# tained in the negative set, along with those sam-
#Q# ( born #CD# ) is a principal dancer
ples that matched an array of well-known hand-
at New York City Ballet and a ballet
choreographer... crafted rules. This set included:
We benefit from these templates for building a. sentences containing words such as ask,
both a positive and a negative training set. report, say, and unless (Kil et al.,
2005; Schlaefer et al., 2007);
4.1 Negative Set
The negative set comprised templates appearing b. sentences bearing several named entities
across all (clicked and unclicked) web-snippets, (Schlaefer et al., 2006; Schlaefer et al.,
which at the same time, are related to more 2007), which were recognized by the number
than five distinct queries. We hypothesize that of tokens starting with a capital letter versus
these prominent elements correspond to non- those starting with a lowercase letter;
informative, and thus non-descriptive, content as
c. statements of persons (Schlaefer et al.,
1
http://opennlp.sourceforge.net 2007); and
103
d. we also profited from about five hundred 23,132 elements, and some illustrative annota-
common expressions across web snippets in- tions are shown in Table 2. It is worth highlight-
cluding Picture of , and Jump to : naviga- ing that these examples signal that our models
tion , search, as well as Recent posts. are considering pattern-free descriptions, that is
to say, unlike other systems (Xu et al., 2003; Katz
This process assisted in acquiring 881,726 dif- et al., 2004; Fernandes, 2004; Feng et al., 2006;
ferent examples, where 673,548 came from KBs. Figueroa and Atkinson, 2009; Westerhout, 2009)
Here, we also randomly selected 1,000 instances which consider definitions aligning an array of
and manually checked if they were actual descrip- well-known patterns (e.g., is a and also known
tions. The error of this set was 12.2%. as), our models disregard any class of syntactic
To put things into perspective, in contrast to constraint.
other corpus acquisition approaches, the present As to a baseline system, we accounted for the
method generated more than 1,800,000 positive centroid vector (Xu et al., 2003; Cui et al., 2004).
and negative training samples combined, while When implementing, we followed the blueprint
the open-domain strategy of (Miliaraki and An- in (Chen et al., 2006), and it was built for each
droutsopoulos, 2004; Androutsopoulos and Gala- definiendum from a maximum of 330 web snip-
nis, 2005) ca. 20,000 examples, the close-domain pets fetched by means of Bing Search. This base-
technique of (Xu et al., 2005) about 3,000 and line achieved a modest performance as it correctly
(Fahmi and Bouma, 2006) ca. 2,000. classified 43.75% of the testing examples. In de-
tail, 47.75% out of the 56.25% of the misclas-
5 Answering New Definition Queries sified elements were a result of data-sparseness.
This baseline has been widely used as a starting
In our experiments, we checked the effectiveness
point for comparison purposes, however it is hard
of our user click-based corpus acquisition tech-
for this technique to discover diverse descriptive
nique by studying its impact on two state-of-the-
nuggets. This problem stems from the narrow-
art systems. The first one is based on the bi-term
coverage of the centroid vector learned for the re-
LMs proposed by (Chen et al., 2006). This sys-
spective definienda (Zhang et al., 2005). In short,
tem requires only positive samples as training ma-
these figures support the necessity for more robust
terial. Conversely, our second system capitalizes
methods based on massive training material.
on both positive and negative examples, and it is
Experiments. We trained both models by sys-
based on the Maximum Entropy (ME) models
tematically increasing the size of the training ma-
presented by (Fahmi and Bouma, 2006). These
terial by 1%. For this, we randomly split the train-
ME2 models amalgamated bigrams and unigrams
ing data into 100 equally sized packs, and system-
as well as two additional syntactic features, which
atically added one to the previously selected sets
were not applicable to our task (i.e, sentence posi-
(i. e., 1%, 2%, 3%, . . ., 99%, 100%). We also ex-
tion). We added to this model the sentence length
perimented with: 1) positive examples originated
as a feature in order to homologate the attributes
solely from KBs; 2) positive samples harvested
used by both systems, therefore offering a good
only from non-KBs; and eventually 3) all positive
framework to assess the impact of our negative
examples combined.
set. Note that (Fahmi and Bouma, 2006), unlike
Figure 1 juxtaposes the outcomes accom-
us, applied their models only to sentences observ-
plished by both techniques under the different
ing some specific syntactic patterns.
configurations. These figures, compared with re-
With regard to the test set, this was constructed
sults obtained by the baseline, indicate the im-
by manually annotating 113,184 sentence tem-
portant contribution of our corpus to tackle data-
plates corresponding to 3,162 unseen definienda.
sparseness. This contrast substantiates our claim
In total, this array of unseen testing instances
that click patterns can be utilized as indicators of
encompassed 11,566 different positive samples.
answers to definition questions. Since our models
In order to build a balanced testing collection,
ignore definition patterns, they have the potential
the same number of negative examples were ran-
of detecting a wide diversity of descriptive infor-
domly selected. Overall, our testing set contains
mation.
2
http://maxent.sourceforge.net/about.html Further, the improvement of about 9%-10% by
104
Label Example/Template
+ Propylene #Q# is a type of alcohol made from fermented yeast and carbohydrates and
is commonly used in a wide variety of products .
+ #Q# is aggressive behavior intended to achieve a goal .
+ In Hispanic culture , when a girl turns #CD# , a celebration is held called the #Q#,
symbolizing the girl s passage to womanhood .
+ Kirschwasser , German for cherry water and often shortened to #Q# in English-speaking
countries , is a colorless brandy made from black ...
+ From the Gaelic dubhglas meaning #Q#, #QT# stream , or from the #QT# river .
+ Council Bluffs Orthopedic Surgeon Doctors physician directory - Read about #Q#, damage
to any of the #CD# tendons that stabilize the shoulder joint .
+ It also occurs naturally in our bodies in fact , an average size adult manufactures up to
#CD# grams of #Q# daily during normal metabolism .
- Sterling Silver #Q# Hoop Earrings Overstockjeweler.com
- I know V is the rate of reaction and the #Q# is hal ...
- As sad and mean as that sounds , there is some truth to it , as #QT# as age their bodies do
not function as well as they used to ( in all respects ) so there is a ...
- If you re new to the idea of Christian #Q#, what I call the wild things of God ,
- A look at the Biblical doctrine of the #QT# , showing the biblical basis for the teaching and
including a discussion of some of the common objections .
- #QT# is Users Choice ( application need to be run at #QT# , but is not system critical ) ,
this page shows you how it affects your Windows operating system .
- Your doctor may recommend that you use certain drugs to help you control your #Q# .
- Find out what is the full meaning of #Q# on Abbreviations.com !
Table 2: Samples of manual annotations (testing set).
means of exploiting our negative set makes its Best True Positive
positive contribution clear. In particular, this sup- Conf. of Accuracy positives examples
ME-combined 80.72% 88% 881,726
ports our hypothesis that redundancy across web-
ME-KB 80.33% 89.37% 673,548
snippets pertaining to several definition questions ME-N-KB 78.99% 93.38% 208,178
can be exploited as negative evidence. On the
whole, this enhancement also suggests that ME Table 3: Comparison of performance, the total amount
models are a better option than LMs. and origin of training data, and the number of recog-
nized descriptions.
Furthermore, in the case of ME models, putting
together evidence from KB and non-KBs bet-
ters the performance. Conversely, in the case of racy. Nevertheless, this fraction (32%) is still
LMs, we do not observe a noticeable improve- larger than the data-sets considered by other open-
ment when unifying both sources. We attribute domain Machine Learning approaches (Miliaraki
this difference to the fact that non-KB data is nois- and Androutsopoulos, 2004; Androutsopoulos
ier, and thus negative examples are necessary to and Galanis, 2005).
cushion this noise. By and large, the outcomes In detail, when contrasting the confusion ma-
show that the usage of descriptive information de- trices of the best configurations accomplished
rived exclusively from KBs is not the best, but a by ME-combined (80.72%), ME-KB (80.33%)
cost-efficient solution. and ME-N-KB (78.99%), one can find that ME-
Incidentally, Figure 1 reveals that more training combined correctly identified 88% of the answers
data does not always imply better results. Overall, (true positives), while ME-KB 89.37% and ME-
the best performance (ME-combined 80.72%) N-KB 93.38% (see Table 3).
was reaped when considering solely 32% of the Interestingly enough, non-KB data only em-
training material. Hence, ME-KB finished with bodies 23.61% of all positive training material,
the best performance when accounting for about but it still has the ability to recognize more an-
215,500 positive examples (see Table 3). Adding swers. Despite of that, the other two strate-
more examples brought about a decline in accu- gies outperform ME-N-KB, because they are able
105
Figure 1: Results for each configuration (accuracy).
to correctly label more negative test examples. size of the corpus. Our figures additionally sug-
Given these figures, we can conclude that this is gest that more effort should go into increasing di-
achieved by mitigating the impact of the noise in versity than the number of training instances. In
the training corpus by means of cleaner (KB) data. light of these observations, we also conjecture that
We verified this synergy by inspecting the num- a more reduced, but diverse and manually anno-
ber of answers from non-KBs detected by the tated, corpus might be more effective. In partic-
three top configurations in Table 3: ME-combined ular, a manually checked corpus distilled by in-
(9,086), ME-KB (9,230) and ME-N-KB (9,677). specting click patterns across query logs of search
In like manner, we examined the confusion ma- engines.
trix for the best configuration (ME-combined Lastly, in order to evaluate how good a click
80.72%): 1,388 (6%) positive examples were mis- predictor the three top ME-configurations are,
labeled as negative, while 3,071 (13.28%) nega- we focused our attention only on the manu-
tive samples were mistagged as positive. ally labeled positive samples (answers) that were
In addition, we performed significance tests uti- clicked by the users. Overall, 86.33% (ME-
lizing two-tailed paired t-test at 95% confidence combined), 88.85% (ME-KB) and 92.45% (ME-
interval on twenty samples. For this, we used N-KB) of these responses were correctly pre-
only the top three configurations in Table 3 and dicted. In light of that, one can conclude that
each sample was determined by using boostrap- (clicked and non-clicked) answers to definition
ping resampling. Each sample has the same size questions can be identified/predicted on the basis
of the original test corpus. Overall, the tests im- of users click patterns across query logs.
plied that all pairs were statistically different from From the viewpoint of search engines, web
each other. snippets are computed off-line, in general. In
In summary, the results show that both negative so doing, some methods select the spans of text
examples and combining positive examples from bearing query terms with the potential of putting
heterogeneous sources are indispensable to tackle the document on top of the rank (Turpin et al.,
any class of text. However, it is vital to lessen the 2007; Tsegay et al., 2009). This helps to create an
noise in non-KB data, since this causes a more abridged version of the document that can quickly
adverse effect on the performance. Given the up- produce the snippet. This has to do with the trade-
perbound in accuracy, our outcomes indicate that off between storage capacity, indexing, and re-
cleanness and quality are more important than the trieval speed. Ergo, our technique can help to de-
106
termine whether or not a span of text is worth ex- this implies that these tools have to be re-trained
panding, or in some cases whether or not it should to cope with web-snippets.
be included in the snippet view of the document.
In our instructive snippet, we now might have: Acknowledgements
Benjamin Millepied / News & This work was partially supported by R&D
Biography - MashCeleb project FONDEF D09I1185. We also thank our
Benjamin Millepied (born 1977) is a
principal dancer at New York City Ballet
reviewers for their interesting comments, which
and a ballet choreographer of helped us to make this work better.
international reputation. Millepied was
born in Bordeaux, France. His...
References
Improving the results of informational (e.g.,
definition) queries, especially of less frequent I. Androutsopoulos and D. Galanis. 2005. A prac-
tically Unsupervised Learning Method to Identify
ones, is key for competing commercial search
Single-Snippet Answers to Definition Questions on
engines as they are embodied in the non- the web. In HLT/EMNLP, pages 323330.
navigational tail where these engines differ the R. Baeza-Yates and B. Ribeiro-Neto. 1999. Modern
most (Zaragoza et al., 2010). Information Retrieval. Addison Wesley.
A. Broder. 2002. A Taxonomy of Web Search. SIGIR
6 Conclusions Forum, 36:310, September.
Y. Chen, M. Zhon, and S. Wang. 2006. Reranking An-
This work investigates into the click behavior of swers for Definitional QA Using Language Model-
commercial search engine users regarding defi- ing. In Coling/ACL-2006, pages 10811088.
nition questions. These behaviour patterns are H. Cui, K. Li, R. Sun, T.-S. Chua, and M.-Y. Kan.
then exploited as a corpus acquisition technique 2004. National University of Singapore at the
for definition QA, which offers the advantage of TREC 13 Question Answering Main Task. In Pro-
encompassing positive samples from heterogo- ceedings of TREC 2004. NIST.
neous sources. In contrast, negative examples Georges E. Dupret and Benjamin Piwowarski. 2008.
are obtained in conformity to redundancy pat- A user browsing model to predict search engine
click data from past observations. In SIGIR 08,
terns across snippets, which are returned by the
pages 331338.
search engine when processing several definition
Ismail Fahmi and Gosse Bouma. 2006. Learning to
queries. The effectiveness of these patterns, and Identify Definitions using Syntactic Features. In
hence of the obtained corpus, was tested by means Proceedings of the Workshop on Learning Struc-
of two models different in nature, where both tured Information in Natural Language Applica-
were capable of achieving an accuracy higher than tions.
70%. Donghui Feng, Deepak Ravichandran, and Eduard H.
As a future work, we envision that answers de- Hovy. 2006. Mining and Re-ranking for Answering
tected by our strategy can aid in determining some Biographical Queries on the Web. In AAAI.
Aaron Fernandes. 2004. Answering Definitional
query expansion terms, and thus to devise some
Questions before they are Asked. Masters thesis,
relevance feedback methods that can bring about Massachusetts Institute of Technology.
an improvement in terms of the recall of answers. A. Figueroa and J. Atkinson. 2009. Using Depen-
Along the same lines, it can cooperate on the vi- dency Paths For Answering Definition Questions on
sualization of the results by highlighting and/or The Web. In WEBIST 2009, pages 643650.
extending truncated answers, that is more infor- Alejandro Figueroa. 2010. Finding Answers to Defini-
mative snippets, which is one of the holy grail of tion Questions on the Web. Phd-thesis, Universitaet
search operators, especially when processing in- des Saarlandes, 7.
formational queries. K. Han, Y. Song, and H. Rim. 2006. Probabilis-
tic Model for Definitional Question Answering. In
NLP tools (e.g., parsers and name entity recog-
Proceedings of SIGIR 2006, pages 212219.
nizers) can also be exploited for designing better
Shihao Ji, Ke Zhou, Ciya Liao, Zhaohui Zheng, Gui-
training data filters and more discriminative fea- Rong Xue, Olivier Chapelle, Gordon Sun, and
tures for our models that can assist in enhanc- Hongyuan Zha. 2009. Global ranking by exploit-
ing the performance, cf. (Surdeanu et al., 2008; ing user clicks. In Proceedings of the 32nd inter-
Figueroa, 2010; Surdeanu et al., 2011). However, national ACM SIGIR conference on Research and
107
development in information retrieval, SIGIR 09, Yohannes Tsegay, Simon J. Puglisi, Andrew Turpin,
pages 3542, New York, NY, USA. ACM. and Justin Zobel. 2009. Document compaction
B. Katz, M. Bilotti, S. Felshin, A. Fernandes, for efficient query biased snippet generation. In
W. Hildebrandt, R. Katzir, J. Lin, D. Loreto, Proceedings of the 31th European Conference on
G. Marton, F. Mora, and O. Uzuner. 2004. An- IR Research on Advances in Information Retrieval,
swering multiple questions on a topic from hetero- ECIR 09, pages 509520, Berlin, Heidelberg.
geneous resources. In Proceedings of TREC 2004. Springer-Verlag.
NIST. Andrew Turpin, Yohannes Tsegay, David Hawking,
B. Katz, S. Felshin, G. Marton, F. Mora, Y. K. Shen, and Hugh E. Williams. 2007. Fast generation of
G. Zaccak, A. Ammar, E. Eisner, A. Turgut, and result snippets in web search. In Proceedings of
L. Brown Westrick. 2007. CSAIL at TREC 2007 the 30th annual international ACM SIGIR confer-
Question Answering. In Proceedings of TREC ence on Research and development in information
2007. NIST. retrieval, SIGIR 07, pages 127134, New York,
Jae Hong Kil, Levon Lloyd, and Steven Skiena. 2005. NY, USA. ACM.
Question Answering with Lydia (TREC 2005 QA Eline Westerhout. 2009. Extraction of definitions us-
track). In Proceedings of TREC 2005. NIST. ing grammar-enhanced machine learning. In Pro-
ceedings of the EACL 2009 Student Research Work-
U. Lee, Z. Liu, and J. Cho. 2005. Automatic Iden-
shop, pages 8896.
tification of User Goals in Web Search. In Pro-
ceedings of the 14th WWW conference, WWW 05, Jinxi Xu, Ana Licuanan, and Ralph Weischedel. 2003.
pages 391400. TREC2003 QA at BBN: Answering Definitional
Questions. In Proceedings of TREC 2003, pages
S. Miliaraki and I. Androutsopoulos. 2004. Learn-
98106. NIST.
ing to identify single-snippet answers to definition
J. Xu, Y. Cao, H. Li, and M. Zhao. 2005. Ranking
questions. In COLING 04, pages 13601366.
Definitions with Supervised Learning Methods. In
Roberto Navigli and Paola Velardi. 2010. WWW2005, pages 811819.
LearningWord-Class Lattices for Definition
Jingfang Xu, Chuanliang Chen, Gu Xu, Hang Li, and
and Hypernym Extraction. In Proceedings of
Elbio Renato Torres Abib. 2010. Improving qual-
the 48th Annual Meeting of the Association for
ity of training data for learning to rank using click-
Computational Linguistics (ACL 2010).
through data. In Proceedings of the third ACM in-
Filip Radlinski, Martin Szummer, and Nick Craswell. ternational conference on Web search and data min-
2010. Inferring query intent from reformulations ing, WSDM 10, pages 171180, New York, NY,
and clicks. In Proceedings of the 19th international USA. ACM.
conference on World wide web, WWW 10, pages H. Zaragoza, B. Barla Cambazoglu, and R. Baeza-
11711172, New York, NY, USA. ACM. Yates. 2010. We Search Solved? All Result Rank-
Daniel E. Rose and Danny Levinson. 2004. Un- ings the Same? In Proceedings of CKIM10, pages
derstanding User Goals in Web Search. In WWW, 529538.
pages 1319. Zhushuo Zhang, Yaqian Zhou, Xuanjing Huang, and
B. Sacaleanu, G. Neumann, and C. Spurk. 2008. Lide Wu. 2005. Answering Definition Questions
DFKI-LT at QA@CLEF 2008. In In Working Notes Using Web Knowledge Bases. In Proceedings of
for the CLEF 2008 Workshop. IJCNLP 2005, pages 498506.
Nico Schlaefer, P. Gieselmann, and Guido Sautter.
2006. The Ephyra QA System at TREC 2006. In
Proceedings of TREC 2006. NIST.
Nico Schlaefer, Jeongwoo Ko, Justin Betteridge,
Guido Sautter, Manas Pathak, and Eric Nyberg.
2007. Semantic Extensions of the Ephyra QA Sys-
tem for TREC 2007. In Proceedings of TREC 2007.
NIST.
Mihai Surdeanu, Massimiliano Ciaramita, and Hugo
Zaragoza. 2008. Learning to Rank Answers on
Large Online QA Collections. In Proceedings of the
46th Annual Meeting of the Association for Compu-
tational Linguistics (ACL 2008), pages 719727.
Mihai Surdeanu, Massimiliano Ciaramita, and Hugo
Zaragoza. 2011. Learning to rank answers to non-
factoid questions from web collections. Computa-
tional Linguistics, 37:351383.
108
Adaptation of Statistical Machine Translation Model for Cross-Lingual
Information Retrieval in a Service Context
Vassilina Nikoulina Bogomil Kovachev
Xerox Research Center Europe Informatics Institute
vassilina.nikoulina@xrce.xerox.com University of Amsterdam
B.K.Kovachev@uva.nl
Nikolaos Lagos Christof Monz

Xerox Research Center Europe Informatics Institute
nikolaos.lagos@xrce.xerox.com University of Amsterdam
C.Monz@uva.nl
Abstract to the undelying IR system used and without ac-

cessing, at translation time, the content providers
This work proposes to adapt an existing document set. Keeping in mind these constraints,
general SMT model for the task of translat- we present two approaches on query translation
ing queries that are subsequently going to optimisation.
be used to retrieve information from a tar-
One of the important observations done dur-
get language collection. In the scenario that
we focus on access to the document collec- ing the CLEF 2009 campaign (Ferro and Peters,
tion itself is not available and changes to 2009) related to CLIR was that the usage of Sta-
the IR model are not possible. We propose tistical Machine Translation (SMT) systems (eg.
two ways to achieve the adaptation effect Google Translate) for query translation led to
and both of them are aimed at tuning pa- important improvements in the cross-lingual re-
rameter weights on a set of parallel queries. trieval performance (the best CLIR performance
The first approach is via a standard tuning
increased from 55% of the monolingual baseline
procedure optimizing for BLEU score and
the second one is via a reranking approach in 2008 to more than 90% in 2009 for French
optimizing for MAP score. We also extend and German target languages). However, general-
the second approach by using syntax-based purpose SMT systems are not necessarily adapted
features. Our experiments show improve- for query translation. That is because SMT sys-
ments of 1-2.5 in terms of MAP score over tems trained on a corpus of standard parallel
the retrieval with the non-adapted transla- phrases take into account the phrase structure im-
tion. We show that these improvements are plicitly. The structure of queries is very differ-
due both to the integration of the adapta-
ent from the standard phrase structure: queries are
tion and syntax-features for the query trans-
lation task. very short and the word order might be different
than the typical full phrase one. This problem can
be seen as a problem of genre adaptation for SMT,
1 Introduction where the genre is query.
To our knowledge, no suitable corpora of par-
Cross Lingual Information Retrieval (CLIR) is an
allel queries is available to train an adapted SMT
important feature for any digital content provider
system. Small corpora of parallel queries1 how-
in todays multilingual environment. However,
ever can be obtained (eg. CLEF tracks) or man-
many of the content providers are not willing to
ually created. We suggest to use such corpora
change existing well-established document index-
in order to adapt the SMT model parameters for
ing and search tools, nor to provide access to
query translation. In our approach the parameters
their document collection by a third-party exter-
of the SMT models are optimized on the basis of
nal service. The work presented in this paper as-
the parallel queries set. This is achieved either di-
sumes such a context of use, where a query trans-
rectly in the SMT system using the MERT (Mini-
lation service allows translating queries posed to
mum Error Rate Training) algorithm and optimiz-
the search engine of a content provider into sev-
1
eral target languages, without requiring changes Insufficient for a full SMT system training (500 entries)
109
ing according to the BLEU2 (Papineni et al., 2001) ror on the training data.
score, or via reranking the Nbest translation can- To our knowledge, existing work that use MT-
didates generated by a baseline system based on based techniques for query translation use an out-
new parameters (and possibly new features) that of-the-box MT system, without adapting it for
aim to optimize a retrieval metric. query translation in particular (Jones et al., 1999;
It is important to note that both of the pro- Wu et al., 2008) (although some query expan-
posed approaches allow keeping the MT system sion techniques might be applied to the produced
independent of the document collection and in- translation afterwards (Wu and He, 2010)).
dexing, and thus suitable for a query translation There is a number of works done for do-
service. These two approaches can also be com- main adaptation in Statistical Machine Transla-
bined by using the model produced with the first tion. However, we want to distinguish between
approach as a baseline that produces the Nbest list genre and domain adaptation in this work. Gen-
of translations that is then given to the reranking erally, genre can be seen as a sub-problem of do-
approach. main. Thus, we consider genre to be the general
The remainder of this paper is organized as fol- style of the text e.g. conversation, news, blog,
lows. We first present related work addressing the query (responsible mostly for the text structure)
problem of query translation. We then describe while the domain reflects more what the text is
two approaches towards adapting an SMT system about eg. social science, healthcare, history, so
to the query-genre: tuning the SMT system on a domain adaptation involves lexical disambigua-
parallel set of queries (Section 3.1) and adapting tion and extra lexical coverage problems. To our
machine translation via the reranking framework knowledge, there is not much work addressing ex-
(Section 3.2). We then present our experimental plicitly the problem of genre adaptation for SMT.
settings and results (Section 4) and conclude in Some work done on domain adaptation could be
section 5. applied to genre adaptation, such as incorporating
available in-domain corpora in the SMT model:
2 Related work either monolingual (Bertoldi and Federico, 2009;
Wu et al., 2008; Zhao et al., 2004; Koehn and
We may distinguish two main groups of ap- Schroeder, 2007), or small parallel data used for
proaches to CLIR: document translation and tuning the SMT parameters (Zheng et al., 2010;
query translation. We concentrate on the second Pecina et al., 2011).
group which is more relevant to our settings. The
standard query translation methods use different 3 Our approach
translation resources such as bilingual dictionar-
This work is based on the hypothesis that the
ies, parallel corpora and/or machine translation.
general-purpose SMT system needs to be adapted
The aspect of disambiguation is important for the
for query translation. Although in (Ferro and
first two techniques.
Peters, 2009) it has been mentioned that using
Different methods were proposed to deal with
Google translate (general-purpose MT) for query
disambiguation issues, often relying on the docu-
translation allowed to CLEF participants to obtain
ment collection or embedding the translation step
the best CLIR performance, there is still 10% gap
directly into the retrieval model (Hiemstra and
between monolingual and cross-lingual IR. We
Jong, 1999; Berger et al., 1999; Kraaij et al.,
believe that, as in (Clinchant and Renders, 2007),
2003). Other methods rely on external resources
more adapted query translation, possibly further
like query logs (Gao et al., 2010), Wikipedia (Ja-
combined with query expansion techniques, can
didinejad and Mahmoudi, 2009) or the web (Nie
lead to improved retrieval.
and Chen, 2002; Hu et al., 2008). (Gao et al.,
The problem of the SMT adaptation for query-
2006) proposes syntax-based translation models
genre translation has different quality aspects.
to deal with the disambiguation issues (NP-based,
On the one hand, we want our model to pro-
dependency-based). The candidate translations
duce a good translation (well-formed and trans-
proposed by these models are then reranked with
mitting the information contained in the source
the model learned to minimize the translation er-
query) of an input query. On the other hand, we
2
Standard MT evaluation metric want to obtain good retrieval performance using
110
the proposed translation. These two aspects are Our hypothesis is that the impact of different
not necessarily correlated: a bag-of-word transla- features should be different depending on whether
tion can lead to good retrieval performance, even we translate a full sentence, or a query-genre en-
though it wont be syntactically well-formed; at try. Thus, one would expect that in the case
the same time a well-formed translation can lead of query-genre the language model or the distor-
to worse retrieval if the wrong lexical choice is tion features should get less importance than in
done. Moreover, often the retrieval demands some the case of the full-sentence translation. MERT
linguistic preprocessing (eg. lemmatisation, PoS tuning on a genre-adapted parallel corpus should
tagging) which in interaction with badly-formed leverage this information from the data, adapting
translations might bring some noise. the SMT model to the query-genre. We would
A couple of works studied the correlation be- also like to note that the tuning approach (pro-
tween the standard MT evaluation metrics and posed for domain adaptation by (Zheng et al.,
the retrieval precision. Thus, (Fujii et al., 2009) 2010)) seems to be more appropriate for genre
showed a good correlation of the BLEU scores adaptation than for domain adaptation where the
with the MAP scores for Cross-Lingual Patent problem of lexical ambiguity is encoded in the
Retrieval. However, the topics in patent search translation model and re-weighting the main fea-
(long and well structured) are very different from tures might not be sufficient.
standard queries. (Kettunen, 2009) also found a We use the MERT implementation provided
pretty high correlation ( 0.8 0.9) between stan- with the Moses toolkit with default settings. Our
dard MT evaluation metrics (METEOR(Banerjee assumption is that this procedure although not ex-
and Lavie, 2005), BLEU, NIST(Doddington, plicitly aimed at improving retrieval performance
2002)) and retrieval precision for long queries. will nevertheless lead to better query transla-
However, the same work shows that the correlations when compared to the baseline. The results
tion decreases ( 0.6 0.7) for short queries. of this apporach allow us also to observe whether
In this paper we propose two approaches to and to what extent changes in BLEU scores are
SMT adaptation for queries. The first one op- correlated to changes in MAP scores.
timizes BLEU, while the second one optimizes
Mean Average Precision (MAP), a standard met- 3.2 Reranking framework for query
ric in information retrieval. Well address the is- translation
sue of the correlation between BLEU and MAP in The second approach addresses the retrieval qual-
Section 4. ity problem. An SMT system is usually trained to
Both of the proposed approaches rely on the optimize the quality of the translation (eg. BLEU
phrase-based SMT (PBMT) model (Koehn et al., score for SMT), which is not necessarily corre-
2003) implemented in the Open Source SMT lated with the retrieval quality (especially for the
toolkit MOSES (Koehn et al., 2007). short queries). Thus, for example, the word or-
der which is crucial for translation quality (and is
3.1 Tuning for genre adaptation taken into account by most MT evaluation met-
First, we propose to adapt the PBMT model by rics) is often ignored by IR models. Our second
tuning the models weights on a parallel set of approach follows (Nie, 2010, pp.106) argument
queries. This approach addresses the first as- that the translation problem is an integral part
pect of the problem, which is producing a good of the whole CLIR problem, and unified CLIR
translation. The PBMT model combines differ- models integrating translation should be defined.
ent types of features via a log-linear model. The We propose integrating the IR metric (MAP) into
standard features include (Koehn, 2010, Chapter the translation model optimisation step via the
5): language model, word penalty, distortion, dif- reranking framework.
ferent translation models, etc. The weights of Previous attempts to apply the reranking ap-
these features are learned during the tuning step proach to SMT did not show significant improve-
with the MERT (Och, 2003) algorithm. Roughly ments in terms of MT evaluation metrics (Och
the MERT algorithm tunes feature weights one by et al., 2003; Nikoulina and Dymetman, 2008).
one and optimizes them according to the BLEU One of the reasons being the poor diversity of the
score obtained. Nbest list of the translations. However, we be-
111
lieve that this approach has more potential in the defined as a weighted linear combination of
context of query translation. features: t() = arg maxtGEN (q) F (t)
First of all the average query length is 5 words, As shown above the best translation is selected ac-
which means that the Nbest list of the translations cording to features weights . In order to learn
is more diverse than in the case of general phrase the weights maximizing the retrieval perfor-
translation (average length 25-30 words). mance, an appropriate annotated training set has
Moreover, the retrieval precision is more natu- to be created. We use the CLEF tracks to create
rally integrated into the reranking framework than the training set. The retrieval scores annotations
standard MT evaluation metrics such as BLEU. are based on the document relevance annotations
The main reason is that the notion of Average Re- performed by human annotators during the CLEF
trieval Precision is well defined for a single query campaign.
translation, while BLEU is defined on the corpus The annotated training set is created out of
level and correlates poorly with human quality queries {q1 , ..., qK } with an Nbest list of trans-
judgements for the individual translations (Specia lations GEN (qi ) of each query qi , i {1..K} as
et al., 2009; Callison-Burch et al., 2009). follows:
Finally, the reranking framework allows a lot
of flexibility. Thus, it allows enriching the base- A list of N (we take N = 1000) translations
line translation model with new complex features (GEN (qi )) is produced by the baseline MT
which might be difficult to introduce into the model for each query qi , i = 1..K.
translation model directly. Each translation t GEN (qi ) is used
Other works applied the reranking framework to perform a retrieval from a target docu-
to different NLP tasks such as Named Entities ment collection, and an Average Precision
Extraction (Collins, 2001), parsing (Collins and score (AP (t)) is computed for each t
Roark, 2004), and language modelling (Roark et GEN (qi ) by comparing its retrieval to the
al., 2004). Most of these works used the reranking relevance annotations done during the CLEF
framework to combine generative and discrimina- campaign.
tive methods when both approaches aim at solv-
ing the same problem: the generative model pro- The weights are learned with the objective of
duces a set of hypotheses, and the best hypoth- maximizing MAP for all the queries of the train-
esis is chosen afterwards via the discriminative ing set, and, therefore, are optimized for retrieval
reranking model, which allows to enrich the base- quality.
line model with the new complex and heteroge- The weights optimization is done with
neous features. We suggest using the reranking the Margin Infused Relaxed Algorithm
framework to combine two different tasks: Ma- (MIRA)(Crammer and Singer, 2003), which
chine Translation and Cross-lingual Information was applied to SMT by (Watanabe et al., 2007;
Retrieval. In this context the reranking framework Chiang et al., 2008). MIRA is an online learning
doesnt only allow enriching the baseline transla- algorithm where each weights update is done to
tion model but also performing training using a keep the new weights as close as possible to the
more appropriate evaluation metric. old weights (first term), and score oracle trans-
lation (the translation giving the best retrieval
3.2.1 Reranking training score : ti = arg maxt AP (t)) higher than each
Generally, the reranking framework can be re- non-oracle translation (tij ) by a margin at least as
sumed in the following steps : wide as the loss lij (second term):
0
1. The baseline (generic-purpose) MT system = min0 21 k k2 +
generates a list of candidate translations 0

C K ) F (t )
P
i=1 max j=1..N lij (F (ti ij
GEN (q) for each query q;
The loss lij is defined as the difference in the re-
2. A vector of features F (t) is assigned to each
trieval average precision between the oracle and
translation t GEN (q);
non-oracle translations: lij = AP (ti ) AP (tij ).
3. The best translation t is chosen as the one C is the regularization parameter which is chosen
maximizing the translation score, which is via 5-fold cross-validation.
112
3.2.2 Features PoS mapping features. The goal of the PoS
One of the advantages of the reranking frame- mapping features is to control the correspondence
work is that new complex features can be easily of Part Of Speech Tags between an input query
integrated. We suggest to enrich the reranking and its translation. As the coupling features, the
model with different syntax-based features, such PoS mapping features rely on the word align-
as: ments between the source sentence and its trans-
lation3 . A vector of sparse features is introduced
features relying on dependency structures: where each component corresponds to a pair of
called therein coupling features (proposed by PoS tags aligned in the training data. We intro-
(Nikoulina and Dymetman, 2008)); duce a generic PoS map variant, which counts a
number of occurrences of a specific pair of PoS
features relying on Part of Speech Tagging: tags, and lexical PoS map variant, which weights
called therein PoS mapping features. down these pairs by a lexical alignment score
(p(s|t) or p(t|s)).
By integrating the syntax-based features we
have a double goal: showing the potential of 4 Experiments
the reranking framework with more complex fea- 4.1 Experimental basis
tures, and examining whether the integration of
syntactic information could be useful for query 4.1.1 Data
translation. To simulate parallel query data we used trans-
lation equivalent CLEF topics. The data set used
Coupling features. The goal of the coupling for the first approach consists of the CLEF topic
features is to measure the similarity between data from the following years and tasks: AdHoc-
source and target dependency structures. The ini- main track from 2000 to 2008; CLEF AdHoc-
tial hypothesis is that a better translation should TEL track 2008; Domain Specific tracks from
have a dependency structure closer to the one of 2000 to 2008; CLEF robust tracks 2007 and 2008;
the source query. GeoCLEf tracks 2005-2007. To avoid the issue of
In this work we experiment with two dif- overlapping topics we removed duplicates. The
ferent coupling variants proposed in (Nikoulina created parallel queries set contained 500 700
and Dymetman, 2008), namely, Lexicalised and parallel entries (depending on the language pair,
Label-specific coupling features. Table 1) and was used for Moses parameters tun-
The generic coupling features are based on ing.
the notion of rectangles that are of the follow- In order to create the training set for the rerank-
ing type : ((s1 , ds12 , s2 ), (t1 , dt12 , t2 )), where ing approach, we need to have access to the rele-
ds12 is an edge between source words s1 and s2 , vance judgements. We didnt have access to all
dt12 is an edge between target words t1 and t2 , relevance judgements of the previously desribed
s1 is aligned with t1 and s2 is aligned with t2 . tracks. Thus we used only a subset of the previ-
Lexicalised features take into account the qual- ously extracted parallel set, which includes CLEF
ity of lexical alignment, by weighting each rect- 2000-2008 topics from the AdHoc-main, AdHoc-
angle (s1 , s2 , t1 , t2 ) by a probability of align- TEL and GeoCLEF tracks.
ing s1 to t1 and s2 to t2 (eg. p(s1 |t1 )p(s2 |t2 ) or The number of queries obtained altogether is
p(t1 |s1 )p(t2 |s2 )). shown in (Table 1).
The Label-Specific features take into account
4.1.2 Baseline
the nature of the aligned dependencies. Thus, the
rectangles of the form ((s1, subj, s2), (t1, subj, We tested our approaches on the CLEF AdHoc-
t2)) will get more weight than a rectangle ((s1, TEL 2009 task (50 topics). This task dealt
subj, s2), (t1, nmod, t2)). The importance of with monolingual and cross-lingual search in a
each rectangle is learned on the parallel anno- library catalog. The monolingual retrieval is
tated corpus by introducing a collection of Label- 3
This alignment can be either produced by a toolkit like
Specific coupling features, each for a specific pair GIZA++(Och and Ney, 2003) or obtained directly by a sys-
of source label and target label. tem that produced the Nbest list of the translations (Moses).
113
Language pair Number of queries The 5best retrieval can be seen as a sort of query
Total queries expansion, without accessing the document col-
En - Fr, Fr - En 470 lection or any external resources.
En - De, De - En 714 Given that the query length is shorter than for a
Annotated queries standard sentence, the 4-gramm BLEU (used for
En - Fr, Fr - En 400 standart MT evaluation) might not be able to cap-
En - De, De - En 350 ture the difference between the translations (eg.
English-German 4-gramm BLEU is equal to 0 for
Table 1: Top: total number of parallel queries gathered our task). For that reason we report both 3- and
from all the CLEF tasks (size of the tuning set). Bot- 4-gramm BLEU scores.
tom: number of queries extracted from the tasks for Note, that the French-English baseline retrieval
which the human relevance judgements were availble
quality is much better than the German-English.
(size of the reranking training set).
This is probably due to the fact that our German-
English translation system doesnt use any de-
performed with the lemur4 toolkit (Ogilvie and coumpounding, which results into many non-
Callan, 2001). The preprocessing includes lem- translated words.
matisation (with the Xerox Incremental Parser-
XIP (At-Mokhtar et al., 2002)) and filtering out 4.2 Results
the function words (based on XIP PoS tagging). We performed the query-genre adaptation ex-
Table 2 shows the performance of the monolin- periments for English-French, French-English,
gual retrieval model for each collection. The German-English and English-German language
monolingual retrieval results are comparable to pairs.
the CLEF AdHoc-TEL 2009 participants (Ferro Ideally, we would have liked to combine the
and Peters, 2009). Let us note here that it is not two approaches we proposed: use the query-
the case for our CLIR results since we didnt ex- genre-tuned model to produce the Nbest list
ploit the fact that each of the collections could ac- which is then reranked to optimize the MAP
tually contain the entries in a language other than score. However, it was not possible in our exper-
the official language of the collection. imental settings due to the small amount of train-
The cross-lingual retrieval is performed as fol- ing data available. We thus simply compare these
lows : two approaches to a baseline approach and com-
ment on their respective performance.
the input query (eg. in English) is first trans-
lated into the language of the collection (eg. 4.2.1 Query-genre tuning approach
German);
For the CLEF-tuning experiments we used the
this translation is used to search the target same translation model and language model as for
collection (eg. Austrian National Library for the baseline (Europarl-based). The weights were
German ) . then tuned on the CLEF topics described in sec-
tion 4.1.1. We then tested the system obtained on
The baseline translation is produced with 50 parallel queries from the CLEF AdHoc-TEL
Moses trained on Europarl. Table 2 reports the 2009 task.
baseline performance both in terms of MT evalu- Table 3 describes the results of the evalua-
ation metrics (BLEU) and Information Retrieval tion. We observe consistent 1-best MAP improve-
evaluation metric MAP (Mean Average Preci- ments, but unstable BLEU (3-gramm) (improve-
sion). ments for English-German, and degradation for
The 1best MAP score corresponds to the case other language pairs), although one would have
when the single translation is proposed for the expected BLEU to be improved in this experi-
retrieval by the query translation model. 5best mental setting given that BLEU was the objective
MAP score corresponds to the case when the 5 function for MERT. These results, on one side,
top translations proposed by the translation ser- confirm the remark of (Kettunen, 2009) that there
vice are concatenated and used for the retrieval. is a correlation (although low) between BLEU
4
http://www.lemurproject.org/ and MAP scores. The unstable BLEU scores
114
MAP MAP BLEU BLEU
MAP
1-best 5-best 4-gramm 3-gramm
Monolingual IR Bilingual IR
French-English 0.1828 0.2186 0.1199 0.1568
English 0.3159
German-English 0.0941 0.0942 0.2351 0.2923
French 0.2386 English-French 0.1504 0.1543 0.2863 0.3423
German 0.2162 English-German 0.1009 0.1157 0.0000 0.1218
Table 2: Baseline MAP scores for monolingual and bilingual CLEF AdHoc TEL 2009 task.
MAP MAP BLEU BLEU

1-best 5-best 4-gramm 3-gramm
Fr-En 0.1954 0.2229 0.1062 0.1489
De-En 0.1018 0.1078 0.2240 0.2486
En-Fr 0.1611 0.1516 0.2072 0.2908
En-De 0.1062 0.1132 0.0000 0.1924
Table 3: BLEU and MAP performance on CLEF AdHoc TEL 2009 task for the genre-tuned model.
might also be explained by the small size of the structure: mostly content words and fewer func-
test set (compared to a standard test set of 1000 tion words when compared to the full sentence.
full-sentences). The language model weight is consistently
Secondly, we looked at the weights of the fea- though not drastically smaller when tuning with
tures both in the baseline model (Europarl-tuned) CLEF data. We suppose that this is due to the
and in the adapted model (CLEF-tuned), shown in fact that a Europarl-base language model is not
Table 4. We are unsure how suitable the sizes of the best choice for translating query data.
the CLEF tuning sets are, especially for the pairs 4.2.2 Reranking approach
involving English and French. Nevertheless we
The reranking experiments include different
do observe and comment on some patterns.
features combinations. First, we experiment with
For the pairs involving English and German the Moses features only in order to make this ap-
the distortion weight is much higher when tuning proach comparable with the first one. Secondly,
with CLEF data compared to tuning with Europarl we compare different syntax-based features com-
data. The picture is reversed when looking at the binations, as described in section 3.2.2. Thus, we
two pairs involving English and French. This is compare the following reranking models (defined
to be expected if we interpret a high distortion by the feature set): moses, lex (lexical coupling
weight as follows: it is not encouraged to place + moses features), lab (label-specific coupling +
source words that are near to each other far away moses features), posmaplex (lexical PoS mapping
from each other in the translation. Indeed, the lo- + moses features ), lab-lex (label-specific cou-
cal reorderings are much more frequent between pling + lexical coupling + moses features), lab-
English and French (e.g. white house = maison lex-posmap (label-specific coupling + lexical cou-
blanche), while the long-distance reorderings are pling features + generic PoS mapping). To reduce
more typcal between English and German. the size of feature-functions vectors we take only
The word penalty is consistenly higher over all the 20 most frequent features in the training data
pairs when tuning with CLEF data compared to for Label-specific coupling and PoS mapping fea-
tuning with Europarl data. We could see an ex- tures. The computation of the syntax features is
planation for this pattern in the smaller size of based on the rule-based XIP parser, where some
the CLEF sentences if we interpret higher word heuristics specific to query processing have been
penalty as a preference for shorter translations. integrated into English and French (but not Ger-
This can be explained both with the smaller aver- man) grammars (Brun et al., 2012).
age size of the queries and with the specific query The results of these experiments are illustrated
115
Lng pair Tune set DW LM (f |e) lex(f |e) (e|f ) lex(e|f ) PP WP
Europarl 0.0801 0.1397 0.0431 0.0625 0.1463 0.0638 -0.0670 -0.3975
Fr-En
CLEF 0.0015 0.0795 -0.0046 0.0348 0.1977 0.0208 -0.2904 0.3707
Europarl 0.0588 0.1341 0.0380 0.0181 0.1382 0.0398 -0.0904 -0.4822
De-En
CLEF 0.3568 0.1151 0.1168 0.0549 0.0932 0.0805 0.0391 -0.1434
Europarl 0.0789 0.1373 0.0002 0.0766 0.1798 0.0293 -0.0978 -0.4002
En-Fr
CLEF 0.0322 0.1251 0.0350 0.1023 0.0534 0.0365 -0.3182 -0.2972
Europarl 0.0584 0.1396 0.0092 0.0821 0.1823 0.0437 -0.1613 -0.3233
En-De
CLEF 0.3451 0.1001 0.0248 0.0872 0.2629 0.0153 -0.0431 0.1214
Table 4: Feature weights for the query-genre tuned model. Abbreviations: DW - distortion weight, LM - language
model weight, PP - phrase penalty, WP - word penalty, -phrase translation probability, lex-lexical weighting.
Query Example MAP bleu1 German which can be explained by the fact that
Src1 Weibliche Martyrer the German grammar used for query processing
Ref Female Martyrs
was not adapted for queries as opposed to English
T1 female martyrs 0.07 1
T2 Women martyr 0.4 0
and French grammars. However, we do not ob-
Src 2 Genmanipulation am serve the same tendency for BLEU score, where
Menschen only a few of the adapted models outperform the
Ref Human Gene Manipula- baseline, which confirms the hypothesis of the
tion low correlation between BLEU and MAP scores
T1 On the genetic manipula- 0.044 0.167 in these settings. Table 5 gives some examples of
tion of people the queries translations before (T1) and after (T2)
T2 genetic manipulation of 0.069 0.286
reranking. These examples also illustrate differ-
the human being
Src 3 Arbeitsrecht in der Eu-
ent types of disagreement between MAP and 1-
ropaischen Union gramm BLEU5 score.
Ref European Union Labour The results for English-German and English-
Laws French look more confusing. This can be partly
T1 Labour law in the Euro- 0.015 0.5 due to the more rich morphology of the target lan-
pean Union guages which may create more noise in the syn-
T2 labour legislation in the 0.036 0.5
tax structure. Reranking however improves over
European Union
the 1-best MAP baseline for English-German, and
Table 5: Some examples of queries translations (T1: 5-best MAP is also improved excluding the mod-
baseline, T2: after reranking with lab-lex), MAP and els involving PoS tagging for German (posmap,
1-gramm BLEU scores for German-English. posmaplex, lab-lex-posmap). The results for
English-French are more difficult to interpret. To
find out the reason of such a behavior, we looked
in Figure 1. To keep the figure more readable, at the translations. We observed the following to-
we report only on 3-gramm BLEU scores. When kenization problem for French: the apostrophe is
computing the 5best MAP score, the order in the systematically separated, e.g. d aujourd hui.
Nbest list is defined by a corresponding reranking This leads to both noisy pre-retrieval preprocess-
model. Each reranking model is illustrated by a ing (eg. d is tagged as a NOUN) and noisy syntax-
single horizontal red bar. We compare the rerank- based feature values, which might explain the un-
ing results to the baseline model (vertical line) and stable results.
also to the results of the first approach (yellow bar Finally, we can see that the syntax-based fea-
labelled MERT:moses) on the same figure. tures can be beneficial for the final retrieval qual-
First, we remark that the adapted models ity: the models with syntax features can outper-
(query-genre tuning and reranking) outperform form the model basd on the moses features only.
the baseline in terms of MAP (1best and 5 best) The syntax-based features leading to the most sta-
for French-English and German-English transla-
tions for most of the models. The only exception 5
The higher order BLEU scores are equal to 0 for most
is posmaplex model (based on PoS tagging) for of the individual translations.
116
Figure 1: Reranking results. The vertical line corresponds to the baseline scores. The lowest bar (MERT:moses,
in yellow): the results of the tuning approach, other bars(in red): the results of the reranking approach.
ble results seem to be lab-lex (combination of lex- of MAP is improved between 1-2.5 points. We
ical and label-specific coupling): it leads to the believe that the combination of these two meth-
best gains over 1-best and 5-best MAP for all lan- ods would be the most beneficial setting, although
guage pairs excluding English-French. This is a we were not able to prove this experimentally
surprising result given the fact that the underlying (due to the lack of training data). None of these
IR model doesnt take syntax into account in any methods require access to the document collec-
way. In our opinion, this is probably due to the tion at test time, and can be used in the context
interaction between the pre-retrieval preprocess- of a query translation service. The combination
ing (lemmatisation, PoS tagging) done with the of our adapted SMT model with other state-of-the
linguistic tools which might produce noisy results art CLIR techniques (eg. query expansion with
when applied to the SMT outputs. The rerank- PRF) will be explored in future work.
ing with syntax-based features allows to choose
a better-formed query for which the PoS tagging Acknowledgements
and lemmatisation tools produce less noise which This research was supported by the European
leads to a better retrieval. Unions ICT Policy Support Programme as part of
the Competitiveness and Innovation Framework
5 Conclusion Programme, CIP ICT-PSP under grant agreement
nr 250430 (Project GALATEAS).
In this work we proposed two methods for query-
genre adaptation of an SMT model: the first
method addressing the translation quality aspect References
and the second one the retrieval precision aspect. Salah At-Mokhtar, Jean-Pierre Chanod, and Claude
We have shown that CLIR performance in terms Roux. 2002. Robustness beyond shallowness: in-
117
cremental deep parsing. Natural Language Engi- Technology Research, pages 138145, San Diego,
neering, 8:121144, June. California. Morgan Kaufmann Publishers Inc.
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: Nicola Ferro and Carol Peters. 2009. CLEF 2009
an automatic metric for MT evaluation with imad hoc track overview: TEL and persian tasks.
proved correlation with human judgments. In Pro- In Working Notes for the CLEF 2009 Workshop,
ceedings of the ACL Workshop on Intrinsic and Ex- Corfu, Greece.
trinsic Evaluation Measures for Machine Transla- Atsushi Fujii, Masao Utiyama, Mikio Yamamoto, and
tion and/or Summarization, pages 6572, Ann Ar- Takehito Utsuro. 2009. Evaluating effects of ma-
bor, Michigan, June. Association for Computational chine translation accuracy on cross-lingual patent
Linguistics. retrieval. In Proceedings of the 32nd international
Adam Berger, John Lafferty, and John La Erty. 1999. ACM SIGIR conference on Research and develop-
The weaver system for document retrieval. In In ment in information retrieval, SIGIR 09, pages
Proceedings of the Eighth Text REtrieval Confer- 674675.
ence (TREC-8, pages 163174. Jianfeng Gao, Jian-Yun Nie, and Ming Zhou. 2006.
Nicola Bertoldi and Marcello Federico. 2009. Do- Statistical query translation models for cross-
main adaptation for statistical machine translation language information retrieval. 5:323359, Decem-
with monolingual resources. In Proceedings of ber.
the Fourth Workshop on Statistical Machine Trans- Wei Gao, Cheng Niu, Jian-Yun Nie, Ming Zhou, Kam-
lation, pages 182189. Association for Computa- Fai Wong, and Hsiao-Wuen Hon. 2010. Exploit-
tional Linguistics. ing query logs for cross-lingual query suggestions.
Caroline Brun, Vassilina Nikoulina, and Nikolaos La- ACM Trans. Inf. Syst., 28(2).
gos. 2012. Linguistically-adapted structural query
Djoerd Hiemstra and Franciska de Jong. 1999. Dis-
annotation for digital libraries in the social sciences.
ambiguation strategies for cross-language informa-
In Proceedings of the 6th EACL Workshop on Lan-
tion retrieval. In Proceedings of the Third European
guage Technology for Cultural Heritage, Social Sci-
Conference on Research and Advanced Technology
ences, and Humanities, Avignon, France, April.
for Digital Libraries, pages 274293.
Chris Callison-Burch, Philipp Koehn, Christof Monz,
and Josh Schroeder. 2009. Findings of the 2009 Rong Hu, Weizhu Chen, Peng Bai, Yansheng Lu,
Workshop on Statistical Machine Translation. In Zheng Chen, and Qiang Yang. 2008. Web query
Proceedings of the Fourth Workshop on Statistical translation via web log mining. In Proceedings of
Machine Translation, pages 128, Athens, Greece, the 31st annual international ACM SIGIR confer-
March. Association for Computational Linguistics. ence on Research and development in information
David Chiang, Yuval Marton, and Philip Resnik. retrieval, SIGIR 08, pages 749750. ACM.
2008. Online large-margin training of syntactic and Amir Hossein Jadidinejad and Fariborz Mahmoudi.
structural translation features. In Proceedings of the 2009. Cross-language information retrieval us-
2008 Conference on Empirical Methods in Natural ing meta-language index construction and structural
Language Processing, pages 224233. Association queries. In Proceedings of the 10th cross-language
for Computational Linguistics. evaluation forum conference on Multilingual in-
Stephane Clinchant and Jean-Michel Renders. 2007. formation access evaluation: text retrieval experi-
Query translation through dictionary adaptation. In ments, CLEF09, pages 7077, Berlin, Heidelberg.
CLEF07, pages 182187. Springer-Verlag.
Michael Collins and Brian Roark. 2004. Incremental Gareth Jones, Sakai Tetsuya, Nigel Collier, Akira Ku-
parsing with the perceptron algorithm. In ACL 04: mano, and Kazuo Sumita. 1999. Exploring the
Proceedings of the 42nd Annual Meeting on Asso- use of machine translation resources for english-
ciation for Computational Linguistics. japanese cross-language information retrieval. In In
Michael Collins. 2001. Ranking algorithms for Proceedings of MT Summit VII Workshop on Ma-
named-entity extraction: boosting and the voted chine Translation for Cross Language Information
perceptron. In ACL02: Proceedings of the 40th Retrieval, pages 181188.
Annual Meeting on Association for Computational Kimmo Kettunen. 2009. Choosing the best mt pro-
Linguistics, pages 489496, Philadelphia, Pennsyl- grams for clir purposes can mt metrics be help-
vania. Association for Computational Linguistics. ful? In Proceedings of the 31th European Confer-
Koby Crammer and Yoram Singer. 2003. Ultracon- ence on IR Research on Advances in Information
servative online algorithms for multiclass problems. Retrieval, ECIR 09, pages 706712, Berlin, Hei-
Journal of Machine Learning Research, 3:951991. delberg. Springer-Verlag.
George Doddington. 2002. Automatic evaluation Philipp Koehn and Josh Schroeder. 2007. Experi-
of Machine Translation quality using n-gram co- ments in domain adaptation for statistical machine
occurrence statistics. In Proceedings of the sec- translation. In Proceedings of the Second Work-
ond international conference on Human Language shop on Statistical Machine Translation, StatMT
118
07, pages 224227. Association for Computational K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001.
Linguistics. Bleu: a method for automatic evaluation of machine
Philipp Koehn, Franz Josef Och, and Daniel Marcu. translation.
2003. Statistical phrase-based translation. In Pavel Pecina, Antonio Toral, Andy Way, Vassilis Pa-
NAACL 03: Proceedings of the 2003 Conference pavassiliou, Prokopis Prokopidis, and Maria Gi-
of the North American Chapter of the Association agkou. 2011. Towards using web-crawled data for
for Computational Linguistics on Human Language domain adaptation in statistical machine translation.
Technology, pages 4854, Morristown, NJ, USA. In Proceedings of the 15th Annual Conference of
Association for Computational Linguistics. the European Associtation for Machine Translation,
Philipp Koehn, Hieu Hoang, Alexandra Birch, pages 297304, Leuven, Belgium. European Asso-
Chris Callison-Burch, Marcello Federico, Nicola ciation for Machine Translation.
Bertoldi, Brooke Cowan, Wade Shen, Christine Brian Roark, Murat Saraclar, Michael Collins, and
Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Mark Johnson. 2004. Discriminative language
Alexandra Constantin, and Evan Herbst. 2007. modeling with conditional random fields and the
Moses: open source toolkit for statistical machine perceptron algorithm. In Proceedings of the 42nd
translation. In ACL 07: Proceedings of the 45th Annual Meeting of the Association for Computa-
Annual Meeting of the ACL on Interactive Poster tional Linguistics (ACL04), July.
and Demonstration Sessions, pages 177180. As- Lucia Specia, Marco Turchi, Nicola Cancedda, Marc
sociation for Computational Linguistics. Dymetman, and Nello Cristianini. 2009. Estimat-
Philip Koehn. 2010. Statistical Machine Translation. ing the sentence-level quality of machine translation
Cambridge University Press. systems. In Proceedings of the 13th Annual Confer-
ence of the EAMT, page 2835, Barcelona, Spain.
Wessel Kraaij, Jian-Yun Nie, and Michel Simard.
Taro Watanabe, Jun Suzuki, Hajime Tsukada, and
2003. Embedding web-based statistical trans-
Hideki Isozaki. 2007. Online large-margin train-
lation models in cross-language information re-
ing for statistical machine translation. In Proceed-
trieval. Computational Linguistiques, 29:381419,
ings of the 2007 Joint Conference on Empirical
September.
Methods in Natural Language Processing and Com-
Jian-yun Nie and Jiang Chen. 2002. Exploiting the putational Natural Language Learning (EMNLP-
web as parallel corpora for cross-language informa- CoNLL), pages 764773, Prague, Czech Republic.
tion retrieval. Web Intelligence, pages 218239. Association for Computational Linguistics.
Jian-Yun Nie. 2010. Cross-Language Information Re- Dan Wu and Daqing He. 2010. A study of query
trieval. Morgan & Claypool Publishers. translation using google machine translation sys-
Vassilina Nikoulina and Marc Dymetman. 2008. Ex- tem. Computational Intelligence and Software En-
periments in discriminating phrase-based transla- gineering (CiSE).
tions on the basis of syntactic coupling features. In Hua Wu, Haifeng Wang, and Chengqing Zong. 2008.
Proceedings of the ACL-08: HLT Second Workshop Domain adaptation for statistical machine transla-
on Syntax and Structure in Statistical Translation tion with domain dictionary and monolingual cor-
(SSST-2), pages 5560. Association for Computa- pora. In Proceedings of the 22nd International
tional Linguistics, June. Conference on Computational Linguistics (Col-
Franz Josef Och and Hermann Ney. 2003. A sys- ing2008), pages 993100.
tematic comparison of various statistical alignment Bing Zhao, Matthias Eck, and Stephan Vogel. 2004.
models. Computational Linguistics, 29(1):1951. Language model adaptation for statistical machine
Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur, translation with structured query models. In Pro-
Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar ceedings of the 20th international conference on
Kumar, Libin Shen, David Smith, Katherine Eng, Computational Linguistics, COLING 04. Associ-
Viren Jain, Zhen Jin, and Dragomir Radev. 2003. ation for Computational Linguistics.
Syntax for Statistical Machine Translation: Final Zhongguang Zheng, Zhongjun He, Yao Meng, and
report of John Hopkins 2003 Summer Workshop. Hao Yu. 2010. Domain adaptation for statisti-
Technical report, John Hopkins University. cal machine translation in development corpus se-
Franz Josef Och. 2003. Minimum error rate train- lection. In Universal Communication Symposium
ing in statistical machine translation. In ACL 03: (IUCS), 2010 4th International, pages 27. IEEE.
Proceedings of the 41st Annual Meeting on Asso-
ciation for Computational Linguistics, pages 160
167, Morristown, NJ, USA. Association for Com-
putational Linguistics.
Paul Ogilvie and James P. Callan. 2001. Experiments
using the lemur toolkit. In TREC.
119
Computing Lattice BLEU Oracle Scores for Machine Translation
Artem Sokolov Guillaume Wisniewski Francois Yvon

LIMSI-CNRS & Univ. Paris Sud
BP-133, 91 403 Orsay, France
{firstname.lastname}@limsi.fr
Abstract to better understand the behavior of the system

(Turchi et al., 2008; Auli et al., 2009). Useful
The search space of Phrase-Based Statisti- diagnostics are, for instance, provided by look-
cal Machine Translation (PBSMT) systems ing at the best (oracle) hypotheses contained in
can be represented under the form of a di- the search space, i.e, those hypotheses that have
rected acyclic graph (lattice). The quality
of this search space can thus be evaluated
the highest quality score with respect to one or
by computing the best achievable hypoth- several references. Such oracle hypotheses can
esis in the lattice, the so-called oracle hy- be used for failure analysis and to better under-
pothesis. For common SMT metrics, this stand the bottlenecks of existing translation sys-
problem is however NP-hard and can only tems (Wisniewski et al., 2010). Indeed, the in-
be solved using heuristics. In this work, ability to faithfully reproduce reference transla-
we present two new methods for efficiently tions can have many causes, such as scantiness
computing BLEU oracles on lattices: the
of the translation table, insufficient expressiveness
first one is based on a linear approximation
of the corpus BLEU score and is solved us- of reordering models, inadequate scoring func-
ing the FST formalism; the second one re- tion, non-literal references, over-pruned lattices,
lies on integer linear programming formu- etc. Oracle decoding has several other applica-
lation and is solved directly and using the tions: for instance, in (Liang et al., 2006; Chi-
Lagrangian relaxation framework. These ang et al., 2008) it is used as a work-around to
new decoders are positively evaluated and the problem of non-reachability of the reference
compared with several alternatives from the in discriminative training of MT systems. Lattice
literature for three language pairs, using lat-
tices produced by two PBSMT systems.
reranking (Li and Khudanpur, 2009), a promising
way to improve MT systems, also relies on oracle
decoding to build the training data for a reranking
1 Introduction algorithm.
The search space of Phrase-Based Statistical Ma- For sentence level metrics, finding oracle hy-
chine Translation (PBSMT) systems has the form potheses in n-best lists is a simple issue; how-
of a very large directed acyclic graph. In several ever, solving this problem on lattices proves much
softwares, an approximation of this search space more challenging, due to the number of embed-
can be outputted, either as a n-best list contain- ded hypotheses, which prevents the use of brute-
ing the n top hypotheses found by the decoder, or force approaches. When using BLEU, or rather
as a phrase or word graph (lattice) which com- sentence-level approximations thereof, the prob-
pactly encodes those hypotheses that have sur- lem is in fact known to be NP-hard (Leusch et
vived search space pruning. Lattices usually con- al., 2008). This complexity stems from the fact
tain much more hypotheses than n-best lists and that the contribution of a given edge to the total
better approximate the search space. modified n-gram precision can not be computed
Exploring the PBSMT search space is one of without looking at all other edges on the path.
the few means to perform diagnostic analysis and Similar (or worse) complexity result are expected
120
for other metrics such as METEOR (Banerjee and that it contains a unique initial state q0 and a
Lavie, 2005) or TER (Snover et al., 2006). The unique final state qF . Let f denote the set of all
exact computation of oracles under corpus level paths from q0 to qF in Lf . Each path f cor-
metrics, such as BLEU, poses supplementary com- responds to a possible translation e . The job of
binatorial problems that will not be addressed in a (conventional) decoder is to find the best path(s)
this work. in Lf using scores that combine the edges fea-
In this paper, we present two original methods ture vectors with the parameters learned during
for finding approximate oracle hypotheses on lat- tuning.
tices. The first one is based on a linear approxima- In oracle decoding, the decoders job is quite
tion of the corpus BLEU, that was originally de- different, as we assume that at least a reference
signed for efficient Minimum Bayesian Risk de- rf is provided to evaluate the quality of each indi-
coding on lattices (Tromble et al., 2008). The sec- vidual hypothesis. The decoder therefore aims at
ond one, based on Integer Linear Programming, is finding the path that generates the hypothesis
an extension to lattices of a recent work on failure that best matches rf . For this task, only the output
analysis for phrase-based decoders (Wisniewski labels ei will matter, the other informations can be
et al., 2010). In this framework, we study two left aside.4
decoding strategies: one based on a generic ILP Oracle decoding assumes the definition of a
solver, and one, based on Lagrangian relaxation. measure of the similarity between a reference
Our contribution is also experimental as we and a hypothesis. In this paper we will con-
compare the quality of the BLEU approxima- sider sentence-level approximations of the popu-
tions and the time performance of these new ap- lar BLEU score (Papineni et al., 2002). BLEU is
proaches with several existing methods, for differ- formally defined for two parallel corpora, E =
ent language pairs and using the lattice generation {ej }Jj=1 and R = {rj }Jj=1 , each containing J
capacities of two publicly-available state-of-the- sentences as:
art phrase-based decoders: Moses1 and N-code2 . Y n 1/n
The rest of this paper is organized as follows. n-BLEU(E, R) = BP pm , (1)
In Section 2, we formally define the oracle decod- m=1
ing task and recall the formalism of finite state
automata on semirings. We then describe (Sec- where BP = min(1, e1c1 (R)/c1 (E) ) is the
tion 3) two existing approaches for solving this brevity penalty and pm = cm (E, R)/cm (E) are
task, before detailing our new proposals in sec- clipped or modified m-gram precisions: cm (E) is
tions 4 and 5. We then report evaluations of the the total number of word m-grams in E; cm (E, R)
existing and new oracles on machine translation accumulates over sentences the number of m-
tasks. grams in ej that also belong to rj . These counts
are clipped, meaning that a m-gram that appears
2 Preliminaries k times in E and l times in R, with k > l, is only
counted l times. As it is well known, BLEU per-
2.1 Oracle Decoding Task forms a compromise between precision, which is
We assume that a phrase-based decoder is able directly appears in Equation (1), and recall, which
to produce, for each source sentence f , a lattice is indirectly taken into account via the brevity
Lf = hQ, i, with # {Q} vertices (states) and penalty. In most cases, Equation (1) is computed
# {} edges. Each edge carries a source phrase with n = 4 and we use BLEU as a synonym for
fi , an associated output phrase ei as well as a fea- 4- BLEU .
ture vector hi , the components of which encode BLEU is defined for a pair of corpora, but, as an
various compatibility measures between fi and ei . oracle decoder is working at the sentence-level, it
We further assume that Lf is a word lattice, should rely on an approximation of BLEU that can
meaning that each ei carries a single word3 and linear chain of arcs.
4
The algorithms described below can be straightfor-
1
http://www.statmt.org/moses/ wardly generalized to compute oracle hypotheses under
2
http://ncode.limsi.fr/ combined metrics mixing model scores and quality measures
3
Converting a phrase lattice to a word lattice is a simple (Chiang et al., 2008), by weighting each edge with its model
matter of redistributing a compound input or output over a score and by using these weights down the pipe.
121
evaluate the similarity between a single hypoth- 2.3 Finite State Acceptors
esis and its reference. This approximation intro- The implementations of the oracles described in
duces a discrepancy as gathering sentences with the first part of this work (sections 3 and 4) use the
the highest (local) approximation may not result common formalism of finite state acceptors (FSA)
in the highest possible (corpus-level) BLEU score. over different semirings and are implemented us-
Let BLEU0 be such a sentence-level approximation ing the generic OpenFST toolbox (Allauzen et al.,
of BLEU. Then lattice oracle decoding is the task 2007).
of finding an optimal path (f ) among all paths A (, )-semiring K over a set K is a system
f for a given f , and amounts to the following hK, , , 0, 1i, where hK, , 0i is a commutative
optimization problem: monoid with identity element 0, and hK, , 1i is
a monoid with identity element 1. distributes
(f ) = arg max BLEU0 (e , rf ). (2) over , so that a (b c) = (a b) (a c)
f
and (b c) a = (b a) (c a) and element
0 annihilates K (a 0 = 0 a = 0).
2.2 Compromises of Oracle Decoding Let A = (, Q, I, F, E) be a weighted finite-
state acceptor with labels in and weights in K,
As proved by Leusch et al. (2008), even with
meaning that the transitions (q, , q 0 ) in A carry a
brevity penalty dropped, the problem of deciding
weight w K. Formally, E is a mapping from
whether a confusion network contains a hypoth-
(Q Q) into K; likewise, initial I and fi-
esis with clipped uni- and bigram precisions all
nal weight F functions are mappings from Q into
equal to 1.0 is NP-complete (and so is the asso-
K. We borrow the notations of Mohri (2009):
ciated optimization problem of oracle decoding
if = (q, a, q 0 ) is a transition in domain(E),
for 2-BLEU). The case of more general word and
p() = q (resp. n() = q 0 ) denotes its origin
phrase lattices and 4-BLEU score is consequently
(resp. destination) state, w() = its label and
also NP-complete. This complexity stems from
E() its weight. These notations extend to paths:
chaining up of local unigram decisions that, due
if is a path in A, p() (resp. n()) is its initial
to the clipping constraints, have non-local effect
(resp. ending) state and w() is the label along
on the bigram precision scores. It is consequently
the path. A finite state transducer (FST) is an FSA
necessary to keep a possibly exponential num-
with output alphabet, so that each transition car-
ber of non-recombinable hypotheses (character-
ries a pair of input/output symbols.
ized by counts for each n-gram in the reference)
As discussed in Sections 3 and 4, several oracle
until very late states in the lattice.
decoding algorithms can be expressed as shortest-
These complexity results imply that any oracle path problems, provided a suitable definition of
decoder has to waive either the form of the objec- the underlying acceptor and associated semiring.
tive function, replacing BLEU with better-behaved In particular, quantities such as:
scoring functions, or the exactness of the solu- M
tion, relying on approximate heuristic search al- E(), (3)
gorithms. (A)
In Table 1, we summarize different compro- where the total weight of a successful path =
mises that the existing (section 3), as well as 1 . . . l in A is computed as:
our novel (sections 4 and 5) oracle decoders,
l
have to make. The target and target level O
columns specify the targeted score. None of E() =I(p(1 )) E(i ) F (n(l ))
i=1
the decoders optimizes it directly: their objec-
tive function is rather the approximation of BLEU can be efficiently found by generic shortest dis-
given in the target replacement column. Col- tance algorithms over acyclic graphs (Mohri,
umn search details the accuracy of the target re- 2002). For FSA-based implementations over
placement optimization. Finally, columns clip- semirings where = max, the optimization
ping and brevity indicate whether the corre- problem (2) is thus reduced to Equation (3), while
sponding properties of BLEU score are considered the oracle-specific details can be incorporated into
in the target substitute and in the search algorithm. in the definition of .
122
this paper existing oracle target target level target replacement search clipping brevity
LM-2g/4g 2/4- BLEU sentence P2 (e; r) or P4 (e; r) exact no no
PB 4- BLEU sentence partial log BLEU (4) appr. no no
PB` 4- BLEU sentence partial log BLEU (4) appr. no yes
LB-2g/4g 2/4- BLEU corpus linear appr. lin BLEU (5) exact no yes
SP 1- BLEU sentence unigram count exact no yes
ILP 2- BLEU sentence uni/bi-gram counts (7) appr. yes yes
RLX 2- BLEU sentence uni/bi-gram counts (8) exact yes yes
Table 1: Recapitulative overview of oracle decoders.
3 Existing Algorithms 3.2 Partial BLEU Oracle (PB)
In this section, we describe our reimplementation Another approach is put forward in (Dreyer et
of two approximate search algorithms that have al., 2007) and used in (Li and Khudanpur, 2009):
been proposed in the literature to solve the oracle oracle translations are shortest paths in a lattice
decoding problem for BLEU. In addition to their L, where the weight of each path is the sen-
approximate nature, none of them accounts for the tence level log BLEU() score of the correspond-
fact that the count of each matching word has to ing complete or partial hypothesis:
be clipped. 1 X
log BLEU() = log pm . (4)
4
3.1 Language Model Oracle (LM) m=1...4
The simplest approach we consider is introduced Here, the brevity penalty is ignored and n-
in (Li and Khudanpur, 2009), where oracle decod- gram precisions are offset to avoid null counts:
ing is reduced to the problem of finding the most pm = (cm (e , r) + 0.1)/(cm (e ) + 0.1).
likely hypothesis under a n-gram language model This approach has been reimplemented using
trained with the sole reference translation. the FST formalism by defining a suitable semir-
Let us suppose we have a n-gram language ing. Let each weight of the semiring keep a set
model that gives a probability P (en |e1 . . . en1 ) of tuples accumulated up to the current state of
of word en given the n 1 previous words. the lattice. Each tuple contains three words of re-
The probability
Q of a hypothesis e is then cent history, a partial hypothesis as well as current
Pn (e|r) = i=1 P (ei+n |ei . . . ei+n1 ). The lan- values of the length of the partial hypothesis, n-
guage model can conveniently be represented as a gram counts (4 numbers) and the sentence-level
FSA ALM , with each arc carrying a negative log- log BLEU score defined by Equation (4). In the
probability weight and with additional -type fail- beginning each arc is initialized with a singleton
ure transitions to accommodate for back-off arcs. set containing one tuple with a single word as the
If we train, for each source sentence f , a sepa- partial hypothesis. For the semiring operations we
rate language model ALM (rf ) using only the ref- define one common -operation and two versions
erence rf , oracle decoding amounts to finding a of the -operation:
shortest (most probable) path in the weighted FSA L1 P B L2 appends a word on the edge of
resulting from the composition L ALM (rf ) over L2 to L1 s hypotheses, shifts their recent histories
the (min, +)-semiring: and updates n-gram counts, lengths, and current
score; L1 P B L2 merges all sets from L1
LM (f ) = ShortestPath(L ALM (rf )). and L2 and recombinates those having the same
recent history; L1 P B` L2 merges all sets
This approach replaces the optimization of n- from L1 and L2 and recombinates those having
BLEU with a search for the most probable path the same recent history and the same hypothesis
under a simplistic n-gram language model. One length.
may expect the most probable path to select fre- If several hypotheses have the same recent
quent n-gram from the reference, thus augment- history (and length in the case of P B` ), re-
ing n-BLEU. combination removes all of them, but the one
123
1:/0 1:111/0
0:00/0 1:11/0
0:/0 1:101/0
1:1/0 1 10 11
0:0/0 1:/0 0:110/0
1:01/0 0:100/0
q 0 0:10/0 1
0:/0 q 1:011/0
0:/0
0:/0 0:000/0
q
1:/0 0:010/0
1:/0
0 01 00
1:001/0
(a) 1 (b) 2 (c) 3
Figure 1: Examples of the n automata for = {0, 1} and n = 1 . . . 3. Initial and final states are marked,
respectively, with bold and with double borders. Note that arcs between final states are weighted with 0, while in
reality they will have this weight only if the corresponding n-gram does not appear in the reference.
with the largest current BLEU score. Optimal gram, and all weighted transitions of the kind
path is then found by launching the generic (1n1 , n : 1n /n 1n (r), 2n ), where s are
ShortestDistance(L) algorithm over one of in , input word sequence 1n1 and output se-
the semirings above. quence 2n , are, respectively, the maximal prefix
The (P B` , P B )-semiring, in which the and suffix of an n-gram 1n .
equal length requirement also implies equal In supplement, we add auxiliary states corre-
brevity penalties, is more conservative in recom- sponding to m-grams (m < n 1), whose func-
bining hypotheses and should achieve final BLEU tional purpose is to help reach one of the main
n1
that is least as good as that obtained with the (n 1)-gram states. There are ||||11 , n > 1,
(P B , P B )-semiring5 . such supplementary states and their transitions are
(1k , k+1 : 1k+1 /0, 1k+1 ), k = 1 . . . n2. Apart
4 Linear BLEU Oracle (LB)
from these auxiliary states, the rest of the graph
In this section, we propose a new oracle based on (i.e., all final states) reproduces the structure of
the linear approximation of the corpus BLEU in- the well-known de Bruijn graph B(, n) (see Fig-
troduced in (Tromble et al., 2008). While this ap- ure 1).
proximation was earlier used for Minimum Bayes To actually compute the best hypothesis, we
Risk decoding in lattices (Tromble et al., 2008; first weight all arcs in the input FSA L with 0 to
Blackwood et al., 2010), we show here how it can obtain 0 . This makes each words weight equal
also be used to approximately compute an oracle in a hypothesis path, and the total weight of the
translation. path in 0 is proportional to the number of words
Given five real parameters 0...4 and a word vo- in it. Then, by sequentially composing 0 with
cabulary , Tromble et al. (2008) showed that one other n s, we discount arcs whose output n-gram
can approximate the corpus-BLEU with its first- corresponds to a matching n-gram. The amount
order (linear) Taylor expansion: of discount is regulated by the ratio between n s
for n > 0.
4
X X With all operations performed over the
lin BLEU() = 0 |e |+ n cu (e )u (r),
(min, +)-semiring, the oracle translation is then
n=1 un
(5) given by:
where cu (e) is the number of times the n-gram
u appears in e, and u (r) is an indicator variable LB = ShortestPath(0 1 2 3 4 ).
testing the presence of u in r.
We set parameters n as in (Tromble et al.,
To exploit this approximation for oracle decod-
2008): 0 = 1, roughly corresponding to the
ing, we construct four weighted FSTs n con-
brevity penalty (each word in a hypothesis adds
taining a (final) state for each possible (n 1)-
up equally to the final path length) and n =
5
See, however, experiments in Section 6. (4p rn1 )1 , which are increasing discounts
124
define, for every edge i , an associated reward, i
36
34 that describes the edges local contribution to the
32
36
34 30
28
hypothesis score. For instance, for the sentence
32
BLEU 30
28
26
24
approximation of the 1-BLEU score, the rewards
26
24
22
are defined as:
22 0
0.2 (
0.4
r
1 if w(i ) is in the reference,
1
0.8
0.6 0.8
0.6
i =
p
0.4
0.2
0 1
2 otherwise,
Figure 2: Performance of the LB-4g oracle for differ- where 1 and 2 are two positive constants cho-
ent combinations of p and r on WMT11 de2en task. sen to maximize the corpus BLEU score6 . Con-
stant 1 (resp. 2 ) is a reward (resp. a penalty)
for generating a word in the reference (resp. not in
for matching n-grams. The values of p and r were
the reference). The score of an assignment P
found by grid search with a 0.05 step value. A P#{}
is then defined as: score() = i=1 i i . This
typical result of the grid evaluation of the LB or-
score can be seen as a compromise between the
acle for German to English WMT11 task is dis-
number of common words in the hypothesis and
played on Figure 2. The optimal values for the
the reference (accounting for recall) and the num-
other pairs of languages were roughly in the same
ber of words of the hypothesis that do not appear
ballpark, with p 0.3 and r 0.2.
in the reference (accounting for precision).
5 Oracles with n-gram Clipping As explained in Section 2.3, finding the or-
acle hypothesis amounts to solving the shortest
In this section, we describe two new oracle de- distance (or path) problem (3), which can be re-
coders that take n-gram clipping into account. formulated by a constrained optimization prob-
These oracles leverage on the well-known fact lem (Wolsey, 1998):
that the shortest path problem, at the heart of
#{}
all the oracles described so far, can be reduced X
straightforwardly to an Integer Linear Program- arg max i i (6)
P i=1
ming (ILP) problem (Wolsey, 1998). Once oracle X X
decoding is formulated as an ILP problem, it is s.t. = 1, =1
relatively easy to introduce additional constraints, (qF ) + (q0 )
for instance to enforce n-gram clipping. We will
X X
= 0, q Q \ {q0 , qF }
first describe the optimization problem of oracle + (q) (q)
decoding and then present several ways to effi-
ciently solve it. where q0 (resp. qF ) is the initial (resp. final) state
of the lattice and (q) (resp. + (q)) denotes the
5.1 Problem Description set of incoming (resp. outgoing) edges of state q.
Throughout this section, abusing the notations, These path constraints ensure that the solution of
we will also think of an edge i as a binary vari- the problem is a valid path in the lattice.
able describing whether the edge is selected or The optimization problem in Equation (6) can
not. The set {0, 1}#{} of all possible edge as- be further extended to take clipping into account.
signments will be denoted by P. Note that , the Let us introduce, for each word w, a variable w
set of all paths in the lattice is a subset of P: by that denotes the number of times w appears in the
enforcing some constraints on an assignment in hypothesis clipped to the number of times, it ap-
P, it can be guaranteed that it will represent a path pears in the reference. Formally, w is defined by:
in the lattice. For the sake of presentation, we as-
sume that each edge i generates a single word X
w(i ) and we focus first on finding the optimal w = min , cw (r)

(w)
hypothesis with respect to the sentence approxi-
mation of the 1-BLEU score. 6
We tried several combinations of 1 and 2 and kept
As 1-BLEU is decomposable, it is possible to the one that had the highest corpus 4-BLEU score.
125
whereP (w) is the subset of edges generating w, 5.2 Shortest Path Oracle (SP)
and (w) is the number of occurrences of As a trivial special class of the above formula-
w in the solution and cw (r) is the number of oc- tion, we also define a Shortest Path Oracle (SP)
currences of w in the reference r. Using the that solves the optimization problem in (6). As
variables, we define a clipped approximation of no clipping constraints apply, it can be solved ef-
1- BLEU : ficiently using the standard Bellman algorithm.

#{}
5.3 Oracle Decoding through Lagrangian
X X X
1 w 2 i w
w i=1 w Relaxation (RLX)
Indeed, the clipped number of words in the hy- In this section, we introduce another method to
pothesis that appear in the reference is given by solve problem (7) without relying on an exter-
P P#{} P nal ILP solver. Following (Rush et al., 2010;
w w , and i=1 i w w corresponds to
the number of words in the hypothesis that do not Chang and Collins, 2011), we propose an original
appear in the reference or that are surplus to the method for oracle decoding based on Lagrangian
clipped count. relaxation. This method relies on the idea of re-
Finally, the clipped lattice oracle is defined by laxing the clipping constraints: starting from an
the following optimization problem: unconstrained problem, the counts clipping is en-
forced by incrementally strengthening the weight
#{}
X X of paths satisfying the constraints.
arg max (1 + 2 ) w 2 i The oracle decoding problem with clipping
P,w w i=1
constraints amounts to solving:
(7)
X #{}
s.t. w 0, w cw (r), w arg min
X
i i (8)
(w)
X X i=1
X
= 1, =1 s.t. cw (r), w r
(qF ) + (q0 ) (w)
X X
= 0, q Q \ {q0 , qF } where, by abusing the notations, r also denotes
+ (q) (q)
the set of words in the reference. For sake of clar-
where the first three sets of constraints are the lin- ity, the path constraints are incorporated into the
earization of the definition of w , made possible domain (the arg min runs over and not over P).
by the positivity of 1 and 2 , and the last three To solve this optimization problem we consider its
sets of constraints are the path constraints. dual form and use Lagrangian relaxation to deal
In our implementation we generalized this op- with clipping constraints.
timization problem to bigram lattices, in which Let = {w }wr be positive Lagrange mul-
each edge is labeled by the bigram it generates. tipliers, one for each different word of the refer-
Such bigram FSAs can be produced by compos- ence, then the Lagrangian of the problem (8) is:
ing the word lattice with 2 from Section 4. In
#{}
this case, the reward of an edge will be defined as X X X
a combination of the (clipped) number of unigram L(, ) = i i + w cw (r)
i=1 wr (w)
matches and bigram matches, and solving the op-
timization problem yields a 2-BLEU optimal hy- The dual objective is L() = min L(, )
pothesis. The approach can be further generalized and the dual problem is: max,0 L(). To
to higher-order BLEU or other metrics, as long as solve the latter, we first need to work out the dual
the reward of an edge can be computed locally. objective:
The constrained optimization problem (7) can
be solved efficiently using off-the-shelf ILP = arg min L(, )

solvers7 .
7
#{}
In our experiments we used Gurobi (Optimization, X
2010) a commercial ILP solver that offers free academic li- = arg min i w(i ) i
i=1
cense.
126
where we assume that w(i ) is 0 when word decoder fr2en de2en en2de
w(i ) is not in the reference. In the same way N-code 27.88 22.05 15.83
oracle test
as in Section 5.2, the solution of this problem can Moses 27.68 21.85 15.89
be efficiently retrieved with a shortest path algo- N-code 36.36 29.22 21.18
rithm. Moses 35.25 29.13 22.03
It is possible to optimize L() by noticing that
it is a concave function. It can be shown (Chang Table 2: Test BLEU scores and oracle scores on
and Collins, 2011) that, at convergence, the clip- 100-best lists for the evaluated systems.
ping constraints will be enforced in the optimal
solution. In this work, we chose to use a simple
and 4). Systems were trained on the data provided
gradient descent to solve the dual problem. A sub-
for the WMT11 Evaluation task10 , tuned on the
gradient of the dual objective is:
WMT09 test data and evaluated on WMT10 test
L() X set11 to produce lattices. The BLEU test scores
= cw (r).
w and oracle scores on 100-best lists with the ap-
(w)
proximation (4) for N-code and Moses are given
Each component of the gradient corresponds to in Table 2. It is not until considering 10,000-best
the difference between the number of times the lists that n-best oracles achieve performance com-
word w appears in the hypothesis and the num- parable to the (mediocre) SP oracle.
ber of times it appears in the reference. The algo- To make a fair comparison with the ILP and
rithm below sums up the optimization of task (8). RLX oracles which optimize 2-BLEU, we in-
In the algorithm (t) corresponds to the step size cluded 2-BLEU versions of the LB and LM ora-
at the tth iteration. In our experiments we used a cles, identified below with the -2g suffix. The
constant step size of 0.1. Compared to the usual two versions of the PB oracle are respectively
gradient descent algorithm, there is an additional denoted as PB and PB`, by the type of the -
projection step of on the positive orthant, which operation they consider (Section 3.2). Parame-
enforces the constraint 0. ters p and r for the LB-4g oracle for N-code were
found with grid search and reused for Moses:
(0)
w, w 0 p = 0.25, r = 0.15 (fr2en); p = 0.175, r = 0.575
for t = 1 T do (en2de) and p = 0.35, r = 0.425 (de2en). Cor-
(t) = arg min i i w(i ) i

respondingly, for the LB-2g oracle: p = 0.3, r =
P
if all clipping constraints are enforced 0.15; p = 0.3, r = 0.175 and p = 0.575, r = 0.1.
then optimal solution found The proposed LB, ILP and RLX oracles were
else for w r do the best performing oracles, with the ILP and
nw n. of occurrences of w in (t) RLX oracles being considerably faster, suffering
(t) (t) only a negligible decrease in BLEU, compared to
w w + (t) (nw cw (r))
(t)
w max(0, w )
(t) the 4-BLEU-optimized LB oracle. We stopped
RLX oracle after 20 iterations, as letting it con-
verge had a small negative effect (1 point of the
6 Experiments corpus BLEU), because of the sentence/corpus dis-
crepancy ushered by the BLEU score approxima-
For the proposed new oracles and the existing ap- tion.
proaches, we compare the quality of oracle trans- Experiments showed consistently inferior per-
lations and the average time per sentence needed formance of the LM-oracle resulting from the op-
to compute them8 on several datasets for 3 lan- timization of the sentence probability rather than
guage pairs, using lattices generated by two open- BLEU . The PB oracle often performed compara-
source decoders: N-code and Moses9 (Figures 3 bly to our new oracles, however, with sporadic
8
Experiments were run in parallel on a server with 64G resource-consumption bursts, that are difficult to
of RAM and 2 Xeon CPUs with 4 cores at 2.3 GHz.
9 10
As the ILP (and RLX) oracle were implemented in http://www.statmt.org/wmt2011
11
Python, we pruned Moses lattices to accelerate task prepa- All BLEU scores are reported using the multi-bleu.pl
ration for it. script.
127
50 6 30
BLEU BLEU BLEU
avg. time avg. time avg. time
48.22
48.12
47.82
47.71
46.76
46.48
5 1.5
35.49
45 35
35.09
34.85
34.79
34.76
34.70
1
25.34
4 25
24.85
24.78
24.75
24.73
24.66
41.23
40
avg. time, s
avg. time, s
avg. time, s
1
BLEU
BLEU
BLEU
38.91
38.75
3
22.19
30.78
35 30
20.78
20.74
29.53
29.53
2 20 0.5
0.5
30
1
25 0 25 0 15 0
RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g
(a) fr2en (b) de2en (c) en2de
Figure 3: Oracles performance for N-code lattices.

50 30
37.73
29.94
BLEU BLEU 4 BLEU
avg. time avg. time avg. time 9
36.91
28.94
36.75
28.76
28.68
36.62
28.65
28.64
36.52
36.43
3
8
45 35
44.44
26.48
44.08
43.82
43.82
3 7
43.42
43.20
25
41.03
6
40
avg. time, s
avg. time, s
avg. time, s
2
BLEU
BLEU
BLEU
5
2
36.34
36.25
30.52
35 30
21.29
21.23
20
29.51
29.45
3
1
1
30 2
25 0 25 0 15 0
RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g RLX ILP LB-4g LB-2g PB PBl SP LM-4g LM-2g
(a) fr2en (b) de2en (c) en2de
Figure 4: Oracles performance for Moses lattices pruned with parameter -b 0.5.
avoid without more cursory hypotheses recom- ter approximations of BLEU than was previously
bination strategies and the induced effect on the done, taking the corpus-based nature of BLEU, or
translations quality. The length-aware PB` oracle clipping constrainst into account, delivering better
has unexpectedly poorer scores compared to its oracles without compromising speed.
length-agnostic PB counterpart, while it should, Using 2-BLEU and 4-BLEU oracles yields com-
at least, stay even, as it takes the brevity penalty parable performance, which confirms the intuition
into account. We attribute this fact to the com- that hypotheses sharing many 2-grams, would
plex effect of clipping coupled with the lack of likely have many common 3- and 4-grams as well.
control of the process of selecting one hypothe- Taking into consideration the exceptional speed of
sis among several having the same BLEU score, the LB-2g oracle, in practice one can safely opti-
length and recent history. Anyhow, BLEU scores mize for 2-BLEU instead of 4-BLEU, saving large
of both of PB oracles are only marginally differ- amounts of time for oracle decoding on long sen-
ent, so the PB`s conservative policy of pruning tences.
and, consequently, much heavier memory con- Overall, these experiments accentuate the
sumption makes it an unwanted choice. acuteness of scoring problems that plague modern
decoders: very good hypotheses exist for most in-
7 Conclusion put sentences, but are poorly evaluated by a linear
We proposed two methods for finding oracle combination of standard features functions. Even
translations in lattices, based, respectively, on a though the tuning procedure can be held respon-
linear approximation to the corpus-level BLEU sible for part of the problem, the comparison be-
and on integer linear programming techniques. tween lattice and n-best oracles shows that the
We also proposed a variant of the latter approach beam search leaves good hypotheses out of the n-
based on Lagrangian relaxation that does not rely best list until very high value of n, that are never
on a third-party ILP solver. All these oracles have used in practice.
superior performance to existing approaches, in
Acknowledgments
terms of the quality of the found translations, re-
source consumption and, for the LB-2g oracles, This work has been partially funded by OSEO un-
in terms of speed. It is thus possible to use bet- der the Quaero program.
128
References Mehryar Mohri. 2002. Semiring frameworks and al-
gorithms for shortest-distance problems. J. Autom.
Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wo- Lang. Comb., 7:321350.
jciech Skut, and Mehryar Mohri. 2007. OpenFst: Mehryar Mohri. 2009. Weighted automata algo-
A general and efficient weighted finite-state trans- rithms. In Manfred Droste, Werner Kuich, and
ducer library. In Proc. of the Int. Conf. on Imple- Heiko Vogler, editors, Handbook of Weighted Au-
mentation and Application of Automata, pages 11 tomata, chapter 6, pages 213254.
23. Gurobi Optimization. 2010. Gurobi optimizer, April.
Michael Auli, Adam Lopez, Hieu Hoang, and Philipp Version 3.0.
Koehn. 2009. A systematic analysis of translation Kishore Papineni, Salim Roukos, Todd Ward, and
model search spaces. In Proc. of WMT, pages 224 Wei-Jing Zhu. 2002. BLEU: a method for auto-
232, Athens, Greece. matic evaluation of machine translation. In Proc. of
Satanjeev Banerjee and Alon Lavie. 2005. ME- the Annual Meeting of the ACL, pages 311318.
TEOR: An automatic metric for MT evaluation with Alexander M. Rush, David Sontag, Michael Collins,
improved correlation with human judgments. In and Tommi Jaakkola. 2010. On dual decomposi-
Proc. of the ACL Workshop on Intrinsic and Extrin- tion and linear programming relaxations for natural
sic Evaluation Measures for Machine Translation, language processing. In Proc. of the 2010 Conf. on
pages 6572, Ann Arbor, MI, USA. EMNLP, pages 111, Stroudsburg, PA, USA.
Graeme Blackwood, Adria de Gispert, and William Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
Byrne. 2010. Efficient path counting transducers nea Micciulla, and John Makhoul. 2006. A study
for minimum bayes-risk decoding of statistical ma- of translation edit rate with targeted human anno-
chine translation lattices. In Proc. of the ACL 2010 tation. In Proc. of the Conf. of the Association for
Conference Short Papers, pages 2732, Strouds- Machine Translation in the America (AMTA), pages
burg, PA, USA. 223231.
Yin-Wen Chang and Michael Collins. 2011. Exact de- Roy W. Tromble, Shankar Kumar, Franz Och, and
coding of phrase-based translation models through Wolfgang Macherey. 2008. Lattice minimum
lagrangian relaxation. In Proc. of the 2011 Conf. on bayes-risk decoding for statistical machine transla-
EMNLP, pages 2637, Edinburgh, UK. tion. In Proc. of the Conf. on EMNLP, pages 620
629, Stroudsburg, PA, USA.
David Chiang, Yuval Marton, and Philip Resnik.
Marco Turchi, Tijl De Bie, and Nello Cristianini.
2008. Online large-margin training of syntactic
2008. Learning performance of a machine trans-
and structural translation features. In Proc. of the
lation system: a statistical and computational anal-
2008 Conf. on EMNLP, pages 224233, Honolulu,
ysis. In Proc. of WMT, pages 3543, Columbus,
Hawaii.
Ohio.
Markus Dreyer, Keith B. Hall, and Sanjeev P. Khu- Guillaume Wisniewski, Alexandre Allauzen, and
danpur. 2007. Comparing reordering constraints Francois Yvon. 2010. Assessing phrase-based
for SMT using efficient BLEU oracle computation. translation models with oracle decoding. In Proc.
In Proc. of the Workshop on Syntax and Structure of the 2010 Conf. on EMNLP, pages 933943,
in Statistical Translation, pages 103110, Morris- Stroudsburg, PA, USA.
town, NJ, USA. L. Wolsey. 1998. Integer Programming. John Wiley
Gregor Leusch, Evgeny Matusov, and Hermann Ney. & Sons, Inc.
2008. Complexity of finding the BLEU-optimal hy-
pothesis in a confusion network. In Proc. of the
2008 Conf. on EMNLP, pages 839847, Honolulu,
Hawaii.
Zhifei Li and Sanjeev Khudanpur. 2009. Efficient
extraction of oracle-best translations from hyper-
graphs. In Proc. of Human Language Technolo-
gies: The 2009 Annual Conf. of the North Ameri-
can Chapter of the ACL, Companion Volume: Short
Papers, pages 912, Morristown, NJ, USA.
Percy Liang, Alexandre Bouchard-Cote, Dan Klein,
and Ben Taskar. 2006. An end-to-end discrim-
inative approach to machine translation. In Proc.
of the 21st Int. Conf. on Computational Linguistics
and the 44th annual meeting of the ACL, pages 761
768, Morristown, NJ, USA.
129
Toward Statistical Machine Translation without Parallel Corpora
Alexandre Klementiev Ann Irvine Chris Callison-Burch David Yarowsky

Center for Language and Speech Processing
Johns Hopkins University
Abstract novel algorithm to estimate reordering features

from monolingual data alone, and we report the
We estimate the parameters of a phrase- performance of a phrase-based statistical model
based statistical machine translation sys- (Koehn et al., 2003) estimated using these mono-
tem from monolingual corpora instead of a
lingual features.
bilingual parallel corpus. We extend exist-
ing research on bilingual lexicon induction
Most of the prior work on lexicon induction
to estimate both lexical and phrasal trans- is motivated by the idea that it could be applied
lation probabilities for MT-scale phrase- to machine translation but stops short of actu-
tables. We propose a novel algorithm to es- ally doing so. Lexicon induction holds the po-
timate reordering probabilities from mono- tential to create machine translation systems for
lingual data. We report translation results languages which do not have extensive parallel
for an end-to-end translation system us- corpora. Training would only require two large
ing these monolingual features alone. Our
monolingual corpora and a small bilingual dictio-
method only requires monolingual corpora
in source and target languages, a small nary, if one is available. The idea is that intrin-
bilingual dictionary, and a small bitext for sic properties of monolingual data (possibly along
tuning feature weights. In this paper, we ex- with a handful of bilingual pairs to act as exam-
amine an idealization where a phrase-table ple mappings) can provide independent but infor-
is given. We examine the degradation in mative cues to learn translations because words
translation performance when bilingually (and phrases) behave similarly across languages.
estimated translation probabilities are re-
This work is the first attempt to extend and apply
moved and show that 80%+ of the loss can
be recovered with monolingually estimated
these ideas to an end-to-end machine translation
features alone. We further show that our pipeline. While we make an explicit assumption
monolingual features add 1.5 BLEU points that a table of phrasal translations is given a priori,
when combined with standard bilingually we induce every other parameter of a full phrase-
estimated phrase table features. based translation system from monolingual data
alone. The contributions of this work are:
1 Introduction
In Section 2.2 we analyze the challenges
The parameters of statistical models of transla- of using bilingual lexicon induction for sta-
tion are typically estimated from large bilingual tistical MT (performance on low frequency
parallel corpora (Brown et al., 1993). However, items, and moving from words to phrases).
these resources are not available for most lan-
guage pairs, and they are expensive to produce in In Sections 3.1 and 3.2 we use multiple cues
quantities sufficient for building a good transla- present in monolingual data to estimate lexi-
tion system (Germann, 2001). We attempt an en- cal and phrasal translation scores.
tirely different approach; we use cheap and plen- In Section 3.3 we propose a novel algo-
tiful monolingual resources to induce an end-to- rithm for estimating phrase reordering fea-
end statistical machine translation system. In par- tures from monolingual texts.
ticular, we extend the long line of work on in-
ducing translation lexicons (beginning with Rapp Finally, in Section 5 we systematically drop
(1995)) and propose to use multiple independent feature functions from a phrase table and
cues present in monolingual texts to estimate lex- then replace them with monolingually es-
ical and phrasal translation probabilities for large, timated equivalents, reporting end-to-end
MT-scale phrase-tables. We then introduce a translation quality.
130
2 Background
Facebook
verdienen
aufrgund
Wieviel
seines
Profils
sollte
man
We begin with a brief overview of the stan-
in
dard phrase-based statistical machine translation
How
model. Here, we define the parameters which
much
we later replace with monolingual alternatives.
We continue with a discussion of bilingual lex- should m
icon induction; we extend these methods to es- you m
timate the monolingual parameters in Section 3. d

charge
This approach allows us to replace expensive/rare
for d
bilingual parallel training data with two large
your m
monolingual corpora, a small bilingual dictionary,
and 2,000 sentence bilingual development set, Facebook d
which are comparatively plentiful/inexpensive. profile s
2.1 Parameters of phrase-based SMT

Statistical machine translation (SMT) was first Figure 1: The reordering probabilities from the phrase-
based models are estimated from bilingual data by cal-
formulated as a series of probabilistic mod-
culating how often in the parallel corpus a phrase pair
els that learn word-to-word correspondences (f, e) is orientated with the preceding phrase pair in
from sentence-aligned bilingual parallel corpora the 3 types of orientations (monotone, swapped, and
(Brown et al., 1993). Current methods, includ- discontinuous).
ing phrase-based (Och, 2002; Koehn et al., 2003)
and hierarchical models (Chiang, 2005), typically
start by word-aligning a bilingual parallel cor- age word translation probabilities, w(ei |fj ),
pus (Och and Ney, 2003). They extract multi- are calculated via phrase-pair-internal word
word phrases that are consistent with the Viterbi alignments.
word alignments and use these phrases to build
new translations. A variety of parameters are es- Reordering model. Each phrase pair (e, f )
timated using the bitexts. Here we review the pa- also has associated reordering parameters,
rameters of the standard phrase-based translation po (orientation|f, e), which indicate the dis-
model (Koehn et al., 2007). Later we will show tribution of its orientation with respect to the
how to estimate them using monolingual texts in- previously translated phrase. Orientations
stead. These parameters are: are monotone, swap, discontinuous (Tillman,
2004; Kumar and Byrne, 2004), see Figure 1.
Phrase pairs. Phrase extraction heuristics
(Venugopal et al., 2003; Tillmann, 2003; Other features. Other typical features are
Och and Ney, 2004) produce a set of phrase n-gram language model scores and a phrase
pairs (e, f ) that are consistent with the word penalty, which governs whether to use fewer
alignments. In this paper we assume that the longer phrases or more shorter phrases.
phrase pairs are given (without any scores), These are not bilingually estimated, so we
and we induce every other parameter of the can re-use them directly without modifica-
phrase-based model from monolingual data. tion.
Phrase translation probabilities. Each
phrase pair has a list of associated fea- The features are combined in a log linear model,
ture functions (FFs). These include phrase and their weights are set through minimum error
translation probabilities, (e|f ) and (f |e), rate training (Och, 2003). We use the same log
which are typically calculated via maximum linear formulation and MERT but propose alterna-
likelihood estimation. tives derived directly from monolingual data for
all parameters except for the phrase pairs them-
Lexical weighting. Since MLE overestimates selves. Our pipeline still requires a small bitext of
for phrase pairs with sparse counts, lexi- approximately 2,000 sentences to use as a devel-
cal weighting FFs are used to smooth. Aver- opment set for MERT parameter tuning.
131
2.2 Bilingual lexicon induction for SMT
Bilingual lexicon induction describes the class of
40
algorithms that attempt to learn translations from
monolingual corpora. Rapp (1995) was the first
30
to propose using non-parallel texts to learn the
Accuracy, %
translations of words. Using large, unrelated En-
20
glish and German corpora (with 163m and 135m

words) and a small German-English bilingual dic-

10

tionary (with 22k entires), Rapp (1999) demon-

strated that reasonably accurate translations could Top 1
Top 10
be learned for 100 German nouns that were not
0
contained in the seed bilingual dictionary. His al- 0 100 200 300 400 500 600
gorithm worked by (1) building a context vector Corpus Frequency
representing an unknown German word by count-

ing its co-occurrence with all the other words Figure 2: Accuracy of single-word translations in-
duced using contextual similarity as a function of the
in the German monolingual corpus, (2) project-
source word corpus frequency. Accuracy is the pro-
ing this German vector onto the vector space of portion of the source words with at least one correct
English using the seed bilingual dictionary, (3) (bilingual dictionary) translation in the top 1 and top
calculating the similarity of this sparse projected 10 candidate lists.
vector to vectors for English words that were con-
structed using the English monolingual corpus, nouns in Rapp (1995), 1,000 most frequent words
and (4) outputting the English words with the in Koehn and Knight (2002), or 2,000 most fre-
highest similarity as the most likely translations. quent nouns in Haghighi et al. (2008)). Although
A variety of subsequent work has extended the previous work reported high translation accuracy,
original idea either by exploring different mea- it may be misleading to extrapolate the results to
sures of vector similarity (Fung and Yee, 1998) SMT, where it is necessary to translate a much
or by proposing other ways of measuring simi- larger set of words and phrases, including many
larity beyond co-occurence within a context win- low frequency items.
dow. For instance, Schafer and Yarowsky (2002) In a preliminary study, we plotted the accuracy
demonstrated that word translations tend to co- of translations against the frequency of the source
occur in time across languages. Koehn and Knight words in the monolingual corpus. Figure 2 shows
(2002) used similarity in spelling as another kind the result for translations induced using contex-
of cue that a pair of words may be translations of tual similarity (defined in Section 3.1). Unsur-
one another. Garera et al. (2009) defined context prisingly, frequent terms have a substantially bet-
vectors using dependency relations rather than ad- ter chance of being paired with a correct transla-
jacent words. Bergsma and Van Durme (2011) tion, with words that only occur once having a low
used the visual similarity of labeled web images chance of being translated accurately.1 This prob-
to learn translations of nouns. Additional related lem is exacerbated when we move to multi-token
work on learning translations from monolingual phrases. As with phrase translation features esti-
corpora is discussed in Section 6. mated from parallel data, longer phrases are more
sparse, making similarity scores less reliable than
In this paper, we apply bilingual lexicon in-
for single words.
duction methods to statistical machine translation.
Another impediment (not addressed in this
Given the obvious benefits of not having to rely
paper) for using lexicon induction for SMT is
on scarce bilingual parallel training data, it is sur-
the number of translations that must be learned.
prising that bilingual lexicon induction has not
Learning translations for all words in the source
been used for SMT before now. There are sev-
language requires n2 vector comparisons, since
eral open questions that make its applicability to
each word in the source language vocabulary must
SMT uncertain. Previous research on bilingual
lexicon induction learned translations only for a 1
For a description of the experimental setup used to pro-
small number of high frequency words (e.g. 100 duce these translations, see Experiment 8 in Section 5.2.
132
ES Context Projected ES EN Context terrorist (en)
Vector Context Vector compare Vectors
Occurrences
terrorista (es)
economico s1 t1 policy
planeta s2 t2 growth
project
tasa s3 t3 foreign
dict.
terrorist (en)

Occurrences
extranjero sN-1 tM-1 economic riqueza (es)
empleo sN tM activity
policy activity of
para crecer
para crecer to expand
(projected)
Time
Figure 3: Scoring contextual similarity of phrases:

first, contextual vectors are projected using a small Figure 4: Temporal histograms of the English phrase
seed dictionary and then compared with the target lan- terrorist, its Spanish translation terrorista, and riqueza
guage candidates. (wealth) collected from monolingual texts spanning a
13 year period. While the correct translation has a
good temporal match, the non-translation riqueza has
be compared against the vectors for all words in a distinctly different signature.
the target language vocabulary. The size of the n2
comparisons hugely increases if we compare vec- N - and target phrase e with an M -dimensional
tors for multi-word phrases instead of just words. vector (see Figure 3). The component values of
In this work, we avoid this problem by assuming the vector representing a phrase correspond to
that a limited set of phrase pairs is given a pri- how often each of the words in that vocabulary
ori (but without scores). By limiting ourselves appear within a two word window on either side
to phrases in a phrase table, we vastly limit the of the phrase. These counts are collected using
search space of possible translations. This is an monolingual corpora. After the values have been
idealization because high quality translations are computed, a contextual vector f is projected onto
guaranteed to be present. However, as our lesion the English vector space using translations in a
experiments in Section 5.1 show, a phrase table seed bilingual dictionary to map the component
without accurate translation probability estimates values into their appropriate English vector posi-
is insufficient to produce high quality translations. tions. This sparse projected vector is compared
We show that lexicon induction methods can be to the vectors representing all English phrases e.
used to replace bilingual estimation of phrase- and Each phrase pair in the phrase table is assigned
lexical-translation probabilities, making a signifi- a contextual similarity score c(f, e) based on the
cant step towards SMT without parallel corpora. similarity between e and the projection of f .
Various means of computing the component
3 Monolingual Parameter Estimation values and vector similarity measures have been
proposed in literature (e.g. Rapp (1999), Fung and
We use bilingual lexicon induction methods to es- Yee (1998)). Following Fung and Yee (1998), we
timate the parameters of a phrase-based transla- compute the value of the k-th component of f s
tion model from monolingual data. Instead of contextual vector as follows:
scores estimated from bilingual parallel data, we
make use of cues present in monolingual data to wk = nf,k (log(n/nk ) + 1)
provide multiple orthogonal estimates of similar-
ity between a pair of phrases. where nf,k and nk are the number of times sk ap-
pears in the context of f and in the entire corpus,
3.1 Phrasal similarity features and n is the maximum number of occurrences of
Contextual similarity. We extend the vector any word in the data. Intuitively, the more fre-
space approach of Rapp (1999) to compute sim- quently sk appears with f and the less common
ilarity between phrases in the source and tar- it is in the corpus in general, the higher its com-
get languages. More formally, assume that ponent value. Similarity between two vectors is
(s1 , s2 , . . . sN ) and (t1 , t2 , . . . tM ) are (arbitrarily measured as the cosine of the angle between them.
indexed) source and target vocabularies, respec- Temporal similarity. In addition to contex-
tively. A source phrase f is represented with an tual similarity, phrases in two languages may
133
be scored in terms of their temporal similarity 3.2 Lexical similarity features
(Schafer and Yarowsky, 2002; Klementiev and
In addition to the three phrase similarity features
Roth, 2006; Alfonseca et al., 2009). The intu-
used in our model c(f, e), t(f, e) and w(f, e)
ition is that news stories in different languages
we include four additional lexical similarity fea-
will tend to discuss the same world events on the
tures for each of phrase pair. The first three lex-
same day. The frequencies of translated phrases
ical features clex (f, e), tlex (f, e) and wlex (f, e)
over time give them particular signatures that will
are the lexical equivalents of the phrase-level con-
tend to spike on the same dates. For instance, if
textual, temporal and wikipedia topic similarity
the phrase asian tsunami is used frequently dur-
scores. They score the similarity of individual
ing a particular time span, the Spanish transla-
words within the phrases. To compute these
tion maremoto asiatico is likely to also be used
lexical similarity features, we average similarity
frequently during that time. Figure 4 illustrates
scores over all possible word alignments across
how the temporal distribution of terrorist is more
the two phrases. Because individual words are
similar to Spanish terrorista than to other Span-
more frequent than multiword phrases, the accu-
ish phrases. We calculate the temporal similar-
racy of clex , tlex , and wlex tends to be higher than
ity between a pair of phrases t(f, e) using the
their phrasal equivalents (this is similar to the ef-
method defined by Klementiev and Roth (2006).
fect observed in Figure 2).
We generate a temporal signature for each phrase
by sorting the set of (time-stamped) documents in Orthographic / phonetic similarity. The final
the monolingual corpus into a sequence of equally lexical similarity feature that we incorporate is
sized temporal bins and then counting the number o(f, e), which measures the orthographic similar-
of phrase occurrences in each bin. In our exper- ity between words in a phrase pair. Etymolog-
iments, we set the window size to 1 day, so the ically related words often retain similar spelling
size of temporal signatures is equal to the num- across languages with the same writing system,
ber of days spanned by our corpus. We use cosine and low string edit distance sometimes signals
distance to compare the normalized temporal sig- translation equivalency. Berg-Kirkpatrick and
natures for a pair of phrases (f, e). Klein (2011) present methods for learning cor-
respondences between the alphabets of two lan-
Topic similarity. Phrases and their translations
guages. We can also extend this idea to language
are likely to appear in articles written about the
pairs not sharing the same writing system since
same topic in two languages. Thus, topic or cat-
many cognates, borrowed words, and names re-
egory information associated with monolingual
main phonetically similar. Transliterations can be
data can also be used to indicate similarity be-
generated for tokens in a source phrase (Knight
tween a phrase and its candidate translation. In
and Graehl, 1997), with o(f, e) calculating pho-
order to score a pair of phrases, we collect their
netic similarity rather than orthographic.
topic signatures by counting their occurrences in
each topic and then comparing the resulting vec- The three phrasal and four lexical similarity
tors. We again use the cosine similarity mea- scores are incorporated into the log linear trans-
sure on the normalized topic signatures. In our lation model as feature functions, replacing the
experiments, we use interlingual links between bilingually estimated phrase translation probabil-
Wikipedia articles to estimate topic similarity. We ities and lexical weighting probabilities w. Our
treat each linked article pair as a topic and collect seven similarity scores are not the only ones that
counts for each phrase across all articles in its cor- could be incorporated into the translation model.
responding language. Thus, the size of a phrase Various other similarity scores can be computed
topic signature is the number of article pairs with depending on the available monolingual data and
interlingual links in Wikipedia, and each compo- its associated metadata (see, e.g. Schafer and
nent contains the number of times the phrase ap- Yarowsky (2002)).
pears in (the appropriate side of) the correspond-
3.3 Reordering
ing pair. Our Wikipedia-based topic similarity
feature, w(f, e), is similar in spirit to polylingual The remaining component of the phrase-based
topic models (Mimno et al., 2009), but it is scal- SMT model is the reordering model. We
able to full bilingual lexicon induction. introduce a novel algorithm for estimating
134
Facebook
Input: Source and target phrases f and e,
Anlegen
einfach
Profils
Source and target monolingual corpora Cf and Ce ,
eines
Das
Phrase table pairs T = {(f (i) , e(i) )}N
i=1 .
ist
in
Output: Orientation features (pm , ps , pd ).
What
Sf sentences containing f in Cf ;
Se sentences containing e in Ce ; does
(Bf , , ) CollectOccurs(f, N i=1 f
(i) , S );
f
(Be , Ae , De ) CollectOccurs(e, i=1 e(i) , Se );
N your
cm = cs = cd = 0;
Facebook
foreach unique f 0 in Bf do
foreach translation e0 of f 0 in T do profile s
cm = cm + #Be (e0 );
cs = cs + #Ae (e0 ); reveal
cd = cd + #De (e0 );
c cm + cs + cd ;
return ( ccm , ccs , ccd ) Figure 6: Collecting phrase orientation statistics for
a English-German phrase pair (profile, Profils)
CollectOccurs(r, R, S) from non-parallel sentences (the German sentence
B (); A (); D (); translates as Creating a Facebook profile is easy).
foreach sentence s S do
foreach occurrence of phrase r in s do
B B + (longest preceding r and in R); taining their corresponding translations (e, e0 ), we
A A + (longest following r and in R); are able to increment orientation counts for (f, e)
D D + (longest discontinuous w/ r and in
R); by looking at whether e and e0 are adjacent,
swapped, or discontinuous. The orientations cor-
return (B, A, D);
respond directly to those shown in Figure 1.
One subtly of our method is that shorter and
Figure 5: Algorithm for estimating reordering more frequent phrases (e.g. punctuation) are more
probabilities from monolingual data. likely to appear in multiple orientations with a
given phrase, and therefore provide poor evi-
po (orientation|f, e) from two monolingual cor- dence of reordering. Therefore, we (a) collect
pora instead a bitext. the longest contextual phrases (which also appear
Figure 1 illustrates how the phrase pair orienta- in the phrase table) for reordering feature estima-
tion statistics are estimated in the standard phrase- tion, and (b) prune the set of sentences so that
based SMT pipeline. For a phrase pair like (f = we only keep a small set of least frequent contex-
Profils, e = profile), we count its orien- tual phrases (this has the effect of dropping many
tation with the previously translated phrase pair function words and punctuation marks and and re-
(f 0 = in Facebook, e0 = Facebook) across lying more heavily on multi-word content phrases
all translated sentence pairs in the bitext. to estimate the reordering).2
In our pipeline we do not have translated sen- Our algorithm for learning the reordering pa-
tence pairs. Instead, we look for monolingual rameters is given in Figure 5. The algorithm
sentences in the source corpus which contain estimates a probability distribution over mono-
the source phrase that we are interested in, like tone, swap, and discontinuous orientations (pm ,
f = Profils, and at least one other phrase ps , pd ) for a phrase pair (f, e) from two mono-
that we have a translation for, like f 0 = in lingual corpora Cf and Ce . It begins by calling
Facebook. We then look for all target lan- CollectOccurs to collect the longest match-
guage sentences in the target monolingual cor- ing phrase table phrases that precede f in source
pus that contain the translation of f (here e = monolingual data (Bf ), as well as those that pre-
profile) and any translation of f 0 . Figure 6 il- cede (Be ), follow (Ae ), and are discontinuous
lustrates that it is possible to find evidence for (De ) with e in the target language data. For each
po (swapped|Profils, profile), even from the non- unique phrase f 0 preceding f , we look up transla-
parallel, non-translated sentences drawn from two tions in the phrase table T. Next, we count3 how
independent monolingual corpora. By looking for 2
The pruning step has an additional benefit of minimizing
foreign sentences containing pairs of adjacent for- the memory needed for orientation feature estimations.
eign phrases (f, f 0 ) and English sentences con- 3
#L (x) returns the count of object x in list L.
135
Monolingual training corpora Spanish-English phrase table
Europarl Gigaword Wikipedia Phrase pairs 3,093,228
date range 4/96-10/09 5/94-12/08 n/a Spanish phrases 89,386
uniq shared dates 829 5,249 n/a English phrases 926,138
Spanish articles n/a 3,727,954 59,463 Spanish unigrams 13,216
English articles n/a 4,862,876 59,463 Avg # translations 98.7
Spanish lines 1,307,339 22,862,835 2,598,269 Spanish bigrams 41,426
English lines 1,307,339 67,341,030 3,630,041 Avg # translations 31.9
Spanish words 28,248,930 774,813,847 39,738,084 Spanish trigrams 34,744
English words 27,335,006 1,827,065,374 61,656,646 Avg # translations 13.5
Table 1: Statistics about the monolingual training data and the phrase table that was used in all of the experiments.
many translations e0 of f 0 appeared before, after was re-run for every experiment.
or were discontinuous with e in the target lan- We estimate the parameters of our model from
guage data. Finally, the counts are normalized and two sets of monolingual data, detailed in Table 1:
returned. These normalized counts are the values
we use as estimates of po (orientation|f, e). First, we treat the two sides of the Europarl
parallel corpus as independent, monolingual
4 Experimental Setup corpora. Haghighi et al. (2008) also used
this method to show how well translations
We use the Spanish-English language pair to test could be learned from monolingual corpora
our method for estimating the parameters of an under ideal conditions, where the contextual
SMT system from monolingual corpora. This al- and temporal distribution of words in the two
lows us to compare our method against the nor- monolingual corpora are nearly identical.
mal bilingual training procedure. We expect bilin-
gual training to result in higher translation qual- Next, we estimate the features from truly
ity because it is a more direct method for learn- monolingual corpora. To estimate the con-
ing translation probabilities. We systematically textual and temporal similarity features, we
remove different parameters from the standard use the Spanish and English Gigaword cor-
phrase-based model, and then replace them with pora.5 These corpora are substantially larger
our monolingual equivalents. Our goal is to re- than the Europarl corpora, providing 27x as
cover as much of the loss as possible for each of much Spanish and 67x as much English for
the deleted bilingual components. contextual similarity, and 6x as many paired
The standard phrase-based model that we use dates for temporal similarity. Topical simi-
as our top-line is the Moses system (Koehn et larity is estimated using Spanish and English
al., 2007) trained over the full Europarl v5 par- Wikipedia articles that are paired with inter-
allel corpus (Koehn, 2005). With the exception language links.
of maximum phrase length (set to 3 in our ex-
periments), we used default values for all of the To project context vectors from Spanish to En-
parameters. All experiments use a trigram lan- glish, we use a bilingual dictionary containing en-
guage model trained on the English side of the tries for 49,795 Spanish words. Note that end-to-
Europarl corpus using SRILM with Kneser-Ney end translation quality is robust to substantially
smoothing. To tune feature weights in minimum reducing dictionary size, but we omit these ex-
error rate training, we use a development bitext periments due to space constraints. The con-
of 2,553 sentence pairs, and we evaluate per- text vectors for words and phrases incorporate co-
formance on a test set of 2,525 single-reference occurrence counts using a two-word window on
translated newswire articles. These development either side.
and test datasets were distributed in the WMT The title of our paper uses the word towards be-
shared task (Callison-Burch et al., 2010).4 MERT cause we assume that an inventory of phrase pairs
is given. Future work will explore inducing the
4
Specifcially, news-test2008 plus news-syscomb2009 for
5
dev and newstest2009 for test. We use the afp, apw and xin sections of the corpora.
136
BL
10.52
B
10
4.00
BM/B
M/M
B/B
-/M
M/-
B/-
-/B
o/-
c/-
t/-
-/-
0
1 2 3 4 5 6 7 8 9 10 11
25
25
25
Exp Phrase scores / orientation scores 22.92
23.36
23.36
1 B/B bilingual / bilingual (Moses) 21.87 21.54 Estimated Using Europarl
Estimated Using Monolingual Corpora
2 B/- bilingual / distortion
20
20
20
18.79
18.79
3 -/B none / bilingual
17.00
17.00
17.92
17.92 16.85 17.50
4 -/- none / distortion 15.35 14.78
15
14.07
14.07 14.02
BLEU
14.02
14.02
15
13.13 12.86
15
5, 12 -/M none / mono
BLEU
BLEU
13.13
6, 13 t/- temporal mono / distortion 10.52
10.15
10.15
10
7,14 o/- orthographic mono / distortion
10
10
8, 15 c/- contextual mono / distortion
16 w/- Wikipedia topical mono / distorion
4.00
BM/B
9, 17 M/- all mono / distortion
25
M/M
5
B/B
-/M
M/-
BM/B
t/-B/-
-/B
o/-
c/-
22.92
t/-
M/M
-/M
M/-
w/-
o/-
-/-
c/-
21.87 21.54 10, 18 M/M all mono / mono
Estimated Using Europarl
0
11, 19 BM/B bilingual + all mono / bilingual
20
00
1 2 3 4 5 6 7 8 9 10 11
16.85 17.50 12 13 14 15 16 17 18 19
15.35
25
25
Exp
14.78Phrase scores / orientation scores
14.02 23.36
23.36
15
Figure 7: Much of the14.02

12.86 lossB/Bin BLEU
1 score
bilingual / bilingual when bilinguallyEstimated
(Moses) estimated features
Using are removed
Monolingual Corpora from a Spanish-
10.52 2 B/- bilingual / distortion
English translation system (experiments 1-4) can be recovered when they are replaced with monolingual equiva-
20
20
18.79
18.79
10
3 -/B none / bilingual 17.92

17.92
17.00
17.00
lents estimated from monolingual
4 -/- noneEuroparl
/ distortion data (experiments 5-10). The labels indicate how the different types
14.02 14.07
14.02 14.07
15
15
4.00 5, 12 -/M
of parameters are estimated, thenone
first/ mono
part is for phrase-table features, 13.13 the second is for reordering probabilities.
5
BLEU
BLEU
13.13
BM/B
M/M
B/B
-/M
M/-
6, 13 t/- temporal mono / distortion

B/-
-/B
o/-
c/-
t/-
-/-
7,14 o/- orthographic mono / distortion 10.15

10.15
10
10
0
1 2 3 4 5 6 8, 15
16
7 c/- 8 contextual
w/-
9 mono10/ distortion 11
Wikipedia topical mono / distorion
5 Experimental Results
25
25
Phrase scores / orientation scores

9, 17 M/- all mono / distortion 23.36
23.36
BM/B
bilingual / bilingual (Moses)
M/M
Estimated Using Monolingual
10, 18 M/M Corpora
-/M
M/-
Figures 7 and 8 give experimental results. Figure
w/-
o/-
c/-
all mono / mono
t/-
bilingual / distortion
20
20
11, 19 BM/B bilingual 18.79

18.79
none / bilingual 17.92+ all mono / bilingual
17.92
7 shows the performance of the standard phrase-
00
none / distortion
17.00
17.00
12 13 14 15 16 17 18 19
14.02
14.02 14.07
14.07 based model when each of the bilingually esti-
15
15
none / mono
BLEU
13.13
BLEU
13.13
temporal mono / distortion
orthographic mono / distortion 10.15
10.15 mated features are removed. It shows how much
10
10
contextual mono / distortion

Wikipedia topical mono / distorion of the performance loss can be recovered using
all mono / distortion
5
our monolingual features when they are estimated

BM/B
M/M
-/M
M/-
w/-
o/-
c/-
all mono / mono

t/-
bilingual + all mono / bilingual

from the Europarl training corpus but treating
00
12 13 14 15 16 17 18 19
each side as an independent, monolingual cor-

Figure 8: Performance of monolingual features de- pus. Figure 8 shows the recovery when using truly
rived from truly monolingual corpora. Over 82% of monolingual corpora to estimate the parameters.
the BLEU score loss can be recovered.
5.1 Lesion experiments
phrase table itself from monolingual texts. Across Experiments 1-4 remove bilingually estimated pa-
all of our experiments, we use the phrase table rameters from the standard model. For Spanish-
that the bilingual model learned from the Europarl English, the relative contribution of the phrase-
parallel corpus. We keep its phrase pairs, but we table features (which include the phrase transla-
drop all of its scores. Table 1 gives details of the tion probabilities and the lexical weights w) is
phrase pairs. In our experiments, we estimated greater than the reordering probabilities. When
similarity and reordering scores for more than 3 the reordering probability po (orientation|f, e) is
million phrase pairs. For each source phrase, the eliminated and replaced with a simple distance-
set of possible translations was constrained and based distortion feature that does not require a
likely to contain good translations. However, the bitext to estimate, the score dips only marginally
average number of possible translations was high since word order in English and Spanish is simi-
(ranging from nearly 100 translations for each un- lar. However, when both the reordering and the
igram to 14 for each trigram). These contain a phrase table features are dropped, leaving only
lot of noise and result in low end-to-end transla- the LM feature and the phrase penalty, the result-
tion quality without good estimates of translation ing translation quality is abysmal, with the score
quality, as the experiments in Section 5.1 show. dropping a total of over 17 BLEU points.
5.2 Adding equivalent monolingual features

Software. Because many details of our estima-
estimated using Europarl
tion procedures must be omitted for space, we dis-
tribute our full set of code along with scripts for Experiments 5-10 show how much our monolin-
running our experiments and output translations. gual equivalents could recover when the monolin-
These may be downed from http://www.cs. gual corpora are drawn from the two sides of the
jhu.edu/anni/papers/lowresmt/ bitext. For instance, our algorithm for estimating
137
reordering probabilities from monolingual data ( Their method has no notion of translation similar-
/M) adds 5 BLEU points, which is 73% of the po- ity aside from a bilingual dictionary. Similarly,
tential recovery going from the model (/) to the Sanchez-Cartagena et al. (2011) supplement an
model with bilingual reordering features (/B). SMT phrase table with translation pairs extracted
Of the temporal, orthographic, and contextual from a bilingual dictionary and give each a fre-
monolingual features the temporal feature per- quency of one for computing translation scores.
forms the best. Together (M/), they recover Ravi and Knight (2011) treat MT without paral-
more than each individually. Combining mono- lel training data as a decipherment task and learn
lingually estimated reordering and phrase table a translation model from monolingual text. They
features (M/M) yields a total gain of 13.5 BLEU translate corpora of Spanish time expressions and
points, or over 75% of the BLEU score loss that subtitles, which both have a limited vocabulary,
occurred when we dropped all features from the into English. Their method has not been applied
phrase table. However, these results use mono- to broader domains of text.
lingual corpora which have practically identical Most work on learning translations from mono-
phrasal and temporal distributions. lingual texts only examine small numbers of fre-
quent words. Huang et al. (2005) and Daume and
5.3 Estimating features using truly Jagarlamudi (2011) are exceptions that improve
monolingual corpora MT by mining translations for OOV items.
Experiments 12-18 estimate all of the features A variety of past research has focused on min-
from truly monolingual corpora. Our novel al- ing parallel or comparable corpora from the web
gorithm for estimating reordering holds up well (Munteanu and Marcu, 2006; Smith et al., 2010;
and recovers 69% of the loss, only 0.4 BLEU Uszkoreit et al., 2010). Others use an existing
points less than when estimated from the Europarl SMT system to discover parallel sentences within
monolingual texts. The temporal similarity fea- independent monolingual texts, and use them to
ture does not perform as well as when it was esti- re-train and enhance the system (Schwenk, 2008;
mated using Europarl data, but the contextual fea- Chen et al., 2008; Schwenk and Senellart, 2009;
ture does. The topic similarity using Wikipedia Rauf and Schwenk, 2009; Lambert et al., 2011).
performs the strongest of the individual features. These are complementary but orthogonal to our
Combining the monolingually estimated re- research goals.
ordering features with the monolingually esti-
mated similarity features (M/M) yields a total 7 Conclusion
gain of 14.8 BLEU points, or over 82% of the
BLEU point loss that occurred when we dropped This paper has demonstrated a novel set of tech-
all features from the phrase table. This is equiv- niques for successfully estimating phrase-based
alent to training the standard system on a bi- SMT parameters from monolingual corpora, po-
text with roughly 60,000 lines or nearly 2 million tentially circumventing the need for large bitexts,
words (learning curve omitted for space). which are expensive to obtain for new languages
Finally, we supplement the standard bilingually and domains. We evaluated the performance of
estimated model parameters with our monolin- our algorithms in a full end-to-end translation sys-
gual features (BM/B), and we see a 1.5 BLEU tem. Assuming that a bilingual-corpus-derived
point increase over the standard model. There- phrase table is available, we were able utilize our
fore, our monolingually estimated scores capture monolingually-estimated features to recover over
some novel information not contained in the stan- 82% of BLEU loss that resulted from removing
dard feature set. the bilingual-corpus-derived phrase-table proba-
bilities. We also showed that our monolingual fea-
6 Additional Related Work tures add 1.5 BLEU points when combined with
standard bilingually estimated features. Thus our
Carbonell et al. (2006) described a data-driven techniques have stand-alone efficacy when large
MT system that used no parallel text. It produced bilingual corpora are not available and also make
translation lattices using a bilingual dictionary a significant contribution to combined ensemble
and scored them using an n-gram language model. performance when they are.
138
References on Data-Driven Machine Translation, Toulouse,
France.
Enrique Alfonseca, Massimiliano Ciaramita, and
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick,
Keith Hall. 2009. Gazpacho and summer rash:
and Dan Klein. 2008. Learning bilingual lexi-
lexical relationships from temporal patterns of web
cons from monolingual corpora. In Proceedings of
search queries. In Proceedings of EMNLP.
ACL/HLT.
Taylor Berg-Kirkpatrick and Dan Klein. 2011. Simple
Fei Huang, Ying Zhang, and Stephan Vogel. 2005.
effective decipherment via combinatorial optimiza-
Mining key phrase translations from web corpora.
tion. In Proceedings of the 2011 Conference on
In Proceedings of EMNLP.
Empirical Methods in Natural Language Process-
ing (EMNLP-2011), Edinburgh, Scotland, UK. Alexandre Klementiev and Dan Roth. 2006. Weakly
supervised named entity transliteration and discov-
Shane Bergsma and Benjamin Van Durme. 2011.
ery from multilingual comparable corpora. In Pro-
Learning bilingual lexicons using the visual simi-
ceedings of the ACL/Coling.
larity of labeled web images. In Proceedings of the
International Joint Conference on Artificial Intelli- Kevin Knight and Jonathan Graehl. 1997. Machine
gence. transliteration. In Proceedings of ACL.
Peter Brown, John Cocke, Stephen Della Pietra, Vin- Philipp Koehn and Kevin Knight. 2002. Learning a
cent Della Pietra, Frederick Jelinek, Robert Mercer, translation lexicon from monolingual corpora. In
and Paul Poossin. 1988. A statistical approach to ACL Workshop on Unsupervised Lexical Acquisi-
language translation. In 12th International Confer- tion.
ence on Computational Linguistics (CoLing-1988). Philipp Koehn, Franz Josef Och, and Daniel Marcu.
Peter Brown, Stephen Della Pietra, Vincent Della 2003. Statistical phrase-based translation. In Pro-
Pietra, and Robert Mercer. 1993. The mathemat- ceedings of HLT/NAACL.
ics of machine translation: Parameter estimation. Philipp Koehn, Hieu Hoang, Alexandra Birch,
Computational Linguistics, 19(2):263311, June. Chris Callison-Burch, Marcello Federico, Nicola
Chris Callison-Burch, Philipp Koehn, Christof Monz, Bertoldi, Brooke Cowan, Wade Shen, Christine
Kay Peterson, Mark Przybocki, and Omar Zaidan. Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
2010. Findings of the 2010 joint workshop on sta- Alexandra Constantin, and Evan Herbst. 2007.
tistical machine translation and metrics for machine Moses: Open source toolkit for statistical machine
translation. In Proceedings of the Workshop on Sta- translation. In Proceedings of the ACL-2007 Demo
tistical Machine Translation. and Poster Sessions.
Jaime Carbonell, Steve Klein, David Miller, Michael Philipp Koehn. 2005. Europarl: A parallel corpus for
Steinbaum, Tomer Grassiany, and Jochen Frey. statistical machine translation. In Proceedings of
2006. Context-based machine translation. In Pro- the Machine Translation Summit.
ceedings of AMTA. Shankar Kumar and William Byrne. 2004. Local
Boxing Chen, Min Zhang, Aiti Aw, and Haizhou Li. phrase reordering models for statistical machine
2008. Exploiting n-best hypotheses for SMT self- translation. In Proceedings of HLT/NAACL.
enhancement. In Proceedings of ACL/HLT, pages Patrik Lambert, Holger Schwenk, Christophe Ser-
157160. van, and Sadaf Abdul-Rauf. 2011. Investigations
David Chiang. 2005. A hierarchical phrase-based on translation model adaptation using monolingual
model for statistical machine translation. In Pro- data. In Proceedings of the Workshop on Statistical
ceedings of ACL. Machine Translation, pages 284293, Edinburgh,
Hal Daume and Jagadeesh Jagarlamudi. 2011. Do- Scotland, UK.
main adaptation for machine translation by mining David Mimno, Hanna Wallach, Jason Naradowsky,
unseen words. In Proceedings of ACL/HLT. David Smith, and Andrew McCallum. 2009.
Pascale Fung and Lo Yuen Yee. 1998. An IR approach Polylingual topic models. In Proceedings of
for translating new words from nonparallel, compa- EMNLP.
rable texts. In Proceedings of ACL/CoLing. Dragos Stefan Munteanu and Daniel Marcu. 2006.
Nikesh Garera, Chris Callison-Burch, and David Extracting parallel sub-sentential fragments from
Yarowsky. 2009. Improving translation lexicon in- non-parallel corpora. In Proceedings of the
duction from monolingual corpora via dependency ACL/Coling.
contexts and part-of-speech equivalences. In Thir- Franz Josef Och and Hermann Ney. 2003. A sys-
teenth Conference On Computational Natural Lan- tematic comparison of various statistical alignment
guage Learning (CoNLL-2009), Boulder, Colorado. models. Computational Linguistics, 29(1):1951.
Ulrich Germann. 2001. Building a statistical machine Franz Josef Och and Hermann Ney. 2004. The align-
translation system from scratch: How much bang ment template approach to statistical machine trans-
for the buck can we expect? In ACL 2001 Workshop lation. Computational Linguistics, 30(4):417449.
139
Franz Joseph Och. 2002. Statistical Machine Transla-
tion: From Single-Word Models to Alignment Tem-
plates. Ph.D. thesis, RWTH Aachen.
Franz Josef Och. 2003. Minimum error rate training
for statistical machine translation. In Proceedings
of ACL.
Reinhard Rapp. 1995. Identifying word translations
in non-parallel texts. In Proceedings of ACL.
Reinhard Rapp. 1999. Automatic identification of
word translations from unrelated English and Ger-
man corpora. In Proceedings of ACL.
Sadaf Abdul Rauf and Holger Schwenk. 2009. On the
use of comparable corpora to improve SMT perfor-
mance. In Proceedings of EACL.
Sujith Ravi and Kevin Knight. 2011. Deciphering for-
eign language. In Proceedings of ACL/HLT.
Vctor M. Sanchez-Cartagena, Felipe Sanchez-
Martnez, and Juan Antonio Perez-Ortiz. 2011.
Integrating shallow-transfer rules into phrase-based
statistical machine translation. In Proceedings of
the XIII Machine Translation Summit.
Charles Schafer and David Yarowsky. 2002. Inducing
translation lexicons via diverse similarity measures
and bridge languages. In Proceedings of CoNLL.
Holger Schwenk and Jean Senellart. 2009. Transla-
tion model adaptation for an Arabic/French news
translation system by lightly-supervised training. In
MT Summit.
Holger Schwenk. 2008. Investigations on large-scale
lightly-supervised training for statistical machine
translation. In Proceedings of IWSLT.
Jason R. Smith, Chris Quirk, and Kristina Toutanova.
2010. Extracting parallel sentences from compa-
rable corpora using document level alignment. In
Proceedings of HLT/NAACL.
Christoph Tillman. 2004. A unigram orientation
model for statistical machine translation. In Pro-
ceedings of HLT/NAACL.
Christoph Tillmann. 2003. A projection extension al-
gorithm for statistical machine translation. In Pro-
ceedings of EMNLP.
Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and
Moshe Dubiner. 2010. Large scale parallel docu-
ment mining for machine translation. In Proceed-
ings of CoLing.
Ashish Venugopal, Stephan Vogel, and Alex Waibel.
2003. Effective phrase translation extraction from
alignment models. In Proceedings of ACL.
140
Character-Based Pivot Translation for Under-Resourced Languages and
Domains
Jorg Tiedemann
Department of Linguistics and Philology
Uppsala University, Uppsala/Sweden
jorg.tiedemann@lingfil.uu.se
Abstract would not have been possible without the Euro-

pean Union and its language policies to give an
In this paper we investigate the use of example.
character-level translation models to sup-
port the translation from and to under-
One of the main challenges of current NLP re-
resourced languages and textual domains search is to port data-driven techniques to under-
via closely related pivot languages. Our ex- resourced languages, which refers to the major-
periments show that these low-level models ity of the worlds languages. One obvious ap-
can be successful even with tiny amounts proach is to create appropriate data resources even
of training data. We test the approach on for those languages in order to enable the use of
movie subtitles for three language pairs and similar techniques designed for high-density lan-
legal texts for another language pair in a do-
guages. However, this is usually too expensive
main adaptation task. Our pivot translations
outperform the baselines by a large margin. and often impossible with the quantities needed.
Another idea is to develop new models that can
work with (much) less data but still make use
1 Introduction of resources and techniques developed for other
well-resourced languages.
Data-driven approaches have been extremely suc-
cessful in most areas of natural language pro- In this paper, we explore pivot translation tech-
cessing (NLP) and can be considered the main niques for the translation from and to resource-
paradigm in application-oriented research and de- poor languages with the help of intermediate
velopment. Research in machine translation is a resource-rich languages. We explore the fact
typical example with the dominance of statisti- that many poorly resourced languages are closely
cal models over the last decade. This is even en- related to well equipped languages, which en-
forced due to the availability of toolboxes such as ables low-level techniques such as character-
Moses (Koehn et al., 2007) which make it pos- based translation. We can show that these tech-
sible to build translation engines within days or niques can boost the performance enormously,
even hours for any language pair provided that ap- tested for several language pairs. Furthermore, we
propriate training data is available. However, this show that pivoting can also be used to overcome
reliance on training data is also the most severe data sparseness in specific domains. Even high
limitation of statistical approaches. Resources in density languages are under-resourced in most
large quantities are only available for a few lan- textual domains and pivoting via in-domain data
guages and domains. In the case of SMT, the of another language can help to adapt statistical
dilemma is even more apparent as parallel cor- models. In our experiments, we observe that re-
pora are rare and usually quite sparse. Some lan- lated languages have the largest impact in such a
guages can be considered lucky, for example, be- setup.
cause of political situations that lead to the pro- The remaining parts of the paper are organized
duction of freely available translated material on as follows: First we describe the pivot translation
a large scale. A lot of research and development approach used in this study. Thereafter, we dis-
141
cuss character-based translation models followed 2003). In our setup we added the parameter
by a detailed presentation of our experimental that can be used to weight the importance of one
results. Finally, we briefly summarize related model over the other. This can be useful as we
work and conclude the paper with discussions and do not consider the entire hypothesis space but
prospects for future work. only a small subset of N-best lists. In the sim-
plest case, this weight is set to 0.5 making both
2 Pivot Models models equally important. An alternative to fit-
Information from pivot languages can be incorpo- ting the interpolation weight would be to per-
rated in SMT models in various ways. The main form a global optimization procedure. However,
principle refers to the combination of source- a straightforward implementation of pivot-based
to-pivot and pivot-to-target translation models. MERT would be prohibitively slow due to the
In our setup, one of these models includes a expensive two-step translation procedure over n-
resource-poor language (source or target) and the best lists.
other one refers to a standard model with ap- A general condition for the pivot approach is to
propriate data resources. A condition is that we assume independent training sets for both transla-
have at least some training data for the translation tion models as already pointed out by (Bertoldi
between pivot and the resource-poor language. et al., 2008). In contrast to research presented
However, for the original task (source-to-target in related work (see, for example, (Koehn et al.,
translation) we do not require any data resources 2009)) this condition is met in our setup in which
except for purposes of comparison. all data sets represent different samples over the
We will explore various models for the transla- languages considered (see section 4).2
tion between the resource-poor language and the
pivot language and most of them are not compat- 3 Character-Based SMT
ible with standard phrase-based translation mod- The basic idea behind character-based translation
els. Hence, triangulation methods (Cohn and La- models is to take advantage of the strong lexi-
pata, 2007) for combining phrase tables are not cal and syntactic similarities between closely re-
applicable in our case. Instead, we explore a lated languages. Consider, for example, Figure
cascaded approach (also called transfer method 1. Related languages like Catalan and Spanish or
(Wu and Wang, 2009)) in which we translate the Danish and Norwegian have common roots and,
input text in two steps using a linear interpo- therefore, use similar concepts and express them
lation for rescoring N-best lists. Following the in similar grammatical structures. Spelling con-
method described in (Utiyama and Isahara, 2007) ventions can still be quite different but those dif-
and (Wu and Wang, 2009), we use the best n hy- ferences are often very consistent. The Bosnian-
potheses from the translation of source sentences Macedonian example also shows that we do not
s to pivot sentences p and combine them with the have to require any alphabetic overlap in order to
top m hypotheses for translating these pivot sen- obtain character-level similarities.
tences to target sentences t:
Regularities between such closely related lan-
guages can be captured below the word level. We
L
can also assume a more or less monotonic rela-
sp sp pt pt
X
t argmax k hk (s, p) + (1 )k hk (p, t) tion between the two languages which motivates
t
k=1
the idea of translation models over character N-
where hxy grams treating translation as a transliteration task
k are feature functions for model xy
with appropriate weights xy 1 Basically, this (Vilar et al., 2007). Conceptually it is straightfor-
k .
means that we simply add the scores and, sim- ward to think of phrase-based models on the char-
ilar to related work, we assume that the feature acter level. Sequences of characters can be used
weights can be set independently for each model instead of word N-grams for both, translation and
using minimum error rate training (MERT) (Och, language models. Training can proceed with the
same tools and approaches. The basic task is to
1
Note, that we do not require the same feature functions
2
in both models even though the formula above implies this Note that different samples may still include common
for simplicity of representation. sentences.
142
cedure and the resulting transducer can be used to
find the Viterbi alignment between characters ac-
cording to the best sequence of edit operations ap-
plied to transform one string into the other. Exten-
sions to this model are possible, for example the
use of many-to-many alignments which have been
shown to be very effective in letter-to-phoneme
alignment tasks (Jiampojamarn et al., 2007).
One advantage of the edit-distance-based trans-
ducer models is that the alignments they pre-
dict are strictly monotonic and cannot easily be
Figure 1: Some examples of movie subtitle transla- confused by spurious relations between charac-
tions between closely related languages (either sharing
ters over longer distances. Long distance align-
parts of the same alphabet or not).
ments are only possible in connection with a se-
ries of insertions and deletions that usually in-
prepare the data to comply with the training pro- crease the alignment costs in such a way that they
cedures (see Figure 2). are avoided if possible. On the other hand, IBM
word alignment models also prefer monotonic
alignments over non-monotonic ones if there is no
good reason to do otherwise (i.e., there is frequent
evidence of distorted alignments). However, the
size of the vocabulary in a character-level model
is very small (several orders of magnitude smaller
Figure 2: Data pre-processing for training models on
than on the word level) and this may cause serious
the character level. Spaces are represented by and
each sentence is treated as one sequence of characters. confusion of the word alignment model that very
much relies on context-independent lexical trans-
lation probabilities. Hence, for character align-
3.1 Character Alignment ment, the lexical evidence is much less reliable
One crucial difference is the alignment of charac- without their context.
ters, which is required instead of an alignment of It is certainly possible to find a compromise be-
words. Clearly, the traditional IBM word align- tween word-level and character-level models in
ment models are not designed for this task es- order to generalize below word boundaries but
pecially with respect to distortion. However, the avoiding alignment problems as discussed above.
same generative story can still be applied in gen- Morpheme-based translation models have been
eral. Vilar et al. (2007) explore a two-step proce- explored in several studies with similar motiva-
dure where words are aligned first (with the tradi- tions as in our approach, a better generalization
tional IBM models) to divide sentence pairs into from sparse training data (Fishel and Kirik, 2010;
aligned segments of reasonable size and the char- Luong et al., 2010). However, these approaches
acters are then aligned with the same algorithm. have the drawback that they require proper mor-
An alternative is to use models designed for phological analyses. Data-driven techniques ex-
transliteration or related character-level transfor- ist even for morphology, but their use in SMT
mation tasks. Many approaches are based on still needs to be shown (Fishel, 2009). The sit-
transducer models that resemble string edit oper- uation is comparable to the problems of integrat-
ations such as insertions, deletions and substitu- ing linguistically motivated phrases into phrase-
tions (Ristad and Yianilos, 1998). Weighted fi- based SMT (Koehn et al., 2003). Instead we opt
nite state transducers (WFSTs) can be trained on for a more general approach to extend context to
unaligned pairs of character sequences and have facilitate, especially, the alignment step. Figure 3
been shown to be very effective for transliteration shows how we can transform texts into sequences
tasks or letter-to-phoneme conversions (Jiampoja- of bigrams that can be aligned with standard ap-
marn et al., 2007). The training procedure usually proaches without making any assumptions about
employs an expectation maximization (EM) pro- linguistically motivated segmentations.
143
cu ur rs so o c co on nf fi ir rm ma ad do o . . BLEU, NIST, METEOR etc. The same simple
q qu ue e e es s e es so o ? ? post-processing as mentioned in the previous sec-
tion can be applied to turn the character transla-
Figure 3: Two Spanish sentences as sequences of char- tions into normal text. However, it can be use-
acter bigrams with a final marking the end of a sen- ful to look at some other measures as well that
tence. consider near matches on the character level in-
stead of matching words and word N-grams only.
In this way we can construct a parallel corpus with Character-level models have the ability to produce
slightly richer contextual information as input to strings that may be close to the reference and still
the alignment program. The vocabulary remains do not match any of the words contained. They
small (for example, 1267 bigrams in the case of may generate non-words that include mistakes
Spanish compared to 84 individual characters in which look like spelling-errors or minor gram-
our experiments) but lexical translation probabili- matical mistakes. Those words are usually close
ties become now much more differentiated. enough to the correct target words to be recog-
With this, it is now possible to use the align- nized by the user, which is often more acceptable
ment between bigrams to train a character-level than leaving foreign words untranslated. This is
translation system as we have the same number of especially true as many unknown words represent
bigrams as we have characters (and the first char- important content words that bear a lot of infor-
acter in each bigram corresponds to the charac- mation. The problem of unknown words is even
ter at that position). Certainly, it is also possible more severe for morphologically rich language as
to train a bigram translation model (and language many word forms are simply not part of (sparse)
model). This has the (one and only) advantage training data sets. Untranslated words are espe-
that one character of context across phrase bound- cially annoying when translating languages that
aries (i.e. character N-grams) is used in the se- use different writing systems. Consider, for ex-
lection of translation alternatives from the phrase ample, the following subtitles in Macedonian (us-
table.3 ing Cyrillic letters) that have been translated from
Bosnian (written in Latin characters):
3.2 Tuning Character-Level Models
reference: , .
A final remark on training character-based SMT word-based: casu vina, .
models is concerned with feature weight tun- char-based: , .
ing. It certainly makes not much sense to com- reference: .
pute character-level BLEU scores for tuning fea- word-based: starom svetilistu.
char-based: .
ture weights especially with the standard settings
of matching relatively short N-grams. Instead The underlined parts mark examples of character-
we would still like to measure performance in level differences with respect to the reference
terms of word-level BLEU scores (or any other translation. For the pivot translation approach, it
MT evaluation metric used in minimum error is important that the translations generated in the
rate training). Therefore, it is important to post- first step can be handled by the second one. This
process character-translated development sets be- means, that words generated by a character-based
fore adjusting weights. This is simply done model should at least be valid input words for the
by merging characters accordingly and replacing second step, even though they might refer to er-
the place-holders with spaces again. Thereafter, roneous inflections in that context. Therefore, we
MERT can run as usual. add another measure to our experimental results
presented below the number of unknown words
3.3 Evaluation
with respect to the input language of the second
Character-level translations can be evaluated in step. This applies only to models that are used
the same way as other translation hypotheses, as the first step in pivot-based translations. For
for example using automatic measures such as other models, we include a string similarity mea-
3
Using larger units (trigrams, for example) led to lower
sure based on the longest common subsequence
scores in our experiments (probably due to data sparseness) ratio (LCSR) (Stephen, 1992) in order to give an
and, therefore, are not reported here. impression about the closeness of the system
144
output to the reference translations. and another 2000 sentences for testing. For Gali-
cian, we only used 1000 sentences for each set
4 Experiments due to the lack of additional data. We were espe-
cially careful when preparing the data to exclude
We conducted a series of experiments to test
all sentences from tuning and test sets that could
the ideas of (character-level) pivot translation for
be found in any pivot or direct translation model.
resource-poor languages. We chose to use data
Hence, all test sentences are unseen strings for all
from a collection of translated subtitles com-
models presented in this paper (but they are not
piled in the freely available OPUS corpus (Tiede-
comparable with each other as they are sampled
mann, 2009b). This collection includes a large
individually from independent data sets).
variety of languages and contains mainly short
sentences and sentence fragments, which suits language pair #sents #words
character-level alignment very well. The selected Galician English
settings represent translation tasks between lan- Galician Spanish 2k 15k
guages (and domains) for which only very limited Catalan English 50k 400k
Catalan Spanish 64k 500k
training data is available or none at all.
Spanish English 30M 180M
Below we present results from two general
Macedonian English 220k 1.2M
tasks:4 (i) Translating between English and a Macedonian Bosnian 12k 60k
resource-poor language (in both directions) via Macedonian Bulgarian 155k 800k
a pivot language that is close related to the Bosnian English 2.1M 11M
resource-poor language. (ii) Translating between Bulgarian English 14M 80M
two languages in a domain for which no in- Table 1: Training data for the translation task between
domain training data is available via a pivot lan- closely related languages in the domain of movie sub-
guage with in-domain data. We will start with titles. Number of sentences (#sents) and number of
the presentation of the first task and the character- words (#words) in thousands (k) and millions (M) (av-
based translation between closely related lan- erages of source and target language).
guages.
The data sets represent several interesting test
4.1 Task 1: Pivoting via Related Languages cases: Galician is the least supported language
We decided to look at resource-poor languages with extremely little training data for building our
from two language families: Macedonian repre- pivot model. There is no data for the direct model
senting a Slavic language from the Balkan re- and, therefore, no explicit baseline for this task.
gion, Catalan and Galician representing two Ro- There is 30 times more data available for Catalan-
mance languages spoken mainly in Spain. There English, but still too little for a decent standard
is only little or no data available for translating SMT model. Interesting here is that we have more
from or to English for these languages. However, or less the same amount of data available for the
there are related languages with medium or large baseline and for the pivot translation between the
amounts of training data. For Macedonian, we related languages. The data set for Macedonian
use Bulgarian (which also uses a Cyrillic alpha- English is by far the largest among the baseline
bet) and Bosnian (another related language that models and also bigger than the sets available for
mainly uses Latin characters) as the pivot lan- the related pivot languages. Especially Macedo-
guage. For Catalan and Galician, the obvious nian Bosnian is not well supported. The inter-
choice was Spanish (however, Portuguese would, esting questions is whether tiny amounts of pivot
for example, have been another reasonable op- data can still be competitive. In all three cases,
tion for Galician). Table 1 lists the data avail- there is much more data available for the trans-
able for training the various models. Furthermore, lation models between English and the pivot lan-
we reserved 2000 sentences for tuning parameters guage.
In the following section we will look at the
4
In all experiments we use standard tools like Moses, translation between related languages with vari-
Giza++, SRILM, mteval etc. Details about basic settings are
omitted here due to space constraints but can be found in
ous models and training setups before we con-
the supplementary material. The data sets are available from sider the actual translation task via the bridge lan-
here: http://stp.lingfil.uu.se/joerg/index.php?resources guages.
145
bs-mk bg-mk es-gl es-ca
Model BLEU % LCSR BLEU % LCSR BLEU % LCSR BLEU % LCSR
word-based 15.43 0.5067 14.66 0.6225 41.11 0.7966 62.73 0.8526
char WFST1:1 21.37++ 0.6903 13.33 0.6159 36.94 0.7832 73.17++ 0.8728
char WFST2:2 19.17++ 0.6737 12.67 0.6190 43.39++ 0.8083 70.64++ 0.8684
char IBMchar 23.17++ 0.6968 14.57 0.6347 45.21++ 0.8171 73.12++ 0.8767
char IBMbigram 24.84++ 0.7046 15.01++ 0.6374 44.06++ 0.8144 74.21++ 0.8803
Table 2: Translating from a related pivot language to the target language. Bosnian (bs) / Bulgarian (bg)
Macedonian (mk); Galician (gl) / Catalan (ca) Spanish (es). Word-based refers to standard phrase-based SMT
models. All other models use phrases over character sequences. The WFSTx:y models use weighted finite state
transducers for character alignment with units that are at most x and y characters long, respectively. Other
models use Viterbi alignments created by IBM model 4 using GIZA++ (Och and Ney, 2003) between characters
(IBMchar ) or bigrams (IBMbigram ). LCSR refers to the averaged longest common subsequence ratio between
system translations and references. Results are significantly better (p < 0.01++ , p < 0.05+ ) or worse (p <
0.01 , p < 0.05 ) than the word-based baseline.
mk-bs mk-bg gl-es ca-es

Model BLEU % UNK BLEU % UNK BLEU % UNK BLEU % UNK
word-based 14.22 17.83% 14.77 5.29% 43.22 10.18% 59.34 3.80%
char WFST1:1 21.74++ 1.50% 16.04++ 0.77% 50.24++ 1.17% 62.87++ 0.45%
char WFST2:2 19.19++ 2.05% 15.32 0.96% 50.59++ 1.28% 59.84 0.47%
char IBMchar 24.15++ 1.30% 17.12++ 0.80% 51.18++ 1.38% 64.35 ++ 0.59%
char IBMbigram 24.82++ 1.00% 17.28++ 0.77% 50.70++ 1.36% 65.14++ 0.48%
Table 3: Translating from the source language to a related pivot language. UNK gives the proportion of unknown
words with respect to the translation model from the pivot language to English.
4.1.1 Translating Related Languages produce consistently worse translation models (at
The main challenge for the translation mod- least in terms of BLEU scores). The reason for
els between related languages is the restriction to this might be that the IBM models can handle
very limited parallel training data. Character-level noise in the training data more robustly. How-
models make it possible to generalize to very ba- ever, in terms of unknown words, WFST-based
sic translation units leading to robust models in alignment is very competitive and often the best
the sense of models without unknown events. The choice (but not much different from the best IBM
basic question is whether they provide reasonable based models). The use of character bigrams
translations with respect to given accepted refer- leads to further BLEU improvements for all data
ences. Tables 2 and 3 give a comprehensive sum- sets except Galician-Spanish. However, this data
mary of various models for the languages selected set is extremely small, which may cause unpre-
in our experiments. dictable results. In any case, the differences
We can see that at least one character-based between character-based alignments and bigram-
translation model outperforms the standard word- based ones are rather small and our experiments
based model in all cases. This is true (and not very do not lead to conclusive results.
surprising) for the language pairs with very little
4.1.2 Pivot Translation
training data but it is also the case for language
pairs with slightly more reasonable data sets like In this section we now look at cascaded transla-
Bulgarian-Macedonian. The automatic measures tions via the related pivot language. Tables 4 and
indicate decent translation performances at this 5 summarize the results for various settings.
stage which encourages their use in pivot trans- As we can see, the pivot translations for Cata-
lation that we will discuss in the next section. lan and Galician outperform the baselines by a
Furthermore, we can also see the influence of large margin. Here, the baselines are, of course,
different character alignment algorithms. Some- very weak due to the minimal amount of train-
what surprisingly, the best results are achieved ing data. Furthermore, the Catalan-English test
with IBM alignment models that are not designed set appears to be very easy considering the rela-
for this purpose. Transducer-based alignments tively high BLEU scores achieved even with tiny
146
Model (BLEU in %) 1x1 10x10 Model (BLEU in %) 1x1 10x10
English Catalan (baseline) 26.70 English Maced. (baseline) 11.04
English (Spanish = Catalan) 8.38 English Bosn. -word- Maced. 7.33 7.64
English Spanish -word- Catalan 38.91++ 39.59++ English Bosn. -char- Maced. 9.99 10.34
English Spanish -char- Catalan 44.46++ 46.82++ English Bulg. -word- Maced. 12.49++ 12.62++
Catalan English (baseline) 27.86 English Bulg. -char- Maced. 11.57++ 11.59+
(Catalan = Spanish) English 9.52 Maced. English (baseline) 20.24
Catalan -word- Spanish English 38.41++ 38.65++ Maced. -word- Bosn. English 12.36 12.48
Catalan -char- Spanish English 40.43++ 40.73++ Maced. -char- Bosn. English 18.73 18.64
English Galician (baseline) Maced. -word- Bulg. English 19.62 19.74
English (Spanish = Galician) 7.46 Maced. -char- Bulg. English 21.05 21.10
English Spanish -word- Galician 20.55 20.76
English Spanish -char- Galician 21.12 21.09 Table 5: Translating between Macedonian (Maced)
Galician English (baseline) and English via Bosnian (Bosn) / Bulgarian (Bulg).
(Galician = Spanish) English 5.76
Galician -word- Spanish English 13.16 13.20
Galician -char- Spanish English 16.04 16.02 the BLEU scores are much lower for all models
involved (even for the high-density languages),
Table 4: Translating between Galician/Catalan and En- which indicates larger problems with the gener-
glish via Spanish using a standard phrase-based SMT ation of correct output and intermediate transla-
baseline, SpanishEnglish SMT models to translate tions.
from/to Catalan/Galician and pivot-based approaches Interesting is the fact that we can achieve al-
using word-level models or character-level models most the same performance as the baseline when
(based on IBMbigram alignments) with either one-best
translating via Bosnian even though we had much
(1x1) or N-best lists (10x10 with = 0.85).
less training data at our disposal for the translation
between Macedonian and Bosnian. In this setup,
amounts of training data for the baseline. Still, no we can see that a character-based model was nec-
test sentence appears in any training or develop- essary in order to obtain the desired abstraction
ment set for either direct translation or pivot mod- from the tiny amount of training data.
els. From the results, we can also see that Catalan
and Galician are quite different from Spanish and 4.2 Task 2: Pivoting for Domain Adaptation
require language-specific treatment. Using a large Sparse resources are not only a problem for spe-
Spanish English model (with over 30% BLEU cific languages but also for specific domains.
in both directions) to translate from or to Cata- SMT models are very sensitive to domain shifts
lan or Galician is not an option. The experiments and domain-specific data is often rare. In the fol-
show that character-based pivot models lead to lowing, we investigate a test case of translating
better translations than word-based pivot models between two languages (English and Norwegian)
(in terms of BLEU scores). This reflects the per- with reasonable amounts of data resources but in
formance gains presented in Table 2. Rescoring the wrong domain (movie subtitles instead of le-
of N-best lists, on the other hand, does not have gal texts). Here again, we facilitate the transla-
a big impact on our results. However, we did not tion process by a pivot language, this time with
spend time optimizing the parameters of N-best domain-specific data.
size and interpolation weight. The task is to translate legal texts from Norwe-
The results from the Macedonian task are not as gian (Bokmal) to English and vice versa. The test
clear. This is especially due to the different setup set is taken from the EnglishNorwegian Parallel
in which the baseline uses more training data than Corpus (ENPC) (Johansson et al., 1996) and con-
any of the related language pivot models. How- tains 1493 parallel sentences (a selection of Eu-
ever, we can still see that the pivot translation via ropean treaties, directives and agreements). Oth-
Bulgarian clearly outperforms the baseline. For erwise, there is no training data available in this
the case of translating to Macedonian via Bulgar- domain for English and Norwegian. Table 6 lists
ian, the word-based model seems to be more ro- the other data resources we used in our study.
bust than the character-level model. This may be As we can see, there is decent amount of train-
due to a larger number of non-words generated ing data for English Norwegian, but the domain
by the character-based pivot model. In general, is strikingly different. On the other hand, there
147
Language pair Domain #sents #words tion process is enormous. As expected, the out-
EnglishNorwegian subtitles 2.4M 18M
of-domain baseline does not perform well even
NorwegianDanish subtitles 1.5M 10M
DanishEnglish DGT-TM 430k 9M though it uses the largest amount of training data
in our setup. It is even outperformed by the in-
Table 6: Training data available for the domain adapta- domain pivot model when pretending that Norwe-
tion task. DGT-TM refers to the translation memories gian is in fact Danish. For the translation into En-
provided by the JRC (Steinberger et al., 2006)
glish, the in-domain language model helps a lit-
tle bit (similar resources are not available for the
is in-domain data for other languages like Danish other direction). However, having the strong in-
that may act as an intermediate pivot. Further- domain model for translating to (and from) the
more, we have out-of-domain data for the transla- pivot language improves the scores dramatically.
tion between pivot and Norwegian. The sizes of The out-of-domain model in the other part of the
the training data sets for the pivot models are com- cascaded translation does not destroy this advan-
parable (in terms of words). The in-domain pivot tage completely and the overall score is much
data is controlled and very consistent and, there- higher than any other baseline.
fore, high quality translations can be expected. In our setup, we used again a closely related
The subtitle data is noisy and includes various language as a pivot. However, this time we
movie genres. It is important to mention that the had more data available for training the pivot
pivot data still does not contain any sentence in- translation model. Naturally, the advantages of
cluded in the EnglishNorwegian test set. the character-level approach diminishes and the
Table 7 summarizes the results of our experi- word-level model becomes a better alternative.
ments when using Danish and in-domain data as However, there can still be a good reason for the
a pivot in translations from and to Norwegian. use of a character-based model as we can see in
the success of the bigram model (subsbi ) in the
Model (task: English Norwegian) BLEU translation from Norwegian to English (via Dan-
(step 1) English dgt Danish 52.76 ish). A character-based model may generalize be-
(step 2) Danish subswo Norwegian 29.87 yond domain-specific terminology which leads to
(step 2) Danish subsch Norwegian 29.65
(step 2) Danish subsbi Norwegian 25.65 a reduction of unknown words when applied to
English subs Norwegian (baseline) 7.20 a new domain. Note that using a character-based
English dgt (Danish = Norwegian) 9.44++ model in step two could possibly cause more harm
English dgt Danish -subswo - Norwegian 17.49++ than using it in step one of the pivot-based pro-
English dgt Danish -subsch - Norwegian 17.61++
English dgt Danish -subsbi - Norwegian 14.07++
cedure. Using n-best lists for a subsequent word-
based translation in step two may fix errors caused
Model (task: Norwegian English) BLEU by character-based translation simply by ignoring
(step 1) Norwegian subswo Danish 30.15 hypotheses containing them, which makes such a
(step 1) Norwegian subsch Danish 27.81 model more robust to noisy input.
(step 1) Norwegian subsbi Danish 28.52
(step 2) Danish dgt English 57.23 Finally, as an alternative, we can also look at
Norwegian subs English (baseline) 11.41 other pivot languages. The domain adaptation
(Norwegian = Danish) dgt English 13.21++ task is not at all restricted to closely related pivot
Norwegian subs+dgtLM English 13.33++ languages especially considering the success of
Norwegian subswo Danish dgt English 25.75++
(Norwegian subsch Danish dgt English 23.77++
word-based models in the experiments above. Ta-
Norwegian subsbi Danish dgt English 26.29++ ble 8 lists results for three other pivot languages.
Surprisingly, the results are much worse than
Table 7: Translating out-of-domain data via Dan- for the Danish test case. Apparently, these mod-
ish. Models using in-domain data are marked with
els are strongly influenced by the out-of-domain
dgt and out-of-domain models are marked with subs.
subs+dgtLM refers to a model with an out-of-domain translation between Norwegian and the pivot lan-
translation model and an added in-domain language guage. The only success can be seen with an-
model. The subscripts wo, ch and bi refer to word, other closely related language, Swedish. Lexical
character and bigram models, respectively. and syntactic similarity seems to be important to
create models that are robust enough for domain
The influence of in-domain data in the transla- shifts in the cascaded translation setup.
148
Pivot=xx enxx xxno enxxno extremely sparse data sets. Moreover, charac-
German 53.09 23.60 3.15
ter level models introduce an abstraction that re-
French 66.47 17.84 5.03
Swedish 52.62 24.79 10.07++ duce the number of unknown words dramatically.
Pivot=xx noxx xxen noxxen In most cases, these unknown words represent
German 15.02 53.02 5.52 information-rich units that bear large portions of
French 17.69 65.85 8.78 the meaning to be translated. The following illus-
Swedish 19.72 59.55 16.35++
trates this effect on example translations with and
Table 8: Alternative word-based pivot translations be- without pivot model:
tween Norwegian (no) and English (en).
Example: Catalan English (via Spanish)
Referen e: I have to grade these papers.
5 Related Work Baseline: Tin que quali ar these examens.
Pivotword : Tin que quali ar these tests.
There is a wide range of pivot language ap- Pivotchar : I have to grade these papers.
proaches to machine translation and a number Example: Ma edonian English (via Bulgarian)
of strategies have been proposed. One of them Referen e: It's a simple matter of self-preservation.
Baseline: It's simply a question of .
is often called triangulation and usually refers Pivotword : That's a matter of .
to the combination of phrase tables (Cohn and Pivotchar : It's just a question of yourself.
Lapata, 2007). Phrase translation probabilities
are merged and lexical weights are estimated by Leaving unseen words untranslated is not only an-
bridging word alignment models (Wu and Wang, noying (especially if the input language uses a
2007; Bertoldi et al., 2008). Cascaded translation different writing system) but often makes transla-
via pivot languages are discussed by (Utiyama tions completely incomprehensible. Pivot trans-
and Isahara, 2007) and are frequently used by var- lations will still not be perfect (see example
ious researchers (de Gispert and Marino, 2006; two above), but can at least be more intelli-
Koehn et al., 2009; Wu and Wang, 2009) and gible. Character-based models can even take
commercial systems such as Google Translate. care of tokenization errors as the one shown
A third strategy is to generate or augment data above (Tincque should be two words Tinc
sets with the help of pivot models. This is, for que). Fortunately, the generation of non-word
example, explored by (de Gispert and Marino, sequences (observed as unknown words) does not
2006) and (Wu and Wang, 2009) (who call it the seem to be a big problem and no special treatment
synthetic method). Pivoting has also been used is required to avoid such output. We would still
for paraphrasing and lexical adaptation (Bannard like to address this issue in future work by adding
and Callison-Burch, 2005; Crego et al., 2010). a word level LM in character-based SMT. How-
(Nakov and Ng, 2009) investigate pivot languages ever, (Vilar et al., 2007) already showed that this
for resource-poor languages (but only when trans- did not have any positive effect in their character-
lating from the resource-poor language). They based system. In a second study, we also showed
also use transliteration for adapting models to a that pivot models can be useful for adapting to
new (related) language. Character-level SMT has a new domain. The use of in-domain pivot data
been used for transliteration (Matthews, 2007; leads to systems that outperform out-of-domain
Tiedemann and Nabende, 2009) and also for the translation models by a large margin. Our find-
translation between closely related languages (Vi- ings point to many prospects for future work.
lar et al., 2007; Tiedemann, 2009a). For example, we would like to investigate combi-
nations of character-based and word-based mod-
6 Conclusions and Discussion
els. Character-based models may also be used for
In this paper, we have discussed possibilities to treating unknown words only. Multiple source ap-
translate via pivot languages on the character proaches via several pivots is another possibility
level. These models are useful to support under- to be explored. Finally, we also need to further
resourced languages and explore strong lexical investigate the robustness of the approach with re-
and syntactic similarities between closely related spect to other language pairs, data sets and learn-
languages. Such an approach makes it possible ing parameters.
to train reasonable translation models even with
149
References Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Pro-
Colin Bannard and Chris Callison-Burch. 2005. Para-
ceedings of the 2003 Conference of the North Amer-
phrasing with bilingual parallel corpora. In Pro-
ican Chapter of the Association for Computational
ceedings of the 43rd Annual Meeting of the Associa-
Linguistics on Human Language Technology - Vol-
tion for Computational Linguistics (ACL05), pages
ume 1, NAACL 03, pages 4854, Stroudsburg, PA,
597604, Ann Arbor, Michigan, June. Association
USA. Association for Computational Linguistics.
Philipp Koehn, Hieu Hoang, Alexandra Birch,
Nicola Bertoldi, Madalina Barbaiani, Marcello Fed-
Chris Callison-Burch, Marcello Federico, Nicola
erico, and Roldano Cattoni. 2008. Phrase-Based
Bertoldi, Brooke Cowan, Wade Shen, Christine
Statistical Machine Translation with Pivot Lan-
Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
guages. In Proceedings of the International Work-
Alexandra Constantin, and Evan Herbst. 2007.
shop on Spoken Language Translation, pages 143
Moses: Open source toolkit for statistical ma-
149, Hawaii, USA.
chine translation. In Proceedings of the 45th An-
Trevor Cohn and Mirella Lapata. 2007. Machine
nual Meeting of the Association for Computational
translation by triangulation: Making effective use
Linguistics Companion Volume Proceedings of the
of multi-parallel corpora. In Proceedings of the
Demo and Poster Sessions, pages 177180, Prague,
45th Annual Meeting of the Association of Compu-
Czech Republic, June. Association for Computa-
tational Linguistics, pages 728735, Prague, Czech
tional Linguistics.
Republic, June. Association for Computational Lin-
Philipp Koehn, Alexandra Birch, and Ralf Steinberger.
guistics.
2009. 462 machine translation systems for europe.
Josep Maria Crego, Aurelien Max, and Francois Yvon.
In Proceedings of MT Summit XII, pages 6572, Ot-
2010. Local lexical adaptation in machine transla-
tawa, Canada.
tion through triangulation: SMT helping SMT. In
Proceedings of the 23rd International Conference Minh-Thang Luong, Preslav Nakov, and Min-Yen
on Computational Linguistics (Coling 2010), pages Kan. 2010. A hybrid morpheme-word represen-
232240, Beijing, China, August. Coling 2010 Or- tation for machine translation of morphologically
ganizing Committee. rich languages. In Proceedings of the 2010 Con-
ference on Empirical Methods in Natural Language
A. de Gispert and J.B. Marino. 2006. Catalan-english
Processing, pages 148157, Cambridge, MA, Octo-
statistical machine translation without parallel cor-
ber. Association for Computational Linguistics.
pus: Bridging through spanish. In Proceedings of
the 5th Workshop on Strategies for developing Ma- David Matthews. 2007. Machine transliteration of
chine Translation for Minority Languages (SALT- proper names. Masters thesis, School of Informat-
MIL06) at LREC, pages 6568, Genova, Italy. ics, University of Edinburgh.
Mark Fishel and Harri Kirik. 2010. Linguistically Preslav Nakov and Hwee Tou Ng. 2009. Im-
motivated unsupervised segmentation for machine proved statistical machine translation for resource-
translation. In Proceedings of the International poor languages using related resource-rich lan-
Conference on Language Resources and Evaluation guages. In Proceedings of the 2009 Conference on
(LREC), pages 17411745, Valletta, Malta. Empirical Methods in Natural Language Process-
Mark Fishel. 2009. Deeper than words: Morph-based ing, pages 13581367, Singapore, August. Associ-
alignment for statistical machine translation. In ation for Computational Linguistics.
Proceedings of the Conference of the Pacific Associ- Franz Josef Och and Hermann Ney. 2003. A sys-
ation for Computational Linguistics PacLing 2009, tematic comparison of various statistical alignment
Sapporo, Japan. models. Computational Linguistics, 29(1):1951.
Sittichai Jiampojamarn, Grzegorz Kondrak, and Tarek Franz Josef Och. 2003. Minimum error rate training
Sherif. 2007. Applying many-to-many alignments in statistical machine translation. In Proceedings
and hidden markov models to letter-to-phoneme of the 41st Annual Meeting of the Association for
conversion. In Human Language Technologies Computational Linguistics, pages 160167, Sap-
2007: The Conference of the North American Chap- poro, Japan, July. Association for Computational
ter of the Association for Computational Linguis- Linguistics.
tics; Proceedings of the Main Conference, pages Eric Sven Ristad and Peter N. Yianilos. 1998.
372379, Rochester, New York, April. Association Learning string edit distance. IEEE Transactions
for Computational Linguistics. on Pattern Recognition and Machine Intelligence,
Stig Johansson, Jarle Ebeling, and Knut Hofland. 20(5):522532, May.
1996. Coding and aligning the English-Norwegian Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Parallel Corpus. In K. Aijmer, B. Altenberg, Camelia Ignat, Tomaz Erjavec, and Dan Tufis.
and M. Johansson, editors, Languages in Contrast, 2006. The JRC-Acquis: A multilingual aligned par-
pages 87112. Lund University Press. allel corpus with 20+ languages. In Proceedings of
150
the 5th International Conference on Language Re-
sources and Evaluation (LREC), pages 21422147.
Graham A. Stephen. 1992. String Search. Technical
report, School of Electronic Engineering Science,
University College of North Wales, Gwynedd.
Jorg Tiedemann and Peter Nabende. 2009. Translat-
ing transliterations. International Journal of Com-
puting and ICT Research, 3(1):3341.
Jorg Tiedemann. 2009a. Character-based PSMT for
closely related languages. In Proceedings of 13th
Annual Conference of the European Association for
Machine Translation (EAMT09), pages 12 19,
Barcelona, Spain.
Jorg Tiedemann. 2009b. News from OPUS - A col-
lection of multilingual parallel corpora with tools
and interfaces. In Recent Advances in Natural Lan-
guage Processing, volume V, pages 237248. John
Benjamins, Amsterdam/Philadelphia.
Masao Utiyama and Hitoshi Isahara. 2007. A com-
parison of pivot methods for phrase-based statisti-
cal machine translation. In Human Language Tech-
nologies 2007: The Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics; Proceedings of the Main Conference,
pages 484491, Rochester, New York, April. Asso-
ciation for Computational Linguistics.
David Vilar, Jan-Thorsten Peter, and Hermann Ney.
2007. Can we translate letters? In Proceedings of
the Second Workshop on Statistical Machine Trans-
lation, pages 3339, Prague, Czech Republic, June.
Association for Computational Linguistics.
Hua Wu and Haifeng Wang. 2007. Pivot language ap-
proach for phrase-based statistical machine transla-
tion. In Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics, pages
856863, Prague, Czech Republic, June. Associa-
Hua Wu and Haifeng Wang. 2009. Revisiting pivot
language approach for machine translation. In Pro-
ceedings of the Joint Conference of the 47th An-
nual Meeting of the ACL and the 4th International
Joint Conference on Natural Language Processing
of the AFNLP, pages 154162, Suntec, Singapore,
August. Association for Computational Linguistics.
151
Does more data always yield better translations?
Guillem Gasco, Martha-Alicia Rocha, German Sanchis-Trilles,
Jesus Andres-Ferrer and Francisco Casacuberta
Departament de Sistemes Informatics i Computacio
Universitat Politecnica de Valencia
Cam de Vera s/n, 46022 Valencia, Spain
{ggasco,mrocha,gsanchis,jandres,fcn}@dsic.upv.es
Abstract that in which the SMT system is to be used or as-

sessed; secondly, the use of all this data for train-
Nowadays, there are large amounts of data ing the system increases the computational train-
available to train statistical machine trans-
ing requirements. Despite the previous remarks,
lation systems. However, it is not clear
whether all the training data actually help
the de facto standard consists in training SMT sys-
or not. A system trained on a subset of such tems with all the available data. This is due to
huge bilingual corpora might outperform the widespread misconception that the more data
the use of all the bilingual data. This paper a system is trained with, the better its performance
studies such issues by analysing two train- should be. Although the previous statement is the-
ing data selection techniques: one based oretically true if all the data belongs to the same
on approximating the probability of an in- domain, this is not the case in the problems tack-
domain corpus; and another based on in-
led by most of the SMT systems. For instance,
frequent n-gram occurrence. Experimental
results not only report significant improve- enterprises often need to build on-demand sys-
ments over random sentence selection but tems (Yuste et al., 2010). In this case, since we
also an improvement over a system trained are interested in translating some specific text, it
with the whole available data. Surprisingly, is not clear whether training a system with all data
the improvements are obtained with just a yields better performance than training it with a
small fraction of the data that accounts for wisely selected subset of bilingual sentences.
less than 0.5% of the sentences. After-
The bilingual sentence selection (BSS) task is
wards, we show that a much larger room for
improvement exists, although this is done stated as the problem of selecting the best sub-
under non-realistic conditions. set of bilingual sentences from an available pool
of sentences, with which to train a SMT system.
This paper is concerned to BSS, and mainly two
1 Introduction ideas are developed.
Globalisation and the popularisation of the Inter- On the one hand, two BSS strategies that at-
net have lead to a rapid increase in the amount of tempt to build better translation systems are anal-
bilingual corpora available. Entities such as the ysed. Such strategies are able to improve state-of-
European Union, the United Nations and other the-art translation quality without the very high
multinational organisations need to translate all computational resources that are required when
the documentation they generate. Such transla- using the complete pool of sentences. Both tech-
tions happen every day and provide very large niques span through two orthogonal criteria when
multilingual corpora, which are oftentimes diffi- selecting bilingual sentences from the available
cult to process and significantly increase the com- pool: avoiding to introduce a bias in the original
putational requirements needed to train statistical data distribution, and increasing the informative-
machine translation (SMT) systems. For instance, ness of the corpus.
the corpora made available for recent machine On the other hand, we prove that among all pos-
translation evaluations are in the order of 1 billion sible subsets from the sentence pool, there is at
running words (Callison-Burch et al., 2010). least a small one that yields large improvements
However, two main problems arise when at- (up to 10 BLEU points) with respect to a system
tempting to use this huge pool of sentences for trained with all the data. In order to retrieve such
training SMT systems: firstly, a large portion of subset, we had to use an oracle that employs infor-
this data is obtained from domains that differ from mation extracted from the reference translations
152
only for the purpose of selecting bilingual sen- infrequent n-grams. In Section 5 experimental re-
tences. However, references are not used at any sults are reported. Finally, the main results of the
stage within the translation system for obtaining work and several future work directions are dis-
the hypotheses. Note that although we are not cussed in Section 6.
able to achieve such an improvement without an
oracle, this result restates the BSS problem as an 2 Related Work
interesting approach not only for reducing com- Training data selection has been receiving an in-
putational effort but also for significantly boost- creasing amount of attention within the SMT
ing performance. To our knowledge, no previous community. For instance, in (Li et al., 2010;
work has quantified the room of improvement in Gasco et al., 2010) several BSS techniques, sim-
which BSS techniques could incur. ilar to those analysed in this paper, have been
In order to assess the performance of the dif- applied for training MT systems when there are
ferent BSS techniques, translation results are ob- large training corpora available. However, nei-
tained by using a standard state-of-the-art SMT ther such techniques have been formalised, nor its
system (Koehn et al., 2007). The most recent lit- performance thoroughly analysed. A similar ap-
erature defines the SMT problem (Papineni et al., proach that gives weights to different subcorpora
1998; Och and Ney, 2002) as follows: given an was proposed in (Matsoukas et al., 2009).
input sentence f from a certain source language, In (Lu et al., 2007), information retrieval meth-
the purpose is to find an output sentence e in a ods are used in order to produce different sub-
certain target language such that models which are then weighted according to the
K sentence to be translated. In such work, authors
X
e = arg max k hk (f , e) (1) define the baseline as the result obtained train-
e
k=1 ing only with the corpus that share the same do-
main of the test. Afterwards they claim that they
where hk (f , e) is a score function representing an are able to improve baseline translation quality by
important feature for the translation of f into e, adding new sentences retrieved with their method.
as for example the language model of the target However, they neither compare their technique
language, a reordering model or several transla- with random sentence selection, nor with a model
tion models. k are the log-linear combination trained with all the corpora.
weights. Although the techniques that are applied for
The main contributions of this paper are: BSS are often very similar to those applied for ac-
A BSS technique is analysed, which im- tive learning (AL), both problems are essentially
proves the results obtained with a random different. Since the AL strategies assume that
bilingual sentence selection strategy when the pool of sentences are not translated, they are
the specific domain to be translated signifi- usually interested in finding the best monolingual
cantly differs from that of the pool of sen- subset of sentences to be translated by a human
tences. annotator. In contrast, in BSS, it is assumed that a
Another BSS technique is analysed that, us- fairly large amount of bilingual corpora is readily
ing less than 0.5% of the sentences avail- available, and the main goal consists in selecting
able, significantly improves over random se- only those sentences which will maximise system
lection, beating a system trained with all the performance.
pool of sentences. Some works have applied sentence selection in
small scale AL frameworks. These works extend
We prove, by means of an oracle, that a wise
the training corpora at most with 5000 sentences.
BSS technique can yield large improvements
In (Ananthakrishnan et al., 2010), sentences are
when compared with systems trained with all
selected by means of discriminative techniques.
data available.
In (Haffari et al., 2009) a technique is proposed
The remaining of the paper is structured as fol- for increasing the counts of phrases that are con-
lows. Section 2 summarises the related work. sidered infrequent. Both works significantly dif-
Sections 3 and 4 present two BSS techniques, fer from the current work not only on the frame-
namely, probabilistic sampling and recovery of work, but also on the scale of the experiments, the
153
proposed techniques and the obtained improve- the sample bias. The proposed approach relies in
ments. Similar ideas applied to adaptation prob- conserving the probability distribution of the task
lems have been proposed in (Moore and Lewis, domain by wisely selecting the bilingual pairs to
2010; Axelrod et al., 2011). be used from the whole pool of sentences. Hence,
it is mandatory to exclude sentences from the pool
3 Probabilistic Sampling that distort the actual probability. In order to ap-
As discussed in Section 2, BSS has inherently proximate the probability distribution, we assume
attached many meaningful links with AL tech- that a small but representative corpus is avail-
niques. Selecting samples for learning our mod- able from the task domain. This corpus, referred
els, incurs in a well-known difficulty in AL, the henceforth as the in-domain corpus, provides a
so-called sample bias problem (Dasgupta, 2009). way to build an initial model which approximates
This problem, which is spread to the BSS case, the actual probability of the system. The pool of
is summarised as the distortion introduced by the sentences will be oppositely denoted as the out-
active strategy into the probability distribution un- of-domain corpus.
derlying the training corpus. This bias forces the The actual probability of the task domain, the
training algorithm to learn a distorted probability so called in-domain probability, is approximated
model which can significantly differ from the ac- with the following model
tual one.
p(e, f , |e|, |f |) = p(e, f | |e|, |f |) p(|e|, |f |) (5)
In order to further analyse the sampling bias
problem, consider the maximum likelihood esti- where p(|e|, |f |) denotes the in-domain length
mation (MLE) of a probability model, p (e, f ) probability, and p(e, f | |e|, |f |) the in-domain
for a given corpus of N data points,{(en , fn )}, bilingual probability.
sampled from the actual probability distribution, The length probability is estimated by MLE
Pr(e, f ). Recall that e denotes a target sen-
tence whereas f stands for its source counter- N (|e| + |f |)
part. MLE techniques aims at minimising the p(|e|, |f |) = (6)
N
Kullback-Leibler divergence between the actual
unknown probability distribution and the proba- where N (|e|+|f |) is the number of bilingual pairs
bility model (Bishop, 2006), defined as in the in-domain corpus such that their lengths
sum up to |e|+|f | and N denotes the total num-

X Pr(e, f ) ber of sentences. Note that no distinction is made
KL(Pr | p ) = Pr(e, f ) log
p (e, f ) between source and target lengths since the model
e,f
(2) is intended for sampling.
When minimising, Eq. (2) is simplified to The complexity of the in-domain bilingual
probability distribution, p(e, f | |e|, |f |), requires
X
= arg max Pr(e, f ) log(p (e, f )) (3) a more sophisticated approximation
e,f P
exp( k k fk (e, f ))
p(e, f /|e|, |f |) = (7)
which is approximated by a sufficiently large Z
dataset under the commonly hold assumption that being Z a normalisation constant; and where
it is independently and identically distributed ac- fk (. . .) and k are the features of the model and
cording to Pr(e, f ) as their respective parametric weights. Specifically,
X four logarithmic features were considered for this
= arg max log(p (en , fn )) (4)
sampling technique: a direct and an inverse IBM
n
model 4 (Brown et al., 1994); and both, source
Therefore, by perturbing the sample {(en , fn )} and target, 5-gram language models. All fea-
with an active strategy, we are, in fact, modifying ture models are estimated in the in-domain cor-
the approximation to Eq.(3) and learning a differ- pus with standard techniques (Brown et al., 1994;
ent underlying probability distribution. Stolcke, 2002). As a first approach, the parame-
In this section a statistical framework is pro- ters of the log-linear model in Eq. (7), k , were
posed to build systems with BSS while avoiding uniformly fixed to 1.
154
Once we have an appropriate model for the be different from the concatenation of the transla-
in-domain probability distribution, the proposed tions of both words separately.
method randomly samples a given number of When selecting sentences from the pool it is
bilingual pairs from the out-of-domain corpora important to choose sentences that contain n-
(the pool of sentences). The process of extend- grams that have never been seen (or have been
ing the in-domain corpus with additional bilin- seen just a few times) in the training set. Such
gual pairs from the out-of-domain corpus is sum- n-grams will be henceforth referred to as infre-
marised as follows: quent n-grams . An n-gram is considered infre-
Decide according to the in-domain length quent when it appears less times than an infre-
probability in Eq. (6), how many samples quent threshold t. If the source language sen-
should be drawn for each length, i.e. divide tences to be translated are known beforehand, the
the number of sentences to add into length set of infrequent n-grams can be reduced to those
dependent buckets. present in such sentences. Then, the technique
consists in selecting from the pool those sentences
Randomly draw the number of samples which contain infrequent n-grams present in the
specified in each bucket according to the source sentences to be translated.
in-domain bilingual probability in Eq. (7) Sentences in the pool are sorted by their infre-
among all the bilingual sentences that share quency score in order to select first the most in-
the current bucket length. formative. Let X the set of n-grams that appear
in the sentences to be translated and w one of
Although the pool of sentences is typically them; C(w) the counts of w in the source lan-
large, it is not large enough to gather a signifi- guage training set; and N (w) the counts of w
cant amount of probability mass. Consequently, in the source sentence f to be scored. The infre-
a small set of sentences accumulate most of the quency score of f is:
probability mass and tend to be selected multi-
ple times. To avoid this awkward and undesired
behaviour, the sampling is performed without re- X
i(f ) = min(1, N (w)) max(0, tC(w)) (8)
placement.
wX
4 Infrequent n-gram Recovery

In order to avoid giving a high score to noisy
Another criterion when confronting the BSS task sentences with a lot of occurrences of the same in-
is to increase the informativeness of the training frequent n-gram, only one occurrence of each n-
set. Thus, it seems important to choose sentences gram is taken into account to compute the score.
that provide information not seen in the training In addition, the score gives more importance to
corpus. Note that this criterion is sometimes op- the n-grams with lowest counts in the training
posed to the one presented in Section 3. set. Although it could be possible to select the
The performance of phrase-based machine highest scored sentences, we updated the scores
translation systems strongly relies in the quality each time a sentence is selected. This decision
of the phrases extracted from the training sam- was taken to avoid the selection of too many sen-
ples. In most of the cases, the inference of such tences with the same infrequent n-gram. First,
phrases or rules is based on word alignments, sentences in the pool are scored using Equation
which cannot be computed accurately when ap- (8). Then, in each iteration, the sentence f with
pearing rarely in the training corpus. The extreme the highest score is selected, added to the training
case are the out-of-vocabulary words: words that set and removed from the pool. In addition, the
do not appear in the training set, cannot be trans- counts of the n-grams present in f are updated
lated. Moreover, this problem can be extended to and, hence, the scores of the rest of the sentences
sequences of words (n-grams). Consider a 2-gram in the pool. Since rescoring the whole pool would
fi fj appearing few or no times in the training set. incur in a very high computational cost, a subop-
Although fi and fj may appear separately in the timal search strategy was followed, in which the
training set, the system might not be able to in- search was constrained to a given set of highest
fer the translation of the 2-gram fi fj , which may scoring sentences. Here it was set to one million.
155
t=1 t = 10 t = 25 Subset Language |S| |W | |V |
tr all tr all tr all English 747K 24.6K
train 47.5K
1-gr 11.6 1.3 40.5 3.5 59.9 5.1 French 793K 31.7K
2-gr 38 9.8 73.2 21.3 84.9 27.9 English 9.2K 1.9K
dev 571
3-gr 66.8 33.5 91.1 55.7 96.4 64.9 French 10.3K 2.2K
4-gr 87.1 65.8 98.2 85.5 99.4 90.7 English 12.6K 2.4K
test 641
French 12.8K 2.7K
Table 1: Percentage of infrequent n-grams in the TED Table 2: TED corpus main figures. K denotes thou-
test set when considering only the TED training set sands of elements. |S| stands for number of sentences,
(tr), and when adding the out-of-domain pool (all), |W | for number of running words, and |V | for vocab-
for different infrequency thresholds t. ulary size.
Table 1 shows the percentage of source lan- Subset Language |S| |W | |V |

guage infrequent n-grams for the test of a rela- English 1.71M 29.9K
train 77.2K
French 1.99M 48K
tively small corpus such as the TED corpus (for
English 49.8K 8.7K
details see Section 5) when considering just the dev 08 2.1K
French 55.4K 7.7K
in-domain training set ( 40K sentences) and the English 65.6K 8.9K
test 09 2.5K
same percentage when adding the larger out of do- French 72.5K 10.6K
main corpora. The percentages in the table have English 62K 8.9K
test 10 2.5K
been computed separately for different values of French 70.5K 10.3K
the threshold t and for n-grams of order from 1 to Table 3: News Commentary corpus main figures.
4. Note that the reduction in the number of infre-
quent n-grams is very high for the 1-grams but de- GIZA++ (Och and Ney, 2003). The language
creases progressively when considering n-grams model used was a 5-gram with modified Kneser-
of higher order. This indicates that the infrequent Ney smoothing (Kneser and Ney, 1995), built
n-grams recovery technique should be very effec- with SRILM toolkit (Stolcke, 2002). The log-
tive for lower order n-grams, but might have less linear combination weights in Eq. (1) were opti-
effect for higher order n-grams. Therefore, and mised using Minimum Error Rate Training (Och
in order to lower the computational cost involved, and Ney, 2002) on the corresponding develop-
the experiments carried out for this paper were ment sets.
performed considering only infrequent 1-grams, Experiments were carried out on two corpora:
2-grams and 3-grams. TED (Paul et al., 2010) and News Commentary
(NC) (Callison-Burch et al., 2010). TED is an
5 Experiments English-French corpus composed of subtitles for
In the present Section, we first describe the exper- a collection of public speeches on a variety of top-
imental framework employed to assess the perfor- ics. The same partitions as in the IWSLT2010
mance of the BSS techniques described. Then, re- evaluation task (Paul et al., 2010) have been used.
sults for the probabilistic sentence selection strat- Subtitles have been concatenated into complete
egy are shown, followed by results obtained with sentences. NC is a slightly larger English-French
the infrequent n-grams technique. Some exam- corpus in the news domain. Main figures of both
ple translations are shown and, finally, we also corpora are shown in Tables 2 and 3. As for the
report experiments using the infrequent n-grams pool of sentences, three large corpora have been
technique in Oracle mode, in order to establish used: Europarl (Euro), United Nations (UN) and
the potential improvement for such technique and Gigaword (Giga), in the partition established for
for BSS in general. the 2010 workshop on SMT of the ACL (Callison-
Burch et al., 2010). Sentences of length greater
5.1 Experimental Setup than 50 have been pruned. Table 4 shows the main
All experiments were carried out using the figures of the tokenised and lowercased corpora.
open-source SMT toolkit Moses (Koehn et al., When translating between some language
2007), in its standard non-monotonic configura- pairs, there are words that remain invariable, like
tion. The phrase tables were generated by means for example numbers or punctuation marks in the
of symmetrised word alignments obtained with case of European languages. In fact, an easy and
156
Corpus Language |S| |W | |V | Europarl
English 25.6M 81K Gigaword
Euro 1.25M 0.03
Relative frequency
French 28.2M 101K UN
TED
English 94.4M 302K NC
UN 5M
French 107M 283K 0.02
English 303M 1.6M
Giga 15.5M
French 361M 1.6M
0.01
Table 4: Figures of the corpora used as sentence pool.
M stands for millions of elements.
0
effective technique that is commonly used is to re- 0 10 20 30 40 50 60 70 80 90 100
produce out-of-vocabulary words from the source Combined sentence length
sentence in the target hypothesis. However, in- Figure 2: Combined length relative frequency.
variable n-grams are usually infrequent as well, 5.2 Results for Probabilistic Sampling
which implies that the infrequent n-grams tech-
In addition to the probabilistic sampling tech-
nique would select sentences containing such n-
nique proposed in Section 3, we also analysed the
grams, even though they do not provide further
effect of sampling only according to the combined
information. As a first approach, we exclude n-
source-reference length, with the purpose of es-
grams without any letter.
tablishing whether potential improvements were
Baseline experiments have been carried out for
only due to the length component, or rather to the
TED and NC corpora using the corresponding
complete sampling model. Results for the 2009
training set. For comparison purposes, we also
test set are shown in Figure 1. Several things
included results for a purely random sentence se-
should be noted:
lection without replacement. In the plots, each
point corresponding to random selection represent Performing sentence selection only according
the average of 10 repetitions. Experiments using to sentence lengths does not achieve better
all data are also reported, although a 64GB ma- performance than random selection.
chine was necessary, even with binarized phrase Selecting sentences according to probabilis-
and distortion tables. tic sampling is able to improve random se-
Experiments were conducted by selecting a lection in the case of the TED corpus, but
fixed amount of sentences according to each one is not able to do so in the case of the NC
of the techniques described above. Then, these corpus. Significance tests for the 500K case
sentences were included into the training data and reported that the differences were significant
subsequent SMT systems were built for translat- in the case of the TED corpus, but not in the
ing the test set. case of the NC corpus.
Results are shown in terms of BLEU (Papineni In the case of the TED corpus, the perfor-
et al., 2001), which is an accuracy metric that mance achieved with the system built by
measures n-gram precision, with a penalty for sampling 500K sentences is only 0.5 BLEU
sentences that are too short. Although it could points below the performance achieved by
be argued that improvements obtained might be the system built with all the data available.
due to a side effect of the brevity penalty, this The explanation to the fact that probabilistic
was not found to be true: the BSS techniques (in- sampling is able to improve over random sam-
cluding random) and considering all data yielded pling only in the case of the TED corpus, but not
very similar brevity penalties (0.005), within in the case of NC, relies in the nature of the cor-
each corpus. In addition, TER scores (Snover et pora. Although both of them belong to a very
al., 2006) were also computed, but are omitted generic domain, their characteristics are very dif-
for clarity purposes and since they were found to ferent. In fact, the NC data is very similar to the
be coherent with BLEU. TER is an error metric sentences in the pool, but, in contrast, the sen-
that computes the minimum number of edits re- tences present in the TED corpus have a much
quired to modify the system hypotheses so that more different structure. This difference is illus-
they match the references translations. trated in Figure 2, where the relative frequency of
157
TED corpus NC corpus
24
in domain length in domain length
all sampling 22 all sampling
random random
23
BLEU
BLEU
21
22 20
19
21
0 100K 200K 300K 400K 500K 0 100K 200K 300K 400K 500K
Number of sentences added Number of sentences added
Figure 1: Effect of adding sentences over the BLEU score using the probabilistic sampling, length sampling and
random selection techniques for the two corpora, TED and News Commentary. Horizontal lines represent the
scores when using just the in domain training set and all the data available.
TED corpus NC corpus
26
all in domain random 23
t=10 t=25
25
22
BLEU
BLEU
24
21
23 all
in domain
20 random
22
t=10
19 t=25
21
0 50k 100k 200k 0 50k 100k 200k
Number of sentences added Number of sentences added
Figure 3: Effect of adding sentences over the BLEU score using the infrequent n-grams (with different thresh-
olds) and random selection techniques for the two corpora, TED and News Commentary. Horizontal lines repre-
sent the scores when using just the in domain training set and all the data available.
each combined sentence length is shown. In this sented similar curves, although less sentences can
plot, it stands out clearly that the TED corpus has be selected and hence improvements obtained are
a very different length distribution than the other slightly lower. Several conclusions can be drawn:
four corpora considered, whereas the NC corpus The translation quality provided by the in-
presents a very similar distribution. This implies frequent n-grams technique is significantly
that, when considering TED, an intelligent data better than the results achieved with random
selection strategy will have better chances to im- selection, comparing similar amount of sen-
prove random selection than in the case of NC. tences. Specifically, the improvements ob-
5.3 Results for Infrequent n-grams Recovery tained are in the range of 3 BLEU points.
Results for the TED corpus are more irreg-
Figure 3 shows the effect of adding sentences us-
ular. The best performance is achieved for
ing the infrequent n-grams and the random se-
t = 25 and 50K sentences added. In NC, the
lection techniques on the 2009 test set. Once
best result is for t = 10 and 112K.
all the infrequent n-grams have been covered
t times, the infrequency score for all the sen- Selecting sentences with the infrequent n-
tences remaining in the pool is 0, and none of grams technique provides better results than
them can be selected. Hence, the number of including all the available data. While using
sentences that can be selected for each t is lim- less than 0.5% of the data, improvements be-
ited. Although for clarity we only show results tween 0.5 and 1 BLEU points are achieved.
for t = {10, 25}, experiments have also been car- When looking at Figure 3, one might suspect
ried out for t = {1, 5, 10, 25}. Such results pre- that t needs to be set specifically for a given test
158
set, and that results from one set are not to be ex- Src the budget has also been criticised by klaus .
trapolated to other test sets. For this reason, we Bsl le budget a egalement ete criticised par m. klaus .
Rdm le budget a egalement ete critiquees par m. klaus .
selected the best configuration in Figure 3 and
PS le budget a egalement ete critiquee par klaus .
used it to build a new system for translating the All le budget a egalement ete critique par klaus .
unseen NC 2010 test set. Such experiment, with Infr le budget a egalement ete critique par klaus .
t = 10 and including all sentences with score Ref klaus critique egalement le budget .
greater than 0 ( 110K), is shown in Table 5 and Src and one has come from music .
evidences that improvements are actually coher- Bsl et un a de la musique .
ent among different test sets. Rdm et on vient de musique .
PS et on a viennent de musique .
technique BLEU TER #phrases All et de la musique .
in-domain 19.0 65.2 5.1M Infr et un est venu de la musique .
Ref et un vient du monde de la musique .
all data 22.7 60.8 1236M
infreq. t = 10 23.6 59.2 16.5M Figure 4: Examples of two translations for each of the
SMT systems built: Src (source sentence), Bsl (base-
Table 5: Effect of the infrequent n-gram recovery tech- line), Rdm (random selection), PS (probabilistic sam-
nique for an unseen test set, when setting t = 10 and pling), All (all the data available), Infr (Infrequent n-
number of phrases (parameters) of the models. grams) and Ref (reference).
to translate criticised, which is considered out-of-

5.4 Oracle Results vocabulary. Even though random selection is able
In order to analyse the potential of BSS tech- to solve this problem (luckily), it does not achieve
niques, the infrequent n-grams recovery tech- to translate it correctly, introducing a concordance
nique in Section 4 was implemented in oracle error. A similar thing happens when using prob-
mode. In this way, sentences from the pool abilistic sampling, where a grammatical error is
were selected according to the infrequent n-grams also present, and only Infr and All are able
present in the reference translations of the test set. to present a correct translation. This is not only
Note that test references were not included into casual, since, by ensuring that a given n-gram ap-
the training data as such, but were rather used pears at least a certain number of times t, the odds
to establish which bilingual sentences within the of including all possible translations of criticised
pool were best suitable for training the SMT sys- are incremented significantly. Note that, even if
tem. In this way, we were able to establish the po- the Infr translation is different from the refer-
tential for improvement of a BSS technique. In- ence, it is equally correct. In the second example,
terestingly, the SMT system trained in this way the baseline translation is pretty much correct, but
achieved 31 BLEU points on the News Commen- has a different meaning (something like and one
tary 2009 test set, i.e. an 8 BLEU points improve- has music). Similarly, when including all data
ment over the system trained with all the data the translation obtained by the system means and
available. This result would have beaten all the some music. In this case, both random and prob-
systems that took part in the 2009 Workshop on abilistic selection present grammatically incorrect
Machine translation (Callison-Burch et al., 2009). sentences, and only Infr is able to provide a cor-
This result is really important: although we are rect translation, although pretty literal and differ-
aware that the sentences were selected in a non- ent from the reference.
realistic manner, it proves that an appropriate BSS
technique would be able to boost SMT perfor- 6 Discussion
mance in a very significant manner. Similar re-
sults were obtained with the TED and NC 2010 Bilingual sentence selection (BSS) might be un-
test sets, with 10 and 7 points improvement, re- derstood to be closely related to adaptation, even
spectively. though both paradigms tackle problems which
are, in essence, different. The goal of an adap-
5.5 Example Translations tation technique is to adapt model parameters,
Example translations are shown in Figure 4. In which have been estimated on a large out-of-
the first example, the baseline system is not able domain (or generic) data set, so that they are
159
best suitable for dealing with a domain-specific larger corpora or even more complex techniques,
test set. This adaptation process is ought to be such as synchronous grammars or hierarchical
achieved by means of a (potentially small) adapta- models. For instance, the infrequent n-grams
tion set, which belongs to the same domain as the technique has beaten all the other systems using
test data. In contrast, BSS tackles with the prob- just a small fraction of the corpus, only 0.5%, and
lem of how to select samples from a large pool is yet able to outperform a system trained with all
of training data, regardless of whether such pool the data by 0.9 BLEU points and the random base-
of data is in-domain or out-of-domain. Hence, in line by 3 points. This baseline has been proved to
one case we can assume to have a fairly well es- be difficult to beat by other works.
timated translation model, which is to be adapted, Preliminary experiments were performed in or-
whereas in BSS we still have full control over the der to analyse the perplexity of the references, the
estimation of such model and need not to aim at a number of out of vocabulary words (OoVs) and
specific domain, although it might often be so. the ratio of target-source phrases. These exper-
BSS is related with instance weighting (Jiang iments revealed that the improvements obtained
and Zhai, 2007; Foster et al., 2010). Adapta- are largely correlated with a decrease in perplex-
tion and BSS can be considered to be orthogo- ity and in the number of OoVs. On the one hand,
nal (yet complementary) problems under the in- reducing the amount of OoVs was mirrored by
stance weighting paradigm. In such case, instance an important improvement in BLEU when the
weighting can be considered to span a complete amount of additional data was small, and also
paradigmatic space between both. At one end, entailed a decrease in perplexity. However, a
there is sample selection (BSS for SMT), while at reduction in perplexity by itself did not always
the other end there is adaptation. For instance, it imply significant improvements. Moreover, no
is quite common to confront the adaptation prob- real conclusion could be drawn from the analy-
lem by extracting different phrase-tables from dif- sis of target-source phrase ratio. Hence, we un-
ferent corpora, and then interpolate such tables. derstand that the improvements obtained are pro-
This technique could be also applied to promote vided mainly by a more specialised estimation of
the performance of the system built by means of the model parameters. However, further experi-
BSS. However, this is left out as future work. ments should still be conducted in order to verify
We thoroughly analysed two BSS approaches this conclusion.
that obtain competitive results, while using a
small fraction of the training data, although there Acknowledgments
is still much to be gained. For instance, oracle re- The research leading to these results has re-
sults have also been reported in this work, yield- ceived funding from the European Union Seventh
ing improvements of up to 10 BLEU points. Even Framework Programme (FP7/2007-2013) under
though the use of an oracle typically implies that grant agreement nr. 287755. This work was
the results obtained are not realistic, recall that also supported by the Spanish MEC/MICINN un-
the proposed oracle is special, in the sense that it der the MIPRCV Consolider Ingenio 2010 pro-
only uses the reference sentences for the specific gram (CSD2007-00018), and iTrans2 (TIN2009-
purpose of selecting training samples, but the ref- 14511) project. Also supported by the Span-
erences are not included into the training data as ish MITyC under the erudito.com (TSI-020110-
such. This is useful for assessing the potential be- 2009-439) project and Instituto Tecnologico de
hind BSS: ideally, if we were able to design a BSS Leon, DGEST-PROMEP y CONACYT, Mexico.
strategy that, without using the references, would
select exactly those training samples, we would be
boosting system performance by 10 BLEU points.
This re-states BSS as a compelling technique that
has not yet received the attention it deserves.
BSS is not aimed at optimising computational
requirements, but does so as a byproduct. This
may seem despicable but it would allow to run
more experiments with the same resources, use
160
References Moran, Richard Zens, Chris Dyer, Ontraj Bojar,
Sankaranarayanan Ananthakrishnan, Rohit Prasad, Moses: Open source toolkit for statistical machine
David Stallard, and Prem Natarajan. 2010. Dis- translation. In Proc. of ACL, pages 177180.
criminative sample selection for statistical machine
Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Gan-
translation. In Proc. of the EMNLP, pages 626635,
itkevitch, Ann Irvine, Sanjeev Khudanpur, Lane
Cambridge, MA, October.
Schwartz, Wren Thornton, Ziyuan Wang, Jonathan
Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Weese, and Omar Zaidan. 2010. Joshua 2.0: A
2011. Domain adaptation via pseudo in-domain toolkit for parsing-based machine translation with
data selection. In Proc of the EMNLP, pages 355 syntax, semirings, discriminative training and other
362. goodies. In Proc. of the MATR(ACL), pages 139
Christopher M. Bishop. 2006. Pattern Recognition 143, Uppsala, Sweden, July.
and Machine Learning. Springer. Yajuan Lu, Jin Huang, and Qun Liu. 2007. Improv-
Peter F. Brown, Stephen Della Pietra, Vincent J. Della ing statistical machine translation performance by
Pietra, and Robert L. Mercer. 1994. The mathe- training data selection and optimization. In Proc. of
matics of statistical machine translation: Parameter the EMNLP-CoNLL, pages 343350, Prague, Czech
estimation. Computational Linguistics, 19(2):263 Republic, June.
311. Spyros Matsoukas, Antti-Veikko I. Rosti, and Bing
Chris Callison-Burch, Philipp Koehn, Christof Monz, Zhang. 2009. Discriminative corpus weight es-
and Josh Schroeder. 2009. Findings of the 2009 timation for machine translation. In Proc. of the
Workshop on Statistical Machine Translation. In EMNLP, pages 708717, Singapore, August.
Proc of the WSMT, pages 128, Athens, Greece, Robert C. Moore and William Lewis. 2010. Intelli-
March. gent selection of language model training data. In
Chris Callison-Burch, Philipp Koehn, Christof Monz, ACL (Short Papers), pages 220224.
Kay Peterson, Mark Przybocki, and Omar Zaidan. Franz J. Och and Hermann Ney. 2002. Discrimina-
2010. Findings of the 2010 joint Workshop on Sta- tive training and maximum entropy models for sta-
tistical Machine Translation and Metrics for Ma- tistical machine translation. In Proc. of ACL, pages
chine Translation. In Proc. of the MATR(ACL), 295302.
pages 1753, Uppsala, Sweden, July. Franz J. Och and Hermann Ney. 2003. A systematic
Sanjoy Dasgupta. 2009. The two faces of active learn- comparison of various statistical alignment models.
ing. In Proc. of The twentieth Conference on Algo- In Computational Linguistics, volume 29, pages
rithmic Learning Theory, page 1, Porto (Portugal), 1951.
October. Kishore Papineni, Salim Roukos, and Todd Ward.
George Foster, Cyril Goutte, and Roland Kuhn. 2010. 1998. Maximum likelihood and discriminative
Discriminative instance weighting for domain adap- training of direct translation models. In Proc. of
tation in statistical machine translation. In Proc. of ICASSP98, pages 189192.
the EMNLP, pages 451459, Cambridge, MA, Oc- Kishore Papineni, Salim Roukos, Todd Ward, and
tober. Wei-Jing Zhu. 2001. Bleu: A method for automatic
Guillem Gasco, Vicent Alabau, Jesus Andres-Ferrer, evaluation of machine translation. In Technical Re-
Jesus Gonzalez-Rubio, Martha-Alicia Rocha, port RC22176 (W0109-022).
German Sanchis-Trilles, Francisco Casacuberta, Michael Paul, Marcello Federico, and Sebastian Stker.
Jorge Gonzalez, and Joan-Andreu Sanchez. 2010. 2010. Overview of the IWSLT 2010 evaluation
ITI-UPV system description for IWSLT 2010. In campaign. In Proc. of the IWSLT 2010, Paris,
Proc. of the IWSLT 2010, Paris, France, December. France, December.
Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
2009. Active learning for statistical phrase-based nea Micciulla, and John Makhoul. 2006. A study
machine translation. In Proc. of HLT/NAACL09, of translation edit rate with targeted human annota-
pages 415423, Morristown, NJ, USA. tion. In Proc. of AMTA06.
Jing Jiang and ChengXiang Zhai. 2007. Instance Andreas Stolcke. 2002. SRILM an extensible lan-
weighting for domain adaptation in NLP. In Proc. guage modeling toolkit. In Proc. of ICSLP.
of ACL07, pages 264271. Elia Yuste, Manuel Herranz, Antonio Lagarda, Li-
Reinhard Kneser and Hermann Ney. 1995. Improved onel Tarazon, Isaas Sanchez-Cortina, and Fran-
backing-off for m-gram language modeling. Proc. cisco Casacuberta. 2010. Pangeamt - putting
of ICASSP, II:181184, May. open standards to work... well. In Proc. of the
Philipp Koehn, Hieu Hoang, Alexandra Birch, AMTA2010. Denver, CO, USA, November.
Bertoldi, Brooke Cowan, Wade Shen, Christie
161
Recall-Oriented Learning of Named Entities in Arabic Wikipedia
Behrang Mohit Nathan Schneider Rishav Bhowmick Kemal Oflazer Noah A. Smith
School of Computer Science, Carnegie Mellon University

P.O. Box 24866, Doha, Qatar Pittsburgh, PA 15213, USA
{behrang@,nschneid@cs.,rishavb@qatar.,ko@cs.,nasmith@cs.}cmu.edu
Abstract delineated. One hallmark of this divergence be-

tween Wikipedia and the news domain is a dif-
We consider the problem of NER in Arabic ference in the distributions of named entities. In-
Wikipedia, a semisupervised domain adap- deed, the classic named entity types (person, or-
tation setting for which we have no labeled ganization, location) may not be the most apt for
training data in the target domain. To fa- articles in other domains (e.g., scientific or social
cilitate evaluation, we obtain annotations
topics). On the other hand, Wikipedia is a large
for articles in four topical groups, allow-
ing annotators to identify domain-specific dataset, inviting semisupervised approaches.
entity types in addition to standard cate- In this paper, we describe advances on the prob-
gories. Standard supervised learning on lem of NER in Arabic Wikipedia. The techniques
newswire text leads to poor target-domain are general and make use of well-understood
recall. We train a sequence model and show building blocks. Our contributions are:
that a simple modification to the online
learnera loss function encouraging it to A small corpus of articles annotated in a new
arrogantly favor recall over precision scheme that provides more freedom for annota-
substantially improves recall and F1 . We
tors to adapt NE analysis to new domains;
then adapt our model with self-training
on unlabeled target-domain data; enforc- An arrogant learning approach designed to
ing the same recall-oriented bias in the self- boost recall in supervised training as well as
training stage yields marginal gains.1 self-training; and
An empirical evaluation of this technique as ap-
plied to a well-established discriminative NER
1 Introduction model and feature set.
This paper considers named entity recognition Experiments show consistent gains on the chal-
(NER) in text that is different from most past re- lenging problem of identifying named entities in
search on NER. Specifically, we consider Arabic Arabic Wikipedia text.
Wikipedia articles with diverse topics beyond the
commonly-used news domain. These data chal- 2 Arabic Wikipedia NE Annotation
lenge past approaches in two ways:
Most of the effort in NER has been fo-
First, Arabic is a morphologically rich lan-
cused around a small set of domains and
guage (Habash, 2010). Named entities are ref-
general-purpose entity classes relevant to those
erenced using complex syntactic constructions
domainsespecially the categories PER ( SON ),
(cf. English NEs, which are primarily sequences
ORG ( ANIZATION ), and LOC ( ATION ) (POL),
of proper nouns). The Arabic script suppresses
which are highly prominent in news text. Ara-
most vowels, increasing lexical ambiguity, and
bic is no exception: the publicly available NER
lacks capitalization, a key clue for English NER.
corporaACE (Walker et al., 2006), ANER (Be-
Second, much research has focused on the use
najiba et al., 2008), and OntoNotes (Hovy et al.,
of news text for system building and evaluation.
2006)all are in the news domain.2 However,
Wikipedia articles are not news, belonging instead
2
to a wide range of domains that are not clearly OntoNotes contains news-related text. ACE includes
some text from blogs. In addition to the POL classes, both
1
The annotated dataset and a supplementary document corpora include additional NE classes such as facility, event,
with additional details of this work can be found at: product, vehicle, etc. These entities are infrequent and may
http://www.ark.cs.cmu.edu/AQMAR not be comprehensive enough to cover the larger set of pos-
162
History Science Sports Technology Table 1: Translated titles
dev: Damascus Atom Raul Gonzales Linux
of Arabic Wikipedia arti-
Imam Hussein Shrine Nuclear power Real Madrid Solaris
cles in our development
test: Crusades Enrico Fermi 2004 Summer Olympics Computer
and test sets, and some
Islamic Golden Age Light Christiano Ronaldo Computer Software
NEs with standard and
Islamic History Periodic Table Football Internet
article-specific classes.
Ibn Tolun Mosque Physics Portugal football team Richard Stallman
Additionally, Prussia and
Amman were reserved
Ummaya Mosque Muhammad al-Razi FIFA World Cup X Window System
for training annotators,
Claudio Filippone (PER) J.J
K
X; Linux (SOFTWARE) JJ
; Spanish

League (CHAMPIONSHIPS) GAJ.B@ PY@; proton (PARTICLE) KQK.; nuclear and
mating
Gulf War for esti-
inter-annotator

agreement.
radiation (GENERIC - MISC) J@ AB@ ; Real Zaragoza (ORG) Q AK
P

appropriate entity classes will vary widely by do- testing examples, but not as training data. In 4
main; occurrence rates for entity classes are quite we will discuss our semisupervised approach to
different in news text vs. Wikipedia, for instance learning, which leverages ACE and ANER data
(Balasuriya et al., 2009). This is abundantly as an annotated training corpus.
clear in technical and scientific discourse, where
much of the terminology is domain-specific, but it 2.1 Annotation Strategy
holds elsewhere. Non-POL entities in the history We conducted a small annotation project on Ara-
domain, for instance, include important events bic Wikipedia articles. Two college-educated na-
(wars, famines) and cultural movements (roman- tive Arabic speakers annotated about 3,000 sen-
ticism). Ignoring such domain-critical entities tences from 31 articles. We identified four top-
likely limits the usefulness of the NE analysis. ical areas of interesthistory, technology, sci-
Recognizing this limitation, some work on ence, and sportsand browsed these topics un-
NER has sought to codify more robust invento- til we had found 31 articles that we deemed sat-
ries of general-purpose entity types (Sekine et al., isfactory on the basis of length (at least 1,000
2002; Weischedel and Brunstein, 2005; Grouin words), cross-lingual linkages (associated articles
et al., 2011) or to enumerate domain-specific in English, German, and Chinese3 ), and subjec-
types (Settles, 2004; Yao et al., 2003). Coarse, tive judgments of quality. The list of these arti-
general-purpose categories have also been used cles along with sample NEs are presented in ta-
for semantic tagging of nouns and verbs (Cia- ble 1. These articles were then preprocessed to
ramita and Johnson, 2003). Yet as the number extract main article text (eliminating tables, lists,
of classes or domains grows, rigorously docu- info-boxes, captions, etc.) for annotation.
menting and organizing the classeseven for a Our approach follows ACE guidelines (LDC,
single languagerequires intensive effort. Ide- 2005) in identifying NE boundaries and choos-
ally, an NER system would refine the traditional ing POL tags. In addition to this traditional form
classes (Hovy et al., 2011) or identify new entity of annotation, annotators were encouraged to ar-
classes when they arise in new domains, adapting ticulate one to three salient, article-specific en-
to new data. For this reason, we believe it is valu- tity categories per article. For example, names
able to consider NER systems that identify (but of particles (e.g., proton) are highly salient in the
do not necessarily label) entity mentions, and also Atom article. Annotators were asked to read the
to consider annotation schemes that allow annota- entire article first, and then to decide which non-
tors more freedom in defining entity classes. traditional classes of entities would be important
Our aim in creating an annotated dataset is to in the context of article. In some cases, annotators
provide a testbed for evaluation of new NER mod- reported using heuristics (such as being proper
els. We will use these data as development and 3
These three languages have the most articles on
Wikipedia. Associated articles here are those that have been
sible NEs (Sekine et al., 2002). Nezda et al. (2006) anno- manually hyperlinked from the Arabic page as cross-lingual
tated and evaluated an Arabic NE corpus with an extended correspondences. They are not translations, but if the associ-
set of 18 classes (including temporal and numeric entities); ations are accurate, these articles should be topically similar
this corpus has not been released publicly. to the Arabic page that links to them.
163
Token position agreement rate 92.6% Cohens : 0.86 History: Gulf War, Prussia, Damascus, Crusades
Token agreement rate 88.3% Cohens : 0.86 WAR CONFLICT
Token F1 between annotators 91.0% Science: Atom, Periodic table
Entity boundary match F1 94.0% THEORY CHEMICAL
Entity category match F1 87.4% NAME ROMAN PARTICLE
Table 2: Inter-annotator agreement measurements. Sports: Football, Raul Gonzales

SPORT CHAMPIONSHIP
nouns or having an English translation which is AWARD NAME ROMAN
conventionally capitalized) to help guide their de- Technology: Computer, Richard Stallman
termination of non-canonical entities and entity COMPUTER VARIETY SOFTWARE
COMPONENT
classes. Annotators produced written descriptions
of their classes, including example instances.
Table 3: Custom NE categories suggested by one or
This scheme was chosen for its flexibility: in both annotators for 10 articles. Article titles are trans-
contrast to a scenario with a fixed ontology, anno- lated from Arabic. indicates that both annotators vol-
tators required minimal training beyond the POL unteered a category for an article; indicates that only
conventions, and did not have to worry about one annotator suggested the category. Annotators were
delineating custom categories precisely enough not given a predetermined set of possible categories;
that they would extend straightforwardly to other rather, category matches between annotators were de-
topics or domains. Of course, we expect inter- termined by post hoc analysis. NAME ROMAN indi-
cates an NE rendered in Roman characters.
annotator variability to be greater for these open-
ended classification criteria. 2.3 Validating Category Intuitions
2.2 Annotation Quality Evaluation To investigate the variability between annotators

with respect to custom category intuitions, we
During annotation, two articles (Prussia and Am- asked our two annotators to independently read
man) were reserved for training annotators on
10 of the articles in the data (scattered across our
the task. Once they were accustomed to anno- four focus domains) and suggest up to 3 custom
tation, both independently annotated a third ar- categories for each. We assigned short names to
ticle. We used this 4,750-word article (Gulf War, these suggestions, seen in table 3. In 13 cases,
J
KA J@ i.J
m '@ H. Qk) to measure inter-annotator both annotators suggested a category for an article
agreement. Table 2 provides scores for token- that was essentially the same (); three such cat-
level agreement measures and entity-level F1 be- egories spanned multiple articles. In three cases
tween the two annotated versions of the article.4 a category was suggested by only one annotator
These measures indicate strong agreement for ().5 Thus, we see that our annotators were gen-
locating and categorizing NEs both at the token erally, but not entirely, consistent with each other
and chunk levels. Closer examination of agree- in their creation of custom categories. Further, al-
ment scores shows that PER and MIS classes have most all of our article-specific categories corre-
the lowest rates of agreement. That the mis- spond to classes in the extended NE taxonomy of
cellaneous class, used for infrequent or article- (Sekine et al., 2002), which speaks to the reason-
specific NEs, receives poor agreement is unsur- ableness of both sets of categoriesand by exten-
prising. The low agreement on the PER class sion, our open-ended annotation process.
seems to be due to the use of titles and descriptive Our annotation of named entities outside of the
terms in personal names. Despite explicit guide- traditional POL classes creates a useful resource
lines to exclude the titles, annotators disagreed on for entity detection and recognition in new do-
the inclusion of descriptors that disambiguate the mains. Even the ability to detect non-canonical
NE (e.g., the father in H . h. Qk.: George
. B@ K types of NEs should help applications such as QA
Bush, the father). and MT (Toral et al., 2005; Babych and Hart-
4
The position and boundary measures ignore the distinc-
ley, 2003). Possible avenues for future work
tions between the POLM classes. To avoid artificial inflation include annotating and projecting non-canonical
of the token and token position agreement rates, we exclude
5
the 81% of tokens tagged by both annotators as not belong- When it came to tagging NEs, one of the two annota-
ing to an entity. tors was assigned to each article. Custom categories only
suggested by the other annotator were ignored.
164
NEs from English articles to their Arabic coun- Training words NEs
terparts (Hassan et al., 2007), automatically clus- ACE+ANER 212,839 15,796
tering non-canonical types of entities into article- Wikipedia (unlabeled, 397 docs) 1,110,546
specific or cross-article classes (cf. Frietag, 2004), Development
or using non-canonical classes to improve the ACE 7,776 638
Wikipedia (4 domains, 8 docs) 21,203 2,073
(author-specified) article categories in Wikipedia.
Test
Hereafter, we merge all article-specific cate-
ACE 7,789 621
gories with the generic MIS category. The pro- Wikipedia (4 domains, 20 docs) 52,650 3,781
portion of entity mentions that are tagged as MIS,
while varying to a large extent by document, is
Table 4: Number of words (entity mentions) in data sets.
a major indication of the gulf between the news
data (<10%) and the Wikipedia data (53% for the tures known to work well for Arabic NER (Be-
development set, 37% for the test set). najiba et al., 2008; Abdul-Hamid and Darwish,
Below, we aim to develop entity detection mod- 2010), we incorporate some additional features
els that generalize beyond the traditional POL en- enabled by Wikipedia. We do not employ a
tities. We do not address here the challenges of gazetteer, as the construction of a broad-domain
automatically classifying entities or inferring non- gazetteer is a significant undertaking orthogo-
canonical groupings. nal to the challenges of a new text domain like
Wikipedia.10 A descriptive list of our features is
3 Data available in the supplementary document.
Table 4 summarizes the various corpora used in We use a first-order structured perceptron; none
this work.6 Our NE-annotated Wikipedia sub- of our features consider more than a pair of con-
corpus, described above, consists of several Ara- secutive BIO labels at a time. The model enforces
bic Wikipedia articles from four focus domains.7 the constraint that NE sequences must begin with
We do not use these for supervised training data; B (so the bigram hO, Ii is disallowed).
they serve only as development and test data. A Training this model on ACE and ANER data
larger set of Arabic Wikipedia articles, selected achieves performance comparable to the state of
on the basis of quality heuristics, serves as unla- the art (F1 -measure11 above 69%), but fares much
beled data for semisupervised learning. worse on our Wikipedia test set (F1 -measure
Our out-of-domain labeled NE data is drawn around 47%); details are given in 5.
from the ANER (Benajiba et al., 2007) and 4.1 Recall-Oriented Perceptron
ACE-2005 (Walker et al., 2006) newswire cor-
pora. Entity types in this data are POL cate- By augmenting the perceptrons online update
gories (PER, ORG, LOC) and MIS. Portions of the with a cost function term, we can incorporate a
ACE corpus were held out as development and task-dependent notion of error into the objective,
test data; the remainder is used in training. as with structured SVMs (Taskar et al., 2004;
Tsochantaridis et al., 2005). Let c(y, y 0 ) denote
4 Models a measure of error when y is the correct label se-
quence but y 0 is predicted. For observed sequence
Our starting point for statistical NER is a feature-
x and feature weights (model parameters) w, the
based linear model over sequences, trained using
structured hinge loss is `hinge (x, y, w) =
the structured perceptron (Collins, 2002).8
In addition to lexical and morphological9 fea-
> 0 0

max
0
w g(x, y ) + c(y, y ) w> g(x, y)
6
Additional details appear in the supplement. y
7
We downloaded a snapshot of Arabic Wikipedia (1)
(http://ar.wikipedia.org) on 8/29/2009 and pre- The maximization problem inside the parentheses
processed the articles to extract main body text and metadata is known as cost-augmented decoding. If c fac-
using the mwlib package for Python (PediaPress, 2010).
8 10
A more leisurely discussion of the structured percep- A gazetteer ought to yield further improvements in line
tron and its connection to empirical risk minimization can with previous findings in NER (Ratinov and Roth, 2009).
11
be found in the supplementary document. Though optimizing NER systems for F1 has been called
9
We obtain morphological analyses from the MADA tool into question (Manning, 2006), no alternative metric has
(Habash and Rambow, 2005; Roth et al., 2008). achieved widespread acceptance in the community.
165
tors similarly to the feature function g(x, y), then Input: labeled data hhx(n) , y (n) iiN
n=1 ; unlabeled
we can increase penalties for y that have more data hx(j) iJj=1 ; supervised learner L;
local mistakes. This raises the learners aware- number of iterations T 0
ness about how it will be evaluated. Incorporat- Output: w
ing cost-augmented decoding into the perceptron w L(hhx(n) , y (n) iiN n=1 )
leads to this decoding step: for t = 1 to T 0 do
for j = 1 to J do
y (j) arg maxy w> g(x(j) , y)

y arg max w> g(x, y 0 ) + c(y, y 0 ) , (2)
y0 w L(hhx(n) , y (n) iiN (j)
n=1 hhx , y
(j) J
iij=1 )
Algorithm 1: Self-training.
which amounts to performing stochastic subgradi-
ent ascent on an objective function with the Eq. 1 there is no available labeled training data. Yet
loss (Ratliff et al., 2006). the available unlabeled data is vast, so we turn to
In this framework, cost functions can be for- semisupervised learning.
mulated to distinguish between different types of Here we adapt self-training, a simple tech-
errors made during training. For a tag sequence nique that leverages a supervised learner (like the
y = hy1 , y2 , . . . , yM i, Gimpel and Smith (2010b) perceptron) to perform semisupervised learning
define word-local cost functions that differently (Clark et al., 2003; Mihalcea, 2004; McClosky
penalize precision errors (i.e., yi = O yi 6= O et al., 2006). In our version, a model is trained
for the ith word), recall errors (yi 6= O yi = O), on the labeled data, then used to label the un-
and entity class/position errors (other cases where labeled target data. We iterate between training
yi 6= yi ). As will be shown below, a key problem on the hypothetically-labeled target data plus the
in cross-domain NER is poor recall, so we will original labeled set, and relabeling the target data;
penalize recall errors more severely: see Algorithm 1. Before self-training, we remove
sentences hypothesized not to contain any named
M 0 if yi = yi0

X entity mentions, which we found avoids further
c(y, y 0 ) = if yi 6= O yi0 = O (3) encouragement of the model toward low recall.
1 otherwise

i=1
5 Experiments
for a penalty parameter > 1. We call our learner
We investigate two questions in the context of
the recall-oriented perceptron (ROP).
NER for Arabic Wikipedia:
We note that Minkov et al. (2006) similarly ex-
plored the recall vs. precision tradeoff in NER. Loss function: Does integrating a cost func-
Their technique was to directly tune the weight tion into our learning algorithm, as we have
of a single featurethe feature marking O (non- done in the recall-oriented perceptron (4.1),
entity tokens); a lower weight for this feature will improve recall and overall performance on
incur a greater penalty for predicting O. Below Wikipedia data?
we demonstrate that our method, which is less Semisupervised learning for domain adap-
coarse, is more successful in our setting.12 tation: Can our models benefit from large
In our experiments we will show that injecting amounts of unlabeled Wikipedia data, in addi-
arrogance into the learner via the recall-oriented tion to the (out-of-domain) labeled data? We
loss function substantially improves recall, espe- experiment with a self-training phase following
cially for non-POL entities (5.3). the fully supervised learning phase.
4.2 Self-Training and Semisupervised We report experiments for the possible combi-
Learning nations of the above ideas. These are summarized
As we will show experimentally, the differences in table 5. Note that the recall-oriented percep-
between news text and Wikipedia text call for do- tron can be used for the supervised learning phase,
main adaptation. In the case of Arabic Wikipedia, for the self-training phase, or both. This leaves us
with the following combinations:
12
The distinction between the techniques is that our cost
function adjusts the whole model in order to perform better reg/none (baseline): regular supervised learner.
at recall on the training data. ROP/none: recall-oriented supervised learner.
166
Figure 1: Tuning the recall-oriented cost parame-
ter for different learning settings. We optimized
for development set F1 , choosing penalty = 200
for recall-oriented supervised learning (in the plot,
ROP/*this is regardless of whether a stage of
self-training will follow); = 100 for recall-
oriented self-training following recall-oriented su-
pervised learning (ROP/ROP); and = 3200 for
recall-oriented self-training following regular super-
vised learning (reg/ROP).
reg/reg: standard self-training setup. baseline is on par with the state of the art for Ara-
ROP/reg: recall-oriented supervised learner, fol- bic NER on ACE news text (Abdul-Hamid and
lowed by standard self-training. Darwish, 2010).15
reg/ROP: regular supervised model as the initial la-
beler for recall-oriented self-training. Here is the performance of the baseline entity
ROP/ROP (the double ROP condition): recall- detection model on our 20-article test set:16
oriented supervised model as the initial labeler for P R F1
recall-oriented self-training. Note that the two technology 60.42 20.26 30.35
ROPs can use different cost parameters. science 64.96 25.73 36.86
history 63.09 35.58 45.50
For evaluating our models we consider the
sports 71.66 59.94 65.28
named entity detection task, i.e., recognizing
overall 66.30 35.91 46.59
which spans of words constitute entities. This
is measured by per-entity precision, recall, and Unsurprisingly, performance on Wikipedia data
F1 .13 To measure statistical significance of differ- varies widely across article domains and is much
ences between models we use Gimpel and Smiths lower than in-domain performance. Precision
(2010) implementation of the paired bootstrap re- scores fall between 60% and 72% for all domains,
sampler of (Koehn, 2004), taking 10,000 samples but recall in most cases is far worse. Miscella-
for each comparison. neous class recall, in particular, suffers badly (un-
der 10%)which partially accounts for the poor
5.1 Baseline recall in science and technology articles (they
Our baseline is the perceptron, trained on the have by far the highest proportion of MIS entities).
POL entity boundaries in the ACE+ANER cor-
pus (reg/none).14 Development data was used to 5.2 Self-Training
select the number of iterations (10). We per- Following Clark et al. (2003), we applied self-
formed 3-fold cross-validation on the ACE data training as described in Algorithm 1, with the
and found wide variance in the in-domain entity perceptron as the supervised learner. Our unla-
detection performance of this model: beled data consists of 397 Arabic Wikipedia ar-
P R F1 ticles (1 million words) selected at random from
fold 1 70.43 63.08 66.55 all articles exceeding a simple length threshold
fold 2 87.48 81.13 84.18 (1,000 words); see table 4. We used only one iter-
fold 3 65.09 51.13 57.27
ation (T 0 = 1), as experiments on development
average 74.33 65.11 69.33
data showed no benefit from additional rounds.
(Fold 1 corresponds to the ACE test set described Several rounds of self-training hurt performance,
in table 4.) We also trained the model to perform
15
POL detection and classification, achieving nearly Abdul-Hamid and Darwish report as their best result a
identical results in the 3-way cross-validation of macroaveraged F1 -score of 76. As they do not specify which
data they used for their held-out test set, we cannot perform
ACE data. From these data we conclude that our
a direct comparison. However, our feature set is nearly a
13
Only entity spans that exactly match the gold spans are superset of their best feature set, and their result lies well
counted as correct. We calculated these scores with the within the range of results seen in our cross-validation folds.
16
conlleval.pl script from the CoNLL 2003 shared task. Our Wikipedia evaluations use models trained on
14
In keeping with prior work, we ignore non-POL cate- POLM entity boundaries in ACE. Per-domain and overall
gories for the ACE evaluation. scores are microaverages across articles.
167
S ELF - TRAINING
S UPERVISED none reg ROP
reg 66.3 35.9 46.59 66.7 35.6 46.41 59.2 40.3 47.97
ROP 60.9 44.7 51.59 59.8 46.2 52.11 58.0 47.4 52.16
Table 5: Entity detection precision, recall, and F1 for each learning setting, microaveraged across the 24 articles
in our Wikipedia test set. Rows differ in the supervised learning condition on the ACE+ANER data (regular
vs. recall-oriented perceptron). Columns indicate whether this supervised learning phase was followed by self-
training on unlabeled Wikipedia data, and if so which version of the perceptron was used for self-training.
Figure 2: Recall improve-

ment over baseline in the test
baseline
set by gold NER category,
entities words recall
counts for those categories in
PER 1081 1743 49.95
the data, and recall scores for
ORG 286 637 23.92
our baseline model. Markers
LOC 1019 1413 61.43
in the plot indicate different
MIS 1395 2176 9.30
experimental settings corre-
overall 3781 5969 35.91
sponding to cells in table 5.
an effect attested in earlier research (Curran et al., vised phase (bottom left cell), the recall gains
2007) and sometimes known as semantic drift. are substantialnearly 9% over the baseline. In-
Results are shown in table 5. We find that stan- tegrating this bias within self-training (last col-
dard self-training (the middle column) has very umn of the table) produces a more modest im-
little impact on performance.17 Why is this the provement (less than 3%) relative to the base-
case? We venture that poor baseline recall and the line. In both cases, the improvements to recall
domain variability within Wikipedia are to blame. more than compensate for the amount of degra-
dation to precision. This trend is robust: wher-
5.3 Recall-Oriented Learning ever the recall-oriented perceptron is added, we
The recall-oriented bias can be introduced in ei- observe improvements in both recall and F1 . Per-
ther or both of the stages of our semisupervised haps surprisingly, these gains are somewhat addi-
learning framework: in the supervised learn- tive: using the ROP in both learning phases gives
ing phase, modifying the objective of our base- a small (though not always significant) gain over
line (5.1); and within the self-training algorithm alternatives (standard supervised perceptron, no
(5.2).18 As noted in 4.1, the aim of this ap- self-training, or self-training with a standard per-
proach is to discourage recall errors (false nega- ceptron). In fact, when the standard supervised
tives), which are the chief difficulty for the news learner is used, recall-oriented self-training suc-
texttrained model in the new domain. We se- ceeds despite the ineffectiveness of standard self-
lected the value of the false positive penalty for training.
cost-augmented decoding, , using the develop- Performance breakdowns by (gold) class, fig-
ment data (figure 1). ure 2, and domain, figure 3, further attest to the
The results in table 5 demonstrate improve- robustness of the overall results. The most dra-
ments due to the recall-oriented bias in both matic gains are in miscellaneous class recall
stages of learning.19 When used in the super- each form of the recall bias produces an improve-
17
In neither case does regular self-training produce a sig- ment, and using this bias in both the supervised
nificantly different F1 score than no self-training. and self-training phases is clearly most success-
18
Standard Viterbi decoding was used to label the data ful for miscellaneous entities. Correspondingly,
within the self-training algorithm; note that cost-augmented the technology and science domains (in which this
decoding only makes sense in learning, not as a prediction
technique, since it deliberately introduces errors relative to a
class dominates83% and 61% of mentions, ver-
correct output that must be provided.
19
In terms of F1 , the worst of the 3 models with the ROP provements due to self-training are marginal, however: ROP
supervised learner significantly outperforms the best model self-training produces a significant gain only following reg-
with the regular supervised learner (p < 0.005). The im- ular supervised learning (p < 0.05).
168
Figure 3: Supervised
learner precision vs.
recall as evaluated
on Wikipedia test
data in different
topical domains. The
regular perceptron
(baseline model) is
contrasted with ROP.
No self-training is
applied.
sus 6% and 12% for history and sports, respec- at regularization (Chelba and Acero, 2006) and
tively) receive the biggest boost. Still, the gaps feature design (Daume III, 2007); we alter the
between domains are not entirely removed. loss function. Not surprisingly, the double-ROP
Most improvements relate to the reduction of approach harms performance on the original do-
false negatives, which fall into three groups: main (on ACE data, we achieve 55.41% F1 , far
(a) entities occurring infrequently or partially below the standard perceptron). Yet we observe
in the labeled training data (e.g. uranium); that models can be prepared for adaptation even
(b) domain-specific entities sharing lexical or con- before a learner is exposed a new domain, sacri-
textual features with the POL entities (e.g. Linux, ficing performance in the original domain.
titanium); and (c) words with Latin characters, The recall-oriented bias is not merely encour-
common in the science and technology domains. aging the learner to identify entities already seen
(a) and (b) are mostly transliterations into Arabic. in training. As recall increases, so does the num-
An alternativeand simplerapproach to ber of new entity types recovered by the model:
controlling the precision-recall tradeoff is the of the 2,070 NE types in the test data that were
Minkov et al. (2006) strategy of tuning a single never seen in training, only 450 were ever found
feature weight subsequent to learning (see 4.1 by the baseline, versus 588 in the reg/ROP condi-
above). We performed an oracle experiment to tion, 632 in the ROP/none condition, and 717 in
determine how this compares to recall-oriented the double-ROP condition.
learning in our setting. An oracle trained with We note finally that our method is a simple
the method of Minkov et al. outperforms the three extension to the standard structured perceptron;
models in table 5 that use the regular perceptron cost-augmented inference is often no more ex-
for the supervised phase of learning, but under- pensive than traditional inference, and the algo-
performs the supervised ROP conditions.20 rithmic change is equivalent to adding one addi-
Overall, we find that incorporating the recall- tional feature. Our recall-oriented cost function
oriented bias in learning is fruitful for adapting to is parameterized by a single value, ; recall is
Wikipedia because the gains in recall outpace the highly sensitive to the choice of this value (fig-
damage to precision. ure 1 shows how we tuned it on development
data), and thus we anticipate that, in general, such
6 Discussion tuning will be essential to leveraging the benefits
of arrogance.
To our knowledge, this work is the first sugges-
tion that substantively modifying the supervised
learning criterion in a resource-rich domain can 7 Related Work
reap benefits in subsequent semisupervised appli- Our approach draws on insights from work in
cation in a new domain. Past work has looked the areas of NER, domain adaptation, NLP with
20
Tuning the O feature weight to optimize for F1 on our Wikipedia, and semisupervised learning. As all
test set, we found that oracle precision would be 66.2, recall are broad areas of research, we highlight only the
would be 39.0, and F1 would be 49.1. The F1 score of our
most relevant contributions here.
best model is nearly 3 points higher than the Minkov et al.
style oracle, and over 4 points higher than the non-oracle Research in Arabic NER has been focused on
version where the development set is used for tuning. compiling and optimizing the gazetteers and fea-
169
ture sets for standard sequential modeling algo- work, major topical differences distinguish the
rithms (Benajiba et al., 2008; Farber et al., 2008; training and test corporaand consequently, their
Shaalan and Raza, 2008; Abdul-Hamid and Dar- salient NE classes. In these respects our NER
wish, 2010). We make use of features identi- setting is closer to that of Florian et al. (2010),
fied in this prior work to construct a strong base- who recognize English entities in noisy text, (Sur-
line system. We are unaware of any Arabic NER deanu et al., 2011), which concerns information
work that has addressed diverse text domains like extraction in a topically distinct target domain,
Wikipedia. Both the English and Arabic ver- and (Dalton et al., 2011), which addresses English
sions of Wikipedia have been used, however, as NER in noisy and topically divergent text.
resources in service of traditional NER (Kazama Self-training (Clark et al., 2003; Mihalcea,
and Torisawa, 2007; Benajiba et al., 2008). Attia 2004; McClosky et al., 2006) is widely used
et al. (2010) heuristically induce a mapping be- in NLP and has inspired related techniques that
tween Arabic Wikipedia and Arabic WordNet to learn from automatically labeled data (Liang et
construct Arabic NE gazetteers. al., 2008; Petrov et al., 2010). Our self-training
Balasuriya et al. (2009) highlight the substan- procedure differs from some others in that we use
tial divergence between entities appearing in En- all of the automatically labeled examples, rather
glish Wikipedia versus traditional corpora, and than filtering them based on a confidence score.
the effects of this divergence on NER perfor- Cost functions have been used in non-
mance. There is evidence that models trained structured classification settings to penalize cer-
on Wikipedia data generalize and perform well tain types of errors more than others (Chan and
on corpora with narrower domains. Nothman Stolfo, 1998; Domingos, 1999; Kiddon and Brun,
et al. (2009) and Balasuriya et al. (2009) show 2011). The goal of optimizing our structured NER
that NER models trained on both automatically model for recall is quite similar to the scenario ex-
and manually annotated Wikipedia corpora per- plored by Minkov et al. (2006), as noted above.
form reasonably well on news corpora. The re-
verse scenario does not hold for models trained 8 Conclusion
on news text, a result we also observe in Arabic We explored the problem of learning an NER
NER. Other work has gone beyond the entity de- model suited to domains for which no labeled
tection problem: Florian et al. (2004) addition- training data are available. A loss function to en-
ally predict within-document entity coreference courage recall over precision during supervised
for Arabic, Chinese, and English ACE text, while discriminative learning substantially improves re-
Cucerzan (2007) aims to resolve every mention call and overall entity detection performance, es-
detected in English Wikipedia pages to a canoni- pecially when combined with a semisupervised
cal article devoted to the entity in question. learning regimen incorporating the same bias.
The domain and topic diversity of NEs has been We have also developed a small corpus of Ara-
studied in the framework of domain adaptation bic Wikipedia articles via a flexible entity an-
research. A group of these methods use self- notation scheme spanning four topical domains
training and select the most informative features (publicly available at http://www.ark.cs.
and training instances to adapt a source domain cmu.edu/AQMAR).
learner to the new target domain. Wu et al. (2009)
bootstrap the NER leaner with a subset of unla- Acknowledgments
beled instances that bridge the source and target We thank Mariem Fekih Zguir and Reham Al Tamime
domains. Jiang and Zhai (2006) and Daume III for assistance with annotation, Michael Heilman for
(2007) make use of some labeled target-domain his tagger implementation, and Nizar Habash and col-
data to tune or augment the features of the source leagues for the MADA toolkit. We thank members of
the ARK group at CMU, Hal Daume, and anonymous
model towards the target domain. Here, in con-
reviewers for their valuable suggestions. This publica-
trast, we use labeled target-domain data only for tion was made possible by grant NPRP-08-485-1-083
tuning and evaluation. Another important dis- from the Qatar National Research Fund (a member of
tinction is that domain variation in this prior the Qatar Foundation). The statements made herein
work is restricted to topically-related corpora (e.g. are solely the responsibility of the authors.
newswire vs. broadcast news), whereas in our
170
References Stephen Clark, James Curran, and Miles Osborne.
2003. Bootstrapping POS-taggers using unlabelled
Ahmed Abdul-Hamid and Kareem Darwish. 2010. data. In Walter Daelemans and Miles Osborne,
Simplified feature set for Arabic named entity editors, Proceedings of the Seventh Conference on
recognition. In Proceedings of the 2010 Named En- Natural Language Learning at HLT-NAACL 2003,
tities Workshop, pages 110115, Uppsala, Sweden, pages 4955.
Michael Collins. 2002. Discriminative training meth-
Mohammed Attia, Antonio Toral, Lamia Tounsi, Mon- ods for hidden Markov models: theory and experi-
ica Monachini, and Josef van Genabith. 2010. ments with perceptron algorithms. In Proceedings
An automatically built named entity lexicon for of the ACL-02 Conference on Empirical Methods in
Arabic. In Nicoletta Calzolari, Khalid Choukri, Natural Language Processing (EMNLP), pages 1
Bente Maegaard, Joseph Mariani, Jan Odijk, Ste- 8, Stroudsburg, PA, USA. Association for Compu-
lios Piperidis, Mike Rosner, and Daniel Tapias, ed- tational Linguistics.
itors, Proceedings of the Seventh Conference on Silviu Cucerzan. 2007. Large-scale named entity
International Language Resources and Evaluation disambiguation based on Wikipedia data. In Pro-
(LREC10), Valletta, Malta, May. European Lan- ceedings of the 2007 Joint Conference on Empirical
guage Resources Association (ELRA). Methods in Natural Language Processing and Com-
Bogdan Babych and Anthony Hartley. 2003. Im- putational Natural Language Learning (EMNLP-
proving machine translation quality with automatic CoNLL), pages 708716, Prague, Czech Republic,
named entity recognition. In Proceedings of the 7th June.
International EAMT Workshop on MT and Other James R. Curran, Tara Murphy, and Bernhard Scholz.
Language Technology Tools, EAMT 03. 2007. Minimising semantic drift with Mutual
Dominic Balasuriya, Nicky Ringland, Joel Nothman, Exclusion Bootstrapping. In Proceedings of PA-
Tara Murphy, and James R. Curran. 2009. Named CLING, 2007.
entity recognition in Wikipedia. In Proceedings Jeffrey Dalton, James Allan, and David A. Smith.
of the 2009 Workshop on The Peoples Web Meets 2011. Passage retrieval for incorporating global
NLP: Collaboratively Constructed Semantic Re- evidence in sequence labeling. In Proceedings of
sources, pages 1018, Suntec, Singapore, August. the 20th ACM International Conference on Infor-
Association for Computational Linguistics. mation and Knowledge Management (CIKM 11),
Yassine Benajiba, Paolo Rosso, and Jose Miguel pages 355364, Glasgow, Scotland, UK, October.
BenedRuiz. 2007. ANERsys: an Arabic named ACM.
entity recognition system based on maximum en- Hal Daume III. 2007. Frustratingly easy domain
tropy. In Alexander Gelbukh, editor, Proceedings adaptation. In Proceedings of the 45th Annual
of CICLing, pages 143153, Mexico City, Mexio. Meeting of the Association of Computational Lin-
Springer. guistics, pages 256263, Prague, Czech Republic,
Yassine Benajiba, Mona Diab, and Paolo Rosso. 2008. June. Association for Computational Linguistics.
Arabic named entity recognition using optimized Pedro Domingos. 1999. MetaCost: a general method
feature sets. In Proceedings of the 2008 Confer- for making classifiers cost-sensitive. Proceedings
ence on Empirical Methods in Natural Language of the Fifth ACM SIGKDD International Confer-
Processing, pages 284293, Honolulu, Hawaii, Oc- ence on Knowledge Discovery and Data Mining,
tober. Association for Computational Linguistics. pages 155164.
Philip K. Chan and Salvatore J. Stolfo. 1998. To- Benjamin Farber, Dayne Freitag, Nizar Habash, and
ward scalable learning with non-uniform class and Owen Rambow. 2008. Improving NER in Arabic
cost distributions: a case study in credit card fraud using a morphological tagger. In Nicoletta Calzo-
detection. In Proceedings of the Fourth Interna- lari, Khalid Choukri, Bente Maegaard, Joseph Mar-
tional Conference on Knowledge Discovery and iani, Jan Odjik, Stelios Piperidis, and Daniel Tapias,
Data Mining, pages 164168, New York City, New editors, Proceedings of the Sixth International Lan-
York, USA, August. AAAI Press. guage Resources and Evaluation (LREC08), pages
Ciprian Chelba and Alex Acero. 2006. Adaptation of 25092514, Marrakech, Morocco, May. European
maximum entropy capitalizer: Little data can help Language Resources Association (ELRA).
a lot. Computer Speech and Language, 20(4):382 Radu Florian, Hany Hassan, Abraham Ittycheriah,
399. Hongyan Jing, Nanda Kambhatla, Xiaoqiang Luo,
Massimiliano Ciaramita and Mark Johnson. 2003. Su- Nicolas Nicolov, and Salim Roukos. 2004. A
persense tagging of unknown nouns in WordNet. In statistical model for multilingual entity detection
Proceedings of the 2003 Conference on Empirical and tracking. In Susan Dumais, Daniel Marcu,
Methods in Natural Language Processing, pages and Salim Roukos, editors, Proceedings of the Hu-
168175. man Language Technology Conference of the North
171
American Chapter of the Association for Compu- Dirk Hovy, Chunliang Zhang, Eduard Hovy, and
tational Linguistics: HLT-NAACL 2004, page 18, Anselmo Peas. 2011. Unsupervised discovery of
Boston, Massachusetts, USA, May. Association for domain-specific knowledge from text. In Proceed-
Computational Linguistics. ings of the 49th Annual Meeting of the Association
Radu Florian, John Pitrelli, Salim Roukos, and Imed for Computational Linguistics: Human Language
Zitouni. 2010. Improving mention detection ro- Technologies, pages 14661475, Portland, Oregon,
bustness to noisy input. In Proceedings of EMNLP USA, June. Association for Computational Linguis-
2010, pages 335345, Cambridge, MA, October. tics.
Association for Computational Linguistics. Jing Jiang and ChengXiang Zhai. 2006. Exploit-
Dayne Freitag. 2004. Trained named entity recog- ing domain structure for named entity recognition.
nition using distributional clusters. In Dekang Lin In Proceedings of the Human Language Technol-
and Dekai Wu, editors, Proceedings of EMNLP ogy Conference of the NAACL (HLT-NAACL), pages
2004, pages 262269, Barcelona, Spain, July. As- 7481, New York City, USA, June. Association for
sociation for Computational Linguistics. Computational Linguistics.
Kevin Gimpel and Noah A. Smith. 2010a. Softmax- Junichi Kazama and Kentaro Torisawa. 2007.
margin CRFs: Training log-linear models with loss Exploiting Wikipedia as external knowledge for
functions. In Proceedings of the Human Language named entity recognition. In Proceedings of
Technologies Conference of the North American the 2007 Joint Conference on Empirical Meth-
Chapter of the Association for Computational Lin- ods in Natural Language Processing and Com-
guistics, pages 733736, Los Angeles, California, putational Natural Language Learning (EMNLP-
USA, June. CoNLL), pages 698707, Prague, Czech Republic,
Kevin Gimpel and Noah A. Smith. 2010b. June. Association for Computational Linguistics.
Softmax-margin training for structured log- Chloe Kiddon and Yuriy Brun. 2011. Thats what
linear models. Technical Report CMU-LTI- she said: double entendre identification. In Pro-
10-008, Carnegie Mellon University. http: ceedings of the 49th Annual Meeting of the Associ-
//www.lti.cs.cmu.edu/research/ ation for Computational Linguistics: Human Lan-
reports/2010/cmulti10008.pdf. guage Technologies, pages 8994, Portland, Ore-
Cyril Grouin, Sophie Rosset, Pierre Zweigenbaum, gon, USA, June. Association for Computational
Karn Fort, Olivier Galibert, and Ludovic Quin- Linguistics.
tard. 2011. Proposal for an extension of tradi-
Philipp Koehn. 2004. Statistical significance tests for
tional named entities: from guidelines to evaluation,
machine translation evaluation. In Dekang Lin and
an overview. In Proceedings of the 5th Linguis-
Dekai Wu, editors, Proceedings of EMNLP 2004,
tic Annotation Workshop, pages 92100, Portland,
pages 388395, Barcelona, Spain, July. Association
Oregon, USA, June. Association for Computational
Linguistics.
LDC. 2005. ACE (Automatic Content Extraction)
Nizar Habash and Owen Rambow. 2005. Arabic to-
Arabic annotation guidelines for entities, version
kenization, part-of-speech tagging and morpholog-
5.3.3. Linguistic Data Consortium, Philadelphia.
ical disambiguation in one fell swoop. In Proceed-
ings of the 43rd Annual Meeting of the Associa- Percy Liang, Hal Daume III, and Dan Klein. 2008.
tion for Computational Linguistics (ACL05), pages Structure compilation: trading structure for fea-
573580, Ann Arbor, Michigan, June. Association tures. In Proceedings of the 25th International Con-
for Computational Linguistics. ference on Machine Learning (ICML), pages 592
Nizar Habash. 2010. Introduction to Arabic Natural 599, Helsinki, Finland.
Language Processing. Morgan and Claypool Pub- Chris Manning. 2006. Doing named entity recogni-
lishers. tion? Dont optimize for F1 . http://nlpers.
Ahmed Hassan, Haytham Fahmy, and Hany Hassan. blogspot.com/2006/08/doing-named-
2007. Improving named entity translation by ex- entity-recognition-dont.html.
ploiting comparable and parallel corpora. In Pro- David McClosky, Eugene Charniak, and Mark John-
ceedings of the Conference on Recent Advances son. 2006. Effective self-training for parsing. In
in Natural Language Processing (RANLP 07), Proceedings of the Human Language Technology
Borovets, Bulgaria. Conference of the NAACL, Main Conference, pages
Eduard Hovy, Mitchell Marcus, Martha Palmer, 152159, New York City, USA, June. Association
Lance Ramshaw, and Ralph Weischedel. 2006. for Computational Linguistics.
OntoNotes: the 90% solution. In Proceedings of Rada Mihalcea. 2004. Co-training and self-training
the Human Language Technology Conference of for word sense disambiguation. In HLT-NAACL
the NAACL (HLT-NAACL), pages 5760, New York 2004 Workshop: Eighth Conference on Computa-
City, USA, June. Association for Computational tional Natural Language Learning (CoNLL-2004),
Linguistics. Boston, Massachusetts, USA.
172
Einat Minkov, Richard Wang, Anthony Tomasic, and Khaled Shaalan and Hafsa Raza. 2008. Arabic
William Cohen. 2006. NER systems that suit users named entity recognition from diverse text types. In
preferences: adjusting the recall-precision trade-off Advances in Natural Language Processing, pages
for entity extraction. In Proceedings of the Human 440451. Springer.
Language Technology Conference of the NAACL, Mihai Surdeanu, David McClosky, Mason R. Smith,
Companion Volume: Short Papers, pages 9396, Andrey Gusev, and Christopher D. Manning. 2011.
New York City, USA, June. Association for Com- Customizing an information extraction system to
putational Linguistics. a new domain. In Proceedings of the ACL 2011
Luke Nezda, Andrew Hickl, John Lehmann, and Sar- Workshop on Relational Models of Semantics, Port-
mad Fayyaz. 2006. What in the world is a Shahab? land, Oregon, USA, June. Association for Compu-
Wide coverage named entity recognition for Arabic. tational Linguistics.
In Proccedings of LREC, pages 4146. Ben Taskar, Carlos Guestrin, and Daphne Koller.
Joel Nothman, Tara Murphy, and James R. Curran. 2004. Max-margin Markov networks. In Sebastian
2009. Analysing Wikipedia and gold-standard cor- Thrun, Lawrence Saul, and Bernhard Scholkopf,
pora for NER training. In Proceedings of the 12th editors, Advances in Neural Information Processing
Conference of the European Chapter of the Associ- Systems 16. MIT Press.
ation for Computational Linguistics (EACL 2009), Antonio Toral, Elisa Noguera, Fernando Llopis, and
pages 612620, Athens, Greece, March. Associa- Rafael Munoz. 2005. Improving question an-
tion for Computational Linguistics. swering using named entity recognition. Natu-
PediaPress. 2010. mwlib. http://code. ral Language Processing and Information Systems,
pediapress.com/wiki/wiki/mwlib. 3513/2005:181191.
Slav Petrov, Pi-Chuan Chang, Michael Ringgaard, and Ioannis Tsochantaridis, Thorsten Joachims, Thomas
Hiyan Alshawi. 2010. Uptraining for accurate de- Hofmann, and Yasemin Altun. 2005. Large margin
terministic question parsing. In Proceedings of the methods for structured and interdependent output
2010 Conference on Empirical Methods in Natural variables. Journal of Machine Learning Research,
Language Processing, pages 705713, Cambridge, 6:14531484, September.
MA, October. Association for Computational Lin- Christopher Walker, Stephanie Strassel, Julie Medero,
guistics. and Kazuaki Maeda. 2006. ACE 2005 multi-
Lev Ratinov and Dan Roth. 2009. Design chal- lingual training corpus. LDC2006T06, Linguistic
lenges and misconceptions in named entity recog- Data Consortium, Philadelphia.
nition. In Proceedings of the Thirteenth Confer- Ralph Weischedel and Ada Brunstein. 2005.
ence on Computational Natural Language Learning BBN pronoun coreference and entity type cor-
(CoNLL-2009), pages 147155, Boulder, Colorado, pus. LDC2005T33, Linguistic Data Consortium,
June. Association for Computational Linguistics. Philadelphia.
Nathan D. Ratliff, J. Andrew Bagnell, and Martin A. Dan Wu, Wee Sun Lee, Nan Ye, and Hai Leong Chieu.
Zinkevich. 2006. Subgradient methods for maxi- 2009. Domain adaptive bootstrapping for named
mum margin structured learning. In ICML Work- entity recognition. In Proceedings of the 2009 Con-
shop on Learning in Structured Output Spaces, ference on Empirical Methods in Natural Language
Pittsburgh, Pennsylvania, USA. Processing, pages 15231532, Singapore, August.
Ryan Roth, Owen Rambow, Nizar Habash, Mona Association for Computational Linguistics.
Diab, and Cynthia Rudin. 2008. Arabic morpho- Tianfang Yao, Wei Ding, and Gregor Erbach. 2003.
logical tagging, diacritization, and lemmatization CHINERS: a Chinese named entity recognition sys-
using lexeme models and feature ranking. In Pro- tem for the sports domain. In Proceedings of the
ceedings of ACL-08: HLT, pages 117120, Colum- Second SIGHAN Workshop on Chinese Language
bus, Ohio, June. Association for Computational Processing, pages 5562, Sapporo, Japan, July. As-
Linguistics. sociation for Computational Linguistics.
Satoshi Sekine, Kiyoshi Sudo, and Chikashi Nobata.
2002. Extended named entity hierarchy. In Pro-
ceedings of LREC.
Burr Settles. 2004. Biomedical named entity recogni-
tion using conditional random fields and rich feature
sets. In Nigel Collier, Patrick Ruch, and Adeline
Nazarenko, editors, COLING 2004 International
Joint workshop on Natural Language Processing in
Biomedicine and its Applications (NLPBA/BioNLP)
2004, pages 107110, Geneva, Switzerland, Au-
gust. COLING.
173
Tree Representations in Probabilistic Models for Extended Named
Entities Detection
Marco Dinarelli Sophie Rosset

LIMSI-CNRS LIMSI-CNRS
Orsay, France Orsay, France
marcod@limsi.fr rosset@limsi.fr
Abstract labelling approach. Additionally, the use of noisy

data like transcriptions of French broadcast data,
In this paper we deal with Named En- makes the task very challenging for traditional
tity Recognition (NER) on transcriptions of
NLP solutions. To deal with such problems, we
French broadcast data. Two aspects make
the task more difficult with respect to previ- adopt a two-steps approach, the first being real-
ous NER tasks: i) named entities annotated ized with Conditional Random Fields (CRF) (Laf-
used in this work have a tree structure, thus ferty et al., 2001), the second with a Probabilistic
the task cannot be tackled as a sequence la- Context-Free Grammar (PCFG) (Johnson, 1998).
belling task; ii) the data used are more noisy The motivations behind that are:
than data used for previous NER tasks. We
approach the task in two steps, involving Since the named entities have a tree struc-
Conditional Random Fields and Probabilis-
ture, it is reasonable to use a solution com-
tic Context-Free Grammars, integrated in a
single parsing algorithm. We analyse the ing from syntactic parsing. However pre-
effect of using several tree representations. liminary experiments using such approaches
Our system outperforms the best system of gave poor results.
the evaluation campaign by a significant
margin. Despite the tree-structure of the entities,
trees are not as complex as syntactic trees,
1 Introduction thus, before designing an ad-hoc solution for
the task, which require a remarkable effort
Named Entity Recognition is a traditinal task of and yet it doesnt guarantee better perfor-
the Natural Language Processing domain. The mances, we designed a solution providing
task aims at mapping words in a text into seman- good results and which required a limited de-
tic classes, such like persons, organizations or lo- velopment effort.
calizations. While at first the NER task was quite
simple, involving a limited number of classes (Gr- Conditional Random Fields are models ro-
ishman and Sundheim, 1996), along the years bust to noisy data, like automatic transcrip-
the task complexity increased as more complex tions of ASR systems (Hahn et al., 2010),
class taxonomies were defined (Sekine and No- thus it is the best choice to deal with tran-
bata, 2004). The interest in the task is related to scriptions of broadcast data. Once words
its use in complex frameworks for (semantic) con- have been annotated with basic entity con-
tent extraction, such like Relation Extraction ap- stituents, the tree structure of named entities
plications (Doddington et al., 2004). is simple enough to be reconstructed with
This work presents research on a Named Entity relatively simple model like PCFG (Johnson,
Recognition task defined with a new set of named 1998).
entities. The characteristic of such set is in that
named entities have a tree structure. As conce- The two models are integrated in a single pars-
quence the task cannot be tackled as a sequence ing algorithm. We analyze the effect of the use of
174
pers.ind org.adm
S
amount
name.firstname.last kind demonym
object
Zahra Abouch Conseil de Gouvernement irakien func.coll
Figure 1: Examples of structured named entities annotated on the amount time.date.rel org.adm
data used in this work val object loc.adm.town name time-modifier val kind name
several tree representations, which result in differ- Figure 2: An example of named entity tree corresponding to en-
tities of a whole sentence. Tree leaves, corresponding to sentence
ent parsing models with different performances. words have been removed to keep readability
We provide a detailed evaluation of our mod-
els. Results can be compared with those obtained Quaero training dev
# sentences 43,251 112
in the evaluation campaign where the same data
words entities words entities
were used. Our system outperforms the best sys- # tokens 1,251,432 245,880 2,659 570
tem of the evaluation campaign by a significant # vocabulary 39,631 134 891 30
# components 133662 971
margin. # components dict. 28 18
The rest of the paper is structured as follows: in # OOV rate [%] 17.15 0
the next section we introduce the extended named
entities used in this work, in section 3 we describe Table 1: Statistics on the training and development sets of the
Quaero corpus
our two-steps algorithm for parsing entity trees,
in section 4 we detail the second step of our ap-
proach based on syntactic parsing approaches, in name.last for pers.ind or val (for value) and ob-
particular we describe the different tree represen- ject for amount.
tations used in this work to encode entity trees These named entities have been annotated on
in parsing models. In section 6 we describe and transcriptions of French broadcast news coming
comment experiments, and finally, in section 7, from several radio channels. The transcriptions
we draw some conclusions. constitute a corpus that has been split into train-
ing, development and evaluation sets.The evalu-
2 Extended Named Entities ation set, in particular, is composed of two set
The most important aspect of the NER task we of data, Broadcast News (BN in the table) and
investigated is provided by the tree structure of Broadcast Conversations (BC in the table). The
named entities. Examples of such entities are evaluation of the models presented in this work
given in figure 1 and 2, where words have been re- is performed on the merge of the two data types.
move for readability issues and are: (90 persons Some statistics of the corpus are reported in ta-
are still present at Atambua. Its there that 3 employ- ble 1 and 2. This set of named entities has been
ees of the High Conseil of United Nations for refugees defined in order to provide more fine semantic in-
have been killed yesterday morning): formation for entities found in the data, e.g. a
person is better specified by first and last name,
90 personnes toujours presentes a and is fully described in (Grouin, 2011) . In or-
Atambua c est la qu hier matin ont der to avoid confusion, entities that can be associ-
ete tues 3 employes du haut commis- ated directly to words, like name.first, name.last,
sariat des Nations unies aux refugies , val and object, are called entity constituents, com-
le HCR ponents or entity pre-terminals (as they are pre-
Words realizing entities in figure 2 are in bold, terminals nodes in the trees). The other entities,
and they correspond to the tree leaves in the like pers.ind or amount, are called entities or non-
picture. As we see in the figures, entities terminal entities, depending on the context.
can have complex structures. Beyond the use
3 Models Cascade for Extended Named
of subtypes, like individual in person (to give
Entities
pers.ind), or administrative in organization
(to give org.adm), entities with more specific con- Since the task of Named Entity Recognition pre-
tent can be constituents of more general enti- sented here cannot be modeled as sequence la-
ties to form tree structures, like name.first and belling and, as mentioned previously, an approach
175
Quaero test BN test BC 3.1 Conditional Random Fields
# sentences 1704 3933
words entities words entities CRFs are particularly suitable for sequence la-
# tokens 32945 2762 69414 2769
# vocabulary 28 28 belling tasks (Lafferty et al., 2001). Beyond the
# components 4128 4017 possibility to include a huge number of features
# components dict. 21 20
# OOV rate [%] 3.63 0 3.84 0
using the same framework as Maximum Entropy
models (Berger et al., 1996), CRF models en-
Table 2: Statistics on the test set of the Quaero corpus, divided in code global conditional probabilities normalized
Broadcast News (BN) and Broadcast Conversations (BC)
at sentence level.
Given a sequence of N words W1N =
w1 , ..., wN and its corresponding components se-
quence E1N = e1 , ..., eN , CRF trains the condi-
tional probabilities
P (E1N |W1N ) =
N M
!
1 Y X
n+2
Figure 3: Processing schema of the two-steps approach proposed exp m hm (en1 , en , wn2 ) (1)
Z n=1 m=1
in this work: CRF plus PCFG
coming from syntactic parsing to perform named where m are the training parameters.
entity annotation in one-shot is not robust on n+2
hm (en1 , en , wn2 ) are the feature functions
the data used in this work, we adopt a two-steps. capturing dependencies of entities and words. Z
The first is designed to be robust to noisy data and is the partition function:
is used to annotate entity components, while the
second is used to parse complete entity trees and N
XY
n+2
Z= H(en1 , en , wn2 ) (2)
is based on a relatively simple model. Since we eN n=1
1
are dealing with noisy data, the hardest part of the
task is indeed to annotate components on words.
On the other hand, since entity trees are relatively which ensures that probabilities sum up to one.
simple, at least much simpler than syntactic trees, en1 and en are components for previous and cur-
n+2
once entity components have been annotated in a rent words, H(en1 , en , wn2 ) is an abbreviation
PM n+2
first step, for the second step, a complex model is for m=1 m hm (en1 , en , wn2 ), i.e. the set
not required, which would also make the process- of active feature functions at current position in
ing slower. Taking all these issues into account, the sequence.
the two steps of our system for tree-structured In the last few years different CRF implemen-
named entity recognition are performed as fol- tations have been realized. The implementation
lows: we refer in this work is the one described in
(Lavergne et al., 2010), which optimize the fol-
1. A CRF model (Lafferty et al., 2001) is used lowing objective function:
to annotate components on words.
2
log(P (E1N |W1N )) + 1 kk1 + kk22 (3)
2. A PCFG model (Johnson, 1998) is used 2
to parse complete entity trees upon compo-
nents, i.e. using components annotated by
kk1 and kk22 are the l1 and l2 regulariz-
CRF as starting point.
ers (Riezler and Vasserman, 2004), and together
This processing schema is depicted in figure 3. in a linear combination implement the elastic net
Conditional Random Fields are described shortly regularizer (Zou and Hastie, 2005). As mentioned
in the next subsection. PCFG models, constituting in (Lavergne et al., 2010), this kind of regulariz-
the main part of this work together with the analy- ers are very effective for feature selection at train-
sis over tree representations, is described more in ing time, which is a very good point when dealing
details in the next sections. with noisy data and big set of features.
176
4 Models for Parsing Trees
The models used in this work for parsing en-
tity trees refer to the models described in (John-
son, 1998), in (Charniak, 1997; Caraballo and
Charniak, 1997) and (Charniak et al., 1998), and
which constitutes the basis of the maximum en-
tropy model for parsing described in (Charniak, Figure 4: Baseline tree representations used in the PCFG parsing
model
2000). A similar lexicalized model has been pro-
posed also by Collins (Collins, 1997). All these
models are based on a PCFG trained from data
and used in a chart parsing algorithm to find the
best parse for the given input. The PCFG model
of (Johnson, 1998) is made of rules of the form:
Xi Xj Xk
Figure 5: Filler-parent tree representations used in the PCFG pars-
Xi w ing model
where X are non-terminal entities and w are

terminal symbols (words in our case).1 The prob- have all rules in the form of 4 and 5, is straight-
ability associated to these rules are: forward and can be done with simple algorithms
P (Xi Xj , Xk ) not discussed here.
pij,k = (4)
P (Xi )
4.1 Tree Representations for Extended
P (Xi w) Named Entities
piw = (5)
P (Xi )
As discussed in (Johnson, 1998), an important
The models described in (Charniak, 1997; point for a parsing algorithm is the representation
Caraballo and Charniak, 1997) encode probabil- of trees being parsed. Changing the tree represen-
ities involving more information, such as head tation can change significantly the performances
words. In order to have a PCFG model made of of the parser. Since there is a large difference be-
rules with their associated probabilities, we ex- tween entity trees used in this work and syntac-
tract rules from the entity trees of our corpus. This tic trees, from both meaning and structure point
processing is straightforward, for example from of view, it is worth performing an analysis with
the tree depicted in figure 2, the following rules the aim of finding the most suitable representa-
are extracted: tion for our task. In order to perform this analy-
S amount loc.adm.town time.dat.rel amount sis, we start from a named entity annotated on the
amount val object words de notre president , M. Nicolas Sarkozy(of
time.date.rel name time-modifier our president, Mr. Nicolas Sarkozy). The corre-
object func.coll sponding named entity is shown in figure 4. As
func.coll kind org.adm decided in the annotation guidelines, fillers can be
org.adm name part of a named entity. This can happen for com-
plex named entities involving several words. The
Using counts of these rules we then compute representation shown in figure 4 is the default rep-
maximum likelihood probabilities of the Right resentation and will be referred to as baseline. A
Hand Side (RHS) of the rule given its Left Hand problem created by this representation is the fact
Side (LHS). Also binarization of rules, applied to that fillers are present also outside entities. Fillers
of named entities should be, in principle, distin-
1
These rules are actually in Chomsky Normal Form, i.e. guished from any other filler, since they may be
unary or binary rules only. A PCFG, in general, can have any
informative to discriminate entities.
rule, however, the algorithm we are discussing convert the
PCFG rules into Chomsky Normal Form, thus for simplicity Following this intuition, we designed two dif-
we provide directly such formulation. ferent representations where entity fillers are con-
177
Figure 8: Parent-node-filler tree representations used in the PCFG
parsing model
Figure 6: Parent-context tree representations used in the PCFG
parsing model
referred to as parent-node-filler. This representa-
tion is a good trade-off between contextual infor-
mation and rigidity, by still representing entities
as concatenation of labels, while using a common
special label for entity fillers. This allows to keep
lower the number of entities annotated on words,
Figure 7: Parent-node tree representations used in the PCFG pars-
ing model
i.e. components.
Using different tree representations affects both
the structure and the performance of the parsing
textualized so that to be distinguished from the model. The structure is described in the next sec-
other fillers. In the first representation we give to tion, the performance in the evaluation section.
the filler the same label of the parent node, while
in the second representation we use a concatena- 4.2 Structure of the Model
tion of the filler and the label of the parent node. Lexicalized models for syntactic parsing de-
These two representations are shown in figure 5 scribed in (Charniak, 2000; Charniak et al., 1998)
and 6, respectively. The first one will be referred and (Collins, 1997), integrate more information
to as filler-parent, while the second will be re- than what is used in equations 4 and 5. Consider-
ferred as parent-context. A problem that may be ing a particular node in the entity tree, not includ-
introduced by the first representation is that some ing terminals, the information used is:
entities that originally were used only for non-
terminal entities will appear also as components, s: the head word of the node, i.e. the most
i.e. entities annotated on words. This may intro- important word of the chunk covered by the
duce some ambiguity. current node
Another possible contextualization can be to
h: the head word of the parent node
annotate each node with the label of the parent
node. This representation is shown in figure 7 t: the entity tag of the current node
and will be referred to as parent-node. Intuitively,
this representation is effective since entities an- l: the entity tag of the parent node
notated directly on words provide also the en-
The head word of the parent node is defined
tity of the parent node. However this representa-
percolating head words from children nodes to
tion increases drastically the number of entities,
parent nodes, giving the priority to verbs. They
in particular the number of components, which
can be found using automatic approaches based
in our case are the set of labels to be learned by
on words and entity tag co-occurrence or mutual
the CRF model. For the same reason this repre-
information. Using this information, the model
sentation produces more rigid models, since label
described in (Charniak et al., 1998) is P (s|h, t, l).
sequences vary widely and thus is not likely to
This model being conditioned on several pieces
match sequences not seen in the training data.
of information, it can be affected by data sparsity
Finally, another interesting tree representation
problems. Thus, the model is actually approxi-
is a variation of the parent-node tree, where en-
mated as an interpolation of probabilities:
tity fillers are only distinguished from fillers not
in an entity, using the label ne-filler, but they are P (s|h, t, l) =
not contextualized with entity information. This 1 P (s|h, t, l) + 2 P (s|ch , t, l)+
representation is shown in figure 8 and it will be 3 P (s|t, l) + 4 P (s|t) (6)
178
have shown less effective for syntactic parsing
where i , i = 1, ..., 4, are parameters of the than their lexicalized couter-parts, there are evi-
model to be tuned, and ch is the cluster of head dences showing that they can be effective in our
words for a given entity tag t. With such model, task. With reference to figure 4, considering the
when not all pieces of information are available to entity pers.ind instantiated by Nicolas Sarkozy,
estimate reliably the probability with more con- our algorithm detects first name.first for Nicolas
ditioning, the model can still provide a proba- and name.last for Sarkozy using the CRF model.
bility with terms conditioned with less informa- As mentioned earlier, once the CRF model has de-
tion. The use of head words and their percola- tected components, since entity trees have not a
tion over the tree is called lexicalization. The complex structure with respect to syntactic trees,
goal of tree lexicalization is to add lexical infor- even a simple model like the one in equation 7
mation all over the tree. This way the probabil- or 8 is effective for entity tree parsing. For ex-
ity of all rules can be conditioned also on lexi- ample, once name.first and name.last have been
cal information, allowing to define the probabili- detected by CRF, pers.ind is the only entity hav-
ties P (s|h, t, l) and P (s|ch , t, l). Tree lexicaliza- ing name.first and name.last as children. Am-
tion reflects the characteristics of syntactic pars- biguities, like for example for kind or qualifier,
ing, for which the models described in (Charniak, which can appear in many entities, can affect the
2000; Charniak et al., 1998) and (Collins, 1997) model 7, but they are overcome by the model 8,
were defined. Head words are very informative taking the entity tag of the parent node into ac-
since they constitute keywords instantiating la- count. Moreover, the use of CRF allows to in-
bels, regardless if they are syntactic constituents clude in the model much more features than the
or named entities. However, for named entity lexicalized model in equation 6. Using features
recognition it doesnt make sense to give prior- like word prefixes (P), suffixes (S), capitalization
ity to verbs when percolating head words over the (C), morpho-syntactic features (MS) and other
tree, even more because head words of named en- features indicated as F2 , the CRF model encodes
tities are most of the time nouns. Moreover, it the conditional probability:
doesnt make sense to give priority to the head
word of a particular entity with respect to the oth- P (t|w, P, S, C, M S, F ) (9)
ers, all entities in a sentence have the same im-

portance. Intuitively, lexicalization of entity trees
is not straightforward as lexicalization of syntac- where w is an input word and t is the corre-
tic trees. At the same time, using not lexicalized sponding component.
trees doesnt make sense with models like 6, since The probability of the CRF model, used in the
all the terms involve lexical information. Instead, first step to tag input words with components,
we can use the model of (Johnson, 1998), which is combined with the probability of the PCFG
define the probability of a tree as: model, used to parse entity trees starting from
Y components. Thus the structure of our model is:
P ( ) = P (X )C (X) (7)
X
P (t|w, P, S, C, M S, F ) P ( ) (10)
here the RHS of rules has been generalized with

or
, representing RHS of both unary and binary
P (t|w, P, S, C, M S, F ) P ( |l) (11)
rules 4 and 5. C (X ) is the number of times
the rule X appears in the tree . The model
7 is instantiated when using tree representations
depending if we are using the tree representa-
shown in Fig. 4, 5 and 6. When using representa-
tion given in figure 4, 5 and 6 or in figure 7 and 8,
tions given in Fig. 7 and 8, the model is:
respectively. A scale factor could be used to com-
P ( |l) (8) bine the two scores, but this is optional as CRFs
can provide normalized posterior probabilities.
where l is the entity label of the parent node. 2

The set of features used in the CRF model will be de-
Although non-lexicalized models like 7 and 8 scribed in more details in the evaluation section.
179
5 Related Work target-side features (Tang et al., 2006). An inte-
gration of the same kind of features has been tried
While the models used for named entity detection also in the model used in this work, without giv-
and the set of named entities defined along the ing significant improvements, but making model
years have been discussed in the introduction and training much harder. Thus, this direction has not
in section 2, since CRFs and models for parsing been further investigated.
constitute the main issue in our work, we discuss
some important models here. 6 Evaluation
Beyond the models for parsing discussed in In this section we describe experiments performed
section 4, together with motivations for using or to evaluate our models. We first describe the set-
not in our work, another important model for syn- tings used for the two models involved in the en-
tactic parsing has been proposed in (Ratnaparkhi, tity tree parsing, and then describe and comment
1999). Such model is made of four Maximum the results obtained on the test corpus.
Entropy models used in cascade for parsing at
different stages. Also this model makes use of 6.1 Settings
head words, like those described in section 4, thus The CRF implementation used in this work is de-
the same considerations hold, moreover it seems scribed in (Lavergne et al., 2010), named wapiti.3
quite complex for real applications, as it involves We didnt optimize parameters 1 and 2 of the
the use of four different models together. The elastic net (see section 3.1), although this im-
models described in (Johnson, 1998), (Charniak, proves significantly the performances and leads
1997; Caraballo and Charniak, 1997), (Charniak to more compact models, default values lead in
et al., 1998), (Charniak, 2000), (Collins, 1997) most cases to very accurate models. We used a
and (Ratnaparkhi, 1999), constitute the main in- wide set of features in CRF models, in a window
dividual models proposed for constituent-based of [2, +2] around the target word:
syntactic parsing. Later other approaches based
on models combination have been proposed, like A set of standard features like word prefixes
e.g. the reranking approach described in (Collins and suffixes of length from 1 to 6, plus some
and Koo, 2005), among many, and also evolutions Yes/No features like Does the word start with
or improvements of these models. capital letter?, etc.
More recently, approaches based on log-linear
models have been proposed (Clark and Curran, Morpho-syntactic features extracted from
2007; Finkel et al., 2008) for parsing, called also the output of the tool tagger (Allauzen and
Tree CRF, using also different training criteria Bonneau-Maynard, 2008)
(Auli and Lopez, 2011). Using such models in our Features extracted from the output of the se-
work has basically two problems: one related to mantic analyzer (Rosset et al., (2009)) pro-
scaling issues, since our data present a large num- vided by the tool WMatch (Galibert, 2009).
ber of labels, which makes CRF training problem-
atic, even more when using Tree CRF; another This analysis morpho-syntactic information as
problem is related to the difference between syn- well as semantic information at the same level
tactic parsing and named entity detection tasks, of named entities. Using two different sets of
as mentioned in sub-section 4.2. Adapting Tree morpho-syntactic features results in more effec-
CRF to our task is thus a quite complex work, it tive models, as they create a kind of agreement
constitutes an entire work by itself, we leave it as for a given word in case of match. Concerning
feature work. the PCFG model, grammars, tree binarization and
Concerning linear-chain CRF models, the the different tree representations are created with
one we use is a state-of-the-art implementation our own scripts, while entity tree parsing is per-
(Lavergne et al., 2010), as it implements the formed with the chart parsing algorithm described
most effective optimization algorithms as well as in (Johnson, 1998).4
state-of-the-art regularizers (see sub-section 3.1). 3
available at http://wapiti.limsi.fr
Some improvement of linear-chain CRF have 4
available at http://web.science.mq.edu.au/
been proposed, trying to integrate higher order mjohnson/Software.htm
180
CRF PCFG DEV TEST
Model # features # labels # rules Model SER F1 SER F1
baseline 3,041,797 55 29,611 baseline 20.0% 73.4% 14.2% 79.4%
filler-parent 3,637,990 112 29,611 filler-parent 16.2% 77.8% 12.5% 81.2%
parent-context 3,605,019 120 29,611 parent-context 15.2% 78.6% 11.9% 81.4%
parent-node 3,718,089 441 31,110 parent-node 6.6% 96.7% 5.9% 96.7%
parent-node-filler 3,723,964 378 31,110 parent-node-filler 6.8% 95.9% 5.7% 96.8%
Table 3: Statistics showing the characteristics of the different Table 4: Results computed from oracle predictions obtained with
models used in this work the different models presented in this work
DEV TEST
6.2 Evaluation Metrics Model SER F1 SER F1
baseline 33.5% 72.5% 33.4% 72.8%
All results are expressed in terms of Slot Error filler-parent 31.3% 74.4% 33.4% 72.7%
parent-context 30.9% 74.6% 33.3% 72.8%
Rate (SER) (Makhoul et al., 1999) which has a parent-node 31.2% 77.8% 31.4% 79.5%
similar definition of word error rate for ASR sys- parent-node-filler 28.7% 78.9% 30.2% 80.3%
tems, with the difference that substitution errors
Table 5: Results obtained with our combined algorithm based on
are split in three types: i) correct entity type with CRF and PCFG
wrong segmentation; ii) wrong entity type with
correct segmentation; iii) wrong entity type with
will have more rules. For example, the rule
wrong segmentation; here, i) and ii) are given half
pers.ind name.first name.last can
points, while iii), as well as insertion and deletion
appear as it is or contextualized with func.ind,
errors, are given full points. Moreover, results are
like in figure 8. In contrast the other tree repre-
given using the well known F 1 measure, defined
sentations modify only fillers, thus the number of
as a function of precision and recall.
rules is not affected.
6.3 Results Concerning CRF models, as shown in table 3,
the use of the different tree representations results
In this section we provide evaluations of the modin an increasing number of labels to be learned by
els described in this work, based on combination CRF. This aspect is quite critical in CRF learn-
of CRF and PCFG and using different tree repre- ing, as training time is exponential in the number
sentations of named entity trees. of labels. Indeed, the most complex models, ob-
6.3.1 Model Statistics tained with parent-node and parent-node-filler
tree representations, took roughly 8 days for train-
As a first evaluation, we describe some statis- ing. Additionally, increasing the number of labels
tics computed from the CRF and PCFG models can create data sparseness problems, however this
using the tree representations. Such statistics pro- problem doesnt seem to arise in our case since,
vide interesting clues of how difficult is learning apart the baseline model which has quite less fea-
the task and which performance we can expect tures, all the others have approximately the same
from the model. Statistics for this evaluation are number of features, meaning that there are actu-
presented in table 3. Rows corresponds to the dif- ally enough data to learn the models, regardless
ferent tree representations described in this work, the number of labels.
while in the columns we show the number of fea-
tures and labels for the CRF models (# features 6.3.2 Evaluations of Tree Representations
and # labels), and the number of rules for PCFG In this section we evaluate the models in terms
models (# rules). of the evaluation metrics described in previous
As we can see from the table, the number section, Slot Error Rate (SER) and F1 measure.
of rules is the same for the tree representations In order to evaluate PCFG models alone, we
baseline, filler-parent and parent-context, and performed entity tree parsing using as input ref-
for the representations parent-node and parent- erence transcriptions, i.e. manual transcriptions
node-filler. This is the consequence of the con- and reference component annotations taken from
textualization applied by the latter representa- development and test sets. This can be consid-
tions, i.e. parent-node and parent-node-filler ered a kind of oracle evaluations and provides us
create several different labels depending from an upper bound of the performance of the PCFG
the context, thus the corresponding grammar models. Results for this evaluation are reported in
181
Participant SER the 2011 evaluation campaign of extended named
P1 48.9
P2 41.0 entity recognition (Galibert et al., 2011; 2) Re-
parent-context 33.3 sults are reported in table 6, where the other two
parent-node 31.4
parent-node-filler 30.2 participants to the campaign are indicated as P 1
and P 2. These two participants P1 and P2, used
Table 6: Results obtained with our combined algorithm based on a system based on CRF, and rules for deep syn-
CRF and PCFG
tactic analysis, respectively. In particular, P 2 ob-
tained superior performances in previous evalua-
table 4. As it can be intuitively expected, adding tion campaign on named entity recognition. The
more contextualization in the trees results in more system we proposed at the evaluation campaign
accurate models, the simplest model, baseline, used a parent-context tree representation. The
has the worst oracle performance, filler-parent results obtained at the evaluation campaign are
and parent-context models, adding similar con- in the first three lines of Table 6. We compare
textualization information, have very similar ora- such results with those obtained with the parent-
cle performances. Same line of reasoning applies node and parent-node-filler tree representations,
to models parent-node and parent-node-filler, reported in the last two rows of the same table. As
which also add similar contextualization and have we can see, the new tree representations described
very similar oracle predictions. These last two in this work allow to achieve the best absolute per-
models have also the best absolute oracle perfor- formances.
mances. However, adding more contextualization
in the trees results also in more rigid models, the 7 Conclusions
fact that models are robust on reference transcrip-
tions and based on reference component annota- In this paper we have presented a Named Entity
tions, doesnt imply a proportional robustness on Recognition system dealing with extended named
component sequences generated by CRF models. entities with a tree structure. Given such represen-
tation of named entities, the task cannot be mod-
This intuition is confirmed from results re-
eled as a sequence labelling approach. We thus
ported in table 5, where a real evaluation of our
proposed a two-steps system based on CRF and
models is reported, using this time CRF out-
PCFG. CRF annotate entity components directly
put components as input to PCFG models, to
on words, while PCFG apply parsing techniques
parse entity trees. The results reported in ta-
to predict the whole entity tree. We motivated
ble 5 show in particular that models using base-
our choice by showing that it is not effective to
line, filler-parent and parent-context tree repre-
apply techniques used widely for syntactic pars-
sentations have similar performances, especially
ing, like for example tree lexicalization. We pre-
on test set. Models characterized by parent-node
sented an analysis of different tree representations
and parent-node-filler tree representations have
for PCFG, which affect significantly parsing per-
indeed the best performances, although the gain
formances.
with respect to the other models is not as much
as it could be expected given the difference in We provided and discussed a detailed evalua-
the oracle performances discussed above. In par- tion of all the models obtained by combining CRF
ticular the best absolute performance is obtained and PCFG with the different tree representation
with the model parent-node-filler. As we men- proposed. Our combined models result in better
tioned in subsection 4.1, this model represents the performances with respect to other models pro-
best trade-off between rigidity and accuracy using posed at the official evaluation campaign, as well
the same label for all entity fillers, but still distin- as our previous model used also at the evaluation
guishing between fillers found in entity structures campaign.
and other fillers found in words not instantiating
any entity.
Acknowledgments
This work has been funded by the project Quaero,
6.3.3 Comparison with Official Results
under the program Oseo, French State agency for
As a final evaluation of our models, we pro- innovation.
vide a comparison of official results obtained at
182
References Proceedings of the fourteenth national conference
on artificial intelligence and ninth conference on
Ralph Grishman and Beth Sundheim. 1996. Mes-
Innovative applications of artificial intelligence,
sage Understanding Conference-6: a brief history.
AAAI97/IAAI97, pages 598603. AAAI Press.
In Proceedings of the 16th conference on Com-
putational linguistics - Volume 1, pages 466471, Eugene Charniak. 2000. A maximum-entropy-
Stroudsburg, PA, USA. Association for Computa- inspired parser. In Proceedings of the 1st North
tional Linguistics. American chapter of the Association for Computa-
Satoshi Sekine and Chikashi Nobata. 2004. Defini- tional Linguistics conference, pages 132139, San
tion, Dictionaries and Tagger for Extended Named Francisco, CA, USA. Morgan Kaufmann Publish-
Entity Hierarchy. In Proceedings of LREC. ers Inc.
G. Doddington, A. Mitchell, M. Przybocki, Sharon A. Caraballo and Eugene Charniak. 1997.
L. Ramshaw, S. Strassel, and R. Weischedel. New figures of merit for best-first probabilistic chart
2004. The Automatic Content Extraction (ACE) parsing. Computational Linguistics, 24:275298.
ProgramTasks, Data, and Evaluation. Proceedings Michael Collins. 1997. Three generative, lexicalised
of LREC 2004, pages 837840. models for statistical parsing. In Proceedings of the
Cyril Grouin, Sophie Rosset, Pierre Zweigenbaum, 35th Annual Meeting of the Association for Com-
Karn Fort, Olivier Galibert, Ludovic Quintard. putational Linguistics and Eighth Conference of the
2011. Proposal for an extension or traditional European Chapter of the Association for Computa-
named entities: From guidelines to evaluation, an tional Linguistics, ACL 98, pages 1623, Strouds-
overview. In Proceedings of the Linguistic Annota- burg, PA, USA. Association for Computational Lin-
tion Workshop (LAW). guistics.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Con- Eugene Charniak, Sharon Goldwater, and Mark John-
ditional random fields: Probabilistic models for son. 1998. Edge-based best-first chart parsing. In
segmenting and labeling sequence data. In Pro- In Proceedings of the Sixth Workshop on Very Large
ceedings of the Eighteenth International Confer- Corpora, pages 127133. Morgan Kaufmann.
ence on Machine Learning (ICML), pages 282289, Alexandre Allauzen and Helene Bonneau-Maynard.
Williamstown, MA, USA, June. 2008. Training and evaluation of pos taggers on the
Mark Johnson. 1998. Pcfg models of linguistic french multitag corpus. In Proceedings of the Sixth
tree representations. Computational Linguistics, International Language Resources and Evaluation
24:613632. (LREC08), Marrakech, Morocco, may.
Stefan Hahn, Marco Dinarelli, Christian Raymond, Olivier Galibert. 2009. Approches et methodologies
Fabrice Lefevre, Patrick Lehen, Renato De Mori, pour la reponse automatique a des questions
Alessandro Moschitti, Hermann Ney, and Giuseppe adaptees a un cadre interactif en domaine ouvert.
Riccardi. 2010. Comparing stochastic approaches Ph.D. thesis, Universite Paris Sud, Orsay.
to spoken language understanding in multiple lan-
guages. IEEE Transactions on Audio, Speech and Rosset Sophie, Galibert Olivier, Bernard Guillaume,
Language Processing (TASLP), 99. Bilinski Eric, and Adda Gilles. The LIMSI mul-
tilingual, multitask QAst system. In Proceed-
Adam L. Berger, Stephen A. Della Pietra, and Vin-
ings of the 9th Cross-language evaluation forum
cent J. Della Pietra. 1996. A maximum entropy
conference on Evaluating systems for multilin-
approach to natural language processing. COMPU-
gual and multimodal information access, CLEF08,
TATIONAL LINGUISTICS, 22:3971.
pages 480487, Berlin, Heidelberg, 2009. Springer-
Thomas Lavergne, Olivier Cappe, and Francois Yvon.
Verlag.
2010. Practical very large scale CRFs. In Proceed-
ings the 48th Annual Meeting of the Association for Azeddine Zidouni, Sophie Rosset, and Herve Glotin.
Computational Linguistics (ACL), pages 504513. 2010. Efficient combined approach for named en-
Association for Computational Linguistics, July. tity recognition in spoken language. In Proceedings
Stefan Riezler and Alexander Vasserman. 2004. In- of the International Conference of the Speech Com-
cremental feature selection and l1 regularization munication Assosiation (Interspeech), Makuhari,
for relaxed maximum-entropy modeling. In Pro- Japan
ceedings of the International Conference on Em- John Makhoul, Francis Kubala, Richard Schwartz,
pirical Methods for Natural Language Processing and Ralph Weischedel. 1999. Performance mea-
(EMNLP). sures for information extraction. In Proceedings of
Hui Zou and Trevor Hastie. 2005. Regularization and DARPA Broadcast News Workshop, pages 249252.
variable selection via the Elastic Net. Journal of the Adwait Ratnaparkhi. 1999. Learning to Parse Natural
Royal Statistical Society B, 67:301320. Language with Maximum Entropy Models. Journal
Eugene Charniak. 1997. Statistical parsing with of Machine Learning, vol. 34, issue 1-3, pages 151
a context-free grammar and word statistics. In 175.
183
Michael Collins and Terry Koo. 2005. Discriminative
Re-ranking for Natural Language Parsing. Journal
of Machine Learning, vol. 31, issue 1, pages 2570.
Clark, Stephen and Curran, James R. 2007. Wide-
Coverage Efficient Statistical Parsing with CCG and
Log-Linear Models. Journal of Computational Lin-
guistics, vol. 33, issue 4, pages 493552.
Finkel, Jenny R. and Kleeman, Alex and Manning,
Christopher D. 2008. Efficient, Feature-based,
Conditional Random Field Parsing. Proceedings
of the Association for Computational Linguistics,
pages 959967, Columbus, Ohio.
Michael Auli and Adam Lopez 2011. Training a Log-
Linear Parser with Loss Functions via Softmax-
Margin. Proceedings of Empirical Methods for
Natural Language Processing, pages 333343, Ed-
inburgh, U.K.
Tang, Jie and Hong, MingCai and Li, Juan-Zi and
Liang, Bangyong. 2006. Tree-Structured Con-
ditional Random Fields for Semantic Annotation.
Proceedgins of the International Semantic Web
Conference, pages 640653, Edited by Springer.
Olivier Galibert; Sophie Rosset; Cyril Grouin; Pierre
Zweigenbaum; Ludovic Quintard. 2011. Struc-
tured and Extended Named Entity Evaluation in Au-
tomatic Speech Transcriptions. IJCNLP 2011.
Marco Dinarelli, Sophie Rosset. Models Cascade for
Tree-Structured Named Entity Detection IJCNLP
2011.
184
When Did that Happen? Linking Events and Relations to Timestamps
Dirk Hovy*, James Fan, Alfio Gliozzo, Siddharth Patwardhan and Chris Welty
IBM T. J. Watson Research Center
19 Skyline Drive
Hawthorne, NY 10532
dirkh@isi.edu, {fanj,gliozzo,siddharth,welty}@us.ibm.com
Abstract In this paper we present methods to estab-

lish links between events (e.g. bombing or
We present work on linking events and flu- election) or fluents (e.g. spouseOf or em-
ents (i.e., relations that hold for certain
ployedBy) and temporal expressions (e.g. last
periods of time) to temporal information
in text, which is an important enabler for
Tuesday and November 2008). While previ-
many applications such as timelines and ous research has mainly focused on temporal links
reasoning. Previous research has mainly for events only, we deal with both events and flu-
focused on temporal links for events, and ents with the same method. For example, consider
we extend that work to include fluents the sentence below
as well, presenting a common methodol-
ogy for linking both events and relations Before his death in October, Steve Jobs
to timestamps within the same sentence. led Apple for 15 years.
Our approach combines tree kernels with
classical feature-based learning to exploit For a machine reading system processing this
context and achieves competitive F1-scores sentence, we would expect it to link the fluent
on event-time linking, and comparable F1-
CEO of (Steve Jobs, Apple) to time duration 15
scores for fluents. Our best systems achieve
F1-scores of 0.76 on events and 0.72 on flu- years. Similarly we expect it to link the event
ents. death to the time expression October.
We do not take a strong ontological position
on what events and fluents are, as part of our
1 Introduction
task these distinctions are made a priori. In other
It is a long-standing goal of NLP to process natu- words, events and fluents are input to our tempo-
ral language content in such a way that machines ral linking framework. In the remainder of this pa-
can effectively reason over the entities, relations, per, we also do not make a strong distinction be-
and events discussed within that content. The ap- tween relations in general and fluents in particu-
plications of such technology are numerous, in- lar, and use them interchangeably, since our focus
cluding intelligence gathering, business analytics, is only on the specific types of relations that rep-
healthcare, education, etc. Indeed, the promise resent fluents. While we only use binary relations
of machine reading is actively driving research in in this work, there is nothing in the framework
this area (Etzioni et al., 2007; Barker et al., 2007; that would prevent the use of n-ary relations. Our
Clark and Harrison, 2010; Strassel et al., 2010). work focuses on accurately identifying temporal
Temporal information is a crucial aspect of this links for eventual use in a machine reading con-
task. For a machine to successfully understand text.
natural language text, it must be able to associate In this paper, we describe a single approach that
time points and temporal durations with relations applies to both fluents and events, using feature
and events it discovers in text. engineering as well as tree kernels. We show that

The first author conducted this research during an in- we can achieve good results for both events and
ternship at IBM Research. fluents using the same feature space, and advocate
185
the versatility of our approach by achieving com- the TempEval tasks overlap with ours in many
petitive results on yet another similar task with a ways. Our task is similar to task A and C of
different data set. TempEval-1 (Verhagen et al., 2007) in the sense
Our approach requires us to capture contextual that we attempt to identify temporal relation be-
properties of text surrounding events, fluents and tween events and time expressions or document
time expressions that enable an automatic system dates. However, we do not use a restricted set of
to detect temporal linking within our framework. events, but focus primarily on a single temporal
A common strategy for this is to follow standard relation tlink instead of named relations like BE-
feature engineering methodology and manually FORE, AFTER or OVERLAP (although we show
develop features for a machine learning model that we can incorporate these as well). Part of our
from the lexical, syntactic and semantic analysis task is similar to task C of TempEval-2 (Verha-
of the text. A key contribution of our work in this gen et al., 2010), determining the temporal rela-
paper is to demonstrate a shallow tree-like repre- tion between an event and a time expression in
sentation of the text that enables us to employ tree the same sentence. In this paper, we do apply our
kernel models, and more accurately detect tempo- system to TempEval-2 data and compare our per-
ral linking. The feature space represented by such formance to the participating systems.
tree kernels is far larger than a manually engi- Our work is similar to that of Boguraev and
neered feature space, and is capable of capturing Ando (2005), whose research only deals with
the contextual information required for temporal temporal links between events and time expres-
linking. sions (and does not consider relations at all). They
The remainder of this paper goes into the de- employ a sequence tagging model with manual
tails of our approach for temporal linking, and feature engineering for the task and achieved
presents empirical evidence for the effectiveness state-of-the-art results on Timebank (Pustejovsky
of our approach. The contributions of this paper et al., 2003) data. Our task is slightly different be-
can be summarized as follows: cause we include relations in the temporal linking,
and our use of tree kernels enables us to explore a
1. We define a common methodology to link wider feature space very quickly.
events and fluents to timestamps. Filatova and Hovy (2001) also explore tempo-
2. We use tree kernels in combination with clas- ral linking with events, but do not assume that
sical feature-based approaches to obtain sig- events and time stamps have been provided by an
nificant gains by exploiting context. external process. They used a heuristics-based ap-
proach to assign temporal expressions to events
3. Empirical evidence illustrates that our (also relying on the proximity as a base case).
framework for temporal linking is very ef- They report accuracy of the assignment for the
fective for the task, achieving an F1-score of correctly classified events, the best being 82.29%.
0.76 on events and 0.72 on fluents/relations, Our best event system achieves an accuracy of
as well as 0.65 for TempEval2, approaching 84.83%. These numbers are difficult to compare,
state-of-the-art. however, since accuracy does not efficiently cap-
ture the performance of a system on a task with so
2 Related Work many negative examples.
Most of the previous work on relation extraction Mirroshandel et al. (2011) describe the use of
focuses on entity-entity relations, such as in the syntactic tree kernels for event-time links. Their
ACE (Doddington et al., 2004) tasks. Temporal results on TempEval are comparable to ours. In
relations are part of this, but to a lesser extent. contrast to them, we found, though, that syntactic
The primary research effort in event temporality tree kernels alone do not perform as well as using
has gone into ordering events with respect to one several flat tree representations.
another (e.g., Chambers and Jurafsky (2008)), and
3 Problem Definition
detecting their typical durations (e.g., Pan et al.
(2006)). The task of linking events and relations to time
Recently, TempEval workshops have focused stamps can be defined as the following: given a set
on the temporal related issues in NLP. Some of of expressions denoting events or relation men-
186
tions in a document, and a set of time expressions 4 Temporal Linking Framework
in the same document, find all instances of the
tlink relation between elements of the two input As previously mentioned, we approach the tem-
sets. The existence of a tlink (e, t) means that e, poral linking problem as a classification task. In
which is an event or a relation mention, occurs the framework of classification, we refer to each
within the temporal context specified by the time pair of (event/relation, temporal expression) oc-
expression t. curring within a sentence as an instance. The goal
Thus, our task can be cast as a binary rela- is to devise a classifier that separates positive (i.e.,
tion classification task: for each possible pair linked) instances from negative ones, i.e., pairs
of (event/relation, time) in a document, decide where there is no link between the event/relation
whether there exists a link between the two, and and the temporal expression in question. The lat-
if so, express it in the data. ter case is far more frequent, so we have an inher-
In addition, we make these assumptions about ent bias toward negative examples in our data.1
the data: Note that the basis of the positive and nega-
tive links is the context around the target terms.
1. There does not exist a timestamp for ev- It is impossible even for humans to determine the
ery event/relation in a document. Although existence of a link based only on the two terms
events and relations typically have temporal without their context. For instance, given just two
context, it may not be explicitly stated in a words (e.g., said and yesterday) there is no
document. way to tell if it is a positive or a negative example.
We need the context to decide.
2. Every event/relation has at most one time ex- Therefore, we base our classification models on
pression associated with it. This is a simpli- contextual features drawn from lexical and syn-
fying assumption, which in the case of rela- tactic analyses of the text surrounding the target
tions we explore as future work. terms. For this, we first define a feature-based
approach, then we improve it by using tree ker-
3. Each temporal expression can be linked to nels. These two subsections, plus the treatment
one or more events or relations. Since mul- of fluent relations, are the main contributions of
tiple events or relations may happen for a this paper. In all of this work, we employ SVM
given time, it is safe to assume that each tem- classifiers (Vapnik, 1995) for machine learning.
poral expression can be linked to more than
one event/relation. 4.1 Feature Engineering
A manual analysis of development data provided
In general, the events/relations and their associ-
several intuitions about the kinds of features that
ated timestamps may occur within the same sen-
would be useful in this task. Based on this anal-
tence or may occur across different sentences. In
ysis and with inspiration from previous work (cf.
this paper, we focus on our effort and our evalua-
Boguraev and Ando (2005)) we established three
tion on the same sentence linking task.
categories of features whose description follows.
In order to solve the problem of temporal link-
ing completely, however, it will be important to
Features describing events or relations. We
also address the links that hold between entities
check whether the event or relation is phrasal, a
across sentences. We estimate, based on our data
verb, or noun, whether it is present tense, past
set, that across sentence links account for 41% of
tense, or progressive, the type assigned to the
all correct event-time pairs in a document. For flu-
event/relation by the UIMA type system used for
ents, the ratio is much higher, more than 80% of
processing, and whether it includes certain trig-
the correct fluent-time links are across sentences.
ger words, such as reporting verbs (said, re-
One of the main obstacles for our approach in the
ported, etc.).
cross-sentence case is the very low ratio of posi-
tive to negative instances (3 : 100) in the set of all 1
Initially, we employed an instance filtering method to
pairs in a document. Most pairs are not linked to address this, which proved to be ineffective and was subse-
one another. quently left out.
187
Features describing temporal expressions. (SVMlight with tree kernels, Moschitti (2004)), it
We check for the presence of certain trigger words is faster and easier than traditional feature engi-
(last, next, old, numbers, etc.) and the type of neering. The tree structure also allows us to use
the expression (DURATION, TIME, or DATE) as different levels of representations (POS, lemma,
specified by the UIMA type system. etc.) and combine their contributions, while at the
same time taking into account the ordering of la-
Features describing context. We also in- bels. We use POS, lemma, semantic type, and a
clude syntactic/structural features, such as testing representation that replaces each word with a con-
whether the relation/event dominates the temporal catenation of its features (capitalization, count-
expression, which one comes first in the sentence able, abstract/concrete noun, etc.).
order, and whether either of them is dominated
by a separate verb, preposition, that (which of- We developed a shallow tree representation that
ten indicates a subordinate sentence) or counter- captures the context of the target terms, without
factual nouns or verbs (which would negate the encoding too much structure (which may prevent
temporal link). generalization). In essence, our tree structure in-
It is not surprising that some of the most induces behavior somewhat similar to a string ker-
formative features (event comes before tempo- nel. In addition, we can model the tasks by pro-
ral expression, time is syntactic child of event) viding specific markup on the generated tree. For
are strongly correlated with the baselines. Less example, in our experiment we used the labels
salient features include the test for certain words EVENT (or equivalently RELATION) and TIME-
indicating the event is a noun, a verb, and if so STAMP to mark our target terms. In order to re-
which tense it has and whether it is a reporting duce the complexity of this comparison, we focus
verb. on the substring between event/relation and time
stamp and the rest of the tree structure is trun-
4.2 Tree Kernel Engineering cated.
We expect that there exist certain patterns be-
Figure 1 illustrates an example of the structure
tween the entities of a temporal link, which mani-
described so far for both lemmas and POS tags
fest on several levels: some on the lexical level,
(note that the lowest level of the tree contains tok-
others expressed by certain sequences of POS
enized items, so their number can differ form the
tags, NE labels, or other representations. Kernels
actual words, as in attorney general). Similar
provide a principled way of expanding the number
trees are produced for each level of representa-
of dimensions in which we search for a decision
tions used, and for each instance (i.e., pair of time
boundary, and allow us to easily model local se-
expressions and event/relation). If a sentence con-
quences and patterns in a natural way (Giuliano et
tains more than one event/relation, we create sep-
al., 2009). While it is possible to define a space
arate trees for each of them, which differ in the po-
in which we find a decision boundary that sepa-
sition of the EVENT/RELATION marks (at level
rates positive and negative instances with manu-
1 of the tree).
ally engineered features, these features can hardly
capture the notion of context as well as those ex- The tree kernel implicitly expands this struc-
plored by a tree kernel. ture into a number of substructures allowing us
Tree Kernels are a family of kernel functions to capture sequential patterns in the data. As we
developed to compute the similarity between tree will see, this step provides significant boosts to
structures by counting the number of subtrees the task performance.
they have in common. This generates a high-
dimensional feature space that can be handled ef- Curiously, using a full-parse syntactic tree as
ficiently using dynamic programming techniques input representation did not help performance.
(Shawe-Taylor and Christianini, 2004). For our This is in line with our finding that syntactic re-
purposes we used an implementation of the Sub- lations are less important than sequential patterns
tree and Subset Tree (SST) (Moschitti, 2006). (see also Section 5.2). Therefore we adopted the
The advantages of using tree kernels are string kernel like representation illustrated in
two-fold: thanks to an existing implementation Figure 1.
188
Scores of supporters of detained Egyptian opposition leader Nur demonstrated outside the attorney generals
office in Cairo last Saturday, demanding he be freed immediately.
BOW
EVENT TERM TERM TERM TERM TERM TIME
TOK TOK TOK TOK TOK TOK TOK TOK
demonstrate outside attorney general office in cairo last saturday
BOP
EVENT TERM TERM TERM TERM TERM TIME
TOK TOK TOK TOK TOK TOK TOK TOK
VBD ADV NNP NN IN NNP JJ NNP
Figure 1: Input Sentence and Tree Kernel Representations for Bag of Words (BOW) and POS tags (BOP)
5 Evaluation given the skewed nature of the data (much smaller

number of positive examples), we could achieve a
We now apply our models to real world data, and
high accuracy simply by classifying all instances
empirically demonstrate their effectiveness at the
as negative, i.e., not assigning a time stamp at all.
task of temporal linking. In this section, we de-
We thus decided to report precision, recall and F1.
scribe the data sets that were used for evaluation,
Unless stated otherwise, results were achieved via
the baselines for comparison, parameter settings,
10-fold cross-validation (10-CV).
and the results of the experiments.
The number of instances (i.e., pairs of event
5.1 Benchmark and temporal expression) for each of the differ-
ent cases listed above was (in brackets the ratio of
We evaluated our approach in 3 different tasks:
positive to negative instances).
1. Linking Timestamps and Events in the IC
events: 2046 (505 positive, 1541 negative)
domain
relations: 6526 (1847 positive, 4679 nega-
2. Linking Timestamps and Relations in the IC
tive)
domain
3. Linking Events to Temporal Expressions The size of the relation data set after filtering is
(TempEval-2, task C) 5511 (1847 positive, 3395 negative).
In order to increase the originally lower number
The first two data sets contained annotations of event instances, we made use of the annotated
in the intelligence community (IC) domain, i.e., event-coreference as a sort of closure to add more
mainly news reports about terrorism. It com- instances: if events A and B corefer, and there
prised 169 documents. This dataset has been de- is a link between A and time expression t, then
veloped in the context of the machine reading pro- there is also a link between B and t. This was not
gram (MRP) (Strassel et al., 2010). In both cases explicitly expressed in the data.
our goal is to develop a binary classifier to judge For the task at hand, we used gold standard
whether the event (or relation) overlaps with the annotations for timestamps, events and relations.
time interval denoted by the timestamp. Success The task was thus not the identification of these
of this classification can be measured by precision objects (a necessary precursor and a difficult task
and recall on annotated data. in itself), but the decision as to which events and
We originally considered using accuracy as a time expressions could and should be linked.
measure of performance, but this does not cor- We also evaluated our system on TempEval-
rectly reflect the true performance of the system: 2 (Verhagen et al., 2010) for better comparison
189
baseline comparison
Evaluation Measures Events

100
to the state-of-the-art. TempEval-2 data included 88.0
80
76.6 75.4 76.5 76.2
the task of linking events to temporal expressions 68.3
63.0 63.0 62.0
60
(there called task C), using several link types 48.0 45.0
35.0
%
40
(OVERLAP, BEFORE, AFTER, BEFORE-OR-
20
OVERLAP, OVERLAP-OR-AFTER). This is a
0
bit different from our settings as it required the Precision Recall F1
metric
implementation of a multi-class classifier. There-
BL-parent BL-closest features +tree kernel
fore we trained three different binary classifiers
(using the same feature set) for the first three of
those types (for which there was sufficient train- Figure 2: Performance on events
ing data) and we used a one-versus-all strategy to
System Accuracy
distinguish positive from negative examples. The
TRIOS 65%
output of the system is the category with the high-
this work 64.5%
est SVM decision score. Since we only use three
JU-CSE, NCSU-indi all 63%
labels, we incur an error every time the gold la-
TRIPS, USFD2
bel is something else. Note that this is stricter
than the evaluation in the actual task, which left Table 1: Comparison to Best Systems in TempEval-2
contestants with the option of skipping examples
their systems could not classify.
5.3 Events
5.2 Baselines Figure 2 shows the improvements of the feature-
Intuitively, one would expect temporal expres- based approach over the two baseline, and the ad-
sions to be close to the event they denote, or even ditional gain obtained by using the tree kernel.
syntactically related. In order to test this, we ap- Both the features and tree kernels mainly improve
plied two baselines. In the first, each temporal ex- precision, while the tree kernel adds a small boost
pression was linked to the closest event (as mea- in recall. It is remarkable, though, that the closest-
sured in token distance). In the second, we at- event baseline has a very high recall value. This
Page 1
tached each temporal expression to its syntactic suggests that most of the links actually do occur
head, if the head was an event. Results are re- between items that are close to one another. For a
ported in Figure 2. possible explanation for the low precision value,
While these results are encouraging for our see the error analysis (Section 5.5).
task, it seems at first counter-intuitive that the Using a two-tailed t-test, we compute the sig-
syntactic baseline does worse than the proximity- nificance in the difference between the F1-scores.
based one. It does, however, reveal two facts: Both the feature-based and the tree kernel ap-
events are not always synonymous with syntactic proach improvements are statistically significant
units, and they are not always bound to tempo- at p < 0.001 over the baseline scores.
ral expressions through direct syntactic links. The Table 1 compares the performances of our sys-
latter makes even more sense given that the links tem to the state-of-the-art systems on TempEval-2
can even occur across sentence boundaries. Pars- Data, task C, showing that our approach is very
ing quality could play a role, yet seems far fetched competitive. The best systems there used sequen-
to account for the difference. tial models. We attribute the competitive nature
More important than syntactic relations seem of our results to the use of tree kernels, which en-
to be sequential patterns on different levels, a fact ables us to make use of contextual information.
we exploit with the different tree representations
used (POS tags, NE types, etc.). 5.4 Relations
For relations, we only applied the closest- In general, performance for relations is not as high
relation baseline. Since relations consist of two or as for events (see Figure 3). The reason here is
more arguments that occur in different, often sep- two-fold: relations consist of two (or more) ele-
arated syntactic constituents, a syntactic approach ments, which can be in various positions with re-
seems futile, especially given our experience with spect to one another and the temporal expression,
events. Results are reported in Figure 3. and each relation can be expressed in a number of
190
baseline comparison
Evaluation Metric Relations

100
90 80.6
ples where time expression and event/relation are
80 70.8 74.0 70.4 72.2
70 63.1 immediately adjacent, but unrelated, as in the
60
50
35.0
man arrested last Tuesday told the police ...,
%
40
29.0
30 24.0 where last Tuesday modifies arrested. It limits
20
10
0
the amount of context that is available to the tree
Precision Recall F1 kernels, since we truncate the tree representations
metric
BL-closest features +tree kernel

to the words between those two elements. This
case closely resembles the problem we see in the
learning curves
closest-event/relation baseline, which, as we have
Figure 3: Performance on relations/fluents seen, does not perform too well. In this case, the
Learning Curves Relations
80
incorrect event (told) is as close to the time ex-
75 pression as the correct one (arrested), resulting
70
65
in a false positive that affects precision. Features
F1 score
60 capturing the order of the elements do not seem

55
50
help here, since the elements can be arranged in
45 any order (i.e., temporal expression before or af-
40
0 10 20 30 40 50 60 70 80 90 100
ter the event/relation). The only way to solve this
% of data problem would be to include additional informa-
features w/ tree tion about whether a time expression is already
kernel
attached to another event/relation.
Figure 4: Learning curves for relation models 5.6 Ablations
To quantify the utility of each tree representation,
different ways. we also performed all-but-one ablation tests, i.e.,
Again, we perform significance tests on the dif- left out each of the tree representations in turn, ran
ference in F1 scores and find that our improve- 10-fold cross-validation on the data and observed
ments over the baseline are statistically significant the effect on F1. The larger the loss in F1, the
Page 1
at p < 0.001. The improvement of the tree kernel more informative the left-out-representation. We
over the feature-based approach, however, are not performed ablations for both events and relations,
statistically significant at the same value. and found that the ranking of the representations
The learning curve over parts of the training is the same for both.
data (exemplary shown here for relations, Figure In events and relations alike, leaving out POS
4)2 indicates that there is another advantage to us- trees has the greatest effect on F1, followed by
ing tree kernels: the approach can benefit from the feature-bundle representation. Lemma and se-
more data. This is conceivably because it allows mantic type representation have less of an impact.
the kernel to find more common subtrees in the We hypothesize that the former two capture un-
various representations the more examples it gets, derlying regularities better by representing differ-
while the feature space rather finds more instances ent words with the same label. Lemmas in turn
that invalidate the expressiveness of features (i.e., are too numerous to form many recurring pat-
it encounters positive and negative instances that terns, and semantic type, while having a smaller
have very similar feature vectors). The curve sug- label alphabet, does not assign a label to every
gests that tree kernels could yield even better re- word, thus creating a very sparse representation
sults with more data, while there is little to no ex- that picks up more noise than signal.
pected gain using only features. Page 1
In preliminary tests, we also used annotated
5.5 Error Analysis dependency trees as input to the tree kernel, but
found that performance improved when they were
Examining the misclassified examples in our data, left out. This is at odds with work that clearly
we find that both feature-based and tree-kernel showed the value of syntactic tree kernels (Mir-
approaches struggle to correctly classify exam- roshandel et al., 2011). We identify two poten-
2
The learning curve for events looks similar and is omit- tial causeseither our setup was not capable of
ted due to space constraints. correctly capturing and exploiting the information
191
from the dependency trees, or our formulation of the 22nd National Conference for Artificial Intelli-
the task was not amenable to it. We did not inves- gence, Vancouver, Canada, July.
tigate this further, but leave it to future work. Branimir Boguraev and Rie Kubota Ando. 2005.
Timeml-compliant text analysis for temporal rea-
6 Conclusion and Future Work soning. In Proceedings of IJCAI, volume 5, pages
9971003. IJCAI.
We cast the problem of linking events and rela- Nathanael Chambers and Dan Jurafsky. 2008. Unsu-
tions to temporal expressions as a classification pervised learning of narrative event chains. pages
task using a combination of features and tree ker- 789797. Association for Computational Linguis-
nels, with probabilistic type filtering. Our main tics.
contributions are: Peter Clark and Phil Harrison. 2010. Machine read-
ing as a process of partial question-answering. In
We showed that within-sentence temporal Proceedings of the NAACL HLT Workshop on For-
links for both events and relations can be ap- malisms and Methodology for Learning by Reading,
Los Angeles, CA, June.
proached with a common strategy.
George Doddington, Alexis Mitchell, Mark Przybocki,
We developed flat tree representations and Lance Ramshaw, Stephanie Strassel, and Ralph
Weischedel. 2004. The automatic content extrac-
showed that these produce considerable
tion program tasks, data and evaluation. In Pro-
gains, with significant improvements over ceedings of the LREC Conference, Canary Islands,
different baselines. Spain, July.
Oren Etzioni, Michele Banko, and Michael Cafarella.
We applied our technique without great ad-
2007. Machine reading. In Proceedings of the
justments to an existing data set and achieved AAAI Spring Symposium Series, Stanford, CA,
competitive results. March.
Elena Filatova and Eduard Hovy. 2001. Assigning
Our best systems achieve F1 score of 0.76 time-stamps to event-clauses. In Proceedings of
on events and 0.72 on relations, and are ef- the workshop on Temporal and spatial information
fective at the task of temporal linking. processing, volume 13, pages 18. Association for
Computational Linguistics.
We developed the models as part of a machine Claudio Giuliano, Alfio Massimiliano Gliozzo, and
reading system and are currently evaluating it in Carlo Strapparava. 2009. Kernel methods for min-
an end-to-end task. imally supervised wsd. Computational Linguistics,
Following tasks proposed in TempEval-2, we 35(4).
plan to use our approach for across-sentence clas- Seyed A. Mirroshandel, Mahdy Khayyamian, and
sification, as well as a similar model for linking Gholamreza Ghassem-Sani. 2011. Syntactic tree
kernels for event-time temporal relation learning.
entities to the document creation date.
Human Language Technology. Challenges for Com-
puter Science and Linguistics, pages 213223.
Acknowledgements
Alessandro Moschitti. 2004. A study on convolution
We would like to thank Alessandro Moschitti for kernels for shallow semantic parsing. In Proceed-
his help with the tree kernel setup, and the review- ings of the 42nd Annual Meeting on Association for
ers who supplied us with very constructive feed- Computational Linguistics, pages 335es. Associa-
back. Research supported in part by Air Force
Alessandro Moschitti. 2006. Making tree kernels
Contract FA8750-09-C-0172 under the DARPA practical for natural language learning. In Proceed-
Machine Reading Program. ings of EACL, volume 6.
Feng Pan, Rutu Mulkar, and Jerry R. Hobbs. 2006.
Learning event durations from event descriptions.
References In Proceedings of the 21st International Conference
Ken Barker, Bhalchandra Agashe, Shaw-Yi Chaw, on Computational Linguistics and the 44th annual
James Fan, Noah Friedland, Michael Glass, Jerry meeting of the Association for Computational Lin-
Hobbs, Eduard Hovy, David Israel, Doo Soon Kim, guistics, pages 393400. Association for Computa-
Rutu Mulkar-Mehta, Sourabh Patwardhan, Bruce tional Linguistics.
Porter, Dan Tecuci, and Peter Yeh. 2007. Learn- James Pustejovsky, Patrick Hanks, Roser Saur, An-
ing by reading: A prototype system, performance drew See, Robert Gaizauskas, Andrea Setzer,
baseline and lessons learned. In Proceedings of Dragomir Radev, Beth Sundheim, David Day, Lisa
192
Ferro, and Marcia Lazo. 2003. The TIMEBANK
Corpus. In Proceedings of Corpus Linguistics
2003, pages 647656.
John Shawe-Taylor and Nello Christianini. 2004. Ker-
nel Methods for Pattern Analysis. Cambridge Uni-
versity Press.
Stephanie Strassel, Dan Adams, Henry Goldberg,
Jonathan Herr, Ron Keesing, Daniel Oblinger,
Heather Simpson, Robert Schrag, and Jonathan
Wright. 2010. The DARPA Machine Read-
ing Program-Encouraging Linguistic and Reason-
ing Research with a Series of Reading Tasks. In
Proceedings of LREC 2010.
Vladimir Vapnik. 1995. The Nature of Statistical
Learning Theory. Springer, New York, NY.
Marc Verhagen, Robert Gaizauskas, Frank Schilder,
Mark Hepple, Graham Katz, and James Puste-
jovsky. 2007. Semeval-2007 task 15: Tempeval
temporal relation identification. In Proceedings of
the 4th International Workshop on Semantic Evalu-
ations, pages 7580. Association for Computational
Linguistics.
Marc Verhagen, Roser Sauri, Tommaso Caselli, and
James Pustejovsky. 2010. Semeval-2010 task 13:
Tempeval-2. In Proceedings of the 5th Interna-
tional Workshop on Semantic Evaluation, pages 57
62. Association for Computational Linguistics.
193
Compensating for Annotation Errors in Training a Relation Extractor
Bonan Min Ralph Grishman
New York University New York University
715 Broadway, 7th floor 715 Broadway, 7th floor
New York, NY 10003 USA New York, NY 10003 USA
min@cs.nyu.edu grishman@cs.nyu.edu
two annotators independently annotate a corpus,

Abstract and then asking a senior annotator to adjudicate
the disagreements 2 . This annotation procedure
The well-studied supervised Relation
roughly requires 3 passes 3 over the same corpus.
Extraction algorithms require training
Therefore it is very expensive. The ACE 2005
data that is accurate and has good
annotation on relations is conducted in this way.
coverage. To obtain such a gold standard,
In this paper, we analyzed a snapshot of ACE
the common practice is to do independent
training data and found that each annotator
double annotation followed by
missed a significant fraction of relation mentions
adjudication. This takes significantly
and annotated some spurious ones. We found
more human effort than annotation done
that it is possible to separate most missing
by a single annotator. We do a detailed
examples from the vast majority of true-negative
analysis on a snapshot of the ACE 2005
unlabeled examples, and in contrast, most of the
annotation files to understand the
relation mentions that are adjudicated as
differences between single-pass
incorrect contain useful expressions for learning
annotation and the more expensive nearly
a relation extractor. Based on this observation,
three-pass process, and then propose an
we propose an algorithm that purifies negative
algorithm that learns from the much
examples and applies transductive inference to
cheaper single-pass annotation and
utilize missing examples during the training
achieves a performance on a par with the
process on the single-pass annotation. Results
extractor trained on multi-pass annotated
show that the extractor trained on single-pass
data. Furthermore, we show that given
annotation with the proposed algorithm has a
the same amount of human labor, the
performance that is close to an extractor trained
better way to do relation annotation is not
on the 3-pass annotation. We further show that
to annotate with high-cost quality
the proposed algorithm trained on a single-pass
assurance, but to annotate more.
annotation on the complete set of documents has
a higher performance than an extractor trained on
1. Introduction 3-pass annotation on 90% of the documents in
the same corpus, although the effort of doing a
Relation Extraction aims at detecting and single-pass annotation over the entire set costs
categorizing semantic relations between pairs of less than half that of doing 3 passes over 90% of
entities in text. It is an important NLP task that the documents. From the perspective of learning
has many practical applications such as a high-performance relation extractor, it suggests
answering factoid questions, building knowledge that a better way to do relation annotation is not
bases and improving web search. to annotate with a high-cost quality assurance,
Supervised methods for relation extraction but to annotate more.
have been studied extensively since rich
annotated linguistic resources, e.g. the Automatic
Content Extraction 1 (ACE) training corpus, were 2
The senior annotator also found some missing examples as
released. We will give a summary of related shown in figure 1.
3
methods in section 2. Those methods rely on In this paper, we will assume that the adjudication pass has
a similar cost compared to each of the two first-passes. The
accurate and complete annotation. To obtain high adjudicator may not have to look at as many sentences as an
quality annotation, the common wisdom is to let annotator, but he is required to review all instances found by
both annotators. Moreover, he has to be more skilled and
may have to spend more time on each instance to be able to
1
http://www.itl.nist.gov/iad/mig/tests/ace/ resolve disagreements.
194
2. Background positive instances and using all the not-a-relation
cases (same as described above) as negative
2.1 Supervised Relation Extraction examples. RC is trained on the annotated
examples with their tagged types. During testing,
One of the most studied relation extraction tasks RD is applied first to identify whether an
is the ACE relation extraction evaluation example expresses some relation, then RC is
sponsored by the U.S. government. ACE 2005 applied to determine the most likely type only if
defined 7 major entity types, such as PER it is detected as correct by RD.
(Person), LOC (Location), ORG (Organization). State-of-the-art supervised methods for
A relation in ACE is defined as an ordered pair relation extraction also differ from each other on
of entities appearing in the same sentence which data representation. Given a relation mention,
expresses one of the predefined relations. ACE feature-based methods (Miller et al., 2000;
2005 defines 7 major relation types and more Kambhatla, 2004; Boschee et al., 2005;
than 20 subtypes. Following previous work, we Grishman et al., 2005; Zhou et al., 2005; Jiang
ignore sub-types in this paper and only evaluate and Zhai, 2007; Sun et al., 2011) extract a rich
on types when reporting relation classification list of structural, lexical, syntactic and semantic
performance. Types include General-affiliation features to represent it; in contrast, the kernel
(GEN-AFF), Part-whole (PART-WHOLE), based methods (Zelenko et al., 2003; Bunescu
Person-social (PER-SOC), etc. ACE provides a and Mooney, 2005a; Bunescu and Mooney,
large corpus which is manually annotated with 2005b; Zhao and Grishman, 2005; Zhang et al.,
entities (with coreference chains between entity 2006a; Zhang et al., 2006b; Zhou et al., 2007;
mentions annotated), relations, events and Qian et al., 2008) represent each instance with an
values. Each mention of a relation is tagged with object such as augmented token sequences or a
a pair of entity mentions appearing in the same parse tree, and used a carefully designed kernel
sentence as its arguments. More details about the function, e.g. subsequence kernel (Bunescu and
ACE evaluation are on the ACE official website. Mooney, 2005b) or convolution tree kernel
Given a sentence s and two entity mentions (Collins and Duffy, 2001), to calculate their
arg1 and arg2 contained in s, a candidate relation similarity. These objects are usually augmented
mention r with argument arg1 preceding arg2 is with features such as semantic features.
defined as r=(s, arg1, arg2). The goal of Relation In this paper, we use the hierarchical learning
Detection and Classification (RDC) is to strategy since it simplifies the problem by letting
determine whether r expresses one of the types us focus on relation detection only. The relation
defined. If so, classify it into one of the types. classification stage remains unchanged and we
Supervised learning treats RDC as a will show that it benefits from improved
classification problem and solves it with detection. For experiments on both relation
supervised Machine Learning algorithms such as detection and relation classification, we use
MaxEnt and SVM. There are two commonly SVM 4 (Vapnik 1998) as the learning algorithm
used learning strategies (Sun et al., 2011). Given since it can be extended to support transductive
an annotated corpus, one could apply a flat inference as discussed in section 4.3. However,
learning strategy, which trains a single multi- for the analysis in section 3.2 and the purification
class classifier on training examples labeled as preprocess steps in section 4.2, we use a
one of the relation types or not-a-relation, and MaxEnt 5 model since it outputs probabilities 6 for
apply it to determine its type or output not-a its predictions. For the choice of features, we use
relation for each candidate relation mention the full set of features from Zhou et al. (2005)
during testing. The examples of each type are the since it is reported to have a state-of-the-art
relation mentions that are tagged as instances of performance (Sun et al., 2011).
that type, and the not-a-relation examples are
constructed from pairs of entities that appear in 2.2 ACE 2005 annotation
the same sentence but are not tagged as any of The ACE 2005 training data contains 599 articles
the types. Alternatively, one could apply a
hierarchical learning strategy, which trains two
classifiers, a binary classifier RD for relation 4
SVM-Light is used. http://svmlight.joachims.org/
5
detection and the other a multi-class classifier RC OpenNLP MaxEnt package is used.
for relation classification. RD is trained by http://maxent.sourceforge.net/about.html
6
SVM also outputs a value associated with each prediction.
grouping tagged relation mentions of all types as However, this value cannot be interpreted as probability.
195
from newswire, broadcast news, weblogs, usenet the entity mentions. Following up, we checked
newsgroups/discussion forum, conversational the relation mentions 7 from fp1 and fp2 against
telephone speech and broadcast conversations. the adjudicated list of entity mentions from adj
The annotation process is conducted as follows: and found that 682 and 665 relation mentions
two annotators working independently annotate respectively have at least one argument which
each article and complete all annotation tasks doesnt appear in the list of adjudicated entity
(entities, values, relations and events). After two mentions.
annotators both finished annotating a file, all Given the list of relation mentions with both
discrepancies are then adjudicated by a senior arguments appearing in the list of adjudicated
annotator. This results in a high-quality entity mentions, figure 1 shows the inter-
annotation file. More details can be found in the annotator agreement of the ACE 2005 relation
documentation of ACE 2005 Multilingual annotation. In this figure, the three circles
Training Data V3.0. represent the list of relation mentions in fp1, fp2
Since the final release of the ACE training and adj, respectively.
corpus only contains the final adjudicated
annotations, in which all the traces of the two 47
first-pass annotations are removed, we use a 645 538

snapshot of almost-finished annotation, ACE
2005 Multilingual Training Data V3.0, for our fp1 fp2
analysis. In the remainder of this paper, we will 3065
call the two independent first-passes of
annotation fp1 and fp2. The higher-quality data 1486 1525
done by merging fp1 and fp2 and then having
disagreements adjudicated by the senior 383
annotator is called adj. From this corpus, we adj
removed the files that have not been completed Figure 1. Inter-annotator agreement of ACE 2005 relation
for all three passes. On the final corpus annotation. Numbers are the distinct relation mentions
whose both arguments are in the list of adjudicated entity
consisting of 511 files, we can differentiate the
mentions.
annotations on which the three annotators have
agreed and disagreed. It shows that each annotator missed a
A notable fact of ACE relation annotation is significant number of relation mentions
that it is done with arguments from the list of annotated by the other. Considering that we
annotated entity mentions. For example, in a removed 682/665 relation mentions from fp1/fp2
relation mention tyco's ceo and president dennis because we generate this figure based on the list
kozlowski which expresses an EMP-ORG of adjudicated entity mentions, we estimate that
relation, the two arguments tyco and dennis fp1 and fp2 both missed around 18.3-28.5% 8 of
kozlowski must have been tagged as entity the relation mentions. This clearly shows that
mentions previously by the annotator. Since fp1 both of the annotators missed a significant
and fp2 are done on all tasks independently, their fraction of the relation mentions. They also
disagreement on entity annotation will be annotated some spurious relation mentions (as
propagated to relation annotation; thus we need adjudicated in adj), although the fraction is
to deal with these cases specifically. smaller (close to 10% of all relation mentions in
adj).
3. Analysis of data annotation ACE 2005 relation annotation guidelines
(ACE English Annotation Guidelines for
3.1 General statistics Relations, version 5.8.3) defined 7 syntactic
classes and the other class. We plot the
As discussed in section 2, relation mentions are distribution of syntactic classes of the annotated
annotated with entity mentions as arguments, and
the lists of annotated entity mentions vary in fp1, 7
fp2 and adj. To estimate the impact propagated This is done by selecting the relation mentions whose both
arguments are in the list of adjudicated entity mentions.
from entity annotation, we first calculate the ratio 8
We calculate the lower bound by assuming that the 682
of overlapping entity mentions between entities relation mentions removed from fp1 are found in fp2,
annotated in fp1/fp2 with adj. We found that although with different argument boundary and headword
fp1/fp2 each agrees with adj on around 89% of tagged. The upper bound is calculated by assuming that they
are all irrelevant and erroneous relation mentions.
196
relations in figure 2 (3 of the classes, accounting examples that are not annotated in adj, and use it
together for less than 10% of the cases, are to make predictions on the mixed pool of correct
omitted) and the other class. It seems that it is examples, missing examples and spurious ones.
generally easier for the annotators to find and To illustrate how distinguishable the missing
agree on relation mentions of the type examples (false negatives) are from the true
Preposition/PreMod/Possessives but harder to negative ones, 1) we apply the MaxEnt model on
find and agree on the ones belonging to Verbal both false negatives and true negatives, 2) put
and Other. The definition and examples of these them together and rank them by the model-
syntactic classes can be found in the annotation predicted probabilities of being positive, 3)
guidelines. calculate their relative rank in this pool. We plot
In the following sections, we will show the the Cumulative distribution of frequency (CDF)
analysis on fp1 and adj since the result is similar of the ranks (as percentages in the mixed pools)
for fp2. of false negatives in figure 3. We took similar
steps for the spurious ones (false positives) and
plot them in figure 3 as well (However, they are
ranked by model-predicted probabilities of being
negative).
Figure 2. Percentage of examples of major syntactic classes. Figure 3: cumulative distribution of frequency (CDF) of the
relative ranking of model-predicted probability of being
3.2 Why the differences? positive for false negatives in a pool mixed of false
negatives and true negatives; and the CDF of the relative
To understand what causes the missing ranking of model-predicted probability of being negative for
annotations and the spurious ones, we need false positives in a pool mixed of false positives and true
methods to find how similar/different the false positives.
positives are to true positives and also how For false negatives, it shows a highly skewed
similar/different the false negatives (missing distribution in which around 75% of the false
annotations) are to true negatives. If we adopt a negatives are ranked within the top 10%. That
good similarity metric, which captures the means the missing examples are lexically,
structural, lexical and semantic similarity structurally or semantically similar to correct
between relation mentions, this analysis will help examples, and are distinguishable from the true
us to understand the similarity/difference from an negative examples. However, the distribution of
extraction perspective. false positives (spurious examples) is close to
We use a state-of-the-art feature space (Zhou uniform (flat curve), which means they are
et al., 2005) to represent examples (including all generally indistinguishable from the correct
correct examples, erroneous ones and untagged examples.
examples) and use MaxEnt as the weight
learning model since it shows competitive 3.3 Categorize annotation errors
performance in relation extraction (Jiang and The automatic method shows that the errors
Zhai, 2007) and outputs probabilities associated (spurious annotations) are very similar to the
with each prediction. We train a MaxEnt model correct examples but provides little clue as to
for relation detection on true positives and true why that is the case. To understand their causes,
negatives, which respectively are the subset of we sampled 65 examples from fp1 (10% of the
correct examples annotated by fp1 (and 645 errors), read the sentences containing these
adjudicated as correct ones) and negative
197
Example
Category Percentage Relation Notes (examples are similar
Sampled text of spurious examples in fp1
Type ones in adj for comparison)
Duplicate
relation his budding friendship
his budding friendship with US President
mention for 49.2% ORG-AFF with US President George
George W. Bush in the face of
coreferential W. Bush in the face of
entity mentions
Hundreds of thousands of demonstrators took to
PHYS
the streets in Britain
(Symmetric relation)
Correct 20%
The dead included the quack doctor, 55-year-old The dead included the quack
PER-SOC
Nityalila Naotia, his teenaged son and doctor, 55-year-old Nityalila
Naotia, his teenaged son
Putin had even secretly invited British Prime
Argument not 15.4%
PER-SOC Minister Tony Blair, Bush's staunchest backer
in list
in the war on Iraq
"The amazing thing is they are going to turn
Violate
San Francisco into ground zero for every criminal
reasonable 6.2% PHYS
who wants to profit at their chosen profession",
reader rule
Paredes said.
PART- a likely candidate to run Vivendi Universal's Arguments are tagged
WHOLE entertainment unit in the United States reversed
Errors 6.1% PART- Khakamada argued that the United
WHOLE States would also need Russia's help "to make the Relation type error
new Iraqi government seem legitimate.
illegal
Up to 20,000 protesters
promotion
PHYS Up to 20,000 protesters thronged the plazas and thronged the plazas and
through 3%
streets of San Francisco, where streets of San Francisco,
blocked
where
categories
Table 1. Categories of spurious relation mentions in fp1 (on a sample of 10% of relation mentions), ranked by the percentage
of the examples in each category. In the sample text, red text (also marked with dotted underlines) shows head words of the
first arguments and the underlined text shows head words of the second arguments.
erroneous relation mentions and compared them mistake. The third largest category is argument
to the correct relation mentions in the same not in list, by which we mean that at least one of
sentence; we categorized these examples and the arguments is not in the list of adjudicated
show them in table 1. The most common type of entity mentions.
error is duplicate relation mention for Based on Table 1, we can see that as many as
coreferential entity mentions. The first row in 72%-88% of the examples which are adjudicated
table 1 shows an example, in which there is a as incorrect are actually correct if viewed from a
relation ORG-AFF tagged between US and relation learning perspective, since most of them
George W. Bush in adj. Because President and contain informative expressions for tagging
George W. Bush are coreferential, the example relations. The annotation guideline is designed
<US, President > from fp1 is adjudicated as to ensure high quality while not imposing too
incorrect. This shows that if a relation is much burden on human annotators. To reduce
expressed repeatedly across relation mentions annotation effort, it defined rules such as illegal
whose arguments are coreferential, the promotion through blocked categories. The
adjudicator only tags one of the relation mentions annotators practice suggests that they are
as correct, although the other is correct too. This following another rule not to annotate duplicate
shared the same principle with another type of relation mention for coreferential entity
error illegal promotion through blocked mentions. This follows the similar principle of
categories 9 as defined in the annotation reducing annotation effort but is not explicitly
guideline. The second largest category is correct, stated in the guideline: to avoid propagation of a
by which we mean the example is a correct relation through a coreference chain. However,
relation mention and the adjudicator made a these examples are useful for learning more ways
to express a relation. Moreover, even for the
9 erroneous examples (as shown in table 1 as
For example, in sentence Smith went to a hotel in Brazil,
(Smith, hotel) is a taggable PHYS Relation but (Smith,
violate reasonable reader rule and errors), most
Brazil) is not, because to get the second relationship, one of them have some level of similar structures or
would have to promote Brazil through hotel. For the semantics to the targeted relation. Therefore, it is
precise definition of annotation rules, please refer to ACE very hard to distinguish them without human
(Automatic Content Extraction) English Annotation
proofreading.
Guidelines for Relations, version 5.8.3.
198
Exp # Training Testing Detection (%) Classification (%)
data data Precision Recall F1 Precision Recall F1
1 fp1 adj 83.4 60.4 70.0 75.7 54.8 63.6
2 fp2 adj 83.5 60.5 70.2 76.0 55.1 63.9
3 adj adj 80.4 69.7 74.6 73.4 63.6 68.2
Table 2. Performance of RDC trained on fp1/fp2/adj, and tested on adj.
4. Relation extraction with low-cost
3.4 Why missing annotations and how
many examples are missing?
annotation
For the large number of missing annotations, 4.1 Baseline algorithm

there are a couple of possible reasons. One To see whether a single-pass annotation is useful
reason is that it is generally easier for a human for relation detection and classification, we did
annotator to annotate correctly given a well- 5-fold cross validation (5-fold CV) with each of
defined guideline, but it is hard to ensure fp1, fp2 and adj as the training set, and tested on
completeness, especially for a task like relation adj. The experiments are done with the same 511
extraction. Furthermore, the ACE 2005 documents we used for the analysis. As shown in
annotation guideline defines more than 20 table 2, we did 5-fold CV on adj for experiment
relation subtypes. These many subtypes make it 3. For fairness, we use settings similar to 5-fold
hard for an annotator to keep all of them in mind CV for experiment 1 and 2. Take experiment 1 as
while doing the annotation, and thus it is an example: we split both of fp1 and adj into 5
inevitable that some examples are missed. folds, use 4 folds from fp1 as training data, and 1
Here we proceed to approximate the number fold from adj as testing data and does one train-
of missing examples given limited knowledge. test cycle. We rotate the folds (both training and
Let each annotator annotate n examples and testing) and repeat 5 times. The final results are
assume that each pair of annotators agrees on a averaged over the 5 runs. Experiment 2 was
certain fraction p of the examples. Assuming the conducted similarly. In the reminder of the paper,
examples are equally likely to be found by an 5-fold CV experiments are all conducted in this
annotator, therefore the total number of unique way.
examples found by annotators is =0(1 Table 2 shows that a relation tagger trained on
) . If we had an infinite number of annotators the single-pass annotated data fp1 performs
( ), the total number of unique examples worse than the one trained on merged and
adjudicated data adj, with 4.6 points lower F
will be , which is the upper bound of the total
measure in relation detection, and 4.6 points
number of examples. In the case of the ACE lower relation classification. For detection,
2005 relation mention annotation, since the two precision on fp1 is 3 points higher than on adj
annotators annotate around 4500 examples and but recall is much lower (close to 10 points). The
they agree on 2/3 of them, the total number of all recall difference shows that the missing
positive examples is around 6750. This is close annotations contain expressions that can help to
to the number of relation mentions in the find more correct examples during testing. The
adjudicated list: 6459. Here we assume the small precision difference indirectly shows that
adjudicator is doing a more complex task than an the spurious ones in fp1 (as adjudicated) do not
annotator, resolving the disagreements and hurt precision. Performance on classification
completing the annotation (as shown in figure 1). shows a similar trend because the relation
The assumption of the calculation is a little classifier takes the examples predicted by the
crude but reasonable given the limited number of detector as correct as its input. Therefore, if there
passes of annotation we have. Recent research (Ji is an error, it gets propagated to this stage. Table
et al, 2010) shows that, by adding annotators for 2 also shows similar performance differences
IE tasks, the merged annotation tends to between fp2 and adj.
converge after having 5 annotators. To In the remainder of this paper, we will discuss
understand the annotation behavior better, in a few algorithms to improve a relation tagger
particular whether annotation will converge after trained on single-pass annotated data 10. Since we
adding a few annotators, more passes of
annotation need to be collected. We leave this as
future work. 10
We only use fp1 and adj in the following experiments
because we observed that fp1 and fp2 are similar in general
in the analysis, though a fraction of the annotation in fp1
199
already showed that most of the spurious training process of a supervised relation
annotations are not actually errors from an extraction algorithm.
extraction perspective and table 2 shows that The algorithm is similar to Li and Liu 2003.
they do not hurt precision, we will only focus on However, we drop a few noisy examples instead
utilizing the missing examples, in other words, of choosing a small purified subset since we have
training with an incomplete annotation. relatively few false negatives compared to the
entire set of unannotated examples. Moreover,
4.2 Purify the set of negative examples after step 3, most false negatives are clustered
As discussed in section 2, traditional supervised within the small region of top ranked examples
methods find all pairs of entity mentions that which has a high model-predicated probability of
appear within a sentence, and then use the pairs being positive. The intuition is similar to what
that are not annotated as relation mentions as the we observed from figure 3 for false negatives
negative examples for the purpose of training a since we also observed very similar distribution
relation detector. It relies on the assumption that using the model trained with noisy data.
the annotators annotated all relation mentions Therefore, we can purify negatives by removing
and missed no (or very few) examples. However, examples in this noisy subset.
this is not true for training on a single-pass However, the false negatives are still mixed
annotation, in which a significant portion of with true negatives. For example, still slightly
relation mentions are left not annotated. If this more than half of the top 2000 examples are true
scheme is applied, all of the correct pairs which negatives. Thus we cannot simply flip their
the annotators missed belong to this negative labels and use them as positive examples. In the
category. Therefore, we need a way to purify the following section, we will use them in the form
negative set of examples obtained by this of unlabeled examples to help train a better
conventional approach. model.
Li and Liu (2003) focuses on classifying
4.3 Transductive inference on unlabeled
documents with only positive examples. Their
examples
algorithm initially sets all unlabeled data to be
negative and trains a Rocchio classifier, selects Transductive SVM (Vapnik, 1998; Joachims,
negative examples which are closer to the 1999) is a semi-supervised learning method
negative centroid than positive centroid as the which learns a model from a data set consisting
purified negative examples, and then retrains the of both labeled and unlabeled examples.
model. Their algorithm performs well for text Compared to its popular antecedent SVM, it also
classification. It is based on the assumption that learns a maximum margin classification
there are fewer unannotated positive examples hyperplane, but additionally forces it to separate
than negative ones in the unlabeled set, so true a set of unlabeled data with large margin. The
negative examples still dominate the set of noisy optimization function of Transductive SVM
negative examples in the purification step. (TSVM) is the following:
Based on the same assumption, our purification
process consists of the following steps:
1) Use annotated relation mentions as
positive examples; construct all possible
relation mentions that are not annotated, and
initially set them to be negative. We call this
noisy data set D.
2) Train a MaxEnt relation detection model
Mdet on D. Figure 4. TSVM optimization function for non-separable
3) Apply Mdet on all unannotated case (Joachims, 1999)
examples, and rank them by the model- TSVM can leverage an unlabeled set of
predicted probabilities of being positive, examples to improve supervised learning. As
4) Remove the top N examples from D. shown in section 3, a significant number of
These preprocessing steps result in a purified relation mentions are missing from the single-
data set . We can use for the normal pass annotation data. Although it is not possible
to find all missing annotations without human
and fp2 is different. Moreover, algorithms trained on them effort, we can improve the model by further
show similar performance.
200
utilizing the fact that some unannotated examples +tSVM: First, the same purification process of
should have been annotated. +purify is applied. Then we follow the steps
The purification process discussed in the described in section 4.3 to construct the set of
previous section removes N examples which unlabeled examples, and set all the rest of
have a high density of false negatives. We further purified negative examples to be negative.
utilize the N examples as follows: Finally, we train TSVM on both labeled and
1) Construct a training corpus from unlabeled data and replace the relation detection
by taking a random sample 11 of N*(1- in the RDC algorithm. The relation classification
p)/p (p is the ratio of annotated examples to is unchanged.
all examples; p=0.05 in fp1) negatively Table 3 shows the results. All experiments are
labeled examples in and setting them to done with 5-fold cross validation 13 using testing
be unlabeled. In addition, the N examples data from adj. The first three rows show
removed by the purification process are added experiments trained on fp1, and the last row
back as unlabeled examples. (ADJ) shows the unmodified RDC algorithm
2) Train TSVM on . trained on adj for comparison. The purification
of negative examples shows significant
The second step trained a model which
performance gain, 3.7% F1 on relation detection
replaced the detection model in the hierarchical
and 3.4% on relation classification. The precision
detection-classification learning scheme we used.
decreases but recall increases substantially since
We will show in the next section that this
the missing examples are not treated as
improves the model.
negatives. Experiment shows that the purification
5. Experiments process removes more than 60% of the false
Experiments were conducted over the same set of negatives. Transductive SVM further improved
documents on which we did analysis: the 511 performance by a relatively small margin. This
documents which have completed annotation in shows that the latent positive examples can help
all of the fp1, fp2 and adj from the ACE 2005 refine the model. Results also show that
Multilingual Training Data V3.0. To transductive inference can find around 17% of
reemphasize, we apply the hierarchical learning missing relation mentions. We notice that the
scheme and we focus on improving relation performance of relation classification is
detection while keeping relation classification improved since by improving relation detection,
unchanged (results show that its performance is some examples that do not express a relation are
improved because of the improved detection). removed. The classification performance on
We use SVM as our learning algorithm with the single-pass annotation is close to the one trained
full feature set from Zhou et al. (2005). on adj due to the help from a better relation
Baseline algorithm: The relation detector is detector trained with our algorithm.
unchanged. We follow the common practice, We also did 5-fold cross validation with a
which is to use annotated examples as positive model trained on a fraction of the 4/5 (4 folds) of
ones and all possible untagged relation mentions adj data (each experiment shown in table 4 uses
as negative ones. We sub-sampled the negative 4 folds of adj documents for training since one
data by since that shows better performance. fold is left for cross validation). The documents
+purify: This algorithm adds an additional are sampled randomly. Table 4 shows results for
purification preprocessing step (section 4.2) varying training data size. Compared to the
before the hierarchical learning RDC algorithm. results shown in the +tSVM row of table 3, we
After purification, the RDC algorithm is trained can see that our best model trained on single-pass
on the positive examples and purified negative annotation outperforms SVM trained on 90% of
examples. We set N=2000 12 in all experiments. the dual-pass, adjudicated data in both relation
detection and classification, although it costs less
11
than half the 3-pass annotation. This suggests
We included this large random sample so that the balance
of positive to negative examples in the unlabeled set would
that given the same amount of human effort for
be similar to that of the labeled data. The test data is not
included in the unlabeled set.
12
We choose 2000 because it is close to the number of
relations missed from each single-pass annotation. In should perform multiple passes of independent annotation
practice, it contains more than 70% of the false negatives, on a small dataset and measure inter-annotator agreements.
13
and it is less than 10% of the unannotated examples. To Details about the settings for 5-fold cross validation are in
estimate how many examples are missing (section 3.4), one section 4.1.
201
Detection (%) Classification (%)
Algorithm
Precision Recall F1 Precision Recall F1
Baseline 83.4 60.4 70.0 75.7 54.8 63.6
+purify 76.8 70.9 73.7 69.8 64.5 67.0
+tSVM 76.4 72.1 74.2 69.4 65.2 67.2
ADJ (on adj) 80.4 69.7 74.6 73.4 63.6 68.2
Table 3. 5-fold cross-validation results. All are trained on fp1 (except the last row showing the unchanged algorithm trained
on adj for comparison), and tested on adj. McNemar's test show that the improvement from +purify to +tSVM, and from
+tSVM to ADJ are statistically significant (with p<0.05).
Percentage of Detection (%) Classification (%)
adj used Precision Recall F1 Precision Recall F1
60% 4/5 86.9 41.2 55.8 78.6 37.2 50.5
70% 4/5 85.5 51.3 64.1 77.7 46.6 58.2
80% 4/5 83.3 58.1 68.4 75.8 52.9 62.3
90% 4/5 82.0 64.9 72.5 74.9 59.4 66.2
Table 4. Performance with SVM trained on a fraction of adj. It shows 5 fold cross validation results.
relation annotation, annotating more documents mentions. They use an evaluation scheme to
with single-pass offers advantages over avoid being penalized by the relation mentions
annotating less data with high quality assurance which are not annotated because of this behavior.
(dual passes and adjudication).
7. Conclusion
6. Related work We analyzed a snapshot of the ACE 2005
relation annotation and found that each single-
Dligach et al. (2010) studied WSD annotation
pass annotation missed around 18-28% of
from a cost-effectiveness viewpoint. They
relation mentions and contains around 10%
showed empirically that, with same amount of
spurious mentions. A detailed analysis showed
annotation dollars spent, single-annotation is
that it is possible to find some of the false
better than dual-annotation and adjudication. The
negatives, and that most spurious cases are
common practice for quality control of WSD
actually correct examples from a system
annotation is similar to Relation annotation.
builders perspective. By automatically purifying
However, the task of WSD annotation is very
negative examples and applying transductive
different from relation annotation. WSD requires
inference on suspicious examples, we can train a
that every example must be assigned some tag,
relation classifier whose performance is
whereas that is not required for relation tagging.
comparable to a classifier trained on the dual-
Moreover, relation tagging requires identifying
annotated and adjudicated data. Furthermore, we
two arguments and correctly categorizing their
show that single-pass annotation is more cost-
types.
effective than annotation with high quality
The purified approach applied in this paper is
assurance.
related to the general framework of learning from
positive and unlabeled examples. Li and Liu
(2003) initially set all unlabeled data to be
Acknowledgments
negative and train a Rocchio classifier, then Supported by the Intelligence Advanced
select negative examples which are closer to the Research Projects Activity (IARPA) via Air
negative centroid than positive centroid as the Force Research Laboratory (AFRL) contract
purified negative examples. We share a similar number FA8650-10-C-7058. The U.S.
assumption with Li and Liu (2003) but we use a Government is authorized to reproduce and
different method to select negative examples distribute reprints for Governmental purposes
since the false negative examples show a very notwithstanding any copyright annotation
skewed distribution, as described in section 5.2. thereon. The views and conclusions contained
Transductive SVM was introduced by Vapnik herein are those of the authors and should not be
(1998) and later refined in Joachims (1999). A interpreted as necessarily representing the
few related methods were studied on the subtask official policies or endorsements, either
of relation classification (the second stage of the expressed or implied, of IARPA, AFRL, or the
hierarchical learning scheme) in Zhang (2005). U.S. Government.
Chan and Roth (2011) observed the similar
phenomenon that ACE annotators rarely
duplicate a relation link for coreferential
202
Xiao-Li Li and Bing Liu. 2003. Learning to classify
References text using positive and unlabeled data. In
Proceedings of IJCAI-2003.
ACE. http://www.itl.nist.gov/iad/mig/tests/ace/
Longhua Qian, Guodong Zhou, Qiaoming Zhu and
ACE (Automatic Content Extraction) English Peide Qian. 2008. Exploiting constituent
Annotation Guidelines for Relations, version 5.8.3. dependencies for tree kernel-based semantic
2005. http://projects.ldc.upenn.edu/ace/. relation extraction . In Proc. of COLING-2008.
ACE 2005 Multilingual Training Data V3.0. 2005. Ang Sun, Ralph Grishman and Satoshi Sekine. 2011.
LDC2005E18. LDC Catalog. Semi-supervised Relation Extraction with Large-
Elizabeth Boschee, Ralph Weischedel, and Alex scale Word Clustering. In Proceedings of ACL-
Zamanian. 2005. Automatic information extraction. 2011.
In Proceedings of the International Conference on Vladimir N. Vapnik. 1998. Statistical Learning
Intelligence Analysis. Theory. John Wiley.
Razvan C. Bunescu and Raymond J. Mooney. 2005a. Dmitry Zelenko, Chinatsu Aone, and Anthony
A shortest path dependency kenrel for relation Richardella. 2003. Kernel methods for relation
extraction. In Proceedings of HLT/EMNLP-2005. extraction. Journal of Machine Learning Research.
Razvan C. Bunescu and Raymond J. Mooney. 2005b. Min Zhang, Jie Zhang and Jian Su. 2006a. Exploring
Subsequence kernels for relation extraction. In syntactic features for relation extraction using a
Proceedings of NIPS-2005. convolution tree kernel, In Proceedings of HLT-
Yee Seng Chan and Dan Roth. 2011. Exploiting NAACL-2006.
Syntactico-Semantic Structures for Relation Min Zhang, Jie Zhang, Jian Su, and GuoDong Zhou.
Extraction. In Proceedings of ACL-2011. 2006b. A composite kernel to extract relations
Michael Collins and Nigel Duffy. Convolution between entities with both flat and structured
Kernels for Natural Language. In Proceedings of features. In Proceedings of COLING-ACL-2006.
NIPS-2001. Zhu Zhang. 2005. Mining Inter-Entity Semantic
Dmitriy Dligach, Rodney D. Nielsen and Martha Relations Using Improved Transductive Learning.
Palmer. 2010. To annotate more accurately or to In Proceedings of ICJNLP-2005.
annotate more. In Proceedings of Fourth Linguistic Shubin Zhao and Ralph Grishman, 2005. Extracting
Annotation Workshop at ACL 2010 Relations with Integrated Information Using Kern
Ralph Grishman, David Westbrook and Adam el Methods. In Proceedings of ACL-2005.
Meyers. 2005. NYUs English ACE 2005 System Guodong Zhou, Jian Su, Jie Zhang and Min Zhang.
Description. In Proceedings of ACE 2005 2005. Exploring various knowledge in relation
Evaluation Workshop extraction. In Proceedings of ACL-2005.
Scott Miller, Heidi Fox, Lance Ramshaw, and Ralph Guodong Zhou, Min Zhang, DongHong Ji, and
Weischedel. 2000. A novel use of statistical QiaoMing Zhu. 2007. Tree kernel-based relation
parsing to extract information from text In extraction with context-sensitive structured parse
Proceedings of NAACL-2010. tree information. In Proceedings of
Heng Ji, Ralph Grishman, Hoa Trang Dang and Kira EMNLP/CoNLL-2007.
Griffitt. 2010. An Overview of the TAC2010
Knowledge Base Population Track. In Proceedings
of TAC-2010
Jing Jiang and ChengXiang Zhai. 2007. A systematic
exploration of the feature space for relation
extraction. In Proceedings of HLT-NAACL-2007.
Thorsten Joachims. 1999. Transductive Inference for
Text Classification using Support Vector
Machines. In Proceedings of ICML-1999.
Nanda Kambhatla. 2004. Combining lexical,
syntactic, and semantic features with maximum
entropy models for information extraction. In
Proceedings of ACL-2004
203
Incorporating Lexical Priors into Topic Models
Jagadeesh Jagarlamudi Hal Daume III Raghavendra Udupa

University of Maryland University of Maryland Microsoft Research
College Park, USA College Park, USA Bangalore, India
jags@umiacs.umd.edu hal@umiacs.umd.edu raghavu@microsoft.com
Abstract left with a skewed impression of the corpus, and

perhaps one that does not perform well in extrin-
Topic models have great potential for help- sic tasks.
ing users understand document corpora.
To illustrate this problem, we ran LDA on
This potential is stymied by their purely un-
supervised nature, which often leads to top- the most frequent five categories of the Reuters-
ics that are neither entirely meaningful nor 21578 (Lewis et al., 2004) text corpus. This doc-
effective in extrinsic tasks (Chang et al., ument distribution is very skewed: more than half
2009). We propose a simple and effective of the collection belongs to the most frequent cat-
way to guide topic models to learn topics egory (Earn). The five topics identified by the
of specific interest to a user. We achieve LDA are shown in Table 1. A brief observation
this by providing sets of seed words that a
of the topics reveals that LDA has roughly allo-
user believes are representative of the un-
derlying topics in a corpus. Our model
cated topics 1 & 2 for the most frequent class
uses these seeds to improve both topic- (Earn) and one topic for the subsequent two
word distributions (by biasing topics to pro- frequent classes (Acquisition and Forex) and
duce appropriate seed words) and to im- merged the least two frequent classes (Crude
prove document-topic distributions (by bi- and Grain) into a single topic. The red colored
asing documents to select topics related to words in topic 5 correspond to the Crude class
the seed words they contain). Extrinsic and blue words are from the Grain class.
evaluation on a document clustering task
reveals a significant improvement when us- This leads to the situation where the topics
ing seed information, even over other mod- identified by LDA are not in accordance with the
els that use seed information navely. underlying topical structure of the corpus. This
is a problem not just with LDA: it is potentially
a problem with any extension thereof that have
1 Introduction focused on improving the semantic coherence of
Topic models such as Latent Dirichlet Allocation the words in each topic (Griffiths et al., 2005;
(LDA) (Blei et al., 2003) have emerged as a pow- Wallach, 2005; Griffiths et al., 2007), the doc-
erful tool to analyze document collections in an ument topic distributions (Blei and McAuliffe,
unsupervised fashion. When fit to a document 2008; Lacoste-Julien et al., 2008) or other aspects
collection, topic models implicitly use document (Blei. and Lafferty., 2009).
level co-occurrence information to group seman- We address this problem by providing some ad-
tically related words into a single topic. Since the ditional information to the model. Initially, along
objective of these models is to maximize the prob- with the document collection, a user may provide
ability of the observed data, they have a tendency higher level view of the document collection. For
to explain only the most obvious and superficial instance, as discussed in Section 4.4, when run
aspects of a corpus. They effectively sacrifice per- on historical NIPS papers, LDA fails to find top-
formance on rare topics to do a better job in mod- ics related to Brain Imaging, Cognitive Science or
eling frequently occurring words. The user is then Hardware, even though we know from the call for
204
mln, dlrs, billion, year, pct, company, share, april, record, cts, quarter, march, earnings, stg, first, pay
mln, NUM, cts, loss, net, dlrs, shr, profit, revs, year, note, oper, avg, shrs, sales, includes
lt, company, shares, corp, dlrs, stock, offer, group, share, common, board, acquisition, shareholders
bank, market, dollar, pct, exchange, foreign, trade, rate, banks, japan, yen, government, rates, today
oil, tonnes, prices, mln, wheat, production, pct, gas, year, grain, crude, price, corn, dlrs, bpd, opec
Table 1: Topics identified by LDA on the frequent-5 categories of the Reuters corpus. The categories are Earn,
Acquisition, Forex, Grain and Crude (in the order document frequency).
1 company, billion, quarter, shrs, earnings We build a model that uses the seed words
2 acquisition, procurement, merge in two ways: to improve both topic-word and
3 exchange, currency, trading, rate, euro document-topic probability distributions. For
4 grain, wheat, corn, oilseed, oil ease of exposition, we present these ideas sep-
5 natural, gas, oil, fuel, products, petrol arately and then in combination (Section 2.3).
To improve topic-word distributions, we set up
Table 2: An example for sets of seed words (seed top-
a model in which each topic prefers to gener-
ics) for the frequent-5 categories of the Reuters-21578
categorization corpus. We use them as running exam- ate words that are related to the words in a seed
ple in the rest of the paper. set (Section 2.1). To improve document-topic
distributions, we encourage the model to select
document-level topics based on the existence of
papers that such topics should exist in the corpus. input seed words in that document (Section 2.2).
By allowing the user to provide some seed words Before moving on to the details of our mod-
related to these underrepresented topics, we en- els, we briefly recall the generative story of the
courage the model to find evidence of these top- LDA model and the reader is encouraged to refer
ics in the data. Importantly, we only encourage to (Blei et al., 2003) for further details.
the model to follow the seed sets and do not force
1. For each topic k = 1 T,
it. So if it has compelling evidence in the data
choose k Dir().
to overcome the seed information then it still has
the freedom to do so. Our seeding approach in 2. For each document d, choose d Dir().
combination with the interactive topic modeling For each token i = 1 Nd :
(Hu et al., 2011) will allow a user to both explore (a) Select a topic zi Mult(d ).
a corpus, and also guide the exploration towards (b) Select a word wi Mult(zi ).
the distinctions that he/she finds more interesting.
where T is the number of topics, , are hyper-
2 Incorporating Seeds parameters of the model and k and d are topic-
word and document-topic Multinomial probabil-
Our approach to allowing a user to guide the topic ity distributions respectively.
discovery process is to let him provide seed infor-
mation at the level of word type. Namely, the user 2.1 Word-Topic Distributions (Model 1)
provides sets of seed words that are representative In regular topic models, each topic k is defined
of the corpus. Table 2 shows an example of seed by a Multinomial distribution k over words. We
sets one might use for the Reuters corpus. This extend this notion and instead define a topic as a
kind of supervision is similar to the seeding in mixture of two Multinomial distributions: a seed
bootstrapping literature (Thelen and Riloff, 2002) topic distribution and a regular topic distribu-
or prototype-based learning (Haghighi and Klein, tion. The seed topic distribution is constrained to
2006). Our reliance on seed sets is orthogonal only generate words from a corresponding seed
to existing approaches that use external knowl- set. The regular topic distribution may generate
edge, which operate at the level of documents any word (including seed words). For example,
(Blei and McAuliffe, 2008), tokens (Andrzejew- seed topic 4 (in Table 2) can only generate the
ski and Zhu, 2009) or pair-wise constraints (An- five words in its set. The word oil can be gener-
drzejewski et al., 2009). ated by seed topics 4 and 5, as well as any regular
205
doc their distribution to only generate words in the
corresponding seed set. Then, for each token in a
z=1 z=2 z=T document, we first generate a topic. After choos-
1 1 1 1 T T
ing a topic, we flip a (biased) coin to pick either
the seed or the regular topic distribution. Once
r1 s1 rT sT this distribution is selected we generate a word
from it. It is important to note that although there
are 2T topic-word distributions in total, each
Figure 1: Tree representation of a document in Model document is still a mixture of only T topics (as
1. shown in Fig. 1). This is crucial in relating seed
and regular topics and is similar to the way top-
topic. We want to emphasize that, like any regular ics and aspects are tied in TAM model (Paul and
topic, each seed topic is a non-uniform probabil- Girju, 2010).
ity distribution over the words in its set. The user To understand how this model gathers words
only inputs the sets of seed words and the model related to seed words, consider a seed topic (say
will infer their probability distributions. the fourth row in Table 2) with seed words {grain,
For the sake of simplicity, we describe our wheat, corn, etc. }. Now by assigning all the re-
model by assuming a one-to-one correspondence lated words such as tonnes, agriculture, pro-
between seed and regular topics. This assumption duction etc. to its corresponding regular topic,
can be easily relaxed by duplicating the seed top- the model can potentially put high probability
ics when there are more regular topics. As shown mass on topic z = 4 for agriculture related doc-
in Fig. 1, each document is a mixture over T top- uments. Instead, if it places these words in an-
ics, where each of those topics is a mixture of other regular topic, say z = 3, then the document
a regular topic (r ) and its associated seed topic probability mass has to be distributed among top-
(s ) distributions. The parameter k controls the ics 3 and 4 and as a result the model will pay a
probability of drawing a word from the seed topic steeper penalty. Thus the model uses seed topic
distribution versus the regular topic distribution. to gather related words into its associated regu-
For our first model, we assume that the corpus is lar topic and as a consequence the document-topic
generated based on the following generative pro- distributions also become focussed.
cess (its graphical notation is shown in Fig. 2(a)): We have experimented with two ways of choos-
ing the binary variable xi (step 2b) of the gener-
1. For each topic k=1 T, ative story. In the first method, we fix this sam-
pling probability to a constant value which is in-
(a) Choose regular topic rk Dir(r ).
dependent of the chosen topic (i.e. i = , i =
(b) Choose seed topic sk Dir(s ). 1 T). And in the second method we learn the
(c) Choose k Beta(1, 1). probability as well (Sec. 4).
2. For each document d, choose d Dir(). 2.2 Document-Topic distributions (Model 2)
For each token i = 1 Nd : In the previous model we used seed words to im-
(a) Select a topic zi Mult(d ). prove topic-word probability distributions. Here
(b) Select an indicator xi Bern(zi ) we propose a model to explore the use of seed
(c) if xi is 0 words to improve document-topic probability dis-
Select a word wi Mult(rzi ). tributions. Unlike the previous model, we will
// choose from regular topic present this model in the general case where the
(d) if xi is 1 number of seed topics is not equal to the number
Select a word wi Mult(szi ). of regular topics. Hence, we associate each seed
// choose from seed topic set (we refer seed set as group for conciseness)
with a Multinomial distribution over the regular
The first step is to generate Multinomial distribu- topics which we call group-topic distribution.
tions for both seed topics and regular topics. The To give an overview of our model, first, we
seed topics are drawn in a way that constrains transfer the seed information from words onto
206

~b ~b
g g

x z
s z s x z
r r w r r w r r w
T Nd D
T Nd D T Nd D
(a) Model 1 (b) Model 2 (c) SeededLDA
Figure 2: The graphical notation of all the three models. In Model 1 we use seed topics to improve the topic-word
probability distributions. In Model 2, the seed topic information is first transfered to the document level based
on the document tokens and then it is used to improve document-topic distributions. In the final, SeededLDA,
model we combine both the models. In Model 1 and SeededLDA, we dropped the dependency of s on hyper
parameter s since it is observed. And, for clarity, we also dropped the dependency of x on .
the documents that contain them. Then, the represented using the binary vector ~b. This bi-
document-topic distribution is drawn in a two step nary vector can be populated based on the docu-
process: we sample a seed set (g for group) and ment words and hence it is treated as an observed
then use its group-topic distribution (g ) as prior variable. For example, consider the (very short!)
to draw the document-topic distribution (d ). We document oil companies have merged. Accord-
used this two step process, to allow flexible num- ing to the seed sets from Table 2, we define a bi-
ber of seed and regular topics, and to tie the topic nary vector that denotes which seed topics contain
distributions of all the documents within a group. words in this document. In this case, this vec-
We assume the following generative story and its tor ~b = h1, 1, 0, 1, 1i, indicating the presence of
graphical notation is shown in Fig. 2(b). seeds from sets 1, 2, 4 and 5.1 As discussed in
(Williamson et al., 2010), generating binary vec-
1. For each k = 1 T, tor is crucial if we want a document to talk about
(a) Choose rk Dir(r ). topics that are less prominent in the corpus.
2. For each seed set s = 1 S, The binary vector ~b, that indicates which seeds
exist in this document, defines a mean of a
(a) Choose group-topic distribution s
Dirichlet distribution from which we sample a
Dir(). // the topic distribution for sth
document-group distribution, d (step 3b). We
group (seed set) a vector of length T.
set the concentration of this Dirichlet to a hy-
3. For each document d, perparamter , which we set by hand (Sec. 4);
(a) Choose a binary vector ~b of length S. thus, d Dir(~b). From the resulting multino-
(b) Choose a document-group distribution mial, we draw a group variable g for this docu-
d Dir(~b). ment. This group variable brings clustering struc-
(c) Choose a group variable g Mult( d ) ture among the documents by grouping the docu-
ments that are likely to talk about same seed set.
(d) Choose d Dir(g ). // of length T
Once the group variable (g) is drawn, we
(e) For each token i = 1 Nd :
choose the document-topic distribution (d ) from
i. Select a topic zi Mult(d ). a Dirichlet distribution with the groups-topic dis-
ii. Select a word wi Mult(rzi ). tribution as the prior (step 3d). This step ensures
that the topic distributions of documents within
We first generate T topic-word distributions
each group are related. The remaining sampling
(k ) and S group-topic distributions (s ). Then
for each document, we generate a list of seed sets 1
As a special case, if no seed word is found in the docu-
that are allowed for this document. This list is ment, ~b is defined as the all-ones vector.
207
process proceeds like LDA. We sample a topic 2.4 Automatic Seed Selection
for each word and then generate a word from its In (Andrzejewski and Zhu, 2009; Andrzejewski
corresponding topic-word distribution. Observe et al., 2009), the seed information is provided
that, if the binary vector is all ones and if we manually. Here, we describe the use of feature se-
set d = d then this model reduces to the LDA lection techniques, prevalent in the classification
model with and r as the hyperparameters. literature, to automatically derive the seed sets. If
2.3 SeededLDA we want the topicality structure identified by the
LDA to align with the underlying class structure,
Both of our models use seed words in different then the seed words need to be representative of
ways to improve topic-word and document-topic the underlying topicality structure. To enable this,
distributions respectively. We can combine both we first take class labeled data (doesnt need to
the above models easily. We refer to the combined be multi-class labeled data unlike (Ramage et al.,
model as SeededLDA and its generative story is 2009)) and identify the discriminating features for
as follows (its graphical notation is shown in Fig. each class. Then we choose these discriminating
2(c)). The variables have same semantics as in the features as the initial sets of seed words. In prin-
previous models. ciple, this is similar to the prototype driven unsu-
1. For each k=1 T, pervised learning (Haghighi and Klein, 2006).
We use Information Gain (Mitchell, 1997) to
(a) Choose regular topic rk Dir(r ).
identify the required discriminating features. The
(b) Choose seed topic sk Dir(s ). Information Gain (IG) of a word (w) in a class (c)
(c) Choose k Beta(1, 1). is given by
2. For each seed set s = 1 S,
IG(c, w) = H(c) H(c|w)
(a) Choose group-topic distribution s
Dir(). where H(c) is the entropy of the class and H(c|w)
3. For each document d, is the conditional entropy of the class given the
word. In computing Information Gain, we bina-
(a) Choose a binary vector ~b of length S.
rize the document vectors and consider whether a
(b) Choose a document-group distribution word occurs in any document of a given class or
d Dir(~b). not. Thus obtained ranked list of words for each
(c) Choose a group variable g Mult( d ). class are filtered for ambiguous words and then
(d) Choose d Dir(g ). // of length T used as initial sets of seed words to be input to the
(e) For each token i = 1 Nd : model.
i. Select a topic zi Mult(d ).
3 Related Work
ii. Select an indicator xi Bern(zi ).
iii. if xi is 0 Seed-based supervision is closely related to the
Select a word wi Mult(rzi ). idea of seeding in the bootstrapping literature for
iv. if xi is 1 learning semantic lexicons (Thelen and Riloff,
Select a word wi Mult(szi ). 2002). The goals are similar as well: growing
a small set of seed examples into a much larger
In the SeededLDA model, the process for gen- set. A key difference is the type of semantic in-
erating group variable of a document is same as formation that the two approaches aim to capture:
the one described in the Model 2. And like in the semantic lexicons are based on much more spe-
Model 2, we sample a document-topic probability cific notions of semantics (e.g. all the country
distribution as a Dirichlet draw with the group- names) than the generic topic semantics of topic
topic distribution of the chosen group as prior. models. The idea of seeding has also been used
Subsequently, we choose a topic for each token in prototype-driven learning (Haghighi and Klein,
and then flip a biased coin. We choose either the 2006) and shown similar efficacies for these semi-
seed or the regular topic based on the result of the supervised learning approaches.
coin toss and then generate a word from its distri- LDAWN (Boyd-Graber et al., 2007) models
bution. sets of words for the word sense disambiguation
208
task. It assumes that a topic is a distribution (Sec. 2.2). However our model differs from La-
over synsets and relies on the Wordnet to obtain beledLDA in the subsequent steps. Rather than
the synsets. The most related prior work is that using the group distribution directly, we sam-
of (Andrzejewski et al., 2009), who propose the ple a group variable and use it to constrain the
use Dirichlet Forest priors to incorporate Must document-topic distributions of all the documents
Link and Cannot Link constraints into the topic within this group. Moreover, in their model the
models. This work is analogous to constrained binary vector is observed directly in the form of
K-means clustering (Wagstaff et al., 2001; Basu document labels while, in our case, it is automat-
et al., 2008). A must link between a pair word ically populated based on the document tokens.
types represents that the model should encourage Interactive topic modeling brings the user into
both the words to have either high or low prob- the loop, by allowing him/her to make suggestions
ability in any particular topic. A cannot link be- on how to improve the quality of the topics at each
tween a word pair indicates both the words should iteration (Hu et al., 2011). In their approach, the
not have high probability in a single topic. In the authors use Dirichlet Forest method to incorpo-
Dirichlet Forest approach, the constraints are first rate the users preferences. In our experiments
converted into trees with words as the leaves and (Sec. 4), we show that SeededLDA performs bet-
edges having pre-defined weights. All the trees ter than Dirichlet Forest method, so SeededLDA
are joined to a dummy node to form a forest. The when used with their framework can allow an user
sampling for a word translates into a random walk to explore a document collection in a more mean-
on the forest: starting from the root and selecting ingful manner.
one of its children based on the edge weights until
you reach a leaf node. 4 Experiments
While the Dirichlet Forest method requires su- We evaluate different aspects of the model sep-
pervision in terms of Must link and Cannot link arately. Our experimental setup proceeds as fol-
information, the Topics In Sets (Andrzejewski and lows: a) Using an existing model, we evaluate the
Zhu, 2009) model proposes a different approach. effectiveness of automatically derived constraints
Here, the supervision is provided at the token indicating the potential benefits of adding seed
level. The user chooses specific tokens and re- words into the topic models. b) We evaluate each
strict them to occur only with in a specified list of of our proposed models in different settings and
topics. While this needs minimal changes to the compare with multiple baseline systems.
inference process of LDA, it requires information Since our aim is to overcome the domi-
at the level of tokens. The word type level seed nance of majority topics by encouraging the
information can be converted into token level in- topicality structure identified by the topic mod-
formation (like we do in Sec. 4) but this prevents els to align with that of the document cor-
their model from distinguishing the tokens based pus, we choose extrinsic evaluation as the
on the word senses. primary evaluation method. We use docu-
Several models have been proposed which use ment clustering task and use frequent-5 cate-
supervision at the document level. Supervised gories of Reuters-21578 corpus (Lewis et al.,
LDA (Blei and McAuliffe, 2008) and DiscLDA 2004) and four classes from the 20 News-
(Lacoste-Julien et al., 2008) try to predict the cat- groups data set (i.e.rec.autos, sci.electronics,
egory labels (e.g. sentiment classification) for comp.hardware and alt.atheism). For both
the input documents based on a document labeled the corpora we do the standard preprocessing
data. Of these models, the most related one to of removing stopwords and infrequent words
SeededLDA is the LabeledLDA model (Ramage (Williamson et al., 2010).
et al., 2009). Their model operates on multi-class For all the models, we use a Collapsed Gibbs
labeled corpus. Each document is assumed to be sampler (Griffiths and Steyvers, 2004) for the in-
a mixture over a known subset of topics (classes) ference process. We use the standard hyperparam-
with each topic being a distribution over words. eters values = 1.0, = 0.01 and = 1.0 and
The process of generating document topic distri- run the sampler for 1000 iterations, but one can
bution in LabeledLDA is similar to the process use techniques like slice sampling to estimate the
of generating group distribution in our Model 2 hyperparameters (Johnson and Goldwater, 2009).
209
Reuters 20 Newsgroups
F-measure VI F-measure VI
LDA 0.64 (.05) 1.26 (.16) 0.77 (.06) 0.9 (.13)
Dirichlet Forest 0.67 (.02) 1.17 (.11) 0.79(.01) 0.83 (.03)
over LDA (+4.68%) (-7.1%) (+2.6%) (-7.8%)
Table 3: The effect of adding constraints by Dirichlet Forest Encoding. For Variational Information (VI) a lower
score indicates a better clustering. indicates statistical significance at p = 0.01 as measured by the t-test. All
the four improvements are significant at p = 0.05.
We run all the models with the same number of every pair of words belonging to different sets.
topics as the number of clusters. Then, for each The accuracies are averaged over 25 different ran-
document, we find the topic that has maximum dom initializations and are shown in Table 3. We
probability in the posterior document-topic distri- have also indicated the relative performance gains
bution and assign it to that cluster. The accuracy compared to LDA. The significant improvement
of the document clustering is measured in terms over the plain LDA demonstrates the effectiveness
of F-measure and Variation of Information. F- of the automatic extraction of seed words in topic
measure is calculated based on the pairs of doc- models.
uments, i.e. if two documents belong to a cluster
in both ground truth and the clustering proposed 4.2 Document Clustering
by the system then it is counted as correct, other- In the next experiment, we compare our models
wise it is counted as wrong. Variational Informa- with LDA and other baselines. The first baseline
tion (VI) of two clusterings X and Y is given as (maxCluster) simply counts the number of tokens
(Meila, 2007): in each document from each of the seed topics and
assigns the document to the seed topic that has
VI(X, Y ) = H(X) + H(Y ) 2I(X, Y ) most tokens. This results in a clustering of doc-
uments based on the seed topic they are assigned
where H(X) denotes the entropy of the clustering to. This baseline evaluates the effectiveness of the
X and I(X, Y ) denotes the mutual information seed words with respect to the underlying cluster-
between the two clusterings. For VI, a lower value ing. Apart from the maxCluster baseline, we use
indicates a better clustering. All the accuracies are LDA and z-labels (Andrzejewski and Zhu, 2009)
averaged over 25 different random initializations as our baselines. For z-labels, we treat all the to-
and all the significance results are measured using kens of a seed word in the same way. Table 4
the t-test at p = 0.01. shows the comparison of our models with respect
to the baseline systems.2 Comparing the perfor-
4.1 Seed Extraction mance of maxCluster to that of LDA, we observe
The seeds were extracted automatically (Sec. 2.4) that the seed words themselves do a poor job in
based on a small sample of labeled data other than clustering the documents.
the test data. We first extract 25 seeds words per We experimented with two variants of Model 1.
each class and then remove the seed words that In the first run (Model 1) we sample the k value,
appear in more than one class. After this filtering, i.e. the probability of choosing a seed topic for
on an average, we are left with 9 and 15 words per each topic. While in the Model 1 ( = 0.7) run,
each seed topic for Reuters and 20 Newsgroups we fix this probability to a constant value of 0.7 ir-
corpora respectively. respective of the topic.3 Though both the models
We use the existing Dirichlet Forest method to 2
The code used for LDA baseline in Tables 3 and 4
evaluate the effectiveness of the automatically ex- are different. For Table 3, we use the code available from
tracted seed words. The Must and Cannot links http://pages.cs.wisc.edu/andrzeje/research/df lda.html.
required for the supervision (Andrzejewski et al., We use our own version for Table 4. We tried to produce
a comparable baseline by running the former for more
2009) are automatically obtained by adding a iterations and with different hyperparameters. In Table 3,
must-link between every pair of words belonging we report their best results.
3
to the same seed set and a split constraint between We chose this value based on intuition; it is not tuned.
210
Reuters 20 Newsgroups
F-measure VI F-measure VI
maxCluster 0.53 1.75 0.58 1.44
LDA 0.66 (.04) 1.2 (.12) 0.76 (.06) 0.9 (.14)
z-labels 0.73 (.01) 1.04 (.01) 0.8 (.00) 0.82 (.01)
over LDA (+10.6%) (-13.3%) (+5.26%) (-8.8%)
Model 1 0.69 (.00) 1.13 (.01) 0.8 (.01) 0.81 (.02)
Model 1 ( = 0.7) 0.73 (.00) 1.09 (.01) 0.8 (.01) 0.81 (.02)
Model 2 0.66 (.04) 1.22 (.1) 0.77 (.07) 0.85 (.12)
SeededLDA 0.76 (.01) 0.99 (.03) 0.81 (.01) 0.75 (.02)
over LDA (+15.5%) (-17.5%) (+6.58%) (-16.7%)
Table 4: Accuracies on document clustering task with different models. indicates significant improvement
compared to the z-labels approach, as measured by the t-test with p = 0.01. The relative performance gains are
with respect to the LDA model and are provided for comparison with Dirichlet Forest method (in Table 3.)
performed better than LDA, fixing the probabil- these intervals reveals the superior performance
ity gave better results. When we attempt to learn of SeededLDA compared to all the baselines. The
this value, the model chooses to explain some of standard deviation of the F-measures over dif-
the seed words by the regular topics. On the other ferent random initializations of our our model is
hand, when is fixed, it explains almost all the about 1% for both the corpora while it is 4% and
seed words based on the seed topics. The next 6% for the LDA on Reuters and 20 Newsgroups
row (Model 2) indicates the performance of our corpora respectively. The reduction in the vari-
second model on the same data sets. The first ance, across all the approaches that use seed infor-
model seems to be performing better than the sec- mation, shows the increased robustness of the in-
ond model, which is justifiable since the latter ference process when using seed words. From the
uses seed topics indirectly. Though the variants accuracies in both the tables, it is clear that Seed-
of Model 1 and Model 2 performed better than edLDA model out-performs other models which
the LDA, they fell short of the z-labels approach. try to incorporate seed information into the topic
Table 4 also shows the performance of our com- models.
bined model (SeededLDA) on both the corpora. 4.3 Effect of Ambiguous Seeds
When the models are combined, the performance
In the following experiment we study the effect
improves over each of them and is also better than
of ambiguous seeds. We allow a seed word to oc-
the baseline systems. As explained before, our in-
cur in multiple seed sets. Table 6 shows the cor-
dividual models improve both the topic-word and
responding results. The performance drops when
document-topic distributions respectively. But it
we add ambiguous seed words, but it is still higher
turns out that the knowledge learnt by both the in-
than that of the LDA model. This suggests that the
dividual models is complementary to each other.
quality of the seed topics is determined by the dis-
As a result the combined model performed better
criminative power of the seed words rather than
than the individual models and other baseline sys-
the number of seed words in each seed topic. The
tems. Comparing the last rows of Tables 4 and 3,
topics identified by the SeededLDA on Reuters
we notice that the relative performance gains ob-
corpus are shown in the Table 5. With the help of
served in the case of SeededLDA is significantly
the seed sets, the model is able to split the Grain
higher than the performance gains obtained by
and Crude into two separate topics which were
incorporating the constraints using the Dirichlet
merged into a single topic by the plain LDA.
Forest method. Moreover, as indicated in the Ta-
ble 4, SeededLDA achieves significant gains over 4.4 Qualitative Evaluation on NIPS papers
the z-labels approach as well. We ran LDA and SeededLDA models on the NIPS
We have also provided the standard intervals papers from 2001 to 2010. For this corpus, the
for each of the approaches. A quick inspection of seed words are chosen from the call for proposal.
211
group, offer, common, cash, agreement, shareholders, acquisition, stake, merger, board, sale
oil, price, prices, production, lt, gas, crude, 1987, 1985, bpd, opec, barrels, energy, first, petroleum
0, mln, cts, net, loss, 2, dlrs, shr, 3, profit, 4, 5, 6, revs, 7, 9, 8, year, note, 1986, 10, 0, sales
tonnes, wheat, mln, grain, week, corn, department, year, export, program, agriculture, 0, soviet, prices
bank, market, pct, dollar, exchange, billion, stg, today, foreign, rate, banks, japan, yen, rates, trade
Table 5: Topics identified by SeededLDA on the frequent-5 categories of Reuters corpus
Reuters 20 Newsgroups drzejewski et al., 2009). Moreover, since, in our

F VI F VI method each seed topic is a distribution over the
LDA 0.66 1.2 0.76 0.9 seed words, the convex combination of regular
SeededLDA 0.76 0.99 0.81 0.75 and seed topics can be seen as adding different
SeededLDA weights (ci ) to different components of the prior
0.71 1.08 0.79 0.78
(amb) vector. Thus our Model 1 can be seen as an asym-
metric generalization of the Informed priors.
Table 6: Effect of ambiguous seed words on Seed- For comparability purposes, in this paper, we
edLDA. experimented with same number of regular topics
as the number of seed topics. But as explained in
There are 10 major areas with sub areas under the modeling part, our model is general enough
each of them. We ran both the models with 10 top- to handle situation with unequal number of seed
ics. For SeededLDA, the words in each of the ar- and regular topics. In this case, we assume that
eas are selected as seed words and we filter out the the seed topics indicate a higher level of topical-
ambiguous seed words. Upon a qualitative obser- ity structure of the corpus and associate each seed
vation of the output topics, we found that LDA has topic (or group) with a distribution over the regu-
identified seven major topics and left out Brain lar topics. On the other hand, in many NLP appli-
Imaging, Cognitive Science and Artificial In- cations, we tend to have only a partial information
telligence and Hardware Technologies areas. rather than high-level supervision. In such cases,
Not surprisingly, but reassuringly, these areas are one can create some empty seed sets and tweak
underrepresented among the NIPS papers. On the the model 2 to output a 1 in the binary vector cor-
other hand, SeededLDA successfully identifies all responding to these seed sets. In this paper, we
of the major topics. The topics identified by LDA used information gain to select the discriminating
and SeededLDA are shown in the supplementary seed words. But in the real world applications,
material. one can use publicly available ODP categorization
data to obtain the higher level seed words and thus
5 Discussion explore the corporal in a more meaningful way.
In this paper, we have explored two methods
In traditional topic models, a symmetric Dirich- to incorporate lexical prior into the topic mod-
let distribution is used as prior for topic-word dis- els, combining them into a single model that we
tributions. A first attempt method to incorporate call SeededLDA. From our experimental analysis,
seed words into the model is to use an asymmetric we found that automatically derived seed words
Dirichlet distribution as prior for the topic-word can improve clustering performance significantly.
distributions (also called as Informed priors). For Moreover, we found out that allowing a seed word
example, to encourage Topic 5 to align with a seed to be shared across multiple sets of seed words de-
set we can choose an asymmetric prior of the form grades the performance.
~5 = {, , + c, , }, i.e. we increase
the component values corresponding to the seed 6 Acknowledgments
words by a positive constant value. This favors
the desired seed words to be drawn with a higher We thank the anonymous reviewers for their help-
probability from this topic. But, it is argued else- ful comments. This material is partially supported
where that words drawn from such distributions by the National Science Foundation under Grant
rarely pick words other than the seed words (An- No. IIS-1153487.
212
References - Volume 1, HLT 11, pages 248257, Stroudsburg,
PA, USA. Association for Computational Linguis-
Andrzejewski, D. and Zhu, X. (2009). Latent dirichlet tics.
allocation with topic-in-set knowledge. In Proceed-
Johnson, M. and Goldwater, S. (2009). Improving
ings of the NAACL HLT 2009 Workshop on Semi-
nonparameteric bayesian inference: experiments
Supervised Learning for Natural Language Pro-
on unsupervised word segmentation with adap-
cessing, SemiSupLearn 09, pages 4348, Morris-
tor grammars. In Proceedings of Human Lan-
town, NJ, USA. Association for Computational Lin-
guage Technologies: The 2009 Annual Conference
guistics.
Andrzejewski, D., Zhu, X., and Craven, M. (2009). In- for Computational Linguistics, NAACL 09, pages
corporating domain knowledge into topic modeling 317325, Stroudsburg, PA, USA. Association for
via dirichlet forest priors. In ICML 09: Proceed- Computational Linguistics.
ings of the 26th Annual International Conference Lacoste-Julien, S., Sha, F., and Jordan, M. (2008).
on Machine Learning, pages 2532, New York, NY, DiscLDA: Discriminative learning for dimensional-
USA. ACM. ity reduction and classification. In Proceedings of
Basu, S., Ian, D., and Wagstaff, K. (2008). Con- NIPS 08.
strained Clustering : Advances in Algorithms, The- Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004).
ory, and Applications. Chapman & Hall/CRC Pres. Rcv1: A new benchmark collection for text catego-
Blei, D. and McAuliffe, J. (2008). Supervised topic rization research. J. Mach. Learn. Res., 5:361397.
models. In Advances in Neural Information Pro- Meila, M. (2007). Comparing clusteringsan infor-
cessing Systems 20, pages 121128, Cambridge, mation based distance. J. Multivar. Anal., 98:873
MA. MIT Press. 895.
Blei., D. M. and Lafferty., J. (2009). Topic models. In Mitchell, T. M. (1997). Machine Learning. McGraw-
Text Mining: Theory and Applications. Taylor and Hill, New York.
Francis. Paul, M. and Girju, R. (2010). A two-dimensional
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). La- topic-aspect model for discovering multi-faceted
tent dirichlet allocation. Journal of Maching Learn- topics. In AAAI.
ing Research, 3:9931022. Ramage, D., Hall, D., Nallapati, R., and Manning,
Boyd-Graber, J., Blei, D. M., and Zhu, X. (2007). A C. D. (2009). Labeled LDA: a supervised topic
topic model for word sense disambiguation. In Em- model for credit attribution in multi-labeled cor-
pirical Methods in Natural Language Processing. pora. In Proceedings of the 2009 Conference on
Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., and Empirical Methods in Natural Language Process-
Blei, D. M. (2009). Reading tea leaves: How hu- ing: Volume 1 - Volume 1, EMNLP 09, pages 248
mans interpret topic models. In Neural Information 256, Morristown, NJ, USA. Association for Com-
Processing Systems. putational Linguistics.
Griffiths, T., Steyvers, M., and Tenenbaum, J. (2007). Thelen, M. and Riloff, E. (2002). A bootstrapping
Topics in semantic representation. Psychological method for learning semantic lexicons using extrac-
Review, 114(2):211244. tion pattern contexts. In In Proc. 2002 Conf. Empir-
Griffiths, T. L. and Steyvers, M. (2004). Finding sci- ical Methods in NLP (EMNLP).
entific topics. Proceedings of National Academy of Wagstaff, K., Cardie, C., Rogers, S., and Schrodl, S.
Sciences USA, 101 Suppl 1:52285235. (2001). Constrained k-means clustering with back-
Griffiths, T. L., Steyvers, M., Blei, D. M., and Tenen- ground knowledge. In Proceedings of the Eigh-
baum, J. B. (2005). Integrating topics and syntax. teenth International Conference on Machine Learn-
In Advances in Neural Information Processing Sys- ing, ICML 01, pages 577584, San Francisco, CA,
tems, volume 17, pages 537544. USA. Morgan Kaufmann Publishers Inc.
Haghighi, A. and Klein, D. (2006). Prototype-driven Wallach, H. M. (2005). Topic modeling: beyond bag-
learning for sequence models. In Proceedings of of-words. In NIPS 2005 Workshop on Bayesian
the main conference on Human Language Tech- Methods for Natural Language Processing.
nology Conference of the North American Chap- Williamson, S., Wang, C., Heller, K. A., and Blei,
ter of the Association of Computational Linguis- D. M. (2010). The IBP compound dirichlet pro-
tics, HLT-NAACL 06, pages 320327, Strouds- cess and its application to focused topic modeling.
burg, PA, USA. Association for Computational Lin- In ICML, pages 11511158.
guistics.
Hu, Y., Boyd-Graber, J., and Satinoff, B. (2011). In-
teractive topic modeling. In Proceedings of the 49th
Annual Meeting of the Association for Computa-
tional Linguistics: Human Language Technologies
213
DualSum: a Topic-Model based approach for update summarization
Jean-Yves Delort Enrique Alfonseca

Google Research Google Research
Brandschenkestrasse 110 Brandschenkestrasse 110
8002 Zurich, Switzerland 8002 Zurich, Switzerland
jydelort@google.com ealfonseca@google.com
Abstract of sentences extracted from the document collec-

tion. Extracts can have coherence and cohesion
Update summarization is a new challenge problems, but they generally offer a good trade-
in multi-document summarization focusing off between linguistic quality and informative-
on summarizing a set of recent documents ness.
relatively to another set of earlier docu-
ments. We present an unsupervised proba- While numerous extractive summarization
bilistic approach to model novelty in a doc- techniques have been proposed for multi-
ument collection and apply it to the genera- document summarization (Erkan and Radev,
tion of update summaries. The new model, 2004; Radev et al., 2004; Shen and Li, 2010; Li et
called D UAL S UM, results in the second or al., 2011), few techniques have been specifically
third position in terms of the ROUGE met- designed for update summarization. Most exist-
rics when tuned for previous TAC competi-
ing approaches handle it as a redundancy removal
tions and tested on TAC-2011, being statis-
tically indistinguishable from the winning
problem, with the goal of producing a summary of
system. A manual evaluation of the gen- collection B that is as dissimilar as possible from
erated summaries shows state-of-the art re- either collection A or from a summary of collec-
sults for D UAL S UM with respect to focus, tion A. A problem with this approach is that it can
coherence and overall responsiveness. easily classify as redundant sentences in which
novel information is mixed with existing informa-
tion (from collection A). Furthermore, while this
1 Introduction approach can identify sentences that contain novel
Update summarization is the problem of extract- information, it cannot model explicitly what the
ing and synthesizing novel information in a col- novel information is.
lection of documents with respect to a set of doc- Recently, Bayesian models have successfully
uments assumed to be known by the reader. This been applied to multi-document summarization
problem has received much attention in recent showing state-of-the-art results in summarization
years, as can be observed in the number of partic- competitions (Haghighi and Vanderwende, 2009;
ipants to the special track on update summariza- Jin et al., 2010). These approaches offer clear and
tion organized by DUC and TAC since 2007. The rigorous probabilistic interpretations that many
problem is usually formalized as follows: Given other techniques lack. Furthermore, they have the
two collections A and B, where the documents in advantage of operating in unsupervised settings,
A chronologically precede the documents in B, which can be used in real-world scenarios, across
generate a summary of B under the assumption domains and languages. To our best knowledge,
that the user of the summary has already read the previous work has not used this approach for up-
documents in A. date summarization.
Extractive techniques are the most common In this article, we propose a novel nonpara-
approaches in multi-document summarization. metric Bayesian approach for update summariza-
Summaries generated by such techniques consist tion. Our approach, which is a variation of Latent
214
Dirichlet Allocation (LDA) (Blei et al., 2003), mary S to approximate, KL is commonly used as
aims to learn to distinguish between common in- the scoring function to select the subset of sen-
formation and novel information. We have eval- tences S that minimizes the KL divergence with
uated this approach on the ROUGE scores and T:
demonstrate that it produces comparable results
to the top system in TAC-2011. Furthermore, our X pT (w)
approach improves over that system when evalu- S = argminKL(T, S) = pT (w) log
S pS (w)
ated manually in terms of linguistic quality and wV
overall responsiveness.
where w is a word from the vocabulary V. This
2 Related work strategy is called KLSum. Usually, a smoothing
factor is applied on the candidate distribution S
2.1 Bayesian approaches in Summarization in order to avoid the divergence to be undefined1 .
Most Bayesian approaches to summarization are This objective function selects the most repre-
based on topic models. These generative mod- sentative sentences of the collection, and at the
els represent documents as mixtures of latent top- same time it also diversifies the generated sum-
ics, where a topic is a probability distribution over mary by penalizing redundancy. Since the prob-
words. In T OPIC S UM (Haghighi and Vander- lem of finding the subset of sentences from a
wende, 2009), each word is generated by a sin- collection that minimizes the KL divergence is
gle topic which can be a corpus-wide background NP-complete, a greedy algorithm is often used in
distribution over common words, a distribution practice2 . Some variations of this objective func-
of document-specific words or a distribution of tion can be considered, such as penalizing sen-
the core content of a given cluster. BAYES S UM tences that contain document-specific topics (Ma-
(Daume and Marcu, 2006) and the Special Words son and Charniak, 2011) or rewarding sentences
and Background model (Chemudugunta et al., appearing closer to the beginning of the docu-
2006) are very similar to T OPIC S UM. ment.
A commonality of all these models is the use of Wang et al. (2009) propose a Bayesian ap-
collection and document-specific distributions in proach for summarization that does not use KL
order to distinguish between the general and spe- for reranking. In their model, Bayesian Sentence-
cific topics in documents. In the context of sum- based Topic Models, every sentence in a docu-
marization, this distinction helps to identify the ment is assumed to be associated to a unique la-
important pieces of information in a collection. tent topic. Once the model parameters have been
Models that use more structure in the repre- calculated, a summary is generated by choosing
sentation of documents have also been proposed the sentence with the highest probability for each
for generating more coherent and less redun- topic.
dant summaries, such as H IER S UM (Haghighi While hierarchical topic modeling approaches
and Vanderwende, 2009) and TTM (Celikyilmaz have shown remarkable effectiveness in learning
and Hakkani-Tur, 2011). For instance, H IER S UM the latent topics of document collections, they are
models the intuitions that first sentences in docu- not designed to capture the novel information in
ments should contain more general information, a collection with respect to another one, which is
and that adjacent sentences are likely to share the primary focus of update summarization.
specic content vocabulary. However, H IER S UM,
which builds upon T OPIC S UM, does not show 2.2 Update Summarization
a statistically signicant improvement in ROUGE The goal of update summarization is to generate
over T OPIC S UM. an update summary of a collection B of recent
A number of techniques have been proposed to documents assuming that the users already read
rank sentences of a collection given a word distri- earlier documents from a collection A. We refer
bution (Carbonell and Goldstein, 1998; Goldstein 1
In our experiments we set = 0.01.
et al., 1999). The Kullback-Leibler divergence 2
In our experiments, we follow the same approach as in
(KL) is a widely used measure in summarization. (Haghighi and Vanderwende, 2009) by greedily adding sen-
Given a target distribution T that we want a sum- tences to a summary so long as they decrease KL divergence.
215
to collection A as the base collection and to col- 3 DualSum
lection B as the update collection.
3.1 Model Formulation
Update summarization is related to novelty de- The input for D UAL S UM is a set of pairs of collec-
tection which can be defined as the problem of tions of documents C = {(Ai , Bi )}i=1...m , where
determining whether a document contains new in- Ai is a base document collection and Bi is an up-
formation given an existing collection (Soboroff date document collection. We use c to refer to a
and Harman, 2005). Thus, while the goal of nov- collection pair (Ac , Bc ).
elty detection is to determine whether some infor- In D UAL S UM, documents are modeled as a bag
mation is new, the goal of update summarization of words that are assumed to be sampled from a
is to extract and synthesize the novel information. mixture of latent topics. Each word is associated
with a latent variable that specifies which topic
distribution is used to generate it. Words in a doc-
Update summarization is also related to con- ument are assumed to be conditionally indepen-
trastive summarization, i.e. the problem of jointly dent given the hidden topic.
generating summaries for two entities in order to As in previous Bayesian works for summariza-
highlight their differences (Lerman and McDon- tion (Daume and Marcu, 2006; Chemudugunta
ald, 2009). The primary difference here is that et al., 2006; Haghighi and Vanderwende, 2009),
update summarization aims to extract novel or up- D UAL S UM not only learns collection-specific dis-
dated information in the update collection with re- tributions, but also a general background distri-
spect to the base collection. bution over common words, G and a document-
specific distribution cd for each document d in
The most common approach for update sum- collection pair c, which is useful to separate the
marization is to apply a normal multi-document specific aspects from the general aspects of c. The
summarizer, with some added functionality to remain novelty is that D UAL S UM introduces spe-
move sentences that are redundant with respect cific machinery for identifying novelty.
to collection A. This can be achieved using sim- To capture the differences between the base and
ple filtering rules (Fisher and Roark, 2008), Max- the update collection for each pair c, D UAL S UM
imal Marginal Relevance (Boudin et al., 2008), or learns two topics for every collection pair. The
more complex graph-based algorithms (Shen and joint topic, Ac captures the common information
Li, 2010; Wenjie et al., 2008). The goal here is between the two collections in the pair, i.e. the
to boost sentences in B that bring out completely main event that both collections are discussing.
novel information. One problem with this ap- The update topic, Bc focuses on the specific as-
proach is that it is likely to discard as redundant pects that are specific of the documents inside the
sentences in B containing novel information if it update collection.
is mixed with known information from collection In the generative model,
A.
For a document d in a collection Ac , words
can be originated from one of three differ-
Another approach is to introduce specific fea-
ent topics: G , cd and Ac , the last one of
tures intended to capture the novelty in collection
which captures the main topic described in
B. For example, comparing collections A and B,
the collection pair.
FastSum derives features for the collection B such
as number of named entities in the sentence that For a document d in a collection Bc , words
already occurred in the old cluster or the number can be originated from one of four different
of new content words in the sentence not already topics: G , cd , Ac and Bc . The last one
mentioned in the old cluster that are subsequently will capture the most important updates to
used to train a Support Vector Machine classifier the main topic.
(Schilder et al., 2008). A limitation with this ap-
proach is there are no large training sets available To make this representation easier, we can also
and, the more features it has, the more it is af- state that both collections are generated from the
fected by the sparsity of the training data. four topics, but we constrain the topic probability
216
1. Sample G Dir(G ) there should be more words in the background
2. For each collection pair c = (Ac , Bc ): than in the other distributions, so the mass is ex-
Sample Ac Dir(A ) pected to be shared on a larger number of words.
Sample Bc Dir(B ) Unlike for the word distributions, mixing prob-
For each document d of type ucd {A, B}: abilities are drawn from a Dirichlet distribution
- Sample cd Dir(D ) with asymmetric priors. The prior knowledge
- If (ucd = A) sample cd Dir( A ) about the origin of words in the base and up-
- If (ucd = B) sample cd Dir( B ) date collections is again encoded at the level the
- For each word w in document d: hyper-parameters. For example, if we set A =
(a) Sample a topic z M ult( cd ), z (5, 3, 2, 0), this would reflect the intuition that,
{G, cd, Ac , Bc } on average, in the base collections, 50% of the
(b) Sample a word w M ult(z ) words originate from the background distribution,
30% from the document-specific distribution, and
Figure 1: Generative model in D UAL S UM. 20% from the joint topic. Similarly, if we set
B = (5, 2, 2, 1), the prior reflects the assumption
A B that, on average, in the update collections, 50% of
the words originate from the background distri-
bution, 20% from the document-specific distribu-
u
tion, 20% from the joint topic, and 10% from the
novel, update topic3 . The priors we have actually
z
G
used are reported in Section 4.
3.2 Learning and inference

D D w G
In order to find the optimal model parameters, the
following equation needs to be computed:
A B
p(z, , , w, u)
A B
p(z, , |w, u) =
p(w, u)
Omitting hyper-parameters for notational sim-
Figure 2: Graphical model representation of D UAL -
plicity, the joint distribution over the observed
S UM.
variables is:
p(w, u) = p(G )
for Bc to be always zero when generating a base Y
document. p(Ac )p(Bc )
c
We denote ucd {A, B} the type of a docu- Y Z
ment d in pair c. This is an observed, Boolean p(ucd )p(cd ) p( cd |ucd )d cd
d
variable stating whether the document d belongs YX
to the base or the update collection inside the pair p(wcdn |zcdn )p(zcdn | cd )
c. n cdn
The generation process of documents in D U -

where denotes the 4-dimensional simplex4 .
AL S UM is described in Figure 1, and the plate
Since this equation is intractable, we need to per-
diagram corresponding to this generative story
form approximate inference in order to estimate
is shown in Figure 2. D UAL S UM is an LDA-
the model parameters. A number of Bayesian sta-
like model, where topic distributions are multi-
tistical inference techniques can be used to ad-
nomial distributions over words and topics that
dress this problem.
are sampled from Dirichlet distributions. We use
3
= (G , D , A , B ) as symmetric priors for the To highlight the difference between asymmetric and
Dirichlet distributions generating the word distri- symmetric priors we put the indices in superscript and sub-
script respectively.
butions. In our experiments, we set G = 0.1 and 4
Remember that, for base documents, words cannot
D = A = B = 0.001. A greater value is as- be generated by the update topic, so denotes the 3-
signed to G in order to reflect the intuition that dimensional simplex for base documents.
217
(v)
Variational approaches (Blei et al., 2003) and where k K, nk denotes the number of times
collapsed Gibbs sampling (Griffiths and Steyvers, (cd)
word v is assigned to topic k, and nk denotes
2004) are common techniques for approximate in- the number of words in document d of collection
ference in Bayesian models. They offer different c that are assigned to topic k.
advantages: the variational approach is arguably By the strong law of large numbers, the average
faster computationally, but the Gibbs sampling of sample parameters should converge towards
approach is in principal more accurate since it the true expected value of the model parameter.
asymptotically approaches the correct distribution Therefore, good estimates of the model parame-
(Porteous et al., 2008). In this section, we pro- ters can be obtained averaging over the sampled
vide details on a collapsed Gibbs sampling strat- values. As suggested by Gamerman and Lopes
egy to infer the model parameters of D UAL S UM (2006), we have set a lag (20 iterations) between
for a given dataset. samples in order to reduce auto-correlation be-
Collapsed Gibbs sampling is a particular case tween samples. Our sampler also discards the first
of Markov Chain Monte Carlo (MCMC) that in- 100 iterations as burn-in period in order to avoid
volves repeatedly sampling a topic assignment for averaging from samples that are still strongly in-
each word in the corpus. A single iteration of the fluenced by the initial assignment.
Gibbs sampler is completed after sampling a new
topic for each word based on the previous assign- 4 Experiments in Update
ment. In a collapsed Gibbs sampler, the model Summarization
parameters are integrated out (or collapsed), al-
The Bayesian graphical model described in the
lowing to only sample z. Let us call wcdn the n-th
previous section can be run over a set of news
word in document d in collection c, and zcdn its
collections to learn the background distribution,
topic assignment. For Gibbs sampling, we need
a joint distribution for each collection, an update
to calculate p(zcdn |w, u, zcdn ) where zcdn de-
distribution for each collection and the document-
notes the random vector of topic assignments ex-
specific distributions. Once this is done, one of
cept the assignment zcdn .
the learned collections can be used to generate the
summary that best approximates this collection,
p(zcdn = j|w, u, zcdn , A , B , ) using the greedy algorithm described by Haghighi
(w ) (cd)
cdn
ncdn,j + j ncdn,j + jucd and Vanderwende (2009). Still, there are some pa-
PV (v)
ncdn,j + V j
P (cd)
+ kucd ) rameters that can be defined and which affects the
v=1 kK (ncdn,k
results obtained:
(v)
where K = {G, cd, Ac , Bc }, ncdn,j denotes the D UAL S UMs choice of hyper-parameters af-
number of times word v is assigned to topic j fects how the topics are learned.
excluding current assignment of word wcdn and
(cd) The documents can be represented with n-
ncdn,k denotes the number of words in document
grams of different lengths.
d of collection c that are assigned to topic j ex-
cluding current assignment of word wcdn . It is possible to generate a summary that ap-
After each sampling iteration, the model pa- proximates the joint distribution, the update-
rameters can be estimated using the following for- only distribution, or a combination of both.
mulas5 .
This section describes how these parameters
have been tuned.
(w)
nk + k
kw =P (v)
4.1 Parameter tuning
V
v=1 nk + V k
We use the TAC 2008 and 2009 update task
datasets as training set for tuning the hyper-
(cd)
n + k parameters for the model, namely the pseudo-
kcd = P k(cd)
n. + V k counts for the two Dirichlet priors that affects the
5
The interested reader is invited to consult (Wang, 2011)
topic mix assignment for each document. By per-
for more details on using Gibbs sampling for LDA-like mod- forming a grid search over a large set of pos-
els sible hyper-parameters, these have been fixed to
218
A = (90, 190, 50, 0) and B = (90, 170, 45, 25)
as the values that produced the best ROUGE-2
score on those two datasets.
Regarding the base collection, this can be inter-
preted as setting as prior knowledge that roughly
27% of the words in the original dataset originate
from the background distribution, 58% from the
document-specific distributions, and 15% from
the topic of the original collection. We remind
the reader that the last value in A is set to zero
because, due to the problem definition, the origi-
nal collection must have no words generated from
the update topic, which reflects the most recent
developments that are still not present in the base Figure 3: Variation in ROUGE-2 score in the TAC-
2010 dataset as we change the mixture weight for the
collections A.
joined topic model between 0 and 1.
Regarding the update set, 27% of the words are
assumed to originate again from the background
distribution, 51% from the document-specific dis-
tributions, 14% from an topic in common with
the original collection, and 8% from the update-
specific topic. One interesting fact to note from
these settings is that most of the words belong to
topics that are specific to single documents (58%
and 51% respectively for both sets A and B) and
to the background distribution, whereas the joint
and update topics generate a much smaller, lim-
ited set of words. This helps these two distribu-
tions to be more focused.
The other settings mentioned at the beginning
of this section have been tuned using the TAC- Figure 4: Effect of the mixture weight in ROUGE-2
scores (TAC-2010 dataset). Results are reported us-
2010 dataset, which we reserved as our develop-
ing bigrams (above, blue), unigrams (middle, red) and
ment set. Once the different document-specific trigrams (below, yellow).
and collection-specific distributions have been ob-
tained, we have to choose the target distribu-
tion T to with which the possible summaries will increases until it plateaus at a maximum around
be compared using the KL metric. Usually, the roughly the interval [0.6, 0.8], and from that point
human-generated update summaries not only in- performance slowly degrades as at the right part
clude the terms that are very specific about the last of the curve the update model is given very little
developments, but they also include a little back- importance in generating the summary. Based on
ground regarding the developing event. There- these results, from this point onwards, the mixture
fore, we try, for KLSum, a simple mixture be- weight has been set to 0.7. Note that using only
tween the joint topic (A ) and the update topic the joint distribution (setting the mixture weight
(B ). to 1.0) also produces reasonable results, hinting
Figure 3 shows the ROUGE-2 results obtained that it successfully incorporates the most impor-
as we vary the mixture weight between the joint tant n-grams from across the base and the update
A distribution and the update-specific B distri- collections at the same time.
bution. As can be seen at the left of the curve, us- A second parameter is the size of the n-grams
ing only the update-specific model, which disre- for representing the documents. The original
gards the generic words about the topic described, implementations of S UM BASIC (Nenkova and
produces much lower results. The results improve Vanderwende, 2005) and T OPIC S UM (Haghighi
as the relative weight of the joined topic model and Vanderwende, 2009) were defined over sin-
219
gle words (unigrams). Still, Haghighi and Van- automatically evaluated using the TAC-2011
derwende (2009) report some improvements in dataset. Table 1 shows the ROUGE results ob-
the ROUGE-2 score when representing words as tained. Because of the non-deterministic nature
a bag of bigrams, and Darling (2010) mention of Gibbs sampling, the results reported here are
similar improvements when running S UM BASIC the average of five runs for all the baselines and
with bigrams. Figure 4 shows the effect on the for D UAL S UM. D UAL S UM outperforms two of
ROUGE-2 curve when we switch to using uni- the baselines in all three ROUGE metrics, and it
grams and trigrams. As stated in previous work, also outperforms T OPIC S UMB on two of the three
using bigrams has better results than using uni- metrics.
grams. Using trigrams was worse than either of
them. This is probably because trigrams are too The top three systems in TAC-2011 have been
specific and the document collections are small, included for comparison. The results between
so the models are more likely to suffer from data these three systems, and between them and D U -
AL S UM , are all indistinguishable at 95% confi-
sparseness.
dence. Note that the best baseline, T OPIC S UMB ,
4.2 Baselines is quite competitive, with results that are indis-
D UAL S UM is a modification of T OPIC S UM de- tinguishable to the top participants in this years
signed specifically for the case of update sum- evaluation. Note as well that, because we have
marization, by modifying T OPIC S UMs graphical five different runs for our algorithms, whereas
model in a way that captures the dependency be- we just have one output for the TAC participants,
tween the joint and the update collections. Still, it the confidence intervals in the second case were
is important to discover whether the new graphi- slightly bigger when checking for statistical sig-
cal model actually improves over simpler applica- nificance, so it is slightly harder for these systems
tions of T OPIC S UM to this task. The three base- to assert that they outperform the baselines with
lines that we have considered are: 95% confidence. These results would have made
D UAL S UM the second best system for ROUGE-
Running T OPIC S UM on the set of collections 1 and ROUGE-SU4, and the third best system in
containing only the update documents. We terms of ROUGE-2.
call this run T OPIC S UMB .
The supplementary materials contain a detailed
Running T OPIC S UM on the set of collections example of the the topic model obtained for the
containing both the base and the update doc- background in the TAC-2011 dataset, and the base
uments. Contrary to the previous run, the and update models for collection D1110. As
topic model for each collection in this run expected, the top unigrams and bigrams are all
will contain information relevant to the base closed-class words and auxiliary verbs. Because
events. We call this run T OPIC S UMAB . trigrams are longer, background trigrams actu-
Running T OPIC S UM twice, once on the set ally include some content words (e.g. university
of collections containing the update docu- or director). Regarding the models for A and
ments, and the second time on the set of B , the base distribution contains words related
collections containing the base documents. to the original event of an earthquake in Sichuan
Then, for each collection, the obtained base province (China), and the update distribution fo-
and update models are combined in a mix- cuses more on the official (updated) death toll
ture model using a mixture weight between numbers. It can be noted here that the tokenizer
zero and one. The weight has been tuned us- we used is very simple (splitting tokens separated
ing TAC-2010 as development set. We call with white-spaces or punctuation) so that num-
this run T OPIC S UMA +T OPIC S UMB . bers such as 7.9 (the magnitude of the earthquake)
and 12,000 or 14,000 are divided into two tokens.
4.3 Automatic evaluation We thought this might be a for the bigram-based
D UAL S UM and the three baselines6 have been system to produce better results, but we ran the
6
Using the settings obtained in the previous section, hav-
summarizers with a numbers-aware tokenizer and
ing been optimized on the datasets from previous TAC com- the statistical differences between versions still
petitions. hold.
220
Method R-1 R-2 R-SU4
T OPIC S UMB 0.3442 0.0868 0.1194
T OPIC S UMAB 0.3385 0.0809 0.1159
T OPIC S UMA +T OPIC S UMB 0.3328 0.0770 0.1125
D UAL S UM 0.3575 0.0924 0.1285
TAC-2011 best system (Peer 43) 0.3559 0.0958 0.1308
TAC-2011 2nd system (Peer 25) 0.3582 0.0926 0.1276
TAC-2011 3rd system (Peer 17) 0.3558 0.0886 0.1279
Table 1: Results on the TAC-2011 dataset. , and indicate that a result is significantly better than T OPIC S UMB ,
T OPIC S UMAB and T OPIC S UMA +T OPIC S UMB , respectively (p < 0.05).
4.4 Manual evaluation Best system

Aspect Peer 43 Same DualSum
While the ROUGE metrics provides an arguable Overall Responsiveness 39 25 68
Focus 41 22 69
estimate of the informativeness of a generated Coherence 39 30 63
summary, it does not account for other important Non-redundancy 40 53 39
aspects such as the readability or the overall re-
sponsiveness. To evaluate such aspects, a manual Table 2: Results of the side-by-side manual evaluation.
evaluation is required. A fairly standard approach
for manual evaluation is through pairwise com- garding Non-redundancy, DualSum and Peer 43
parison (Haghighi and Vanderwende, 2009; Ce- obtain similar results but the majority of raters
likyilmaz and Hakkani-Tur, 2011). found no difference between the two systems.
In this approach, raters are presented with pairs Fleiss has been used to measure the inter-rater
of summaries generated by two systems and they agreement. For each aspect, we observe 0.2
are asked to say which one is best with respect which corresponds to a slight agreement; but if we
to some aspects. We followed a similar approach focus on tasks where the 3 ratings reflect a prefer-
to compare DualSum with Peer 43 - the best sys- ence for either of the two systems, then 0.5,
tem with respect to ROUGE-2, on the TAC 2011 which indicates moderate agreement.
dataset. For each collection, raters were presented
with three summaries: a reference summary ran- 4.5 Efficiency and applicability
domly chosen from the model summaries, and the The running time for summarizing the TAC col-
summaries generated by Peer 43 and DualSum. lections with DualSum, averaged over a hundred
They were asked to read the summaries and say runs, is 4.97 minutes, using one core (2.3 GHz).
which one of the two generated summaries is best Memory consumption was 143 MB.
with respect to: 1) Overall responsiveness: which It is important to note as well that, while T OP -
summary is best overall (both in terms of content IC S UM incorporates an additional layer to model
and fluency), 2) Focus: which summary contains topic distributions at the sentence level, we noted
less irrelevant details, 3) Coherence: which sum- early in our experiments that this did not improve
mary is more coherent and 4) Non-redundancy: the performance (as evaluated with ROUGE) and
which summary repeats less the same informa- consequently relaxed that assumption in Dual-
tion. For each aspect, the rater could also reply Sum. This resulted in a simplification of the
that both summary were of the same quality. model and a reduction of the sampling time.
For each of the 44 collections in TAC-2011, 3 While five minutes is fast enough to be able
ratings were collected from raters7 . Results are to experiment and tune parameters with the TAC
reported in Table 2. DualSum outperforms Peer collections, it would be quite slow for a real-
43 in three aspects, including Overall Responsive- time summarization system able to generate sum-
ness, which aggregates all the other scores and maries on request. As can be seen from the plate
can be considered the most important one. Re- diagram in Figure 2, all the collections are gen-
7
In total 132 raters participated to the task via our own erated independently from each other. The only
crowdsourcing platform, not mentioned yet for blind review. exception, for which it is necessary to have all
221
the collections available at the same time dur- pling to on-line settings. By fixing the back-
ing Gibbs sampling, is the background distribu- ground distribution we are able to summarize a
tion, which is estimated from all the collections distribution in only three seconds, which seems
simultaneously, roughly representing 27% of the reasonable for some on-line applications.
words, that should appear distributed across all As future work, we plan to explore the use of
documents. D UAL S UM to generate more general contrastive
The good news is that this background distri- summaries, by identifying differences between
bution will contain closed-class words in the lan- collections whose differences are not of temporal
guage, which are domain-independent (see sup- nature.
plementary material for examples). Therefore,
we can generate this distribution from one of Acknowledgments
the TAC datasets only once, and then it can be
The research leading to these results has received
reused. Fixing the background distribution to a
funding from the European Unions Seventh
pre-computed value requires a very simple mod-
Framework Programme (FP7/2007-2013) under
ification of the Gibbs sampling implementation,
grant agreement number 257790. We would also
which just needs to adjust at each iteration the
like to thank Yasemin Altun and the anonymous
collection and document-specific models, and the
reviewers for their useful comments on the draft
topic assignment for the words.
of this paper.
Using this modified implementation, it is now
possible to summarize a single collection inde-
pendently. The summarization of a single col- References
lection of the size of the TAC collections is re-
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
duced on average to only three seconds on the 2003. Latent dirichlet allocation. J. Mach. Learn.
same hardware settings, allowing the use of this Res., 3:9931022, March.
summarizer in an on-line application. Florian Boudin, Marc El-Beze, and Juan-Manuel
Torres-Moreno. 2008. A scalable MMR approach
5 Conclusions to sentence scoring for multi-document update sum-
marization. In Coling 2008: Companion volume:
The main contribution of this paper is D UAL S UM,
Posters, pages 2326, Manchester, UK, August.
a new topic model that is specifically designed to Coling 2008 Organizing Committee.
identify and extract novelty from pairs of collec- J. Carbonell and J. Goldstein. 1998. The use of mmr,
tions. diversity-based reranking for reordering documents
It is inspired by T OPIC S UM (Haghighi and and producing summaries. In Proceedings of the
Vanderwende, 2009), with two main changes: 21st annual international ACM SIGIR conference
Firstly, while T OPIC S UM can only learn the main on Research and development in information re-
topic of a collection, D UAL S UM focuses on the trieval, pages 335336. ACM.
differences between two collections. Secondly, Asli Celikyilmaz and Dilek Hakkani-Tur. 2011. Dis-
covery of topically coherent sentences for extrac-
while T OPIC S UM incorporates an additional layer
tive summarization. In Proceedings of the 49th An-
to model topic distributions at the sentence level, nual Meeting of the Association for Computational
we have found that relaxing this assumption and Linguistics: Human Language Technologies, pages
modeling the topic distribution at document level 491499, Portland, Oregon, USA, June. Associa-
does not decrease the ROUGE scores and reduces tion for Computational Linguistics.
the sampling time. Chaitanya Chemudugunta, Padhraic Smyth, and Mark
The generated summaries, tested on the TAC- Steyvers. 2006. Modeling general and specific as-
2011 collection, would have resulted on the sec- pects of documents with a probabilistic topic model.
ond and third position in the last summarization In NIPS, pages 241248.
W.M. Darling. 2010. Multi-document summarization
competition according to the different ROUGE
from first principles. In Proceedings of the third
scores. This would make D UAL S UM statistically Text Analysis Conference, TAC-2010. NIST.
indistinguishable from the top system with 0.95 Hal Daume, III and Daniel Marcu. 2006. Bayesian
confidence. query-focused summarization. In Proceedings of
We also propose and evaluate the applicability the 21st International Conference on Computa-
of an alternative implementation of Gibbs sam- tional Linguistics and the 44th annual meeting
222
of the Association for Computational Linguistics, search, Redmond, Washington, Tech. Rep. MSR-TR-
ACL-2006, pages 305312, Stroudsburg, PA, USA. 2005-101.
Association for Computational Linguistics. Ian Porteous, David Newman, Alexander Ihler, Arthur
Gunes Erkan and Dragomir R. Radev. 2004. Lexrank: Asuncion, Padhraic Smyth, and Max Welling.
graph-based lexical centrality as salience in text 2008. Fast collapsed Gibbs sampling for latent
summarization. J. Artif. Int. Res., 22:457479, De- Dirichlet allocation. In KDD 08: Proceeding of
cember. the 14th ACM SIGKDD international conference on
S. Fisher and B. Roark. 2008. Query-focused super- Knowledge discovery and data mining, pages 569
vised sentence ranking for update summaries. In 577, New York, NY, USA, August. ACM.
Proceedings of the first Text Analysis Conference, Dragomir R. Radev, Hongyan Jing, Malgorzata Stys,
TAC-2008. and Daniel Tam. 2004. Centroid-based summariza-
Dani Gamerman and Hedibert F. Lopes. 2006. tion of multiple documents. Inf. Process. Manage.,
Markov Chain Monte Carlo: Stochastic Simulation 40:919938, November.
for Bayesian Inference. Chapman and Hall/CRC. Frank Schilder, Ravikumar Kondadadi, Jochen L. Lei-
Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and dner, and Jack G. Conrad. 2008. Thomson reuters
Jaime Carbonell. 1999. Summarizing text docu- at tac 2008: Aggressive filtering with fastsum for
ments: sentence selection and evaluation metrics. update and opinion summarization. In Proceedings
In Proceedings of the 22nd annual international of the first Text Analysis Conference, TAC-2008.
ACM SIGIR conference on Research and develop- Chao Shen and Tao Li. 2010. Multi-document sum-
ment in information retrieval, SIGIR 99, pages marization via the minimum dominating set. In
121128, New York, NY, USA. ACM. Proceedings of the 23rd International Conference
T. L. Griffiths and M. Steyvers. 2004. Finding scien- on Computational Linguistics, COLING 10, pages
tific topics. Proceedings of the National Academy 984992, Stroudsburg, PA, USA. Association for
of Sciences, 101(Suppl. 1):52285235, April. Computational Linguistics.
A. Haghighi and L. Vanderwende. 2009. Exploring Ian Soboroff and Donna Harman. 2005. Novelty de-
content models for multi-document summarization. tection: the trec experience. In Proceedings of the
In Proceedings of Human Language Technologies: conference on Human Language Technology and
The 2009 Annual Conference of the North Ameri- Empirical Methods in Natural Language Process-
can Chapter of the Association for Computational ing, HLT 05, pages 105112, Stroudsburg, PA,
Linguistics, pages 362370. Association for Com- USA. Association for Computational Linguistics.
putational Linguistics. Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong
Feng Jin, Minlie Huang, and Xiaoyan Zhu. 2010. The Gong. 2009. Multi-document summarization us-
thu summarization systems at tac 2010. In Proceed- ing sentence-based topic models. In Proceedings
ings of the third Text Analysis Conference, TAC- of the ACL-IJCNLP 2009 Conference Short Papers,
2010. ACLShort 09, pages 297300, Stroudsburg, PA,
Kevin Lerman and Ryan McDonald. 2009. Con- USA. Association for Computational Linguistics.
trastive summarization: an experiment with con- Yi Wang. 2011. Distributed gibbs sampling of latent
sumer reviews. In Proceedings of Human Lan- dirichlet allocation: The gritty details.
guage Technologies: The 2009 Annual Conference Li Wenjie, Wei Furu, Lu Qin, and He Yanxiang. 2008.
of the North American Chapter of the Association Pnr2: ranking sentences with positive and nega-
for Computational Linguistics, Companion Volume: tive reinforcement for query-oriented update sum-
Short Papers, NAACL-Short 09, pages 113116, marization. In Proceedings of the 22nd Interna-
Stroudsburg, PA, USA. Association for Computa- tional Conference on Computational Linguistics -
tional Linguistics. Volume 1, COLING 08, pages 489496, Strouds-
Xuan Li, Liang Du, and Yi-Dong Shen. 2011. Graph- burg, PA, USA. Association for Computational Lin-
based marginal ranking for update summarization. guistics.
In Proceedings of the Eleventh SIAM International
Conference on Data Mining. SIAM / Omnipress.
Rebecca Mason and Eugene Charniak. 2011. Ex-
tractive multi-document summaries should explic-
itly not contain document-specific content. In Pro-
ceedings of the Workshop on Automatic Summariza-
tion for Different Genres, Media, and Languages,
WASDGML 11, pages 4954, Stroudsburg, PA,
A. Nenkova and L. Vanderwende. 2005. The im-
pact of frequency on summarization. Microsoft Re-
223
Large-Margin Learning of Submodular Summarization Models
Ruben Sipos Pannaga Shivaswamy Thorsten Joachims

Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
Cornell University Cornell University Cornell University
Ithaca, NY 14853 USA Ithaca, NY 14853 USA Ithaca, NY 14853 USA
rs@cs.cornell.edu pannaga@cs.cornell.edu tj@cs.cornell.edu
Abstract Bilmes (2010), using a submodular scoring func-

tion based on inter-sentence similarity. On the one
In this paper, we present a supervised hand, this scoring function rewards summaries
learning approach to training submodu- that are similar to many sentences in the origi-
lar scoring functions for extractive multi- nal documents (i.e. promotes coverage). On the
document summarization. By taking a other hand, it penalizes summaries that contain
structured prediction approach, we pro-
sentences that are similar to each other (i.e. dis-
vide a large-margin method that directly
optimizes a convex relaxation of the de- courages redundancy). While obtaining the exact
sired performance measure. The learning summary that optimizes the objective is computa-
method applies to all submodular summa- tionally hard, they show that a greedy algorithm
rization methods, and we demonstrate its is guaranteed to compute a good approximation.
effectiveness for both pairwise as well as However, their work does not address how to
coverage-based scoring functions on mul- select a good inter-sentence similarity measure,
tiple datasets. Compared to state-of-the-
leaving this problem as well as selecting an appro-
art functions that were tuned manually, our
method significantly improves performance
priate trade-off between coverage and redundancy
and enables high-fidelity models with num- to manual tuning.
ber of parameters well beyond what could To overcome this problem, we propose a su-
reasonably be tuned by hand. pervised learning method that can learn both
the similarity measure as well as the cover-
age/reduncancy trade-off from training data. Fur-
1 Introduction thermore, our learning algorithm is not limited to
the model of Lin and Bilmes (2010), but applies to
Automatic document summarization is the prob-
all monotone submodular summarization models.
lem of constructing a short text describing the
Due to the diminishing-returns property of mono-
main points in a (set of) document(s). Exam-
tone submodular set functions and their computa-
ple applications range from generating short sum-
tional tractability, this class of functions provides
maries of news articles, to presenting snippets for
a rich space for designing summarization meth-
URLs in web-search. In this paper we focus on
ods. To illustrate the generality of our approach,
extractive multi-document summarization, where
we also provide experiments for a coverage-based
the final summary is a subset of the sentences
model originally developed for diversified infor-
from multiple input documents. In this way, ex-
mation retrieval (Swaminathan et al., 2009).
tractive summarization avoids the hard problem
In general, our method learns a parameterized
of generating well-formed natural-language sen-
monotone submodular scoring function from su-
tences, since only existing sentences from the in-
pervised training data, and its implementation is
put documents are presented as part of the sum-
available for download.1 Given a set of docu-
mary.
ments and their summaries as training examples,
A current state-of-the-art method for document
1
summarization was recently proposed by Lin and http://www.cs.cornell.edu/rs/sfour/
224
we formulate the learning problem as a struc- concept of eigenvector centrality in a graph of
tured prediction problem and derive a maximum- sentence similarities. Similarly, TextRank (Mi-
margin algorithm in the structural support vec- halcea and Tarau, 2004) is also graph based rank-
tor machine (SVM) framework. Note that, un- ing system for identification of important sen-
like other learning approaches, our method does tences in a document by using sentence similar-
not require a heuristic decomposition of the learn- ity and PageRank (Brin and Page, 1998). Sen-
ing task into binary classification problems (Ku- tence extraction can also be implemented using
piec et al., 1995), but directly optimizes a struc- other graph based scoring approaches (Mihalcea,
tured prediction. This enables our algorithm to di- 2004) such as HITS (Kleinberg, 1999) and po-
rectly optimize the desired performance measure sitional power functions. Graph based methods
(e.g. ROUGE) during training. Furthermore, our can also be paired with clustering such as in Col-
method is not limited to linear-chain dependen- labSum (Wan et al., 2007). This approach first
cies like (Conroy and Oleary, 2001; Shen et al., uses clustering to obtain document clusters and
2007), but can learn any monotone submodular then uses graph based algorithm for sentence se-
scoring function. lection which includes inter and intra-document
This ability to easily train summarization mod- sentence similarities. Another clustering-based
els makes it possible to efficiently tune models algorithm (Nomoto and Matsumoto, 2001) is a
to various types of document collections. In par- diversity-based extension of MMR that finds di-
ticular, we find that our learning method can re- versity by clustering and then proceeds to reduce
liably tune models with hundreds of parameters redundancy by selecting a representative for each
based on a training set of about 30 examples. cluster.
This increases the fidelity of models compared The manually tuned sentence pairwise model
to their hand-tuned counterparts, showing sig- (Lin and Bilmes, 2010; Lin and Bilmes, 2011) we
nificantly improved empirical performance. We took inspiration from is based on budgeted sub-
provide a detailed investigation into the sources modular optimization. A summary is produced
of these improvements, identifying further direc- by maximizing an objective function that includes
tions for research. coverage and redundancy terms. Coverage is de-
fined as the sum of sentence similarities between
2 Related work the selected summary and the rest of the sen-
Work on extractive summarization spans a large tences, while redundancy is the sum of pairwise
range of approaches. Starting with unsupervised intra-summary sentence similarities. Another ap-
methods, one of the widely known approaches proach based on submodularity (Qazvinian et al.,
is Maximal Marginal Relevance (MMR) (Car- 2010) relies on extracting important keyphrases
bonell and Goldstein, 1998). It uses a greedy ap- from citation sentences for a given paper and us-
proach for selection and considers the trade-off ing them to build the summary.
between relevance and redundancy. Later it was In the supervised setting, several early methods
extended (Goldstein et al., 2000) to support multi- (Kupiec et al., 1995) made independent binary de-
document settings by incorporating additional in- cisions whether to include a particular sentence
formation available in this case. Good results can in the summary or not. This ignores dependen-
be achieved by reformulating this as a knapsack cies between sentences and can result in high re-
packing problem and solving it using dynamic dundancy. The same problem arises when using
programing (McDonald, 2007). Alternatively, we learning-to-rank approaches such as ranking sup-
can use annotated phrases as textual units and se- port vector machines, support vector regression
lect a subset that covers most concepts present and gradient boosted decision trees to select the
in the input (Filatova and Hatzivassiloglou, 2004) most relevant sentences for the summary (Metzler
(which can also be achieved by our coverage scor- and Kanungo, 2008).
ing function if it is extended with appropriate fea- Introducing some dependencies can improve
tures). the performance. One limited way of introduc-
A popular stochastic graph-based summariza- ing dependencies between sentences is by using a
tion method is LexRank (Erkan and Radev, 2004). linear-chain HMM. The HMM is assumed to pro-
It computes sentence importance based on the duce the summary by having a chain transitioning
225
between summarization and non-summarization however it uses vine-growth model and employs
states (Conroy and Oleary, 2001) while travers- search to to find the best policy which is then used
ing the sentences in a document. A more expres- to generate a summary.
sive approach is using a CRF for sequence label- A specific subclass of submodular (but not
ing (Shen et al., 2007) which can utilize larger and monotone) functions are defined by Determinan-
not necessarily independent feature spaces. The tal Point Processes (DPPs) (Kulesza and Taskar,
disadvantage of using linear chain models, how- 2011). While they provide an elegant probabilis-
ever, is that they represent the summary as a se- tic interpretation of the resulting summarization
quence of sentences. Dependencies between sen- models, the lack of monotonicity means that no
tences that are far away from each other cannot efficient approximation algorithms are known for
be modeled efficiently. In contrast to such lin- computing the highest-scoring summary.
ear chain models, our approach on submodular
scoring functions can model long-range depen- 3 Submodular document summarization
dencies. In this way our method can use proper- In this section, we illustrate how document sum-
ties of the whole summary when deciding which marization can be addressed using submodular set
sentences to include in it. functions. The set of documents to be summa-
More closely related to our work is that of Li rized is split into a set of individual sentences
et al. (2009). They use the diversified retrieval x = {s1 , ..., sn }. The summarization method
method proposed in Yue and Joachims (2008) for then selects a subset y x of sentences that max-
document summarization. Moreover, they assume imizes a given scoring function Fx : 2x R
that subtopic labels are available so that additional subject to a budget constraint (e.g. less than B
constraints for diversity, coverage and balance can characters).
be added to the structural SVM learning prob-
lem. In contrast, our approach does not require the y = arg max Fx (y) s.t. |y| B (1)
knowledge of subtopics (thus allowing us to ap- yx
ply it to a wider range of tasks) and avoids adding

In the following we restrict the admissible scoring
additional constraints (simplifying the algorithm).
functions F to be submodular.
Furthermore, it can use different submodular ob-
jective functions, for example word coverage and Definition 1. Given a set x, a function F : 2x
sentence pairwise models described later in this R is submodular iff for all u U and all sets s
paper. and t such that s t x, we have,
Another closely related work also takes a max-
F (s {u}) F (s) F (t {u}) F (t).
margin discriminative learning approach in the
structural SVM framework (Berg-Kirkpatrick et Intuitively, this definition says that adding u to
al., 2011) or by using MIRA (Martins and Smith, a subset s of t increases f at least as much as
2009) to learn the parameters for summarizing adding it to t. Using two specific submodular
a set of documents. However, they do not con- functions as examples, the following sections il-
sider submodular functions, but instead solve an lustrate how this diminishing returns property nat-
Integer Linear Program (ILP) or an approxima- urally reflects the trade-off between maximizing
tion thereof. The ILP encodes a compression coverage while minimizing redundancy.
model where arbitrary parts of the parse trees
of sentences in the summary can be cut and re- 3.1 Pairwise scoring function
moved. This allows them to select parts of sen- The first submodular scoring function we con-
tences and yet preserve some gramatical struc- sider was proposed by Lin and Bilmes (2010) and
ture. Their work focuses on learning a particular is based on a model of pairwise sentence similar-
compression model based on ILP inference, while ities. It scores a summary y using the following
our work explores learning a general and large function, which Lin and Bilmes (2010) show is
class of sentence selection models using submod- submodular:
ular optimization. The third notable approach X X
uses SEARN (Daume, 2006) to learn parameters Fx (y) = (i, j) (i, j). (2)
for joint summarization and compression model, ix\y,jy i,jy:i6=j
226
Figure 1: Illustration of the pairwise model. Not all Figure 2: Illustration of the coverage model. Word
edges are shown for clarity purposes. Edge thickness border thickness represents importance.
denotes the similarity score.
An example of how a summary is scored is il-
In the above equation, (i, j) 0 denotes a mea- lustrated in the Figure 2. Analogous to the defini-
sure of similarity between pairs of sentences i and tion of similarity (i, j) in the pairwise model, the
j. The first term in Eq. 2 is a measure of how simi- choice of the word importance function (v) is
lar the sentences included in summary y are to the crucial in the coverage model. A simple heuristic
other sentences in x. The second term penalizes is to weigh words highly that occur in many sen-
y by how similar its sentences are to each other. tences of x, but in few other documents (Swami-
> 0 is a scalar parameter that trades off be- nathan et al., 2009). However, we will show in the
tween the two terms. Maximizing Fx (y) amounts following how to learn (v) from training data.
to increasing the similarity of the summary to ex-
cluded sentences while minimizing repetitions in Algorithm 1 Greedy algorithm for finding the
the summary. An example is illustrated in Figure best summary y given a scoring function Fx (y).
1. In the simplest case, (i, j) may be the TFIDF Parameter: r > 0.
(Salton and Buckley, 1988) cosine similarity, but y
we will show later how to learn sophisticated sim- Ax
ilarity functions. while A 6= do
Fx (y {l}) Fx (y)
k arg max
3.2 Coverage scoring function PlA (cl )r
A second scoring function we consider was if ck+ iy ci B and Fx (y{k})Fx (y)
first proposed for diversified document retrieval 0 then
(Swaminathan et al., 2009; Yue and Joachims, y y {k}
2008), but it naturally applies to document sum- end if
marization as well (Li et al., 2009). It is based on A A\{k}
a notion of word coverage, where each word v has end while
some importance weight (v) 0. A summary
y covers a word if at least one of its sentences
3.3 Computing a Summary
contains the word. The score of a summary is
then simply the sum of the word weights its cov- Computing the summary that maximizes either of
ers (though we could also include a concave dis- the two scoring functions from above (i.e. Eqns.
count function that rewards covering a word mul- (2) and (3)) is NP-hard (McDonald, 2007). How-
tiple times (Raman et al., 2011)): ever, it is known that the greedy algorithm 1 can
achieve a 1 1/e approximation to the optimum
X
Fx (y) = (v). (3) solution for any linear budget constraint (Lin and
vV (y) Bilmes, 2010; Khuller et al., 1999). Even further,
this algorithm provides a 1 1/e approximation
In the above equation, V (y) denotes the union of for any monotone submodular scoring function.
all words in y. This function is analogous to a The algorithm starts with an empty summariza-
maximum coverage problem, which is known to tion. In each step, a sentence is added to the sum-
be submodular (Khuller et al., 1999). mary that results in the maximum relative increase
227
of the objective. The increase is relative to the called the joint feature-map between input x and
amount of budget that is used by the added sen- output y. Note that both submodular scoring func-
tence. The algorithm terminates when the budget tion in Eqns. (2) and (3) can be brought into the
B is reached. form wT (x, y) for the linear parametrization in
Note that the algorithm has a parameter r in Eq. (6) and (7):
the denominator of the selection rule, which Lin X X
and Bilmes (2010) report to have some impact p (x, y) = px (i, j) px (i, j), (6)
on performance. In the algorithm, ci represents ix\y,jy i,jy:i6=j
X
the cost of the sentence (i.e., length). Thus, the c (x, y) = cx (v). (7)
algorithm actually selects sentences with large vV (y)
marginal unity relative to their length (trade-off
controlled by the parameter r). Selecting r to be After this transformation, it is easy to see that
less than 1 gives more importance to information computing the maximizing summary in Eq. (1)
density (i.e. sentences that have a higher ratio and the structural SVM prediction rule in Eq. (5)
of score increase per length). The 1 1e greedy are equivalent.
approximation guarantee holds despite this addi- To learn the weight vector w, structural SVMs
tional parameter (Lin and Bilmes, 2010). More require training examples (x1 , y 1 ), ..., (xn , y n ) of
details on our choice of r and its effects are pro- input/output pairs. In document summarization,
vided in the experiments section. however, the correct extractive summary is typ-
ically not known. Instead, training documents
4 Learning algorithm xi are typically annotated with multiple manual
(non-extractive) summaries (denoted by Y i ). To
In this section, we propose a supervised learning determine a single extractive target summary y i
method for training a submodular scoring func- for training, we find the extractive summary that
tion to produce desirable summaries. In particu- (approximately) optimizes ROUGE score or
lar, for the pairwise and the coverage model, we some other loss function (Y i , y) with respect
show how to learn the similarity function (i, j) to Y i .
and the word importance weights (v) respec- y i = argmin (Y i , y) (8)
tively. In particular, we parameterize (i, j) and yY
(v) using a linear model, allowing that each de-
We call the y i determined in this way the target
pends on the full set of input sentences x:
summary for xi . Note that y i is a greedily con-
x (i, j) = wTpx (i, j) x (v) = wTcx (v). (4) structed approximate target summary based on its
proximity to Y i via . Because of this, we will
In the above equations, w is a weight vector that learn a model that can predict approximately good
is learned, and px (i, j) and cx (v) are feature vec- summaries y i from xi . However, we believe that
tors. In the pairwise model, px (i, j) may include most of the score difference between manual sum-
feature like the TFIDF cosine between i and j or maries and y i (as explored in the experiments sec-
the number of words from the document titles that tion) is due to it being an extractive summary and
i and j share. In the coverage model, cx (v) may not due to greedy construction.
include features like a binary indicator of whether Following the structural SVM approach, we
v occurs in more than 10% of the sentences in x can now formulate the problem of learning w as
or whether v occurs in the document title. the following quadratic program (QP):
We propose to learn the weights following a n
1 CX
large-margin framework using structural SVMs min kwk2 + i (9)
w,0 2 n
(Tsochantaridis et al., 2005). Structural SVMs i=1
learn a discriminant function s.t. w> (xi , y i ) w> (xi , y i )
h(x) = arg max w> (x, y) (5) (y i , Y i ) i , y i 6= y i , 1 i n.

yY
The above formulation ensures that the scor-
that predicts a structured output y given a (posing function with the target summary (i.e.
sibly also structured) input x. (x, y) RN is w> (xi , y i )) is larger than the scoring function
228
Algorithm 2 Cutting-plane algorithm for solving low:
the learning optimization problem.
Parameter: desired tolerance > 0. (Y i , y) = max(0, R (Y i , y) R (Y i , y i )),
i : Wi
repeat The loss was used in our experiments. Thus
for i do training a structural SVM with this loss aims to
y arg max wT (xi , y) + (Y i , y) maximize the ROUGE-1 F score with the man-
y ual summaries provided in the training examples,
if wT (xi , y i ) + wT (xi , y) + while trading it off with margin. Note that we
(Y i , y) i then could also use a different loss function (as the
Wi Wi {y} method is not tied to this particular choice), if we
w solve QP (9) using constraints Wi had a different target evaluation metric. Finally,
end if once a w is obtained from structural SVM train-
end for ing, a predicted summary for a test document x
until no Wi has changed during iteration can be obtained from (5).
5 Experiments
for any other summary y i (i.e., w> (xi , y i )). In this section, we empirically evaluate the ap-
The objective function learns a large-margin proach proposed in this paper. Following Lin and
weight vector w while trading it off with an upper Bilmes (2010), experiments were conducted on
bound on the empirical loss. The two quantities two different datasets (DUC 03 and 04). These
are traded off with a parameter C > 0. datasets contain document sets with four manual
Even though the QP has exponentially many summaries for each set. For each document set,
constraints in the number of sentences in the in- we concatenated all the articles and split them
put documents, it can be solved approximately into sentences using the tool provided with the
in polynomial time via a cutting plane algorithm 03 dataset. For the supervised setting we used
(Tsochantaridis et al., 2005). The steps of the 10 resamplings with a random 20/5/5 (03) and
cutting-plane algorithm are shown in Algorithm 40/5/5 (04) train/test/validation split. We deter-
2. In each iteration of the algorithm, for each mined the best C value in (9) using the perfor-
training document xi , a summary y i which most mance on each validation set and then report aver-
violates the constraint in (9) is found. This is done age performence over the corresponding test sets.
by finding Baseline performance (the approach of Lin and
Bilmes (2010)) was computed using all 10 test
y arg max wT (xi , y) + (Y i , y), sets as a single test set. For all experiments and
yY
datasets, we used r = 0.3 in the greedy algorithm
for which we use a variant of the greedy algorithm as recommended in Lin and Bilmes (2010) for the
in Figure 1. After a violating constraint for each 03 dataset. We find that changing r has only a
training example is added, the resulting quadratic small influence on performance.2
program is solved. These steps are repeated until The construction of features for learning is or-
all the constraints are satisfied to a required preci- ganized by word groups. The most trivial group
sion . is simply all words (basic). Considering the prop-
Finally, special care has to be taken to appro- erties of the words themselves, we constructed
priately define the loss function given the dis- several features from properties such as capital-
parity of Y i and y i . Therefore, we first define an ized words, non-stop words and words of cer-
intermediate loss function tain length (cap+stop+len). We obtained another
set of features from the most frequently occur-
R (Y, y) = max(0, 1 ROU GE1F (Y, y)), ing words in all the articles (minmax). We also
considered the position of a sentence (containing
based on the ROUGE-1 F score. To ensure that 2
Setting r to 1 and thus eliminating the non-linearity does
the loss function is zero for the target label as de- lower the score (e.g. to 0.38466 for the pairwise model on
fined in (8), we normalized the above loss as be- DUC 03 compared with the results on Figure 3).
229
the word) in the article as another feature (loca- formance numbers than those reported in Lin and
tion). All those word groups can then be further Bilmes (2010) better on DUC 03 and somewhat
refined by selecting different thresholds, weight- lower on DUC 04, if evaluated on the same selec-
ing schemes (e.g. TFIDF) and forming binned tion of test examples as theirs. We conjecture that
variants of these features. this is due to small differences in implementation
For the pairwise model we use cosine similar- and/or preprocessing of the dataset. Furthermore,
ity between sentences using only words in a given as authors of Lin and Bilmes (2010) note in their
word group during computation. For the word paper, the 03 and 04 datasets behave quite dif-
coverage model we create separate features for ferently.
covering words in different groups. This gives us
fairly comparable feature strength in both mod- model dataset ROUGE-1 F (stderr)
els. The only further addition is the use of differ- pairwise DUC 03 0.3929 (0.0074)
ent word coverage levels in the coverage model. coverage 0.3784 (0.0059)
First we consider how well does a sentence cover hand-tuned 0.3571 (0.0063)
a word (e.g. a sentence with five instances of the pairwise DUC 04 0.4066 (0.0061)
same word might cover it better than another with coverage 0.3992 (0.0054)
only a single instance). And secondly we look at hand-tuned 0.3935 (0.0052)
how important it is to cover a word (e.g. if a word
appears in a large fraction of sentences we might Figure 3: Results obtained on DUC 03 and 04
want to be sure to cover it). Combining those two datasets using the supervised models. Increase in per-
formance over the hand-tuned is statistically signifi-
criteria using different thresholds we get a set of
cant (p 0.05) for the pairwise model on the both
features for each word. Our coverage features are datasets, but only on DUC 03 for the coverage model.
motivated from the approach of Yue and Joachims
(2008). In contrast, the hand-tuned pairwise base- Figure 3 also reports the performance for
line uses only TFIDF weighted cosine similarity the coverage model as trained by our algorithm.
between sentences using all words, following the These results can be compared against those for
approach in Lin and Bilmes (2010). the pairwise model. Since we are using features
The resulting summaries are evaluated using of comparable strength in both approaches, as
ROUGE version 1.5.5 (Lin and Hovy, 2003). We well as the same greedy algorithm and structural
selected the ROUGE-1 F measure because it was SVM learning method, this comparison largely
used by Lin and Bilmes (2010) and because it is reflects the quality of models themselves. On the
one of the commonly used performance scores in 04 dataset both models achieve the same perfor-
recent work. However, our learning method ap- mance while on 03 the pairwise model performs
plies to other performance measures as well. Note significantly (p 0.05) better than the coverage
that we use the ROUGE-1 F measure both for the model.
loss function during learning, as well as for the Overall, the pairwise model appears to perform
evaluation of the predicted summaries. slightly better than the coverage model with the
datasets and features we used. Therefore, we fo-
5.1 How does learning compare to manual cus on the pairwise model in the following.
tuning?
In our first experiment, we compare our super- 5.2 How fast does the algorithm learn?
vised learning approach to the hand-tuned ap- Hand-tuned approaches have limited flexibility.
proach. The results from this experiment are sum- Whenever we move to a significantly different
marized in Figure 3. First, supervised training collection of documents we have to reinvest time
of the pairwise model (Lin and Bilmes, 2010) to retune it. Learning can make this adaptation
resulted in a statistically significant (p 0.05 to a new collection more automatic and faster
using paired t-test) increase in performance on especially since training data has to be collected
both datasets compared to our reimplementation even for manual tuning.
of the manually tuned pairwise model. Note that Figure 4 evaluates how effectively the learn-
our reimplementation of the approach of Lin and ing algorithm can make use of a given amount of
Bilmes (2010) resulted in slightly different per- training data. In particular, the figure shows the
230
extractive summary is about 10 points of ROUGE.
Third, we expect some drop in performance,
since our model may not be able to fit the optimal
extractive summaries due to a lack of expressive-
ness. This can be estimated by looking at train-
ing set performance, as reported in row model fit
of Figure 5. On both datasets, we see a drop of
about 5 points of ROUGE performance. Adding
more and better features might help the model fit
the data better.
Finally, a last drop in performance may come
Figure 4: Learning curve for the pairwise model on from overfitting. The test set ROUGE scores are
DUC 04 dataset showing ROUGE-1 F scores for given in the row prediction of Figure 5. Note that
different numbers of learning examples (logarithmic the drop between training and test performance
scale). The dashed line represents the preformance of is rather small, so overfitting is not an issue and
the hand-tuned model. is well controlled in our algorithm. We therefore
conclude that increasing model fidelity seems like
learning curve for our approach. Even with very a promising direction for further improvements.
few training examples, the learning approach al-
ready outperforms the baseline. Furthermore, at bound dataset ROUGE-1 F
the maximum number of training examples avail- human DUC 03 0.56235
able to us the curve still increases. We therefore extractive 0.45497
conjecture that more data would further improve model fit 0.40873
performance. prediction 0.39294
human DUC 04 0.55221
5.3 Where is room for improvement? extractive 0.45199
To get a rough estimate of what is actually achiev- model fit 0.40963
able in terms of the final ROUGE-1 F score, we prediction 0.40662
looked at different upper bounds under vari-
Figure 5: Upper bounds on ROUGE-1 F scores: agree-
ous scenarios (Figure 5). First, ROUGE score
ment between manual summaries, greedily computed
is computed by using four manual summaries best extractive summaries, best model fit on the train
from different assessors, so that we can estimate set (using the best C value) and the test scores of the
inter-subject disagreement. If one computes the pairwise model.
ROUGE score of a held-out summary against the
remaining three summaries, the resulting perfor-
mance is given in the row labeled human of Fig- 5.4 Which features are most useful?
ure 5. It provides a reasonable estimate of human To understand which features affected the final
performance. performance of our approach, we assessed the
Second, in extractive summarization we re- strength of each set of our features. In particu-
strict summaries to sentences from the documents lar, we looked at how the final test score changes
themselves, which is likely to lead to a reduc- when we removed certain features groups (de-
tion in ROUGE. To estimate this drop, we use the scribed in the beginning of Section 5) as shown
greedy algorithm to select the extractive summary in Figure 6.
that maximizes ROUGE on the test documents. The most important group of features are the
The resulting performance is given in the row ex- basic features (pure cosine similarity between
tractive of Figure 5. On both dataset, the drop sentences) since removing them results in the
in performance for this (approximately3 ) optimal largest drop in performance. However, other fea-
3
tures play a significant role too (i.e. only the ba-
We compared the greedy algorithm with exhaustive sic ones are not enough to achieve good perfor-
search for up to three selected sentences (more than that
would take too long). In about half the cases we got the same below optimal confirming that greedy selection works quite
solution, in other cases the soultion was on average about 1% well.
231
mance). This confirms that performance can be was 0.4010, which is slightly but not significantly
improved by adding richer fatures instead of us- lower than the 0.4066 obtained with four sum-
ing only a single similarity score as in Lin and maries (as shown on Figure 3). Similarly, on DUC
Bilmes (2010). Using learning for these complex 03 the performance drop from 0.3929 to 0.3838
model is essential, since hand-tuning is likely to was not significant as well.
be intractable. Based on those results, we conjecture that hav-
The second most important group of features ing more documents sets with only a single man-
considering the drop in performance (i.e. loca- ual summary is more useful for training than
tion) looks at positions of sentences in the arti- fewer training examples with better labels (i.e.
cles. This makes intuitive sense because the first multiple summaries). In both cases, we spend
sentences in news articles are usually packed with approximately the same amount of effort (as the
information. The other three groups do not have a summaries are the most expensive component of
significant impact on their own. the training data), however having more training
examples helps (according to the learning curve
removed ROUGE-1 F presented before) while spending effort on multi-
group ple summaries appears to have only minor benefit
none 0.40662 for training.
basic 0.38681
all except basic 0.39723 6 Conclusions
location 0.39782
This paper presented a supervised learning ap-
sent+doc 0.39901
proach to extractive document summarization
cap+stop+len 0.40273
based on structual SVMs. The learning method
minmax 0.40721
applies to all submodular scoring functions, rang-
ing from pairwise-similarity models to coverage-
Figure 6: Effects of removing different feature groups
on the DUC 04 dataset. Bold font marks significant based approaches. The learning problem is for-
difference (p 0.05) when compared to the full pari- mulated into a convex quadratic program and was
wise model. The most important are basic similar- then solved approximately using a cutting-plane
ity features including all words (similar to (Lin and method. In an empirical evaluation, the structural
Bilmes, 2010)). The last feature group actually low- SVM approach significantly outperforms conven-
ered the score but is included in the model because we tional hand-tuned models on the DUC 03 and
only found this out later on DUC 04 dataset.
04 datasets. A key advantage of the learn-
ing approach is its ability to handle large num-
5.5 How important is it to train with bers of features, providing substantial flexibility
multiple summaries? for building high-fidelity summarization models.
Furthermore, it shows good control of overfitting,
While having four manual summaries may be im-
making it possible to train models even with only
portant for computing a reliable ROUGE score
a few training examples.
for evaluation, it is not clear whether such an ap-
proach is the most efficient use of annotator re- Acknowledgments
sources for training. In our final experiment, we
trained our method using only a single manual We thank Claire Cardie and the members of the
summary for each set of documents. When us- Cornell NLP Seminar for their valuable feedback.
ing only a single manual summary, we arbitrarily This research was funded in part through NSF
took the first one out of the provided four refer- Awards IIS-0812091 and IIS-0905467.
ence summaries and used only it to compute the
target label for training (instead of using average
loss towards all four of them). Otherwise, the ex- References
perimental setup was the same as in the previous T. Berg-Kirkpatrick, D. Gillick and D. Klein. Jointly
subsections, using the pairwise model. Learning to Extract and Compress. In Proceedings
For DUC 04, the ROUGE-1 F score obtained of ACL, 2011.
using only a single summary per document set S. Brin and L. Page. The Anatomy of a Large-Scale
232
Hypertextual Web Search Engine. In Proceedings of Linear Programming for Natural Language Process-
WWW, 1998. ing, 2009.
J. Carbonell and J. Goldstein. The use of MMR, R. McDonald. 2007. A Study of Global Inference Al-
diversity-based reranking for reordering documents gorithms in Multi-document Summarization. In Ad-
and producing summaries. In Proceedings of SI- vances in Information Retrieval, Lecture Notes in
GIR, 1998. Computer Science, 2007, pp. 557564.
J. M. Conroy and D. P. Oleary. Text summarization via D. Metzler and T. Kanungo. Machine learned sen-
hidden markov models. In Proceedings of SIGIR, tence selection strategies for query-biased summa-
2001. rization. In Proceedings of SIGIR, 2008.
H. Daume III. Practical Structured Learning Tech- R. Mihalcea. 2004. Graph-based ranking algorithms
niques for Natural Language Processing. Ph.D. for sentence extraction, applied to text summa-
Thesis, 2006. rization. In Proceedings of the ACL on Interactive
G. Erkan and D. R. Radev. LexRank: Graph-based poster and demonstration sessions, 2004.
Lexical Centrality as Salience in Text Summariza- R. Mihalcea and P. Tarau. Textrank: Bringing order
tion. In Journal of Artificial Intelligence Research, into texts. In Proceedings of EMNLP, 2004.
Vol. 22, 2004, pp. 457479. T. Nomoto and Y. Matsumoto. A new approach to un-
E. Filatova and V. Hatzivassiloglou. Event-Based Ex- supervised text summarization. In Proceedings of
tractive Summarization. In Proceedings of ACL SIGIR, 2001.
Workshop on Summarization, 2004. V. Qazvinian, D. R. Radev, and A. Ozgur. 2010. Cita-
T. Finley and T. Joachims. Training structural SVMs tion Summarization Through Keyphrase Extraction.
when exact inference is intractable. In Proceedings In Proceedings of COLING, 2010.
of ICML, 2008. K. Raman, T. Joachims and P. Shivaswamy. Structured
D. Gillick and Y. Liu. A scalable global model for sum- Learning of Two-Level Dynamic Rankings. In Pro-
marization. In Proceedings of ACL Workshop on ceedings of CIKM, 2011.
Integer Linear Programming for Natural Language G. Salton and C. Buckley. Term-weighting approaches
Processing, 2009. in automatic text retrieval. In Information process-
J. Goldstein, V. Mittal, J. Carbonell, and M. ing and management, 1988, pp. 513523.
Kantrowitz. Multi-document summarization by sen- D. Shen, J. T. Sun, H. Li, Q. Yang, and Z. Chen.
tence extraction. In Proceedings of NAACL-ANLP, Document summarization using conditional ran-
2000. dom fields. In Proceedings of IJCAI, 2007.
S. Khuller, A. Moss and J. Naor. The budgeted maxi- A. Swaminathan, C. V. Mathew and D. Kirovski.
mum coverage problem. In Information Processing Essential Pages. In Proceedings of WI-IAT, IEEE
Letters, Vol. 70, Issue 1, 1999, pp. 3945. Computer Society, 2009.
I. Tsochantaridis, T. Hofmann, T. Joachims and Y. Al-
J. M. Kleinberg. Authoritative sources in a hyperlinked
tun. Large margin methods for structured and inter-
environment. In Journal of the ACM, Vol. 46, Issue
dependent output variables. In Journal of Machine
5, 1999, pp. 604-632.
Learning Research, Vol. 6, 2005, pp. 1453-1484.
A. Kulesza and B. Taskar. Learning Determinantal
X. Wan, J. Yang, and J. Xiao. Collabsum: Exploit-
Point Processes. In Proceedings of UAI, 2011.
ing multiple document clustering for collaborative
J. Kupiec, J. Pedersen, and F. Chen. A trainable docu- single document summarizations. In Proceedings of
ment summarizer. In Proceedings of SIGIR, 1995. SIGIR, 2007.
L. Li, Ke Zhou, G. Xue, H. Zha, and Y. Yu. Enhanc- Y. Yue and T. Joachims. Predicting diverse subsets us-
ing Diversity, Coverage and Balance for Summa- ing structural svms. In Proceedings of ICML, 2008.
rization through Structure Learning. In Proceedings
of WWW, 2009.
H. Lin and J. Bilmes. 2010. Multi-document summa-
rization via budgeted maximization of submodular
functions. In Proceedings of NAACL-HLT, 2010.
H. Lin and J. Bilmes. 2011. A Class of Submodu-
lar Functions for Document Summarization. In Pro-
ceedings of ACL-HLT, 2011.
C. Y. Lin and E. Hovy. Automatic evaluation of sum-
maries using N-gram co-occurrence statistics. In
Proceedings of NAACL, 2003.
F. T. Martins and N. A. Smith. Summarization with
a joint model for sentence extraction and compres-
sion. In Proceedings of ACL Workshop on Integer
233
A Probabilistic Model of Syntactic and Semantic Acquisition from
Child-Directed Utterances and their Meanings
Tom Kwiatkowski* Sharon Goldwater Luke Zettlemoyer Mark Steedman
tomk@cs.washington.edu sgwater@inf.ed.ac.uk lsz@cs.washington.edu steedman@inf.ed.ac.uk

ILCC, School of Informatics Computer Science & Engineering
University of Edinburgh University of Washington
Edinburgh, EH8 9AB, UK Seattle, WA, 98195, USA
Abstract of propositional uncertainty1 , from a set of con-

textually afforded meaning candidates, as here:
This paper presents an incremental prob-
abilistic learner that models the acquis-
tion of syntax and semantics from a cor-
Utterance : you have another cookie

pus of child-directed utterances paired with have(you, another(x, cookie(x)))
possible representations of their meanings. Candidate
eat(you, your(x, cake(x)))
These meaning representations approxi- Meanings
want(i, another(x, cookie(x)))
mate the contextual input available to the
child; they do not specify the meanings of The task is then to learn, from a sequence of such
individual words or syntactic derivations.
(utterance, meaning-candidates) pairs, the correct
The learner then has to infer the meanings
and syntactic properties of the words in the lexicon and parsing model. Here we present a
input along with a parsing model. We use probabilistic account of this task with an empha-
the CCG grammatical framework and train sis on cognitive plausibility.
a non-parametric Bayesian model of parse Our criteria for plausibility are that the learner
structure with online variational Bayesian must not require any language-specific informa-
expectation maximization. When tested on tion prior to learning and that the learning algo-
utterances from the CHILDES corpus, our
rithm must be strictly incremental: it sees each
learner outperforms a state-of-the-art se-
mantic parser. In addition, it models such
training instance sequentially and exactly once.
aspects of child acquisition as fast map- We define a Bayesian model of parse structure
ping, while also countering previous crit- with Dirichlet process priors and train this on a
icisms of statistical syntactic learners. set of (utterance, meaning-candidates) pairs de-
rived from the CHILDES corpus (MacWhinney,
1 Introduction 2000) using online variational Bayesian EM.
We evaluate the learnt grammar in three ways.
Children learn language by mapping the utter- First, we test the accuracy of the trained model
ances they hear onto what they believe those ut- in parsing unseen utterances onto gold standard
terances mean. The precise nature of the childs annotations of their meaning. We show that
prelinguistic representation of meaning is not it outperforms a state-of-the-art semantic parser
known. We assume for present purposes that (Kwiatkowski et al., 2010) when run with similar
it can be approximated by compositional logical training conditions (i.e., neither system is given
representations such as (1), where the meaning is the corpus based initialization originally used by
a logical expression that describes a relationship Kwiatkowski et al.). We then examine the learn-
have between the person you refers to and the ing curves of some individual words, showing that
object another(x, cookie(x)): the model can learn word meanings on the ba-
sis of a single exposure, similar to the fast map-
Utterance : you have another cookie (1)
ping phenomenon observed in children (Carey
Meaning : have(you, another(x, cookie(x))) and Bartlett, 1978). Finally, we show that our
Most situations will support a number of plausi- 1
Similar to referential uncertainty but relating to propo-
ble meanings, so the child has to learn in the face sitions rather than referents.
234
learner captures the step-like learning curves for pers are not designed to be cognitively plausible,
word order regularities that Thornton and Tesan using batch training algorithms, multiple passes
(2007) claim children show. This result coun- over the data, and language specific initialisations
ters Thornton and Tesans criticism of statistical (lists of noun phrases and additional corpus statis-
grammar learnersthat they tend to exhibit grad- tics), all of which we dispense with here. In
ual learning curves rather than the abrupt changes particular, our approach is closely related that of
in linguistic competence observed in children. Kwiatkowski et al. (2010) but, whereas that work
required careful initialisation and multiple passes
1.1 Related Work over the training data to learn a discriminative
Models of syntactic acquisition, whether they parsing model, here we learn a generative parsing
have addressed the task of learning both syn- model without either.
tax and semantics (Siskind, 1992; Villavicencio,
1.2 Overview of the approach
2002; Buttery, 2006) or syntax alone (Gibson
and Wexler, 1994; Sakas and Fodor, 2001; Yang, Our approach takes, as input, a corpus of (ut-
2002) have aimed to learn a single, correct, deter- terance, meaning-candidates) pairs {(si , {m}i ) :
ministic grammar. With the exception of Buttery i = 1, . . . , N }, and learns a CCG lexicon and
(2006) they also adopt the Principles and Param- the probability of each production a b that
eters grammatical framework, which assumes de- could be used in a parse. Together, these define
tailed knowledge of linguistic regularities2 . Our a probabilistic parser that can be used to find the
approach contrasts with all previous models in as- most probable meaning for any new sentence.
suming a very general kind of linguistic knowl- We learn both the lexicon and production prob-
edge and a probabilistic grammar. Specifically, abilities from allowable parses of the training
we use the probabilistic Combinatory Categorial pairs. The set of allowable parses {t} for a sin-
Grammar (CCG) framework, and assume only gle (utterance, meaning-candidates) pair consists
that the learner has access to a small set of general of those parses that map the utterance onto one of
combinatory schemata and a functional mapping the meanings. This set is generated with the func-
from semantic type to syntactic category. Further- tional mapping T :
more, this paper is the first to evaluate a model {t} = T (s, m), (2)
of child syntactic-semantic acquisition by parsing
unseen data. which is defined, following Kwiatkowski et al.
(2010), using only the CCG combinators and a
Models of child word learning have focused mapping from semantic type to syntactic category
on semantics only, learning word meanings from (presented in in Section 4).
utterances paired with either sets of concept sym- The CCG lexicon is learnt by reading off
bols (Yu and Ballard, 2007; Frank et al., 2008; Fa- the lexical items used in all parses of all training
zly et al., 2010) or a compositional meaning rep- pairs. Production probabilities are learnt in con-
resentation of the type used here (Siskind, 1996). junction with through the use of an incremen-
The models of Alishahi and Stevenson (2008) tal parameter estimation algorithm, online Varia-
and Maurits et al. (2009) learn, as well as word- tional Bayesian EM, as described in Section 5.
meanings, orderings for verb-argument structures Before presenting the probabilistic model, the
but not the full parsing model that we learn here. mapping T , and the parameter training algorithm,
we first provide some background on the meaning
Semantic parser induction as addressed by
representations we use and on CCG.
Zettlemoyer and Collins (2005, 2007, 2009), Kate
and Mooney (2007), Wong and Mooney (2006, 2 Background
2007), Lu et al. (2008), Chen et al. (2010),
Kwiatkowski et al. (2010, 2011) and Borschinger 2.1 Meaning Representations
et al. (2011) has the same task definition as the We represent the meanings of utterances in first-
one addressed by this paper. However, the learn- order predicate logic using the lambda-calculus.
ing approaches presented in those previous pa- An example logical expression (henceforth also
2
referred to as a lambda expression) is:
This linguistic use of the term parameter is distinct
from the statistical use found elsewhere in this paper. like(eve, mummy) (3)
235
which expresses a logical relationship like be-
tween the entity eve and the entity mummy. In 3 Modelling Derivations
Section 6.1 we will see how logical expressions
like this are created for a set of child-directed ut- The objective of our learning algorithm is to
terances (to use in training our model). learn the correct parameterisation of a probabilis-
The lambda-calculus uses operators to define tic model P (s, m, t) over (utterance, meaning,
functions. These may be used to represent func- derivation) triples. This model assigns a proba-
tional meanings of utterances but they may also be bility to each of the grammar productions a b
used as a glue language, to compose elements of used to build the derivation tree t. The probabil-
first order logical expressions. For example, the ity of any given CCG derivation t with sentence
function xy.like(y, x) can be combined with s and semantics m is calculated as the product of
the object mummy to give the phrasal mean- all of its production probabilities.
Y
ing y.like(y, mummy) through the lambda- P (s, m, t) = P (b|a) (4)
calculus operation of function application. abt
2.2 CCG For example, the derivation in Figure 1 contains

Combinatory Categorial Grammar (CCG; Steed- 13 productions, and its probability is the product
man 2000) is a strongly lexicalised linguistic for- of the 13 production probabilities. Grammar pro-
malism that tightly couples syntax and seman- ductions may be either syntacticused to build a
tics. Each CCG lexical item in the lexicon is syntactic derivation tree, or lexicalused to gen-
a triple, written as word ` syntactic category : erate logical expressions and words at the leaves
logical expression. Examples are: of this tree.
You ` NP : you A syntactic production Ch R expands a
read ` S\NP/NP : xy.read(y, x) head node Ch into a result R that is either an
ordered pair of syntactic parse nodes hCl , Cr i
the ` NP/N : f.the(x, f (x))
(for a binary production) or a single parse node
book ` N : x.book(x)
(for a unary production). Only two unary syn-
A full CCG category X : h has syntactic cate- tactic productions are allowed in the grammar:
gory X and logical expression h. Syntactic cat- START A to generate A as the top syntactic
egories may be atomic (e.g., S or NP) or com- node of a parse tree and A [A]lex to indicate
plex (e.g., (S\NP)/NP). Slash operators in com- that A is a leaf node in the syntactic derivation
plex categories define functions from the range on and should be used to generate a logical expres-
the right of the slash to the result on the left in sion and word. Syntactic derivations are built by
much the same way as lambda operators do in the recursively applying syntactic productions to non-
lambda-calculus. The direction of the slash de- leaf nodes in the derivation tree. Each syntactic
fines the linear order of function and argument. production Ch R has conditional probability
CCG uses a small set of combinatory rules to P (R|Ch ). There are 3 binary and 5 unary syntac-
concurrently build syntactic parses and semantic tic productions in Figure 1.
representations. Two example combinatory rules Lexical productions have two forms. Logical
are forward (>) and backward (<) application: expressions are produced from leaf nodes in the
syntactic derivation tree Alex m with condi-
X/Y : f Y : g X : f (g) (>) tional probability P (m|Alex ). Words are then pro-
Y : g X\Y : f X : f (g) (<) duced from these logical expressions with condi-
tional probability P (w|m). An example logical
Given the lexicon above, the phrase You read the
production from Figure 1 is [NP]lex you. An
book can be parsed using these rules, as illus-
example word production is you You.
trated in Figure 1 (with additional notation dis-
Every production a b used in a parse tree t
cussed in the following section)..
is chosen from the set of productions that could
CCG also includes combinatory rules of
be used to expand a head node a. If there are a
forward (> B) and backward (< B) composition:
finite K productions that could expand a then a
X/Y : f Y /Z : g X/Z : x.f (g(x)) (> B) K-dimensional Multinomial distribution parame-
Y \Z : g X\Y : f X\Z : x.f (g(x)) (< B) terised by a can be used to model the categorical
236
START
X = {(si , {m}i ) : i = 1, . . . , N }, the latent vari-
Sdcl
ables S (containing the productions used in each
parse t) and the parsing parameters .
NP Sdcl \NP
[NP]lex 4 Generating Parses

you
(Sdcl \NP)/NP NP The previous section defined a parameterisation
You
[(Sdcl \NP)/NP]lex over parses assuming that the CCG lexicon was
NP/N N
xy.read(y, x) known. In practice is empty prior to training
[NP/N]lex [N]lex
read and must be populated with the lexical items from
f x.the(x, f (x)) x.book(x)
parses t consistent with training pairs (s, {m}).
the book
The set of allowed parses {t} is defined by the
Figure 1: Derivation of sentence You read the function T from Equation 2. Here we review the
book with meaning read(you, the(x, book(x))). splitting procedure of Kwiatkowski et al. (2010)
that is used to generate CCG lexical items and de-
choice of production: scribe how it is used by T to create a packed chart
representation of all parses {t} that are consistent
b Multinomial(a ) (5) with s and at least one of the meaning represen-
tations in {m}. In this section we assume that s
However, before training a model of language ac- is paired at each point with only a single meaning
quisition the dimensionality and contents of both m. Later we will show how T is used multiple
the syntactic grammar and lexicon are unknown. times to create the set of parses consistent with s
In order to maintain a probability model with and a set of candidate meanings {m}.
cover over the countably infinite number of pos- The splitting procedure takes as input a CCG
sible productions, we define a Dirichlet Process category X : h, such as NP : a(x, cookie(x)), and
(DP) prior for each possible production head a. returns a set of category splits. Each category split
For the production head a, DP (a , Ha ) assigns is a pair of CCG categories (Cl : ml , Cr : mr ) that
some probability mass to all possible production can be recombined to give X : h using one of the
targets {b} covered by the base distribution Ha . CCG combinators in Section 2.2. The CCG cat-
It is possible to use the DP as an infinite prior egory splitting procedure has two parts: logical
from which the parameter set of a finite dimen- splitting of the category semantics h; and syntac-
sional Multinomial may be drawn provided that tic splitting of the syntactic category X. Each logi-
we can choose a suitable partition of {b}. When cal split of h is a pair of lambda expressions (f, g)
calculating the probability of an (s, m, t) triple, in the following set:
the choice of this partition is easy. For any given
production head a there is a finite set of usable {(f, g) | h = f (g) h = x.f (g(x))}, (8)
production targets {b1 , . . . , bk1 } in t. We create
a partition that includes one entry for each of these which means that f and g can be recombined us-
along with a final entry {bk , . . . } that includes all ing either function application or function com-
other ways in which a could be expanded in dif- position to give the original lambda expression
ferent contexts. Then, by applying the distribution h. An example split of the lambda expression
Ga drawn from the DP to this partition, we get a h = a(x, cookie(x)) is the pair
parameter vector a that is equivalent to a draw
from a k dimensional Dirichlet distribution: (y.a(x, y(x)), x.cookie(x)), (9)
Ga DP (a , Ha ) (6)
where y.a(x, y(x)) applied to x.cookie(x) re-
a = (Ga (b1 ), . . . , Ga (bk1 ), Ga ({bk , . . . }) turns the original expression a(x, cookie(x)).
Dir(a H(b1 ), . . . , a Ha (bk1 ), (7) Syntactic splitting assigns linear order and syn-
a Ha ({bk , . . . })) tactic categories to the two lambda expressions f
and g. The initial syntactic category X is split by
Together, Equations 4-7 describe the joint distri- a reversal of the CCG application combinators in
bution P (X, S, ) over the observed training data Section 2.2 if f and g can be recombined to give
237
Syntactic Category Semantic Type Example Phrase
Sdcl hev, ti I took it ` Sdcl :e.took(i, it, e)
St t I0 m angry ` St :angry(i)
Swh he, hev, tii Who took it? ` Swh :xe.took(x, it, e)
Sq hev, ti Did you take it? ` Sq :e.Q(take(you, it, e))
N he, ti cookie ` N:x.cookie(x)
NP e John ` NP:john
PP hev, ti on John ` PP:e.on(john, e)
Figure 2: Atomic Syntactic Categories.
h with function application: T cycles over all cell entries in increasingly small
spans and populates the chart with their splits. For
{(X/Y : f Y : g), (10) any cell entry X : h spanning more than one word
(Y : g : X\Y : f )|h = f (g)} T generates a set of pairs representing the splits of
X:h. For each split (Cl :ml , Cr :mr ) and every bi-
or by a reversal of the CCG composition combi- nary partition (wi:k , wk:j ) of the word-span T cre-
nators if f and g can be recombined to give h with ates two new cell entries in the chart: (Cl : ml )i:k
function composition: and (Cr :mr )k:j .
{(X/Z : f Z/Y : g, (11) Input : Sentence [w1 , . . . , wn ], top node Cm :m

(Z\Y : g : X\Z : f )|h = x.f (g(x))} Output: Packed parse chart Ch containing {t}
Ch = [ [{}1 , . . . , {}n ]1 , . . . , [{}1 , . . . , {}n ]n ]
Unknown category names in the result of a Ch[1][n 1] = Cm :m
split (Y in (10) and Z in (11)) are labelled via a for i = n, . . . , 2; j = 1 . . . (n i) + 1 do
for X:h Ch[j][i] do
functional mapping cat from semantic type T to
for (Cl :ml , Cr :mr ) split(X:h) do
syntactic category: for k = 1, . . . , i 1 do
Ch[j][k] Cl :ml
Atomic(T ) if T Figure 2 Ch[j + k][i k] Cr :mr
cat(T ) = cat(T1 )/cat(T2 ) if T = hT1 , T2 i
cat(T1 )\cat(T2 ) if T = hT1 , T2 i

Algorithm 1: Generating {t} with T .
which uses the Atomic function illustrated
in Figure 2 to map semantic-type to basic CCG Algorithm 1 shows how the learner uses T to
syntactic category. As an example, the logical generate a packed chart representation of {t} in
split in (9) supports two CCG category splits, one the chart Ch. The function T massively overgen-
for each of the CCG application rules. erates parses for any given natural language. The
probabilistic parsing model introduced in Sec-
(NP/N:y.a(x, y(x)), N:x.cookie(x)) (12) tion 3 is used to choose the best parse from the
(N:x.cookie(x), NP\N:y.a(x, y(x))) (13) overgenerated set.
The parse generation algorithm T uses the func- 5 Training

tion split to generate all CCG category pairs that 5.1 Parameter Estimation
are an allowed split of an input category X:h:
The probabilistic model of the grammar describes
{(Cl :ml , Cr :mr )} = split(X:h), a distribution over the observed training data X,
latent variables S, and parameters . The goal of
and then packs a chart representation of {t} in a training is to estimate the posterior distribution:
top-down fashion starting with a single cell entry
p(S, X|)p()
Cm : m for the top node shared by all parses {t}. p(S, |X) = (14)
p(X)
For the utterance and meaning in (1) the top parse
node, spanning the entire word-string, is which we do with online Variational Bayesian Ex-
pectation Maximisation (oVBEM; Sato (2001),
S:have(you, another(x, cookie(x))). Hoffman et al. (2010)). oVBEM is an online
238
Bayesian extension of the EM algorithm that
Input : Corpus D = {(si , {m}i )|i = 1, . . . , N },
accumulates observation pseudocounts nab for
Function T , Semantics to syntactic cate-
each of the productions a b in the grammar.
gory mapping cat, function lex to read
These pseudocounts define the posterior over pro-
lexical items off derivations.
duction probabilities as follows:
Output: Lexicon , Pseudocounts {nab }.
(ab1 , . . . , ab{k,... } )) | X, S (15) = {}, {t} = {}

for i = 1, . . . , N do
X {t}i = {}
Dir(H(b1 ) + nab1 , . . . , H(bj ) + nabj )
for m0 {m}i do
j=k
Cm0 = cat(m0 )
These pseudocounts are computed in two steps: {t}0 = T (si , Cm0 :m0 )
{t}i = {t}i {t}0 , {t} = {t} {t}0
oVBE-step For the training pair (si , {m}i ) = lex ({t}0 )
which supports the set of parses {t}, the expec- for a b {t} do
i1
tation E{t} [a b] of each production a b is niab = nab + i (N E{t}i [a b]
i1
calculated by creating a packed chart representa- nab )
tion of {t} and running the inside-outside algo- Algorithm 2: Learning and {nab }
rithm. This is similar to the E-step in standard
EM apart from the fact that each production is
the parameter update step cycles over all produc-
scored with the current expectation of its parame-
i1 tions in {t} it is not neccessary to store {t}, just
ter weight ab , where:
the set of productions that it uses.
i1
i1 e(a Ha (ab)+nab ) 6 Experimental Setup
ab = P
K i1
(16)
0
{b0 } a Ha (ab )+nab0
e 6.1 Data
and is the digamma function (Beal, 2003). The Eve corpus, collected by Brown (1973), con-
oVBM-step The expectations from the oVBE tains 14, 124 English utterances spoken to a sin-
step are used to update the pseudocounts in Equa- gle child between the ages of 18 and 27 months.
tion 15 as follows, These have been hand annotated by Sagae et al.
(2004) with labelled syntactic dependency graphs.
niab = ni1 i1 An example annotation is shown in Figure 3.
ab + i (N E{t} [a b] nab )
(17) While these annotations are designed to rep-
where i is the learning rate and N is the size of resent syntactic information, the parent-child re-
the dataset. lationships in the parse can also be viewed as a
proxy for the predicate-argument structure of the
5.2 The Training Algorithm semantics. We developed a template based de-
Now the training algorithm used to learn the lex- terministic procedure for mapping this predicate-
icon and pseudocounts {nab } can be defined. argument structure onto logical expressions of the
The algorithm, shown in Algorithm 2, passes over type discussed in Section 2.1. For example, the
the training data only once and one training in- dependency graph in Figure 3 is automatically
stance at a time. For each (si , {m}i ) it uses the transformed into the logical expression
function T |{m}i | times to generate a set of con-
sistent parses {t}0 . The lexicon is populated by e.have(you,another(y, cookie(y)), e) (18)
using the lex function to read all of the lexical on(the(z, table(z)), e),
items off from the derivations in each {t}0 . In
the parameter update step, the training algorithm where e is a Davidsonian event variable used to
updates the pseudocounts associated with each of deal with adverbial and prepositional attachments.
the productions a b that have ever been seen The deterministic mapping to logical expressions
during training according to Equation (17). uses 19 templates, three of which are used in this
Only non-zero pseudocounts are stored in our example: one for the verb and its arguments, one
model. The count vector is expanded with a new for the prepositional attachment and one (used
entry every time a new production is used. While twice) for the quantifier-noun constructions.
239
SUBJ ROOT DET OBJ JCT DET POBJ
pro|you v|have qn|another n|cookie prep|on det|the n|table
You have another cookie on the table
Figure 3: Syntactic dependency graph from Eve corpus.
This mapping from graph to logical expression

makes use of a predefined dictionary of allowed, 0.8
typed, logical constants. The mapping is success- 0.7
ful for 31% of the child-directed utterances in the
Eve corpus3 . The remaining data is mostly ac- 0.6
counted for by one-word utterances that have no 0.5
Accuracy
straightforward interpretation in our typed logi-
0.4
cal language (e.g. what; okay; alright; no; yeah;
hmm; yes; uhhuh; mhm; thankyou), missing ver- 0.3
bal arguments that cannot be properly guessed 0.2
from the context (largely in imperative sentences Our Approach UBL1
0.1
such as drink the water), and complex noun con- Our Approach + Guess UBL10
structions that are hard to match with a small set 0.0
0.0 0.2 0.4 0.6 0.8 1.0
of templates (e.g. as top to a jar). We also re- Proportion of Data Seen
move the small number of utterances containing Figure 4: Meaning Prediction: Train on files 1, . . . , n
more than 10 words for reasons of computational test on file n + 1.
efficiency (see discussion in Section 8).
Following Alishahi and Stevenson (2010), we 7 Experiments
generate a context set {m}i for each utterance si
by pairing that utterance with its correct logical 7.1 Parsing Unseen Sentences
expression along with the logical expressions of We test the parsing model that is learnt by training
the preceding and following (|{m}i | 1)/2 utter- on the first i files of the longitudinally ordered Eve
ances. corpus and testing on file i + 1, for i = 1 . . . 19.
For each utterance s0 in the test file we use the
6.2 Base Distributions and Learning Rate parsing model to predict a meaning m and com-
Each of the production heads a in the grammar pare this to the target meaning m0 . We report the
requires a base distribution Ha and concentration proportion of utterances for which the prediction
parameter a . For word-productions the base dis- m is returned correctly both with and without
tribution is a geometric distribution over character word-meaning guessing. When a word has never
strings and spaces. For syntactic-productions the been seen at training time our parser has the abil-
base distribution is defined in terms of the new ity to guess a typed logical meaning with place-
category to be named by cat and the probability holders for constant and predicate names.
of splitting the rule by reversing either the appli- For comparison we use the UBL semantic
cation or composition combinators. parser of Kwiatkowski et al. (2010) trained in
Semantic-productions base distributions are a similar settingi.e., with no language specific
defined by a probabilistic branching process con- initialisation4 . Figure 4 shows accuracy for our
ditioned on the type of the syntactic category. approach with and without guessing, for UBL
This distribution prefers less complex logical ex- 4
Kwiatkowski et al. (2010) initialise lexical weights in
pressions. All concentration parameters are set to their learning algorithm using corpus-wide alignment statis-
1.0. The learning rate for parameter updates is tics across words and meaning elements. Instead we run
i = (0.8 + i)0.5 . UBL with small positive weight for all lexical items. When
run with Giza++ parameter initialisations, U BL10 achieves
3
Data available at www.tomkwiat.com/resources.html 48.1% across folds compared to 49.2% for our approach.
240
when run over the training data once (UBL1 ) and Figure 5 shows the posterior probability of the
for UBL when run over the training data 10 times correct meanings for the quantifiers a, another
(UBL10 ) as in Kwiatkowski et al. (2010). Each and any over the course of training with 1, 3,
of the points represents accuracy on one of the 5 and 7 candidate meanings for each utterance5 .
19 test files. All of these results are from parsers These three words are all of the same class but
trained on utterances paired with a single candi- have very different frequencies in the training
date meaning. The lines of best fit show the up- subset shown (168, 10 and 2 respectively). In all
ward trend in parser performance over time. training settings, the word a is learnt gradually
Despite only seeing each training instance from many observations but the rarer words an-
once, our approach, due to its broader lexi- other and any are learnt (when they are learnt)
cal search strategy, outperforms both versions of through large updates to the posterior on the ba-
UBL which performs a greedy search in the space sis of few observations. These large updates re-
of lexicons and requires initialisation with co- sult from a syntactic bootstrapping effect (Gleit-
occurence statistics between words and logical man, 1990). When the model has great confidence
constants to guide this search. These statistics are about the derivation in which an unseen lexical
not justified in a model of language acquisition item occurs, the pseudocounts for that lexical item
and so they are not used here. The low perfor- get a large update under Equation 17. This large
mance of all systems is due largely to the sparsity update has a greater effect on rare words which
of the data with 32.9% of all sentences containing are associated with small amounts of probability
a previously unseen word. mass than it does on common ones that have al-
ready accumulated large pseudocounts. The fast
7.2 Word Learning learning of rare words later in learning correlates
with observations of word learning in children.
Due to the sparsity of the data, the training algo-
rithm needs to be able to learn word-meanings on
the basis of very few exposures. This is also a de- 7.3 Word Order Learning
sirable feature from the perspective of modelling Figure 6 shows the posterior probability of the
language acquisition as Carey and Bartlett (1978) correct SVO word order learnt from increasing
have shown that children have the ability to learn amounts of training data. This is calculated by
word meanings on the basis of one, or very few, summing over all lexical items containing transi-
exposures through the process of fast mapping. tive verb semantics and sampling in the space of
parse trees that could have generated them. With
1 Meaning 3 Meanings no propositional uncertainty in the training data
1.0
the correct word order is learnt very quickly and
0.8
stabilises. As the amount of propositional uncer-
P(m|w)
0.6
tainty increases, the rate at which this rule is learnt
0.4
decreases. However, even in the face of ambigu-
0.2 ous training data, the model can learn the cor-
0.0 rect word-order rule. The distribution over word
0 500 1000 1500 2000 0 500 1000 1500 2000
5 Meanings 7 Meanings orders also exhibits initial uncertainty, followed
1.0 by a sharp convergence to the correct analysis.
0.8 This ability to learn syntactic regularities abruptly
P(m|w)
0.6 means that our system is not subject to the crit-

0.4 icisms that Thornton and Tesan (2007) levelled
0.2 at statistical models of language acquisitionthat
0.0 their learning rates are too gradual.
0 500 1000 1500 2000 0 500 1000 1500 2000
Number of Utterances Number of Utterances
f = 168 a f.a(x, f (x)) 5
f = 10 another f.another(x, f (x))
The term fast mapping is generally used to refer to
f = 2 any f.any(x, f (x))
noun learning. We chose to examine quantifier learning here
as there is a greater variation in quantifier frequencies. Fast
Figure 5: Learning quantifiers with frequency f. mapping of nouns is also achieved.
241
1 Meaning 3 Meanings ble. In particular, it generates all parses consistent
1.0
P(word order) with each training instance, which can be both
0.8
0.6
memory- and processor-intensive. It is unlikely
0.4
that children do this once they have learnt at least
0.2 some of the target language. In future, we plan
0.0
0 500 1000 1500 2000 0 500 1000 1500 2000
to investigate more efficient parameter estimation
5 Meanings 7 Meanings methods. One possibility would be an approxi-
1.0
mate oVBEM algorithm in which the expectations
P(word order)
0.8
0.6
in Equation 17 are calculated according to a high
0.4 probability subset of the parses {t}. Another op-
0.2 tion would be particle filtering, which has been
0.0
0 500 1000 1500 2000 0 500 1000 1500 2000
investigated as a cognitively plausible method for
Number of Utterances Number of Utterances approximate Bayesian inference (Shi et al., 2010;
vso ovs vos Levy et al., 2009; Sanborn et al., 2010).
svo sov osv As a crude approximation to the context in
which an utterance is heard, the logical represen-
Figure 6: Learning SVO word order.
tations of meaning that we present to the learner
are also open to criticism. However, Steedman
8 Discussion (2002) argues that children do have access to
structured meaning representations from a much
We have presented an incremental model of lan-
older apparatus used for planning actions and we
guage acquisition that learns a probabilistic CCG
wish to eventually ground these in sensory input.
grammar from utterances paired with one or
Despite the limitations listed above, our ap-
more potential meanings. The model assumes
proach makes several important contributions to
no language-specific knowledge, but does assume
the computational study of language acquisition.
that the learner has access to language-universal
It is the first model to learn syntax and seman-
correspondences between syntactic and semantic
tics concurrently; previous systems (Villavicen-
types, as well as a Bayesian prior encouraging
cio, 2002; Buttery, 2006) learnt categorial gram-
grammars with heavy reuse of existing rules and
mars from sentences where all word meanings
lexical items. We have shown that this model
were known. Our model is also the first to be
not only outperforms a state-of-the-art semantic
evaluated by parsing sentences onto their mean-
parser, but also exhibits learning curves similar
ings, in contrast to the work mentioned above and
to childrens: lexical items can be acquired on a
that of Gibson and Wexler (1994), Siskind (1992)
single exposure and word order is learnt suddenly
Sakas and Fodor (2001), and Yang (2002). These
rather than gradually.
all evaluate their learners on the basis of a small
Although we use a Bayesian model, our ap-
number of predefined syntactic parameters.
proach is different from many of the Bayesian
Finally, our work addresses a misunderstand-
models proposed in cognitive science and lan-
ing about statistical learnersthat their learn-
guage acquisition (Xu and Tenenbaum, 2007;
ing curves must be gradual (Thornton and Tesan,
Goldwater et al., 2009; Frank et al., 2009; Grif-
2007). By demonstrating sudden learning of word
fiths and Tenenbaum, 2006; Griffiths, 2005; Per-
order and fast mapping, our model shows that sta-
fors et al., 2011). These models are intended
tistical learners can account for sudden changes in
as ideal observer analyses, demonstrating what
childrens grammars. In future, we hope to extend
would be learned by a probabilistically optimal
these results by examining other learning behav-
learner. Our learner uses a more cognitively plau-
iors and testing the model on other languages.
sible but approximate online learning algorithm.
In this way, it is similar to other cognitively plau- 9 Acknowledgements
sible approximate Bayesian learners (Pearl et al.,
2010; Sanborn et al., 2010; Shi et al., 2010). We thank Mark Johnson for suggesting an analy-
Of course, despite the incremental nature of our sis of learning rates. This work was funded by the
learning algorithm, there are still many aspects ERC Advanced Fellowship 24952 GramPlus and
that could be criticized as cognitively implausi- EU IP grant EC-FP7-270273 Xperience.
242
References Goldwater, S., Griffiths, T. L., and Johnson, M.
(2009). A Bayesian framework for word seg-
Alishahi and Stevenson, S. (2008). A computa-
mentation: Exploring the effects of context.
tional model for early argument structure ac-
Cognition, 112(1):2154.
quisition. Cognitive Science, 32:5:789834.
Griffiths, T. L., . T. J. B. (2005). Structure and
Alishahi, A. and Stevenson, S. (2010). Learning
strength in causal induction. Cognitive Psy-
general properties of semantic roles from usage
chology, 51:354384.
data: a computational model. Language and
Cognitive Processes, 25:1. Griffiths, T. L. and Tenenbaum, J. B. (2006). Op-
timal predictions in everyday cognition. Psy-
Beal, M. J. (2003). Variational algorithms for ap-
chological Science.
proximate Bayesian inference. Technical re-
port, Gatsby Institute, UCL. Hoffman, M., Blei, D. M., and Bach, F. (2010).
Borschinger, B., Jones, B. K., and Johnson, M. Online learning for latent dirichlet allocation.
(2011). Reducing grounded learning tasks In NIPS.
to grammatical inference. In Proceedings of Kate, R. J. and Mooney, R. J. (2007). Learning
the 2011 Conference on Empirical Methods language semantics from ambiguous supervi-
in Natural Language Processing, pages 1416 sion. In Proceedings of the 22nd Conference
1425, Edinburgh, Scotland, UK. Association on Artificial Intelligence (AAAI-07).
for Computational Linguistics. Kwiatkowski, T., Zettlemoyer, L., Goldwater, S.,
Brown, R. (1973). A First Language: the Early and Steedman, M. (2010). Inducing proba-
Stages. Harvard University Press, Cambridge bilistic CCG grammars from logical form with
MA. higher-order unification. In Proceedings of the
Buttery, P. J. (2006). Computational models for Conference on Emperical Methods in Natural
first language acquisition. Technical Report Language Processing.
UCAM-CL-TR-675, University of Cambridge, Kwiatkowski, T., Zettlemoyer, L., Goldwater, S.,
Computer Laboratory. and Steedman, M. (2011). Lexical general-
Carey, S. and Bartlett, E. (1978). Acquring a sin- ization in ccg grammar induction for semantic
gle new word. Papers and Reports on Child parsing. In Proceedings of the Conference on
Language Development, 15. Emperical Methods in Natural Language Pro-
Chen, D. L., Kim, J., and Mooney, R. J. (2010). cessing.
Training a multilingual sportscaster: Using per- Levy, R., Reali, F., and Griffiths, T. (2009). Mod-
ceptual context to learn language. J. Artif. In- eling the effects of memory on human online
tell. Res. (JAIR), 37:397435. sentence processing with particle filters. In Ad-
Fazly, A., Alishahi, A., and Stevenson, S. (2010). vances in Neural Information Processing Sys-
A probabilistic computational model of cross- tems 21.
situational word learning. Cognitive Science, Lu, W., Ng, H. T., Lee, W. S., and Zettlemoyer,
34(6):10171063. L. S. (2008). A generative model for parsing
Frank, M., Goodman, S., and Tenenbaum, J. natural language to meaning representations. In
(2009). Using speakers referential intentions Proceedings of The Conference on Empirical
to model early cross-situational word learning. Methods in Natural Language Processing.
Psychological Science, 20(5):578585. MacWhinney, B. (2000). The CHILDES project:
Frank, M. C., Goodman, N. D., and Tenenbaum, tools for analyzing talk. Lawrence Erlbaum,
J. B. (2008). A bayesian framework for cross- Mahwah, NJ u.a. EN.
situational word-learning. Advances in Neural Maurits, L., Perfors, A., and Navarro, D. (2009).
Information Processing Systems 20. Joint acquisition of word order and word refer-
Gibson, E. and Wexler, K. (1994). Triggers. Lin- ence. In Proceedings of the 31th Annual Con-
guistic Inquiry, 25:355407. ference of the Cognitive Science Society.
Gleitman, L. (1990). The structural sources of Pearl, L., Goldwater, S., and Steyvers, M. (2010).
verb meanings. Language Acquisition, 1:155. How ideal are we? Incorporating human limi-
243
tations into Bayesian models of word segmen- University of Cambridge, Computer Labora-
tation. pages 315326, Somerville, MA. Cas- tory.
cadilla Press. Wong, Y. W. and Mooney, R. (2006). Learning for
Perfors, A., Tenenbaum, J. B., and Regier, T. semantic parsing with statistical machine trans-
(2011). The learnability of abstract syntactic lation. In Proceedings of the Human Language
principles. Cognition, 118(3):306 338. Technology Conference of the NAACL.
Sagae, K., MacWhinney, B., and Lavie, A. Wong, Y. W. and Mooney, R. (2007). Learn-
(2004). Adding syntactic annotations to tran- ing synchronous grammars for semantic pars-
scripts of parent-child dialogs. In Proceed- ing with lambda calculus. In Proceedings of
ings of the 4th International Conference on the Association for Computational Linguistics.
Language Resources and Evaluation. Lisbon, Xu, F. and Tenenbaum, J. B. (2007). Word learn-
LREC. ing as Bayesian inference. Psychological Re-
Sakas, W. and Fodor, J. D. (2001). The struc- view, 114:245272.
tural triggers learner. In Bertolo, S., editor, Yang, C. (2002). Knowledge and Learning in Nat-
Language Acquisition and Learnability, pages ural Language. Oxford University Press, Ox-
172233. Cambridge University Press, Cam- ford.
bridge.
Yu, C. and Ballard, D. H. (2007). A unified model
Sanborn, A. N., Griffiths, T. L., and Navarro, of early word learning: Integrating statisti-
D. J. (2010). Rational approximations to ratio- cal and social cues. Neurocomputing, 70(13-
nal models: Alternative algorithms for category 15):2149 2165.
learning. Psychological Review.
Zettlemoyer, L. S. and Collins, M. (2005). Learn-
Sato, M. (2001). Online model selection based ing to map sentences to logical form: Struc-
on the variational bayes. Neural Computation, tured classification with probabilistic categorial
13(7):16491681. grammars. In Proceedings of the Conference on
Shi, L., Griffiths, T. L., Feldman, N. H., and San- Uncertainty in Artificial Intelligence.
born, A. N. (2010). Exemplar models as a Zettlemoyer, L. S. and Collins, M. (2007). Online
mechanism for performing bayesian inference. learning of relaxed CCG grammars for pars-
Psychonomic Bulletin & Review, 17(4):443 ing to logical form. In Proc. of the Joint Con-
464. ference on Empirical Methods in Natural Lan-
Siskind, J. M. (1992). Naive Physics, Event Per- guage Processing and Computational Natural
ception, Lexical Semantics, and Language Ac- Language Learning.
quisition. PhD thesis, Massachusetts Institute Zettlemoyer, L. S. and Collins, M. (2009). Learn-
of Technology. ing context-dependent mappings from sen-
Siskind, J. M. (1996). A computational study of tences to logical form. In Proceedings of The
cross-situational techniques for learning word- Joint Conference of the Association for Com-
to-meaning mappings. Cognition, 61(1-2):1 putational Linguistics and International Joint
38. Conference on Natural Language Processing.
Steedman, M. (2000). The Syntactic Process.
MIT Press, Cambridge, MA.
Steedman, M. (2002). Plans, affordances, and
combinatory grammar. Linguistics and Philos-
ophy, 25.
Thornton, R. and Tesan, G. (2007). Categori-
cal acquisition: Parameter setting in universal
grammar. Biolinguistics, 1.
Villavicencio, A. (2002). The acquisition of a
unification-based generalised categorial gram-
mar. Technical Report UCAM-CL-TR-533,
244
Active learning for interactive machine translation
Jesus Gonzalez-Rubio and Daniel Ortiz-Martnez and Francisco Casacuberta

D. de Sistemas Informaticos y Computacion
U. Politecnica de Valencia
C. de Vera s/n, 46022 Valencia, Spain
{jegonzalez,dortiz,fcn}@dsic.upv.es
Abstract therefore various automatic machine translation

methods have been proposed.
Translation needs have greatly increased
However, automatic statistical machine trans-
during the last years. In many situa-
tions, text to be translated constitutes an lation (SMT) systems are far from generating
unbounded stream of data that grows con- error-free translations and their outputs usually
tinually with time. An effective approach require human post-editing in order to achieve
to translate text documents is to follow high-quality translations. One way of taking ad-
an interactive-predictive paradigm in which vantage of SMT systems is to combine them
both the system is guided by the user with the knowledge of a human translator in the
and the user is assisted by the system to interactive-predictive machine translation (IMT)
generate error-free translations. Unfortu-
framework (Foster et al., 1998; Langlais and La-
nately, when processing such unbounded
data streams even this approach requires an palme, 2002; Barrachina et al., 2009), which is
overwhelming amount of manpower. Is in a particular case of the computer-assisted trans-
this scenario where the use of active learn- lation paradigm (Isabelle and Church, 1997). In
ing techniques is compelling. In this work, the IMT framework, a state-of-the-art SMT model
we propose different active learning tech- and a human translator collaborate to obtain high-
niques for interactive machine translation. quality translations while minimizing required
Results show that for a given translation
human effort.
quality the use of active learning allows us
to greatly reduce the human effort required Unfortunately, the application of either post-
to translate the sentences in the stream. editing or IMT to data streams with massive data
volumes is still too expensive, simply because
manual supervision of all instances requires huge
1 Introduction
amounts of manpower. For such massive data
Translation needs have greatly increased during streams the need of employing active learning
the last years due to phenomena such as global- (AL) is compelling. AL techniques for IMT se-
ization and technologic development. For exam- lectively ask an oracle (e.g. a human transla-
ple, the European Parliament1 translates its pro- tor) to supervise a small portion of the incoming
ceedings to 22 languages in a regular basis or sentences. Sentences are selected so that SMT
Project Syndicate2 that translates editorials into models estimated from them translate new sen-
different languages. In these and many other ex- tences as accurately as possible. There are three
amples, data can be viewed as an incoming un- challenges when applying AL to unbounded data
bounded stream since it grows continually with streams (Zhu et al., 2010). These challenges can
time (Levenberg et al., 2010). Manual translation be instantiated to IMT as follows:
of such streams of data is extremely expensive
given the huge volume of translation required, 1. The pool of candidate sentences is dynam-
1
http://www.europarl.europa.eu ically changing, whereas existing AL algo-
2
http://project-syndicate.org rithms are dealing with static datasets only.
245
2. Concepts such as optimum translation and halves the required human effort to obtain a cer-
translation probability distribution are contain translation quality.
tinually evolving whereas existing AL algo- In this work, the AL framework presented
rithms only deal with constant concepts. in (Gonzalez-Rubio et al., 2011) is extended in
an effort to address all the above described chal-
3. Data volume is unbounded which makes lenges. In short, we propose an AL framework for
impractical to batch-learn one single sys- IMT that splits the data stream into blocks. This
tem from all previously translated sentences. approach allows us to have more context to model
Therefore, model training must be done in an the changing probability distribution of the stream
incremental fashion. (challenge 2) and results in a more accurate sam-
pling of the changing pool of sentences (chal-
In this work, we present a proposal of AL for lenge 1). In contrast to the proposal described
IMT specifically designed to work with stream in (Gonzalez-Rubio et al., 2011), we define sen-
data. In short, our proposal divides the data tence sampling strategies whose underlying mod-
stream into blocks where AL techniques for static els can be updated with the newly available data.
datasets are applied. Additionally, we implement This way, the sentences to be supervised by the
an incremental learning technique to efficiently user are chosen taking into account previously su-
train the base SMT models as new data is avail- pervised sentences. To efficiently retrain the un-
able. derlying SMT models of the IMT system (chal-
lenge 3), we follow the online learning technique
2 Related work
described in (Ortiz-Martnez et al., 2010). Finally,
A body of work has recently been proposed to ap- we integrate all these elements to define an AL
ply AL techniques to SMT (Haffari et al., 2009; framework for IMT with an objective of obtaining
Ambati et al., 2010; Bloodgood and Callison- an optimum balance between translation quality
Burch, 2010). The aim of these works is to and human user effort.
build one single optimal SMT model from manu-
ally translated data extracted from static datasets. 3 Interactive machine translation
None of them fit in the setting of data streams. IMT can be seen as an evolution of the SMT
Some of the above described challenges of AL framework. Given a sentence f from a source
from unbounded streams have been previously ad- language to be translated into a sentence e of
dressed in the MT literature. In order to deal with a target language, the fundamental equation of
the evolutionary nature of the problem, Nepveu et SMT (Brown et al., 1993) is defined as follows:
al. (2004) propose an IMT system with dynamic
adaptation via cache-based model extensions for e = arg max P r(e | f ) (1)
e
language and translation models. Pursuing the
same goal for SMT, Levenberg et al., (2010) where P r(e | f ) is usually approximated by a log
study how to bound the space when processing linear translation model (Koehn et al., 2003). In
(potentially) unbounded streams of parallel data this case, the decision rule is given by the expres-
and propose a method to incrementally retrain sion:
SMT models. Another method to efficiently re- ( M )
train a SMT model with new data was presented X
e = arg max m hm (e, f ) (2)
in (Ortiz-Martnez et al., 2010). In this work, e
m=1
the authors describe an application of the online
learning paradigm to the IMT framework. where each hm (e, f ) is a feature function repre-
To the best of our knowledge, the only previ- senting a statistical model and m its weight.
ous work on AL for IMT is (Gonzalez-Rubio et In the IMT framework, a human translator is in-
al., 2011). There, the authors present a nave ap- troduced in the translation process to collaborate
plication of the AL paradigm for IMT that do not with an SMT model. For a given source sentence,
take into account the dynamic change in proba- the SMT model fully automatically generates an
bility distribution of the stream. Nevertheless, re- initial translation. The human user checks this
sults show that even that simple AL framework translation, from left to right, correcting the first
246
source (f ): Para ver la lista de recursos number of sentences whose translations are worth
desired translation (e): To view a listing of resources to be supervised by the human expert.
e
inter.-0 p This approach implies a modification of the
es To view the resources list
user-machine interaction protocol. For a given
ep To view
inter.-1 k a source sentence, the SMT model generates an ini-
es list of resources tial translation. Then, if this initial translation is
ep To view a list classified as incorrect or worth of supervision,
inter.-2 k list i we perform a conventional IMT procedure as in
es list i ng resources Figure 1. If not, we directly return the initial au-
ep To view a listing tomatic translation and no effort is required from
inter.-3 k o the user. At the end of the process, we use the new
es of resources sentence pair (f , e) available to refine the SMT
accept ep To view a listing of resources models used by the IMT system.
In this scenario, the user only checks a small
Figure 1: IMT session to translate a Spanish sentence number of sentences, thus, final translations are
into English. The desired translation is the translation not error-free as in conventional IMT. However,
the human user have in mind. At interaction-0, the sys-
results in previous works (Gonzalez-Rubio et al.,
tem suggests a translation (es ). At interaction-1, the
user moves the mouse to accept the first eight charac- 2011) show that this approach yields important
ters To view and presses the a key (k), then the reduction in human effort. Moreover, depending
system suggests completing the sentence with list of on the definition of the sampling strategy, we can
resources (a new es ). Interactions 2 and 3 are simi- modify the ratio of sentences that are interactively
lar. In the final interaction, the user accepts the current translated to adapt our system to the requirements
translation. of a specific translation task. For example, if the
main priority is to minimize human effort, our
error. Then, the SMT model proposes a new ex- system can be configured to translate all the sen-
tension taking the correct prefix, ep , into account. tences without user intervention.
These steps are repeated until the user accepts the Algorithm 1 describes the basic algorithm to
translation. Figure 1 illustrates a typical IMT ses- implement AL for IMT. The algorithm receives as
sion. In the resulting decision rule, we have to input an initial SMT model, M , a sampling strat-
find an extension es for a given prefix ep . To do egy, S, a stream of source sentences, F, and the
this we reformulate equation (1) as follows, where block size, B. First, a block of B sentences, X,
the term P r(ep | f ) has been dropped since it does is extracted from the data stream (line 3). From
not depend on es : this block, we sample those sentences, Y , that
es = arg max P r(ep , es | f ) (3) are worth to be supervised by the human expert
es (line 4). For each of the sentences in X, the cur-
arg max p(es | f , ep ) (4) rent SMT model generates an initial translation,
es e, (line 6). If the sentence has been sampled as
The search is restricted to those sentences e worthy of supervision, f Y , the user is required
which contain ep as prefix. Since e ep es , we to interactively translate it (lines 813) as exem-
can use the same log-linear SMT model, equa- plified in Figure 1. The source sentence f and its
tion (2), whenever the search procedures are ad- human-supervised translation, e, are then used to
equately modified (Barrachina et al., 2009). retrain the SMT model (line 14). Otherwise, we
directly output the automatic translation e as our
4 Active learning for IMT final translation (line 17).
The aim of the IMT framework is to obtain high- Most of the functions in the algorithm denote
quality translations while minimizing the required different steps in the interaction between the hu-
human effort. Despite the fact that IMT may man user and the machine:
reduce the required effort with respect to post- translate(M, f ): returns the most proba-
editing, it still requires the user to supervise all ble automatic translation of f given by M .
the translations. To address this problem, we pro-
pose to use AL techniques to select only a small validPrefix(e): returns the prefix of e
247
input : M (initial SMT model) 5 Sentence sampling strategies
S (sampling strategy)
F (stream of source sentences) A good sentence sampling strategy must be able
B (block size) to select those sentences that along with their cor-
auxiliar : X (block of sentences) rect translations improve most the performance of
Y (sentences worth of supervision) the SMT model. To do that, the sampling strat-
1 begin egy have to correctly discriminate informative
2 repeat sentences from those that are not. We can make
3 X = getSentsFromStream (B, F);
different approximations to measure the informa-
4 Y = S(X, M );
5 foreach f X do tiveness of a given sentence. In the following
6 e = translate(M, f ); sections, we describe the three different sampling
7 if f Y then strategies tested in our experimentation.
8 e = e;
9 repeat 5.1 Random sampling
10 ep = validPrefix(e);
11 es = genSuffix(M, f , ep );
Arguably, the simplest sampling approach is ran-
12 e = ep es ; dom sampling, where the sentences are randomly
13 until validTranslation(e) ; selected to be interactively translated. Although
14 M = retrain(M, (f , e)); simple, it turns out that random sampling per-
15 output(e); form surprisingly well in practice. The success
16 else of random sampling stem from the fact that in
17 output(e); data stream environments the translation proba-
18 until True ;
bility distributions may vary significantly through
19 end
time. While general AL algorithms ask the user to
translate informative sentences, they may signifi-
Algorithm 1: Pseudo-code of the proposed
cantly change probability distributions by favor-
algorithm to implement AL for IMT from
ing certain translations, consequently, the previ-
unbounded data streams.
ously human-translated sentences may no longer
reveal the genuine translation distribution in the
validated by the user as correct. This prefix current point of the data stream (Zhu et al., 2007).
includes the correction k. This problem is less severe for static data where
the candidate pool is fixed and AL algorithms are
genSuffix(M, f , ep ): returns the suffix of able to survey all instances. Random sampling
maximum probability that extends prefix ep . avoids this problem by randomly selecting sen-
tences for human supervision. As a result, it al-
validTranslation(e): returns True if ways selects those sentences with the most similar
the user considers the current translation to distribution to the current sentence distribution in
be correct and False otherwise. the data stream.
5.2 n-gram coverage sampling

Apart from these, the two elements that define
the performance of our algorithm are the sampling One technique to measure the informativeness
strategy S(X, M ) and the retrain(M, (f , e)) of a sentence is to directly measure the amount
function. On the one hand, the sampling strat- of new information that it will add to the SMT
egy decides which sentences should be supervised model. This sampling strategy considers that
by the user, which defines the human effort re- sentences with rare n-grams are more informa-
quired by the algorithm. Section 5 describes our tive. The intuition for this approach is that rare
implementation of the sentence sampling to deal n-grams need to be seen several times in order to
with the dynamic nature of data streams. On the accurately estimate their probability.
other hand, the retrain() function incremen- To do that, we store the counts for each n-gram
tally trains the SMT model with each new training present in the sentences used to train the SMT
pair (f , e). Section 6 describes the implementa- model. We assume that an n-gram is accurately
tion of this function. represented when it appears A or more times in
248
the training samples. Therefore, the score for a Finally, this sampling strategy works by select-
given sentence f is computed as: ing a given percentage of the highest scoring sen-
PN tences.
|Nn<A (f )|
C(f ) = Pn=1
N
(5) We dynamically update the confidence sampler
n=1 |Nn (f )| each time a new sentence pair is added to the SMT
where Nn (f ) is the set of n-grams of size n model. The incremental version of the EM algo-
in f , Nn<A (f ) is the set of n-grams of size n in rithm (Neal and Hinton, 1999) is used to incre-
f that are inaccurately represented in the training mentally train the IBM model 1.
data and N is the maximum n-gram order. In
the experimentation, we assume N = 4 as the 6 Retraining of the SMT model
maximum n-gram order and a value of 10 for the
To retrain the SMT model, we implement the
threshold A. This sampling strategy works by se-
online learning techniques proposed in (Ortiz-
lecting a given percentage of the highest scoring
Martnez et al., 2010). In that work, a state-
sentences.
of-the-art log-linear model (Och and Ney, 2002)
We update the counts of the n-grams seen by
and a set of techniques to incrementally train this
the SMT model with each new sentence pair.
model were defined. The log-linear model is com-
Hence, the sampling strategy is always up-to-date
posed of a set of feature functions governing dif-
with the last training data.
ferent aspects of the translation process, includ-
5.3 Dynamic confidence sampling ing a language model, a source sentencelength
Another technique is to consider that the most in- model, inverse and direct translation models, a
formative sentence is the one the current SMT target phraselength model, a source phrase
model translates worst. The intuition behind this length model and a distortion model.
approach is that an SMT model can not generate The incremental learning algorithm allows us
good translations unless it has enough informa- to process each new training sample in constant
tion to translate the sentence. time (i.e. the computational complexity of train-
The usual approach to compute the quality of a ing a new sample does not depend on the num-
translation hypothesis is to compare it to a refer- ber of previously seen training samples). To do
ence translation, but, in this case, it is not a valid that, a set of sufficient statistics is maintained for
option since reference translations are not avail- each feature function. If the estimation of the
able. Hence, we use confidence estimation (Gan- feature function does not require the use of the
drabur and Foster, 2003; Blatz et al., 2004; Ueff- well-known expectationmaximization (EM) al-
ing and Ney, 2007) to estimate the probability of gorithm (Dempster et al., 1977) (e.g. n-gram lan-
correctness of the translations. Specifically, we guage models), then it is generally easy to incre-
estimate the quality of a translation from the con- mentally extend the model given a new training
fidence scores of their individual words. sample. By contrast, if the EM algorithm is re-
The confidence score of a word ei of the trans- quired (e.g. word alignment models), the estima-
lation e = e1 . . . ei . . . eI generated from the tion procedure has to be modified, since the con-
source sentence f = f1 . . . fj . . . fJ is computed ventional EM algorithm is designed for its use in
as described in (Ueffing and Ney, 2005): batch learning scenarios. For such models, the in-
cremental version of the EM algorithm (Neal and
Cw (ei , f ) = max p(ei |fj ) (6) Hinton, 1999) is applied. A detailed description
0j| f |
of the update algorithm for each of the models in
where p(ei |fj ) is an IBM model 1 (Brown et al.,
the log-linear combination is presented in (Ortiz-
1993) bilingual lexicon probability and f0 is the
Martnez et al., 2010).
empty source word. The confidence score for the
full translation e is computed as the ratio of its 7 Experiments
words classified as correct by the word confidence
measure. Therefore, we define the confidence- We carried out experiments to assess the perfor-
based informativeness score as: mance of the proposed AL implementation for
|{ei | Cw (ei , f ) > w }| IMT. In each experiments, we started with an
C(e, f ) = 1 (7) initial SMT model that is incrementally updated
|e|
249
words 7.2 Assessment criteria
corpus use sentences
(Spa/Eng)
train 731K 15M/15M We want to measure both the quality of the gener-
Europarl ated translations and the human effort required to
devel. 2K 60K/58K
News obtain them.
test 51K 1.5M/1.2M
Commentary We measure translation quality with the well-
known BLEU (Papineni et al., 2002) score.
Table 1: Size of the SpanishEnglish corpora used in
To estimate human user effort, we simulate the
the experiments. K and M stand for thousands and
millions of elements respectively. actions taken by a human user in its interaction
with the IMT system. The first translation hypoth-
esis for each given source sentence is compared
with a single reference translation and the longest
with the sentences selected by the current sam- common character prefix (LCP) is obtained. The
pling strategy. Due to the unavailability of public first non-matching character is replaced by the
benchmark data streams, we selected a relatively corresponding reference character and then a new
large corpus and treated it as a data stream for AL. translation hypothesis is produced (see Figure 1).
To simulate the interaction with the user, we used This process is iterated until a full match with the
the reference translations in the data stream cor- reference is obtained. Each computation of the
pus as the translation the human user would like LCP would correspond to the user looking for the
to obtain. Since each experiment is carried out next error and moving the pointer to the corre-
under the same conditions, if one sampling strat- sponding position of the translation hypothesis.
egy outperforms its peers, then we can safely con- Each character replacement, on the other hand,
clude that this is because the sentences selected to would correspond to a keystroke of the user.
be translated are more informative. Bearing this in mind, we measure the user ef-
fort by means of the keystroke and mouse-action
ratio (KSMR) (Barrachina et al., 2009). This mea-
7.1 Training corpus and data stream sure has been extensively used to report results in
the IMT literature. KSMR is calculated as the
The training data comes from the Europarl corpus number of keystrokes plus the number of mouse
as distributed for the shared task in the NAACL movements divided by the total number of refer-
2006 workshop on statistical machine transla- ence characters. From a user point of view the
tion (Koehn and Monz, 2006). We used this data two types of actions are different and require dif-
to estimate the initial log-linear model used by our ferent types of effort (Macklovitch, 2006). In any
IMT system (see Section 6). The weights of the case, as an approximation, KSMR assumes that
different feature functions were tuned by means both actions require a similar effort.
of minimum errorrate training (Och, 2003) exe-
cuted on the Europarl development corpus. Once 7.3 Experimental results
the SMT model was trained, we use the News In this section, we report results for three different
Commentary corpus (Callison-Burch et al., 2007) experiments. First, we studied the performance
to simulate the data stream. The size of these cor- of the sampling strategies when dealing with the
pora is shown in Table 1. The reasons to choose sampling bias problem. In the second experiment,
the News Commentary corpus to carry out our we carried out a typical AL experiment measur-
experiments are threefold: first, its size is large ing the performance of the sampling strategies as
enough to simulate a data stream and test our a function of the percentage of the corpus used
AL techniques in the long term; second, it is to retrain the SMT model. Finally, we tested our
out-of-domain data which allows us to simulate AL implementation for IMT in order to study the
a real-world situation that may occur in a trans- tradeoff between required human effort and final
lation company, and, finally, it consists in edito- translation quality.
rials from eclectic domain: general politics, eco-
nomics and science, which effectively represents 7.3.1 Dealing with the sampling bias
the variations in the sentence distributions of the In this experiment, we want to study the perfor-
simulated data stream. mance of the different sampling strategies when
250
DCS NS RS DCS NS SCS RS
22 23
22
21
21
20
20
BLEU
BLEU
19 19 20
18 19
18
17 18
17
16 17
2 4 6 8
16 15
0 10 20 30 40 50 0 5 10 15 20
Block number Percentage (%) of the corpus in words
Figure 2: Performance of the AL methods across dif- Figure 3: BLEU of the initial automatic translations
ferent data blocks. Block size 500. Human supervision as a function of the percentage of the corpus used to
10% of the corpus. retrain the model.
fact that NS is independent of the target language

dealing with the sampling bias problem. Fig-
and just looks into the source language, while
ure 2 shows the evolution of the translation qual-
DCS takes into account both the source sentence
ity, in terms of BLEU, across different data blocks
and its automatic translation. Similar phenomena
for the three sampling strategies described in sec-
has been reported in a previous work on AL for
tion 5, namely, dynamic confidence sampling
SMT (Haffari et al., 2009).
(DCS), n-gram coverage sampling (NS) and ran-
dom sampling (RS). On the one hand, the x-axis 7.3.2 AL performance
represents the data blocks number in their tempo- We carried out experiments to study the perfor-
ral order. On the other hand, the y-axis represents mance of the different sampling strategies. To this
the BLEU score when automatically translating a end, we compare the quality of the initial auto-
block. Such translation is obtained by the SMT matic translations generated in our AL implemen-
model trained with translations supervised by the tation for IMT (line 6 in Algorithm 1). Figure 3
user up to that point of the data stream. To fairly shows the BLEU score of these initial translations
compare the different methods, we fixed the per- represented as a function of the percentage of the
centage of words supervised by the human user corpus used to retrain the SMT model. The per-
(10%). In addition to this, we used a block size of centage of the corpus is measured in number of
500 sentences. Similar results were obtained for running words.
other block sizes.
In Figure 3, we present results for the three
Results in Figure 2 indicate that the perfor- sampling strategies described in section 5. Ad-
mances for the data blocks fluctuate and fluctu- ditionally, we also compare our techniques with
ations are quite significant. This phenomenon is the AL technique for IMT proposed in (Gonzalez-
due to the eclectic domain of the sentences in the Rubio et al., 2011). Such technique is similar to
data stream. Additionally, the steady increase in DCS but it does not update the IBM model 1 used
performance is caused by the increasing amount by the confidence sampler with the newly avail-
of data used to retrain the SMT model. able human-translated sentences. This technique
Regarding the results for the different sam- is referred to as static confidence sampler (SCS).
pling strategies, DCS consistently outperformed Results in Figure 3 indicate that the perfor-
RS and NS. This observation asserts that for con- mance of the retrained SMT models increased as
cept drifting data streams with constant changing more data was incorporated. Regarding the sam-
translation distributions, DCS can adaptively ask pling strategies, DCS improved the results ob-
the user to translate sentences to build a superior tained by the other sampling strategies. NS ob-
SMT model. On the other hand, NS obtains worse tained by far the worst results, which confirms the
results that RS. This result can be explained by the results shown in the previous experiment. Finally,
251
DCS SCS w/o AL different AL sampling strategies, DCS obtains the
NS RS
better results but differences with other methods
100 are slight.
90 Varying the sentence classifier, we can achieve
80 a balance between final translation quality and re-
70
quired human effort. This feature allows us to
BLEU
60
50 75 adapt the system to suit the requirements of the
70
40 65 particular translation task or to the available eco-
60
30 55 nomic or human resources. For example, if a
50
20 translation quality of 60 BLEU points is satisfac-
16 18 20 22 24
10
0 10 20 30 40 50 60 70 tory, then the human translators would need to
KSMR modify only a 20% of the characters of the au-
tomatically generated translations.
Figure 4: Quality of the data stream translation Finally, it should be noted that our IMT sys-
(BLEU) as a function of the required human effort tems with AL are able to generate new suffixes
(KSMR). w/o AL denotes a system with no retraining. and retrain with new sentence pairs in tenths of a
second. Thus, it can be applied in real time sce-
as it can be seen, SCS obtained slightly worst re- narios.
sults than DCS showing the importance of dy- 8 Conclusions and future work
namically adapting the underlying model used by
the sampling strategy. In this work, we have presented an AL frame-
work for IMT specially designed to process data
7.3.3 Balancing human effort and streams with massive volumes of data. Our pro-
translation quality posal splits the data stream in blocks of sentences
Finally, we studied the balance between re- of a certain size and applies AL techniques indi-
quired human effort and final translation error. vidually for each block. For this purpose, we im-
This can be useful in a real-world scenario where plemented different sampling strategies that mea-
a translation company is hired to translate a sure the informativeness of a sentence according
stream of sentences. Under these circumstances, to different criteria.
it would be important to be able to predict the ef- To evaluate the performance of our proposed
fort required from the human translators to obtain sampling strategies, we carried out experiments
a certain translation quality. comparing them with random sampling and the
The experiment simulate this situation using only previously proposed AL technique for IMT
our proposed IMT system with AL to translate described in (Gonzalez-Rubio et al., 2011). Ac-
the stream of sentences. To have a broad view cording to the results, one of the proposed sam-
of the behavior of our system, we repeated this pling strategies, specifically the dynamic con-
translation process multiple times requiring an in- fidence sampling strategy, consistently outper-
creasing human effort each time. Experiments formed all the other strategies.
range from a fully-automatic translation system The results in the experimentation show that the
with no need of human intervention to a system use of AL techniques allows us to make a tradeoff
where the human is required to supervise all the between required human effort and final transla-
sentences. Figure 4 presents results for SCS (see tion quality. In other words, we can adapt our sys-
section 7.3.2) and the sentence selection strate- tem to meet the translation quality requirements
gies presented in section 5. In addition, we also of the translation task or the available human re-
present results for a static system without AL (w/o sources.
AL). This system is equal to SCS but it do not per- As future work, we plan to investigate on
form any SMT retraining. more sophisticated sampling strategies such as
Results in Figure 4 show a consistent reduction those based in information density or query-by-
in required user effort when using AL. For a given committee. Additionally, we will conduct exper-
human effort the use of AL methods allowed to iments with real users to confirm the results ob-
obtain twice the translation quality. Regarding the tained by our user simulation.
252
Acknowledgements Conference on Computational Natural Language
Learning, pages 315321.
The research leading to these results has re- Jesus Gonzalez-Rubio, Daniel Ortiz-Martnez, and
ceived funding from the European Union Seventh Francisco casacuberta. 2011. An active learn-
Framework Programme (FP7/2007-2013) under ing scenario for interactive machine translation. In
grant agreement no 287576. Work also supported Proc. of the 13thInternational Conference on Mul-
by the EC (FEDER/FSE) and the Spanish MEC timodal Interaction. ACM.
under the MIPRCV Consolider Ingenio 2010 pro- Gholamreza Haffari, Maxim Roy, and Anoop Sarkar.
gram (CSD2007-00018) and iTrans2 (TIN2009- 2009. Active learning for statistical phrase-based
machine translation. In Proc. of the North Ameri-
14511) project and by the Generalitat Valenciana
can Chapter of the Association for Computational
under grant ALMPR (Prometeo/2009/01). Linguistics, pages 415423.
Pierre Isabelle and Kenneth Ward Church. 1997. Spe-
cial issue on new tools for human translators. Ma-
References
chine Translation, 12(1-2):12.
Vamshi Ambati, Stephan Vogel, and Jaime Carbonell. Philipp Koehn and Christof Monz. 2006. Man-
2010. Active learning and crowd-sourcing for ma- ual and automatic evaluation of machine transla-
chine translation. In Proc. of the conference on tion between european languages. In Proc. of the
International Language Resources and Evaluation, Workshop on Statistical Machine Translation, pages
pages 21692174. 102121.
Sergio Barrachina, Oliver Bender, Francisco Casacu- Philipp Koehn, Franz Josef Och, and Daniel Marcu.
berta, Jorge Civera, Elsa Cubel, Shahram Khadivi, 2003. Statistical phrase-based translation. In Pro-
Antonio Lagarda, Hermann Ney, Jesus Tomas, En- ceedings of the 2003 Conference of the North Amer-
rique Vidal, and Juan-Miguel Vilar. 2009. Sta- ican Chapter of the Association for Computational
tistical approaches to computer-assisted translation. Linguistics on Human Language Technology - Vol-
Computational Linguistics, 35:328. ume 1, pages 4854.
John Blatz, Erin Fitzgerald, George Foster, Simona Philippe Langlais and Guy Lapalme. 2002. Trans
Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Type: development-evaluation cycles to boost trans-
Sanchis, and Nicola Ueffing. 2004. Confidence es- lators productivity. Machine Translation, 17:77
timation for machine translation. In Proc. of the in- 98.
ternational conference on Computational Linguis- Abby Levenberg, Chris Callison-Burch, and Miles Os-
tics, pages 315321. borne. 2010. Stream-based translation models for
Michael Bloodgood and Chris Callison-Burch. 2010. statistical machine translation. In Proc. of the North
Bucking the trend: large-scale cost-focused active American Chapter of the Association for Compu-
learning for statistical machine translation. In Proc. tational Linguistics, pages 394402, Los Angeles,
of the Association for Computational Linguistics, California, June.
pages 854864. Elliott Macklovitch. 2006. TransType2: the last word.
Peter F. Brown, Vincent J. Della Pietra, Stephen In Proc. of the conference on International Lan-
A. Della Pietra, and Robert L. Mercer. 1993. guage Resources and Evaluation, pages 16717.
The mathematics of statistical machine translation: Radford Neal and Geoffrey Hinton. 1999. A view of
parameter estimation. Computational Linguistics, the EM algorithm that justifies incremental, sparse,
19:263311. and other variants. Learning in graphical models,
Chris Callison-Burch, Cameron Fordyce, Philipp pages 355368.
Koehn, Christof Monz, and Josh Schroeder. 2007.
Laurent Nepveu, Guy Lapalme, Philippe Langlais, and
(Meta-) evaluation of machine translation. In Proc.
George Foster. 2004. Adaptive language and trans-
of the Workshop on Statistical Machine Translation,
lation models for interactive machine translation. In
pages 136158.
Proc, of EMNLP, pages 190197, Barcelona, Spain,
Arthur Dempster, Nan Laird, and Donald Rubin.
July.
1977. Maximum likelihood from incomplete data
Franz Och and Hermann Ney. 2002. Discriminative
via the EM algorithm. Journal of the Royal Statis-
training and maximum entropy models for statisti-
tical Society., 39(1):138.
cal machine translation. In Proc. of the Association
George Foster, Pierre Isabelle, and Pierre Plamon-
for Computational Linguistics, pages 295302.
don. 1998. Target-text mediated interactive ma-
chine translation. Machine Translation, 12:175 Franz Och. 2003. Minimum error rate training in sta-
194. tistical machine translation. In Proc. of the Associa-
tion for Computational Linguistics, pages 160167.
Simona Gandrabur and George Foster. 2003. Confi-
dence estimation for text prediction. In Proc. of the
253
Daniel Ortiz-Martnez, Ismael Garca-Varea, and ference, pages 262270.
Francisco Casacuberta. 2010. Online learning for Nicola Ueffing and Hermann Ney. 2007. Word-
interactive statistical machine translation. In Proc. level confidence estimation for machine translation.
of the North American Chapter of the Association Computational Linguistics, 33:940.
for Computational Linguistics, pages 546554. Xingquan Zhu, Peng Zhang, Xiaodong Lin, and Yong
Kishore Papineni, Salim Roukos, Todd Ward, and Shi. 2007. Active learning from data streams. In
Wei-Jing Zhu. 2002. BLEU: a method for auto- Proc. of the 7th IEEE International Conference on
matic evaluation of machine translation. In Proc. Data Mining, pages 757762. IEEE Computer So-
of the Association for Computational Linguistics, ciety.
pages 311318. Xingquan Zhu, Peng Zhang, Xiaodong Lin, and Yong
Nicola Ueffing and Hermann Ney. 2005. Applica- Shi. 2010. Active learning from stream data using
tion of word-level confidence measures in interac- optimal weight classifier ensemble. Transactions
tive statistical machine translation. In Proc. of the on Systems, Man and Cybernetics Part B, 40:1607
European Association for Machine Translation con- 1621, December.
254
Adapting Translation Models to Translationese Improves SMT
Gennadi Lembersky Noam Ordan Shuly Wintner

Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
University of Haifa University of Haifa University of Haifa
31905 Haifa, Israel 31905 Haifa, Israel 31905 Haifa, Israel
glembers@campus.haifa.ac.il noam.ordan@gmail.com shuly@cs.haifa.ac.il
Abstract target language, which reflects both artifacts of

the translation process and traces of the origi-
Translation models used for statistical ma- nal language from which the texts were trans-
chine translation are compiled from par-
lated. Among the better-known properties of
allel corpora; such corpora are manually
translated, but the direction of translation is translationese are simplification and explicitation
usually unknown, and is consequently ig- (Baker, 1993, 1995, 1996): translated texts tend
nored. However, much research in Trans- to be shorter, to have lower type/token ratio, and
lation Studies indicates that the direction of to use certain discourse markers more frequently
translation matters, as translated language than original texts. Incidentally, translated texts
(translationese) has many unique proper- are so markedly different from original ones that
ties. Specifically, phrase tables constructed automatic classification can identify them with
from parallel corpora translated in the same
direction as the translation task perform
very high accuracy (van Halteren, 2008; Baroni
better than ones constructed from corpora and Bernardini, 2006; Ilisei et al., 2010; Koppel
translated in the opposite direction. and Ordan, 2011).
We reconfirm that this is indeed the case, Contemporary Statistical Machine Translation
but emphasize the importance of using also (SMT) systems use parallel corpora to train trans-
texts translated in the wrong direction. lation models that reflect source- and target-
We take advantage of information pertain- language phrase correspondences. Typically,
ing to the direction of translation in con- SMT systems ignore the direction of translation
structing phrase tables, by adapting the used to produce those corpora. Given the unique
translation model to the special proper-
properties of translationese, however, it is reason-
ties of translationese. We define entropy-
based measures that estimate the correspon- able to assume that this direction may affect the
dence of target-language phrases to transla- quality of the translation. Recently, Kurokawa
tionese, thereby eliminating the need to an- et al. (2009) showed that this is indeed the case.
notate the parallel corpus with information They train a system to translate between French
pertaining to the direction of translation. and English (and vice versa) using a French-
We show that incorporating these measures translated-to-English parallel corpus, and then an
as features in the phrase tables of statisti-
English-translated-to-French one. They find that
cal machine translation systems results in
consistent, statistically significant improve- in translating into French the latter parallel cor-
ment in the quality of the translation. pus yields better results, whereas for translating
into English it is better to use the former.
Usually, of course, the translation direction of a
1 Introduction
parallel corpus is unknown. Therefore, Kurokawa
Much research in Translation Studies indicates et al. (2009) train an SVM-based classifier to pre-
that translated texts have unique characteristics dict which side of a bi-text is the origin and which
that set them apart from original texts (Toury, one is the translation, and only use the subset
1980; Gellerstam, 1986; Toury, 1995). Known of the corpus that corresponds to the translation
as translationese, translated texts (in any lan- direction of the task in training their translation
guage) constitute a genre, or a dialect, of the model.
255
We use these results as our departure point, from this table. The benefit of this method is that
but improve them in two major ways. First, not only does it yield the best results, but it also
we demonstrate that the other subset of the cor- eliminates the need to directly predict the direc-
pus, reflecting translation in the wrong direction of translation of the parallel corpus. The main
tion, is also important for the translation task, and contribution of this work, therefore, is a method-
must not be ignored; second, we show that ex- ology that improves the quality of SMT by build-
plicit information on the direction of translation of ing translation models that are adapted to the na-
the parallel corpus, whether manually-annotated ture of translationese.
or machine-learned, is not mandatory. This is
achieved by casting the problem in the framework 2 Related Work
of domain adaptation: we use domain-adaptation Kurokawa et al. (2009) are the first to address
techniques to direct the SMT system toward pro- the direction of translation in the context of SMT.
ducing output that better reflects the properties Their main finding is that using the S T por-
of translationese. We show that SMT systems tion of the parallel corpus results in mucqqh better
adapted to translationese produce better transla- translation quality than when the T S portion
tions than vanilla systems trained on exactly the is used for training the translation model. We in-
same resources. We confirm these findings using deed replicate these results here (Section 3), and
an automatic evaluation metric, BLEU (Papineni view them as a baseline. Additionally, we show
et al., 2002), as well as through a qualitative anal- that the T S portion is also important for ma-
ysis of the results. chine translation and thus should not be discarded.
Our departure point is the results of Kurokawa Using information-theory measures, and in par-
et al. (2009), which we successfully replicate in ticular cross-entropy, we gain statistically signif-
Section 3. First (Section 4), we explain why trans- icant improvements in translation quality beyond
lation quality improves when the parallel corpus the results of Kurokawa et al. (2009). Further-
is translated in the right direction. We do so more, we eliminate the need to (manually or au-
by showing that the subset of the corpus that was tomatically) detect the direction of translation of
translated in the direction of the translation task the parallel corpus.
(the right direction, henceforth source-to-target, Lembersky et al. (2011) also investigate the re-
or S T ) yields phrase tables that are better lations between translationese and machine trans-
suited for translation of the original language than lation. Focusing on the language model (LM),
the subset translated in the reverse direction (the they show that LMs trained on translated texts
wrong direction, henceforth target-to-source, or yield better translation quality than LMs compiled
T S). We use several statistical measures that from original texts. They also show that perplex-
indicate the better quality of the phrase tables in ity is a good discriminator between original and
the former case. translated texts.
Then (Section 5), we explore ways to build a Our current work is closely related to research
translation model that is adapted to the unique in domain-adaptation. In a typical domain adap-
properties of translationese. We first show that tation scenario, a system is trained on a large cor-
using the entire parallel corpus, including texts pus of general (out-of-domain) training mate-
that are translated both in the right and in the rial, with a small portion of in-domain training
wrong direction, improves the quality of the re- texts. In our case, the translation model is trained
sults. Furthermore, we show that the direction of on a large parallel corpus, of which some (gener-
translation used for producing the parallel corpus ally unknown) subset is in-domain (S T ),
can be approximated by defining several entropy- and some other subset is out-of-domain (T
based measures that correlate well with transla- S). Most existing adaptation methods focus on
tionese, and, consequently, with the quality of the selecting in-domain data from a general domain
translation. corpus. In particular, perplexity is used to score
Specifically, we use the entire corpus, create a the sentences in the general-domain corpus ac-
single, unified phrase table and then use the statis- cording to an in-domain language model. Gao
tical measures mentioned above, and in particular et al. (2002) and Moore and Lewis (2010) apply
cross-entropy, as a clue for selecting phrase pairs this method to language modeling, while Foster
256
et al. (2010) and Axelrod et al. (2011) use it on French-to-English and twelve English-to-French
the translation model. Moore and Lewis (2010) phrase-based (PB-) SMT systems using the
suggest a slightly different approach, using cross- MOSES toolkit (Koehn et al., 2007), each trained
entropy difference as a ranking function. on a different subset of the corpus. We use
Domain adaptation methods are usually applied GIZA++ (Och and Ney, 2000) with grow-diag-
at the corpus level, while we focus on an adap- final alignment, and extract phrases of length up
tation of the phrase table used for SMT. In this to 10 words. We prune the resulting phrase tables
sense, our work follows Foster et al. (2010), who as in Johnson et al. (2007), using at most 30 trans-
weigh out-of-domain phrase pairs according to lations per source phrase and discarding singleton
their relevance to the target domain. They use phrase pairs.
multiple features that help distinguish between We construct English and French 5-gram lan-
phrase pairs in the general domain and those in guage models from the English and French
the specific domain. We rely on features that are subsections of the Europarl-V6 corpus (Koehn,
motivated by the findings of Translation Studies, 2005), using interpolated modified Kneser-Ney
having established their relevance through a com- discounting (Chen, 1998) and no cut-off on all
parative analysis of the phrase tables. In particu- n-grams. Europarl consists of a large number
lar, we use measures such as translation model en- of subsets translated from various languages, and
tropy, inspired by Koehn et al. (2009). Addition- is therefore unlikely to be biased towards a spe-
ally, we apply the method suggested by Moore cific source language. The reordering model used
and Lewis (2010) using perplexity ratio instead in all MT systems is trained on the union of
of cross-entropy difference. the 1.5M French-original and the 1.5M English-
original subsets, using msd-bidirectional-fe re-
3 Experimental Setup ordering. We use the MERT algorithm (Och,
The tasks we focus on are translation between 2003) for tuning and BLEU (Papineni et al., 2002)
French and English, in both directions. We as our evaluation metric. We test the statistical
use the Hansard corpus, containing transcripts of significance of the differences between the results
the Canadian parliament from 19962007, as the using the bootstrap resampling method (Koehn,
source of all parallel data. The Hansard is a 2004).
bilingual FrenchEnglish corpus comprising ap- A word on notation: We use English-original
proximately 80% English-original texts and 20% (EO) and French-original (FO) to refer to the
French-original texts. Crucially, each sentence subsets of the corpus that are translated from En-
pair in the corpus is annotated with the direction glish to French and from French to English, re-
of translation. Both English and French are lower- spectively. The translation tasks are English-to-
cased and tokenized using MOSES (Koehn et al., French (E2F) and French-to-English (F2E). We
2007). Sentences longer than 80 words are dis- thus use S T when the FO corpus is used for
carded. the F2E task or when the EO corpus is used for
To address the effect of the corpus size, we the E2F task; and T S when the FO corpus
compile six subsets of different sizes (250K, is used for the E2F task or when the EO corpus is
500K, 750K, 1M, 1.25M and 1.5M parallel used for the F2E task.
sentences) from each portion (English-original Table 1 depicts the BLEU scores of the baseline
and French-original) of the corpus. Addition- systems. The data are consistent with the findings
ally, we use the devtest section of the Hansard of Kurokawa et al. (2009): systems trained on
corpus to randomly select French-original and S T parallel texts outperform systems trained
English-original sentences that are used for tun- on T S texts, even when the latter are much
ing (1,000 sentences each) and evaluation (5,000 larger. The difference in BLEU score can be as
sentences each). French-to-English MT sys- high as 3 points.
tems are tuned and tested on French-original sen-
4 Analysis of the Phrase Tables
tences and English-to-French systems on English-
original ones. The baseline results suggest that S T and
To replicate the results of Kurokawa et al. T S phrase tables differ substantially, presum-
(2009) and set up a baseline, we train twelve ably due to the different characteristics of original
257
Task: French-to-English the average entropy over all translation options
Corpus subset S T T S for each source phrase (henceforth, phrase table
250K 34.35 31.33 entropy or PtEnt), whereas Koehn et al. (2009)
500K 35.21 32.38 search through all possible segmentations of the
750K 36.12 32.90 source sentence to find the optimal covering set of
1M 35.73 33.07 test sentences that minimizes the average entropy
1.25M 36.24 33.23 of the source phrases in the covering set (hence-
1.5M 36.43 33.73 forth, covering set entropy or CovEnt).
Task: English-to-French We also propose a metric that assesses the qual-
Corpus subset S T T S ity of the source side of a phrase table. The met-
ric finds the minimal covering set of a given text
250K 27.74 26.58
in the source language using source phrases from
500K 29.15 27.19
a particular phrase table, and outputs the average
750K 29.43 27.63
length of a phrase in the covering set (henceforth,
1M 29.94 27.88
covering set average length or CovLen).
1.25M 30.63 27.84
1.5M 29.89 27.83 Lembersky et al. (2011) show that perplexity
distinguishes well between translated and origi-
Table 1: BLEU scores of baseline systems nal texts. Moreover, perplexity reflects the de-
gree of relatedness of a given phrase to original
language or to translationese. Motivated by this
and translated texts. In this section we explain
observation, we design two cross-entropy-based
the better translation quality in terms of the bet-
measures to assess how well each phrase table fits
ter quality of the respective phrase tables, as de-
the genre of translationese. Since MT systems are
fined by a number of statistical measures. We first
evaluated against human translations, we believe
relate these measures to the unique properties of
that this factor may have a significant impact on
translationese.
translation performance. The cross-entropy of a
Translated texts tend to be simpler than original
text T = w1 , w2 , wN according to a language
ones along a number of criteria. Generally, trans-
model L is:
lated texts are not as rich and variable as origi-
nal ones, and in particular, their type/token ratio N
is lower. Consequently, we expect S T phrase 1 X
H(T, L) = log2 L(wi ) (2)
tables (which are based on a parallel corpus whose N i=1
source is original texts, and whose target is trans-
We build language models of translated texts
lationese) to have more unique source phrases and
as follows. For English translationese, we
a lower number of translations per source phrase.
extract 170,000 French-original sentences from
A large number of unique source phrases suggests
the English portion of Europarl, and 3,000
better coverage of the source text, while a small
English-translated-from-French sentences from
number of translations per source phrase means a
the Hansard corpus (disjoint from the training,
lower phrase table entropy. Entropy-based mea-
development and test sets, of course). We use
sures are well-established tools to assess the qual-
each corpus to train a trigram language model
ity of a phrase table. Phrase table entropy captures
with interpolated modified Kneser-Ney discount-
the amount of uncertainty involved in choosing
ing and no cut-off. All out-of-vocabulary words
candidate translation phrases (Koehn et al., 2009).
are mapped to a special token, hunki. Then,
Given a source phrase s and a phrase table T
we interpolate the Hansard and Europarl language
with translations t of s whose probabilities are
models to minimize the perplexity of the target
p(t|s), the entropy H of s is:
side of the development set ( = 0.58). For
X French translationese, we use 270,000 sentences
H(s) = p(t|s) log2 p(t|s) (1)
tT
from Europarl and 3,000 sentences from Hansard,
= 0.81. Finally, we compute the cross-entropy
There are two major flavors of the phrase table of each target phrase in the phrase tables accord-
entropy metric: Lambert et al. (2011) calculate ing to these language models.
258
As with the entropy-based measures, we define Measure R2 (FREN) R2 (EN-FR)
two cross-entropy metrics: phrase table cross- AvgTran 0.06 0.22
entropy or PtCrEnt calculates the average cross- PtEnt 0.03 0.19
entropy over weighted cross-entropies of all trans- CovEnt 0.94 0.46
lation options for each source phrase, and cover- PtCrEnt 0.33 0.44
ing set cross-entropy or CovCrEnt finds the opti- CovCrEnt 0.56 0.54
mal covering set of test sentences that minimizes CovLen 0.75 0.56
the weighted cross-entropy of the source phrase
Table 3: Correlation of BLEU scores with phrase table
in the covering set. Given a phrase table T and a
statistical measures
language model L, the weighted cross-entropy W
for a source phrase s is:
these measures are computed directly on the
X phrase table, and do not require reference trans-
W (s, L) = H(t, L) p(t|s) (3) lations or meta-information pertaining to the di-
tT rection of translation of the parallel phrase.
where H(t, L) is the cross-entropy of t according
to a language model L. 5 Translation Model Adaptation
Table 2 depicts various statistical measures We have thus established the fact that S T
computed on the phrase tables corresponding to phrase tables have an advantage over T S ones
our 24 SMT systems.1 The data meet our pre- that stems directly from the different characteris-
liminary expectations: S T phrase tables have tics of original and translated texts. We have also
more unique source phrases, but fewer translation identified three statistical measures that explain
options per source phrase. They have lower en- most of the variability in translation quality. We
tropy and cross-entropy, but higher covering set now explore ways for taking advantage of the en-
length. tire parallel corpus, including translations in both
In order to asses the correspondence of each directions, in light of the above findings. Our goal
measure to translation quality, we compute the is to establish the best method to address the is-
correlation of BLEU scores from Table 1 with sue of different translation direction components
each of the measures specified in Table 2; we in the parallel corpus.
compute the correlation coefficient R2 (the square First, we simply take the union of the two sub-
of Pearsons product-moment correlation coeffi- sets of the parallel corpus. We create three dif-
cient) by fitting a simple linear regression model. ferent mixtures of FO and EO: 500K sentences
Table 3 lists the results. Only the covering set each of FO and EO (MIX1), 500K sentences
cross-entropy measure shows stability over the of FO and 1M sentences of EO (MIX2), and
French-to-English and English-to-French transla- 1M sentences of FO and 500K sentences of EO
tion tasks, with R2 equals to 0.56 and 0.54, re- (MIX3). We use these corpora to train French-
spectively. Other measures are sensitive to the to-English and English-to-French MT systems,
translation task: covering set entropy has the evaluating their quality on the evaluation sets de-
highest correlation with BLEU (R2 = 0.94) when scribed in Section 3. We use the same Moses con-
translating French-to-English, but it drops to 0.46 figuration as well as the same language and re-
for the reverse task. The covering set average ordering models as in Section 3.
length measure shows similar behavior: R2 drops Table 4 reports the results, comparing them
from 0.75 in French-to-English to 0.56 in English- to the results obtained for the baseline MT sys-
to-French. Still, the correlation of these measures tems trained on individual French-original and
with BLEU is high. English-original bi-texts (see Section 3).2 Note
Consequently, we use the three best measures, that the mixed corpus includes many more sen-
namely covering set entropy, cross-entropy and tences than each of the baseline models; this is a
average length, as indicators of better transla-
2
tions, more similar to translationese. Crucially, Recall that when translating from French to English,
S T means that the bi-text is French-original; when trans-
1
The phrase tables were pruned, retaining only phrases lating from English to French, S T means it is English-
that are included in the evaluation set. original.
259
Task: French-to-English
Set Total Source AvgTran PtEnt CovEnt PtCrEnt CovCrEnt CovLen
ST
250K 231K 69K 3.35 0.86 0.36 3.94 1.64 2.44
500K 360K 86K 4.21 0.98 0.35 3.52 1.30 2.64
750K 461K 96K 4.81 1.05 0.35 3.24 1.10 2.77
1M 544K 103K 5.27 1.10 0.34 3.09 0.99 2.85
1.25M 619K 109K 5.66 1.14 0.34 2.98 0.91 2.92
1.5M 684K 114K 6.01 1.18 0.33 2.90 0.85 2.97
T S
250K 199K 55K 3.65 0.92 0.45 4.00 1.87 2.25
500K 317K 69K 4.56 1.05 0.43 3.57 1.52 2.42
750K 405K 78K 5.19 1.12 0.43 3.39 1.35 2.53
1M 479K 85K 5.66 1.16 0.42 3.21 1.21 2.61
1.25M 545K 90K 6.07 1.20 0.41 3.11 1.12 2.67
1.5M 602K 94K 6.43 1.24 0.41 3.04 1.07 2.71
Task: English-to-French
Set Total Source AvgTran PtEnt CovEnt PtCrEnt CovCrEnt CovLen
ST
250K 224K 49K 4.52 1.07 0.63 3.48 1.88 2.08
500K 346K 61K 5.64 1.21 0.59 3.08 1.49 2.25
750K 437K 68K 6.39 1.29 0.57 2.91 1.33 2.33
1M 513K 74K 6.95 1.34 0.55 2.75 1.18 2.41
1.25M 579K 78K 7.42 1.38 0.54 2.63 1.09 2.46
1.5M 635K 81K 7.83 1.41 0.53 2.58 1.03 2.50
T S
250K 220K 46K 4.75 1.12 0.63 3.62 2.09 2.02
500K 334K 57K 5.82 1.24 0.60 3.24 1.70 2.16
750K 421K 64K 6.54 1.31 0.58 2.97 1.48 2.25
1M 489K 69K 7.10 1.36 0.57 2.84 1.35 2.32
1.25M 550K 73K 7.56 1.40 0.55 2.74 1.25 2.37
1.5M 603K 76K 7.92 1.43 0.55 2.66 1.17 2.41
Table 2: Statistic measures computed on the phrase tables: total size, in tokens (Total); the number of unique
source phrases (Source); the average number of translations per source phrase (AvgTran); phrase table entropy
(PtEnt) and covering set entropy (CovEnt); phrase table cross-entropy (PtCrEnt) and covering set cross-
entropy (CovCrEnt); and the covering set average length (CovLen)
realistic scenario, in which one can opt either to previous section on phrase tables trained on the
use the entire parallel corpus, or only its S T MIX corpora, and compare them with the same
subset. Even with a corpus several times as large, measures computed for phrase tables trained on
however, the mixed MT systems perform only the relevant S T corpus for both translation
slightly better than the S T ones. On one tasks. Table 5 displays the figures for the MIX1
hand, this means that one can train MT systems corpus: Phrase tables trained on mixed corpora
on S T data only, at the expense of only a mi- have higher covering set average length, similar
nor loss in quality. On the other hand, it is obvi- covering set entropy, but significantly worse cov-
ous that the T S component also contributes to ering set cross-entropy. Consequently, improving
translation quality. We now look at ways to better covering set cross-entropy has the greatest poten-
utilize this portion. tial for improving translation quality. We there-
fore use this feature to encourage the decoder to
We compute the measures established in the
260
Task: French-to-English tion of Europarl, and 2,700 English-original sen-
System MIX1 MIX2 MIX3 tences from the Hansard corpus. We train a tri-
Union 35.27 35.36 35.94 gram language model with interpolated modified
ST 35.21 35.21 35.73 Kneser-Ney discounting on each corpus and we
T S 32.38 33.07 32.38 interpolate both models to minimize the perplex-
Task: English-to-French ity of the source side of the development set for
System MIX1 MIX2 MIX3 the English-to-French translation task ( = 0.49).
Union 29.27 30.01 29.44 For original French, we use 110,000 sentences
ST 29.15 29.94 29.15 from Europarl and 2,900 sentences from Hansard,
T S 27.19 27.19 27.88 = 0.61. Finally, for each target phrase t in the
phrase table we compute the ratio of the perplex-
Table 4: Evaluation of the MIX systems ity of t according to the original language model
Lo and the perplexity of t with respect to the trans-
select translation options that are more related to lated model Lt (see Section 4). In other words, the
the genre of translated texts. factor F is computed as follows:
H(t, Lo )
French-to-English F (t) = (4)
Measure MIX1 S T H(t, Lt )
CovLen 2.78 2.64 We apply these techniques to the French-to-
CovEnt 0.37 0.35 English and English-to-French phrase tables built
CovCrEnt 1.58 1.10 from the mixed corpora and use each phrase ta-
English-to-French ble to train an SMT system. Table 6 summa-
Measure MIX1 S T rizes the performance of these systems. All sys-
CovLen 2.40 2.25 tems outperform the corresponding Union sys-
CovEnt 0.55 0.58 tems. CrEnt systems show significant improve-
CovCrEnt 2.09 1.48 ments (p < 0.05) on balanced scenarios (MIX1)
and on scenarios biased towards the S T com-
Table 5: Statistical measures computed for mixed vs. ponent (MIX2 in the French-to-English task,
source-to-target phrase tables
MIX3 in English-to-French). PplRatio sys-
tems exhibit more consistent behavior, showing
We do so by adding to each phrase pair in the small, but statistically significant improvement
phrase tables an additional factor, as a measure of (p < 0.05) in all scenarios.
its fitness to the genre of translationese. We ex-
periment with two such factors. First, we use the Task: French-to-English
language models described in Section 4 to com- System MIX1 MIX2 MIX3
pute the cross-entropy of each translation option Union 35.27 35.36 35.94
according to this model. We add cross-entropy CrEnt 35.54 35.45 36.75
as an additional score of a translation pair that PplRatio 35.59 35.78 36.22
can be tuned by MERT (we refer to this system Task: English-to-French
as CrEnt). Since cross-entropy is the lower the System MIX1 MIX2 MIX3
better metric, we adjust the range of values used Union 29.27 30.01 29.44
by MERT for this score to be negative. Sec- CrEnt 29.47 30.44 29.45
ond, following Moore and Lewis (2010), we de- PplRatio 29.65 30.34 29.62
fine an adapting feature that not only measures
how close phrases are to translated language, but Table 6: Evaluation of MT Systems
also how far they are from original language, and
use it as a factor in a phrase table (this system Note again that all systems in the same column
is referred to as PplRatio). We build two addi- are trained on exactly the same corpus and have
tional language models of original texts as fol- exactly the same phrase tables. The only differ-
lows. For original English, we extract 135,000 ence is an additional factor in the phrase table that
English-original sentences from the English por- encourages the decoder to select translation op-
261
tions that are closer to translated texts than to orig- Source Cependant, je pense quil est premature
inal ones. de le faire actuellement, etant donne que le
ministre a lance cette tournee.
6 Analysis Baseline However, I think it is premature to the
In order to study the effect of the adaptation qual- right now, since the minister launched this
itatively, rather than quantitatively, we focus on tour.
several concrete examples. We compare transla- Adapted However, I think it is premature to do
tions produced by the Union (henceforth base- so now, given that the minister has launched
line) and by the PplRatio (henceforth adapted) this tour.
French-English SMT systems. We manually in-
spect 200 sentences of length between 15 and 25 Finally, there are often cultural differences be-
from the French-English evaluation set. tween languages, specifically the use of a 24-hour
In many cases, the adapted system produces clock (common in French) vs. a 12-hour clock
more fluent and accurate translations. In the fol- (common in English). The adapted system is
lowing examples, the baseline system generates more consistent in translating the former to the
common translations of French words that are ad- latter:
equate for a wider context, whereas the adapted
system chooses less common, but more suitable Source On avait decide de poursuivre la seance
translations: jusqu a 18 heures, mais on naura pas le
Source Jai eu cette perception et jetais assez temps de faire un autre tour de table.
certain que ca allait se faire. Baseline We had decided to continue the meeting
Baseline I had that perception and I was enough until 18 hours, but we will not have the time
certain it was going do. to do another round.
Adapted I had that perception and I was quite Adapted We had decided to continue the meeting
certain it was going do. until 6 p.m., but we wont have the time to do
Source Jattends donc que vous en demandiez la another round.
permission, monsieur le President. Source Vu quil est 17h 20, je suis daccord
Baseline I look so that you seek permission, mr. pour quon ne discute pas de ma motion
chairman. immediatement.
Adapted I await, then, that you seek permission, Baseline Seen that it is 17h 20, I agree that we are
mr. chairman. not talking about my motion immediately.
In quite a few cases, the baseline system leaves Adapted Given that it is 5:20, I agree that we are
out important words from the source sentence, not talking about my motion immediately.
producing ungrammatical, even illegible transla-
tions, whereas the adapted system generates good In (human) translation circles, translating out of
translations. Careful traceback reveals that the ones mother tongue is considered unprofessional,
baseline system splits the source sentence into even unethical (Beeby, 2009). Many professional
phrases differently (and less optimally) than the associations in Europe urge translators to work
adapted system. Apparently, when the decoder is exclusively into their mother tongue (Pavlovic,
coerced to select translation options that are more 2007). The two kinds of automatic systems built
adapted to translationese, it tends to select source in this paper reflect only partly the human sit-
phrases that are more related to original texts, re- uation, but they do so in a crucial way. The
sulting in more successful coverage of the source S T systems learn examples from many hu-
sentence: man translators who follow the decree according
Source Pourtant, lorsqu on les avait presentes, to which translation should be made into ones na-
cetait pour corriger les problemes lies au tive tongue. The T S systems are flipped di-
PCSRA. rections of humans input and output. The S T
Baseline Yet when they had presented, it was to direction proved to be more fluent, accurate and
correct the problems the CAIS program. even more culturally sensitive. This has to do with
Adapted Yet when they had presented, it was to fact that the translators cover the source texts
correct the problems associated with CAIS. more fully, having a better translation model.
262
7 Conclusion References
Amittai Axelrod, Xiaodong He, and Jianfeng Gao.
Phrase tables trained on parallel corpora that were
Domain adaptation via pseudo in-domain data se-
translated in the same direction as the translation lection. In Proceedings of the 2011 Conference
task perform better than ones trained on corpora on Empirical Methods in Natural Language Pro-
translated in the opposite direction. Nonethe- cessing, pages 355362. Association for Computa-
less, even wrong phrase tables contribute to the tional Linguistics, July 2011. URL http://www.
translation quality. We analyze both correct and aclweb.org/anthology/D11-1033.
wrong phrase tables, uncovering a great deal of Michiel Bacchiani, Michael Riley, Brian Roark, and
difference between them. We use insights from Richard Sproat. MAP adaptation of stochastic
Translation Studies to explain these differences; grammars. Computer Speech and Language, 20:41
we then adapt the translation model to the nature 68, January 2006. ISSN 0885-2308. doi: 10.1016/
of translationese. j.csl.2004.12.001. URL http://dl.acm.org/
citation.cfm?id=1648820.1648854.
We incorporate information-theoretic measures
that correlate well with translationese into phrase Mona Baker. Corpus linguistics and translation stud-
tables as an additional score that can be tuned ies: Implications and applications. In Gill Fran-
cis Mona Baker and Elena Tognini-Bonelli, editors,
by MERT, and show a statistically significant im-
Text and technology: in honour of John Sinclair,
provement in the translation quality over all base- pages 233252. John Benjamins, Amsterdam, 1993.
line systems. We also analyze the results qual-
Mona Baker. Corpora in translation studies: An
itatively, showing that SMT systems adapted to
overview and some suggestions for future research.
translationese tend to produce more coherent and Target, 7(2):223243, September 1995.
fluent outputs than the baseline systems. An addi-
tional advantage of our approach is that it does not Mona Baker. Corpus-based translation studies:
The challenges that lie ahead. In Gill Francis
require an annotation of the translation direction
Mona Baker and Elena Tognini-Bonelli, editors,
of the parallel corpus. It is completely generic Terminology, LSP and Translation. Studies in lan-
and can be applied to any language pair, domain guage engineering in honour of Juan C. Sager,
or corpus. pages 175186. John Benjamins, Amsterdam, 1996.
This work can be extended in various direc- Marco Baroni and Silvia Bernardini. A new
tions. We plan to further explore the use of two approach to the study of Translationese: Machine-
phrase tables, one for each direction-determined learning the difference between original and
subset of the parallel corpus. Specifically, we will translated text. Literary and Linguistic Com-
interpolate the translation models as in Foster and puting, 21(3):259274, September 2006. URL
Kuhn (2007), including a maximum a posteriori http://llc.oxfordjournals.org/cgi/
content/short/21/3/259?rss=1.
combination (Bacchiani et al., 2006). We also
plan to upweight the S T subset of the parallel Alison Beeby. Direction of translation (directional-
corpus and train a single phrase table on the con- ity). In Mona Baker and Gabriela Saldanha, edi-
tors, Routledge Encyclopedia of Translation Stud-
catenated corpus. Finally, we intend to extend this
ies, pages 8488. Routledge (Taylor and Francis),
work by combining the translation-model adap- New York, 2nd edition, 2009.
tation we present here with the language-model
adaptation suggested by Lembersky et al. (2011) Stanley F. Chen. An empirical study of smoothing
techniques for language modeling. Technical report
in a unified system that is more tuned to generat-
10-98, Computer Science Group, Harvard Univer-
ing translationese. sity, November 1998.
George Foster and Roland Kuhn. Mixture-model adap-
Acknowledgments tation for SMT. In Proceedings of the Second
Workshop on Statistical Machine Translation, pages
We are grateful to Cyril Goutte, George Foster 128135. Association for Computational Linguis-
and Pierre Isabelle for providing us with an anno- tics, June 2007. URL http://www.aclweb.
tated version of the Hansard corpus. This research org/anthology/W/W07/W07-0717.
was supported by the Israel Science Foundation George Foster, Cyril Goutte, and Roland Kuhn. Dis-
(grant No. 137/06) and by a grant from the Israeli criminative instance weighting for domain adap-
Ministry of Science and Technology. tation in statistical machine translation. In
263
Proceedings of the 2010 Conference on Em- Companion Volume Proceedings of the Demo and
pirical Methods in Natural Language Process- Poster Sessions, pages 177180, Prague, Czech Re-
ing, pages 451459, Stroudsburg, PA, USA, public, June 2007. Association for Computational
2010. Association for Computational Linguis- Linguistics. URL http://www.aclweb.org/
tics. URL http://dl.acm.org/citation. anthology/P07-2045.
cfm?id=1870658.1870702.
Philipp Koehn, Alexandra Birch, and Ralf Steinberger.
Jianfeng Gao, Joshua Goodman, Mingjing Li, and Kai- 462 machine translation systems for Europe. In Ma-
Fu Lee. Toward a unified approach to statistical lan- chine Translation Summit XII, 2009.
guage modeling for Chinese. ACM Transactions
Moshe Koppel and Noam Ordan. Translationese
on Asian Language Information Processing, 1:3
and its dialects. In Proceedings of the 49th An-
33, March 2002. ISSN 1530-0226. doi: http://doi.
nual Meeting of the Association for Computa-
acm.org/10.1145/595576.595578. URL http://
tional Linguistics: Human Language Technolo-
doi.acm.org/10.1145/595576.595578.
gies, pages 13181326, Portland, Oregon, USA,
Martin Gellerstam. Translationese in Swedish novels June 2011. Association for Computational Lin-
translated from English. In Lars Wollin and Hans guistics. URL http://www.aclweb.org/
Lindquist, editors, Translation Studies in Scandi- anthology/P11-1132.
navia, pages 8895. CWK Gleerup, Lund, 1986.
David Kurokawa, Cyril Goutte, and Pierre Isabelle.
Iustina Ilisei, Diana Inkpen, Gloria Corpas Pastor, Automatic detection of translated text and its im-
and Ruslan Mitkov. Identification of translationese: pact on machine translation. In Proceedings of MT-
A machine learning approach. In Alexander F. Summit XII, 2009.
Gelbukh, editor, Proceedings of CICLing-2010:
11th International Conference on Computational Patrik Lambert, Holger Schwenk, Christophe Ser-
Linguistics and Intelligent Text Processing, vol- van, and Sadaf Abdul-Rauf. Investigations on
ume 6008 of Lecture Notes in Computer Science, translation model adaptation using monolingual
pages 503511. Springer, 2010. ISBN 978-3- data. In Proceedings of the Sixth Workshop
642-12115-9. URL http://dx.doi.org/10. on Statistical Machine Translation, pages 284
1007/978-3-642-12116-6. 293. Association for Computational Linguistics,
July 2011. URL http://www.aclweb.org/
Howard Johnson, Joel Martin, George Foster, and anthology/W11-2132.
Roland Kuhn. Improving translation quality by dis-
carding most of the phrasetable. In Proceedings of Gennadi Lembersky, Noam Ordan, and Shuly Wint-
the Joint Conference on Empirical Methods in Nat- ner. Language models for machine translation:
ural Language Processing and Computational Nat- Original vs. translated texts. In Proceedings of the
ural Language Learning (EMNLP-CoNLL), pages 2011 Conference on Empirical Methods in Natural
967975. Association for Computational Linguis- Language Processing, pages 363374, Edinburgh,
tics, June 2007. URL http://www.aclweb. Scotland, UK, July 2011. Association for Computa-
org/anthology/D/D07/D07-1103. tional Linguistics. URL http://www.aclweb.
org/anthology/D11-1034.
Philipp Koehn. Statistical significance tests for ma-
chine translation evaluation. In Proceedings of Robert C. Moore and William Lewis. Intelligent
EMNLP 2004, pages 388395, Barcelona, Spain, selection of language model training data. In
July 2004. Association for Computational Linguis- Proceedings of the ACL 2010 Conference, Short
tics. Papers, pages 220224, Stroudsburg, PA, USA,
2010. Association for Computational Linguis-
Philipp Koehn. Europarl: A Parallel Corpus
tics. URL http://dl.acm.org/citation.
for Statistical Machine Translation. In Confer-
cfm?id=1858842.1858883.
ence Proceedings: the tenth Machine Translation
Summit, pages 7986, Phuket, Thailand, 2005. Franz Josef Och. Minimum error rate training in sta-
AAMT. URL http://mt-archive.info/ tistical machine translation. In ACL 03: Proceed-
MTS-2005-Koehn.pdf. ings of the 41st Annual Meeting on Association for
Computational Linguistics, pages 160167, Morris-
town, NJ, USA, 2003. Association for Computa-
tional Linguistics. doi: http://dx.doi.org/10.3115/
1075096.1075117.
Alexandra Constantin, and Evan Herbst. Moses: Franz Josef Och and Hermann Ney. Improved statisti-
Open source toolkit for statistical machine transla- cal alignment models. In ACL 00: Proceedings of
tion. In Proceedings of the 45th Annual Meeting the 38th Annual Meeting on Association for Com-
of the Association for Computational Linguistics putational Linguistics, pages 440447, Morristown,
264
NJ, USA, 2000. Association for Computational Lin-
guistics. doi: http://dx.doi.org/10.3115/1075218.
1075274.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. BLEU: a method for automatic eval-
uation of machine translation. In ACL 02: Proceed-
ings of the 40th Annual Meeting on Association for
Computational Linguistics, pages 311318, Morris-
town, NJ, USA, 2002. Association for Computa-
tional Linguistics. doi: http://dx.doi.org/10.3115/
1073083.1073135.
Natasa Pavlovic. Directionality in translation and in-
terpreting practice. Report on a questionnaire sur-
vey in Croatia. Forum, 5(2):7999, 2007.
Gideon Toury. In Search of a Theory of Translation.
The Porter Institute for Poetics and Semiotics, Tel
Aviv University, Tel Aviv, 1980.
Gideon Toury. Descriptive Translation Studies and be-
yond. John Benjamins, Amsterdam / Philadelphia,
1995.
Hans van Halteren. Source language markers in EU-
ROPARL translations. In COLING 08: Proceed-
ings of the 22nd International Conference on Com-
putational Linguistics, pages 937944, Morristown,
NJ, USA, 2008. Association for Computational Lin-
guistics. ISBN 978-1-905593-44-6.
265
Aspectual Type and Temporal Relation Classification
Francisco Costa Antonio Branco

Universidade de Lisboa Universidade de Lisboa
fcosta@di.fc.ul.pt Antonio.Branco@di.fc.ul.pt
Abstract text. These data are annotated according to the

TimeML (Pustejovsky et al., 2003) scheme.
In this paper we investigate the relevance of Figure 1 shows a small and slightly simpli-
aspectual type for the problem of temporal
fied fragment of the data from TempEval, with
information processing, i.e. the problems
of the recent TempEval challenges. TimeML annotations. There, event terms, such
as the term referring to the event of releasing the
For a large list of verbs, we obtain sev-
tapes, are annotated using EVENT tags. States
eral indicators about their lexical aspect by
querying the web for expressions where (such as the situations denoted by verbs like want
these verbs occur in contexts associated or love) are also considered events. Temporal ex-
with specific aspectual types. pressions, such as today, are enclosed in TIMEX3
We then proceed to extend existing solu- tags. The attribute value of time expressions
tions for the problem of temporal informa- holds a normalized representation of the date or
tion processing with the information ex- time they refer to (e.g. the word today denotes the
tracted this way. The improved perfor- date 1998-01-14 in this example). The TLINK
mance of the resulting models shows that elements at the end describe temporal relations
(i) aspectual type can be data-mined with between events and temporal expressions. For in-
unsupervised methods with a level of noise stance, the event of the plane going down is anno-
that does not prevent this information from
tated as temporally preceding the date denoted by
being useful and that (ii) temporal informa-
tion processing can profit from information the temporal expression today.
about aspectual type. The major tasks of these two TempEval evalu-
ation challenges were about guessing the type of
temporal relations, i.e. the value of the relType
1 Introduction attribute of the TLINK elements in Figure 1, all
Extracting the temporal information present in a other annotations being given. Temporal relation
text is relevant to many natural language process- classification is also the most interesting problem
ing applications, including question-answering, in temporal information processing. The other
information extraction, and even document sum- relevant tasks (identifying and normalizing tem-
marization, as summaries may be more readable poral expressions and events) have a longer re-
if they follow a chronological order. search history and show better evaluation results.
Recent evaluation campaigns have focused on TempEval was organized in three tasks
the extraction of temporal information from writ- (TempEval-2 has four additional ones, that are not
ten text. TempEval (Verhagen et al., 2007), in relevant to this work): task A was concerned with
2007, and more recently TempEval-2 (Verhagen classifying temporal relations holding between an
et al., 2010), in 2010, were concerned with this event and a time mentioned in the same sentence
problem. Additionally, they provided data that (although they could be syntactically unrelated, as
can be used to develop and evaluate systems that the temporal relation represented by the TLINK
can automatically temporally tag natural language with the lid with the value l1 in Figure 1); task
266
<s>In Washington <TIMEX3 tid="t53" type="DATE" al., 2007) also combined rule-based and machine
value="1998-01-14">today</TIMEX3>, the Federal learning approaches. It employed sophisticated
Aviation Administration <EVENT eid="e1"
class="OCCURRENCE" stem="release"
NLP to compute some of the features used; more
aspect="NONE" tense="PAST" polarity="POS" specifically it used syntactic features.
pos="VERB">released</EVENT> air traffic control tapes from Our goal with this work is to evaluate the im-
<TIMEX3 tid="t54" type="TIME"
value="1998-XX-XXTNI">the night</TIMEX3> the TWA pact of information about aspectual type on these
Flight eight hundred <EVENT eid="e2" tasks. The TimeML annotations include an at-
class="OCCURRENCE" stem="go" aspect="NONE"
tense="PAST" polarity="POS" tribute class for EVENTs that encodes some as-
pos="VERB">went</EVENT> down.</s> pectual information, distinguishing between sta-
<TLINK lid="l1" relType="BEFORE" eventID="e2"
relatedToTime="t53"/>
tive (annotated with the value STATE) and non-
<TLINK lid="l2" relType="OVERLAP" stative events (value OCCURRENCE). This at-
eventID="e2" relatedToTime="t54"/> tribute is relevant to the classification problem at
hand, i.e. it is a useful feature for machine learned
Figure 1: Sample of the data annotated for TempEval, classifiers for the TempEval tasks (although this
corresponding to the fragment: In Washington today, class attribute encodes other kinds of informa-
the Federal Aviation Administration released air traf- tion as well). However, aspectual distinctions can
fic control tapes from the night the TWA Flight eight be more fine-grained than a mere binary distinc-
hundred went down. tion, and so far no system has explored this sort of
information to help improve the solutions to tem-
Task poral relation classification.
A B C In this paper we work with Portuguese, but in
principle there is no reason to believe that our
Best system 0.62 0.80 0.55 findings would not apply to other languages that
Average of all participants 0.56 0.74 0.51 display similar aspectual phenomena, such as En-
Majority class baseline 0.57 0.56 0.47 glish. Some of the details, such as the material
in Section 4.2, are however language specific and
Table 1: Results for English in TempEval (F-measure), would need adaptation.
from Verhagen et al. (2009)
2 Aspectual Type
B focused on the temporal relation between events Distinctions of aspectual type (also referred to as
and the documents creation time, which is also situation type, lexical aspect or Aktionsart) of the
annotated in TimeML (not shown in that Figure); sort of Vendler (1967) and Dowty (1979) are ex-
and task C was about classifying the temporal re- pected to improve the existing solutions to the
lation between the main events of two consecu- problem of temporal relation classification. The
tive sentences. The possible values for the type major aspectual distinctions are between (i) states
of temporal relation are BEFORE, AFTER and (e.g. to hate beer, to know the answer, to own a
OVERLAP.1 car, to stink), (ii) processes, also called activities
Table 1 shows the results of the first TempEval (to work, to eat ice cream, to grow, to play the
evaluation. The results of TempEval-2 are fairly piano), (iii) culminated processes, also called ac-
similar (Verhagen et al., 2010), but the data used complishments (to paint a picture, to burn down,
are similar but not identical. to deliver a sermon) and (iv) culminations, also
The best system in TempEval for tasks A and B called achievements (to explode, to win the game,
(Puscasu, 2007) combined statistical and knowl- to find the key). States and processes are atelic
edge based methods to propagate temporal con- situations in that they do not make salient a spe-
straints along parse trees coming from a syntac- cific instant in time. Culminated processes and
tic parser. The best system for task C (Min et culminations are telic situations: they have an in-
trinsic, instantaneous endpoint, called the culmi-
1
There are the additional disjunctive values nation (e.g. in the case of to paint a picture, it is
BEFORE-OR-OVERLAP, OVERLAP-OR-AFTER and
VAGUE, employed when the annotators could not make a
the moment when the picture is ready; in the case
more specific decision, but these affect a small number of of to explode, it is the moment of the explosion).
instances. There are several reasons to think aspectual
267
type is relevant to temporal information pro- in he will read the book in three days but not with
cessing. First, these distinctions are related to other aspectual types, as in he will be living there
how long events last: culminations are punctual, in three days.
whereas states can be very prolonged in time. A factor related to aspectual class, that is not
States are thus more likely to temporally overlap trivial to account for, is the phenomenon of as-
other temporal entities than culminations, for in- pectual shift, or aspectual coercion (Moens and
stance. Steedman, 1988; de Swart, 1998; de Swart, 2000).
Second, there are grammatical consequences Many linguistic contexts pose constraints on as-
on how events are anchored in time. Consider pectual type. This does not mean, however, that
the following examples, from Ritchie (1979) and clashes of aspectual type cause ungrammatical-
Moens and Steedman (1988): ity. What often happens is that phrases associated
with an incompatible aspectual type get their type
(1) When they built the 59th Street bridge, changed in order to be of the required type, caus-
they used the best materials. ing a change in meaning.
(2) When they built that bridge, I was still a For instance, the progressive construction com-
young lad. bines with processes. When it combines with e.g.
a culminated process, the culmination is stripped
The situation of building the bridge is a cul-
off from this culminated process, which is thus
minated processed, composed by the process of
converted into a process. The result is that a sen-
actively building a bridge followed by the culmi-
tence like (5) does not say that the bridge was fin-
nation of the bridge being finished. In sentence
ished (the event has no culmination), whereas one
(1), the event described in the main clause (that of
such as (6) does say this (the event has a culmina-
using the best materials) is a process, but in sen-
tion).
tence (2) it is a state (the state of being a young
lad). Even though the two clauses in each sen- (5) They were building that bridge.
tence are connected by when, the temporal rela-
(6) They built that bridge.
tions holding between the events of each clause
are different. On the one hand, in sentence (1) Aspectual type is not a property of just words,
the event of using the best materials (a process) but phrases as well. For example, while the
overlaps with the process of actively building the progressive construction just mentioned combines
bridge and precedes the culmination of finishing with processes, the resulting phrase behaves as a
the bridge. On the other hand, in sentence (2) state (cf. the sentence When they built the 59th
the event of being a young lad (which is a state) Street bridge, they were using the best materi-
overlaps with both the process of actively build- als and what was mentioned above about when
ing the bridge and the culmination of the bridge clauses).
being built. This difference is arguably caused by
the different aspectual types of the main events of 3 Strategy
each sentence.
Aspectual type is hard to annotate. This is partly
As another example, states overlap with tem-
because of what was just mentioned: it is not a
poral location adverbials, as in (3), while culmi-
property of just words, but rather phrases, and
nations are included in them, as in (4).
different phrases with the same head word can
(3) He was happy last Monday. have different aspectual types; however anno-
(4) He reached the top of Mount Everest last tation schemes like TimeML annotate the head
Monday. word as denoting events, not full phrases or
clauses.
In other cases, differences in aspectual type can For this reason, our strategy is to obtain aspec-
disambiguate ambiguous linguistic material. For tual type information from unannotated data. Be-
instance, the preposition in is ambiguous as it can cause these data are gradientan event-denoting
be used to locate events in the future but also to word can be associated with different aspectual
measure the duration of culminated processes; it types, depending on word sensewe do not aim
is thus ambiguous with culminated processes, as to extract categorical information, but rather nu-
268
meric values for each event term that reflect as- data. Relevant to our work is that of Siegel and
sociations to aspectual types. These may be seen McKeown (2000). The authors guess the aspec-
as values that are indicative of the frequencies in tual type of verbs by searching for specific pat-
which an event term denotes a state, or a process, terns in a one million word corpus that has been
etc. syntactically parsed. They extract several linguis-
In order to extract these indicators, we resort to tic indicators and combine them with machine
a methodology sometimes referred to as Google learning algorithms. The indicators that they ex-
Hits: large amounts of queries are sent to a web tract are naturally different from ours, since they
search engine (not necessarily Google), and the have access to syntactic structure and we do not,
number of search results (the number of web but our data are based on a much larger corpus.
pages that match the query) is recorded and taken
as a measure of the frequency of the queried ex- 3.2 Textual Patterns as Indicators of
pression. Aspectual Type
This methodology is not perfect, since multiple Because of aspectual shift phenomena (see Sec-
occurrences of the queried expression in the same tion 2), full syntactic parsing is necessary in order
web page are not reflected in the hit count, and to determine the aspectual type of a natural lan-
in many cases the hit counts reported by search guage expression. However, this can be approxi-
engines are just estimates and might not be very mated by frequencies: it is natural to expect that
accurate. Additionally, uncarefully formulated e.g. stative verbs occur more frequently in stative
queries can match expressions that are syntacti- contexts than non-stative verbs, even if there may
cally and semantically very different from what be errors in determining these contexts if syntactic
was intended. In any case, it has the advantages parsing is not a possibility.
of being based on a very large amount of data and If one uses Google Hits, syntactic information
not requiring any manual annotation, which can is not accessible. In return for its impreciseness,
introduce errors. Google Hits have the advantage of being based on
very large amounts of data.
3.1 The Web as a Very Large Corpus
Hearst (1992) is one of the earliest studies where 4 Scope and Approach
specific textual patterns are used to extract lexico-
semantic information from very large corpora. In this study we focus exclusively on verbs, but
The authors goal was to extract hyponymy rela- events can be denoted by words belonging to
tions. With the same goal, Kozareva et al. (2008) other parts-of-speech. This limitation is linked to
apply similar textual patterns to the web. the fact that the textual patterns that are used to
The web has been used as a corpus by many search for specific aspectual contexts are sensitive
other authors with the purpose of extracting syn- to part-of-speech (i.e. what may work for a verb
tactic or semantic properties of words or re- may not work equally well for a noun).
lations between them, e.g. Ravichandran and In order to assess whether aspectual type in-
Hovy (2002), Etzioni et al. (2004), etc. Some formation is relevant to the problem of temporal
of this work is specially relevant to the problem relation classification, our approach is to check
of temporal information processing. VerbOcean whether incorporating that kind of information
(Chklovski and Pantel, 2004) is a database of into existing solutions for this problem can im-
web mined relations between verbs. Among other prove their performance. TimeML annotated
kinds of relations, it includes typical precedence data, such as those used for TempEval, can be
relations, e.g. sleeping happens before waking up. used to train machine learned classifiers. These
This type of information has in fact been used by can then be augmented with attributes encoding
some of the participating systems of TempEval-2 aspectual type information and their performance
(Ha et al., 2010), with good results. compared to the original classifiers.
More generally, there is a large body of work Additionally, we work with Portuguese data.
focusing on lexical acquisition from corpora. Just This is because our work is part of an effort to
as an example, Mayol et al. (2005) learn subcate- implement a temporal processing system for Por-
gorization frames of verbs from large amounts of tuguese. We briefly describe the data next.
269
<s>Em Washington, <TIMEX3 tid="t53" type="DATE" tool (Branco et al., 2009) to generate the specific
value="1998-01-14">hoje</TIMEX3>, a Federal Aviation verb forms that are used in the queries. They are
Administration <EVENT eid="e1" class="OCCURRENCE"
stem="publicar" aspect="NONE" tense="PPI"
mostly third person singular forms of several dif-
polarity="POS" pos="VERB">publicou</EVENT> ferent tenses.
gravacoes do controlo de trafego aereo da <TIMEX3
tid="t54" type="TIME"
The indicators that we used are ratios of Google
value="1998-XX-XXTNI">noite</TIMEX3> em que o voo Hits. They compare two queries.
TWA800 <EVENT eid="e2" class="OCCURRENCE" Several indicators were tested. We provide ex-
stem="cair" aspect="NONE" tense="PPI"
polarity="POS" pos="VERB">caiu</EVENT>.</s> amples with the verb fazer do for the queries
<TLINK lid="l1" relType="BEFORE" eventID="e2" being compared by each indicator. The name of
relatedToTime="t53"/>
each indicator reflects the aspectual type being
<TLINK lid="l2" relType="OVERLAP"
tested, i.e. states should present high values for
eventID="e2" relatedToTime="t54"/>
State Indicators 1 and 2, processes should show
high values for Process Indicators 14, etc.
Figure 2: Sample of the Portuguese data adapted from
the TempEval data, corresponding to the fragment: Em State Indicator 1 (Indicator S1) is about im-
Washington, hoje, a Federal Aviation Administration perfective and perfective past forms of verbs.
publicou gravacoes do controlo de trafego aereo da It compares the number of hits a for an im-
noite em que o voo TWA800 caiu.
perfective form fazia did to the number of
a
hits b for a perfective form fez did: a+b .
4.1 Data Assuming the imperfective past constrains
the entire clause to be a state, and the perfec-
Our experiments used TimeBankPT (Costa and tive past constrains it to be telic, the higher
Branco, 2010; Costa and Branco, 2012; Costa, to this value the more frequently the verb ap-
appear). This corpus is an adaptation of the orig- pears in stative clauses in a past tense.2
inal TempEval data to Portuguese, obtained by
translating it and then adapting the annotations. State Indicator 2 (Indicator S2) is about the
Figure 2 shows the Portuguese equivalent to the co-occurrence with acaba de has just fin-
sample presented above in Figure 1. The two cor- ished. It compares the number of hits a
pora are quite similar, but there is of course the for acaba de fazer has just finished doing
language difference. TimeBankPT contains a few to the number of hits b for fazer to do:
corrections to the data (mostly the temporal rela- b
a+b . In Portuguese, this construction does
tions), but these corrections only changed around not seem to be felicitous with states.
1.2% of the total number of annotated temporal
relations (Costa and Branco, 2012). Although we Process Indicator 1 (Indicator P1) is about
did not test our results on English data, we specu- past progressive forms and simple past forms
late that our results carry over to other languages. (both imperfective). It compares the num-
Just like the original English corpus for ber of hits a for fazia did to the number of
TempEval, it is divided in a training part and a hits b for estava a fazer was doing: a+b b
.
testing part. The numbers (sentences, words, an- Assuming the progressive construction is a
notated events, time expressions and temporal re- function from processes to states (see Sec-
lations) are fairly similar for the two corpora (the tion 2), the higher this value, the more likely
English one and the Portuguese one). the verb can occur with the interpretation of
a process.
4.2 Extracting the Aspectual Indicators
2
We expect this frequency to be indicative of states be-
We extracted the 4,000 most common verbs from cause states can appear in the imperfective past tense with
a 180 million word corpus of Portuguese news- their interpretation unchanged, whereas non-stative events
paper text, CETEMPublico. Because this corpus have their interpretation shifted to a stative one in that con-
is not annotated, we used a part-of-speech tag- text (e.g. they get a habitual reading). In order to refer to an
event occurring in the past with an on-going interpretation,
ger and morphological analyzer (Barreto et al., non-stative verbs require the progressive construction to be
2006; Silva, 2007) to detect verbs and to obtain used in Portuguese, whereas states do not. Therefore, states
their dictionary form. We then used an inflection should occur more freely in the simple imperfective past.
270
Process Indicator 2 (Indicator P2) is about Culmination Indicator1 (Indicator C1) is
past progressive forms vs. simple past forms about differentiating culminations and cul-
(perfective). It compares the number of hits minated processes. It compares the number
a for fez did to the number of hits b for of hits a for fez de repente did suddenly to
b
esteve a fazer was doing: a+b . Similarly the number of hits b for fez num instante did
a
to the previous indicator, this one tests the in an instant: a+b .
frequency of a verb appearing in a context
typical of processes. For each of the 4,000 verbs, the necessary
queries required by these indicators were gener-
Process Indicator 3 (Indicator P3) is about ated and then sent to a search engine. The queries
the occurrence of for Adverbials. It com- were enclosed in quotes, so as to guarantee ex-
pares the number of hits a for fez did to act matches. The number of hits was recorded for
the number of hits b for fez durante muito each query.
b
tempo did for a long time: a+b . This We had some problems with outliers for a few
number is also intended to be an indica- rather infrequent verbs. These could show very
tion of how frequent a verb can be used extreme values for some indicators. In order
with the interpretation of a process. Note to minimize their impact, for each indicator we
that Portuguese allows modifiers to occur homogenized the 100 highest values that were
freely between a verb and its complements, found. More specifically, for each indicator, each
so this test should work for transitive verbs one of the highest 100 values was replaced by the
(or any other subcategorization frame involv- 100th highest value. The bottom 100 values were
ing complements), not just intransitive ones. similarly changed. This way the top 99 values and
Process Indicator 4 (Indicator P4) is about the bottom 99 values are replaced by the 100th
the co-occurrence of a verb with parar de to highest value and the 100th lowest value respec-
stop. It compares the number of hits a for tively.
parou de fazer stopped doing to the num- Each indicator ranges between 0 and 1 in the-
a
ber of hits b for fazer to do: a+b . Just like ory. In practice, we seldom find values close to the
the English verbs stop and finish are sensitive extremes, as this would imply that some queries
to the aspectual type of their complement, so would have close to 0 hits, which does not occur
is the Portuguese verb parar, which selects very often (after all, we intentionally used queries
for processes. for which we would expect large hit counts, as
these are more likely to be representative of true
Atelicity Indicator 1 (Indicator A1) is about language use). For this reason, each indicator is
comparing in and for adverbials. It compares scaled so that its minimum (actual) value is 0 and
the number of hits a for fez num instante did its maximum (actual) value is 1.
in an instant to the number of hits b for fez
durante muito tempo did for a long time: 5 Evaluation
b
a+b . Processes can be modified by for ad- As mentioned before, in order to assess the use-
verbials, whereas culminated processes are
fulness of these aspectual indicators for the tasks
modified by in adverbials. This indicator
of temporal relation classification, we checked
tests the occurrence of a verb in contexts that
whether they can improve machine learned clas-
require these aspectual types.
sifiers trained for this problem. We next describe
Atelicity Indicator 2 (Indicator A2) is about the classifiers that were used as the bases for com-
comparing for Adverbials with suddenly. It parison.
compares the number of hits a for fez de re-
pente did suddenly to the number of hits 5.1 Experimental Setup
b for fez durante muito tempo did for a In order to obtain bases for comparison, we
b
long time: a+b . De repente suddenly trained machine learned classifiers on the Por-
seems to modify culminations, so this indi- tuguese corpus TimeBankPT, that is adapted from
cator compares process readings with culmi- the TempEval data (see Section 4.1). We took
nation readings. inspiration in the work of Hepple et al. (2007).
271
This was one of the participating systems of Task
TempEval. It used machine learning algorithms
Attribute A B C
implemented in Weka (Witten and Frank, 1999).
For our experiments, we used Wekas implemen- event-aspect X X
tation of the C4.5 algorithm, trees.J48 (Quin- event-polarity X X X
lan, 1993), the RIPPER algorithm as implemented event-POS X
by Wekas rules.JRip (Cohen, 1995), a near- event-stem X
est neighbors classifier, lazy.KStar (Cleary event-string X
and Trigg, 1995), a Nave Bayes classifier, namely event-class X X
Wekas bayes.NaiveBayes (John and Lang- event-tense X X X
ley, 1995), and a support vector classifier, Wekas order-event-first X N/A N/A
functions.SMO (Platt, 1998) . We chose these order-event-between X N/A N/A
algorithms as they are representative of a wide order-timex3-between N/A N/A
range of machine learning approaches. order-adjacent X N/A N/A
Recall that the tasks of TempEval are to guess
the type of temporal relations. Each train or test timex3-mod X N/A
instance thus corresponds to a temporal relation, timex3-type N/A
i.e. a TLINK element in the TimeML annota- tlink-relType X X X
tions (see Figures 1 and 2). The classification
problem is to determine the value of the attribute Table 2: Feature combinations used in the classifiers
relType of TimeML TLINK elements. These used as comparison bases. Features inspired by the
temporal relations relate an event (referred by the ones used by Hepple et al. (2007) in TempEval.
eventID attribute of TLINK elements) to an-
other temporal entity, that can be a time (pointed ement that represents the temporal relation to
to by the relatedToTime attribute), in the case be classified. The order features are the at-
of tasks A and B, or, in the case of task C, an- tributes computed from the documents textual
other event (given by the relatedToEvent at- content. The feature order-event-first
tribute). encodes whether the event terms precedes in
As for the features that were employed, we also the text the time expression it is related to by
took inspiration in the approach of Hepple et al. the temporal relation to classify. The clas-
(2007). These authors used as classifier attributes sifier attribute order-event-between de-
two types of features. The first group of features scribes whether any other event is mentioned
corresponds to TimeML attributes: for instance in the text between the two expressions for
the value of the aspect attribute of EVENT el- the entities that are in the temporal relation,
ements, for the events involved in the temporal and similarly order-timex3-between is
relation to be classified. The second group of fea- about whether there is an intervening tempo-
tures corresponds to simple features that can be ral expression. Finally, order-adjacent is
computed with string manipulation and do not re- true iff both order-timex3-between and
quire any kind of natural language processing. order-event-between are false (even if
Table 2 shows the features that were tried and other linguistic material occurs between the ex-
employed. pressions denoting the two entities in the temporal
The event features correspond to attributes relation).
of EVENT elements, with the exception of In order to arrive at the final set of features
the event-string feature, which takes as (marked with a check mark in Table 2), we per-
value the character data inside the correspond- formed exhaustive search on all possible combi-
ing TimeML EVENT element. In a simi- nations of these features for each task, using the
lar spirit, the timex3 features are taken from Nave Bayes algorithm. They were compared us-
the attributes of TIMEX3 elements with the ing 10-fold cross-validation on the training data.
same name. The tlink-relType feature The feature combinations shown in Table 2 are
is the class attribute and corresponds to the the optimal combinations arrived at in this way.
relType attribute of the TimeML TLINK el- These are the classifiers that we used for the
272
comparison with the aspectual type indicators. Task
We chose this straightforward approach because it
Classifier A B C
forms a basis for comparison that is easily repro-
ducible: the algorithm implementations that were trees.J48 0.57 0.77 0.53
used are part of freely available software, and the With best indicator 0.55
features that were employed are easily computed rules.JRip 0.60 0.76 0.51
from the annotated data, with no need to run any With best indicator 0.61 0.54
natural language processing tools whatsoever.
As mentioned before in Section 4.1, the data lazy.KStar 0.54 0.70 0.52
used are organized in a training set and an evalu- With best indicator 0.73 0.53
ation set. The training part is around 60K words bayes.NaiveBayes 0.50 0.76 0.53
long, the test data containing around 9K words. With best indicator 0.53 0.54
When tested on held-out data, these classifiers
functions.SMO 0.55 0.79 0.54
present the scores shown in italics in Table 3.
With best indicator 0.56 0.55
These results are fairly similar to the scores that
the system of Hepple et al. (2007) obtained in Table 3: Evaluation on held-out test data of classi-
TempEval with English data: 0.59 for task A, 0.73 fiers trained on full train data. Values for the classi-
for task B, and 0.54 for task C. They are also not fiers used as comparison bases are in italics. Boldface
very far from the best results of TempEval. As highlights improvements resulting from incorporating
such they represent interesting bases for compar- aspectual indicators as classifier features, and missing
ison, as improving their performance is likely to values represent no improvement.
be relevant to the best systems that have been de-
veloped for temporal information processing.
the event that is the first argument of this temporal
5.2 Results and Discussion relation. After adding each of these features, we
retrained the classifiers on the training data and
After obtaining the bases for comparison de-
tested them on the held-out test data. In order to
scribed above, we proceeded to check whether the
keep the evaluation manageable, we did not test
aspectual type indicators described in Section 4.2
combinations of multiple indicators.
can improve these results.
For each aspectual indicator, we implemented Table 3 shows the overall results. For task
a classifier feature that encodes its value for the A, the best indicators were P4 (with JRip), A1
event term in the temporal relation (if it is not a (NaiveBayes) and S1 (SMO). For task B the
verb, this value is missing). In the case of task C, best one was P4 (KStar). For task C, the best
two features are added for each indicator, one for indicators were P3 (J48), A1 and P3 (JRip),
each event term. C1 (KStar), A1 (NaiveBayes) and P2 (SMO).
We extended each of these classifiers with one Each of the indicators S2, P1 and A2 either does
of these features at a time (two in the case of task not improve the results or does so but not as much
C), and checked whether it improved the results as another, better indicator for the same task and
on the test data. So for instance, in order to test algorithm.
Indicator S1, we extended each of these classifiers It seems clear from Table 3 that some tasks ben-
with a feature that encodes the value that this indi- efit from these indicators more than others. In
cator presents for the term that denotes the event particular, task C shows consistent improvements
present in the temporal relation to be classified. whereas task B is hardly affected. Since task C
In the case of task C, two classifier features are is about relations involving two events, the classi-
added, one for each event term, and both for the fiers may be picking up the sort of linguistic gen-
same Indicator S1. For instance, for the (train- eralizations mentioned in Section 2 about when
ing) instance corresponding to the TLINK in Fig- clauses.
ure 2 with the lid attribute that has the value l1, J48 and JRip produce human-readable mod-
the classifier feature for Indicator S1 has the value els. We checked how these classifiers are taking
that was computed for the verb cair go down, advantage of the aspectual indicators. For task C,
since this is the stem of the word that denotes the induced models are generally associating high
273
values of the indicators A1 and P3 with overlap An interesting question that we hope will be ad-
relations and low values of these indicators with dressed by future work is how these results extend
other types of relations. This is expected. On the to other languages. We cannot provide an answer
one end, high values for these indicators are asso- to this question, as we do not have the data. How-
ciated with atelicity (i.e. the endpoint of the cor- ever, this experiment can be replicated for any lan-
responding event is not presented). On the other guage that has (i) TimeML annotated data, (ii) a
hand, both indicators are based on queries con- reasonable size of documents on the Web and a
taining the phrase durante muito tempo for a long search engine capable of separating them from the
time, which, in addition to picking up events that documents in other languages and (iii) an aspec-
can be modified by for adverbials, more specifi- tual system similar enough that the question be-
cally pick up events that happen for a long time ing addressed in this paper makes sense (and use-
and are thus likely to overlap other events. ful patterns for queries can be constructed, even
For task A, JRip also associates high values of if not entirely identical to the ones that we used).
the indicator P4which constitute evidence that The second criterion is met by many, many lan-
the corresponding events are processes (which are guages. The third one also seems to affect many
atelic)with overlap relations. This is a specially languages, as the existing literature on aspectual
interesting result, considering that the queries on phenomena indicates that these phenomena are
which this indicator is based reflect a purely as- quite widespread. The second criterion is, at the
pectual constraint. moment, the hardest to fulfill as not many lan-
guages have data with rich annotations about time
6 Concluding Remarks (i.e. including events and temporal relations). We
In this paper, we evaluated the relevance of infor- speculate that our results can extend to English,
mation about aspectual type for temporal process- although a different set of query patterns may
ing tasks. have to be used in order to extract the aspectual
Temporal information processing has received indicators that are employed. We believe this be-
substantial attention recently with the two cause the two languages largely overlap when it
TempEval challenges in 2007 and 2010. The most comes to aspectual phenomena.
interesting problem of temporal information pro-
cessing, that of temporal relation classification, is References
still affected by high error rates.
Even though a very substantial part of the se- Florbela Barreto, Antonio Branco, Eduardo Ferreira,
Amalia Mendes, Maria Fernanda Nascimento, Fil-
mantics literature on tense and aspect focuses on
ipe Nunes, and Joao Silva. 2006. Open resources
aspectual type, solutions to the problem of auto- and tools for the shallow processing of Portuguese:
matic temporal relation classification have not in- the TagShare project. In Proceedings of LREC
corporated this sort of semantic information. In 2006.
part this is expected, as aspectual type is very in- Antonio Branco, Francisco Costa, Eduardo Ferreira,
terconnected with syntax (cf. the discussion about Pedro Martins, Filipe Nunes, Joao Silva, and Sara
aspectual coercion in Section 2), and the phe- Silveira. 2009. LX-Center: a center of online lin-
guistic services. In Proceedings of the Demo Ses-
nomenon of aspect shift can make it hard to com-
sion, ACL-IJCNLP2009, Singapore.
pute even when syntactic information is available. Timothy Chklovski and Patrick Pantel. 2004. Verb-
Our contribution with this paper is to incor- Ocean: Mining the Web for fine-grained semantic
porate this sort of information in existing ma- verb relations. In In Proceedings of EMNLP-2004,
chine learned classifiers that tackle this problem. Barcelona, Spain.
Even though these classifiers do not have access to John G. Cleary and Leonard E. Trigg. 1995. K*: An
syntactic information, aspectual type information instance-based learner using an entropic distance
seemed to be useful in improving the performance measure. In 12th International Conference on Ma-
chine Learning, pages 108114.
of these models. We hypothesize that combin-
William W. Cohen. 1995. Fast effective rule induc-
ing aspectual type information with information tion. In Proceedings of the Twelfth International
about syntactic structure can further improve the Conference on Machine Learning, pages 115123.
problems of temporal information processing, but Francisco Costa and Antonio Branco. 2010. Tempo-
we leave that research to future work. ral information processing of a new language: Fast
274
porting with minimal resources. In Proceedings of Marc Moens and Mark Steedman. 1988. Temporal
ACL 2010. ontology and temporal reference. Computational
Francisco Costa and Antonio Branco. 2012. Time- Linguistics, 14(2):1528.
BankPT: A TimeML annotated corpus of Por- John Platt. 1998. Fast training of support vec-
tuguese. In Proceedings of LREC2012. tor machines using sequential minimal optimiza-
Francisco Costa. to appear. Processing Temporal In- tion. In Bernhard Scholkopf, Chris Burges, and
formation in Unstructured Documents. Ph.D. the- Alexander J. Smola, editors, Advances in Kernel
sis, Universidade de Lisboa, Lisbon. MethodsSupport Vector Learning.
Henriette de Swart. 1998. Aspect shift and coercion. Georgiana Puscasu. 2007. WVALI: Temporal rela-
Natural Language and Linguistic Theory, 16:347 tion identification by syntactico-semantic analysis.
385. In Proceedings of SemEval-2007, pages 484487,
Prague, Czech Republic. Association for Computa-
Henriette de Swart. 2000. Tense, aspect and coer-
tional Linguistics.
cion in a cross-linguistic perspective. In Proceed-
ings of the Berkeley Formal Grammar conference, James Pustejovsky, Jose Castano, Robert Ingria, Roser
Stanford. CSLI Publications. Saur, Robert Gaizauskas, Andrea Setzer, and Gra-
ham Katz. 2003. TimeML: Robust specification of
David R. Dowty. 1979. Word Meaning and Montague
event and temporal expressions in text. In IWCS-
Grammar: the Semantics of Verbs and Times in
5, Fifth International Workshop on Computational
Generative Semantics and Montagues PTQ. Rei-
Semantics.
del, Dordrecht.
John Ross Quinlan. 1993. C4.5: Programs for Ma-
Oren Etzioni, Michael Cafarella, Doug Downey, Stan- chine Learning. Morgan Kaufmann, San Mateo,
ley Kok, Ana-Maria Popescu, Tal Shaked, , Stephen CA.
Soderland, Daniel S. Weld, and Alexander Yates.
Deepak Ravichandran and Eduard Hovy. 2002.
2004. Web-scale information extraction in Know-
Learning surface text patterns for a question an-
ItAll. In Proceedings of the 13th International Con-
swering system. In Proceedings of ACL 2002.
ference on World Wide Web.
Graeme D. Ritchie. 1979. Temporal clauses in En-
Eun Young Ha, Alok Baikadi, Carlyle Licata, and glish. Theoretical Linguistics, 6:87115.
James C. Lester. 2010. NCSU: Modeling temporal Eric V. Siegel and Kathleen McKeown. 2000.
relations with Markov logic and lexical ontology. In Learning methods to combine linguistic indica-
Proceedings of SemEval 2010. tors: Improving aspectual classification and reveal-
Marti A. Hearst. 1992. Automatic acquisition of hying linguistic insights. Computational Linguistics,
ponyms from large text corpora. In Proceedings of 24(4):595627.
the 14th Conference on Computational Linguistics, Joao Ricardo Silva. 2007. Shallow processing
volume 2, pages 539545, Nantes, France. of Portuguese: From sentence chunking to nomi-
Mark Hepple, Andrea Setzer, and Rob Gaizauskas. nal lemmatization. Masters thesis, Faculdade de
2007. USFD: Preliminary exploration of fea- Ciencias da Universidade de Lisboa, Lisbon, Portu-
tures and classifiers for the TempEval-2007 tasks. gal.
In Proceedings of SemEval-2007, pages 484487, Zeno Vendler. 1967. Verbs and times. Linguistics in
Prague, Czech Republic. Association for Computa- Philosophy, pages 97121.
tional Linguistics. Marc Verhagen, Robert Gaizauskas, Frank Schilder,
George H. John and Pat Langley. 1995. Estimating Mark Hepple, and James Pustejovsky. 2007.
continuous distributions in Bayesian classifiers. In SemEval-2007 Task 15: TempEval temporal re-
Eleventh Conference on Uncertainty in Artificial In- lation identification. In Proceedings of SemEval-
telligence, pages 338345, San Mateo. 2007.
Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. Marc Verhagen, Robert Gaizauskas, Frank Schilder,
2008. Semantic class learning from the web with Mark Hepple, Jessica Moszkowicz, and James
hyponym pattern linkage graphs. In Proceedings of Pustejovsky. 2009. The TempEval challenge: iden-
ACL-08: HLT, pages 10481056, Columbus, Ohio. tifying temporal relations in text. Language Re-
Association for Computational Linguistics. sources and Evaluation.
Laia Mayol, Gemma Boleda, and Toni Badia. 2005. Marc Verhagen, Roser Saur, Tommaso Caselli, and
Automatic acquisition of syntactic verb classes with James Pustejovsky. 2010. SemEval-2010 task 13:
basic resources. Language Resources and Evalua- TempEval-2. In Proceedings of SemEval-2010.
tion, 39(4):295312. Ian H. Witten and Eibe Frank. 1999. Data Mining:
Congmin Min, Munirathnam Srikanth, and Abraham Practical Machine Learning Tools and Techniques
Fowler. 2007. LCC-TE: A hybrid approach to with Java Implementations. Morgan Kaufmann,
temporal relation identification in news text. pages San Francisco.
219222.
275
Automatic generation of short informative sentiment summaries
Andrea Glaser and Hinrich Schutze

University of Stuttgart, Germany
glaseraa@ims.uni-stuttgart.de
Abstract needs: it must convey the sentiment of the re-

view, but it must also provide a specific reason
In this paper, we define a new type of
for that sentiment, so that the user can make an
summary for sentiment analysis: a single-
sentence summary that consists of a sup- informed decision as to whether reading the en-
porting sentence that conveys the overall tire review is likely to be worth the users time
sentiment of a review as well as a convinc- again similar to the purpose of the summary of a
ing reason for this sentiment. We present a web page in search engine results.
system for extracting supporting sentences We call a sentence that satisfies these two crite-
from online product reviews, based on a
ria a supporting sentence. A supporting sentence
simple and unsupervised method. We de-
sign a novel comparative evaluation method contains information on the sentiment as well as
for summarization, using a crowdsourcing a specific reason for why the author arrived at this
service. The evaluation shows that our sentiment. Examples for supporting sentences are
sentence extraction method performs better The picture quality is very good or The bat-
than a baseline of taking the sentence with tery life is 2 hours. Non-supporting sentences
the strongest sentiment. contain opinions without such reasons such as I
like the camera or This camera is not worth the
1 Introduction money.
Given the success of work on sentiment analy- To address use cases of sentiment analysis that
sis in NLP, increasing attention is being focused involve quick scanning and selective reading of
on how to present the results of sentiment analy- large numbers of reviews, we present a simple un-
sis to the user. In this paper, we address an im- supervised system in this paper that extracts one
portant use case that has so far been neglected: supporting sentence per document and show that
quick scanning of short summaries of a body of it is superior to a baseline of selecting the sentence
reviews with the purpose of finding a subset of with the strongest sentiment.
reviews that can be studied in more detail. This One problem we faced in our experiments was
use case occurs in companies that want to quickly that standard evaluations of summarization would
assess, perhaps on a daily basis, what consumers have been expensive to conduct for this study. We
think about a particular product. One-sentence therefore used crowdsourcing to perform a new
summaries can be quickly scanned similar to type of comparative evaluation method that is dif-
the summaries that search engines give for search ferent from training set and gold standard cre-
results and the reviews that contain interesting ation, the dominant way crowdsourcing has been
and new information can then be easily identified. used in NLP so far.
Consumers who want to quickly scan review sum- In summary, our contributions in this paper are
maries to pick out a few reviews that are helpful as follows. We define supporting sentences, a new
for a purchasing decision are a similar use case. type of sentiment summary that is appropriate in
For a one-sentence summary to be useful in this situations where both the sentiment of a review
context, it must satisfy two different information and a good reason for that sentiment need to be
276
conveyed succinctly. We present a simple un- aspects can result in bad summaries.
supervised method for extracting supporting sen- Our approach enables us to find strong support-
tences and show that it is superior to a baseline in ing sentences even if the reason given in that sen-
a novel crowdsourcing-based evaluation. tence does not fit well into the fixed inventory. No
In the next section, we describe related work manual work like the creation of an aspect inven-
that is relevant to our new approach. In Section 3 tory is necessary and there are no requirements on
we present the approach we use to identify sup- the format of the reviews such as author-provided
porting sentences. Section 4 describes the fea- pros and cons.
ture representation of sentences and the classifi- Aspect-oriented summarization also differs in
cation method. In Section 5 we give an overview that it does not differentiate along the dimension
of the crowdsourcing evaluation. Section 6 dis- of quality of the reason given for a sentiment. For
cusses our experimental results. In Sections 7 and example, I dont like the zoom and The zoom
8, we present our conclusions and plans for future range is too limited both give reasons for why a
work. camera gets a negative evaluation, but only the lat-
ter reason is informative. In our work, we evaluate
2 Related Work the quality of the reason given for a sentiment.
Both sentiment analysis (Pang and Lee, 2008; The use case we address in this paper requires
Liu, 2010) and summarization (Nenkova and a short, easy-to-read summary. A well-formed
McKeown, 2011) are important subfields of NLP. sentence is usually easier to understand than a
The work most relevant to this paper is work on pro/con table. It also has the advantage that the
summarization methods that addresses the spe- information conveyed is accurately representing
cific requirements of summarization in sentiment what the user wanted to say this is not the case
analysis. There are two lines of work in this vein for a presentation that involves several complex
with goals similar to ours: (i) aspect-based and processing steps and takes linguistic material out
pro/con-summarization and (ii) approaches that of the context that may be needed to understand it
extract summary sentences from reviews. correctly.
An aspect is a component or attribute of a Berend (2011) performs a form of pro/con
product such as battery, lens cap, battery summarization that does not rely on aspects.
life, and picture quality for cameras. Aspect- However, most of the problems of aspect-based
oriented summarization (Hu and Liu, 2004; pro/con summarization also apply to this paper:
Zhuang et al., 2006; Kim and Hovy, 2006) col- no differentiation between good and bad reasons,
lects sentiment assessments for a given set of as- the need for human labels to train a classifier, and
pects and returns a list of pros and cons about ev- inferior readability compared to a well-formed
ery aspect for a review or, in some cases, on a sentence.
per-sentence basis. Two previous approaches that have attempted
Aspect-oriented summarization and pro/con- to extract sentences from reviews in the context
summarization differ in a number of ways from of summarization are (Beineke et al., 2004) and
supporting sentence summarization. First, as- (Arora et al., 2009). Beineke et al. (2004) train
pects and pros&cons are taken from a fixed in- a classifier on rottentomatoes.com summary sen-
ventory. The inventory is typically small and does tences provided by review authors. These sen-
not cover the full spectrum of relevant informa- tences sometimes contain a specific reason for the
tion. Second, in its most useful form, aspect- overall sentiment of the review, but sometimes
oriented summarization requires classification of they are just catchy lines whose purpose is to
phrases and sentences according to the aspect they draw moviegoers in to read the entire review; e.g.,
belong to; e.g., The camera is very light has El Bulli barely registers a pulse stronger than a
to be recognized as being relevant to the aspect books (which does not give a specific reason for
weight. Developing a component that assigns why the movie does not register a strong pulse).
phrases and sentences to their corresponding cat- Arora et al. (2009) define two classes of sen-
egories is time-consuming and has to be redone tences: qualified claims and bald claims. A qual-
for each domain. Any such component will make ified claim gives the reader more details (e.g.,
mistakes and undetected or incorrectly classified This camera is small enough to fit easily in a
277
coat pocket) while a bald claim is open to inter- in the supporting sentence are nominal; the
pretation (e.g., This camera is small). Quali- verb will be needed in many cases to accu-
fied/bald is a dimension of classification of senti- rately convey the reason for the sentiment
ment statements that is to some extent orthogonal expressed. However, it is a fairly safe as-
to quality of reason. Qualified claims do not have sumption that part of the information is con-
to contain a reason and bald claims can contain veyed using noun phrases since it is dif-
an informative reason. For example, I didnt like ficult to convey specific information with-
the camera, but I suspect it will be a great camera out using specific noun phrases. Adjectives
for first timers is classified as a qualified claim, are often important when expressing a rea-
but the sentence does not give a good reason for son, but frequently a noun is also mentioned
the sentiment of the document. Both dimensions or one would need to resolve a pronoun to
(qualified/bald, high-quality/low-quality reason) make the sentence a self-contained support-
are important and can be valuable components of ing sentence. In a sentence like Its easy
a complete sentiment analysis system. to use it is not clear what the adjective is
Apart from the definition of the concept of sup- referring to.
porting sentence, which we believe to be more ap-
propriate for the application we have in mind than (iii) Noun phrases that express supporting facts
rottentomatoes.com summary sentences and qual- tend to be domain-specific; they can be
ified claims, there are two other important differ- automatically identified by selecting noun
ences of our approach to these two papers. First, phrases that are frequent in the domain ei-
we directly evaluate the quality of the reasons in a ther in relative terms (compared to a generic
crowdsourcing experiment. Second, our approach corpus) or in absolute terms. By making
is unsupervised and does not require manual an- this assumption we may fail to detect sup-
notation of a training set of supporting sentences. porting sentences that are worded in an orig-
As we will discuss in Section 5, we propose inal way using ordinary words. However,
a novel evaluation measure for summarization in a specific domain there is usually a lot
based on crowdsourcing in this paper. The most of redundancy and most good reasons oc-
common use of crowdsourcing in NLP is to have cur many times and are expressed by similar
workers label a training set and then train a super- words.
vised classifier on this training set. In contrast, we
use crowdsourcing to directly evaluate the relative Based on these assumptions, we select the sup-
quality of the automatic summaries generated by porting sentence in two steps. In the first step, we
the unsupervised method we propose. determine the n sentences with the strongest sen-
timent within every review by classifying the po-
3 Approach larity of the sentences (where n is a parameter).
In the second step, we select one of the n sen-
Our approach is based on the following three tences as the best supporting sentence by means
premises. of a weighting function.
(i) A good supporting sentence conveys both Step 1: Sentiment Classification
the reviews sentiment and a supporting fact.
We make this assumption because we want In this step, we apply a sentiment classifier to all
the sentence to be self-contained. If it only sentences of the review to classify sentences as
describes a fact about a product without positive or negative. We then select the n sen-
evaluation, then it does not on its own ex- tences with the highest probability of conforming
plain which sentiment is conveyed by the ar- with the overall sentiment of the document. For
ticle and why. example, if the documents polarity is negative,
we select the n sentences that are most likely to be
(ii) Supporting facts are most often expressed by negative according to the sentiment classifier. We
noun phrases. We call a noun phrase that ex- restrict the set of n sentences to sentences with the
presses a supporting fact a keyphrase. We right sentiment because even an excellent sup-
are not assuming that all important words porting sentence is not a good characterization of
278
the content of the review if it contradicts the over- quency), I1 (the set of infrequent nouns), F2 (the
all assessment given by the review. Only in cases set of compounds with high frequency), and I2
where there are fewer than n sentences with the (the set of infrequent compounds). An infrequent
correct sentiment, we also select sentences with noun (resp. compound) is simply defined as a
the wrong sentence to obtain a minimum of n noun (resp. compound) that does not meet the fre-
sentences for each review. quency criterion.
We define the score s of a sentence with n to-
Step 2: Weighting Function kens t1 . . . tn (where the last token tn is a punctu-
Based on premises (ii) and (iii) above, we score ation mark) as follows:
a sentence based on the number of noun phrases
n1
that occur with high absolute and relative fre- X
s= wf2 [[(ti , ti+1 ) F2 ]]
quency in the domain. We only consider sim-
i=1
ple nouns and compound nouns consisting of + wi2 [[(ti , ti+1 ) I2 ]] (2)
two nouns in this paper. In general, compound + wf1 [[ti F1 ]]
nouns are more informative and specific. A com- + wi1 [[ti I1 ]]
pound noun may refer to a specific reason even
if the head noun does not (e.g., life vs. battery where [[]] = 1 if is true and [[]] = 0 otherwise.
life). This means that we need to compute scores Note that a noun in a compound will contribute to
in a way that allows us to give higher weight to the overall score in two different summands.
compound nouns than to simple nouns. The weights wf2 , wi2 , wf1 , and wi1 are deter-
In addition, we also include counts of nouns mined using logistic regression. The training set
and compounds in the scoring that do not have for the regression is created in an unsupervised
high absolute/relative frequency because fre- fashion as follows. From each set of n sentences
quency heuristics identify keyphrases with only (one per review), we select the two highest scor-
moderate accuracy. However, theses nouns and ing, i.e., the two sentences that were classified
compounds are given a lower weight. with the highest confidence. The two classes in
This motivates a scoring function that is a the regression problem are then the top ranked
weighted sum of four variables: number of simple sentences vs. the sentences at rank 2. Since tak-
nouns with high frequency, number of infrequent ing all sentences turned out to be too noisy, we
simple nouns, number of compound nouns with eliminate sentence pairs where the top sentence is
high frequency, and number of infrequent com- better than the second sentence on almost all of
pound nouns. High frequency is defined as fol- the set counts (i.e., count of members of F1 , I1 ,
lows. Let fdom (p) be the domain-specific abso- F2 , and I2 ). Our hypothesis in setting up this re-
lute frequency of phrase p, i.e., the frequency in gression was that the sentence with the strongest
the review corpus, and fwiki (p) the frequency of sentiment often does not give a good reason. Our
p in the English Wikipedia. We view the distribu- experiments confirm that this hypothesis is true.
tion of terms in Wikipedia as domain-independent The weights wf2 , wi2 , wf1 , and wi1 estimated
and define the relative frequency as in Equation 1. by the regression are then used to score sentences
according to Equation 2.
fdom (p)
frel (p) = (1) We give the same weight to all keyphrase com-
fwiki (p) pounds (and the same weight to all keyphrase
We do not consider nouns and compound nouns nouns) in future work one could attempt to give
that do not occur in Wikipedia for computing higher weights to keyphrases with higher absolute
the relative frequency. A noun (resp. compound or relative frequency. In this paper, our goal is to
noun) is deemed to be of high frequency if it is establish a simple baseline for the task of extrac-
one of the k% nouns (resp. compound nouns) with tion of supporting sentences.
the highest fdom (p) and at the same time is one of After computing the overall weight for each
the k% nouns (resp. compound nouns) with the sentence in a review, the sentence with the highest
highest frel (p) where k is a parameter. weight is chosen as the supporting sentence the
Based on these definitions, we define four dif- sentence that is most informative for explaining
ferent sets: F1 (the set of nouns with high fre- the overall sentiment of the review.
279
4 Experiments reasons. The cleaned corpus consists of 11,624
documents. Finally, we split the corpus into train-
4.1 Data
ing set (85%) and test set (15%) as shown in Table
We use part of the Amazon dataset from Jindal 1. The average number of sentences of a review is
and Liu (2008). The dataset consists of more than 13.36 sentences, the median number of sentences
5.8 million consumer-written reviews of several is 10.
products, taken from the Amazon website. For
our experiment we used the digital camera do- 4.3 Sentiment Classification
main and extracted 15,340 reviews covering a to- We first build a sentence sentiment classifier by
tal of 740 products. See table 1 for key statistics training the Stanford maximum entropy classifier
of the data set. (Manning and Klein, 2003) on the sentences in the
training set. Sentences occurring in positive (resp.
Type Number
negative) reviews are labeled positive (resp. neg-
Brands 17
ative). We use a simple bag-of-words representa-
Products 740
tion (without punctuation characters and frequent
Documents (all) 15,340
stop words). Propagating labels from documents
Documents (cleaned) 11,624 to sentences creates a noisy training set because
Documents (train) 9,880 some sentences have sentiment different from the
Documents (test) 1,744 sentiment in their documents; however, there is
Short test documents 147 no alternative because we need per-sentence clas-
Long test documents 1,597 sification decisions, but do not have per-sentence
Average number of sents 13.36 human labels.
Median number of sents 10 The accuracy of the classifier is 88.4% on
propagated sentence labels.
Table 1: Key statistics of our dataset We use the sentence classifier in two ways.
First, it defines our baseline BL for extracting
In addition to the review text, authors can give supporting sentences: the baseline simply pro-
an overall rating (a number of stars) to the prod- poses the sentence with the highest sentiment
uct. Possible ratings are 5 (very positive), 4 (pos- score that is compatible with the sentiment of the
itive), 3 (neutral), 2 (negative), and 1 (very nega- document as the supporting sentence.
tive). We unify ratings of 4 and 5 to positive and Second, the sentence classifier selects a subset
ratings of 1 and 2 to negative to obtain polarity of candidate sentences that is then further pro-
labels for binary classification. Reviews with a cessed using the scoring function in Equation 2.
rating of 3 are discarded. This subset consists of the n = 5 sentences with
the highest sentiment scores of the right polarity
4.2 Preprocessing that is, if the document is positive (resp. nega-
We tokenized and part-of-speech (POS) tagged tive), then the n = 5 sentences with the highest
the corpus using TreeTagger (Schmid, 1994). We positive (resp. negative) scores are selected.
split each review into individual sentences by us-
ing the sentence boundaries given by TreeTag- 4.4 Determining Frequencies and Weights
ger. One problem with user-written reviews is The absolute frequency of nouns and compound
that they are often not written in coherent En- nouns simply is computed as their token fre-
glish, which results in wrong POS tags. To ad- quency in the training set. For computing the rel-
dress some of these problems, we cleaned the ative frequency (as described in Section 3, Equa-
corpus after the tokenization step. We separated tion 1), we use the 20110405 dump of the English
word-punctuation clusters (e.g., word...word) and Wikipedia.
removed emoticons, html tags, and all sentences In the product review corpora we studied,
with three or fewer tokens, many of which were the percentage of high-frequency keyphrase com-
a result of wrong tokenization. We excluded all pound nouns was higher than that of simple
reviews with fewer than five sentences. Short re- nouns. We therefore use two different thresh-
views are often low-quality and do not give good olds for absolute and relative frequency. We de-
280
fine F1 as the set of nouns that are in the top supporting sentences.
kn = 2.5% for both absolute and relative fre-
quencies; and F2 as the set of compounds that are 5 Comparative Evaluation with Amazon
in the top kp = 5% for both absolute and rela- Mechanical Turk
tive frequencies. These thresholds are set to ob- One standard way to evaluate summarization sys-
tain a high density of good keyphrases with few tems is to create hand-edited summaries and to
false positives. Below the threshold there are still compute some measure of similarity (e.g., word
other good keyphrases, but they cannot be sepa- or n-gram overlap) between automatic and human
rated easily from non-keyphrases. summaries. An alternative for extractive sum-
Sentences are scored according to Equation 2. maries is to classify all sentences in the document
Recall that the parameters wf2 , wi2 , wf1 , and wi1 with respect to their appropriateness as summary
are determined using logistic regression. The ob- sentences. An automatic summary can then be
tained parameter values (see table 2) indicate the scored based on its ability to correctly identify
relative importance of the four different types of good summary sentences. Both of these meth-
terms. Compounds are the most important term ods require a large annotation effort and are most
and even those with a frequency below the thresh- likely too complex to be outsourced to a crowd-
old kp still provide more detailed information than sourcing service because the creation of manual
simple nouns above the threshold kn ; the value of summaries requires skilled writers. For the sec-
wi2 is approximately twice the value wf1 for this ond type of evaluation, ranking sentences accord-
reason. Non-keyphrase nouns are least important ing to a criterion is a lot more time consuming
and are weighted with only a very small value of than making a binary decision so ranking the
wi1 = 0.01. 13 or 14 sentences that a review contains on av-
erage for the entire test set would be a signifi-
Phrase Par Value cant annotation effort. It would also be difficult
keyphrase compounds w f2 1.07 to obtain consistent and repeatable annotation in
non-keyphrase compounds wi2 0.89 crowdsourcing on this task due to its subtlety.
keyphrase nouns w f1 0.46 We therefore designed a novel evaluation
non-keyphrase nouns wi1 0.01 methodology in this paper that has a much smaller
startup cost. It is well known that relative judg-
Table 2: Weight settings ments are easier to make on difficult tasks than ab-
solute judgments. For example, much recent work
The scoring function with these parameter val- on relevance ranking in information retrieval re-
ues is applied to the n = 5 selected sentences of lies on relative relevance judgments (one docu-
the review. The highest scoring sentence is then ment is more relevant than another) rather than ab-
selected as the supporting sentence proposed by solute relevance judgments. We adopt this gen-
our system. eral idea and only request such relative judgments
For 1380 of the 1744 reviews, the sentence se- on supporting sentences from annotators. Unlike
lected by our system is different from the baseline a complete ranking of the sentences (which would
sentence; however, there are 364 cases (20.9%) require m(m 1)/2 judgments where m is the
where the two are the same. Only the 1380 cases length of the review), we choose a setup where
where the two methods differ are included in the we need to only elicit a single relative judgment
crowdsourcing evaluation to be described in the per review, one relative judgment on a sentence
next section. As we will show below, our sys- pair (consisting of the baseline sentence and the
tem selects better supporting sentences than the system sentence) for each of the 1380 reviews se-
baseline in most cases. So if baseline and our sys- lected in the previous section. This is a manage-
tem agree, then it is even more likely that the sen- able annotation task that can be run on a crowd-
tence selected by both is a good supporting sen- sourcing service in a short time and at little cost.
tence. However, there could also be cases where We use Amazon Mechanical Turk (AMT) for
the n = 5 sentences selected by the sentiment this annotation task. The main advantage of AMT
classifier are all bad supporting sentences or cases is that cost per annotation task is very low, so that
where the document does not contain any good we can obtain large annotated datasets for an af-
281
file:///Users/hs0711/example2.html
Task:
Sentence 1: This 5 meg camera meets all my requirements.
Sentence 2: Very good pictures, small bulk, long battery life.
Which sentence gives the more convincing reason? Fill out exactly one field, please.
Please type the blue word of the chosen sentence into the corresponding answer field.
s1
s2
If both sentences do not give a convincing reason, type NOTCONV into this answer
field.
Submit
Figure 1: AMT interface for annotators
fordable price. The disadvantage is the level of is simply the number of times it was rated bet-
quality of the annotation which will be discussed ter than its competitor. The score can be 0, 1, 2
at the end of this section. or 3. HITs for which the worker chooses the op-
tion Neither sentence has a convincing reason
5.1 Task Design are ignored when computing sentence scores.
We created a HIT (Human Intelligence Task) The sentence with the higher score is then se-
template including detailed annotation guidelines. lected as the best supporting sentence for the cor-
Every HIT consists of a pair of sentences. One responding review.
sentence is the baseline sentence; the other sen- In cases of ties, we posted the sentence pair one
tence is the system sentence, i.e., the sentence se- more time for one worker. If one of the two sen-
lected by the1 ofscoring
1 function. The two sentences tences has a higher score after3/9/12this
12:06 reposting,
PM we
are presented in random order to avoid bias. choose it as the winner. Otherwise we label this
The workers are then asked to evaluate the rel- sentence pair no decision or N-D.
ative quality of the sentences by selecting one of
the following three options: 5.2 Quality of AMT Annotations
Since our crowdsourcing based evaluation is
1. Sentence 1 has the more convincing reason novel, it is important to investigate if human an-
notators perform the annotation consistently and
2. Sentence 2 has the more convincing reason
reproducibly.
3. Neither sentence has a convincing reason The Fleiss agreement score for the final
experiment is 0.17. AMT workers only have
If both sentences contain reasons, the worker the instructions given by the requesters. If they
has to compare the two reasons and choose the are not clear enough or too complicated, work-
sentence with the more convincing reason. ers can misunderstand the task, which decreases
Each HIT was posted to three different workers the quality of the answers. There are also AMT
to make it possible to assess annotator agreement. workers who spam and give random answers to
Every worker can process each HIT only once tasks. Moreover, ranking sentences according to
so that the three assignments are always done by the quality of the given reason is a subjective task.
three different people. Even if the sentence contains a reason, it might
Based on the worker annotations, we compute a not be convincing for the worker.
gold standard score for each sentence. This score To ensure a high level of quality for our dataset,
282
Experiment # Docs BL SY N-D B=S
1 AMT, first pass 1380 27.4 57.9 14.7 -
2 AMT, second pass 203 46.8 45.8 7.4 -
3 AMT final 1380 34.3 64.6 1.1 -
4 AMT+[B=S] 1744 27.1 51.1 0.9 20.9
Table 3: AMT evaluation results. Numbers are percentages or counts. BL = baseline, SY = system, N-D = no
decision, B=S = same sentence selected by baseline and system
we took some precautions. To force workers to baseline system, 46% the system sentence; 7.4%
actually read the sentences and not just click a of the responses were undecided (line 2). Line 3
few boxes, we randomly marked one word of each presents the consolidated results where the 14.7%
sentence blue. The worker had to type the word ties on line 1 are replaced by the ratings obtained
of their preferred sentence into the corresponding on line 2 in the second pass.
answer field or NOTCONV into the special field if The consolidated results (line 3) show that our
neither sentence was convincing. Figure 1 shows system is clearly superior to the baseline of se-
our AMT interface design. lecting the sentence with the strongest sentiment.
For each answer field we have a gold stan- Our system selected a better supporting sentence
dard (the words we marked blue and the word for 64.6% of the reviews; the baseline selected a
NOTCONV) which enables us to look for spam. better sentence for 34.3% of the reviews. These
The analysis showed that some workers mistyped results exclude the reviews where baseline and
some words, which however only indicates that system selected the same sentence. If we as-
the worker actually typed the word instead of sume that these sentences are also acceptable sen-
copying it from the task. Some workers submit- tences (since they score well on the traditional
ted inconsistent answers, for instance, they typed sentiment metrics as well as on our new con-
a random word or filled out all three answer fields. tent keyword metric), then our system finds a
In such cases we reposted this HIT again to re- good supporting sentence for 72.0% of reviews
ceive a correct answer. (51.1+20.9) whereas the baseline does so for only
After the task, we counted how often a worker 48.0% (27.1+20.9).
said that neither sentence is convincing since a
6.1 Error Analysis
high number indicates that the worker might have
only copied the word for several sentence pairs Our error analysis revealed that a significant pro-
without checking the content of the sentences. We portion of system sentences that were worse than
also analyzed the time a worker needed for every baseline sentences did contain a reason. How-
HIT. Since no task was done in less than 10 sec- ever, the baseline sentence also contained a reason
onds, the possibility of just copying the word was and was rated better by AMT annotators. Exam-
rather low. ples (1) and (2) show two such cases. The first
sentence is the baseline sentence (BL) which was
6 Results and discussion rated better. The system sentence (SY) contains
a similar or different reason. Since rating reasons
The results of the AMT experiment are shown in is a very subjective task, it is impossible to de-
table 3. As described above, each of the 1380 fine which of these two sentences contains the bet-
sentence pairs was evaluated by three workers. ter reason and depends on how the workers think
Workers rated the system sentence as better for about it.
57.9% of the reviews, and the baselines sentence
as better for 27.4% of the reviews; for 14.7% of (1) BL:The best thing is that everything is just so
reviews, the scores of the two sentences were tied easily displayed and one doesnt need a
(line 1 of Table 3). The 203 reviews in this cate- manual to start getting the work done.
gory were reposted one more time (as described in SY: The zoom is incredible, the video was so
Section 5). The responses were almost perfectly clear that I actually thought of making a
evenly split: about 47% of workers preferred the 15 min movie.
283
(2) BL:The colors are horrible, indoor shots are Finally, there are a number of cases where our
horrible, and too much noise. assumption that good supporting sentences con-
SY: Who cares about 8 mega pixels and 1600 tain keyphrases is incorrect. For example, sen-
iso when it takes such bad quality pic- tence (6) does not contain any keyphrases indica-
tures. tive of good reasons. The information that makes
it a good supporting sentence is mainly expressed
In example (3) the system sentence is an in- using verbs and particles.
complete sentence consisting of only two noun
(6) I have had an occasional problem with
phrases. These cut-off sentences are mainly
the camera not booting up and telling me
caused by incorrect usage of grammar and punc-
to turn it off and then on again.
tuation by the reviewers which results in wrongly
determined sentence boundaries in the prepro-
7 Conclusion
cessing step.
In this work, we presented a system that ex-
(3) BL:Gives peace of mind to have it fit per- tracts supporting sentences, single-sentence sum-
fectly. maries of a document that contain a convincing
SY: battery and SD card. reason for the authors opinion about a product.
We used an unsupervised approach that extracts
In some cases, the two sentences that were pre- keyphrases of the given domain and then weights
sented to the worker in the evaluation had a dif- these keyphrases to identify supporting sentences.
ferent polarity. This can have two reasons: (i) due We used a novel comparative evaluation method-
to noisy training input, the classifier misclassified ology with the crowdsourcing framework Ama-
some of the sentences, and (ii) for short reviews zon Mechanical Turk to evaluate this novel task
we also used sentences with the non-conforming since no gold standard is available. We showed
polarity. Sentences with different polarity often that our keyphrase-based system performs better
confused the workers and they tended to prefer than a baseline of extracting the sentence with the
the positive sentence even if the negative one con- highest sentiment score.
tained a more convincing reason as can be seen in
example (4). 8 Future work
(4) BL:It shares same basic commands and Our method failed for some of the about 35% of
setup, so the learning curve was minimal. reviews where it did not find a convincing reason
because of the noisiness of reviews. Reviews are
SY: I was not blown away by the image qual-
user-generated content and contain grammatically
ity, and as others have mentioned, the
incorrect sentences and are full of typographical
flash really is weak.
errors. This problem makes it hard to perform pre-
A general problem with our approach is that the processing steps like part-of-speech tagging and
weighting function favors sentences with many sentence boundary detection correctly and reli-
noun phrases. The system sentence in example ably. We plan to address these problems in fu-
(5) contains many noun phrases, including some ture work by developing a more robust processing
highly frequent nouns (e.g., lens, battery), pipeline.
but there is no convincing reason and the baseline
sentence has been selected by the workers. Acknowledgments
This work was supported by Deutsche
(5) BL:I have owned my cd300 for about 3 weeks
Forschungsgemeinschaft (Sonderforschungs-
and have already taken 700 plus pictures.
bereich 732, Project D7) and in part by the
SY: It has something to do with the lens be- IST Programme of the European Community,
cause the manual says it only happens to under the PASCAL2 Network of Excellence,
the 300 and when I called Sony tech sup- IST-2007-216886. This publication only reflects
port the guy tried to tell me the battery the authors views.
was faulty and it wasnt.
284
References International Conference on New Methods in Lan-
guage Processing, Manchester, UK.
Shilpa Arora, Mahesh Joshi, and Carolyn P. Rose.
Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006.
2009. Identifying types of claims in online cus-
Movie review mining and summarization. In Pro-
tomer reviews. In Proceedings of Human Lan-
ceedings of the 15th ACM international conference
guage Technologies: The 2009 Annual Conference
on Information and knowledge management, CIKM
06, pages 4350, New York, NY, USA. ACM.
for Computational Linguistics, Companion Volume:
Short Papers, NAACL-Short 09, pages 3740,
Stroudsburg, PA, USA. Association for Computa-
tional Linguistics.
Philip Beineke, Trevor Hastie, Christopher Manning,
and Shivakumar Vaithyanathan. 2004. Exploring
sentiment summarization. In Proceedings of the
AAAI Spring Symposium on Exploring Attitude and
Affect in Text: Theories and Applications. AAAI
Press. AAAI technical report SS-04-07.
Gabor Berend. 2011. Opinion expression mining by
exploiting keyphrase extraction. In Proceedings of
5th International Joint Conference on Natural Lan-
guage Processing, pages 11621170, Chiang Mai,
Thailand, November. Asian Federation of Natural
Language Processing.
Minqing Hu and Bing Liu. 2004. Mining and sum-
marizing customer reviews. In Proceedings of the
Tenth ACM SIGKDD international conference on
Knowledge discovery and data mining, KDD 04,
pages 168177, New York, NY, USA. ACM.
Nitin Jindal and Bing Liu. 2008. Opinion spam
and analysis. In WSDM 08: Proceedings of the
international conference on Web search and web
data mining, pages 219230, New York, NY, USA.
ACM.
Soo-Min Kim and Eduard Hovy. 2006. Automatic
identification of pro and con reasons in online re-
views. In Proceedings of the COLING/ACL on
Main conference poster sessions, COLING-ACL
06, pages 483490, Stroudsburg, PA, USA. Asso-
Bing Liu. 2010. Sentiment analysis and subjectivity.
Handbook of Natural Language Processing, 2nd ed.
Christopher Manning and Dan Klein. 2003. Opti-
mization, maxent models, and conditional estima-
tion without magic. In Proceedings of the 2003
Conference of the North American Chapter of the
Association for Computational Linguistics on Hu-
man Language Technology: Tutorials - Volume 5,
NAACL-Tutorials 03, pages 88, Stroudsburg, PA,
Ani Nenkova and Kathleen McKeown. 2011. Auto-
matic summarization. Foundations and Trends in
Information Retrieval, 5(2-3):103233.
Bo Pang and Lillian Lee. 2008. Opinion mining and
sentiment analysis. Foundations and Trends in In-
formation Retrieval, 2(1-2):1135.
Helmut Schmid. 1994. Probabilistic part-of-speech
tagging using decision trees. In Proceedings of the
285
Bootstrapped Training of Event Extraction Classifiers
Ruihong Huang and Ellen Riloff

School of Computing
University of Utah
Salt Lake City, UT 84112
{huangrh,riloff}@cs.utah.edu
Abstract 2002; Maslennikov and Chua, 2007)). How-

ever, manually generating answer keys for event
Most event extraction systems are trained extraction is time-consuming and tedious. And
with supervised learning and rely on a col- more importantly, event extraction annotations
lection of annotated documents. Due to
are highly domain-specific, so new annotations
the domain-specificity of this task, event
extraction systems must be retrained with
must be obtained for each domain.
new annotated data for each domain. In The goal of our research is to use bootstrap-
this paper, we propose a bootstrapping so- ping techniques to automatically train a state-of-
lution for event role filler extraction that re- the-art event extraction system without human-
quires minimal human supervision. We aim
generated answer key templates. The focus of our
to rapidly train a state-of-the-art event ex-
traction system using a small set of seed work is the TIER event extraction model, which
nouns for each event role, a collection is a multi-layered architecture for event extrac-
of relevant (in-domain) and irrelevant (out- tion (Huang and Riloff, 2011). TIERs innova-
of-domain) texts, and a semantic dictio- tion over previous techniques is the use of four
nary. The experimental results show that different classifiers that analyze a document at in-
the bootstrapped system outperforms previ- creasing levels of granularity. TIER progressively
ous weakly supervised event extraction sys- zooms in on event information using a pipeline
tems on the MUC-4 data set, and achieves
of classifiers that perform document-level classi-
performance levels comparable to super-
vised training with 700 manually annotated fication, sentence classification, and noun phrase
documents. classification. TIER outperformed previous event
extraction systems on the MUC-4 data set, but re-
lied heavily on a large collection of 1,300 docu-
1 Introduction ments coupled with answer key templates to train
Event extraction systems process stories about its four classifiers.
domain-relevant events and identify the role fillers In this paper, we present a bootstrapping solu-
of each event. A key challenge for event extraction that exploits a large unannotated corpus for
tion is that recognizing role fillers is inherently training by using role-identifying nouns (Phillips
contextual. For example, a PERSON can be a and Riloff, 2007) as seed terms. Phillips and
perpetrator or a victim in different contexts (e.g., Riloff observed that some nouns, by definition,
John Smith assassinated the mayor vs. John refer to entities or objects that play a specific role
Smith was assassinated). Similarly, any COM - in an event. For example, assassin, sniper,
PANY can be an acquirer or an acquiree depending and hitman refer to people who play the role
on the context. of PERPETRATOR in a criminal event. Similarly,
Many supervised learning techniques have victim, casualty, and fatality refer to peo-
been used to create event extraction systems us- ple who play the role of VICTIM, by virtue of
ing gold standard answer key event templates their lexical semantics. Phillips and Riloff called
for training (e.g., (Freitag, 1998a; Chieu and Ng, these words role-identifying nouns and used them
286
to learn extraction patterns. Our research also and Riloff, 2009)). Other systems take a more
uses role-identifying nouns to learn extraction pat- global view and consider discourse properties of
terns, but the role-identifying nouns and patterns the document as a whole to improve performance
are then used to create training data for event ex- (e.g., (Maslennikov and Chua, 2007; Ji and Gr-
traction classifiers. Each classifier is then self- ishman, 2008; Liao and Grishman, 2010; Huang
trained in a bootstrapping loop. and Riloff, 2011)). Currently, the learning-based
Our weakly supervised training procedure re- event extraction systems that perform best all use
quires a small set of seed nouns for each event supervised learning techniques that require a large
role, and a collection of relevant (in-domain) and number of texts coupled with manually-generated
irrelevant (out-of-domain) texts. No answer key annotations or answer key templates.
templates or annotated texts are needed. The seed A variety of techniques have been explored
nouns are used to automatically generate a set for weakly supervised training of event extrac-
of role-identifying patterns, and then the nouns, tion systems, primarily in the realm of pattern or
patterns, and a semantic dictionary are used to rule-based approaches (e.g., (Riloff, 1996; Riloff
label training instances. We also propagate the and Jones, 1999; Yangarber et al., 2000; Sudo et
event role labels across coreferent noun phrases al., 2003; Stevenson and Greenwood, 2005)). In
within a document to produce additional train- some of these approaches, a human must man-
ing instances. The automatically labeled texts are ually review and clean the learned patterns to
used to train three components of TIER: its two obtain good performance. Research has also been
types of sentence classifiers and its noun phrase done to learn extraction patterns in an unsuper-
classifiers. To create TIERs fourth component, vised way (e.g., (Shinyama and Sekine, 2006;
its document genre classifier, we apply heuristics Sekine, 2006)). But these efforts target open do-
to the output of the sentence classifiers. main information extraction. To extract domain-
We present experimental results on the MUC- specific event information, domain experts are
4 data set, which is a standard benchmark for needed to select the pattern subsets to use.
event extraction research. Our results show that There have also been weakly supervised ap-
the bootstrapped system, TIERlite , outperforms proaches that use more than just local context.
previous weakly supervised event extraction sys- (Patwardhan and Riloff, 2007) uses a semantic
tems and achieves performance levels comparable affinity measure to learn primary and secondary
to supervised training with 700 manually anno- patterns, and the secondary patterns are applied
tated documents. only to event sentences. The event sentence clas-
sifier is self-trained using seed patterns. Most
2 Related Work recently, (Chambers and Jurafsky, 2011) acquire
Event extraction techniques have largely focused event words from an external resource, group the
on detecting event triggers with their arguments event words to form event scenarios, and group
for extracting role fillers. Classical methods are extraction patterns for different event roles. How-
either pattern-based (Kim and Moldovan, 1993; ever, these weakly supervised systems produce
Riloff, 1993; Soderland et al., 1995; Huffman, substantially lower performance than the best su-
1996; Freitag, 1998b; Ciravegna, 2001; Califf and pervised systems.
Mooney, 2003; Riloff, 1996; Riloff and Jones,
3 Overview of TIER
1999; Yangarber et al., 2000; Sudo et al., 2003;
Stevenson and Greenwood, 2005) or classifier- The goal of our research is to develop a weakly
based (e.g., (Freitag, 1998a; Chieu and Ng, 2002; supervised training process that can successfully
Finn and Kushmerick, 2004; Li et al., 2005; Yu et train a state-of-the-art event extraction system for
al., 2005)). a new domain with minimal human input. We de-
Recently, several approaches have been pro- cided to focus our efforts on the TIER event ex-
posed to address the insufficiency of using only traction model because it recently produced bet-
local context to identify role fillers. Some ap- ter performance on the MUC-4 data set than prior
proaches look at the broader sentential context learning-based event extraction systems (Huang
around a potential role filler when making a de- and Riloff, 2011). In this section, we briefly give
cision (e.g., (Gu and Cercone, 2006; Patwardhan an overview of TIERs architecture and its com-
287
and time-consuming. Furthermore, answer key
templates for one domain are virtually never
reusable for different domains, so a new set of
answer keys must be produced from scratch for
each domain. In the next section, we present our
weakly supervised approach for training TIERs
Figure 1: TIER Overview event extraction classifiers.
ponents. 4 Bootstrapped Training of Event

TIER is a multi-layered architecture for event Extraction Classifiers
extraction, as shown in Figure 1. Documents pass
We adopt a two-phase approach to train TIERs
through a pipeline where they are analyzed at dif-
event extraction modules using minimal human-
ferent levels of granularity, which enables the sys-
generated resources. The goal of the first phase
tem to gradually zoom in on relevant facts. The
is to automatically generate positive training ex-
pipeline consists of a document genre classifier,
amples using role-identifying seed nouns as input.
two types of sentence classifiers, and a set of noun
phrase (role filler) classifiers. The seed nouns are used to automatically gener-
ate a set of role-identifying patterns for each event
The lower pathway in Figure 1 shows that all
role. Each set of patterns is then assigned a set
documents pass through an event sentence clas-
of semantic constraints (selectional restrictions)
sifier. Sentences labeled as event descriptions
that are appropriate for that event role. The se-
then proceed to the noun phrase classifiers, which
mantic constraints consist of the role-identifying
are responsible for identifying the role fillers in
seed nouns as well as general semantic classes
each sentence. The upper pathway in Figure 1 in-
that constrain the event role (e.g., a victim must
volves a document genre classifier to determine
be a HUMAN). A noun phrase will satisfy the se-
whether a document is an event narrative story
mantic constraints if its head noun is in the seed
(i.e., an article that primarily discusses the details
noun list or if it has the appropriate semantic type
of a domain-relevant event). Documents that are
(based on dictionary lookup). Each pattern is then
classified as event narratives warrant additional
matched against the unannotated texts, and if the
scrutiny because they most likely contain a lot of
extracted noun phrase satisfies its semantic con-
event information. Event narrative stories are pro-
straints, then the noun phrase is automatically la-
cessed by an additional set of role-specific sen-
beled as a role filler.
tence classifiers that look for role-specific con-
texts that will not necessarily mention the event. The second phase involves bootstrapped train-
For example, a victim may be mentioned in a sen- ing of TIERs classifiers. Using the labeled in-
tence that describes the aftermath of a crime, such stances generated in the first phase, we iteratively
as transportation to a hospital or the identifica- train three of TIERs components: the two types
tion of a body. Sentences that are determined to of sentential classifiers and the noun phrase clas-
have role-specific contexts are passed along to sifiers. For the fourth component, the document
the noun phrase classifiers for role filler extrac- classifier, we apply heuristics to the output of the
tion. Consequently, event narrative documents sentence classifiers to assess the density of rel-
pass through both the lower pathway and the up- evant sentences in a document and label high-
per pathway. This approach creates an event ex- density stories as event narratives. In the fol-
traction system that can discover role fillers in a lowing sections, we present the details of each of
variety of different contexts by considering the these steps.
type of document being processed.
4.1 Automatically Labeling Training Data
TIER was originally trained with supervised
learning using 1,300 texts and their corresponding Finding seeding instances of high precision and
answer key templates from the MUC-4 data set reasonable coverage is important in bootstrap-
(MUC-4 Proceedings, 1992). Human-generated ping. However, this is especially challenging
answer key templates are expensive to produce for event extraction task because identifying role
because the annotation process is both difficult fillers is inherently contextual. Furthermore, role
288
patterns automatically generated from unanno-
tated texts to assess the similarity of nouns. First,
Basilisk assigns a score to each pattern based on
the number of seed words that co-occur with it.
Basilisk then collects the noun phrases extracted
by the highest-scoring patterns. Next, the head
noun of each noun phrase is assigned a score
Figure 2: Using Basilisk to Induce Role-Identifying based on the set of patterns that it co-occurred
Patterns with. Finally, Basilisk selects the highest-scoring
nouns, automatically labels them with the seman-
fillers occur sparsely in text and in diverse con- tic class of the seeds, adds these nouns to the lex-
texts. icon, and restarts the learning process in a boot-
In this section, we explain how we gener- strapping fashion.
ate role-identifying patterns automatically using For our work, we give Basilisk role-identifying
seed nouns, and we discuss why we add seman- seed nouns for each event role. We run the boot-
tic constraints to the patterns when producing la- strapping process for 20 iterations and then har-
beled instances for training. Then, we discuss the vest the 40 best patterns that Basilisk identifies
coreference-based label propagation that we used for each event role. We also tried using the addi-
to obtain additional training instances. Finally, we tional role-identifying nouns learned by Basilisk,
give examples to illustrate how we create training but found that these nouns were too noisy.
instances.
4.1.2 Using the Patterns to Label NPs
4.1.1 Inducing Role-Identifying Patterns The induced role-identifying patterns can be
The input to our system is a small set of matched against the unannotated texts to produce
manually-defined seed nouns for each event role. labeled instances. However, relying solely on the
Specifically, the user is required to provide pattern contexts can be misleading. For example,
10 role-identifying nouns for each event role. the pattern context <subject> caused damage
(Phillips and Riloff, 2007) defined a noun as be- will extract some noun phrases that are weapons
ing role-identifying if its lexical semantics re- (e.g., the bomb) but some noun phrases that are
veal the role of the entity/object in an event. For not (e.g., the tsunami).
example, the words assassin and sniper are Based on this observation, we add selectional
people who participate in a violent event as a PER - restrictions to each pattern that requires a noun
PETRATOR . Therefore, the entities referred to by phrase to satisfy certain semantic constraints in
role-identifying nouns are probable role fillers. order to be extracted and labeled as a positive
However, treating every context surrounding a instances for an event role. The selectional re-
role-identifying noun as a role-identifying pattern strictions are satisfied if the head noun is among
is risky. The reason is that many instances of role- the role-identifying seed nouns or if the semantic
identifying nouns appear in contexts that do not class of the head noun is compatible with the cor-
describe the event. But, if one pattern has been responding event role. In the previous example,
seen to extract many role-identifying nouns and tsunami will not be extracted as a weapon because
seldomly seen to extract other nouns, then the pat- it has an incompatible semantic class (EVENT),
tern likely represents an event context. but bomb will be extracted because it has a com-
As (Phillips and Riloff, 2007) did, we use patible semantic class (WEAPON).
Basilisk to learn patterns for each event role. We use the semantic class labels assigned by
Basilisk was originally designed for semantic the Sundance parser (Riloff and Phillips, 2004) in
class learning (e.g., to learn nouns belonging to our experiments. Sundance looks up each noun
semantic categories, such as building or human). in a semantic dictionary to assign the semantic
As shown in Figure 2, beginning with a small set class labels. As an alternative, general resources
of seed nouns for each semantic class, Basilisk (e.g., WordNet (Miller, 1990)) or a semantic tag-
learns additional nouns belonging to the same se- ger (e.g., (Huang and Riloff, 2010)) could be
mantic class. Internally, Basilisk uses extraction used.
289
propagate the perpetrator label from noun phrase
men = Human terrorists was killed by <np> #1 to noun phrase #3.
assassins <subject> attacked
building = Object snipers <subject> fired shots
... ... ...
4.2 Creating TIERlite with Bootstrapping
Semantic RoleIdentifying RoleIdentifying In this section, we explain how the labeled in-
Dictionary Noun Patterns stances are used to train TIERs classifiers with
Constraints Constraints
bootstrapping. In addition to the automatically
labeled instances, the training process depends
John Smith was killed by two armed men on a text corpus that consists of both relevant
1
in broad daylight this morning. (in-domain) and irrelevant (out-of-domain) doc-
The assassins
2
attacked the mayor as he uments. Positive instances are generated from
left his house to go to work about 8:00 am.
Police arrested the unidentified men
the relevant documents and negative instances are
3
an hour later. generated by randomly sampling from the irrele-
vant documents.
Figure 3: Automatic Training Data Creation The classifiers are all support vector machines
(SVMs), implemented using the SVMlin software
(Keerthi and DeCoste, 2005). When applying the
4.1.3 Propagating Labels with Coreference classifiers during bootstrapping, we use a sliding
To enrich the automatically labeled training in- confidence threshold to determine which labels
stances, we also propagate the event role labels are reliable based on the values produced by the
across coreferent noun phrases within a docu- SVM. Initially, we set the threshold to be 2.0 to
ment. The observation is that once a noun phrase identify highly confident predictions. But if fewer
has been identified as a role filler, its corefer- than k instances pass the threshold, then we slide
ent mentions in the same document likely fill the the threshold down in decrements of 0.1 until we
same event role since they are referring to the obtain at least k labeled instances or the thresh-
same real world entity. old drops below 0, in which case bootstrapping
To leverage these coreferential contexts, we ends. We used k=10 for both sentence classifiers
employ a simple head noun matching heuristic to and k=30 for the noun phrase classifiers.
identify coreferent noun phrases. This heuristic The following sections present the details of the
assumes that two noun phrases that have the same bootstrapped training process for each of TIERs
head noun are coreferential. We considered us- components.
ing an off-the-shelf coreference resolver, but de-
cided that the head noun matching heuristic would
likely produce higher precision results, which is
important to produce high-quality labeled data.
4.1.4 Examples of Training Instance

Creation
Figure 3 illustrates how we label training in-
stances automatically. The text example shows
three noun phrases that are automatically labeled
Figure 4: The Bootstrapping Process
as perpetrators. Noun phrases #1 and #2 oc-
cur in role-identifying pattern contexts (was killed
by <np> and <subject> attacked) and satisfy 4.2.1 Noun Phrase Classifiers
the semantic constraints for perpetrators because The mission of the noun phrase classifiers is to
men has a compatible semantic type and assas- determine whether a noun phrase is a plausible
sins is a role-identifying noun for perpetrators. event role filler based on the local features sur-
Noun phrase #3 (the unidentified men) does rounding the noun phrase (NP). A set of classifiers
not occur in a pattern context, but it is deemed is needed, one for each event role.
to be coreferent with two armed men because As shown in Figure 4, to seed the classifier
they have the same head noun. Consequently, we training, the positive noun phrase instances are
290
generated from the relevant documents follow- to maintain the negative:positive ratio of 10:1.
ing Section 4.1. The negative noun phrase in- The bootstrapping process and feature set are the
stances are drawn randomly from the irrelevant same as for the event sentence classifier.
documents. Considering the sparsity of role fillers The difference between the two types of sen-
in texts, we set the negative:positive ratio to be tence classifiers is that the event sentence classi-
10:1. Once the classifier is trained, it is applied to fier uses positive instances from all event roles,
the unlabeled noun phrases in the relevant docu- while each role-specific sentence classifiers only
ments. Noun phrases that are assigned role filler uses the positive instances for one particular event
labels by the classifier with high confidence (us- role. The rationale is similar as in the super-
ing the sliding threshold) are added to the set of vised setting (Huang and Riloff, 2011); the event
positive instances. New negative instances are sentence classifier is expected to generalize over
drawn randomly from the irrelevant documents to all event roles to identify event mention contexts,
maintain the 10:1 (negative:positive) ratio. while the role-specific sentence classifiers are ex-
We extract features from each noun phrase pected to learn to identify contexts specific to in-
(NP) and its surrounding context. The features dividual roles.
include the NP head noun and its premodifiers.
4.2.4 Event Narrative Document Classifier
We also use the Stanford NER tagger (Finkel et
al., 2005) to identify Named Entities within the TIER also uses an event narrative document
NP. The context features include four words to the classifier and only extracts information from role-
left of the NP, four words to the right of the NP, specific sentences within event narrative docu-
and the lexico-syntactic patterns generated by Au- ments. In the supervised setting, TIER uses
toSlog to capture expressions around the NP (see heuristic rules derived from answer key templates
(Riloff, 1993) for details). to identify the event narrative documents in the
training set, which are used to train an event nar-
4.2.2 Event Sentence Classifier rative document classifier. The heuristic rules re-
The event sentence classifier is responsible quire that an event narrative should have a high
for identifying sentences that describe a relevant density of relevant information and tend to men-
event. Similar to the noun phrase classifier train- tion the relevant information within the first sev-
ing, positive training instances are selected from eral sentences.
the relevant documents and negative instances are In our weakly supervised setting, we use the
drawn from the irrelevant documents. All sen- information density heuristic directly instead of
tences in the relevant documents that contain one training an event narrative classifier. We approxi-
or more labeled noun phrases (belonging to any mate the relevant information density heuristic by
event role) are labeled as positive training in- computing the ratio of relevant sentences (both
stances. We randomly sample sentences from the event sentences and role-specific sentences) out of
irrelevant documents to obtain a negative:positive all the sentences in a document. Thus, the event
training instance ratio of 10:1. The bootstrapping narrative labeller only relies on the output of the
process is then identical to that of the noun phrase two sentence classifiers. Specifically, we label a
classifiers. The feature set for this classifier con- document as an event narrative if 50% of the
sists of unigrams, bigrams and AutoSlogs lexico- sentences in the document are relevant (i.e., la-
syntactic patterns surrounding all noun phrases in beled positively by either sentence classifier).
the sentence.
5 Evaluation
4.2.3 Role-Specific Sentence Classifiers In this section, we evaluate our bootstrapped sys-
The role-specific sentence classifiers are tem, TIERlite , on the MUC-4 event extraction
trained to identify the contexts specific to each data set. First, we describe the IE task, the data
event role. All sentences in the relevant doc- set, and the weakly supervised baseline systems
uments that contain at least one labeled noun that we use for comparison. Then we present the
phrase for the appropriate event role are used results of our fully bootstrapped system TIERlite ,
as positive instances. Negative instances are the weakly supervised baseline systems, and two
randomly sampled from the irrelevant documents fully supervised event extraction systems, TIER
291
and GLACIER. In addition, we analyze the per- manually selects the best patterns for each event
formance of TIERlite using different configura- role. During testing, the patterns are matched
tions to assess the impact of its components. against unseen texts to extract event role fillers.
PIPER (Patwardhan and Riloff, 2007; Patward-
5.1 IE Task and Data han, 2010) learns extraction patterns using a se-
We evaluated the performance of our systems on mantic affinity measure, and it distinguishes be-
the MUC-4 terrorism IE task (MUC-4 Proceed- tween primary and secondary patterns and ap-
ings, 1992) about Latin American terrorist events. plies them selectively. (Chambers and Jurafsky,
We used 1,300 texts (DEV) as our training set and 2011) (C+J) created an event extraction system
200 texts (TST3+TST4) as the test set. All the by acquiring event words from WordNet (Miller,
documents have answer key templates. For the 1990), clustering the event words into different
training set, we used the answer keys to separate event scenarios, and grouping extraction patterns
the documents into relevant and irrelevant sub- for different event roles.
sets. Any document containing at least one rel-
evant event was considered to be relevant. 5.3 Performance of TIERlite
Table 2 shows the seed nouns that we used in our
PerpInd PerpOrg Target Victim Weapon
129 74 126 201 58 experiments, which were generated by sorting the
nouns in the corpus by frequency and manually
Table 1: # of Role Fillers in the MUC-4 Test Set identifying the first 10 role-identifying nouns for
each event role.3 Table 3 shows the number of
Following previous studies, we evaluate our training instances (noun phrases) that were auto-
system on five MUC-4 string event roles: perpe- matically labeled for each event role using our
trator individuals (PerpInd), perpetrator organi- training data creation approach (Section 4.1).
zations (PerpOrg), physical targets, victims, and
weapons. Table 1 shows the distribution of role Event Role Seed Nouns
fillers in the MUC-4 test set. The complete IE task Perpetrator terrorists assassins criminals rebels
Individual murderers death squads guerrillas
involves the creation of answer key templates, one member members individuals
template per event1 . Our work focuses on extract- Perpetrator FMLN ELN FARC MRTA M-19 Front
ing individual role fillers and not template genera- Organization Shining Path Medellin Cartel
tion, so we evaluate the accuracy of the role fillers The Extraditables
Army of National Liberation
irrespective of which template they occur in.
Target houses residence building home homes
We used the same head noun scoring scheme offices pipeline hotel car vehicles
as previous systems, where an extraction is cor- Victim victims civilians children jesuits Galan
rect if its head noun matches the head noun in the priests students women peasants Romero
answer key2 . Pronouns were discarded from both Weapon weapons bomb bombs explosives rifles
dynamite grenades device car bomb
the system responses and the answer keys since
no coreference resolution is done. Duplicate ex- Table 2: Role-Identifying Seed Nouns
tractions were conflated before being scored, so
they count as just one hit or one miss.
PerpInd PerpOrg Target Victim Weapon
5.2 Weakly Supervised Baselines 296 157 522 798 248
We compared the performance of our system with Table 3: # of Automatically Labeled NPs
three previous weakly supervised event extraction
systems. Table 4 shows how our bootstrapped system
AutoSlog-TS (Riloff, 1996) generates lexico- TIERlite compares with previous weakly super-
syntactic patterns exhaustively from unannotated vised systems and two supervised systems, its su-
texts and ranks them based on their frequency and pervised counterpart TIER (Huang and Riloff,
probability of occurring in relevant documents. 2011) and a model that jointly considers local
A human expert then examines the patterns and and sentential contexts, G LACIER (Patwardhan
1 3
Documents may contain multiple events per article. We only found 9 weapon terms among the high-
2
For example, armed men will match 5 armed men. frequency terms.
292
Weakly Supervised Baselines
PerpInd PerpOrg Target Victim Weapon Average
AUTO S LOG -TS (1996) 33/49/40 52/33/41 54/59/56 49/54/51 38/44/41 45/48/46
P IPERBest (2007) 39/48/43 55/31/40 37/60/46 44/46/45 47/47/47 44/46/45
C+J (2011) - - - - - 44/36/40
Supervised Models
G LACIER (2009) 51/58/54 34/45/38 43/72/53 55/58/56 57/53/55 48/57/52
TIER (2011) 48/57/52 46/53/50 51/73/60 56/60/58 53/64/58 51/62/56
Weakly Supervised Models
TIERlite 47/51/49 60/39/47 37/65/47 39/53/45 53/55/54 47/53/50
Table 4: Performance of the Bootstrapped Event Extraction System (Precision/Recall/F-score)
60 5.4 Analysis
55
Table 6 shows the effect of the coreference prop-
agation step described in Section 4.1.3 as part of
IE performance(F1)
50
training data creation. Without this step, the per-
45 formance of the bootstrapped system yields an F
score of 41. With the benefit of the additional
40
training instances produced by coreference prop-
35 agation, the system yields an F score of 53. The
new instances produced by coreference propaga-
30
0 200 400 600 800 1000
# of training documents
1200 1400 tion seem to substantially enrich the diversity of
the set of labeled instances.
Figure 5: The Learning Curve of Supervised TIER Seeding P/R/F
wo/Coref 45/38/41
w/Coref 47/53/50
Table 6: Effects of Coreference Propagation

and Riloff, 2009). We see that TIERlite outper-
forms all three weakly supervised systems, with In the evaluation section, we saw that the su-
slightly higher precision and substantially more pervised event extraction systems achieve higher
recall. When compared to the supervised sys- recall than the weakly supervised systems. Al-
tems, the performance of TIERlite is similar to though our bootstrapped event extraction sys-
G LACIER, with comparable precision but slightly tem TIERlite produces higher recall than previ-
lower recall. But the supervised TIER system, ous weakly supervised systems, a substantial re-
which was trained with 1,300 annotated docu- call gap still exists.
ments, is still superior, especially in recall. Considering the pipeline structure of the event
extraction system, as shown in Figure 1, the noun
Figure 5 shows the learning curve for TIER phrase extractors are responsible for identifying
when it is trained with fewer documents, rang- all candidate role fillers. The sentential classifiers
ing from 100 to 1,300 in increments of 100. Each and the document classifier effectively serve as
data point represents five experiments where we filters to rule out candidates from irrelevant con-
randomly selected k documents from the train- texts. Consequently, there is no way to recover
ing set and averaged the results. The bars show missing recall (role fillers) if the noun phrase ex-
the range of results across the five runs. Figure 5 tractors fail to identify them.
shows that TIERs performance increases from an Since the noun phrase classifiers are so central
F score of 34 when trained on just 100 documents to the performance of the system, we compared
up to an F score of 56 when training on 1,300 doc- the performance of the bootstrapped noun phrase
uments. The circle shows the performance of our classifiers directly with their supervised conter-
bootstrapped system, TIERlite , which achieves an parts. The results are shown in Table 5. Both sets
F score comparable to supervised training with of classifiers produce low precision when used in
about 700 manually annotated documents. isolation, but their precision levels are compara-
293
PerpInd PerpOrg Target Victim Weapon Average
Supervised Classifier 25/67/36 26/78/39 34/83/49 32/72/45 30/75/43 30/75/42
Bootstrapped Classifier 30/54/39 37/53/44 30/71/42 28/63/39 36/57/44 32/60/42
Table 5: Evaluation of Bootstrapped Noun Phrase Classifiers (Precision/Recall/F-score)
ble. The TIER pipeline architecture is successful References

at eliminating many of the false hits. However, M.E. Califf and R. Mooney. 2003. Bottom-up Re-
the recall of the bootstrapped classifiers is consis- lational Learning of Pattern Matching rules for In-
tently lower than the recall of the supervised clas- formation Extraction. Journal of Machine Learning
sifiers. Specifically, the recall is about 10 points Research, 4:177210.
lower for three event roles (PerpInd, Target and Nathanael Chambers and Dan Jurafsky. 2011.
Victim) and 20 points lower for the other two event Template-Based Information Extraction without the
roles (PerpOrg and Weapon). These results sug- Templates. In Proceedings of the 49th Annual
gest that our bootstrapping approach to training
guistics: Human Language Technologies (ACL-11).
instance creation does not fully capture the diver- H.L. Chieu and H.T. Ng. 2002. A Maximum Entropy
sity of role filler contexts that are available in the Approach to Information Extraction from Semi-
supervised training set of 1,300 documents. This Structured and Free Text. In Proceedings of the
issue is an interesting direction for future work. 18th National Conference on Artificial Intelligence.
F. Ciravegna. 2001. Adaptive Information Extraction
6 Conclusions from Text by Rule Induction and Generalisation. In
Proceedings of the 17th International Joint Confer-
We have presented a bootstrapping approach for ence on Artificial Intelligence.
training a multi-layered event extraction model J. Finkel, T. Grenager, and C. Manning. 2005. In-
using a small set of seed nouns for each event corporating Non-local Information into Information
role, a collection of relevant (in-domain) and ir- Extraction Systems by Gibbs Sampling. In Pro-
relevant (out-of-domain) texts and a semantic dic- ceedings of the 43rd Annual Meeting of the Associa-
tion for Computational Linguistics, pages 363370,
tionary. The experimental results show that the
Ann Arbor, MI, June.
bootstrapped system, TIERlite , outperforms pre- A. Finn and N. Kushmerick. 2004. Multi-level
vious weakly supervised event extraction sys- Boundary Classification for Information Extraction.
tems on a standard event extraction data set, and In In Proceedings of the 15th European Conference
achieves performance levels comparable to super- on Machine Learning, pages 111122, Pisa, Italy,
vised training with 700 manually annotated docu- September.
ments. The minimal supervision required to train Dayne Freitag. 1998a. Multistrategy Learning for
such a model increases the portability of event ex- Information Extraction. In Proceedings of the Fif-
teenth International Conference on Machine Learn-
traction systems.
ing. Morgan Kaufmann Publishers.
Dayne Freitag. 1998b. Toward General-Purpose
7 Acknowledgments
Learning for Information Extraction. In Proceed-
We gratefully acknowledge the support of the ings of the 36th Annual Meeting of the Association
National Science Foundation under grant IIS- for Computational Linguistics.
Z. Gu and N. Cercone. 2006. Segment-Based Hidden
1018314 and the Defense Advanced Research
Markov Models for Information Extraction. In Pro-
Projects Agency (DARPA) Machine Reading ceedings of the 21st International Conference on
Program under Air Force Research Laboratory Computational Linguistics and 44th Annual Meet-
(AFRL) prime contract no. FA8750-09-C-0172. ing of the Association for Computational Linguis-
Any opinions, findings, and conclusions or rec- tics, pages 481488, Sydney, Australia, July.
ommendations expressed in this material are those Ruihong Huang and Ellen Riloff. 2010. Inducing
of the authors and do not necessarily reflect the Domain-specific Semantic Class Taggers from (Al-
view of the DARPA, AFRL, or the U.S. govern- most) Nothing. In Proceedings of The 48th Annual
ment.
guistics (ACL 2010).
Ruihong Huang and Ellen Riloff. 2011. Peeling Back
the Layers: Detecting Event Role Fillers in Sec-
ondary Contexts. In Proceedings of the 49th Annual
294
Meeting of the Association for Computational Lin- W. Phillips and E. Riloff. 2007. Exploiting Role-
guistics: Human Language Technologies (ACL-11). Identifying Nouns and Expressions for Information
S. Huffman. 1996. Learning Information Extraction Extraction. In Proceedings of the 2007 Interna-
Patterns from Examples. In Stefan Wermter, Ellen tional Conference on Recent Advances in Natural
Riloff, and Gabriele Scheler, editors, Connectionist, Language Processing (RANLP-07), pages 468473.
Statistical, and Symbolic Approaches to Learning E. Riloff and R. Jones. 1999. Learning Dictionar-
for Natural Language Processing, pages 246260. ies for Information Extraction by Multi-Level Boot-
Springer-Verlag, Berlin. strapping. In Proceedings of the Sixteenth National
H. Ji and R. Grishman. 2008. Refining Event Extrac- Conference on Artificial Intelligence.
tion through Cross-Document Inference. In Pro- E. Riloff and W. Phillips. 2004. An Introduction to the
ceedings of ACL-08: HLT, pages 254262, Colum- Sundance and AutoSlog Systems. Technical Report
bus, OH, June. UUCS-04-015, School of Computing, University of
Utah.
S. Keerthi and D. DeCoste. 2005. A Modified Finite
E. Riloff. 1993. Automatically Constructing a Dictio-
Newton Method for Fast Solution of Large Scale
nary for Information Extraction Tasks. In Proceed-
Linear SVMs. Journal of Machine Learning Re-
ings of the 11th National Conference on Artificial
search.
Intelligence.
J. Kim and D. Moldovan. 1993. Acquisition of E. Riloff. 1996. Automatically Generating Extraction
Semantic Patterns for Information Extraction from Patterns from Untagged Text. In Proceedings of the
Corpora. In Proceedings of the Ninth IEEE Con- Thirteenth National Conference on Artificial Intel-
ference on Artificial Intelligence for Applications, ligence, pages 10441049.
pages 171176, Los Alamitos, CA. IEEE Computer Satoshi Sekine. 2006. On-demand information extrac-
Society Press. tion. In Proceedings of Joint Conference of the In-
Y. Li, K. Bontcheva, and H. Cunningham. 2005. Us- ternational Committee on Computational Linguis-
ing Uneven Margins SVM and Perceptron for Infor- tics and the Association for Computational Linguis-
mation Extraction. In Proceedings of Ninth Confer- tics (COLING/ACL-06.
ence on Computational Natural Language Learn- Y. Shinyama and S. Sekine. 2006. Preemptive In-
ing, pages 7279, Ann Arbor, MI, June. formation Extraction using Unrestricted Relation
Shasha Liao and Ralph Grishman. 2010. Using Docu- Discovery. In Proceedings of the Human Lan-
ment Level Cross-Event Inference to Improve Event guage Technology Conference of the North Ameri-
Extraction. In Proceedings of the 48st Annual can Chapter of the Association for Computational
Meeting on Association for Computational Linguis- Linguistics, pages 304311, New York City, NY,
tics (ACL-10). June.
M. Maslennikov and T. Chua. 2007. A Multi- S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert.
Resolution Framework for Information Extraction 1995. CRYSTAL: Inducing a conceptual dictio-
from Free Text. In Proceedings of the 45th Annual nary. In Proc. of the Fourteenth International Joint
Meeting of the Association for Computational Lin- Conference on Artificial Intelligence, pages 1314
guistics. 1319.
G. Miller. 1990. Wordnet: An On-line Lexical M. Stevenson and M. Greenwood. 2005. A Seman-
Database. International Journal of Lexicography, tic Approach to IE Pattern Induction. In Proceed-
3(4). ings of the 43rd Annual Meeting of the Association
for Computational Linguistics, pages 379386, Ann
MUC-4 Proceedings. 1992. Proceedings of the
Arbor, MI, June.
Fourth Message Understanding Conference (MUC-
K. Sudo, S. Sekine, and R. Grishman. 2003. An Im-
4). Morgan Kaufmann.
proved Extraction Pattern Representation Model for
S. Patwardhan and E. Riloff. 2007. Effective Informa- Automatic IE Pattern Acquisition. In Proceedings
tion Extraction with Semantic Affinity Patterns and of the 41st Annual Meeting of the Association for
Relevant Regions. In Proceedings of 2007 the Con- Computational Linguistics (ACL-03).
ference on Empirical Methods in Natural Language R. Yangarber, R. Grishman, P. Tapanainen, and S. Hut-
Processing (EMNLP-2007). tunen. 2000. Automatic Acquisition of Domain
S. Patwardhan and E. Riloff. 2009. A Unified Model Knowledge for Information Extraction. In Proceed-
of Phrasal and Sentential Evidence for Information ings of the Eighteenth International Conference on
Extraction. In Proceedings of 2009 the Conference Computational Linguistics (COLING 2000).
on Empirical Methods in Natural Language Pro- K. Yu, G. Guan, and M. Zhou. 2005. Resume In-
cessing (EMNLP-2009). formation Extraction with Cascaded Hybrid Model.
S. Patwardhan. 2010. Widening the Field of View In Proceedings of the 43rd Annual Meeting of the
of Information Extraction through Sentential Event Association for Computational Linguistics, pages
Recognition. Ph.D. thesis, University of Utah. 499506, Ann Arbor, MI, June.
295
Bootstrapping Events and Relations from Text
Ting Liu Tomek Strzalkowski
ILS, University at Albany, ILS, University at Albany, USA
USA Polish Academy of Sciences
tliu@albany.edu tomek@albany.edu
(2) self-adapting unsupervised multi-pass boot-
Abstract strapping by which the system learns new rules
as it reads un-annotated text using the rules learnt
In this paper, we describe a new approach to in the first step and in the subsequent learning
semi-supervised adaptive learning of event passes. When a sufficient quantity and quality of
extraction from text. Given a set of exam- text material is supplied, the system will learn
ples and an un-annotated text corpus, the many ways in which a specific class of events
BEAR system (Bootstrapping Events And
can be described. This includes the capability to
Relations) will automatically learn how to
recognize and understand descriptions of detect individual event mentions using a system
complex semantic relationships in text, such of context-sensitive triggers and to isolate perti-
as events involving multiple entities and nent attributes such as agent, object, instrument,
their roles. For example, given a series of time, place, etc., as may be specific for each type
descriptions of bombing and shooting inci- of event. This method produces an accurate and
dents (e.g., in newswire) the system will highly adaptable event extraction that significant-
learn to extract, with a high degree of accu- ly outperforms current information extraction
racy, other attack-type events mentioned techniques both in terms of accuracy and robust-
elsewhere in text, irrespective of the form of ness, as well as in deployment cost.
description. A series of evaluations using
the ACE data and event set show a signifi-
2 Learning by bootstrapping
cant performance improvement over our
baseline system. As a semi-supervised machine learning method,
bootstrapping can start either with a set of prede-
fined rules or patterns, or with a collection of
1 Introduction training examples (seeds) annotated by a domain
expert on a (small) data set. These are normally
We constructed a semi-supervised machine
related to a target application domain and may be
learning process that effectively exploits statisti-
regarded as initial teacher instructions to the
cal and structural properties of natural language
learning system. The training set enables the sys-
discourse in order to rapidly acquire rules to de-
tem to derive initial extraction rules, which are
tect mentions of events and other complex rela-
applied to un-annotated text data in order to pro-
tionships in text, extract their key attributes, and
duce a much larger set of examples. The exam-
construct template-like representations. The
ples found by the initial rules will occur in a
learning process exploits descriptive and struc-
variety of linguistic contexts, and some of these
tural redundancy, which is common in language;
contexts may provide support for creating alter-
it is often critical for achieving successful com-
native extraction rules. When the new rules are
munication despite distractions, different con-
subsequently applied to the text corpus, addition-
texts, or incompatible semantic models between
al instances of the target concepts will be identi-
a speaker/writer and a hearer/reader. We also
fied, some of which will be positive and some
take advantage of the high degree of referential
not. As this process continues to iterate over, the
consistency in discourse (e.g., as observed in
system acquires more extraction rules, fanning
word sense distribution by (Gale, et al. 1992),
out from the seed set until no new rules can be
and arguably applicable to larger linguistic
learned.
units), which enables the reader to efficiently
Thus defined, bootstrapping has been used in
correlate different forms of description across
natural language processing research, notably in
coherent spans of text.
word sense disambiguation (Yarowsky, 1995).
The method we describe here consists of two
Strzalkowski and Wang (1996) were first to
steps: (1) supervised acquisition of initial extrac-
demonstrate that the technique could be applied
tion rules from an annotated training corpus, and
to adaptive learning of named entity extraction
296
rules. For example, given a nave rule for iden- be found in essays and other narrative forms. The
tifying company names in text, e.g., capitalized system needs to recognize any of these forms and
NP followed by Co., their system would first to do so we need to distill each description to a
find a large number of (mostly) positive instanc- basic event pattern. This pattern will capture the
es of company names, such as Henry Kauffman heads of key phrases and their dependency struc-
Co. From the context surrounding each of these ture while suppressing modifiers and certain oth-
instances it would isolate alternative indicators, er non-essential elements. Such skeletal
such as the president of, which is noted to oc- representations cannot be obtained with keyword
cur in front of many company names, as in The analysis or linear processing of sentences at word
president of American Electric Automobile Co. level (e.g., Agichtein and Gravano, 2000), be-
. Such alternative indicators give rise to new cause such methods cannot distinguish a phrase
extraction rules, e.g., president of + CNAME. head from its modifier. A shallow dependency
The new rules find more entities, including com- parser, such as Minipar (Lin, 1998), that recog-
pany names that do not end with Co., and the nizes dependency relations between words is
process iterates until no further rules are found. quite sufficient for deriving head-modifier rela-
The technique achieved a very high performance tions and thus for construction of event tem-
(95% precision and 90% recall), which encour- plates. Event templates are obtained by stripping
aged more research in IE area by using boot- the parse tree of modifiers while preserving the
strapping techniques. Using a similar approach, basic dependency structure as shown in Figure 1,
(Thelen and Riloff, 2002) generated new syntac- which is a stripped down parse tree of, Also
tic patterns by exploiting the context of known Monday, Israeli soldiers fired on four diplomatic
seeds for learning semantic categories. vehicles in the northern Gaza town of Beit
In Snowball (Agichtein and Gravano, 2000 ) Hanoun, said diplomats
and Yangarbers IE system (2000), bootstrapping The model proposed here represents a signifi-
technique was applied for extraction of binary cant advance over the current methods for rela-
relations, such as Organization-Location, e.g., tion extraction, such as the SVO model
between Microsoft and Redmond, WA. Then, Xu (Yangarber, et al. 2000) and its extension, e.g.,
(2007) extended the method for more complex the chain model (Sudo, et al. 2001) and other
relations extraction by using sentence syntactic related variants (Riloff, 1996) all of which lack
structure and a data driven pattern generation. In the expressive power to accurately recognize and
this paper, we describe a different approach on represent complex event descriptions and to sup-
building event patterns and adapting to the dif- port successful machine learning. While Sudos
ferent structures of unseen events. subtree model (2003) overcomes some of the
limitations of the chain models and is thus con-
3 Bootstrapping applied to event learn- ceptually closer to our method, it nonetheless
ing lacks efficiency required for practical applica-
tions.
Our objective in this project was to expand the We represent complex relations as tree-like
bootstrapping technique to learn extraction of structures anchored at an event trigger (which is
events from text, irrespective of their form of usually but not necessarily the main verb) with
description, a property essential for successful branches extending to the event attributes (which
adaptability to new domains and text genres. The are usually named entities). Unlike the singular
major challenge in advancing from entities and concepts (i.e., named entities such as person or
binary relations to event learning is the complex-
ity of structures involved that not only consist of
multiple elements but their linguistic context
may now extend well beyond a few surrounding
words, even past sentence boundaries. These
considerations guided the design of the BEAR
system (Bootstrapping Events And Relations),
which is described in this paper.
3.1 Event representation
An event description can vary from very concise,
newswire-style to very rich and complex as may Figure 1. Skeletal dependency structure representation of an
event mention.
297
location) or linear relations (i.e., tuples such as 3.2 Designating the sense of event triggers
Gates CEO Microsoft), an event description
An event trigger may have multiple senses but
consists of elements that form non-linear de-
only one of them is for the event representation.
pendencies, which may not be apparent in the
If the correct sense can be determined, we would
word order and therefore require syntactic and
be able to use its synonyms and hyponym as al-
semantic analysis to extract. Furthermore, an ar-
ternative event triggers, thus enabling extraction
rangement of these elements in text can vary
of more events. This, in turn, requires sense dis-
greatly from one event mention to the next, and
ambiguation to be performed on the event trig-
there is usually other intervening material in-
gers.
volved. Consequently, we construe event repre-
In MUC evaluations, participating groups (
sentation as a collection of paths linking the
Yangarber and Grishman, 1998) used human
trigger to the attributes through the nodes of a
experts to decide the correct sense of event trig-
parse tree1.
gers and then manually added correct synonyms
To create an event pattern (which will be part
to generalize event patterns. Although accurate,
of an extraction rule), we generalize the depend-
the process is time consuming and not portable to
ency paths that connect the event trigger with
new domains.
each of the event key attributes (the roles). A
We developed a new approach for utilizing
dependency path consists of lexical and syntactic
Wordnet to decide the correct sense of an event
relations (POS and phrase dependencies), as well
trigger. The method is based on the hypothesis
as semantic relations, such as entity tags (e.g.,
that event triggers will share same sense when
Person, Company, etc.) of event roles and word
represent same type of event. For example, when
sense designations (based on Wordnet senses) of
the verbs, attack, assail, strike, gas, bomb, are
event triggers. In addition to the trigger-role
trigger words of Conflict-Attack event, they
paths (which we shall call the sub-patterns), an
share same sense. This process is described in the
event pattern also contains the following:
following steps:
Event Type and Subtype which is inher- 1) From training corpus, collect all triggers,
ited from seed examples; which specify the lemma, POS tag, the type
Trigger class an instance of the trigger of event and get all possible senses of them
must be found in text before any patterns from Wordnet.
are applied; 2) Order the triggers by the trigger frequency
Confidence score expected accuracy of TrF(t, w_pos),2 which is calculated by divid-
the pattern established during training ing number of times each word (w_pos) is
process; used as a trigger for the event of type (t) by
Context profile additional features col- the total number of times this word occurs in
lected from the context surrounding the the training corpus. Clearly, the greater trig-
event description, including references of ger frequency of a word, the more discrimi-
other types of events near this event, in native it is as a trigger for the given type of
the same sentence, same paragraph, or ad- event. When the senses of the triggers with
jacent paragraphs. high accuracy are defined, they can be the
We note that the trigger-attribute sub-patterns reference for the triggers in low accuracy.
are defined over phrase structures rather than 3) From the top of the trigger list, select the
over linear text, as shown in Figure 2. In order to first none-sense defined trigger (Tr1)
compose a complete event pattern, sub-patterns 4) Again, beginning from the top of the trigger
are collected across multiple mentions of the list, for every trigger Tr2 (other than Tr1),
same-type event. we look for a pair of compatible senses be-
tween Tr1 and Tr2. To do so, traverse Syno-
Attacker: <N(subj, PER): Attacker> <V(fire): trigger>
Place: <V(fire): trigger> <Prep> <N> <Prep(in)> <N(GPE): Place>
nym, Hypernym, and Hyponym links starting
Target: <V(fire): trigger> <Prep(on)> <N(VEH): Target> from the sense(s) of Tr2 (use either the sense
Time-Within:<N(timex2): Time-Within><SentHead><V(fire): already assigned to Tr2 if has or all its possi-
trigger>
Figure 2. Trigger-attribute sub-patterns for key roles in a Conflict- ble senses) and see whether there are paths
Attack event pattern. which can reach the senses of Tr1. If such
1
Details of how to derive the skeletal tree representation are converging paths exist, the compatible senses
described in (Liu, 2009).
2 2
t the type of the event, w_pos the lemma of a word and t the type of the event, w_pos the lemma of a word and
its POS. its POS.
3
In this figure we omit the parse tree trimming step which
was explained in the previous section.
298
relaxation, is particularly useful for rapid adapta-
tion of extraction capability to slightly altered,
partly ungrammatical, or otherwise variant data.
The basic idea is as follows: the patterns ac-
quired in prior learning iterations (starting with
those obtained from the seed examples) are
matched against incoming text to extract new
events. Along the way there will be a number of
partial matches, i.e., when no existing pattern
fully matches a span of text. This may simply
mean that no event is present; however, depend-
ing upon the degree of the partial match we may
Figure 3. A Conflict-Attack event pattern derived from a also consider that a novel structural variant was
positive example in the training corpus
are identified and assigned to Tr1 and Tr2 (if found. BEAR would automatically test this hy-
Tr2s sense wasnt assigned before). Then go pothesis by attempting to construe a new pattern,
back to step 3. However, if no such path ex- out of the elements of existing patterns, in order
ist between Tr1 senses with other triggers to achieve a full match. If a match is achieved,
senses, the first sense listed in Wordnet will the new mutated pattern will be added to
be assigned to Tr1 BEAR learned collection, subject to a validation
This algorithm tries to assign the most proper step. The validation step (discussed later in this
sense to every trigger for one type of event. For paper) is to assure that the added pattern would
example, the sense of fire as trigger of Conflict- not introduce an unacceptable drop in overall
Attack event is start firing a weapon; while it is system precision. Specific pattern mutation tech-
used in Personal-End_Position, its sense is ter- niques include the following:
minate the employment of. After the trigger Adding a role subpattern: When a pattern
sense is defined, we can expand event triggers by matches an event mention while there is a
adding their synonyms and hyponyms during the sufficient linguistic evidence (e.g., pres-
event extraction. ence of certain types of named entities)
that additional roles may be present in
3.3 Deriving initial rules from seed exam- text, then appropriate role subpatterns can
ples be "imported" from other, non-matching
patterns (Figure 4).
Extraction rules are construed as transformations
from the event patterns derived from text onto a Replacing a role subpattern: When a pat-
formal representation of an event. The initial tern matches but for one role, the system
rules are derived from a manually annotated can replace this role subpattern by another
training text corpus (seed data), supplied as part subpattern for the same role taken from a
of an application task. Each rule contains the different pattern for the same event type.
type of events it extracts, trigger, a list of role Adding or replacing a trigger: When a
sub-patterns, and the confidence score obtained pattern matches but for the trigger, a new
through a validation process (see section 3.6). trigger can be added if it either is already
Figure 3 shows an extraction pattern for the Con- present in another pattern for the same
flict-Attack event derived from the training cor- event type or the syno-
pus (but not validated yet)3. nym/hyponym/hypernym of the trigger
(found in section 3.2).
3.4 Learning through pattern mutation We should point out that some of the same ef-
fects can be obtained by making patterns more
Given an initial set of extraction rules, a variety
general, i.e., adding "optional" attributes (i.e.,
of pattern mutation techniques are applied to de-
optional sub-patterns), etc. Nonetheless, the pat-
rive new patterns and new rules. This is done by
tern mutation is more efficient because it will
selecting elements of previously learnt patterns,
automatically learn such generalization on an as-
based on the history of partial matches and com-
needed basis in an entirely data-driven fashion,
bining them into new patterns. This form of
while also maintaining high precision of the re-
learning, which also includes conditional rule
sulting pattern set. It is thus a more general
3
In this figure we omit the parse tree trimming step which method. Figure 4 illustrated the use of the ele-
was explained in the previous section. ments combination technique. In this example,
299
Figure 4. Deriving a new pattern by importing a role from another pattern
neither of the two existing patterns can fully (shown in Figure 5B) is of course subject to con-
match the new event description; however, by fidence validation, after which it will be immedi-
combining the first pattern with the Place role ately applied to extract more events.
sub-pattern from the second pattern we obtain a Another way of getting at this kind of struc-
new pattern that fully matches the text. While tural duality is to exploit co-referential con-
this adjustment is quite simple, it is nonetheless sistency within coherent spans of discourse, e.g.,
performed automatically and without any human a single news article or a similar document. Such
assistance. The new pattern is then learned by documents may contain references to multiple
BEAR, subject to a verification step explained in events, but when the same type of event is men-
a later section. tioned along with the same attributes, it is more
likely than not in reference to the same event.
3.5 Learning by exploiting structural duali- This hypothesis is a variant of an argument ad-
ty vanced in (Gale, et al. 2000) that a polysemous
As the system reads through new text extracting word used multiple times within a single docu-
more events using already learnt rules, each ex- ment, is consistently used in the same sense. So
tracted event mention is analyzed for presence of if we extract an event mention (of type T) with
alternative trigger elements that can consistently trigger t in one part of a document, and then find
predict the presence of a subset of events that that t occurs in another part of the same docu-
includes the current one. Subsequently, an alter- ment, then we may assume that this second oc-
native sub-pattern structure will be built with currence of t has the same sense as the first.
branches extending from the new trigger to the Since t is a trigger for an event of type T, we can
already identified attributes, as shown schemati- hypothesize its subsequent occurrences indicate
cally in Figure 5. additional mentions of type T events that were
In this example, a Conflict-Attack-type event not extracted by any of the existing patterns. Our
is extracted using a pattern (shown in Figure 5A) objective is to exploit these unextracted mentions
anchored at the bombing trigger. Nonetheless, and then automatically generate additional event
an alternative trigger structure is discovered, patterns.
which is anchored at an attack NP, as shown Indeed, Ji (2008) showed that trigger co-
on the right side of Figure 5. This discovery is occurrence helps finding new mentions of the
based upon seeing the new trigger repeatedly it Pattern ID: 1207
needs to explain a subset of previously seen Type: Conflict Subtype: Attack
events to be adopted. The new trigger will Trigger: bombing_N
Target: <N(bombing): trigger> <Prep(of)> <N(FAC): Target>
prompt BEAR to derive additional event pat- Attacker: <N(PER): Attacker> <V> <N(bombing): trigger>
terns, by computing alternative trigger-attribute Time-Within: <N(bombing): trigger> <Prep> <N> <Prep> <N>
paths in the dependency tree. The new pattern <E0> <V> <N(timex2): Time-within>
Figure 5A. A pattern with the bombing trigger matches the event
mention in Fig. 5.
Pattern ID: 1286
Type: Conflict Subtype: Attack
Trigger: attack_N
Target: <N(FAC): Target> <Prep(in)> <N(attack): trigger>
Attacker: <N(PER): Attacker> <V> <N> <Prep> <N> <Prep(in)>
<N(attack): trigger>
Time-Within: <N(attack): trigger> <E0> <V> <N(timex2): Time-
within>
Figure 5B. A new pattern is derived for event in Fig 5, with an attack as the
Figure 5. A new extraction pattern is derived by iden- trigger.
tifying an alternative trigger for an event.
300
entities, Howard G. Capek and UBS. The
projected accuracy of resign_V as an End-
Position trigger is 0.88. With 100% argument
overlap rate, we estimate the probability that sen-
tence R contains an event mention of the same
type as sentence L (and in fact co-referential
mention) at 97% (We set 80% as the threshold).
Thus a new event mention is found and a new
pattern for End-Position is automatically derived
Figure 6. The probability of a sentence containing a mention of the from R, as shown in Figure 7A.
same type of event within a single document
same event; however, we found that if using enti- 3.6 Pattern validation
ty co-reference as another factor, more new men- Extraction patterns are validated after each learn-
tions could be identified when the trigger has low ing cycle against the already annotated data. In
projected accuracy (Liu, 2009; Yu Hong, et al. the first supervised learning step, patterns accu-
2011). Our experiments (Figure 64), which com- racy is tested against the training corpus based on
pared the triggers and the roles across all event the similarity between the extracted events and
mentions within each document on ACE training human annotated events:
corpus, showed that when the trigger accuracy is A Full match is achieved when the event
0.5 or higher, each of its occurrences within the type is correctly identified and all its roles
document indicates an event mention of the same are correctly matched. A full credit is
type with a very high probability (mostly > 0.9). added to the pattern score.
For triggers with lower accuracy, this high prob- A Partial match is achieved when the
ability is only achieved when the two mentions event type is correctly identified but only
share at least 60% of their roles, in addition to a subset of roles is correctly extracted. A
having a common trigger. Thus our approach partial score, which is the ratio of the
uses co-occurrence of both trigger and event ar- matched roles to the whole roles, is add-
gument for detecting new event mentions. ed.
In Figure 7, an End-Position event is extracted A False Alarm occurs when a wrong type
from left sentence (L), with resign as the trig- of event is extracted (including when no
ger and Capek and UBS assigned Person and event is present in text). No credit is add-
Entity roles, respectively 5 . The right sentence ed to the pattern score.
(R), taken from the same document, contains the In the subsequent steps, the validation is ex-
same trigger word, resigned and also the same tended over parts of the unannotated corpus. In
Riloff (1996) and Sudo et al. (2001), the pattern
accuracy is mainly dependent on its occurrences
in the relevant documents6 vs. the whole corpus.
However, one document may contain multiple
types of events, thus we set a more restricted val-
idation measure on new rules:
Good Match If a new rule rediscovers
already extracted events of the same type,
Figure 7. Two event mentions have different triggers and then it will be counted as either a Full
sub-patterns structures Match or Partial Match based on previ-
Pattern ID: -1
ous rules
Type: Personnel Subtype: End-Position Possible Match If an already extracted
Trigger: resign_V event of same type of a rule contains
Person: <N(PER, subj): Person> <V(resign): trigger>
Entity: <V(resign):trigger> <E0> <N(ORG): Entity> <N> <V>
same entities and trigger as the candidate
Figure 7A. A new pattern for End-Position learned by exploiting extracted by the rule. This candidate is a
event co-reference. possible match, so it will get a partial
4
The X-axis is the percentage of entities coreferred between
the EVMs (Event mentions) and the SEs (Sentences); while
6
the Y-axis shows the probability that the SE contains a men- If a document contains same type of events extracted from
tion that is the same type as the EVM. previous steps, the document is a relevant document to the
5
Entity is the employer in the event pattern.
301
accuracy is expected to increase, in some cases
Event id: 27
from: sample above the threshold.
Projected Accuracy: 0.1765 For example, the pattern in Figure 8 has an in-
Adjusted Projected Accuracy: 0.91 itially low projected accuracy score; however, we
Type: Justice Subtype: Arrest-Jail
Trigger: capture find that positive matches of this pattern show a
Person sub-pattern: <N(obj, PER): Person> <V(capture): trigger> very high (100% in fact) degree of correlation
Co-occurrence ratio: {para_Conflict_Demonstrate=100%, } with mentions of Demonstrate events. Therefore,
Mutually exclusive ratio: {sent_Conflict_Attack=100%, pa-
ra_Conflict_Attack=96.3%, }
limiting the application of this pattern to situa-
Figure 8. An Arrest-Jail pattern with context profile information tions where a Justice-Arrest-Jail event is men-
tioned in a nearby text improves its projected
score based on the statistics result from accuracy to 91%, which is well above the re-
Figure 6. quired threshold.
False Alarm If a new rule picks up an al- In addition to the confidence rate of each new
ready extracted event in different type pattern, we also calculate projected accuracy of
Thus, event patterns are validated for overall each of the role sub-patterns, because they may
expected precision by calculating the ratio of be used in the process of detecting new patterns,
positive matches to all matches against known and it will be necessary to score partial matches,
events. This produces pattern confidence scores, as a function confidence weights for pattern
which are used to decide if a pattern is to be components. To validate a sub-pattern we apply
learned or not. Learning only the patterns with it to the training corpus and calculate its project-
sufficiently high confidence scores helps to ed accuracy score by dividing the number of cor-
guard the bootstrapping process from spinning rectly matched roles by the total number of
off track; nonetheless, the overall objective is to matches returned. The projected accuracy score
maximize the performance of the resulting set of will tell us how well a sub-pattern can distin-
extraction rules, particularly by expanding its guish a specific event role from other infor-
recall rate. mation, when used independently from other
For the patterns where the projected accuracy elements of the complete pattern.
score falls under the cutoff threshold, we may Figure 9 shows three sub-pattern examples.
still be able to make some repairs by taking The first sub-pattern extracts the Victim role in a
into account their context profile. To do so, we Life-Die event with very high projected accuracy.
applied a similar approach as (Liao, 2010), which This sub-pattern is also a good candidate for
showed that some types of events can appeared generations of additional patterns for this type of
frequently with each other. We collected all the event, a process which we describe in section D.
matches produced by such a failed pattern and The second sub-pattern was built to extract the
created a list of all other events that occur in their Attacker role in Conflict-Attack events, but it has
immediate vicinity: in the same sentence, as well very low projected accuracy. The third one
as the sentences before and after it7. These other shows another Attacker sub-pattern whose pro-
events, of different types and detected by differ- jected accuracy score is 0.417 after the first step
ent patterns, may be seen as co-occurring near Victim pattern: <N(obj, PER): Victim> <V(kill): trigger> (Life-Die)
the target event: these that co-occur near positive Projected Accuracy: 0.9390243902439024
matches of our pattern will be added to the posi- Number of negative matches: 5
Number of Positive matches: 77
tive context support of this pattern; conversely,
events co-occurring near false alarms will be Attacker pattern: <N(subj, PE/PER/ORG): Attacker> <V> <V(use):
added to the negative context support for this trigger> (Conflict-Attack)
pattern. By collecting such contextual infor- Projected Accuracy: 0.025210084033613446
Number of negative matches: 116
mation, we can find contextually-based indica- Number of positive matches: 3
tors and non-indicators for occurrence of event
mentions. When these extra constraints are in- Attacker pattern: <N(subj, GPE/PER): Attacker> <V(attack): trig-
cluded in a previously failed pattern, its projected ger> (Conflict-Attack)
Projected Accuracy: 0.4166666666666667
Number of negative matches: 7
Number of positive matches: 5
categories of posi- GPE: 4 GPE_Nation: 4 PER: 1
7
If a known event is detected in the same sentence tive matches: PER_Individual: 1
(sent_), the same paragraph (para_), or an adjacent categories of nega- GPE: 1 GPE_Nation: 1 PER: 6
paragraph (adj_para_...) as the candidate event, it be- tive matches: PER_Group: 1
comes an element of the pattern context support. PER_Individual: 5
Figure 9. sub-patterns with projected accuracy scores
302
Table 1. Sub-patterns whose projected accuracy is significantly increased after noisy samples are removed
Projected Additional con- Revised Accu-
Sub-patterns
Accuracy straints racy
Movement-Transport:
<N(obj, PER/VEH): Artifact> <V(send): trigger> 0.475 removing PER 0.667
<V(bring): trigger> <N(obj)> <Prep = to> <N(FAC/GPE): Destina-
0.375 removing GPE 1.0
tion>

Conflict Attack:
<N(PER/ORG/GPE):Attacker><N(attack):trigger> 0.682 removing PER 0.8
<N(subj,GPE/PER):Attacker><V(attack): trigger> 0.417 removing GPE 0.8
removing
<N(obj,VEH/PER/FAC):Target><V(target):trigger> 0.364 0.667
PER_Individual

in validation process. This is quite low; however, reached the best cross-validated score, 66.72%,
it can be repaired by constraining its entity type when pattern accuracy threshold is set at 0.5. The
to GPE. This is because we note that with a GPE highest score of single run is 67.62%. In the fol-
entity, the subpattern is 80% on target, while lowing of this section, we will use results of one
with PER entity it is 85% a false alarm. After single run to display the learning behavior of
this sub-pattern is restricted to GPE its projected BEAR.
accuracy becomes 0.8. In Figure 10, X-axis shows values of the
Table 1 lists example sub-patterns for which learning threshold (in descending order), while
the projected accuracy increases significantly Y-axis is the average F-score achieved by the
after adding more constrains. When the projected automatically learned patterns for all types of
accuracy of a sub-pattern is improved, all pat- events against the test corpus. The red (lower)
terns containing this sub-pattern will also im- line represents BEARs base run immediately
prove their projected accuracy. If the adjusted after the first iteration (supervised learning step);
projected accuracy rises above the predefined the blue (upper) line represents BEARs perfor-
threshold, the repaired pattern will be saved. mance after an additional 10 unsupervised learn-
In the following section, we will discuss the ing cycles9 are completed. We note that the final
experiments conducted to evaluate the perfor- performance of the bootstrapped system steadily
mance of the techniques underlying BEAR: how increases as the learning threshold is lowered,
effectively it can learn and how accurately it can peaking at about 0.5 threshold value, and then
perform its extraction task. declines as the threshold value is further de-
creased, although it remains solidly above the
4 Evaluation base run. Analyzing more closely a few selected
points on this chart we note, for example, that the
We test the system learning effectiveness by
base run at threshold of 0 has F-score of 34.5%,
comparing its performance immediately follow-
which represents 30.42% recall, 40% precision.
ing the first iteration (i.e., using rules derived
On the other end of the curve, at the threshold of
from the training data) with its performance after
0.9, the base run precision is 91.8% but recall at
N cycles of unsupervised learning. We split ACE
only 21.5%, which produces F-score of 34.8%. It
training corpus 8 randomly into 5 folders and
is interesting to observe that at neither of these
trained BEAR on the four folders and evaluated
two extremes the system learning effectiveness is
it on the left one. Then, we did 5 fold cross vali-
particularly good, and is significantly less than at
dation. Our experiments showed that BEAR
Figure 11. BEARs unsupervised learning curve.

Figure 10. BEAR cross-validated scores 9
The learning process for one type of event will stop when
8
ACE training data contains 599 documents from news, no new patterns can be generated, so the number of learning
weblog, usenet, and conversational telephone speech. Total cycles for each event type is different. The highest number
33 types of events are defined in ACE corpus. of learning cycles is 10 and lowest one is 2.
303
Table 2. BEAR performance following different selections of
the median threshold of 0.5 (based on the exper- learning steps
Precision Recall F-score
iments conducted thus far), where the system Base1 0.89 0.22 0.35
performance improves from 42% to 66.86% F- Base2 0.87 0.28 0.42
score, which represents 83.9% precision and All 0.84 0.56 0.67
55.57% recall. PMM 0.84 0.48 0.61
Figure 11 explains BEARs learning effec- CBM 0.86 0.37 0.52
tiveness at what we determined empirically to be see how they contribute to the end performance.
the optimal confidence threshold (0.5) for pattern Base1 and Base2 showed the result without and
acquisition. We note that the performance of the with adding trigger synonyms in event extrac-
system steadily increases until it reaches a plat- tion. By introducing trigger synonyms, 27%
eau after about 10 learning cycles. more good events were extracted at the first it-
Figure 12 and Figure 13 show a detailed eration and thus, BEAR had more resources to
breakdown of BEAR extraction performance use in the unsupervised learning steps.
after 10 learning cycles for different types of The ALL is the combination of PMM and
events. We note that while precision holds steady CBM, which demonstrate both methods have the
across the event types, recall levels vary signifi- contribution to the final results. Furthermore, as
cantly. The main reason for low recall in some explained before, new extraction rules are
types of events is the failure to find a sufficient learned in each iteration cycle based on what was
number of high-confidence patterns. This may learned in prior cycles and that new rules are
point to limitations of the current pattern discov- adopted only after they are tested for their pro-
ery methods and may require new ways of reach- jected accuracy (confidence score), so that the
ing outside of the current feature set. overall precision of the resulting rule set is main-
In the previous section we described several tained at a high level relative to the base run.
learning methods that BEAR uses to discover,
validate and adapt new event extraction rules. 5 Conclusion and future work
Some of them work by manipulating already
In this paper, we presented a semi-supervised
learnt patterns and adapting them to new data in
method for learning new event extraction pat-
order to create new patterns, and we shall call
terns from un-annotated text. The techniques de-
these pattern-mutation methods (PMM). Other
scribed here add significant new tools that
described methods work by exploiting a broader
increase capabilities of information extraction
linguistic context in which the events occur, or
technology in general, and more specifically, of
context-based methods (CBM). CB methods look
systems that are built by purely supervised meth-
for structural duality in text surrounding the
ods or from manually designed rules. Our eval-
events and thus discover alternative extraction
uation using ACE dataset demonstrated that that
patterns.
bootstrapping can be effectively applied to learn-
In Table 2, we report the results of running
ing event extraction rules for 33 different types
BEAR with each of these two groups of learning
of events and that the resulting system can out-
methods separately and then in combination to
perform supervised system (base run) significant-
ly.
Some follow-up research issues include:
New techniques are needed to recognize
event descriptions that still evade the cur-
rent pattern derivation techniques, espe-
cially for the events defined in Personnel,
Business, and Transactions classes.
Figure 12. Event mention extraction after learning: preci- Adapting the bootstrapping method to ex-
sion for each type of event tract events in a different language, e.g.
Chinese or Arabic.
Expanding this method to extraction of
larger scenarios, i.e., groups of correlat-
ed events that form coherent stories of-
ten described in larger sections of text,
e.g., an event and its immediate conse-
quences.
Figure 13. Event mention extraction after learning: recall for
each type of event
304
References Thelen, M., Riloff, E. 2002. A bootstrapping
method for learning semantic lexicons using
Agichtein, E. and Gravano, L. 2000. Snowball:
extraction pattern contexts. In Proceedings of
Extracting Relations from Large Plain-Text
the ACL-02 conference on Empirical methods
Collections. In Proceedings of the Fifth ACM
in natural language processing - Volume 10.
International Conference on Digital Libraries
214-222. Morristown, NJ: Association for
Gale, W. A., Church, K. W., and Yarowsky, D. Computational Linguistics
1992. One sense per discourse. In Proceedings
Xu, F., Uszkoreit, H., & Li, H. (2007). A
of the workshop on Speech and Natural Lan-
seed-driven bottom-up machine learning
guage, 233-237. Harriman, New York: Asso-
framework for extracting relations of various
complexity. In Proc. of the 45th Annual Meet-
Hong, Y., Zhang, J., Ma, B., Yao, J., Zhou, G., ing of the Association of Comp. Linguistics,
and Zhu, Q,. 2011. Using Cross-Entity Infer- pp. 584591, Prague, Czech Republic.
ence to Improve Event Extraction. In Proceed-
Yangarber, R., and Grishman, R. 1998. NYU:
ings of the Annual Meeting of the Association
Description of the Proteus/PET System as
of Computational Linguistics (ACL 2011).
Used for MUC-7 ST. In Proceedings of the
Portland, Oregon, USA.
7th conference on Message understanding.
Ji, H. and Grishman, R. 2008. Refining Event
Yangarber, R., Grishman, R., Tapanainen, P.,
Extraction Through Unsupervised Cross-
and Huttunen, S. 2000. Unsupervised discov-
document Inference. In Proceedings of the
ery of scenario-level patterns for information
Annual Meeting of the Association of Compu-
extraction. In Proceedings of the Sixth Confer-
tational Linguistics (ACL 2008).Ohio, USA.
ence on Applied Natural Language Pro-
Liao, S. and Grishman R. 2010. Using Document cessing, (ANLP-NAACL 2000), 282-289
Level Cross-Event Inference to Improve Event
Yarowsky, D. 1995. Unsupervised word sense
Extraction. In Proc. ACL-2010, pages 789-
disambiguation rivaling supervised methods.
797, Uppsala, Sweden, July.
In Proceedings of the 33rd annual meeting on
Lin, D. 1998. Dependency-based evaluation of Association for Computational Linguistics,
MINIPAR. In Workshop on the Evaluation of 189-196, Cambridge, Massachusetts: Associa-
Parsing System, Granada, Spain. tion for Computational Linguistics
Liu Ting. 2009. BEAR: Bootstrap Event and Re-
lations from Text. Ph.D. Thesis
Riloff, E. 1996. Automatically Generating Ex-
traction Patterns from Untagged Text. In Pro-
ceedings of the Thirteenth National
Conference on Artificial Intelligence, pages
10441049. The AAAI Press/MIT Press.
Sudo, K., Sekine, S., Grishman, R. 2001. Auto-
matic Pattern Acquisition for Japanese Infor-
mation Extraction. In Proceedings of Human
Language Technology Conference (HLT2001).
Sudo, K., Sekine, S., Grishman, R. 2003. An im-
proved extraction pattern representation model
for automatic IE pattern acquisition. Proceed-
ings of ACL 2003 , 224 231. Tokyo.
Strzalkowski, T., and Wang, J. 1996. A self-
learning universal concept spotter. In Proceed-
ings of the 16th conference on Computational
linguistics - Volume 2, 931-936, Copenhagen,
Denmark: Association for Computational Lin-
guistics
305
CLex: A Lexicon for Exploring Color, Concept and Emotion
Associations in Language
Svitlana Volkova William B. Dolan Theresa Wilson

Johns Hopkins University Microsoft Research HLTCOE
3400 North Charles One Microsoft Way 810 Wyman Park Drive
Baltimore, MD 21218, USA Redmond, WA 98052, USA Baltimore, MD 21211, USA
svitlana@jhu.edu billdol@microsoft.com taw@jhu.edu
Abstract many tasks in language understanding and gener-

ation. A detailed set of color-concept-emotion as-
Existing concept-color-emotion lexicons sociations (e.g., brown - darkness - boredom; red -
limit themselves to small sets of basic emo- blood - anger) could be quite useful for sentiment
tions and colors, which cannot capture the analysis, for example, in helping to understand
rich pallet of color terms that humans use what emotion a newspaper article, a fairy tale, or
in communication. In this paper we begin
a tweet is trying to evoke (Alm et al., 2005; Mo-
to address this problem by building a novel,
color-emotion-concept association lexicon hammad, 2011b; Kouloumpis et al., 2011). Color-
via crowdsourcing. This lexicon, which we concept-emotion associations may also be useful
call C LEX, has over 2,300 color terms, over for textual entailment, and for machine translation
3,000 affect terms and almost 2,000 con- as a source of paraphrasing.
cepts. We investigate the relation between Color-concept-emotion associations also have
color and concept, and color and emotion, the potential to enhance human-computer inter-
reinforcing results from previous studies, as
actions in many real- and virtual-world domains,
well as discovering new associations. We
also investigate cross-cultural differences in e.g., online shopping, and avatar construction in
color-emotion associations between US and gaming environments. Such knowledge may al-
India-based annotators. low for clearer and hopefully more natural de-
scriptions by users, for example searching for
a sky-blue shirt rather than blue or light blue
1 Introduction shirt. Our long term goal is to use color-emotion-
concept associations to enrich dialog systems
People typically use color terms to describe the
with information that will help them generate
visual characteristics of objects, and certain col-
more appropriate responses to users different
ors often have strong associations with particu-
emotional states.
lar objects, e.g., blue - sky, white - snow. How-
This work introduces a new lexicon of color-
ever, people also take advantage of color terms to
concept-emotion associations, created through
strengthen their messages and convey emotions in
crowdsourcing. We call this lexicon CL EX1 . It
natural interactions (Jacobson and Bender, 1996;
is comparable in size to only two known lexi-
Hardin and Maffi, 1997). Colors are both indica-
cons: W ORD N ET-A FFECT (Strapparava and Val-
tive of and have an effect on our feelings and emo-
itutti, 2004) and E MO L EX (Mohammad and Tur-
tions. Some colors are associated with positive
ney, 2010). In contrast to the development of
emotions, e.g., joy, trust and admiration and some
these lexicons, we do not restrict our annotators
with negative emotions, e.g., aggressiveness, fear,
to a particular set of emotions. This allows us to
boredom and sadness (Ortony et al., 1988).
1
Given the importance of color and visual de- Available for download at:
http://research.microsoft.com/en-us/
scriptions in conveying emotion, obtaining a downloads/
deeper understanding of the associations between Questions about the data and the access process may be
colors, concepts and emotions may be helpful for sent to svitlana@jhu.edu
306
collect more linguistically rich color-concept an- collecting color and emotion annotations for
notations associated with mood, cognitive state, 10,170 word-sense pairs from Macquarie The-
behavior and attitude. We also do not have any saurus2 . They analyzed their annotations, looking
restrictions on color naming, which helps us to for associations with the 11 basic color terms from
discover a rich lexicon of color terms and collo- Berlin and Key (1988). The set of emotion labels
cations that represent various hues, darkness, sat- used in their annotations was restricted to the set
uration and other natural language collocations. of 8 basic emotions proposed by Plutchik (1980).
We also perform a comprehensive analysis of Their annotators were restricted to the US, and
the data by investigating several questions includ- produced 4.45 annotations per word-sense pair on
ing: What affect terms are evoked by a certain average.
color, e.g., positive vs. negative? What con- There is also a commercial project from Cym-
cepts are frequently associated with a particular bolism3 to collect concept-color associations. It
color? What is the distribution of part-of-speech has 561,261 annotations for a restricted set of 256
tags over concepts and affect terms in the data col- concepts, mainly nouns, adjectives and adverbs.
lected without any preselected set of affect terms Other work on collecting emotional aspect
and concepts? What affect terms are strongly as- of concepts includes WordNet-Affect (WNA)
sociated with a certain concept or a category of (Strapparava and Valitutti, 2004), the General En-
concepts and is there any correlation with a se- quirer (GI) (Stone et al., 1966), Affective Forms
mantic orientation of a concept? of English Words (Bradley and Lang, 1999) and
Finally, we share our experience collecting the Elliotts Affective Reasoner (Elliott, 1992).
data using crowdsourcing, describe advantages The WNA lexicon is a set of affect terms from
and disadvantages as well as the strategies we WordNet (Miller, 1995). It contains emotions,
used to ensure high quality annotations. cognitive states, personality traits, behavior, at-
titude and feelings, e.g., joy, doubt, competitive,
2 Related Work cry, indifference, pain. Total of 289 affect terms
Interestingly, some color-concept associations were manually extracted, but later the lexicon was
vary by culture and are influenced by the tra- extended using WordNet semantic relationships.
ditions and beliefs of a society. As shown in WNA covers 1903 affect terms - 539 nouns, 517
(Sable and Akcay, 2010) green represents danger adjectives, 238 verbs and 15 adverbs.
in Malaysia, envy in Belgium, love and happiness The General Enquirer covers 11,788 concepts
in Japan; red is associated with luck in China and labeled with 182 category labels including cer-
Denmark, but with bad luck in Nigeria and Ger- tain affect categories (e.g., pleasure, arousal, feel-
many and reflects ambition and desire in India. ing, pain) in addition to positive/negative seman-
Some expressions involving colors share the tic orientation for concepts4 .
same meaning across many languages. For in- Affective Forms of English Words is a work
stance, white heat or red heat (the state of high which describes a manually collected set of nor-
physical and mental tension), blue-blood (an aris- mative emotional ratings for 1K English words
tocrat, royalty), white-collar or blue collar (of- that are rated in terms of emotional arousal (rang-
fice clerks). However, there are some expres- ing from calm to excited), affective valence (rang-
sions where color associations differ across lan- ing from pleasant to unpleasant) and dominance
guages, e.g., British or Italian black eye becomes (ranging from in control to dominated).
blue in Germany, purple in Spain and black-butter Elliotts Affective Reasoner is a collection of
in France; your French, Italian and English neigh- programs that is able to reason about human emo-
bors are green with envy while Germans are yel- tions. The system covers a set of 26 emotion cat-
low with envy (Bortoli and Maroto, 2001). egories from Ortony et al (1988).
There has been little academic work on con- Kaya (2004) and Strapparava and Ozbal (2010)
structing color-concept and color-emotion lexi- both have worked on inferring emotions associ-
cons. The work most closely related to ours ated with colors using semantic similarity. Their
collects concept-color (Mohammad, 2011c) and 2
http://www.macquarieonline.com.au
concept-emotion (E MO L EX) associations, both 3
http://www.cymbolism.com/
4
relying on crowdsourcing. His project involved http://www.wjh.harvard.edu/inquirer/
307
research found that Americans perceive red as ex- a set of trusted workers who had been consistently
citement, yellow as cheer, purple as dignity and working on similar tasks for us.
associate blue with comfort and security. Other
research includes that geared toward discovering 3.2 Task Design
culture-specific color-concept associations (Gage, Our task was designed to collect a linguistically
1993) and color preference, for example, in chil- rich set of color terms, emotions, and concepts
dren vs. adults (Ou et al., 2011). that were associated with a large set of colors,
specifically the 152 RGB values corresponding to
3 Data Collection facial features of cartoon human avatars. In to-
tal we had 36 colors for hair/eyebrows, 18 for
In order to collect color-concept and color- eyes, 27 for lips, 26 for eye shadows, 27 for fa-
emotion associations, we use Amazon Mechani- cial mask and 18 for skin. These data is necessary
cal Turk5 . It is a fast and relatively inexpensive to achieve our long-term goal which is to model
way to get a large amount of data from many cul- natural human-computer interactions in a virtual
tures all over the world. world domain such as the avatar editor.
We designed two MTurk tasks. For Task 1, we
3.1 MTurk and Data Quality showed a swatch for one RGB value and asked
Amazon Mechanical Turk is a crowdsourcing 50 workers to name the color, describe emotions
platform that has been extensively used for ob- this color evokes and define a set of concepts as-
taining low-cost human annotations for various sociated with that color. For Task 2, we showed a
linguistic tasks over the last few years (Callison- particular facial feature and a swatch in a particu-
Burch, 2009). The quality of the data obtained lar color, and asked 50 workers to name the color
from non-expert annotators, also referred to as and describe the concepts and emotions associ-
workers or turkers, was investigated by Snow et ated with that color. Figure 1 shows what would
al (2008). Their empirical results show that the be presented to worker for Task 2.
quality of non-expert annotations is comparable Q1. How would you name this color?
to the quality of expert annotations on a variety of Q2. What emotion does this color evoke?
natural language tasks, but the cost of the annota- Q3. What concepts do you associate with it?
tion is much lower.
There are various quality control strategies that
can be used to ensure annotation quality. For in-
stance, one can restrict a crowd by creating a
pilot task that allows only workers who passed
the task to proceed with annotations (Chen and Figure 1: Example of MTurk Task 2. Task 1 is the
Dolan, 2011). In addition, new quality control same except that only a swatch is given.
mechanisms have been recently introduced e.g.,
Masters. They are groups of workers who are The design that we suggested has a minor lim-
trusted for their consistent high quality annota- itation in that a color swatch may display differ-
tions, but to employ them costs more. ently on different monitors. However, we hope to
Our task required direct natural language in- overcome this issue by collecting 50 annotations
e c
put from workers and did not include any mul- per RGB value. The example color emotion
tiple choice questions (which tend to attract more concept associations produced by different anno-
cheating). Thus, we limited our quality control ef- tators ai are shown below:
forts to (1) checking for empty input fields and (2)
blocking copy/paste functionality on a form. We [R=222, G=207, B=186] (a1 ) light golden
e c
did not ask workers to complete any qualification yellow purity, happiness butter cookie,
e c
tasks because it is impossible to have gold stan- vanilla; (a2 ) gold cheerful, happy sun,
e c
dard answers for color-emotion and color-concept corn; (a3 ) golden sexy beach, jewelery.
associations. In addition, we limited our crowd to
[R=218, G=97, B=212] (a4 ) pinkish pur-
e c
5
http://www.mturk.com ple peace, tranquility, stressless justin
308
biebers headphones, someday perfume; (a5 ) orange), darkness (dark, light, medium), satura-
e c
pink happiness rose, bougainvillea. tion (grayish, vivid), and brightness (deep, pale)
(Mojsilovic, 2002). Interestingly, we observe
In addition, we collected data about workers
these dimensions in CL EX by looking for B&K
gender, age, native language, number of years of
color terms and their frequent collocations. We
experience with English, and color preferences.
present the top 10 color collocations for the B&K
This data is useful for investigating variance in an-
colors in Table 2. As can be seen, color terms
notations for color-emotion-concept associations
truly are distinguished by darkness, saturation and
among workers from different cultural and lin-
brightness terms e.g., light, dark, greenish, deep.
guistic backgrounds.
In addition, we find that color terms are also as-
4 Data Analysis sociated with color-specific collocations, e.g., sky
blue, chocolate brown, pea green, salmon pink,
We collected 15,200 annotations evenly divided carrot orange. These collocations were produced
between the two tasks over 12 days. In total, 915 by annotators to describe the color of particular
workers (41% male, 51% female and 8% who did RGB values. We investigate these color-concept
not specify), mainly from India and United States, associations in more details in Section 4.3.
completed our tasks as shown in Table 1. 18% In total, the CL EX has 2,315 unique color
workers produced 20 or more annotations. They
spent 78 seconds on average per annotation with P
an average salary rate $2.3 per hour ($0.05 per Color Co-occurrences
completed task). white off, antique, half, dark, black, bone, 0.62
milky, pale, pure, silver
Country Annotations black light, blackish brown, brownish, 0.43
brown, jet, dark, green, off, ash,
India 7844
blackish grey
United States 5824 red dark, light, dish brown, brick, or- 0.59
Canada 187 ange, brown, indian, dish, crimson,
United Kingdom 172 bright
Colombia 100 green dark, light, olive, yellow, lime, for- 0.54
est, sea, dark olive, pea, dirty
Table 1: Demographic information about annota- yellow light, dark, green, pale, golden, 0.63
tors: top 5 countries represented in our dataset. brown, mustard, orange, deep,
bright
In total, we collected 2,315 unique color terms, blue light, sky, dark, royal, navy, baby, 0.55
grey, purple, cornflower, violet
3,397 unique affect terms, and 1,957 unique con-
brown dark, light, chocolate, saddle, red- 0.67
cepts for the given 152 RGB values. In the dish, coffee, pale, deep, red,
sections below we discuss our findings on color medium
naming, color-emotion and color-concept associ- pink dark, light, hot, pale, salmon, baby, 0.55
ations. We also give a comparison of annotated deep, rose, coral, bright
affect terms and concepts from C LEX and other purple light, dark, deep, blue, bright, 0.69
existing lexicons. medium, pink, pinkish, bluish,
pretty
4.1 Color Terms orange light, burnt, red, dark, yellow, 0.68
brown, brownish, pale, bright, car-
Berlin and Kay (1988) state that as languages rot
evolve they acquire new color terms in a strict gray dark, light, blue, brown, charcoal, 0.62
chronological order. When a language has only leaden, greenish, grayish blue, pale,
two colors they are white (light, warm) and black grayish brown
(dark, cold). English is considered to have 11 ba-
sic colors: white, black, red, green, yellow, blue, Table 2: Top 10 color term collocations for the
brown, pink, purple, orange and gray, which is 11 B&K colors; co-occurrences are sorted by fre-
known as the B&K order. quency
P10 from left to right in a decreasing order;
In addition, colors can be distinguished along at 1 p( | color) is a total estimated probability
most three independent dimensions of hue (olive, of the top 10 co-occurrences.
309
Agreement Color Term Valitutti, 2004). Of this set, 41% appeared at
% of overall Exact match 0.492 least once in CL EX. We also looked specifically
agreement Substring match 0.461 at the set of terms labeled as emotions in the
Free-marginal Exact match 0.458 W ORD N ET-A FFECT hierarchy. Of these, 12 are
Kappa Substring match 0.424 positive emotions and 10 are negative emotions.
We found that 9 out of 12 positive emotion
Table 3: Inter-annotator agreement on assigning terms (except self-pride, levity and fearlessness)
names to RGB values: 100 annotators, 152 RGB and 9 out of 10 negative emotion terms (except in-
values and 16 color categories including 11 B&K gratitude) also appear in CL EX as shown in Table
colors, 4 additional colors and none of the above. 5. Thus, we can conclude that annotators do not
names for the set of 152 RGB values. The associate any colors with self-pride, levity, fear-
inter-annotator agreement rate on color naming is lessness and ingratitude. In addition, some emo-
shown in Table 3. We report free-marginal Kappa tions were associated more frequently with colors
(Randolph, 2005) because we did not force an- than others. For instance, positive emotions like
notators to assign certain number of RGB values calmness, joy, love are more frequent in CL EX
to a certain number of color terms. Additionally, than expectation and ingratitude; negative emo-
we report inter-annotator agreement for an exact tions like sadness, fear are more frequent than
string match e.g., purple, green and a substring shame, humility and daze.
match e.g., pale yellow = yellow = golden yellow.
Positive Freq. Negative Freq.
4.2 Color-Emotion Associations calmness 1045 sadness 356
In total, the CL EX lexicon has 3,397 unique af- joy 527 fear 250
fect terms representing feelings (calm, pleasure), love 482 anxiety 55
emotions (joy, love, anxiety), attitudes (indiffer- hope 147 despair 19
ence, caution), and mood (anger, amusement). affection 86 compassion 10
The affect terms in C LEX include the 8 basic emo- enthusiasm 33 dislike 8
tions from (Plutchik, 1980): joy, sadness, anger, liking 5 shame 5
fear, disgust, surprise, trust and anticipation6 expectation 3 humility 3
CL EX is a very rich lexicon because we did not gratitude 3 daze 1
restrict our annotators to any specific set of affect
terms. A wide range of parts-of-speech are rep- Table 5: W ORD N ET-A FFECT positive and neg-
resented, as shown in the first column in Table 4. ative emotion terms from CL EX. Emotions are
For instance, the term love is represented by other sorted by frequency in decreasing order from the
semantically related terms such as: lovely, loved, total 27,802 annotations.
loveliness, loveless, love-able and the term joy is Next, we analyze the color-emotion associ-
represented as enjoy, enjoyable, enjoyment, joy- ations in CL EX in more detail and compare
ful, joyfulness, overjoyed. them with the only other publicly-available color-
emotion lexicon, E MO L EX. Recall that E MO L EX
POS Affect Terms, % Concepts, % (Mohammad, 2011a) has 11 B&K colors associ-
Nouns 79 52 ated with 8 basic positive and negative emotions
Adjectives 12 29 from (Plutchik, 1980). Affect terms in CL EX are
Adverbs 3 5 not labeled as conveying positive or negative emo-
Verbs 6 12 tions. Instead, we use the overlapping 289 affect
terms between W ORD N ET-A FFECT and CL EX
Table 4: Main syntactic categories for affect terms
and propagate labels from W ORD N ET-A FFECT to
and concepts in CL EX.
the corresponding affect terms in CL EX. As a re-
The
manually constructed portion of sult we discover positive and negative affect term
W ORD N ET-A FFECT includes 101 positive associations with the 11 B&K colors. Table 6
and 188 negative affect terms (Strapparava and shows the percentage of positive and negative af-
6
The set of 8 Plutchiks emotions is a superset of emotions fect term associations with colors for both CL EX
from (Ekman, 1992). and E MO L EX.
310
Positive Negative a disagreement in color-emotion associations be-
CL EX EL CL EX EL tween CL EX and E MO L EX. For instance antic-
white 2.5 20.1 0.3 2.9 ipation is associated with orange in CL EX com-
black 0.6 3.9 9.3 28.3 pared to white, red or yellow in E MO L EX. We also
red 1.7 8.0 8.2 21.6 found quite a few inconsistent associations with
green 3.3 15.5 2.7 4.7 the disgust emotion. This inconsistency may be
yellow 3.0 10.8 0.7 6.9 explained by several reasons: (a) E MO L EX asso-
blue 5.9 12.0 1.6 4.1 ciates emotions with colors through concepts, but
brown 6.5 4.8 7.6 9.4 CL EX has color-emotion associations obtained
pink 5.6 7.8 1.1 1.2 directly from annotators; (b) CL EX has 3,397
purple 3.1 5.7 1.8 2.5 affect terms compared to 8 basic emotions in
orange 1.6 5.4 1.7 3.8 E MO L EX. Therefore, it may be introducing some
gray 1.0 5.7 3.6 14.1 ambiguous color-emotion associations.
Finally, we investigate cross-cultural differ-
Table 6: The percentage of affect terms associated ences in color-emotion associations between the
with B&K colors in CL EX and E MO L EX (similar two most representative groups of our annotators:
color-emotion associations are shown in bold). US-based and India-based. We consider the 8
The percentage of color-emotion associations Plutchiks emotions and allow associations with
in CL EX and E MO L EX differs because the set of all possible color terms (rather than only 11 B&K
affect terms in CL EX consists of 289 positive and colors). We show top 5 colors associated with
negative affect terms compared to 8 affect terms emotions for two groups of annotators in Figure 2.
in E MO L EX. Nevertheless, we observe the same For example, we found that US-based annotators
pattern as (Mohammad, 2011a) for negative emo- associate pink with joy, dark brown with trust vs.
tions. They are associated with black, red and India-based annotators who associate yellow with
gray colors, except yellow becomes a color of joy and blue with trust.
positive emotions in CL EX. Moreover, we found
4.3 Color-Concept Associations
the associations with the color brown to be am-
biguous as it was associated with both positive In total, workers annotated the 152 RGB values
and negative emotions. In addition, we did not ob- with 37,693 concepts which is on average 2.47
serve strong associations between white and pos- concepts compared to 1.82 affect term per anno-
itive emotions. This may be because white is the tation. CL EX contains 1,957 unique concepts in-
color of grief in India. The rest of the positive cluding 1,667 nouns, 23 verbs, 28 adjectives, and
emotions follow the E MO L EX pattern and are as- 12 adverbs. We investigate an overlap of con-
sociated with green, pink, blue and purple colors. cepts by part-of-speech tag between CL EX and
Next, we perform a detailed comparison be- other lexicons including E MO L EX (EL), Affec-
tween CL EX and E MO L EX color-emotion asso- tive Norms of English Words (AN), General In-
ciations for the 11 B&K colors and the 8 basic quirer (GI). The results are shown in Table 8.
emotions from (Plutchik, 1980) in Table 7. Recall Finally, we generate concept clusters associ-
that annotations in E MO L EX are done by workers ated with yellow, white and brown colors in Fig-
from the USA only. Thus, we report two num- ure 3. From the clusters, we observe the most
bers for CL EX - annotations from workers from frequent k concepts associated with these colors
the USA (CA ) and all annotations (C). We take have a correlation with either positive or negative
E MO L EX results from (Mohammad, 2011c). We emotion. For example, white is frequently associ-
observe a strong correlation between CL EX and ated with snow, milk, cloud and all of these con-
E MO L EX affect lexicons for some color-emotion cepts evolve positive emotions. This observation
associations. For instance, anger has a strong as- helps resolve the ambiguity in color-emotion as-
sociation with red and brown, anticipation with sociations we found in Table 7.
green, fear with black, joy with pink, sadness
5 Conclusions
with black, brown and gray, surprise with yel-
low and orange, and finally, trust is associated We have described a large-scale crowdsourcing
with blue and brown. Nonetheless, we also found effort aimed at constructing a rich color-emotion-
311
white black red green yellow blue brown pink purple orange grey
C - 3.6 43.4 0.3 0.3 0.3 3.3 0.6 0.3 1.5 2.1
anger CA - 3.8 40.6 0.8 - - 4.5 - 0.8 2.3 0.8
EA 2.1 30.7 32.4 5.0 5.0 2.4 6.6 0.5 2.3 2.5 9.9
C 0.3 24.0 0.3 0.6 0.3 4.2 11.4 0.3 2.2 0.3 10.3
sadness CA - 22.2 - 0.6 - 5.3 9.4 - 4.1 - 12.3
EA 3.0 36.0 18.6 3.4 5.4 5.8 7.1 0.5 1.4 2.1 16.1
C 0.8 43.0 8.9 2.0 1.2 0.4 6.1 0.4 0.8 0.4 2.0
fear CA - 29.5 10.5 3.2 1.1 - 3.2 - 1.1 1.1 4.2
EA 4.5 31.8 25.0 3.5 6.9 3.0 6.1 1.3 2.3 3.3 11.8
C - 2.3 1.1 11.2 1.1 1.1 24.7 1.1 3.4 1.1 -
disgust CA - - - 14.8 1.8 - 33.3 - 1.8 - -
EA 2.0 33.7 24.9 4.8 5.5 1.9 9.7 1.1 1.8 3.5 10.5
C 1.0 0.2 0.2 3.4 5.7 4.2 4.2 9.1 4.4 4.0 0.6
joy CA 0.9 - 0.3 3.3 4.5 4.8 2.7 10.6 4.2 3.9 0.6
EA 21.8 2.2 7.4 14.1 13.4 11.3 3.1 11.1 6.3 5.8 2.8
C - - 1.2 3.5 1.2 17.4 8.1 1.2 1.2 5.8 1.2
trust CA - - 3.0 6.1 3.0 3.0 9.1 - - 3.0 3.0
EA 22.0 6.3 8.4 14.2 8.3 14.4 5.9 5.5 4.9 3.8 5.8
C - - - 3.3 6.7 6.7 3.3 3.3 6.7 13.3 3.3
surprise CA - - - - 5.6 5.6 - 5.6 11.1 11.1 -
EA 11.0 13.4 21.0 8.3 13.5 5.2 3.4 5.2 4.1 5.6 8.8
C - - - 5.3 5.3 - 5.3 5.3 - 15.8 5.3
anticipation CA - - - - - - - 10.0 - 10.0 10.0
EA 16.2 7.5 11.5 16.2 10.7 9.5 5.7 5.9 3.1 4.9 8.4
Table 7: The percentage of the 8 basic emotions associated with 11 B&K colors in CL EX vs. E MO L EX,
e.g., sadness is associated with black by 36% of annotators in E MOLEX (EA ), 22.1% in CL EX (CA ) by
US-based annotators only and 24% in CL EX (C) by all annotators; we report zero associations by -.
(a) Joy - US: 331, I: 154 (b) Trust - US: 33, I: 47 (c) Surprise - US: 18, I: 12 (d) Anticipation - US: 10, I: 9
(e) Anger - US: 133, I: 160 (f) Sadness - US: 171, I: 142 (g) Fear - US: 95, I: 105 (h) Disgust - US: 54, I: 16
Figure 2: Apparent cross-cultural differences in color-emotion associations between US- and India-
based annotators. 10.6% of US workers associated joy with pink, while 7.1% India-based workers
associated joy with yellow (based on 331 joy associations from the US and from 154 India).
312
(a) Yellow (b) Brown (c) White
Figure 3: Concept clusters of color-concept associations for ambiguous colors: yellow, white, brown.
concept association lexicon, CL EX. This lexicon the way that colors are associated with concepts
links concepts, color terms and emotions to spe- and emotions in languages other than English.
cific RGB values. This lexicon may help to dis-
ambiguate objects when modeling conversational Acknowledgments
interactions in many domains. We have examined
We are grateful to everyone in the NLP group
the association between color terms and positive
at Microsoft Research for helpful discussion and
or negative emotions.
feedback especially Chris Brocket, Piali Choud-
Our work also investigated cross-cultural dif- hury, and Hassan Sajjad. We thank Natalia Rud
ferences in color-emotion associations between from Tyumen State University, Center of Linguis-
India- and US-based annotators. We identified tic Education for helpful comments and sugges-
frequent color-concept associations, which sug- tions.
gests that concepts associated with a particular
color may express the same sentiment as the color.
Our future work includes applying statistical References
inference for discovering a hidden structure of
Cecilia Ovesdotter Alm, Dan Roth, and Richard
concept-emotion associations. Moreover, auto- Sproat. 2005. Emotions from text: machine
matically identifying the strength of association learning for text-based emotion prediction. In
between a particular concept and emotions is an- Proceedings of the conference on Human Lan-
other task which is more difficult than just iden- guage Technology and Empirical Methods in Natu-
tifying the polarity of the word. We are also in- ral Language Processing, HLT 05, pages 579586,
terested in using a similar approach to investigate Stroudsburg, PA, USA. Association for Computa-
tional Linguistics.
CL EXAN CL EXEL CL EXGI Brent Berlin and Paul Kay. 1988. Basic Color Terms:
their Universality and Evolution. Berkley: Univer-
Noun 287 Noun 574 Noun 708
sity of California Press.
Verb 4 Verb 13 Verb 17
M. Bortoli and J. Maroto. 2001. Translating colors in
Adj 28 Adj 53 Adj 66 web site localisation. In In The Proceedings of Eu-
Adv 1 Adv 2 Adv 3 ropean Languages and the Implementation of Com-
320 642 794 munication and Information Technologies (Elicit).
AN\CL EX EL\CL EX GI\CL EX M. Bradley and P. Lang. 1999. Affective forms for
712 7,445 11,101 english words (anew): Instruction manual and af-
CL EX\AN CL EX\EL CL EX\GI fective ranking.
1,637 1,315 1,163 Chris Callison-Burch. 2009. Fast, cheap, and creative:
evaluating translation quality using amazons me-
Table 8: An overlap of concepts by part-of- chanical turk. In EMNLP 09: Proceedings of the
speech tag between CL EX and existing lexicons. 2009 Conference on Empirical Methods in Natural
Language Processing, pages 286295, Stroudsburg,
CL EXGI stands for the intersection of sets,
CL EX\GI denotes the difference of sets. tics.
313
David L. Chen and William B. Dolan. 2011. Building Aleksandra Mojsilovic. 2002. A method for color
a persistent workforce on mechanical turk for mul- naming and description of color composition in im-
tilingual data collection. In Proceedings of The 3rd ages. In Proc. IEEE Int. Conf. Image Processing,
Human Computation Workshop (HCOMP 2011), pages 789792.
August. Andrew Ortony, Gerald L. Clore, and Allan Collins.
Paul Ekman. 1992. An argument for basic emotions. 1988. The Cognitive Structure of Emotions. Cam-
Cognition & Emotion, 6(3):169200. bridge University Press, July.
Clark Davidson Elliott. 1992. The affective reasoner: Li-Chen Ou, M. Ronnier Luo, Pei-Li Sun, Neng-
a process model of emotions in a multi-agent sys- Chung Hu, and Hung-Shing Chen. 2011. Age ef-
tem. Ph.D. thesis, Evanston, IL, USA. UMI Order fects on colour emotion, preference, and harmony.
No. GAX92-29901. Color Research and Application.
R. Plutchik, 1980. A general psychoevolutionary the-
J. Gage. 1993. Color and culture: Practice and mean-
ory of emotion, pages 333. Academic press, New
ing from antiquity to abstraction, univ. of calif.
York.
C. Hardin and L. Maffi. 1997. Color Categories in Justus J. Randolph. 2005. Author note: Free-marginal
Thought and Language. multirater kappa: An alternative to fleiss fixed-
N. Jacobson and W. Bender. 1996. Color as a deter- marginal multirater kappa.
mined communication. IBM Syst. J., 35:526538, P. Sable and O. Akcay. 2010. Color: Cross cultural
September. marketing perspectives as to what governs our re-
N. Kaya. 2004. Relationship between color and emo- sponse to it. In In The Proceedings of ASSBS, vol-
tion: a study of college students. College Student ume 17.
Journal. Rion Snow, Brendan OConnor, Daniel Jurafsky, and
Efthymios Kouloumpis, Theresa Wilson, and Johanna Andrew Y. Ng. 2008. Cheap and fastbut is it
Moore. 2011. Twitter sentiment analysis: The good good?: evaluating non-expert annotations for natu-
the bad and the OMG! In Proc. ICWSM. ral language tasks. In Proceedings of the Confer-
ence on Empirical Methods in Natural Language
George A. Miller. 1995. Wordnet: A lexical database
Processing, EMNLP 08, pages 254263, Strouds-
for english. Communications of the ACM, 38:39
burg, PA, USA. Association for Computational Lin-
41.
guistics.
Saif M. Mohammad and Peter D. Turney. 2010. Emo- Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith,
tions evoked by common words and phrases: using and Daniel M. Ogilvie. 1966. The General In-
mechanical turk to create an emotion lexicon. In quirer: A Computer Approach to Content Analysis.
Proceedings of the NAACL HLT 2010 Workshop on MIT Press.
Computational Approaches to Analysis and Gener- Carlo Strapparava and Gozde Ozbal. 2010. The color
ation of Emotion in Text, CAAGET 10, pages 26 of emotions in text. COLING, pages 2832.
34, Stroudsburg, PA, USA. Association for Compu- C. Strapparava and A. Valitutti. 2004. Wordnet-affect:
tational Linguistics. an affective extension of wordnet. In In: Proceed-
Saif Mohammad. 2011a. Colourful language: Mea- ings of the 4th International Conference on Lan-
suring word-colour associations. In Proceedings guage Resources and Evaluation (LREC 2004), Lis-
of the 2nd Workshop on Cognitive Modeling and bon, pages 10831086.
Computational Linguistics, pages 97106, Port-
land, Oregon, USA, June. Association for Compu-
tational Linguistics.
Saif Mohammad. 2011b. From once upon a time
to happily ever after: Tracking emotions in novels
and fairy tales. In Proceedings of the 5th ACL-
HLT Workshop on Language Technology for Cul-
tural Heritage, Social Sciences, and Humanities,
pages 105114, Portland, OR, USA, June. Associa-
Saif M. Mohammad. 2011c. Even the abstract have
colour: consensus in word-colour associations. In
Language Technologies: short papers - Volume 2,
HLT 11, pages 368373, Stroudsburg, PA, USA.
314
Extending the Entity-based Coherence Model with Multiple Ranks
Vanessa Wei Feng Graeme Hirst

Department of Computer Science Department of Computer Science
University of Toronto University of Toronto
Toronto, ON, M5S 3G4, Canada Toronto, ON, M5S 3G4, Canada
weifeng@cs.toronto.edu gh@cs.toronto.edu
Abstract 3.1 below). However, coherence is matter of de-

gree rather than a binary distinction, so a model
We extend the original entity-based coher- based only on such pairwise rankings is insuffi-
ence model (Barzilay and Lapata, 2008) ciently fine-grained and cannot capture the sub-
by learning from more fine-grained coher- tle differences in coherence between the permuted
ence preferences in training data. We asso-
documents.
ciate multiple ranks with the set of permuta-
tions originating from the same source doc- Since the first appearance of B&Ls model,
ument, as opposed to the original pairwise several extensions have been proposed (see Sec-
rankings. We also study the effect of the tion 2.3 below), primarily focusing on modify-
permutations used in training, and the effect ing or enriching the original feature set by incor-
of the coreference component used in en- porating other document information. By con-
tity extraction. With no additional manual
trast, we wish to refine the learning procedure
annotations required, our extended model
is able to outperform the original model on in a way such that the resulting model will be
two tasks: sentence ordering and summary able to evaluate coherence on a more fine-grained
coherence rating. level. Specifically, we propose a concise exten-
sion to the standard entity-based coherence model
by learning not only from the original docu-
1 Introduction ment and its corresponding permutations but also
from ranking preferences among the permutations
Coherence is important in a well-written docu-
themselves.
ment; it helps make the text semantically mean-
ingful and interpretable. Automatic evaluation We show that this can be done by assigning a
of coherence is an essential component of vari- suitable objective score for each permutation indi-
ous natural language applications. Therefore, the cating its dissimilarity from the original one. We
study of coherence models has recently become call this a multiple-rank model since we train our
an active research area. A particularly popular model on a multiple-rank basis, rather than tak-
coherence model is the entity-based local coher- ing the original pairwise ranking approach. This
ence model of Barzilay and Lapata (B&L) (2005; extension can also be easily combined with other
2008). This model represents local coherence extensions by incorporating their enriched feature
by transitions, from one sentence to the next, in sets. We show that our multiple-rank model out-
the grammatical role of references to entities. It performs B&Ls basic model on two tasks, sen-
learns a pairwise ranking preference between al- tence ordering and summary coherence rating,
ternative renderings of a document based on the evaluated on the same datasets as in Barzilay and
probability distribution of those transitions. In Lapata (2008).
particular, B&L associated a lower rank with au- In sentence ordering, we experiment with
tomatically created permutations of a source doc- different approaches to assigning dissimilarity
ument, and learned a model to discriminate an scores and ranks (Section 5.1.1). We also exper-
original text from its permutations (see Section iment with different entity extraction approaches
315
Manila Miles Island Quake Baco are simply clustered by string matching.
1 X X 2.2 Evaluation Tasks
2 S O
Two evaluation tasks for Barzilay and Lapata
3 X X X X X
(2008)s entity-based model are sentence order-
Table 1: A fragment of an entity grid for five entities ing and summary coherence rating.
across three sentences. In sentence ordering, a set of random permu-
tations is created for each source document, and
the learning procedure is conducted on this syn-
(Section 5.1.2) and different distributions of per-
thetic mixture of coherent and incoherent docu-
mutations used in training (Section 5.1.3). We
ments. Barzilay and Lapata (2008) experimented
show that these two aspects are crucial, depend-
on two datasets: news articles on the topic of
ing on the characteristics of the dataset.
earthquakes (Earthquakes) and narratives on the
2 Entity-based Coherence Model topic of aviation accidents (Accidents). A train-
ing data instance is constructed as a pair con-
2.1 Document Representation sisting of a source document and one of its ran-
The original entity-based coherence model is dom permutations, and the permuted document
based on the assumption that a document makes is always considered to be less coherent than the
repeated reference to elements of a set of entities source document. The entity transition features
that are central to its topic. For a document d, an are then used to train a support vector machine
entity grid is constructed, in which the columns ranker (Joachims, 2002) to rank the source docu-
represent the entities referred to in d, and rows ments higher than the permutations. The model is
represent the sentences. Each cell corresponds tested on a different set of source documents and
to the grammatical role of an entity in the corre- their permutations, and the performance is evalu-
sponding sentence: subject (S), object (O), nei- ated as the fraction of correct pairwise rankings in
ther (X), or nothing (). An example fragment the test set.
of an entity grid is shown in Table 1; it shows In summary coherence rating, a similar exper-
the representation of three sentences from a text imental framework is adopted. However, in this
on a Philippine earthquake. B&L define a lo- task, rather than training and evaluating on a set
cal transition as a sequence {S , O, X, }n , repre- of synthetic data, system-generated summaries
senting the occurrence and grammatical roles of and human-composed reference summaries from
an entity in n adjacent sentences. Such transi- the Document Understanding Conference (DUC
tion sequences can be extracted from the entity 2003) were used. Human annotators were asked
grid as continuous subsequences in each column. to give a coherence score on a seven-point scale
For example, the entity Manila in Table 1 has for each item. The pairwise ranking preferences
a bigram transition {S , X} from sentence 2 to 3. between summaries generated from the same in-
The entity grid is then encoded as a feature vector put document cluster (excluding the pairs consist-
(d) = (p1 (d), p2 (d), . . . , pm (d)), where pt (d) is ing of two human-written summaries) are used by
the probability of the transition t in the entity grid, a support vector machine ranker to learn a dis-
and m is the number of transitions with length no criminant function to rank each pair according to
more than a predefined optimal transition length their coherence scores.
k. pt (d) is computed as the number of occurrences
of t in the entity grid of document d, divided by 2.3 Extended Models
the total number of transitions of the same length Filippova and Strube (2007) applied Barzilay and
in the entity grid. Lapatas model on a German corpus of newspa-
For entity extraction, Barzilay and Lapata per articles with manual syntactic, morphological,
(2008) had two conditions: Coreference+ and and NP coreference annotations provided. They
Coreference. In Coreference+, entity corefer- further clustered entities by semantic relatedness
ence relations in the document were resolved by as computed by the WikiRelated! API (Strube and
an automatic coreference resolution tool (Ng and Ponzetto, 2006). Though the improvement was
Cardie, 2002), whereas in Coreference, nouns not significant, interestingly, a short subsection in
316
their paper described their approach to extending 3.1 Sentence Ordering
pairwise rankings to longer rankings, by supply-
In the standard entity-based model, a discrimina-
ing the learner with rankings of all renderings as
tive system is trained on the pairwise rankings be-
computed by Kendalls , which is one of our
tween source documents and their permutations
extensions considered in this paper. Although
(see Section 2.2). However, a model learned from
Filippova and Strube simply discarded this idea
these pairwise rankings is not sufficiently fine-
because it hurt accuracies when tested on their
grained, since the subtle differences between the
data, we found it a promising direction for further
permutations are not learned. Our major contribu-
exploration. Cheung and Penn (2010) adapted
tion is to further differentiate among the permuta-
the standard entity-based coherence model to the
tions generated from the same source documents,
same German corpus, but replaced the original
rather than simply treating them all as being of the
linguistic dimension used by Barzilay and Lap-
same degree of coherence.
ata (2008) grammatical role with topologi-
cal field information, and showed that for German Our fundamental assumption is that there exists
text, such a modification improves accuracy. a canonical ordering for the sentences of a doc-
ument; therefore we can approximate the degree
For English text, two extensions have been proof coherence of a document by the similarity be-
posed recently. Elsner and Charniak (2011) aug- tween its actual sentence ordering and that canon-
mented the original features used in the standard ical sentence ordering. Practically, we automati-
entity-based coherence model with a large num- cally assign an objective score for each permuta-
ber of entity-specific features, and their extension tion to estimate its dissimilarity from the source
significantly outperformed the standard model document (see Section 4). By learning from all
on two tasks: document discrimination (another the pairs across a source document and its per-
name for sentence ordering), and sentence inser- mutations, the effective size of the training data
tion. Lin et al. (2011) adapted the entity grid rep- is increased while no further manual annotation
resentation in the standard model into a discourse is required, which is favorable in real applica-
role matrix, where additional discourse informations when available samples with manually an-
tion about the document was encoded. Their ex- notated coherence scores are usually limited. For
tended model significantly improved ranking ac- r source documents each with m random permuta-
curacies on the same two datasets used by Barzi- tions, the number of training instances in the stan-
lay and Lapata (2008) as well as on the Wall Street dard entity-based model is therefore r m, while
Journal corpus. in our
multiple-rank
model learning process, it is
r m+1 1
r m 2 > r m, when m > 2.
However, while enriching or modifying the 2 2
original features used in the standard model is cer-
tainly a direction for refinement of the model, it 3.2 Summary Coherence Rating
usually requires more training data or a more so- Compared to the standard entity-based coherence
phisticated feature representation. In this paper, model, our major contribution in this task is to
we instead modify the learning approach and pro- show that by automatically assigning an objective
pose a concise and highly adaptive extension that score for each machine-generated summary to es-
can be easily combined with other extended fea- timate its dissimilarity from the human-generated
tures or applied to different languages. summary from the same input document cluster,
we are able to achieve performance competitive
with, or even superior to, that of B&Ls model
3 Experimental Design without knowing the true coherence score given
by human judges.
Following Barzilay and Lapata (2008), we wish Evaluating our multiple-rank model in this task
to train a discriminative model to give the cor- is crucial, since in summary coherence rating,
rect ranking preference between two documents the coherence violations that the reader might en-
in terms of their degree of coherence. We experi- counter in real machine-generated texts can be
ment on the same two tasks as in their work: sen- more precisely approximated, while the sentence
tence ordering and summary coherence rating. ordering task is only partially capable of doing so.
317
4 Dissimilarity Metrics by Bollegala et al. (2006). This metric esti-
mates the quality of a particular sentence order-
As mentioned previously, the subtle differences ing by the number of correctly arranged contin-
among the permutations of the same source docu- uous sentences, compared to the reference order-
ment can be used to refine the model learning pro- ing. For example, if = (. . . , 3, 4, 5, 7, . . . , oN ),
cess. Considering an original document d and one then {3, 4, 5} is considered as continuous while
of its permutations, we call = (1, 2, . . . , N) the {3, 4, 5, 7} is not. Average continuity is calculated
reference ordering, which is the sentence order- as
ing in d, and = (o1 , o2 , . . . , oN ) the test order- n

1 X
AC = exp log (Pi + ) ,

ing, which is the sentence ordering in that permu-

n 1 i=2
tation, where N is the number of sentences being
rendered in both documents. where n = min(4, N) is the maximum number
In order to approximate different degrees of co- of continuous sentences to be considered, and
herence among the set of permutations which bear = 0.01. Pi is the proportion of continuous sen-
the same content, we need a suitable metric to tences of length i in that are also continuous in
quantify the dissimilarity between the test order- the reference ordering . To represent the dis-
ing and the reference ordering . Such a metric similarity between the two orderings and , we
needs to satisfy the following criteria: (1) It can be use its complement AC 0 = 1 AC, such that the
automatically computed while being highly corre- larger AC 0 is, the more dissimilar two orderings
lated with human judgments of coherence, since are2 .
additional manual annotation is certainly undesir- Edit distance (ED): Edit distance is a com-
able. (2) It depends on the particular sentence monly used metric in information theory to mea-
ordering in a permutation while remaining inde- sure the difference between two sequences. Given
pendent of the entities within the sentences; oth- a test ordering , its edit distance is defined as the
erwise our multiple-rank model might be trained minimum number of edits (i.e., insertions, dele-
to fit particular probability distributions of entity tions, and substitutions) needed to transform it
transitions rather than true coherence preferences. into the reference ordering . For permutations,
In our work we use three different metrics: the edits are essentially movements, which can
Kendalls distance, average continuity, and edit be considered as equal numbers of insertions and
distance. deletions.
Kendalls distance: This metric has been
5 Experiments
widely used in evaluation of sentence ordering
(Lapata, 2003; Lapata, 2006; Bollegala et al., 5.1 Sentence Ordering
2006; Madnani et al., 2007)1 . It measures the Our first set of experiments is on sentence order-
disagreement between two orderings and in ing. Following Barzilay and Lapata (2008), we
terms of the number of inversions of adjacent sen- use all transitions of length 3 for feature extrac-
tences necessary to convert one ordering into an- tion. In addition, we explore three specific aspects
other. Kendalls distance is defined as in our experiments: rank assignment, entity ex-
2m traction, and permutation generation.
= ,
N(N 1) 5.1.1 Rank Assignment
In our multiple-rank model, pairwise rankings
where m is the number of sentence inversions nec-
between a source document and its permutations
essary to convert to .
are extended into a longer ranking with multiple
Average continuity (AC): Following Zhang
ranks. We assign a rank to a particular permuta-
(2011), we use average continuity as the sec-
tion, based on the result of applying a chosen dis-
ond dissimilarity metric. It was first proposed
similarity metric from Section 4 (, AC, or ED) to
1
Filippova and Strube (2007) found that their perfor- the sentence ordering in that permutation.
mance dropped when using this metric for longer rankings; We experiment with two different approaches
but they were using data in a different language and with to assigning ranks to permutations, while each
manual annotations, so its effect on our datasets is worth try-
ing nonetheless. 2
We will refer to AC 0 as AC from now on.
318
source document is always assigned a zero (the 5.1.3 Permutation Generation
highest) rank. The quality of the model learned depends on
In the raw option, we rank the permutations di- the set of permutations used in training. We are
rectly by their dissimilarity scores to form a full not aware of how B&Ls permutations were gen-
ranking for the set of permutations generated from erated, but we assume they are generated in a per-
the same source document. fectly random fashion.
Since a full ranking might be too sensitive to However, in reality, the probabilities of seeing
noise in training, we also experiment with the documents with different degrees of coherence are
stratified option, in which C ranks are assigned to not equal. For example, in an essay scoring task,
the permutations generated from the same source if the target group is (near-) native speakers with
document. The permutation with the smallest dis- sufficient education, we should expect their essays
similarity score is assigned the same (zero, the to be less incoherent most of the essays will
highest) rank as the source document, and the one be coherent in most parts, with only a few minor
with the largest score is assigned the lowest (C1) problems regarding discourse coherence. In such
rank; then ranks of other permutations are uni- a setting, the performance of a model trained from
formly distributed in this range according to their permutations generated from a uniform distribu-
raw dissimilarity scores. We experiment with 3 tion may suffer some accuracy loss.
to 6 ranks (the case where C = 2 reduces to the Therefore, in addition to the set of permutations
standard entity-based model). used by Barzilay and Lapata (2008) (PS BL ), we
create another set of permutations for each source
document (PS M ) by assigning most of the proba-
5.1.2 Entity Extraction bility mass to permutations which are mostly sim-
ilar to the original source document. Besides its
Barzilay and Lapata (2008)s best results were capability of better approximating real-life situ-
achieved by employing an automatic coreference ations, training our model on permutations gen-
resolution tool (Ng and Cardie, 2002) for ex- erated in this way has another benefit: in the
tracting entities from a source document, and the standard entity-based model, all permuted doc-
permutations were generated only afterwards uments are treated as incoherent; thus there are
entity extraction from a permuted document de- many more incoherent training instances than co-
pends on knowing the correct sentence order and herent ones (typically the proportion is 20:1). In
the oracular entity information from the source contrast, in our multiple-rank model, permuted
document since resolving coreference relations documents are assigned different ranks to fur-
in permuted documents is too unreliable for an au- ther differentiate the different degrees of coher-
tomatic tool. ence within them. By doing so, our model will
We implement our multiple-rank model with be able to learn the characteristics of a coherent
full coreference resolution using Ng and Cardies document from those near-coherent documents as
coreference resolution system, and entity extrac- well, and therefore the problem of lacking coher-
tion approach as described above the Coref- ent instances can be mitigated.
erence+ condition. However, as argued by El- Our permutation generation algorithm is shown
sner and Charniak (2011), to better simulate in Algorithm 1, where = 0.05, = 5.0,
the real situations that human readers might en- MAX NUM = 50, and K and K 0 are two normal-
counter in machine-generated documents, such ization factors to make p(swap num) and p(i, j)
oracular information should not be taken into ac- proper probability distributions. For each source
count. Therefore we also employ two alterna- document, we create the same number of permu-
tive approaches for entity extraction: (1) use the tations as PS BL .
same automatic coreference resolution tool on
permuted documents we call it the Corefer- 5.2 Summary Coherence Rating
ence condition; (2) use no coreference reso- In the summary coherence rating task, we are
lution, i.e., group head noun clusters by simple dealing with a mixture of multi-document sum-
string matching B&Ls Coreference condi- maries generated by systems and written by hu-
tion. mans. Barzilay and Lapata (2008) did not assume
319
Algorithm 1 Permutation Generation. with the optimal transition length set to 2.
Input: S 1 , S 2 , . . . , S N ; = (1, 2, . . . , N)
Choose a number of sentence swaps 6 Results
swap num with probability eswap num /K 6.1 Sentence Ordering
for i = 1 swap num do
Swap a pair of sentence (S i , S j ) In this task, we use the same two sets of source
with probability p(i, j) = e|i j| /K 0 documents (Earthquakes and Accidents, see Sec-
end for tion 3.1) as Barzilay and Lapata (2008). Each
Output: = (o1 , o2 , . . . , oN ) contains 200 source documents, equally divided
between training and test sets, with up to 20 per-
mutations per document. We conduct experi-
a simple binary distinction among the summaries ments on these two domains separately. For each
generated from the same input document clus- domain, we accompany each source document
ter; rather, they had human judges give scores for with two different sets of permutations: the one
each summary based on its degree of coherence used by B&L (PS BL ), and the one generated from
(see Section 3.2). Therefore, it seems that the our model described in Section 5.1.3 (PS M ). We
subtle differences among incoherent documents train our multiple-rank model and B&Ls standard
(system-generated summaries in this case) have two-rank model on each set of permutations using
already been learned by their model. the SVM rank package (Joachims, 2006), and eval-
uate both systems on their test sets. Accuracy is
But we wish to see if we can replace hu-
measured as the fraction of correct pairwise rank-
man judgments by our computed dissimilarity
ings for the test set.
scores so that the original supervised learning is
converted into unsupervised learning and yet re- 6.1.1 Full Coreference Resolution with
tain competitive performance. However, given Oracular Information
a summary, computing its dissimilarity score is In this experiment, we implement B&Ls fully-
a bit involved, due to the fact that we do not fledged standard entity-based coherence model,
know its correct sentence order. To tackle this and extract entities from permuted documents us-
problem, we employ a simple sentence aligning oracular information from the source docu-
ment between a system-generated summary and ments (see Section 5.1.2).
a human-written summary originating from the Results are shown in Table 2. For each test sit-
same input document cluster. Given a system- uation, we list the best accuracy (in Acc columns)
generated summary D s = (S s1 , S s2 , . . . , S sn ) and for each chosen dissimilarity metric, with the cor-
its corresponding human-written summary Dh = responding rank assignment approach. C repre-
(S h1 , S h2 , . . . , S hN ) (here it is possible that n , sents the number of ranks used in stratifying raw
N), we treat the sentence ordering (1, 2, . . . , N) scores (N if using raw configuration, see Sec-
in Dh as (the original sentence ordering), and tion 5.1.1 for details). Baselines are accuracies
compute = (o1 , o2 , . . . , on ) based on D s . To trained using the standard entity-based coherence
compute each oi in , we find the most similar model3 .
sentence S h j , j [1, N] in Dh by computing their Our model outperforms the standard entity-
cosine similarity over all tokens in S h j and S si ; based model on both permutation sets for both
if all sentences in Dh have zero cosine similarity datasets. The improvement is not significant
with S si , we assign 1 to oi . when trained on the permutation set PS BL , and
Once is known, we can compute its dissimi- is achieved only with one of the three metrics;
larity from using a chosen metric. But because 3
There are discrepancies between our reported accuracies
now is not guaranteed to be a permutation of
and those of Barzilay and Lapata (2008). The differences are
(there may be repetition or missing values, i.e., due to the fact that we use a different parser: the Stanford de-
1, in ), Kendalls cannot be used, and we use pendency parser (de Marneffe et al., 2006), and might have
only average continuity and edit distance as dis- extracted entities in a slightly different way than theirs, al-
though we keep other experimental configurations as close
similarity metrics in this experiment.
as possible to theirs. But when comparing our model with
The remaining experimental configuration is theirs, we always use the exact same set of features, so the
the same as that of Barzilay and Lapata (2008), absolute accuracies do not matter.
320
Condition: Coreference+ Condition: Coreference
Earthquakes Accidents Earthquakes Accidents
Perms Metric Perms Metric
C Acc C Acc C Acc C Acc
3 79.5 3 82.0 3 71.0 3 73.3
AC 4 85.2 3 83.3 AC 3 *76.8 3 74.5
PS BL PS BL
ED 3 86.8 6 82.2 ED 4 *77.4 6 74.4
Baseline 85.3 83.2 Baseline 71.7 73.8
3 86.8 3 85.2* 3 55.9 3 51.5
AC 3 85.6 1 85.4* AC 4 53.9 6 49.0
PS M PS M
ED N 87.9* 4 86.3* ED 4 53.9 5 52.3
Baseline 85.3 81.7 Baseline 49.2 53.2
Table 2: Accuracies (%) of extending the stan- Table 3: Accuracies (%) of extending the stan-
dard entity-based coherence model with multiple-rank dard entity-based coherence model with multiple-rank
learning in sentence ordering using Coreference+ op- learning in sentence ordering using Coreference op-
tion. Accuracies which are significantly better than the tion. Accuracies which are significantly better than the
baseline (p < .05) are indicated by *. baseline (p < .05) are indicated by *.
but when trained on PS M (the set of permutations generated from our model), running full corefer-
generated from our biased model), our models ence resolution is not a good option, since it al-
performance significantly exceeds B&Ls4 for all most makes the accuracies no better than random
three metrics, especially as their models perfor- guessing (50%).
mance drops for dataset Accidents. Moreover, considering training using PS BL ,
From these results, we see that in the ideal sit- running full coreference resolution has a different
uation where we extract entities and resolve their influence for the two datasets. For Earthquakes,
coreference relations based on the oracular infor- our model significantly outperforms B&Ls while
mation from the source document, our model is the improvement is insignificant for Accidents.
effective in terms of improving ranking accura- This is most probably due to the different way that
cies, especially when trained on our more realistic entities are realized in these two datasets. As an-
permutation sets PS M . alyzed by Barzilay and Lapata (2008), in dataset
Earthquakes, entities tend to be referred to by pro-
6.1.2 Full Coreference Resolution without
nouns in subsequent mentions, while in dataset
Oracular Information
Accidents, literal string repetition is more com-
In this experiment, we apply the same auto- mon.
matic coreference resolution tool (Ng and Cardie, Given a balanced permutation distribution as
2002) on not only the source documents but also we assumed in PS BL , switching distant sentence
their permutations. We want to see how removing pairs in Accidents may result in very similar en-
the oracular component in the original model af- tity distribution with the situation of switching
fects the performance of our multiple-rank model closer sentence pairs, as recognized by the auto-
and the standard model. Results are shown in Ta- matic tool. Therefore, compared to Earthquakes,
ble 3. our multiple-rank model may be less powerful in
First we can see when trained on PS M , run- indicating the dissimilarity between the sentence
ning full coreference resolution significantly hurts orderings in a permutation and its source docu-
performance for both models. This suggests that, ment, and therefore can improve on the baseline
in real-life applications, where the distribution of only by a small margin.
training instances with different degrees of co-
herence is skewed (as in the set of permutations 6.1.3 No Coreference Resolution
4
Following Elsner and Charniak (2011), we use the In this experiment, we do not employ any coref-
Wilcoxon Sign-rank test for significance. erence resolution tool, and simply cluster head
321
Condition: Coreference 88.0
Accuracy (%)
Earthquakes Accidents 83.0 Earthquake ED Coref+
Perms Metric Earthquake ED Coref
C Acc C Acc 78.0
Accidents ED Coref+
4 82.8 N 82.0 73.0 Accidents ED Coref

Accidents Coref-
AC 3 78.0 3 **84.2
PS BL 68.0
ED N 78.2 3 *82.7 3 4 5 6 N
C
Baseline 83.7 80.1
Figure 1: Effect of C on testing accuracies in selected
3 **86.4 N **85.7 sentence ordering experimental configurations.
AC 4 *84.4 N **86.6
PS M
ED 5 **86.7 N **84.6
choices of Cs with the configurations where our
Baseline 82.6 77.5 model outperforms the baseline model. In each
configuration, we choose the dissimilarity metric
Table 4: Accuracies (%) of extending the stan-
dard entity-based coherence model with multiple-rank
which achieves the best accuracy reported in Ta-
learning in sentence ordering using Coreference op- bles 2 to 4 and the PS BL permutation set. We
tion. Accuracies which are significantly better than the can see that the dependency of accuracies on the
baseline are indicated by * (p < .05) and ** (p < .01). particular choice of C is not consistent across all
experimental configurations, which suggests that
this free parameter C needs careful tuning in dif-
nouns by string matching. Results are shown in
ferent experimental setups.
Table 4.
Combining our multiple-rank model with sim-
Even with such a coarse approximation of
ple string matching for entity extraction is a ro-
coreference resolution, our model is able to
bust option for coherence evaluation, regardless
achieve around 85% accuracy in most test cases,
of the particular distribution of permutations used
except for dataset Earthquakes, training on PS BL
in training, and it significantly outperforms the
gives poorer performance than the standard model
baseline in most conditions.
by a small margin. But such inferior perfor-
mance should be expected, because as explained 6.2 Summary Coherence Rating
above, coreference resolution is crucial to this
As explained in Section 3.2, we employ a simple
dataset, since entities tend to be realized through
sentence alignment between a system-generated
pronouns; simple string matching introduces too
summary and its corresponding human-written
much noise into training, especially when our
summary to construct a test ordering and calcu-
model wants to train a more fine-grained discrim-
late its dissimilarity between the reference order-
inative system than B&Ls. However, we can see
ing from the human-written summary. In this
from the result of training on PS M , if the per-
way, we convert B&Ls supervised learning model
mutations used in training do not involve swap-
into a fully unsupervised model, since human an-
ping sentences which are too far away, the result-
notations for coherence scores are not required.
ing noise is reduced, and our model outperforms
We use the same dataset as Barzilay and Lap-
theirs. And for dataset Accidents, our model
ata (2008), which includes multi-document sum-
consistently outperforms the baseline model by a
maries from 16 input document clusters generated
large margin (with significance test at p < .01).
by five systems, along with reference summaries
6.1.4 Conclusions for Sentence Ordering composed by humans.
Considering the particular dissimilarity metric In this experiment, we consider only average
used in training, we find that edit distance usually continuity (AC) and edit distance (ED) as dissimi-
stands out from the other two metrics. Kendalls larity metrics, with raw configuration for rank as-
distance proves to be a fairly weak metric, which signment, and compare our multiple-rank model
is consistent with the findings of Filippova and with the standard entity-based model using ei-
Strube (2007) (see Section 2.3). Figure 1 plots ther full coreference resolution5 or no resolution
the testing accuracies as a function of different 5
We run the coreference resolution tool on all documents.
322
Entities Metric Same Full vs. 72.3% on full test.
When our model performs poorer than the
AC 82.5 *72.6
baseline (using Coreference configuration), the
Coreference+ ED 81.3 **73.0
difference is not significant, which suggests that
Baseline 78.8 70.9 our multiple-rank model with unsupervised score
AC 76.3 72.0 assignment via simple cosine matching can re-
Coreference ED 78.8 71.7 main competitive with the standard model, which
requires human annotations to obtain a more fine-
Baseline 80.0 72.3 grained coherence spectrum. This observation is
consistent with Banko and Vanderwende (2004)s
Table 5: Accuracies (%) of extending the stan-
discovery that human-generated summaries look
dard entity-based coherence model with multiple-rank
learning in summary rating. Baselines are results of quite extractive.
standard entity-based coherence model. Accuracies
which are significantly better than the corresponding 7 Conclusions
baseline are indicated by * (p < .05) and ** (p < .01).
In this paper, we have extended the popular co-
herence model of Barzilay and Lapata (2008) by
for entity extraction. We train both models on adopting a multiple-rank learning approach. This
the ranking preferences (144 in all) among sum- is inherently different from other extensions to
maries originating from the same input document this model, in which the focus is on enriching
cluster using the SVM rank package (Joachims, the set of features for entity-grid construction,
2006), and test on two different test sets: same- whereas we simply keep their original feature set
cluster test and full test. Same-cluster test is the intact, and manipulate only their learning method-
one used by Barzilay and Lapata (2008), in which ology. We show that this concise extension is
only the pairwise rankings (80 in all) between effective and able to outperform B&Ls standard
summaries originating from the same input doc- model in various experimental setups, especially
ument cluster are tested; we also experiment with when experimental configurations are most suit-
full test, in which pairwise rankings (1520 in all) able considering certain dataset properties (see
between all summary pairs excluding two human- discussion in Section 6.1.4).
written summaries are tested. We experimented with two tasks: sentence or-
dering and summary coherence rating, following
Results are shown in Table 5. Coreference+
B&Ls original framework. In sentence ordering,
and Coreference denote the configuration of
we also explored the influence of removing the
using full coreference resolution or no resolu-
oracular component in their original model and
tion separately. First, clearly for both models,
dealing with permutations generated from differ-
performance on full test is inferior to that on
ent distributions, showing that our model is robust
same-cluster test, but our model is still able to
for different experimental situations. In summary
achieve performance competitive with the stan-
coherence rating, we further extended their model
dard model, even if our fundamental assumption
such that their original supervised learning is con-
about the existence of canonical sentence order-
verted into unsupervised learning with competi-
ing in documents with same content may break
tive or even superior performance.
down on those test pairs not originating from the
Our multiple-rank learning model can be easily
same input document cluster. Secondly, for the
adapted into other extended entity-based coher-
baseline model, using the Coreference configu-
ence models with their enriched feature sets, and
ration yields better accuracy in this task (80.0%
further improvement in ranking accuracies should
vs. 78.8% on same-cluster test, and 72.3% vs.
be expected.
70.9% on full test), which is consistent with the
findings of Barzilay and Lapata (2008). But our Acknowledgments
multiple-rank model seems to favor the Corefer-
ence+ configuration, and our best accuracy even This work was financially supported by the Nat-
exceeds B&Ls best when tested on the same set: ural Sciences and Engineering Research Council
82.5% vs. 80.0% on same-cluster test, and 73.0% of Canada and by the University of Toronto.
323
References Mirella Lapata. 2006. Automatic evaluation of in-
formation ordering: Kendalls tau. Computational
Michele Banko and Lucy Vanderwende. 2004. Us- Linguistics, 32(4):471484.
ing n-grams to understand the nature of summaries.
Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2011.
In Proceedings of Human Language Technologies
Automatically evaluating text coherence using dis-
and North American Association for Computational
course relations. In Proceedings of the 49th Annual
Linguistics 2004: Short Papers, pages 14.
Regina Barzilay and Mirella Lapata. 2005. Modeling guistics (ACL 2011), pages 9971006.
local coherence: An entity-based approach. In Pro- Nitin Madnani, Rebecca Passonneau, Necip Fazil
ceedings of the 42rd Annual Meeting of the Asso- Ayan, John M. Conroy, Bonnie J. Dorr, Ju-
ciation for Computational Linguistics (ACL 2005), dith L. Klavans, Dianne P. OLeary, and Judith D.
pages 141148. Schlesinger. 2007. Measuring variability in sen-
Regina Barzilay and Mirella Lapata. 2008. Modeling tence ordering for news summarization. In Pro-
local coherence: an entity-based approach. Compu- ceedings of the Eleventh European Workshop on
tational Linguistics, 34(1):134. Natural Language Generation (ENLG 2007), pages
Danushka Bollegala, Naoaki Okazaki, and Mitsuru 8188.
Ishizuka. 2006. A bottom-up approach to sen- Vincent Ng and Claire Cardie. 2002. Improving ma-
tence ordering for multi-document summarization. chine learning approaches to coreference resolution.
In Proceedings of the 21st International Confer- In Proceedings of the 40th Annual Meeting on Asso-
ence on Computational Linguistics and 44th Annual ciation for Computational Linguistics (ACL 2002),
Meeting of the Association for Computational Lin- pages 104111.
guistics, pages 385392. Michael Strube and Simone Paolo Ponzetto. 2006.
Jackie Chi Kit Cheung and Gerald Penn. 2010. Entity- Wikirelate! Computing semantic relatedness using
based local coherence modelling using topological Wikipedia. In Proceedings of the 21st National
fields. In Proceedings of the 48th Annual Meet- Conference on Artificial Intelligence, pages 1219
ing of the Association for Computational Linguis- 1224.
tics (ACL 2010), pages 186195. Renxian Zhang. 2011. Sentence ordering driven by
Marie-Catherine de Marneffe, Bill MacCartney, and local and global coherence for summary generation.
Christopher D. Manning. 2006. Generating typed In Proceedings of the ACL 2011 Student Session,
dependency parses from phrase structure parses. In pages 611.
Proceedings of the 5th International Conference on
Language Resources and Evaluation (LREC 2006).
Micha Elsner and Eugene Charniak. 2011. Extending
the entity grid with entity-specific features. In Pro-
ceedings of the 49th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL 2011),
pages 125129.
Katja Filippova and Michael Strube. 2007. Extend-
ing the entity-grid coherence model to semantically
related entities. In Proceedings of the Eleventh Eu-
ropean Workshop on Natural Language Generation
(ENLG 2007), pages 139142.
Thorsten Joachims. 2002. Optimizing search en-
gines using clickthrough data. In Proceedings of
the 8th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (KDD
2002), pages 133142.
Thorsten Joachims. 2006. Training linear SVMs
in linear time. In Proceedings of the 12th ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD 2006), pages
217226.
Mirella Lapata. 2003. Probabilistic text structuring:
Experiments with sentence ordering. In Proceed-
ings of the 41st Annual Meeting of the Association
for Computational Linguistics (ACL 2003), pages
545552.
324
Generalization Methods for In-Domain and Cross-Domain Opinion
Holder Extraction
Michael Wiegand and Dietrich Klakow

Spoken Language Systems
Saarland University
D-66123 Saarbrucken, Germany
{Michael.Wiegand|Dietrich.Klakow}@lsv.uni-saarland.de
Abstract In order to illustrate this, compare for instance

(1) and (2).
In this paper, we compare three different
generalization methods for in-domain and (1) Malaysia did not agree to such treatment of Al-Qaeda sol-
cross-domain opinion holder extraction be- diers as they were prisoners-of-war and should be accorded
treatment as provided for under the Geneva Convention.
ing simple unsupervised word clustering, (2) Japan wishes to build a $21 billion per year aerospace indus-
an induction method inspired by distant try centered on commercial satellite development.
supervision and the usage of lexical re-
sources. The generalization methods are Though both sentences contain an opinion
incorporated into diverse classifiers. We
holder, the lexical items vary considerably. How-
show that generalization causes significant
improvements and that the impact of im- ever, if the two sentences are compared on the ba-
provement depends on the type of classifier sis of some higher level patterns, some similari-
and on how much training and test data dif- ties become obvious. In both cases the opinion
fer from each other. We also address the holder is an entity denoting a person and this en-
less common case of opinion holders being tity is an agent1 of some predictive predicate (i.e.
realized in patient position and suggest ap- agree in (1) and wishes in (2)), more specifically,
proaches including a novel (linguistically- an expression that indicates that the agent utters a
informed) extraction method how to detect
subjective statement. Generalization methods ide-
those opinion holders without labeled train-
ing data as standard datasets contain too ally capture these patterns, for instance, they may
few instances of this type. provide a domain-independent lexicon for those
predicates. In some cases, even higher order fea-
tures, such as certain syntactic constructions may
1 Introduction vary throughout the different domains. In (1) and
Opinion holder extraction is one of the most im- (2), the opinion holders are agents of a predictive
portant subtasks in sentiment analysis. The ex- predicate, whereas the opinion holder her daugh-
traction of sources of opinions is an essential com- ters in (3) is a patient2 of embarrasses.
ponent for complex real-life applications, such (3) Mrs. Bennet does what she can to get Jane and Bingley to-
as opinion question answering systems or opin- gether and embarrasses her daughters by doing so.
ion summarization systems (Stoyanov and Cardie,
2011). Common approaches designed to extract If only sentences, such as (1) and (2), occur in
opinion holders are based on data-driven methods, the training data, a classifier will not correctly ex-
in particular supervised learning. tract the opinion holder in (3), unless it obtains
In this paper, we examine the role of general- additional knowledge as to which predicates take
ization for opinion holder extraction in both in- opinion holders as patients.
domain and cross-domain classification. General- 1
By agent we always mean constituents being labeled as
ization may not only help to compensate the avail- A0 in PropBank (Kingsbury and Palmer, 2002).
ability of labeled training data but also conciliate 2
By patient we always mean constituents being labeled
domain mismatches. as A1 in PropBank.
325
In this work, we will consider three differ- Domain # Sentences # Holders in sentence (average)
ETHICS 5700 0.79
ent generalization methods being simple unsuper- SPACE 628 0.28
vised word clustering, an induction method and FICTION 614 1.49
the usage of lexical resources. We show that gen-
Table 1: Statistics of the different domain corpora.
eralization causes significant improvements and
that the impact of improvement depends on how
much training and test data differ from each other.
In addition to these two (sub)domains, we
We also address the issue of opinion holders in
chose some text type that is not even news text
patient position and present methods including a
in order to have a very distant domain. There-
novel extraction method to detect these opinion
fore, we had to use some text not included in the
holders without any labeled training data as stan-
MPQA corpus. Existing text collections contain-
dard datasets contain too few instances of them.
ing product reviews (Kessler et al., 2010; Toprak
In the context of generalization it is also impor- et al., 2010), which are generally a popular re-
tant to consider different classification methods source for sentiment analysis, were not found
as the incorporation of generalization may have a suitable as they only contain few distinct opinion
varying impact depending on how robust the clas- holders. We finally used a few summaries of fic-
sifier is by itself, i.e. how well it generalizes even tional work (two Shakespeare plays and one novel
with a standard feature set. We compare two state- by Jane Austen4 ) since their language is notably
of-the-art learning methods, conditional random different from that of news texts and they con-
fields and convolution kernels, and a rule-based tain a large number of different opinion holders
method. (therefore opinion holder extraction is a meaning-
ful task on this text type). These texts make up
2 Data our third domain FICTION. We manually labeled
As a labeled dataset we mainly use the MPQA it with opinion holder information by applying the
2.0 corpus (Wiebe et al., 2005). We adhere to annotation scheme of the MPQA corpus.
the definition of opinion holders from previous Table 1 lists the properties of the different do-
work (Wiegand and Klakow, 2010; Wiegand and main corpora. Note that ETHICS is the largest do-
Klakow, 2011a; Wiegand and Klakow, 2011b), main. We consider it our primary (source) domain
i.e. every source of a private state or a subjective as it serves both as a training and (in-domain) test
speech event (Wiebe et al., 2005) is considered an set. Due to their size, the other domains only
opinion holder. serve as test sets (target domains).
This corpus contains almost exclusively news For some of our generalization methods, we
texts. In order to divide it into different domains, also need a large unlabeled corpus. We use the
we use the topic labels from (Stoyanov et al., North American News Text Corpus (LDC95T21).
2004). By inspecting those topics, we found that
3 The Different Types of Generalization
many of them can grouped to a cluster of news
items discussing human rights issues mostly in 3.1 Word Clustering (Clus)
the context of combating global terrorism. This The simplest generalization method that is con-
means that there is little point in considering every sidered in this paper is word clustering. By that,
single topic as a distinct (sub)domain and, there- we understand the automatic grouping of words
fore, we consider this cluster as one single domain occurring in similar contexts. Such clusters are
ETHICS.3 For our cross-domain evaluation, we usually computed on a large unlabeled corpus.
want to have another topic that is fairly different Unlike lexical features, features based on clusters
from this set of documents. By visual inspection, are less sparse and have been proven to signif-
we found that the topic discussing issues regard- icantly improve data-driven classifiers in related
ing the International Space Station would suit our tasks, such as named-entity recognition (Turian et
purpose. It is henceforth called SPACE.
4
available at: www.absoluteshakespeare.com/
3
The cluster is the union of documents with the following guides/{othello|twelfth night}/summary/
MPQA-topic labels: axisofevil, guantanamo, humanrights, {othello|twelfth night} summary.htm
mugabe and settlements. www.wikisummaries.org/Pride and Prejudice
326
I. Madrid, Dresden, Bordeaux, Istanbul, Caracas, Manila, ... majority of holders are agents (4). A certain
II. Toby, Betsy, Michele, Tim, Jean-Marie, Rory, Andrew, ...
III. detest, resent, imply, liken, indicate, suggest, owe, expect, ...
number of predicates, however, also have opinion
IV. disappointment, unease, nervousness, dismay, optimism, ... holders in patient position, e.g. (5) and (6).
V. remark, baby, book, saint, manhole, maxim, coin, batter, ... Wiegand and Klakow (2011b) found that many
Table 2: Some automatically induced clusters.
of those latter predicates are listed in one of
Levins verb classes called amuse verbs. While
ETHICS SPACE FICTION
on the evaluation on the entire MPQA corpus,
1.47 2.70 11.59 opinion holders in patient position are fairly rare
(Wiegand and Klakow, 2011b), we may wonder
Table 3: Percentage of opinion holders as patients. whether the same applies to the individual do-
mains that we consider in this work. Table 3
lists the proportion of those opinion holders (com-
al., 2010). Such a generalization is, in particular,
puted manually) based on a random sample of 100
attractive as it is cheaply produced. As a state-
opinion holder mentions from those corpora. The
of-the-art clustering method, we consider Brown
table shows indeed that on the domains from the
clustering (Brown et al., 1992) as implemented in
MPQA corpus, i.e. ETHICS and SPACE, those
the SRILM-toolkit (Stolcke, 2002). We induced
opinion holders play a minor role but there is a no-
1000 clusters which is also the configuration used
tably higher proportion on the FICTION-domain.
in (Turian et al., 2010).5
Table 2 illustrates a few of the clusters induced 3.3 Task-Specific Lexicon Induction (Induc)
from our unlabeled dataset introduced in Section
3.3.1 Distant Supervision with Prototypical
() 2. Some of these clusters represent location
Opinion Holders
or person names (e.g. I. & II.). This exempli-
Lexical resources are potentially much more
fies why clustering is effective for named-entity
expressive than word clustering. This knowledge,
recognition. We also find clusters that intuitively
however, is usually manually compiled, which
seem to be meaningful for our task (e.g. III. &
makes this solution much more expensive. Wie-
IV.) but, on the other hand, there are clusters that
gand and Klakow (2011a) present an intermedi-
contain words that with the exception of their part
ate solution for opinion holder extraction inspired
of speech do not have anything in common (e.g.
by distant supervision (Mintz et al., 2009). The
V.).
output of that method is also a lexicon of predi-
3.2 Manually Compiled Lexicons (Lex) cates but it is automatically extracted from a large
The major shortcoming of word clustering is that unlabeled corpus. This is achieved by collecting
it lacks any task-specific knowledge. The oppo- predicates that frequently co-occur with prototyp-
site type of generalization is the usage of manu- ical opinion holders, i.e. common nouns such as
ally compiled lexicons comprising predicates that opponents (7) or critics (8), if they are an agent
indicate the presence of opinion holders, such as of that predicate. The rationale behind this is
supported, worries or disappointed in (4)-(6). that those nouns act very much like actual opin-
ion holders and therefore can be seen as a proxy.
(4) I always supported this idea. holder:agent.
(5) This worries me. holder:patient (7) Opponents say these arguments miss the point.
(6) He disappointed me. holder:patient (8) Critics argued that the proposed limits were unconstitutional.
This method reduces the human effort to specify-

We follow Wiegand and Klakow (2011b) who
ing a small set of such prototypes.
found that those predicates can be best obtained
Following the best configuration reported
by using a subset of Levins verb classes (Levin,
in (Wiegand and Klakow, 2011a), we extract 250
1993) and the strong subjective expressions of the
verbs, 100 nouns and 100 adjectives from our un-
Subjectivity Lexicon (Wilson et al., 2005). For
labeled corpus (2).
those predicates it is also important to consider
in which argument position they usually take an 3.3.2 Extension for Opinion Holders in
opinion holder. Bethard et al. (2004) found the Patient Position
5
We also experimented with other sizes but they did not The downside of using prototypical opinion
produce a better overall performance. holders as a proxy for opinion holders is that it
327
anguish , astonish, astound, concern, convince, daze, delight, opinion holders to persons. This means that we
disenchant , disappoint, displease, disgust, disillusion, dissat-
isfy, distress, embitter , enamor , engross, enrage, entangle , allow personal pronouns (i.e. I, you, he, she and
excite, fatigue , flatter, fluster, flummox , frazzle , hook , hu- we) to appear in this position. We believe that this
miliate, incapacitate , incense, interest, irritate, obsess, outrage,
perturb, petrify , sadden, sedate , shock, stun, tether , trouble
relaxation can be done in that particular case, as
adjectives are much more likely to convey opin-
Table 4: Examples of the automatically extracted verbs ions a priori than verbs (Wiebe et al., 2004).
taking opinion holders as patients ( : not listed as An intrinsic evaluation of the predicates that we
amuse verb). thus extracted from our unlabeled corpus is dif-
ficult. The 250 most frequent verbs exhibiting
this special property of coinciding with adjectives
is limited to agentive opinion holders. Opinion
(this will be the list that we use in our experi-
holders in patient position, such as the ones taken
ments) contains 42% entries of the amuse verbs
by amuse verbs in (5) and (6), are not covered.
(3.2). However, we also found many other po-
Wiegand and Klakow (2011a) show that consid-
tentially useful predicates on this list that are not
ering less restrictive contexts significantly drops
listed as amuse verbs (Table 4). As amuse verbs
classification performance. So the natural exten-
cannot be considered a complete golden standard
sion of looking for predicates having prototypical
for all predicates taking opinion holders as pa-
opinion holders in patient position is not effective.
tients, we will focus on a task-based evaluation
Sentences, such as (9), would mar the result.
of our automatically extracted list (6).
(9) They criticized their opponents.
4 Data-driven Methods
In (9) the prototypical opinion holder opponents
(in the patient position) is not a true opinion In the following, we present the two supervised
holder. classifiers we use in our experiments. Both clas-
Our novel method to extract those predicates sifiers incorporate the same levels of representa-
rests on the observation that the past participle of tions, including the same generalization methods.
those verbs, such as shocked in (10), is very often
4.1 Conditional Random Fields (CRF)
identical to some predicate adjective (11) having
a similar if not identical meaning. For the predi- The supervised classifier most frequently used
cate adjective, the opinion holder is, however, its for information extraction tasks, in general, are
subject/agent and not its patient. conditional random fields (CRF) (Lafferty et al.,
2001). Using CRF, the task of opinion holder ex-
(10) He had shockedverb me. holder:patient
(11) I was shockedadj . holder:agent
traction is framed as a tagging problem in which
given a sequence of observations x = x1 x2 . . . xn
Instead of extracting those verbs directly (10), (words in a sentence) a sequence of output tags
we take the detour via their corresponding pred- y = y1 y2 . . . yn indicating the boundaries of opin-
icate adjectives (11). This means that we collect ion holders is computed by modeling the condi-
all those verbs (from our large unlabeled corpus tional probability P (x|y).
(2)) for which there is a predicate adjective that The features we use (Table 5) are mostly in-
coincides with the past participle of the verb. spired by Choi et al. (2005) and by the ones
To increase the likelihood that our extracted used for plain support vector machines (SVMs)
predicates are meaningful for opinion holder ex- in (Wiegand and Klakow, 2010). They are orga-
traction, we also need to check the semantic type nized into groups. The basic group Plain does not
in the relevant argument position, i.e. make sure contain any generalization method. Each other
that the agent of the predicate adjective (which group is dedicated to one specific generalization
would be the patient of the corresponding verb) method that we want to examine (Clus, Induc
is an entity likely to be an opinion holder. Our and Lex). Apart from considering generalization
initial attempts with prototypical opinion holders features indicating the presence of generalization
were too restrictive, i.e. the number of prototyp- types, we also consider those types in conjunction
ical opinion holders co-occurring with those ad- with semantic roles. As already indicated above,
jectives was too small. Therefore, we widen the semantic roles are especially important for the de-
semantic type of this position from prototypical tection of opinion holders. Unfortunately, the cor-
328
Group Features convolution kernels, the structures to be compared
Token features: unigrams and bigrams
POS/chunk/named-entity features: unigrams, bi-
within the kernel function are not vectors com-
grams and trigrams prising manually designed features but the under-
Plain Constituency tree path to nearest predicate lying discrete structures, such as syntactic parse
Nearest predicate
Semantic role to predicate+lexical form of predicate
trees or part-of-speech sequences. Since they are
Cluster features: unigrams, bigrams and trigrams directly provided to the learning algorithm, a clas-
Clus Semantic role to predicate+cluster-id of predicate sifier can be built without taking the effort of im-
Cluster-id of nearest predicate
Is there predicate from induced lexicon within win-
plementing an explicit feature extraction.
dow of 5 tokens? We take the best configuration from (Wiegand
Induc Semantic role to predicate, if predicate is contained in and Klakow, 2010) that comprises a combination
induced lexicon
Is nearest predicate contained in induced lexicon? of three different tree kernels being two tree ker-
Is there predicate from manually compiled lexicons nels based on constituency parse trees (one with
within window of 5 tokens?
predicate and another with semantic scope) and
Semantic role to predicate, if predicate is contained in
Lex manually compiled lexicons a tree kernel encoding predicate-argument struc-
Is nearest predicate contained in manually compiled tures based on semantic role information. These
lexicons?
representations are illustrated in Figure 1. The re-
Table 5: Feature set for CRF. sulting kernels are combined by plain summation.
In order to integrate our generalization meth-
ods into the convolution kernels, the input struc-
responding feature from the Plain feature group tures, i.e. the linguistic tree structures, have to be
that also includes the lexical form of the predicate augmented. For that we just add additional nodes
is most likely a sparse feature. For the opinion whose labels correspond to the respective gener-
holder me in (10), for example, it would corre- alization types (i.e. Clus: CLUSTER-ID, Induc:
spond to A1 shock. Therefore, we introduce for INDUC-PRED and Lex: LEX-PRED). The nodes
each generalization method an additional feature are added in such a way that they (directly) domi-
replacing the sparse lexical item by a generaliza- nate the leaf node for which they provide a gener-
tion label, i.e. Clus: A1 CLUSTER-35265, Induc: alization.10 If several generalization methods are
A1 INDUC-PRED and Lex: A1 LEX-PRED.6 used and several of them apply for the same lex-
For this learning method, we use CRF++.7 We ical unit, then the (vertical) order of the general-
choose a configuration that provides good perfor- ization nodes is LEX-PRED INDUC-PRED
mance on our source domain (i.e. ETHICS).8 CLUSTER-ID.11 Figure 2 illustrates the predi-
For semantic role labeling we use SWIRL9 , for cate argument structure from Figure 1 augmented
chunk parsing CASS (Abney, 1991) and for con- with INDUC-PRED and CLUSTER-IDs.
stituency parsing Stanford Parser (Klein and Man- For this learning method, we use the
ning, 2003). Named-entity information is pro- SVMLight-TK toolkit.12 Again, we tune the
vided by Stanford Tagger (Finkel et al., 2005). parameters to our source domain (ETHICS).13
4.2 Convolution Kernels (CK) 5 Rule-based Classifiers (RB)
Convolution kernels (CK) are special kernel func-
tions. A kernel function K : X X R com- Finally, we also consider rule-based classifiers
putes the similarity of two data instances xi and (RB). The main difference towards CRF and CK
xj (xi xj X). It is mostly used in SVMs that is that it is an unsupervised approach not requiring
estimate a hyperplane to separate data instances training data. We re-use the framework by Wie-
from different classes H(~x) = w ~ ~x + b = 0 gand and Klakow (2011b). The candidate set are
n
where w R and b R (Joachims, 1999). In all noun phrases in a test set. A candidate is clas-
sified as an opinion holder if all of the following
6
Predicates in patient position are given the same gener-
10
alization label as the predicates in agent position. Specially Note that even for the configuration Plain the trees are
marking them did not result in a notable improvement. already augmented with named-entity information.
7 11
http://crfpp.sourceforge.net We chose this order as it roughly corresponds to the
8
The soft margin parameter c is set to 1.0 and all fea- specificity of those generalization types.
12
tures occurring less than 3 times are removed. disi.unitn.it/moschitti
9 13
http://www.surdeanu.name/mihai/swirl The cost parameter j (Morik et al., 1999) was set to 5.
329
Figure 1: The different structures (left: constituency trees, right: predicate argument structure) derived from
Sentence (1) for the opinion holder candidate Malaysia used as input for convolution kernels (CK).
Features
Induc Lex Induc+Lex
Domains AG AG+PT AG AG+PT AG+PT
ETHICS 50.77 50.99 52.22 52.27 53.07
SPACE 45.81 46.55 47.60 48.47 45.20
FICTION 46.59 49.97 54.84 59.35 63.11
Table 6: F-score of the different rule-based classifiers.

Figure 2: Predicate argument structure augmented
with generalization nodes.
ods, since there is no straightforward way of in-
corporating this output into a rule-based classifier.
conditions hold:
The candidate denotes a person or group of persons. 6 Experiments
There is a predictive predicate in the same sentence.
The candidate has a pre-specified semantic role in the event CK and RB have an instance space that is differ-
that the predictive predicate evokes (default: agent-role).
ent from the one of CRF. While CRF produces
The set of predicates is obtained from a given lex- a prediction for every word token in a sentence,
icon. For predicates that take opinion holders as CK and RB only produce a prediction for every
patients, the default agent-role is overruled. noun phrase. For evaluation, we project the pre-
We consider several classifiers that differ in the dictions from RB and CK to word token level in
lexicon they use. RB-Lex uses the combination of order to ensure comparability. We evaluate the se-
the manually compiled lexicons presented in 3.2. quential output with precision, recall and F-score
RB-Induc uses the predicates that have been au- as defined in (Johansson and Moschitti, 2010; Jo-
tomatically extracted from a large unlabeled cor- hansson and Moschitti, 2011).
pus using the methods presented in 3.3. RB-
Induc+Lex considers the union of those lexicons. 6.1 Rule-based Classifier
In order to examine the impact of modeling opin- Table 6 shows the cross-domain performance of
ion holders in patient position, we also introduce the different rule-based classifiers. RB-Lex per-
two versions of each lexicon. AG just consid- forms better than RB-Induc. In comparison to the
ers predicates in agentive position while AG+PT domains ETHICS and SPACE the difference is
also considers predicates that take opinion hold- larger on FICTION. Presumably, this is due to the
ers as patients. For example, RB-InducAG+P T fact that the predicates in Induc are extracted from
is a classifier that uses automatically extracted a news corpus (2). Thus, Induc may slightly suf-
predicates in order to detect opinion holders in fer from a domain mismatch. A combination of
both agent and patient argument position, i.e. the two classifiers, i.e. RB-Lex+Induc, results in
RB-InducAG+P T also covers our novel extraction a notable improvement in the FICTION-domain.
method for patients (3.3.2). The approaches that also detect opinion holders as
The output of clustering will exclusively be patients (AG+PT) including our novel approach
evaluated in the context of learning-based meth- (3.3.2) are effective. A notable improvement can
330
Training Size (%) and recall. RB achieves a high recall, whereas the
Features Alg. 5 10 20 50 100
CRF 32.14 35.24 41.03 51.05 55.13
learning-based methods always excel RB in pre-
Plain
CK 42.15 46.34 51.14 56.39 59.52 cision.14 Applying generalization to the learning-
CRF 33.06 37.11 43.47 52.05 56.18 based methods results in an improvement of both
+Clus
CK 42.02 45.86 51.11 56.59 59.77
CRF 37.28 42.31 46.54 54.27 56.71
recall and precision if few training data are used.
+Induc The impact on precision decreases, however, the
CK 46.26 49.35 53.26 57.28 60.42
+Lex
CRF 40.69 43.91 48.43 55.37 58.46 more training data are added. There is always a
CK 46.45 50.59 53.93 58.63 61.50
significant increase in recall but learning-based
CRF 37.27 42.19 47.35 54.95 57.14
+Clus+Induc methods may not reach the level of RB even
CK 45.14 48.20 52.39 57.37 59.97
+Clus+Lex
CRF 40.52 44.29 49.32 55.44 58.80 though they use the same resources. This is a
CK 45.89 49.35 53.56 58.74 61.43
side-effect of preserving a much higher precision.
CRF 42.23 45.92 49.96 55.61 58.40
+Lex+Induc
CK 47.46 51.44 54.80 58.74 61.58 It also explains why learning-based methods with
CRF 41.56 45.75 50.39 56.24 59.08 generalization may have a lower F-score than RB.
All
CK 46.18 50.10 54.04 58.92 61.44
6.3 Out-of-Domain Evaluation of
Table 7: F-score of in-domain (ETHICS) learning-
Learning-based Methods
based classifiers.
Table 9 presents the results of out-of-domain clas-
sifiers. The complete ETHICS-dataset is used for
only be measured on the FICTION-domain since training. Some properties are similar to the pre-
this is the only domain with a significant propor- vious experiments: CK always outperforms CRF.
tion of those opinion holders (Table 3). RB provides a high recall whereas the learning-
based methods maintain a higher precision. Sim-
6.2 In-Domain Evaluation of ilar to the in-domain setting using few labeled
Learning-based Methods training data, the incorporation of generalization
Table 7 shows the performance of the learning- increases both precision and recall. Moreover, a
based methods CRF and CK on an in-domain combination of generalization methods is better
evaluation (ETHICS-domain) using different than just using one method on average, although
amounts of labeled training data. We carry out Lex is again a fairly robust individual generaliza-
a 5-fold cross-validation and use n% of the train- tion method. Generalization is more effective in
ing data in the training folds. The table shows that this setting than on the in-domain evaluation us-
CK is more robust than CRF. The fewer training ing all training data, in particular for CK, since
data are used the more important generalization the training and test data are much more different
becomes. CRF benefits much more from gener- from each other and suitable generalization meth-
alization than CK. Interestingly, the CRF config- ods partly close that gap.
uration with the best generalization is usually as There is a notable difference in precision be-
good as plain CK. This proves the effectiveness tween the SPACE- and FICTION-domain (and
of CK. In principle, Lex is the strongest general- also the source domain ETHICS (Table 8)). We
ization method while Clus is by far the weakest. strongly assume that this is due to the distribu-
For Clus, systematic improvements towards no tion of opinion holders in those datasets (Table 1).
generalization (even though they are minor) can The FICTION-domain contains much more opin-
only be observed with CRF. As far as combina- ion holders, therefore the chance that a predicted
tions are concerned, either Lex+Induc or All per- opinion holder is correct is much higher.
forms best. This in-domain evaluation proves that With regard to recall, a similar level of per-
opinion holder extraction is different from named- formance as in the ETHICS-domain can only be
entity recognition. Simple unsupervised general- achieved in the SPACE-domain, i.e. CK achieves
ization, such as word clustering, is not effective a recall of 60%. In the FICTION-domain, how-
and popular sequential classifiers are less robust ever, the recall is much lower (best recall of CK
than margin-based tree-kernels. is below 47%). This is no surprise as the SPACE-
Table 8 complements Table 7 in that it com- domain is more similar to the source domain than
pares the learning-based methods with the best 14
The reason for RB having a high recall is extensively
rule-based classifier and also displays precision discussed in (Wiegand and Klakow, 2011b).
331
the FICTION-domain since ETHICS and SPACE CRF CK
Size Feat. Prec Rec F1 Prec Rec F1
are news texts. FICTION contains more out-of- Plain 52.17 26.61 35.24 58.26 38.47 46.34
domain language. Therefore, RB (which exclu- 10
All 62.85 35.96 45.75 63.18 41.50 50.10
sively uses domain-independent knowledge) out- Plain 59.85 44.50 51.05 59.60 53.50 56.39
50
All 62.99 50.80 56.24 61.91 56.20 58.92
performs both learning-based methods including Plain 64.14 48.33 55.13 62.38 56.91 59.52
the ones incorporating generalization. Similar re- 100
All 64.75 54.32 59.08 63.81 59.24 61.44
sults have been observed for rule-based classifiers RB 47.38 60.32 53.07 47.38 60.32 53.07
from other tasks in cross-domain sentiment anal-
ysis, such as subjectivity detection and polarity Table 8: Comparison of best RB with learning-based
approaches on in-domain classification.
classification. High-level information as it is en-
coded in a rule-based classifier generalizes better
Algorithms Generalization Prec Rec F
than learning-based methods (Andreevskaia and CK (Plain) 66.90 41.48 51.21
Bergler, 2008; Lambov et al., 2009). CK Induc 67.06 45.15 53.97
CK+RBAG Induc 60.22 54.52 57.23
We set up another experiment exclusively for
CK+RBAG+P T Induc 61.09 58.14 59.58
the FICTION-domain in which we combine the CK Lex 69.45 46.65 55.81
output of our best learning-based method, i.e. CK, CK+RBAG Lex 67.36 59.02 62.91
CK+RBAG+P T Lex 68.25 63.28 65.67
with the prediction of a rule-based classifier. The
CK Induc+Lex 69.73 46.17 55.55
combined classifier will predict an opinion holder, CK+RBAG Induc+Lex 61.41 65.56 63.42
if either classifier predicts one. The motivation for CK+RBAG+P T Induc+Lex 62.26 70.56 66.15
this is the following: The FICTION-domain is the
Table 10: Combination of out-of-domain CK and rule-
only domain to have a significant proportion of
based classifiers on FICTION (i.e. distant domain).
opinion holders appearing as patients. We want
to know how much of them can be recognized
with the best out-of-domain classifier using train- further evidence that our novel approach to extract
ing data with only very few instances of this type those predicates (3.3.2) is effective.
and what benefit the addition of using various RBs The combined approach in Table 10 not only
which have a clearer notion of these constructions outperforms CK (discussed above) but also RB
brings about. Moreover, we already observed that (Table 6). We manually inspected the output of
the learning-based methods have a bias towards the classifiers to find also cases in which CK de-
preserving a high precision and this may have as tect opinion holders that RB misses. CK has the
a consequence that the generalization features in- advantage that it is not only bound to the relation-
corporated into CK will not receive sufficiently ship between candidate holder and predicate. It
large weights. Unlike the SPACE-domain where learns further heuristics, e.g. that sentence-initial
a sufficiently high recall is already achieved with mentions of persons are likely opinion holders. In
CK (presumably due to its stronger similarity to- (12), for example, this heuristics fires while RB
wards the source domain) the FICTION-domain overlooks this instance as to give someone a share
may be more severely affected by this bias and of advice is not part of the lexicon.
evidence from RB may compensate for this.
(12) She later gives Charlotte her share of advice on running a
Table 10 shows the performance of those com- household.
bined classifiers. For all generalization types
considered, there is, indeed, an improvement by
7 Related Work
adding information from RB resulting in a large
boost in recall. Already the application of our in- The research on opinion holder extraction has
duction approach Induc results in an increase of been focusing on applying different data-driven
more than 8% points compared to plain CK. The approaches. Choi et al. (2005) and Choi et al.
table also shows that there is always some im- (2006) explore conditional random fields, Wie-
provement if RB considers opinion holders as pa- gand and Klakow (2010) examine different com-
tients (AG+PT). This can be considered as some binations of convolution kernels, while Johans-
evidence that (given the available data we use) son and Moschitti (2010) present a re-ranking ap-
opinion holders in patient position can only be ef- proach modeling complex relations between mul-
fectively extracted with the help of RBs. It is also tiple opinions in a sentence. A comparison of
332
SPACE (similar target domain) FICTION (distant target domain)
CRF CK CRF CK
Features Prec Rec F1 Prec Rec F1 Prec Rec F1 Prec Rec F1
Plain 47.32 48.62 47.96 45.89 57.07 50.87 68.58 28.96 40.73 66.90 41.48 51.21
+Clus 49.00 48.62 48.81 49.23 57.64 53.10 71.85 32.21 44.48 67.54 41.21 51.19
+Induc 42.92 49.15 45.82 46.66 60.45 52.67 71.59 34.77 46.80 67.06 45.15 53.97
+Lex 49.65 49.07 49.36 49.60 59.88 54.26 71.91 35.83 47.83 69.45 46.65 55.81
+Clus+Induc 46.61 48.78 47.67 48.65 58.20 53.00 71.32 35.88 47.74 67.46 42.17 51.90
+Lex+Induc 48.75 50.87 49.78 49.92 58.76 53.98 74.02 37.37 49.67 69.73 46.17 55.55
+Clus+Lex 49.72 50.87 50.29 53.70 59.32 56.37 73.41 37.15 49.33 70.59 43.98 54.20
All 49.87 51.03 50.44 51.68 58.76 54.99 72.00 37.44 49.26 70.61 44.83 54.84
best RB 41.72 57.80 48.47 41.72 57.80 48.47 63.26 62.96 63.11 63.26 62.96 63.11
Table 9: Comparison of best RB with learning-based approaches on out-of-domain classification.
those methods has not yet been attempted. In The only cross-domain evaluation of opinion
this work, we compare the popular state-of-the-art holder extraction is reported in (Li et al., 2007) us-
learning algorithms conditional random fields and ing the MPQA corpus as a training set and the NT-
convolution kernels for the first time. All these CIR collection as a test set. A low cross-domain
data-driven methods have been evaluated on the performance is obtained and the authors conclude
MPQA corpus. Some generalization methods are that this is due to the very different annotation
incorporated but unlike this paper they are neither schemes of those corpora.
systematically compared nor combined. The role
of resources that provide the knowledge of argu- 8 Conclusion
ment positions of opinion holders is not covered We examined different generalization methods for
in any of these works. This kind of knowledge opinion holder extraction. We found that for in-
should be directly learnt from the labeled train- domain classification, the more labeled training
ing data. In this work, we found, however, that data are used, the smaller is the impact of gener-
the distribution of argument positions of opinion alization. Robust learning methods, such as con-
holders varies throughout the different domains volution kernels, benefit less from generalization
and, therefore, cannot be learnt from any arbitrary than weaker classifiers, such as conditional ran-
out-of-domain training set. dom fields. For cross-domain classification, gen-
Bethard et al. (2004) and Kim and Hovy (2006) eralization is always helpful. Distant domains
explore the usefulness of semantic roles provided are problematic for learning-based methods, how-
by FrameNet (Fillmore et al., 2003). Bethard ever, rule-based methods provide a reasonable re-
et al. (2004) use this resource to acquire labeled call and can be effectively combined with the
training data while in (Kim and Hovy, 2006) learning-based methods. The types of generaliza-
FrameNet is used within a rule-based classifier tion that help best are manually compiled lexicons
mapping frame-elements of frames to opinion followed by an induction method inspired by dis-
holders. Bethard et al. (2004) only evaluate on an tant supervision. Finally, we examined the case
artificial dataset (i.e. a subset of sentences from of opinion holders as patients and also presented
FrameNet and PropBank (Kingsbury and Palmer, a novel automatic extraction method that proved
2002)). The only realistic test set on which Kim effective. Such dedicated extraction methods are
and Hovy (2006) evaluate their approach are news important as common labeled datasets (from the
texts. Their method is compared against a sim- news domain) do not provide sufficient training
ple rule-based baseline and, unlike this work, not data for these constructions.
against a robust data-driven algorithm.
Acknowledgements
(Wiegand and Klakow, 2011b) is similar to
(Kim and Hovy, 2006) in that a rule-based ap- This work was funded by the German Federal Ministry
proach is used relying on the relationship towards of Education and Research (Software-Cluster) under
predictive predicates. Diverse resources are con- grant no. 01IC10S01. The authors thank Alessandro
sidered for obtaining such words, however, they Moschitti, Benjamin Roth and Josef Ruppenhofer for
are only evaluated on the entire MPQA corpus. their technical support and interesting discussions.
333
References Proceedings of the Annual Meeting of the Associa-
tion for Computational Linguistics (ACL), Portland,
Steven Abney. 1991. Parsing By Chunks. In Robert OR, USA.
Berwick, Steven Abney, and Carol Tenny, editors,
Jason S. Kessler, Miriam Eckert, Lyndsay Clarke,
Principle-Based Parsing. Kluwer Academic Pub-
and Nicolas Nicolov. 2010. The ICWSM JDPA
lishers, Dordrecht.
2010 Sentiment Corpus for the Automotive Do-
Alina Andreevskaia and Sabine Bergler. 2008. When main. In Proceedings of the International AAAI
Specialists and Generalists Work Together: Over- Conference on Weblogs and Social Media Data
coming Domain Dependence in Sentiment Tagging. Challange Workshop (ICWSM-DCW), Washington,
In Proceedings of the Annual Meeting of the Associ- DC, USA.
ation for Computational Linguistics: Human Lan- Soo-Min Kim and Eduard Hovy. 2006. Extracting
guage Technologies (ACL/HLT), Columbus, OH, Opinions, Opinion Holders, and Topics Expressed
USA. in Online News Media Text. In Proceedings of
Steven Bethard, Hong Yu, Ashley Thornton, Vasileios the ACL Workshop on Sentiment and Subjectivity in
Hatzivassiloglou, and Dan Jurafsky. 2004. Extract- Text, Sydney, Australia.
ing Opinion Propositions and Opinion Holders us- Paul Kingsbury and Martha Palmer. 2002. From
ing Syntactic and Lexical Cues. In Computing At- TreeBank to PropBank. In Proceedings of the
titude and Affect in Text: Theory and Applications. Conference on Language Resources and Evaluation
Springer-Verlag. (LREC), Las Palmas, Spain.
Peter F. Brown, Peter V. deSouza, Robert L. Mer- Dan Klein and Christopher D. Manning. 2003. Accu-
cer, Vincent J. Della Pietra, and Jenifer C. Lai. rate Unlexicalized Parsing. In Proceedings of the
1992. Class-based n-gram models of natural lan- Annual Meeting of the Association for Computa-
guage. Computational Linguistics, 18:467479. tional Linguistics (ACL), Sapporo, Japan.
Yejin Choi, Claire Cardie, Ellen Riloff, and Sid- John Lafferty, Andrew McCallum, and Fernando
dharth Patwardhan. 2005. Identifying Sources Pereira. 2001. Conditional Random Fields: Prob-
of Opinions with Conditional Random Fields and abilistic Models for Segmenting and Labeling Se-
Extraction Patterns. In Proceedings of the Con- quence Data. In Proceedings of the International
ference on Human Language Technology and Em- Conference on Machine Learning (ICML).
pirical Methods in Natural Language Processing Dinko Lambov, Gael Dias, and Veska Noncheva.
(HLT/EMNLP), Vancouver, BC, Canada. 2009. Sentiment Classification across Domains. In
Yejin Choi, Eric Breck, and Claire Cardie. 2006. Joint Proceedings of the Portuguese Conference on Artifi-
Extraction of Entities and Relations for Opinion cial Intelligence (EPIA), Aveiro, Portugal. Springer-
Recognition. In Proceedings of the Conference on Verlag.
Empirical Methods in Natural Language Process- Beth Levin. 1993. English Verb Classes and Alter-
ing (EMNLP), Sydney, Australia. nations: A Preliminary Investigation. University of
Charles. J. Fillmore, Christopher R. Johnson, and Chicago Press.
Miriam R. Petruck. 2003. Background to Yangyong Li, Kalina Bontcheva, and Hamish Cun-
FrameNet. International Journal of Lexicography, ningham. 2007. Experiments of Opinion Analy-
16:235 250. sis on the Corpora MPQA and NTCIR-6. In Pro-
Jenny Rose Finkel, Trond Grenager, and Christopher ceedings of the NTCIR-6 Workshop Meeting, Tokyo,
Manning. 2005. Incorporating Non-local Informa- Japan.
tion into Information Extraction Systems by Gibbs Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-
Sampling. In Proceedings of the Annual Meeting sky. 2009. Distant Supervision for Relation Extrac-
of the Association for Computational Linguistics tion without Labeled Data. In Proceedings of the
(ACL), Ann Arbor, MI, USA. Joint Conference of the Annual Meeting of the As-
Thorsten Joachims. 1999. Making Large-Scale SVM sociation for Computational Linguistics and the In-
Learning Practical. In B. Scholkopf, C. Burges, and ternational Joint Conference on Natural Language
A. Smola, editors, Advances in Kernel Methods - Processing of the Asian Federation of Natural Lan-
Support Vector Learning. MIT Press. guage Processing (ACL/IJCNLP), Singapore.
Richard Johansson and Alessandro Moschitti. 2010. Katharina Morik, Peter Brockhausen, and Thorsten
Reranking Models in Fine-grained Opinion Anal- Joachims. 1999. Combining Statistical Learn-
ysis. In Proceedings of the International Confer- ing with a Knowledge-based Approach - A Case
ence on Computational Linguistics (COLING), Be- Study in Intensive Care Monitoring. In Proceedings
jing, China. the International Conference on Machine Learning
Richard Johansson and Alessandro Moschitti. 2011. (ICML).
Extracting Opinion Expressions and Their Polari- Andreas Stolcke. 2002. SRILM - An Extensible Lan-
ties Exploration of Pipelines and Joint Models. In guage Modeling Toolkit. In Proceedings of the In-
334
ternational Conference on Spoken Language Pro-
cessing (ICSLP), Denver, CO, USA.
Veselin Stoyanov and Claire Cardie. 2011. Auto-
matically Creating General-Purpose Opinion Sum-
maries from Text. In Proceedings of Recent Ad-
vances in Natural Language Processing (RANLP),
Hissar, Bulgaria.
Veselin Stoyanov, Claire Cardie, Diane Litman, and
Janyce Wiebe. 2004. Evaluating an Opinion An-
notation Scheme Using a New Multi-Perspective
Question and Answer Corpus. In Proceedings of
the AAAI Spring Symposium on Exploring Attitude
and Affect in Text, Menlo Park, CA, USA.
Cigdem Toprak, Niklas Jakob, and Iryna Gurevych.
2010. Sentence and Expression Level Annotation
of Opinions in User-Generated Discourse. In Pro-
ceedings of the Annual Meeting of the Associa-
tion for Computational Linguistics (ACL), Uppsala,
Sweden.
Joseph Turian, Lev Ratinov, and Yoshua Bengio.
2010. Word Representations: A Simple and Gen-
eral Method for Semi-supervised Learning. In Pro-
ceedings of the Annual Meeting of the Associa-
tion for Computational Linguistics (ACL), Uppsala,
Sweden.
Janyce Wiebe, Theresa Wilson, Rebecca Bruce,
Matthew Bell, and Melanie Martin. 2004. Learn-
ing Subjective Language. Computational Linguis-
tics, 30(3).
Janyce Wiebe, Theresa Wilson, and Claire Cardie.
2005. Annotating Expressions of Opinions and
Emotions in Language. Language Resources and
Evaluation, 39(2/3):164210.
Michael Wiegand and Dietrich Klakow. 2010. Convo-
lution Kernels for Opinion Holder Extraction. In
Proceedings of the Human Language Technology
Conference of the North American Chapter of the
ACL (HLT/NAACL), Los Angeles, CA, USA.
Michael Wiegand and Dietrich Klakow. 2011a. Proto-
typical Opinion Holders: What We can Learn from
Experts and Analysts. In Proceedings of Recent Ad-
vances in Natural Language Processing (RANLP),
Hissar, Bulgaria.
Michael Wiegand and Dietrich Klakow. 2011b. The
Role of Predicates in Opinion Holder Extraction. In
Proceedings of the RANLP Workshop on Informa-
tion Extraction and Knowledge Acquisition (IEKA),
Hissar, Bulgaria.
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.
2005. Recognizing Contextual Polarity in Phrase-
level Sentiment Analysis. In Proceedings of the
Conference on Human Language Technology and
ing (HLT/EMNLP), Vancouver, BC, Canada.
335
Skip N-grams and Ranking Functions for Predicting Script Events
Bram Jans Steven Bethard
KU Leuven University of Colorado Boulder
Leuven, Belgium Boulder, Colorado, USA
bram.jans@gmail.com steven.bethard@colorado.edu
Ivan Vulic Marie Francine Moens

KU Leuven KU Leuven
Leuven, Belgium Leuven, Belgium
ivan.vulic@cs.kuleuven.be sien.moens@cs.kuleuven.be
Abstract within that script (Chambers and Jurafsky, 2008;

Chambers and Jurafsky, 2009) or that generates
In this paper, we extend current state-of-the- a story using the selected events (McIntyre and
art research on unsupervised acquisition of Lapata, 2009; McIntyre and Lapata, 2010).
scripts, that is, stereotypical and frequently
In this article, we analyze and compare tech-
observed sequences of events. We design,
evaluate and compare different methods for
niques for constructing models that, given a partial
constructing models for script event predic- chain of events, predict other events that belong to
tion: given a partial chain of events in a the script. In particular, we consider the following
script, predict other events that are likely questions:
to belong to the script. Our work aims
to answer key questions about how best How should representative chains of events
to (1) identify representative event chains be selected from the source text?
from a source text, (2) gather statistics from
the event chains, and (3) choose ranking Given an event chain, how should statistics
functions for predicting new script events. be gathered from it?
We make several contributions, introducing
skip-grams for collecting event statistics, de- Given event n-gram statistics, which ranking
signing improved methods for ranking event function best predicts the events for a script?
predictions, defining a more reliable evalu-
In the process of answering these questions, this
ation metric for measuring predictiveness,
and providing a systematic analysis of the article makes several contributions to the field of
various event prediction models. script and narrative event chain understanding:
We explore for the first time the use of skip-
1 Introduction grams for collecting narrative event statistics,
and show that this approach performs better
There has been recent interest in automatically ac- than classic n-gram statistics.
quiring world knowledge in the form of scripts
(Schank and Abelson, 1977), that is, frequently We propose a new method for ranking events
recurring situations that have a stereotypical se- given a partial script, and show that it per-
quence of events, such as a visit to a restaurant. forms substantially better than ranking meth-
All of the techniques so far proposed for this task ods from prior work.
share a common sub-task: given an event or partial We propose a new evaluation procedure (us-
chain of events, predict other events that belong ing Recall@N) for the cloze test, and advo-
to the same script (Chambers and Jurafsky, 2008; cate its usage instead of average rank used
Chambers and Jurafsky, 2009; Chambers and Ju- previously in the literature.
rafsky, 2011; Manshadi et al., 2008; McIntyre and
Lapata, 2009; McIntyre and Lapata, 2010; Regneri We provide a systematic analysis of the in-
et al., 2010). Such a model can then serve as input teractions between the choices made when
to a system that identifies the order of the events constructing an event prediction model.
336
Section 2 gives an overview of the prior work dom sequence?) rather than on prediction tasks
related to this task. Section 3 lists and briefly de- (e.g. which event should follow these events?).
scribes different approaches that try to provide In the current article, we attempt to shed some
answers to the three questions posed in this intro- light on these previous works by comparing differ-
duction, while Section 4 presents the results of our ent ways of collecting and using event chains.
experiments and reports on our findings. Finally,
Section 5 provides a conclusive discussion along 3 Methods
with ideas for future work. Models that predict script events typically have
three stages. First, a large corpus is processed to
2 Prior Work
find event chains in each of the documents. Next,
Our work is primarily inspired by the work of statistics over these event chains are gathered and
Chambers and Jurafsky, which combined a depen- stored. Finally, the gathered statistics are used to
dency parser with coreference resolution to col- create a model that takes as input a partial script
lect event script statistics and predict script events and produces as output a ranked list of events for
(Chambers and Jurafsky, 2008; Chambers and Ju- that script. The following sections give more de-
rafsky, 2009). For each document in their training tails about each of these stages and identify the
corpus, they used coreference resolution to iden- decisions that must be made in each step, and an
tify all the entities, and a dependency parser to overview of the whole process with an example
identify all verbs that had an entity as either a sub- source text is displayed in Figure 1.
ject or object. They defined an event as a verb plus
3.1 Identifying Event Chains
a dependency type (either subject or object), and
collected for each entity, the chain of events that Event chains are typically defined as a sequence
it participated in. They then calculated pointwise of actions performed by some actor. Formally, an
mutual information (PMI) statistics over all the event chain C for some actor a, is a partially or-
pairs of events that occurred in the event chains in dered set of events (v, d) where each v is a verb
their corpus. To predict a new script event given that has the actor a as its dependency d. Following
a partial chain of events, they selected the event prior work (Chambers and Jurafsky, 2008; Cham-
with the highest sum of PMIs with all the events bers and Jurafsky, 2009; McIntyre and Lapata,
in the partial chain. 2009; McIntyre and Lapata, 2010), these event
The work of McIntyre and Lapata followed in chains are identified by running a coreference sys-
this same paradigm, (McIntyre and Lapata, 2009; tem and a dependency parser. Then for each en-
McIntyre and Lapata, 2010), collecting chains of tity identified by the coreference system, all verbs
events by looking at entities and the sequence of that have a mention of that entity as one of their
verbs for which they were a subject or object. They dependencies are collected1 . The event chain is
also calculated statistics over the collected event then the sequence of (verb, dependency-type) tu-
chains, though they considered both event bigram ples. For example, given the sentence A Crow
and event trigram counts. Rather than predicting was sitting on a branch of a tree when a Fox ob-
an event for a script however, they used these sim- served her, the event chain for the Crow would be
ple counts to predict the next event that should be (sitting, SUBJECT), (observed, OBJECT).
generated for a childrens story. Once event chains have been identified, the most
Manshadi and colleagues were concerned about appropriate event chains for training the model
the scalability of running parsers and coreference must be selected. The goal of this process is to
over a large collection of story blogs, and so used select the subset of the event chains identified by
a simplified version of event chains just the main the coreference system and the dependency parser
verb of each sentence (Manshadi et al., 2008). that look to be the most reliable. Both the coref-
Rather than rely on an ad-hoc summation of PMIs, erence system and the dependency parser make
they apply language modeling techniques (specifi- some errors, so not all event chains are necessarily
cally, a smoothed 5-gram model) over the sequence useful for training a model. The three strategies
of events in the collected chains. However, they we consider for this selection process are:
only tested these language models on sequencing 1
Also following prior work, we consider only the depen-
tasks (e.g. is the real sequence better than a ran- dencies subject and object.
337
John woke up. He opened his eyes and yawned. Then he crossed the room and walked to the door.
There he saw Mary. Mary smiled and kissed him. Then they both blushed.
all chains, long chains,

all chains 1. Identifying event chains
the longest chain
JOHN MARY
(saw, OBJ)
(woke, SUBJ)
(smiled, SUBJ)
(opened, SUBJ)
st) (kissed, SUBJ)
(yawned, SUBJ) te
e (blushed, SUBJ)
(crossed, SUBJ) loz
(walked, SUBJ) (c
1-skip bigrams
t
(saw, SUBJ) rip 2. Gathering event chain statistics
l sc
(kissed, OBJ)
rti a
. SUBJ)
(blushed, pa 2-skip
a ms
r bigra
. bigram
g regula
in s
.
ct
t ru
ns [(saw, OBJ), (smiled, SUBJ)] [(saw, OBJ), (smiled, SUBJ)]
co [(smiled, SUBJ), (kissed, SUBJ)] [(saw, OBJ), (kissed, SUBJ)]
[(saw, OBJ), (smiled, SUBJ)]
[(saw, OBJ), (kissed, SUBJ)]
[(kissed, SUBJ), (blushed, SUBJ)] [(smiled, SUBJ), (kissed, SUBJ)] [(saw, OBJ), (blushed, SUBJ)]
[(smiled, SUBJ), (blushed, SUBJ)] ...
[(kissed, SUBJ), (blushed, SUBJ)] [(kissed, SUBJ), (blushed, SUBJ)]
(saw, OBJ) 1. (looked, OBJ)

MI
JP 2. (gave, SUBJ)
(smiled, SUBJ) C&
3. (saw, SUBJ)
(kissed, SUBJ) Orde ...
red P
_________ (missing event)B MI
3. Predicting script events
ig 1. (kissed, OBJ)
r am
pr 2. (looked, OBJ)
ob 3. (waited, SUBJ)
.
...
1. (blushed, SUBJ)
2. (kissed, OBJ)
3. (smiled, SUBJ)
Figure 1: An overview of the whole linear work flow showing the three key steps identifying event chains,
collecting statistics out of the chains and predicting a missing event in a script. The figure also displays how a
partial script for evaluation (Section 4.3) is constructed. We show the whole process for Marys event chain only,
but the same steps are followed for Johns event chain.
Select all event chains, that is, all sequences 3.2 Gathering Event Chain Statistics
of two or more events linked by common Once event chains have been collected from the
actors. This strategy will produce the largest corpus, the statistics necessary for constructing
number of event chains to train a model from, the event prediction model must be gathered. Fol-
but it may produce noisier training data as lowing prior work (Chambers and Jurafsky, 2008;
the very short chains included by this strategy Chambers and Jurafsky, 2009; Manshadi et al.,
may be less likely to represent real scripts. 2008; McIntyre and Lapata, 2009; McIntyre and
Select all long event chains consisting of 5 Lapata, 2010), we focus on gathering statistics
or more events. This strategy will produce a about the n-grams of events that occur in the
smaller number of event chains, but as they collected event chains. Specifically, we look at
are longer, they may be more likely to repre- strategies for collecting bigram statistics, the most
sent scripts. common type of statistics gathered in prior work.
Select only the longest event chain. This We consider three strategies for collecting bigram
strategy will produce the smallest number of statistics:
event chains from a corpus. However, they
may be of higher quality, since this strategy Regular bigrams. We find all pairs of
looks for the key actor in each story, and only events that are adjacent in an event chain
uses the events that are tied together by that and collect the number of times each event
key actor. Since this is the single actor that pair was observed. For example, given the
played the largest role in the story, its actions chain of events (saw, SUBJ), (kissed, OBJ),
may be the most likely to represent a real (blushed, SUBJ), we would extract the two
script. event bigrams: ((saw, SUBJ), (kissed, OBJ))
338
and ((kissed, OBJ), (blushed, SUBJ)). In addi- collected from their corpus, and score it as
tion to the event pair counts, we also collect the sum of the pointwise mutual informations
the number of times each event was observed between the event e and each of the events in
individually, to allow for various conditional the script:
probability calculations. This strategy fol- n
lows the classic approach for most language
X P (ci , e)
f (e, c) = log
models. P (ci )P (e)
i
1-skip bigrams. We collect pairs of events Chambers and Jurafskys description of this
that occur with 0 or 1 events intervening be- score suggests that it is unordered, such that
tween them. For example, given the chain P (a, b) = P (b, a). Thus the probabilities
(saw, SUBJ), (kissed, OBJ), (blushed, SUBJ), must be defined as:
we would extract three bigrams: the two regu-
C(e1 , e2 ) + C(e2 , e1 )
lar bigrams ((saw, SUBJ), (kissed, OBJ)) and P (e1 , e2 ) = PP
C(ei , ej )
((kissed, OBJ), (blushed, SUBJ)), plus the 1- ei ej
skip-bigram, ((saw, SUBJ), (blushed, SUBJ)).
This approach to collecting n-gram statistics C(e)
P (e) = P 0
is sometimes called skip-gram modeling, and e0 C(e )
it can reduce data sparsity by extracting more where C(e1 , e2 ) is the number of times that
event pairs per chain (Guthrie et al., 2006). the ordered event pair (e1 , e2 ) was counted in
It has not previously been applied in the task the training data, and C(e) is the number of
of predicting script events, but it may be times that the event e was counted.
quite appropriate to this task because in most
scripts it is possible to skip some events in Ordered PMI. A variation on the approach
the sequence. of Chambers and Jurafsky is to have a score
that takes the order of the events in the chain
2-skip bigrams. We collect pairs of events into account. In this scenario, we assume that
that occur with 0, 1 or 2 intervening events, in addition to the partial script of events, we
similar to what was done in the 1-skip bi- are given an insertion point, m, where the
grams strategy. This will extract even more new event should be added. The score is then
pairs of events from each chain, but it is pos- defined as:
sible the statistics over these pairs of events
m
will be noisier. X P (ck , e)
f (e, c) = log +
P (ck )P (e)
3.3 Predicting Script Events k=1
n
X P (e, ck )
Once statistics over event chains have been col- log
lected, it is possible to construct the model for P (e)P (ck )
k=m+1
predicting script events. The input of this model
will be a partial script c of n events, where c = where the probabilities are defined as:
c1 c2 . . . cn = (v1 , d1 ), (v2 , d2 ), . . . , (vn , dn ), and C(e1 , e2 )
the output of this model will be a ranked list of P (e1 , e2 ) = P P
C(ei , ej )
events where the highest ranked events are the ones ei ej
most likely to belong to the event sequence in the
script. Thus, the key issue for this model is to de- C(e)
P (e) = P 0
fine the function f for ranking events. We consider e0 C(e )
three such ranking functions: This approach uses pointwise mutual infor-
mation but also models the event chain in the
Chambers & Jurafsky PMI. Chambers and order it was observed.
Jurafsky (2008) define their event ranking
function based on pointwise mutual infor- Bigram probabilities. Finally, a natural
mation. Given a partial script c as defined ranking function, which has not been applied
above, they consider each event e = (v 0 , d0 ) to the script event prediction task (but has
339
been applied to related tasks (Manshadi et economics, sports, etc., strongly varying in
al., 2008)) is to use the bigram probabilities length, topics and narrative structure.
of language modeling rather than pointwise
mutual information scores. Again, given an Andrew Lang Fairy Tale Corpus 4 a
insertion point m for the event in the script, small collection of 437 children stories with
we define the score as: an average length of 125 sentences, and used
previously for story generation by McIntyre
m
X and Lapata (2009).
f (e, c) = log P (e|ck ) +
k=1
n
In general, the Reuters Corpus is much larger and
X
log P (ck |e) allows us to see how well script events can be
k=m+1
predicted when a lot of data is available, while the
Andrew Lang Fairy Tale Corpus is much smaller,
where the conditional probability is defined but has a more straightforward narrative structure
as2 : that may make identifying scripts simpler.
C(e1 , e2 )
P (e1 |e2 ) =
C(e2 ) 4.2 Corpus Processing
This approach scores an event based on the Constructing a model for predicting script events
probability that it was observed following all requires a corpus that has been parsed with a de-
the events before it in the chain and preceding pendency parser, and whose entities have been
all the events after it in the chain. This ap- identified via a coreference system. We there-
proach most directly models the event chain fore processed our corpora by (1) filtering out
in the order it was observed. non-narrative articles, (2) applying a dependency
parser, (3) applying a coreference resolution sys-
4 Experiments tem and (4) identifying event chains via entities
Our experiments aimed to answer three questions: and dependencies.
Which event chains are worth keeping? How First, articles that had no narrative content were
should event bigram counts be collected? And removed from the corpora. In the Reuters Corpus,
which ranking method is best for predicting script we removed all files solely listing stock exchange
events? To answer these questions we use two values, interest rates, etc., as well as all articles
corpora, the Reuters Corpus and the Andrew Lang that were simply summaries of headlines from dif-
Fairy Tale Corpus, to evaluate our three differ- ferent countries or cities. After removing these
ent chain selection methods, {all chains, long files, the Reuters corpus was reduced to 788, 245
chains, the longest chain}, our three different bi- files. Removing files from the Fairy Tale corpus
gram counting methods, {regular bigrams, 1-skip was not necessary all 437 stories were retained.
bigrams, 2-skip bigrams}, and our three different We then applied the Stanford Parser (Klein and
ranking methods, {Chambers & Jurafsky PMI, or- Manning, 2003) to identify the dependency struc-
dered PMI, bigram probabilities}. ture of each sentence in each article in the corpus.
This parser produces a constitutent-based syntactic
4.1 Corpora parse tree for each sentence, and then converts this
We consider two corpora for evaluation: tree to a collapsed dependency structure via a set
of tree patterns.
Reuters Corpus, Volume 1 3 (Lewis et Next we applied the OpenNLP coreference en-
al., 2004) a large collection of 806, 791 gine5 to identify the entities in each article, and the
news stories written in English concerning noun phrases that were mentions of each entity.
a number of different topics such as politics, Finally, to identify the event chains, we took
2
Note that predicted bigram probabilities are calculated each of the entities proposed by the coreference
in this way for both classic language modeling and skip-gram system, walked through each of the noun phrases
modeling. In skip-gram modeling, skips in the n-grams are associated with that entity, retrieved any subject
only used to increase the size of the training data; prediction
4
is performed exactly as in classic language modeling. http://www.mythfolklore.net/andrewlang/
3 5
http://trec.nist.gov/data/reuters/reuters.html http://incubator.apache.org/opennlp/
340
or object dependencies that linked a verb to that the rank of e in the systems guess list for c.
noun phrase, and created an event chain from the
sequence of (verb, dependency-type) tuples in the Average rank. The average rank of the miss-
order that they appeared in the text. ing event across all of the partial scripts:
1 X
4.3 Evaluation Metrics ranksys (c)
|C|
cC
We follow the approach of Chambers and Jurafsky
(2008), evaluating our models for predicting script This is the evaluation metric used by Cham-
events in a narrative cloze task. The narrative bers and Jurafsky (2008).
cloze task is inspired by the classic psychological
cloze task in which subjects are given a sentence Recall@N. The fraction of partial scripts
with a word missing and asked to fill in the blank where the missing event is ranked N or less6
(Taylor, 1953). Similarly, in the narrative cloze in the guess list.
task, the system is given a sequence of events from 1
a script where one event is missing, and asked |{c : c C ranksys (c) N }|
|C|
to predict the missing event. The difficulty of a
cloze task depends a lot on the context around In our experiments we use N = 50, but re-
the missing item in some cases it may be quite sults are roughly similar for lower and higher
predictable, but in many cases there is no single values of N .
correct answer, though some answers are more
probable than others. Thus, performing well on a Recall@N has not been used before for evaluat-
cloze task is more about ranking the missing event ing models that predict script events, however we
highly, and not about proposing a single correct suggest that it is a more reliable metric than Av-
event. erage rank. When calculating the average rank,
In this way, narrative cloze is like perplexity the length of the guess lists will have a significant
in a language model. However, where perplexity influence on results. For instance, if a small model
measures how good the model is at predicting a is trained with only a small vocabulary of events,
script event given the previous events in the script, its guess lists will usually be shorter than a larger
narrative cloze measures how good the model is model, but if both models predict the missing event
at predicting what is missing between events in at the bottom of the list, the larger model will get
the script. Thus narrative cloze is somewhat more penalized more. Recall@N does not have this is-
appropriate to our task, and at the same time sim- sue it is not influenced by length of the guess
plifies comparisons to prior work. lists.
An alternative evaluation metric would have
Rather than manually constructing a set of
been mean average precision (MAP), a metric
scripts on which to run the cloze test, we follow
commonly used to evaluate information retrieval.
Chambers and Jurafsky in reserving a section of
Mean average precision reduces to mean recipro-
our parsed corpora for testing, and then using the
cal rank (MRR) when theres only a single answer
event chains from that section as the scripts for
as in the case of narrative cloze, and would have
which the system must predict events. Given an
scored the ranked lists as:
event chain of length n, we run n cloze tests, with
a different one of the n events removed each time 1 X 1
to create a partial script from the remaining n 1 |C| ranksys (c)
cC
events (see Figure 1). Given a partial script as
input, an accurate event prediction model should Note that mean reciprocal rank has the same issues
rank the missing event highly in the guess list that with guess list length that average rank does. Thus,
it generates as output. since it does not aid us in comparing to prior work,
We consider two approaches to evaluating the and it has the same deficiencies as average rank,
guess lists produced in response to narrative cloze we do not report MRR in this article.
tests. Both are defined in terms of a test collection 6
Rank 1 is the event that the system predicts is most prob-
C, consisting of |C| partial scripts, where for each able, so we want the missing event to have the smallest rank
partial script c with missing event e, ranksys (c) is possible.
341
2-skip + bigram prob. all chains + bigram prob.
Chain selection Av. rank Recall@50 Bigram selection Av. rank Recall@50
all chains 502 0.5179 regular bigrams 789 0.4886
long chains 549 0.4951 1-skip bigrams 630 0.4951
the longest chain 546 0.4984 2-skip bigrams 502 0.5179
Table 1: Chain selection methods for the Reuters corpus Table 3: Event bigram selection methods for the
- comparison of average ranks and Recall@50. Reuters corpus - comparison of average ranks and Re-
call@50.
2-skip + bigram prob.
Chain selection Av. rank Recall@50 all chains + bigram prob.
all chains 1650 0.3376 Bigram selection Av. rank Recall@50
long chains 452 0.3461 regular bigrams 2363 0.3227
the longest chain 1534 0.3376 1-skip bigrams 1690 0.3418
2-skip bigrams 1650 0.3376
Table 2: Chain selection methods for the Fairy Tale
corpus - comparison of average ranks and Recall@50. Table 4: Event bigram selection methods for the Fairy
Tales corpus - comparison of average ranks and Re-
call@50.
4.4 Results
We considered all 27 combinations of our chain predicting script events.
selection methods, bigram counting methods, and
For the Fairy Tale collection, long chains gives
ranking methods: {all chains, long chains, the
the lowest average rank and highest Recall@50. In
longest chain}x{regular bigrams, 1-skip bigrams,
this collection, there is apparently some benefit to
2-skip bigrams}x{Chambers & Jurafsky PMI, or-
filtering the shorter event chains, probably because
dered PMI, bigram probabilities}. The best among
the collection is small enough that the noise in-
these 27 combinations for the Reuters corpus was
troduced from dependency and coreference errors
{all chains}x{2-skip bigrams}x{bigram probabil-
plays a larger role.
ities} achieving an average rank of 502 and a Re-
call@50 of 0.5179.
4.4.2 Gathering Event Chain Statistics
Since viewing all the combinations at once
would be confusing, instead the following sec- We next try to answer the question: Given an
tions investigate each decision (selection, counting, event chain, how should statistics be gathered from
ranking) one at a time. While one decision is var- it? Tables 3 and 4 show performance when we vary
ied across its three choices, the other decisions are the strategy for counting event pairs, while fixing
held to their values in the best model above. the selecting method to all chains, and fixing the
ranking method to bigram probabilities.
4.4.1 Identifying Event Chains For the Reuters corpus, 2-skip bigrams achieves
We first try to answer the question: How should the lowest average rank and the highest Recall@50.
representative chains of events be selected from For the Fairy Tale corpus, 1-skip bigrams and 2-
the source text? Tables 1 and 2 show perfor- skip bigrams perform similarly, and both have
mance when we vary the strategy for selecting lower average rank and higher Recall@50 than
event chains, while fixing the counting method to regular bigrams.
2-skip bigrams, and fixing the ranking method to Skip-grams probably outperform regular n-
bigram probabilities. grams on both of these corpora because the skip-
For the Reuters collection, we see that using all grams provide many more event pairs over which
chains gives a lower average rank and a higher to calculate statistics: in the Reuters corpus, regu-
Recall@50 than either of the strategies that select lar bigrams extracts 737,103 bigrams, while 2-skip
a subset of the event chains. The explanation is bigrams extracts 1,201,185 bigrams. Though skip-
probably simple: using all chains produces more grams have not been applied to predicting script
than 700,000 bigrams from the Reuters corpus, events before, it seems that they are a good fit,
while using only the long chains produces only and better capture statistics about narrative event
around 300,000. So more data is better data for chains than regular n-grams do.
342
all bigrams + 2-skip ing the intuition that events do not have to appear
Ranking method Av. rank Recall@50 strictly one after another to be closely semantically
C&J PMI 2052 0.1954
related, skip-grams decrease data sparsity and in-
ordered PMI 3584 0.1694
bigram prob. 502 0.5179 crease the size of the training data.
Second, our novel bigram probabilities ranking
Table 5: Ranking methods for the Reuters corpus - function outperforms the other ranking methods.
comparison of average ranks and Recall@50. In particular, it outperforms the state-of-the-art
pointwise mutual information method introduced
all bigrams + 2-skip by Chambers and Jurafsky (2008), and it does so
Ranking method Av. rank Recall@50 by a large margin, more than doubling the Re-
C&J PMI 1455 0.1975
call@50 on the Reuters corpus. The key insight
ordered PMI 2460 0.0467
bigram prob. 1650 0.3376 here is that, when modeling events in a script, a
language-model-like approach better fits the task
Table 6: Ranking methods for the Fairy Tale corpus - than a mutual information approach.
comparison of average ranks and Recall@50. Third, we have discussed why Recall@N is a
better and more consistent evaluation metric than
Average rank. However, both evaluation metrics
4.4.3 Predicting Script Events
suffer from the strictness of the narrative cloze test,
Finally, we try to answer the question: Given which accepts only one event being the correct
event n-gram statistics, which ranking function event, while it is sometimes very difficult, even
best predicts the events for a script? Tables 5 and for humans, to predict the missing events, and
6 show performance when we vary the strategy for sometimes more solutions are possible and equally
ranking event predictions, while fixing the selec- correct. In future research, our goal is to design
tion method to all chains, and fixing the counting a better evaluation framework which is more suit-
method to 2-skip bigrams. able for this task, where credit can be given for
For both Reuters and the Fairy Tale corpus, Re- proposed script events that are appropriate but not
call@50 identifies bigram probabilities as the best identical to the ones observed in a text.
ranking function by far. On the Reuters corpus Fourth, we have observed some differences in
the Chambers & Jurafsky PMI ranking method results between the Reuters and the Fairy Tale
achieves Recall@50 of only 0.1954, while bigram corpora. The results for Reuters are consistently
probabilities ranking method achieves 0.5179. The better (higher Recall@50, lower average rank), al-
gap is also quite large on the Fairy Tales corpus: though fairy tales contain a plainer narrative struc-
0.1975 vs. 0.3376. ture, which should be more appropriate to our task.
On the Reuters corpus, average rank also identi- This again leads us to the conclusion that more
fies bigram probabilities as the best ranking func- data (even with more noise as in Reuters) leads to
tion, yet for the Fairy Tales corpus, Chambers & a greater coverage of events, better overall models
Jurafsky PMI and bigram probabilities have simi- and, consequently, to more accurate predictions.
lar average ranks. This inconsistency is probably Still, the Reuters corpus seems to be far from a
due to the flaws in the average rank evaluation perfect corpus for research in the automatic acqui-
measure that were discussed in Section 4.3 the sition of scripts, since only a small portion of the
measure is overly sensitive to the length of the corpus contains true narratives. Future work must
guess list, particularly when the missing event is therefore gather a large corpus of true narratives,
ranked lower, as it is likely to be when training on like fairy tales and childrens stories, whose sim-
a smaller corpus like the Fairy Tales corpus. ple plot structures should provide better learning
material, both for models predicting script events,
5 Discussion
and for related tasks like automatic storytelling
Our experiments have led us to several important (McIntyre and Lapata, 2009).
conclusions. First, we have introduced skip-grams One of the limitations of the work presented
and proved their utility for acquiring script knowl- here is that it takes a fairly linear, n-gram-based ap-
edge our models that employ skip bigrams score proach to characterizing story structure. We think
consistently higher on event prediction. By follow- such an approach is useful because it forms a natu-
343
ral baseline for the task (as it does in many other 47th Annual Meeting of the Association for Compu-
tasks such as named entity tagging and language tational Linguistics and the 4th International Joint
modeling). However, story structure is seldom Conference on Natural Language Processing of the
strictly linear, and future work should consider AFNLP, pages 217225.
Neil McIntyre and Mirella Lapata. 2010. Plot induc-
models based on grammatical or discourse links
tion and evolutionary search for story generation.
that can capture the more complex nature of script In Proceedings of the 48th Annual Meeting of the
events and story structure. Association for Computational Linguistics, pages
15621572.
Acknowledgments Michaela Regneri, Alexander Koller, and Manfred
We would like to thank the anonymous reviewers Pinkal. 2010. Learning script knowledge with web
experiments. In Proceedings of the 48th Annual
for their constructive comments. This research
was carried out as a master thesis in the frame- guistics, pages 979988.
work of the TERENCE European project (EU FP7- Roger C. Schank and Robert P. Abelson. 1977. Scripts,
257410). plans, goals, and understanding: an inquiry into
human knowledge structures. Lawrence Erlbaum
Associates.
References Wilson L. Taylor. 1953. Cloze procedure: a new tool
Nathanael Chambers and Dan Jurafsky. 2008. Un- for measuring readibility. Journalism Quarterly,
supervised learning of narrative event chains. In 30:415433.
Language Technologies, pages 789797.
Nathanael Chambers and Dan Jurafsky. 2009. Un-
supervised learning of narrative schemas and their
participants. In Proceedings of the Joint Conference
of the 47th Annual Meeting of the Association for
Computational Linguistics and the 4th International
Joint Conference on Natural Language Processing
of the AFNLP, pages 602610.
Nathanael Chambers and Dan Jurafsky. 2011.
Template-based information extraction without the
templates. In Proceedings of the 49th Annual Meet-
ing of the Association for Computational Linguistics:
Human Language Technologies, pages 976986.
David Guthrie, Ben Allison, W. Liu, Louise Guthrie,
and Yorick Wilks. 2006. A closer look at skip-gram
modelling. In Proceedings of the Fifth international
Conference on Language Resources and Evaluation
(LREC), pages 12221225.
Dan Klein and Christopher D. Manning. 2003. Ac-
curate unlexicalized parsing. In Proceedings of the
41st Annual Meeting of the Association for Compu-
tational Linguistics, pages 423430.
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan
Li. 2004. RCV1: a new benchmark collection for
text categorization research. Journal of Machine
Learning Research, 5:361397.
Mehdi Manshadi, Reid Swanson, and Andrew S. Gor-
don. 2008. Learning a probabilistic model of event
sequences from internet weblog stories. In Proceed-
ings of the Twenty-First International Florida Artifi-
cial Intelligence Research Society Conference.
Neil McIntyre and Mirella Lapata. 2009. Learning to
tell tales: A data-driven approach to story genera-
tion. In Proceedings of the Joint Conference of the
344
The Problem with Kappa
David M W Powers
Centre for Knowledge & Interaction Technology, CSEM
Flinders University
David.Powers@flinders.edu.au
Abstract Introduction
Research in Computational Linguistics usually
It is becoming clear that traditional requires some form of quantitative evaluation. A
evaluation measures used in number of traditional measures borrowed from
Computational Linguistics (including Information Retrieval (Manning & Schtze,
Error Rates, Accuracy, Recall, Precision 1999) are in common use but there has been
and F-measure) are of limited value for considerable critical evaluation of these measures
unbiased evaluation of systems, and are themselves over the last decade or so (Entwisle
not meaningful for comparison of & Powers, 1998, Flach, 2003, Ben-David. 2008).
algorithms unless both the dataset and Receiver Operating Analysis (ROC) has been
algorithm parameters are strictly advocated as an alternative by many, and in
controlled for skew (Prevalence and particular has been used by Frnkranz and Flach
Bias). The use of techniques originally (2005), Ben-David (2008) and Powers (2008) to
designed for other purposes, in particular better understand both learning algorithms
Receiver Operating Characteristics Area relationship and the between the various
Under Curve, plus variants of Kappa, measures, and the inherent biases that make
have been proposed to fill the void. many of them suspect. One of the key advantages
of ROC is that it provides a clear indication of
This paper aims to clear up some of the chance level performance as well as a less well
confusion relating to evaluation, by known indication of the relative cost weighting
demonstrating that the usefulness of each of positive and negative cases for each possible
evaluation method is highly dependent on system or parameterization represented.
the assumptions made about the ROC Area Under the Curve (Fig. 1) has been
distributions of the dataset and the also used as a performance measure but averages
underlying populations. The behaviour of over the false positive rate (Fallout) and is thus a
a number of evaluation measures is function of cost that is dependent on the
compared under common assumptions. classifier rather than the application. For this
reason it has come into considerable criticism
Deploying a system in a context which and a number of variants and alternatives have
has the opposite skew from its validation been proposed (e.g. AUK, Kaymak et. Al, 2010
set can be expected to approximately and H-measure, Hand, 2009). An AUC curve
negate Fleiss Kappa and halve Cohen that is at least as good as a second curve at all
Kappa but leave Powers Kappa points, is said to dominate it and indicates that
unchanged. For most performance the first classifier is equal or better than the
evaluation purposes, the latter is thus second for all plotted values of the parameters,
most appropriate, whilst for comparison and all cost ratios. However AUC being greater
of behaviour, Matthews Correlation is for one classifier than another does not have such
recommended. a property indeed deconvexities within or
345
intersections of ROC curves are both prima facie ROC AUC, which are in fact all equivalent to
evidence that fusion of the parameterized DeltaP in the dichotomous case, which we deal
classifiers will be useful (cf. Provost and Facett, with first, and to the other Kappas when the
2001; Flach and Wu, 2005). marginal prevalences (or biases) match.
AUK stands for Area under Kappa, and
represents a step in the advocacy of Kappa (Ben- 1.1 Two classes and non-negative Kappa.
David, 2008ab) as an alternative to the traditional Kappa was originally proposed (Cohen, 1960) to
measures and ROC AUC. Powers (2003,2007) compare human ratings in a binary, or
has also proposed a Kappa-like measure dichotomous, classification task. Cohen (1960)
(Informedness) and analysed it in terms of ROC, recognized that Rand Accuracy did not take
and there are many more, Warrens (2010) analyzing chance into account and therefore proposed to
the relationships between some of the others. subtract off the chance level of Accuracy and
Systems like RapidMiner (2011) and Weka then renormalize to the form of a probability:
(Witten and Frank, 2005) provide almost all of K(Acc) = [Acc E(Acc)] / [1 E(Acc)] (1)
the measures we have considered, and many This leaves the question of how to estimate the
more besides. This encourages the use of expected Accuracy, E(Acc). Cohen (1960) made
multiple measures, and indeed it is now the assumption that raters would have different
becoming routine to display tables of multiple distributions that could be estimated as
results for each system, and this is in particular the products of the corresponding marginal
true for the frameworks of some of the coefficients of the contingency table:
challenges and competitions brought to the
communities (e.g. 2nd i2b2 Challenge in NLP for +ve Class ve Class
Clinical Data, 2011; 2nd Pascal Challenge on +ve Prediction A=TP B=FP PP
HTC, 2011)). ve Prediction C=FN D=TN PN
This use of multiple statistics is no doubt in Notation RP RN N
response to the criticism levelled at the Table 1. Statistical and IR Contingency Notation
evaluation mechanisms used in earlier
generations of competitions and the above In order to discuss this further it is important
mentioned critiques, but the proliferation of to discuss our notational conventions, and it is
alternate measures in some ways merely noted that in statistics, the letters A-D (upper
compounds the problem. Researchers have the case or lower case) are conventionally used to
temptation of choosing those that favour their label the cells, and their sums may be used to
system as they face the dilemma of what to do label the marginal cells. However in the
about competing (and often disagreeing) literature on ROC analysis, which we follow
evaluation measures that they do not completely here, it is usual to talk about true and false
understand. These systems and competitions also positives (that is positive predictions that are
exhibit another issue, the tendency to macro- correct or incorrect), and conversely true and
averages over multiple classes, even of measures false negatives. Often upper case is used to
that are not denominated in class (e.g. that are indicate counts in the contingency table, which
proportions of predicted labels rather than real sum to the number of instances, N. In this case
classes, as with Precision). lower case letters are used to indicate
This paper is directed at better understanding probabilities, which means that the
some of these new and old measures as well as corresponding upper case values in the
providing recommendations as to which measures contingency table are all divided by N, and n=1.
are appropriate in which circumstances. Statistics relative to (the total numbers of
items in) the real classes are called Rates and
Whats in a Kappa? have the number (or proportion) of Real
Positives (RP) or Real Negatives (RN) in the
In this paper we focus on the Kappa family of denominator. In this notation, we have Recall =
measures, as well as some closely related TPR = TP/RP.
statistics named for other letters of the Greek Conversely statistics relative to the (number
alphabet, and some measures that we will show of) predictions are called Accuracies, so relative
behave as Kappa measures although they were to the predictions that label instances positively,
not originally defined as such. These include Predicted Positives (PP), we have Precision =
Informedness, Gini Coefficient and single point TPA = TP/PP.
346
!
the weighting is made according to the number
of predictions made for the corresponding labels.
Rand Accuracy is also the weighted average of
Recall and Inverse Recall (probability that
negative instances are correctly predicted),
where the weighting is made according to the
number of instances in the corresponding
classes.
The marginal probabilities rp and pp are also
known as Prevalence (the class prevalence of
positive instances) and Bias (the label bias to
positive predictions), and the corresponding
probabilities of negative classes and labels are
the Inverse Prevalence and Inverse Bias
respectively. In the ROC literature, the ratios of
negative to positive classes is often referred to as
the class ratio or skew. We can similarly also
refer to a label ratio, prediction ratio or
Figure 1. Illustration of ROC Analysis. The prediction skew. Note that optimal performance
solid diagonal represents chance performance can only be achieved if class skew = label skew.
for different rates of guessing positive or The Expected True Positives and Expected
negative labels. The dotted line represent the True Negatives for Cohen Kappa, as well as Chi-
convex hull enclosing the results of different squared significance, are estimated as the
systems, thresholds or parameters tested. The product of Bias and Prevalence, and the product
(0,0) and (1,1) points represent guessing always of Inverse Bias and Inverse Prevalence, resp.,
negative and always positive and are always where traditional uses of Kappa for agreement of
nominal systems in a ROC curve. The points human raters, the contingency table represents
along any straight line segment of a convex hull one rater as providing the classification to be
are achievable by probabilistic interpolation of
predicted by the other rater. Cohen assumes that
the systems at each end, the gradient represents
their distribution of ratings are independent, as
the cost ratio and all points along the segment,
including the endpoints have the same effective reflected both by the margins and the
cost benefit. AUC is the area under the curve contingencies: ETP = RP*PP; ETN = RN*NN.
joining the systems with straight edges and This gives us E(Acc) = (ETP+ETN)/N=etp+etn.
AUCH is the area under the convex hull where By contrast the two rater two class form of
points within it are ignored. The height above Fleiss (1981) Kappa, also known as Scott Pi,
the chance line of any point represents DeltaP, assumes that both raters are labeling
the Gini Coefficient and also the Dichotomous independently using the same distribution, and
Informedness of the corresponding system, and that the margins reflect this potential variation.
also corresponds to twice the area of the triangle The expected number of positives is thus
between it and the chance line, and thus 2AUC-1 effectively estimated as the average of the two
where AUC is calculated on this single point raters counts, so that EP = (RP+PP)/2, and EN =
curve (not shown) joining it to (0,0) and (1,1). (RN+PN)/2, ETP = EP2 and ETN = EN2.
The (1,0) point represents perfect performance
with 100% True Positive Rate and 0% False 1.2 Inverting Kappa
Negative Rate.
The definition of Kappa in Eqn (1) can be seen
to be applicable to arbitrary definitions of
The accuracy of all our predictions, positive or Expected Accuracy, and in order to discover how
negative, is given by Rand Accuracy = other measures relate to the family of Kappa
(TF+TN)/N = tf+tn, and this is what is meant in measures it is useful to invert Kappa to discover
general by the unadorned term Accuracy, or the the implicit definition of Expected Accuracy that
abbreviation Acc. allows a measure to be interpreted as a form of
Rand Accuracy is the weighted average of Kappa. We simply make E(Acc) the subject by
Precision and Inverse Precision (probability that multiplying out Eqn (1) to a common
negative predictions are correctly labeled), where denominator and associating factors of E(Acc):
347
K(Acc) = [Acc E(Acc)] / [1 E(Acc)] (1) prevalence and bias of each class/label). Our
E(Acc) = [Acc K(Acc)] / [1 K(Acc)] (2) focus in this paper is the behaviour of the various
Note that for a given value of Acc the function Kappa measures as we move from strongly
connecting E(Acc) and K(Acc) is its own matched to strongly mismatched biases.
inverse: Cohen (1968) also introduced a weighted
variant of Kappa. We have also discussed cost
E(Acc) = fAcc(K(Acc)) (3)
weighting in the context of ROC, and Hand
K(Acc) = fAcc(E(Acc)) (4) (2009) seeks to improve on ROC AUC by
For the future we will tend to drop the Acc introducing a beta distribution as an estimated
argument or subscript when it is clear, and we cost profile, but we will not discuss them further
will also subscript E and K with the name or here as we are more interested in the
initial of the corresponding definition of effectiveness of the classifer overall rather than
Expectation and thus Kappa (viz. Fleiss and matching a particular cost profile, and are
Cohen so far). skeptical about any generic cost distribution. In
Note that given Acc and E(Acc) are in the particular the beta distribution gives priority to
range of 0..1 as probabilities, Kappa is also central tendency rather than boundary conditions,
restricted to this range, and takes the form of a but boundary conditions are frequently
probability. encountered in optimization. Similarly Kaymak
et al.s (2010) proposal to replace AUC by AUK
1.3 Multiclass multirater Kappa corresponds to a Cohen Kappa reweighting of
Fleiss (1981) and others sought to generalize the ROC that eliminates many of its useful
Cohen (1960) definition of Kappa to handle both properties, without any expectation that the
multiple class (not just positive/negative) and measure, as an integration across a surrogate cost
multiple raters (not just two one of which we distribution, has any validity for system
have called real and the other prediction). Fleiss selection. Introducing alternative weights is also
in fact generalized Scotts (1955) Pi in both allowed in the definition of F-Measure, although
senses, not Cohen Kappa. The Fleiss Kappa is in practice this is almost invariably employed as
not formulated as we have done here for the equally weighted harmonic mean of Recall
exposition, but in terms of pairings (agreements) and Precision. Introducing additional weight or
amongst the raters, who are each assumed to distribution parameters, just multiplies the
have rated the same number of items, N, but not confusion as to which measure to believe.
necessarily all. Krippendorfs (1970, 1978) Powers (2003) derived a further multiclass
effectively generalizes further by dealing with Kappa-like measure from first principles,
arbitrary numbers of raters assessing different dubbing it Informedness, based on an analogy of
numbers of items. Bookmaker associating costs/payoffs based on
Light (1971) and Hubert (1977) successfully the odds. This is then proven to measure the
generalized Cohen Kappa. Another approach to proportion of time (or probability) a decision is
estimating E(Acc) was taken by Bennett et al informed versus random, based on the same
(1955) which basically assumed all classes were assumptions re expectation as Cohen Kappa, and
equilikely (effectively what use of Accuracy, F- we will thus call it Powers Kappa, and derive an
Measure etc. do, although they dont subtract off formulation of the corresponding expectation.
the chance component). Powers (2007) further identifies that the
The Bennett Kappa was generalized by dichotomous form of Powers Kappa is equivalent
Randolph (2005), but as our starting point is that to the Gini cooefficient as a deskewed version of
we need to take the actual margins into account, the weighted Relative Accuracy proposed by
we do not pursue these further. However, Flach (2003) based on his analysis and
Warrens (2010a) shows that, under certain deskewing of common evaluation measures in
conditions, Fleiss Kappa is a lower bound of the ROC paradigm. Powers (2007) also identifies
both the Hubert generalization of Cohen Kappa that Dichotomous Informedness is equivalent to
and the Randolph generalization of Bennet an empirically derived psychological measure
Kappa, which is itself correspondingly an upper called DeltaP (Perruchet et al. 2004). DeltaP
bound of both the Hubert and the Light (and its dual DeltaP) were derived based on
generalizations of Cohen Kappa. Unfortunately analysis of human word association data the
the conditions are that there is some agreement combination of this empirical observation with
between the class and label skews (viz. the the place of DeltaP as the dichotomous case of
348
Powers Informedness suggests that human the ability to take the geometric mean (of macro-
association is in some sense optimal. Powers averaged) Informedness and Markedness means
(2007) also introduces a dual of Informedness that a single Correlation can be provided when
that he names Markedness, and shows that the appropriate.
geometric mean of Informedness and Our aim now is therefore to characterize
Markedness is Matthews Correlation, the Informedness (and hence as its dual Markedness)
nominal analog of Pearson Correlation. as a Kappa measure in relation to the families of
Powers Informedness is in fact a variant of Kappa measures represented by Cohen and Fleiss
Kappa with some similarities to Cohen Kappa, Kappa in the dichotomous case. Note that
but also some advantages over both Cohen and Warrens (2011) shows that a linearly weighted
Fleiss Kappa due to its asymmetric relation with versions of Cohens (1968) Kappa is in fact a
Recall, in the dichotomous form of Powers (2007), weighted average of dichotomous Kappas.
Informedness = Recall + InverseRecall 1 Similarly Powers (2003) shows that his Kappa
= (Recall Bias) / (1 Prevalence). (Informedness) has this property. Thus it is
If we think of Kappa as assessing the appropriate to consider the dichotomous case,
relationship between two raters, Powers statistic and from this we can generalize as required.
is not evenhanded and the Informedness and
Markedness duals measure the two directions of 1.5 Kappa vs Determinant
prediction, normalizing Recall and Precision. In Warrens (2010c) discusses another commonly
fact, the relationship with Correlation allows used measure, the Odds Ratio ad/bc (in
these to be interpreted as regression coefficients Epidemiology rather than Computer Science or
for the prediction function and its inverse. Computational Linguistics). Closely related to
this is the Determinant of the Contingency
1.4 Kappa vs Correlation
Matrix dtp = ad-bc = etp-etn (in the Chi-Sqr,
It is often asked why we dont just use Cohen and Powers sense based on independent
Correlation to measure. In fact, Castellan (1996) marginal probabilities). Both show whether the
uses Tetrachoric Correlation, another odds favour positives over negatives more for the
generalization of Pearson Correlation that first rater (real) than the second (predicted) for
assumes that the two class variables are given by the ratio it is if it is greater than one, for the
underlying normal distributions. Uebersax difference it is if it is greater than 0. Note that
(1987), Hutchison (1993) and Bonnet and Price taking logs of all coefficients would maintain the
(2005) each compare Kappa and Correlation and same relationship and that the difference of the
conclude that there does not seem to be any logs corresponds to the log of the ratio, mapping
situation where Kappa would be preferable to into the information domain.
Correlation. However all the Kappa and Warrens (2010c) further shows (in cost-
Correlation variants considered were symmetric, weighted form) that Cohen Kappa is given by the
and it is thus interesting to consider the separate following (in the notation of this paper, but
regression coefficients underlying it that preferring the notations Prevalence and Inverse
represent the Powers Kappa duals of Prevalence to rp and rn for clarity):
Informedness and Markedness, which have the KC = dtp/[(Prev*IBias+Bias*IPrev)/2]. (5)
advantage of separating out the influences of
Prevalence and Bias (which then allows macro- Based on the previous characterization of
averaging, which is not admissable for any Fleiss Kappa, we can further characterize it by
symmetric form of Correlation or Kappa, as we KF = dtp/[(Prev+Bias)*(IBias+IPrev)/4]. (6)
will discuss shortly). Powers (2007) regards Powers (2007) also showed corresponding
Matthews Correlation as an appropriate measure formulations for Bookmaker Informedness (B, or
for symmetric situations (like rater agreement) Powers Kappa = KP), Markedness and Matthews
and generalizes the relationships between Correlation:
Correlation and Significance to the Markedness B = dtp/[(Prev*IPrev)]. (7)
and Informedness Measures. The differences
M = dtp/[(Bias*IBias)]. (8)
between Informedness and Markedness, which
relate to mismatches in Prevalence and Bias, C = dtp/[(Prev*IPrev*Bias*IBias)]. (9)
mean that the pair of numbers provides further These elegant dichotomous forms are
information about the nature of the relationship straightforward, with the independence
between the two classifications or raters, whilst assumptions on Bias and Prevalence clear in
349
Cohen Kappa, the arithmetic means of Bias and 1.7 Averaging
Prevalence clear in Fleiss Kappa, and the We now consider the issue of dealing with
geometric means of Bias and Prevalence in the multiple measures and results of multiple
Matthews Correlation. Further the independence classifiers by averaging. We first consider
of Bias is apparent for Powers Kappa in the averages of some of the individual measures we
Informedness form, and independence of have seen. The averages need not be arithmetic
Prevalence is clear in the Markedness direction. means, or may represent means over the
Note that the names Powers uses suggest that Prevalences and Biases.
we are measuring something about the We will be punctuating our theoretical
information conveyed by the prediction about the discussions and explanations with empirical
class in the case of Informedness, and the demonstrations where we use 1:1 and 4:1
information conveyed to the predictor by the prevalence versus matching and mismatching
class state in the case of Markedness. To the bias to generate the chance level contingency
extent that Prevalence and Bias can be controlled based on marginal independence. We then mix
independently, Informedness and Markedness are in a proportion of informed decisions, with the
independent and Correlation represents the joint remaining decisions made by chance.
probability of information being passed in both Table 2 compares Accuracy and F-Measure
directions! Powers (2007) further proposes using for an informed decision percentage of 0, 100, 15
log formulations of these measures to take them and -15. Note that Powers Kappa or
into the information domain, as well as relating Informedness purports to recover this
them to mutual information, G-squared and chi- proportion or probability.
squared significance. F-Measure is one of the most common
measures in Computational Linguistics and
1.6 Kappa vs Concordance Information Retrieval, being a Harmonic Mean
of Recall and Precision, which in the common
The pairwise approach used by Fleiss Kappa and
unweighted form also is interpretable with
its relatives does not assume raters use a
respect to a mean of Prevalence and Bias:
common distribution, but does assume they are
using the same set, and number of categories. F = tp / [(Prev+Bias)/2] (10)
When undertaking comparison of unconstrained Note that like Recall and Precision, F-Measure
ratings or unsupervised learning, this constraint ignores totally cell D corresponding to tn. This
is removed and we need to use a measure of is an issue when Prevalence and Bias are uneven
concordance to compare clusterings against each or mismatched. In Information Retrieval, it is
other or against a Gold Standard. Some of the often justified on the basis that the number of
concordance measures use operators in irrelevant documents is large and not precisely
probability space and relate closely to the known, but in fact this is due to lack of
techniques here, whilst others operate in knowledge of the number of relevant documents,
information space. See Pfitzner et al. (2009) for which affects Recall. In fact if tn is large with
reviews of clustering comparison/concordance. respect to both rp and pp, and thus with respect
A complete coverage of evaluation would also to components tp, fp and fn, then both tn/pn and
cover significance and the multiple testing tn/rn approach 0 as tn increases without bound.
problem, but we will confine our focus in this As discussed earlier, Rand Accuracy is a
paper to the issue of choice of Kappa or prevalence (real class) weighted average of
Correlation statistic, as well as addressing some Precision and Inverse Precision, as well as a bias
issues relating to the use of macro-averaging. In (prediction label) weighted average of Recall and
this paper we are regarding the choice of Bias as Inverse Precision. It reflects the D (tn) cell unlike
under the control of the experimenter, as we have F, and while it does not remove the effect of
a focus on learned or hand crafted computational chance it does not have the positive bias of F.
linguistics systems. In fact, when we are using Acc = tp + fp (11)
bootstrapping techniques or dealing with We also point out that the differences between
multiple real samples or different subjects or the various Kappas shown in Determinant
ecosystems, Prevalence may also vary. Thus the normalized form in Eqns (5-9) vary only in the
simple marginal assumptions of Cohen or way prevalences and biases are averaged
Powers statistics are the appropriate ones. together in the normalizing denominator.
350
Informed 1:1/1:1 4:1/4:1 4:1/1:4 We now turn to macro-averaging across
Acc 50% 68% 32% multiple classifiers or raters. The Area Under the
0% Curve measures are all of this form, whether we
F 50% 80% 32%
Acc 100% 100% 100% are talking about ROC, Kappa, Recall-Precision
100% curves or whatever. The controversy over these
F 100% 100% 100%
Acc 57.5% 72.8% 42.2% averages, and macro-averaging in general, relates
15% to one of two issues: 1. The averages are not in
F 57.5% 83% 46.97%
Acc 42.5% 57.8% 27.2% general over the appropriate units or
-15% denominators of the individual statistics; or 2.
F 42.5% 72% 27.2%
Table 2. Accuracy and F-Measure for different The averages are over a classifier determined
mixes of prevalence and bias skew (odds ratio cost function rather than an externally or
shown) as well as different proportions of correct standardly defined cost function. AUK and H-
(informed) answers versus guessing negative Measure seek to address these issues as discussed
proportions imply that the informed decisions are earlier. In fact they both boil down to averaging
deliberately made incorrectly (oracle tells me with an inappropriate distribution of weights.
what to do and I do the opposite). Commonly macro-averaging averages across
classes as average statistics derived for each class
weighted by the cardinality of the class (viz.
From Table 2 we note that the first set of prevalence). In our review above, we cited four
statistics notes the chance level varies from the examples, but we will refer only to WEKA
50% expected for Bias=Prevalence=50%. This is (Witten et al., 2005) here as a commonly used
in fact the E(Acc) used in calculating Cohen system and associated text book that employs
Kappa. Where Prevalences and Biases are equal and advocates macro-averaging. WEKA
and balanced, all common statistics agree averages over tpr, fpr, Recall (yes redundantly),
Recall = Precision = Accuracy = F, and they are Precision, F-Factor and ROC AUC. Only the
interpretable with respect to this 50% chance average over tpr=Recall is actually meaningful,
level. All the Kappas will also agree, as the because only it has the number of members of
different averages of the identical prevalences the class, or its prevalence, as its denominator.
and biases all come down to 50% as well. So Precision needs to be macro-averaged over the
subtracting 50% from 57.5% and normalizing number of predictions for each class, in which
(dividing) by the average effective prevalence of case it is equivalent to micro-averaging.
50%, we return 15% informed decisions in all Other micro-averaged statistics are also
cases (as seen in detail in Table 3). shown, including Kappa (with the expectation
However, F-measure gives an inflated estimate determined from ZeroR predicting the majority
when it focus on the more prevalent positive class, leading to a Cohen-like Kappa).
class, with corresponding bias in the chance AUC will be pointwise for classifiers that
component. dont provide any probabilistic information
Worse still is the strength of the Acc and F associated with label prediction, and thus dont
scores under conditions of matched bias and allow varying a threshold for additional points on
prevalence when the deviation from chance is - the ROC or other threshold curves. In the case
15% - that is making the wrong decision 15% of where multiple threshold points are available,
the time and guessing the rest of the time. In ROC AUC cannot be interpreted as having any
academic terms, if we bump these rates up to relevance to any particular classifier, but is an
25% F-factor gives a High Distinction for average over a range of classifiers. Even then it
guessing 75% of the time and putting the right is not so meaningful as AUCH, which should be
answer for the other 25%, a Distinction for 100% used as classifiers on the convex hull are usually
guessing, and a Credit for guessing 75% of the available. The AUCH measure will then
time and putting a wrong answer for the other dominate any individual classifiers, as if the
25%! In fact, the Powers Kappa corresponds to convex hull is not the same as the single
the methodology of multiple choice marking, classifier it must include points that are above the
where for questions with k+1 choices, a right classifier curve and thus its enclosed area totally
answer gets 1 mark, and a wrong answer gets -1/k includes the area that is enclosed by the
so that guessing achieves an expected mark of 0. individual classifier.
Cohen Kappa achieves a very similar result for Macroaveraging of the curve based on each
unbiased guessing strategies. class in turn as the Positive Class, and weighted
351
by the size of the positive class, is not two of the more complex cases that both relate to
meaningful as effectively shown by Powers Fleiss Kappa with its mismatch to the marginal
(2003) for the special case of the single point independence assumptions we prefer. These will
curve given its equivalence to Powers Kappa. provide informedness of probability B plus a
In fact Markedness does admit averaging over remaining proportion 1-B of random responses
classes, whilst Informedness requires averaging exhibiting extreme bias versus both neutral and
over predicted labels, as does Precision. The contrary prevalence. Note that we consider only
other Kappa and Correlations are more complex |B|<1 as all Kappas give Acc=1 and thus K=1 for
(note the demoninators in Eqns 5-9) and how B=1, and only Powers Kappa is designed to work
they might be meaningfully macro-averaged is for B<1, giving K= -1 for B= -1.
an open question. However, microaveraging can Recall that the general calculation of Expected
always be done quickly and easily by simply Accuracy is
summing all the contingency tables (the true E(Acc) = etp+etn (11)
contingency tables are tables of counts, not For Fleiss Kappa we must calculate the
probabilities, as shown in Table 1). expected values of the correct contingencies as
Macroaveraging should never be done except discussed previously with expected probabilities
for the special cases of Recall and Markedness ep = (rp+pp)/2 & en = (rn+pn)/2 (12)
when it is equivalent to micro-average, which is etp = ep 2
& etn = en 2
(13)
only slightly more expensive/complicated to do.
We first consider cases where prevalence is
Comparison of Kappas extreme and the chance component exhibits
inverse bias. We thus consider limits as
We now turn to explore the different definitions rp0, rn1, pp1-B, pnB. This gives us
of Kappas, using the same approach employed (assuming |B|<1)
with Accuracy and F-Factor in Table 1: We will
EF(Acc) = (1/4+B2/4+B/2)2+(1/4+B2/4-B/2)2
consider 0%, 100%, 15% and -15% informed = (1+B2)/2 (14)
decisions, with random decisions modelled on
the basis of independent Bias and Prevalence. KF(Acc) = (1-B)2/[B2-2] (15)
This clearly biases against the Fleiss family of We second consider cases where the
Kappas, which is entirely appropriate. As prevalence is balanced and chance extreme, with
pointed out by Entwisle & Powers (1998) the rp0.5, rn0.5, pp1-B, pnB, giving
practice of deliberately skewing bias to achieve EF(Acc) = 1/2 + (B-1/2)2/2
better statistics is to be deprecated they used = 5/8 + B(B-1)/2 (16)
1 1 2 1 1 2
the real-life example of a CL researcher choosing KF(Acc)=[(B- /2)-(B- /2) /2]/[ /2-(B- /2) /2] (17)
to say water was always a noun because it was a =[B-5/8+B(B-1)/2]/[1-(5/8+B(B-1)/2)
noun more often than not. With Cohen or Powers
measures, any actual power of the system to Conclusions
determine PoS, however weak, would be The asymmetric Powers Informedness gives
reflected in an improvement in the scores versus the clearest measure of the predictive value of a
any random choice, whatever the distribution. system, while the Matthews Correlation (as
Recall that choosing one answer all the time geometric mean with the Powers Markedness
corresponds to the extreme points of the chance dual) is appropriate for comparing equally valid
line in the ROC curve. classifications or ratings into an agreed number
Studies like Fitzgibbon et al (2007) and of classes. Concordance measures should be used
Leibbrandt and Powers (2012) show divergences if number of classes is not agreed or specified.
amongst the conventional and debiased measures, For mismatch cases (15) Fleiss is always
but it is tricky to prove which is better. negative for |B|<1) and thus fails to adequately
reward good performance under these marginal
Kappa in the Limit conditions. For the chance case (17), the first
It is however straightforward to derive limits for form we provide shows that the deviation from
the various Kappas and Expectations under matching Prevalence is a driver in a Kappa-like
extreme and central conditions of bias and function. Cohen on the other hand (Table 3)
prevalence, including both match and mismatch. tends to apply multiply the weight given to error
The 36 theoretical results match the mixture in even mild prevalence-bias mismatch
model results in Table 3, however, due to space conditions. None of the symmetric Kappas
constraints, formal treatment will be limited to designed for raters are suitable for classifiers.
352
1:1 1:1 4:1 4:1 4:1 1:4 1:1 1:1 4:1 4:1 4:1 1:4 1:1 1:1 4:1 4:1 4:1 1:4
Informedness 0% 0% 0% 0% 0% 0% 0% 0% 0%
Prevalence 50% 80% 80% 50% 80% 80% 50% 20% 20%
Iprevalence 50% 20% 20% 50% 20% 20% 50% 80% 80%
Bias 50% 80% 20% 50% 80% 20% 50% 20% 80%
Ibias 50% 20% 80% 50% 20% 80% 50% 80% 20%
SkewR 100% 25% 25% 100% 25% 25% 100% 400% 400%
SkewP 100% 25% 400% 100% 25% 400% 100% 400% 25%
OddsRatio 100% 100% 6% 100% 100% 6% 100% 100% 1600%
ePowers 50% 68% 32% 50% 68% 32% 50% 68% 32%
eCohen 50% 68% 32% 50% 68% 32% 50% 68% 32%
eFleiss 50% 68% 50% 50% 68% 50% 50% 68% 50%
kPowers 0% 0% 0% 0% 0% 0% 0% 0% 0%
kCohen 0% 0% 0% 0% 0% 0% 0% 0% 0%
kFleiss 0% 0% -36% 0% 0% -36% 0% 0% -36%
Informedness 100% 100% 100% 100% 100% 100% 100% 100% 100%
Prevalence 50% 80% 80% 50% 80% 80% 50% 20% 20%
Iprevalence 50% 20% 20% 50% 20% 20% 50% 80% 80%
Bias 50% 80% 80% 50% 80% 80% 50% 20% 20%
Ibias 50% 20% 20% 50% 20% 20% 50% 80% 80%
SkewR 100% 25% 25% 100% 25% 25% 100% 400% 400%
SkewP 100% 25% 25% 100% 25% 25% 100% 400% 400%
OddsRatio 100% 100% 100% 100% 100% 100% 100% 100% 100%
ePowers 50% 68% 68% 50% 68% 68% 50% 68% 68%
aCohen 50% 68% 68% 50% 68% 68% 50% 68% 68%
aFleiss 50% 68% 68% 50% 68% 68% 50% 68% 68%
kPowers 100% 100% 100% 100% 100% 100% 100% 100% 100%
kCohen 100% 100% 100% 100% 100% 100% 100% 100% 100%
kFleiss 100% 100% 100% 100% 100% 100% 100% 100% 100%
Informedness 15% 15% 15% 99% 99% 99% 99% 99% 99%
Prevalence 50% 80% 80% 50% 80% 80% 50% 20% 20%
Iprevalence 50% 20% 20% 50% 20% 20% 50% 80% 80%
Bias 50% 80% 29% 50% 80% 79% 50% 20% 79%
Ibias 50% 20% 71% 50% 20% 21% 50% 80% 21%
SkewR 100% 25% 25% 100% 25% 25% 100% 400% 400%
SkewP 100% 25% 245% 100% 25% 26% 100% 400% 26%
OddsRatio 100% 100% 6% 100% 100% 6% 100% 100% 1600%
ePowers 50% 68% 32% 50% 68% 32% 50% 68% 32%
eCohen 50% 68% 37% 50% 68% 68% 50% 68% 32%
eFleiss 50% 68% 50% 50% 68% 68% 50% 68% 50%
kPowers 15% 15% 15% 99% 99% 99% 1% 1% 1%
kCohen 15% 15% 8% 99% 99% 98% 1% 1% 0%
kFleiss 15% 15% -17% 99% 99% 98% 1% 1% -35%
Informedness -15% -15% -15% -99% -99% -99% -99% -99% -99%
Prevalence 50% 80% 20% 50% 80% 80% 50% 20% 20%
Iprevalence 50% 20% 80% 50% 20% 20% 50% 80% 80%
Bias 50% 71% 80% 50% 21% 20% 50% 21% 80%
Ibias 50% 29% 20% 50% 79% 80% 50% 79% 20%
SkewR 100% 25% 400% 100% 25% 25% 100% 400% 400%
SkewP 100% 41% 25% 100% 385% 400% 100% 385% 25%
OddsRatio 100% 65% 1038% 100% 25% 25% 100% 104% 1542%
ePowers 50% 63% 37% 50% 50% 50% 50% 68% 32%
eCohen 50% 63% 32% 50% 32% 32% 50% 68% 32%
eFleiss 50% 63% 50% 50% 50% 50% 50% 68% 50%
kPowers -15% -15% -15% -99% -99% -99% -1% -1% -1%
kCohen -15% -13% -7% -99% -47% -47% -1% -1% 0%
kFleiss -15% -14% -46% -99% -99% -99% -1% -1% -37%
Table 3. Empirical Results for Accuracy and Kappa for Fleiss/Scott, Cohen and Powers. Shaded
cells indicate misleading results, which occur for both Cohen and Fleiss Kappas.
353
References P. A. Flach (2003). The Geometry of ROC Space:
Understanding Machine Learning Metrics through
2nd i2b2 Workshop on Challenges in Natural
ROC Isometrics, Proceedings of the Twentieth
Language Processing for Clinical Data (2008).
International Conference on Machine Learning
http://gnode1.mib.man.ac.uk/awards.html
(ICML-2003), Washington DC, 2003, pp. 226-233.
(accessed 4 November 2011)
J. L. Fleiss (1981). Statistical methods for rates and
2nd Pascal Challenge on Hierarchical Text
proportions (2nd ed.). New York: Wiley.
Classification http://lshtc.iit.demokritos.gr/node/48
(accessed 4 November 2011) A. Fraser & D. Marcu (2007). Measuring Word
Alignment Quality for Statistical Machine
N. Ailon. and M. Mohri (2010) Preference-based
Translation, Computational Linguistics 33(3):293-
learning to rank. Machine Learning 80:189-211.
303.
A. Ben-David. (2008a). About the relationship
J. Frnkranz & P. A. Flach (2005). ROC n Rule
between ROC curves and Cohens kappa.
Learning Towards a Better Understanding of
Engineering Applications of AI, 21:874882, 2008.
Covering Algorithms, Machine Learning 58(1):39-
A. Ben-David (2008b). Comparison of classification 77.
accuracy using Cohens Weighted Kappa, Expert
D. J. Hand (2009). Measuring classifier performance:
Systems with Applications 34 (2008) 825832
a coherent alternative to the area under the ROC
Y. Benjamini and Y. Hochberg (1995). "Controlling curve. Machine Learning 77:103-123.
the false discovery rate: a practical and powerful
T. P. Hutchinson (1993). Focus on Psychometrics.
approach to multiple testing". Journal of the Royal
Kappa muddles together two sources of
Statistical Society. Series B (Methodological) 57
disagreement: tetrachoric correlation is preferable.
(1), 289300.
Research in Nursing & Health 16(4):313-6, 1993
D. G. Bonett & R.M. Price, (2005). Inferential Aug.
Methods for the Tetrachoric Correlation
U. Kaymak, A. Ben-David and R. Potharst (2010),
Coefficient, Journal of Educational and Behavioral
AUK: a sinple alternative to the AUC, Technical
Statistics 30:2, 213-225
Report, Erasmus Research Institute of
J. Carletta (1996). Assessing agreement on Management, Erasmus School of Economics,
classification tasks: the kappa statistic. Rotterdam NL.
Computational Linguistics 22(2):249-254
K. Krippendorff (1970). Estimating the reliability,
N. J. Castellan, (1966). On the estimation of the systematic error, and random error of interval data.
tetrachoric correlation coefficient. Psychometrika, Educational and Psychological Measurement, 30
31(1), 67-73. (1),61-70.
J. Cohen (1960). A coefficient of agreement for K. Krippendorff (1978). Reliability of binary attribute
nominal scales. Educational and Psychological data. Biometrics, 34 (1), 142-144.
Measurement, 1960:37-46.
J. Lafferty, A. McCallum. & F. Pereira. (2001).
J. Cohen (1968). Weighted kappa: Nominal scale Conditional Random Fields: Probabilistic Models
agreement with provision for scaled disagreement for Segmenting and Labeling Sequence Data.
or partial credit. Psychological Bulletin 70:213-20. Proceedings of the 18th International Conference
on Machine Learning (ICML-2001), San
B. Di Eugenio and M. Glass (2004), The Kappa
Francisco, CA: Morgan Kaufmann, pp. 282-289.
Statistic: A Second Look., Computational
Linguistics 30:1 95-101. R. Leibbrandt & D. M. W. Powers, Robust Induction
of Parts-of-Speech in Child-Directed Language by
J. Entwisle and D. M. W. Powers (1998). "The
Co-Clustering of Words and Contexts. (2012).
Present Use of Statistics in the Evaluation of NLP
EACL Joint Workshop of ROBUS (Robust
Parsers", pp215-224, NeMLaP3/CoNLL98 Joint
Unsupervised and Semi-supervised Methods in
Conference, Sydney, January 1998
NLP) and UNSUP (Unsupervised Learning in NLP).
Sean Fitzgibbon, David M. W. Powers, Kenneth
P. J. G. Lisboa, A. Vellido & H. Wong (2000). Bias
Pope, and C. Richard Clark (2007). Removal of
reduction in skewed binary classfication with
EEG noise and artefact using blind source
Bayesian neural networks. Neural Networks
separation. Journal of Clinical Neurophysiology
13:407-410.
24(3):232-243, June 2007
354
R. Lowry (1999). Concepts and Applications of L. H. Reeker, (2000), Theoretic Constructs and
Inferential Statistics. (Published on the web as Measurement of Performance and Intelligence in
http:// faculty.vassar.edu/lowry/webtext.html.) Intelligent Systems, PerMIS 2000. (See
http://www.isd.mel.nist.gov/research_areas/
C. D. Manning, and H. Schtze (1999). Foundations
research_engineering/PerMIS_Workshop/ accessed
of Statistical Natural Language Processing. MIT
22 December 2007.)
Press, Cambridge, MA.
W. A. Scott (1955). Reliability of content analysis:
J. H McDonald, (2007). The Handbook of Biological
The case of nominal scale coding. Public Opinion
Statistics. (Course handbook web published as Quarterly, 19, 321-325.
http: //udel.edu/~mcdonald/statpermissions.html)
D. R. Shanks (1995). Is human learning rational?
J.C. Nunnally and Bernstein, I.H. (1994). Quarterly Journal of Experimental Psychology,
Psychometric Theory (Third ed.). McGraw-Hill. 48A, 257-279.
K. Pearson and D. Heron (1912). On Theories of T. Sellke, Bayarri, M.J. and Berger, J. (2001), Calibration
Association. J. Royal Stat. Soc. LXXV:579-652 of P-values for testing precise null hypotheses,
P. Perruchet and R. Peereman (2004). The American Statistician 55, 62-71. (See http://
exploitation of distributional information in www.stat.duke.edu/%7Eberger/papers.html#p-value
syllable processing, J. Neurolinguistics 17:97119. accessed 22 December 2007.)
D. Pfitzner, R. E. Leibbrandt and D. M. W. Powers P. J. Smith, Rae, DS, Manderscheid, RW and

(2009). Characterization and evaluation of Silbergeld, S. (1981). Approximating the moments
similarity measures for pairs of clusterings, and distribution of the likelihood ratio statistic for
Knowledge and Information Systems, 19:3, 361-394 multinomial goodness of fit. Journal of the
American Statistical Association 76:375,737-740.
D. M. W. Powers (2003), Recall and Precision versus
the Bookmaker, Proceedings of the International R. R. Sokal, Rohlf FJ (1995) Biometry: The principles
Conference on Cognitive Science (ICSC-2003), and practice of statistics in biological research, 3rd
ed New York: WH Freeman and Company.
Sydney Australia, 2003, pp. 529-534. (See http://
david.wardpowers.info/BM/index.htm.) J. Uebersax (1987). Diversity of decision-making
models and the measurement of interrater
D. M. W. Powers (2008), Evaluation Evaluation, The
agreement. Psychological Bulletin 101, 140146.
18th European Conference on Artificial
Intelligence (ECAI08) J. Uebersax (2009) http://ourworld.compuserve.com/
homepages/jsuebersax/agree.htm accessed 24
D. M W Powers, (2007/2011) Evaluation: From
February 2011.
Precision, Recall and F-Factor to ROC,
Informedness, Markedness & Correlation, M. J. Warrens (2010a), Inequalities between multi-
School of Informatics and Engineering, Flinders rater kappas. Advances in Data Analysis and
University, Adelaide, Australia, TR SIE-07-001, Classification 4:271-286.
Journal of Machine Learning Technologies 2:1 37-63. M. J. Warrens (2010b). A formal proof of a paradox
https://dl-web.dropbox.com/get/Public/201101- associated with Cohens kappa. Journal of
Evaluation_JMLT_Postprint-Colour.pdf?w=abcda988 Classificaiton 27:322-332.
D. M. W. Powers, 2012. The Problem of Area Under M. J. Warrens (2010c). A Kraemer-type rescaling that
the Curve. International Conference on Information transforms the Odds Ratio into the Weighted
Science and Technology, ICIST2012, in press. Kappa Coefficient. Psychometrika 75:2 328-330.
D. M. W. Powers and A. Atyabi, 2012. The Problem M. J. Warrens (2011). Cohens linearly wieghted
of Cross-Validation: Averaging and Bias, Kappa is a weighted average of 2x2 Kappas.
Repetition and Significance, SCET2012, in press. Psychometrika 76:3, 471-486.
F. Provost and T. Fawcett. Robust classification for D. A. Williams (1976). Improved Likelihood Ratio
imprecise environments. Machine Learning, Tests for Complete Contingency Tables,
44:203231, 2001. Biometrika 63:33-37.
RapidMiner (2011). http://rapid-i.com (accessed 4 I. H. Witten & E. Frank, (2005). Data mining (2nd
November 2011). ed.). London: Academic Press.
355
User Edits Classification Using Document Revision Histories
Amit Bronner Christof Monz

Informatics Institute Informatics Institute
University of Amsterdam University of Amsterdam
a.bronner@uva.nl c.monz@uva.nl
Abstract Exploiting document revision histories has

proven useful for a variety of natural language
Document revision histories are a useful
and abundant source of data for natural processing (NLP) tasks, including sentence com-
language processing, but selecting relevant pression (Nelken and Yamangil, 2008; Yamangil
data for the task at hand is not trivial. and Nelken, 2008) and simplification (Yatskar et
In this paper we introduce a scalable ap- al., 2010; Woodsend and Lapata, 2011), informa-
proach for automatically distinguishing be- tion retrieval (Aji et al., 2010; Nunes et al., 2011),
tween factual and fluency edits in document textual entailment recognition (Zanzotto and Pen-
revision histories. The approach is based nacchiotti, 2010), and paraphrase extraction (Max
on supervised machine learning using lan-
guage model probabilities, string similar-
and Wisniewski, 2010; Dutrey et al., 2011).
ity measured over different representations The ability to distinguish between factual
of user edits, comparison of part-of-speech changes or edits, which alter the meaning, and flu-
tags and named entities, and a set of adap- ency edits, which improve the style or readability,
tive features extracted from large amounts is a crucial requirement for approaches exploit-
of unlabeled user edits. Applied to con- ing revision histories. The need for an automated
tiguous edit segments, our method achieves
classification method has been identified (Nelken
statistically significant improvements over
a simple yet effective edit-distance base-
and Yamangil, 2008; Max and Wisniewski, 2010),
line. It reaches high classification accuracy but to the best of our knowledge has not been di-
(88%) and is shown to generalize to addi- rectly addressed. Previous approaches have either
tional sets of unseen data. applied simple heuristics (Yatskar et al., 2010;
Woodsend and Lapata, 2011) or manual annota-
1 Introduction tions (Dutrey et al., 2011) to restrict the data to
the type of edits relevant to the NLP task at hand.
Many online collaborative editing projects such as The work described in this paper shows that it is
Wikipedia1 keep track of complete revision histo- possible to automatically distinguish between fac-
ries. These contain valuable information about the tual and fluency edits. This is very desirable as
evolution of documents in terms of content as well it does not rely on heuristics, which often gener-
as language, style and form. Such data is publicly alize poorly, and does not require manual anno-
available in large volumes and constantly grow- tation beyond a small collection of training data,
ing. According to Wikipedia statistics, in August thereby allowing for much larger data sets of re-
2011 the English Wikipedia contained 3.8 million vision histories to be used for NLP research.
articles with an average of 78.3 revisions per ar-
In this paper, we make the following novel con-
ticle. The average number of revision edits per
tributions:
month is about 4 million in English and almost 11
We address the problem of automated classi-
million in total for all languages.2
fication of user edits as factual or fluency edits
1
http://www.wikipedia.org
2
Average for the 5 years period between August 2006 users, anonymous users, software bots and reverts. Source:
and August 2011. The count includes edits by registered http://stats.wikimedia.org.
356
by defining the scope of user edits, extracting a phrase identification using a rule based approach
large collection of such user edits from the En- and manually annotated examples.
glish Wikipedia, constructing a manually labeled Wikipedia vandalism detection is a user ed-
dataset, and setting up a classification baseline. its classification problem addressed by a yearly
A set of features is designed and integrated into competition (since 2010) in conjunction with the
a supervised machine learning framework. It is CLEF conference (Potthast et al., 2010; Potthast
composed of language model probabilities and and Holfeld, 2011). State-of-the-art solutions in-
string similarity measured over different represen- volve supervised machine learning using various
tations, including part-of-speech tags and named content and metadata features. Content features
entities. Despite their relative simplicity, the fea- use spelling, grammar, character- and word-level
tures achieve high classification accuracy when attributes. Many of them are relevant for our ap-
applied to contiguous edit segments. proach. Metadata features allow detection by pat-
We go beyond labeled data and exploit large terns of usage, time and place, which are gener-
amounts of unlabeled data. First, we demonstrate ally useful for the detection of online malicious
that the trained classifier generalizes to thousands activities (West et al., 2010; West and Lee, 2011).
of examples identified by user comments as spe- We deliberately refrain from using such features.
cific types of fluency edits. Furthermore, we in- A wide range of methods and approaches has
troduce a new method for extracting features from been applied to the similar tasks of textual en-
an evolving set of unlabeled user edits. This tailment and paraphrase recognition, see Androut-
method is successfully evaluated as an alternative sopoulos and Malakasiotis (2010) for a compre-
or supplement to the initial supervised approach. hensive review. These are all related because
paraphrases and bidirectional entailments repre-
2 Related Work sent types of fluency edits.
The need for user edits classification is implicit in A different line of research uses classifiers to
studies of Wikipedia edit histories. For example, predict sentence-level fluency (Zwarts and Dras,
Viegas et al. (2004) use revision size as a simpli- 2008; Chae and Nenkova, 2009). These could be
fied measure for the change of content, and Kittur useful for fluency edits detection. Alternatively,
et al. (2007) use metadata features to predict user user edits could be a potential source of human-
edit conflicts. produced training data for fluency models.
Classification becomes an explicit requirement
3 Definition of User Edits Scope
when exploiting edit histories for NLP research.
Yamangil and Nelken (2008) use edits as train- Within our approach we distinguish between edit
ing data for sentence compression. They make segments, which represent the comparison (diff)
the simplifying assumption that all selected edits between two document revisions, and user edits,
retain the core meaning. Zanzotto and Pennac- which are the input for classification.
chiotti (2010) use edits as training data for textual An edit segment is a contiguous sequence of
entailment recognition. In addition to manually deleted, inserted or equal words. The difference
labeled edits, they use Wikipedia user comments between two document revisions (vi , vj ) is repre-
and a co-training approach to leverage unlabeled sented by a sequence of edit segments E. Each
edits. Woodsend and Lapata (2011) and Yatskar edit segment (, w1m ) E is a pair, where
et al. (2010) use Wikipedia comments to identify {deleted , inserted , equal } and w1m is a m-word
relevant edits for learning sentence simplification. substring of vi , vj or both (respectively).
The work by Max and Wisniewski (2010) is A user edit is a minimal set of sentences over-
closely related to the approach proposed in this lapping with deleted or inserted segments. Given
paper. They extract a corpus of rewritings, dis- the two sets of revision sentences (Svi , Svj ), let
tinguish between weak semantic differences and
strong semantic differences, and present a typol- (, w1m ) = {s Svi Svj | w1m s 6= } (1)
ogy of multiple subclasses. Spelling corrections be the subset of sentences overlapping with a
are heuristically identified but the task of auto- given edit segment, and let
matic classification is deferred. Follow-up work
by Dutrey et al. (2011) focuses on automatic para- (s) = {(, w1m ) E | w1m s 6= } (2)
357
be the subset of edit segments overlapping with a (1) Revisions 368209202 & 378822230
given sentence. pre (By the mid 1700s, Medzhybizh was the seat of
A user edit is a pair (pre Svi , post Svj ) power in Podilia Province.)
where post (By the mid 18th century, Medzhybizh was
the seat of power in Podilia Province.)
s pre post {deleted , inserted } w1m diff (equal , By the mid) , (deleted, 1700s) ,
(inserted , 18th century) , (equal , , Medzhy-
(, w1m ) (s) (, w1m ) pre post (3) bizh was the seat of power in Podilia Province.)
s pre post {deleted , inserted } w1m
(, w1m ) (s) (4) (2) Revisions 148109085 & 149440273
pre (Original Society of Teachers of the Alexander
Table 1 illustrates different types of edit seg- Technique (est. 1958).)
ments and user edits. The term replaced segment post (Original and largest professional Society of
refers to adjacent deleted and inserted segments. Teachers of the Alexander Technique estab-
lished in 1958.)
Example (1) contains a replaced segment because
diff (equal , Original) , (inserted , and largest
the deleted segment (1700s) is adjacent to the
professional) , (equal , Society of Teachers of
inserted segment (18th century). Example (2) the Alexander Technique) , (deleted , (est.) ,
contains an inserted segment (and largest profes- (inserted , established in) , (equal , 1958) ,
sional), a replaced segment ((est. estab- (deleted , )) , (equal , .)
lished in) and a deleted segment ()). User edits
of both examples consist of a single pre sentence (3) Revisions 61406809 & 61746002
and a single post sentence because deleted and in- pre (Fredrik Modin is a Swedish ice hockey left
serted segments do not cross any sentence bound- winger. , He is known for having one of the
hardest slap shots in the NHL.)
ary. Example (3) contains a replaced segment (. post (Fredrik Modin is a Swedish ice hockey left
He who). In this case the deleted segment winger who is known for having one of the hard-
(. He) overlaps with two sentences and there- est slap shots in the NHL.)
fore the user edit consists of two pre sentences. diff (equal , Fredrik Modin is a Swedish ice hockey
left winger) , (deleted , . He) , (inserted ,
4 Features for Edits Classification who) , (equal , is known for having one of
the hardest slap shots in the NHL.)
We design a set of features for supervised classi-
fication of user edits. The design is guided by two Table 1: Examples of user edits and the corre-
main considerations: simplicity and interoperabil- sponding edit segments (revision numbers corre-
ity. Simplicity is important because there are po- spond to the English Wikipedia).
tentially hundreds of millions of user edits to be
classified. This amount continues to grow at rapid
pace and a scalable solution is required. Interop- each user edit. For instance, example (1) in Table
erability is important because millions of user ed- 1 has one deleted token, two inserted tokens and
its are available in multiple languages. Wikipedia 14 equal tokens. Many features use string similar-
is a flagship project, but there are other collabora- ity calculated over alternative representations.
tive editing projects. The solution should prefer- Character-level features include counts of
ably be language- and project-independent. Con- deleted, inserted and equal characters of different
sequently, we refrain from deeper syntactic pars- types, such as word & non-word characters or dig-
ing, Wikipedia-specific features, and language re- its & non-digits. Character types may help iden-
sources that are limited to English. tify edits types. For example, the change of dig-
Our basic intuition is that longer edits are likely its may suggest a factual edit while the change of
to be factual and shorter edits are likely to be non-word characters may suggest a fluency edit.
fluency edits. The baseline method is therefore Word-level features count deleted, inserted
character-level edit distance (Levenshtein, 1966) and equal words using three parallel represen-
between pre- and post-edited text. tations: original case, lower case, and lemmas.
Six feature categories are added to the baseline. Word-level edit distance is calculated for each
Most features take the form of threefold counts re- representation. Table 2 illustrates how edit dis-
ferring to deleted, inserted and equal elements of tance may vary across different representations.
358
Rep. User Edit Dist the deleted NE tag and the inserted PoS tag. This
Words pre Branch lines were built in Kenya 4 is an inherent weakness of these features when
post A branch line was built in Kenya compared to parsing-based alternatives.
Lowcase pre branch lines were built in kenya 3 An additional set of counts, NE values, de-
post a branch line was built in kenya
scribes the number of deleted, inserted and equal
Lemmas pre branch line be build in Kenya 1
post a branch line be build in Kenya
normalized values of numeric entities such as
PoS tags pre NN NNS VBD VBN IN NNP 2
numbers and dates. For instance, if the word
post DT NN NN VBD VBN IN NNP 100 is replaced by 200 and the respective nu-
NE tags pre LOCATION 0 meric values 100.0 and 200.0 are normalized, the
post LOCATION counts of deleted and inserted NE values will be
incremented and suggest a factual edit. If on the
Table 2: Word- and tag-level edit distance mea- other hand 100 is replaced by hundred and the
sured over different representations (example latter is normalized as having the numeric value
from Wikipedia revisions 2678278 & 2682972). 100.0, then the count of equal NE values will be
incremented, rather suggesting a fluency edit.
Acronym features count deleted, inserted and
Fluency edits may shift words, which sometimes
equal acronyms. Potential acronyms are extracted
may be slightly modified. Fluency edits may also
from word sequences that start with a capital letter
add or remove words that already appear in con-
and from words that contain multiple capital let-
text. Optimal calculation of edit distance with
ters. If, for example, UN is replaced by United
shifts is computationally expensive (Shapira and
Nations, MicroSoft by MS or Jean Pierre
Storer, 2002). Translation error rate (TER) pro-
by J.P, the count of equal acronyms will be in-
vides an approximation but it is designed for the
cremented, suggesting a fluency edit.
needs of machine translation evaluation (Snover
The last category, language model (LM) fea-
et al., 2006). To have a more sensitive estima-
tures, takes a different approach. These features
tion of the degree of edit, we compute the minimal
look at n-gram based sentence probabilities be-
character-level edit distance between every pair of
fore and after the edit, with and without normal-
words that belong to different edit segments. For
ization with respect to sentence lengths. The ratio
each pair of edit segments (, w1m ), ( 0 , w0 k1 ) over-
of the two probabilities, Pratio (pre, post) is com-
lapping with a user edit, if 6= 0 we compute:
puted as follows:
w w1m : min EditDist(w, w0 ) (5) m
Y
w0 w0 k1 P (w1m ) i1
P (wi |win+1 ) (6)
Binned counts of the number of words with a min- i=1
1
imal edit distance of 0, 1, 2, 3 or more charac- Pnorm (w1m ) = P (w1m ) m (7)
ters are accumulated per edit segment type (equal, Pnorm (post)
deleted or inserted). Pratio (pre, post) = (8)
Pnorm (pre)
Part-of-speech (PoS) features include counts
of deleted, inserted and equal PoS tags (per tag) Pnorm (post)
log Pratio (pre, post) = log (9)
and edit distance at the tag level between PoS tags Pnorm (pre)
before and after the edit. Similarly, named-entity = log Pnorm (post) log Pnorm (pre)
(NE) features include counts of deleted, inserted
1 1
and equal NE tags (per tag, excluding OTHER) = log P (post) log P (pre)
|post| |pre|
and edit distance at the tag level between NE tags
before and after the edit. Table 2 illustrates the Where P is the sentence probability estimated as
edit distance at different levels of representation. a product of n-gram conditional probabilities and
We assume that a deleted NE tag, e.g. PERSON Pnorm is the sentence probability normalized by
or LOCATION, could indicate a factual edit. It the sentence length. We hypothesize that the rel-
could however be a fluency edit where the NE is ative change of normalized sentence probabilities
replaced by a co-referent like she or it. Even is related to the edit type. As an additional feature,
if we encounter an inserted PRP PoS tag, the fea- the number of out of vocabulary (OOV) words be-
tures do not capture the explicit relation between fore and after the edit is computed. The intuition
359
Dataset Labeled Subset guage model built by SRILM (Stolcke, 2002) with
Number of User Edits: modified interpolated Kneser-Ney smoothing us-
923,820 (100%) 2,008 (100%) ing the AFP and Xinhua portions of the English
Edit Segments Distribution: Gigaword corpus (LDC2003T05).
Replaced 535,402 (57.96%) 1,259 (62.70%) We extract a total of 4.3 million user edits of
Inserted 235,968 (25.54%) 471 (23.46%) which 2.52 million (almost 60%) are insertions
Deleted 152,450 (16.5%) 278 (13.84%) and deletions of complete sentences. Although
Character-level Edit Distance Distribution: these may include fluency edits such as sentence
1 202,882 (21.96%) 466 (23.21%) reordering or rewriting from scratch, we assume
2 81,388 (8.81%) 198 (9.86%) that the large majority is factual. Of the remaining
3-10 296,841 (32.13%) 645 (32.12%)
1.78 million edits, the majority (64.5%) contains
11-100 342,709 (37.10%) 699 (34.81%)
single deleted, inserted or replaced segments. We
Word-level Edit Distance Distribution:
decide to focus on this subset because sentences
1 493,095 (53.38%) 1,008 (54.18%)
2 182,770 (19.78%) 402 (20.02%) with multiple non-contiguous edit segments are
3 77,603 (8.40%) 161 (8.02%) more likely to contain mixed cases of unrelated
4-10 170,352 (18.44%) 357 (17.78%) factual and fluency edits, as illustrated by exam-
Labels Distribution: ple (2) in Table 1. Learning to classify contigu-
Fluency - 1,008 (50.2%) ous edit segments seems to be a reasonable way
Factual - 1,000 (49.8%) of breaking down the problem into smaller parts.
We filter out user edits with edit distance longer
Table 3: Dataset of nearly 1 million user edits than 100 characters or 10 words that we assume to
with single deleted, inserted or replaced segments, be factual. The resulting dataset contains 923,820
of which 2K are labeled. The labels are almost user edits: 58% replaced segments, 25.5% in-
equally distributed. The distribution over edit seg- serted segments and 16.5% deleted segments.
ment types and edit distance intervals is detailed. Manual labeling of user edits is carried out by
a group of annotators with near native or native
level of English. All annotators receive the same
is that unknown words are more likely to be in- written guidelines. In short, fluency labels are
dicative of factual edits. assigned to edits of letter case, spelling, gram-
mar, synonyms, paraphrases, co-referents, lan-
5 Experiments guage and style. Factual labels are assigned to
5.1 Experimental Setup edits of dates, numbers and figures, named enti-
ties, semantic change or disambiguation, addition
First, we extract a large amount of user edits from
or removal of content. A random set of 2,676 in-
revision histories of the English Wikipedia.3 The
stances is labeled: 2,008 instances with a majority
extraction process scans pairs of subsequent re-
agreement of at least two annotators are selected
visions of article pages and ignores any revision
as training set, 270 instances are held out as de-
that was reverted due to vandalism. It parses the
velopment set, 164 trivial fluency corrections of a
Wikitext and filters out markup, hyperlinks, tables
single letters case and 234 instances with no clear
and templates. The process analyzes the clean text
agreement among annotators are excluded. The
of the two revisions4 and computes the difference
last group (8.7%) emphasizes that the task is, to
between them.5 The process identifies the overlap
a limited extent, subjective. It suggests that auto-
between edit segments and sentence boundaries
mated classification of certain user edits would be
and extracts user edits. Features are calculated
difficult. Nevertheless, inter-rater agreement be-
and user edits are stored and indexed. LM features
tween annotators is high to very high. Kappa val-
are calculated against a large English 4-gram lan-
ues between 0.74 to 0.84 are measured between
3
Dump of all pages with complete edit history as of Jan- six pairs of annotators, each pair annotated a com-
uary 15, 2011 (342GB bz2), http://dumps.wikimedia.org. mon subset of at least 100 instances. Table 3 de-
4
Tokenization, sentence split, PoS & NE tags by Stanford scribes the resulting dataset, which we also make
CoreNLP, http://nlp.stanford.edu/software/corenlp.shtml.
5
Myers O(N D) difference algorithm (Myers, 1986),
available to the research community.6
http://code.google.com/p/google-diff-match-patch. 6
Available for download at http://staff.
360
Character-level Edit Distance Feature set SVM RF Logit
flu. / fac. flu. / fac. flu. / fac.
.4 >4&
Baseline 0.85 / 0.67 0.74 / 0.79 0.85 / 0.67
+ Char-level 0.85 / 0.82 0.83 / 0.86 0.86 / 0.82
Fluency (725) Factual (821) + Word-level 0.88 / 0.69 0.81 / 0.82 0.86 / 0.70
Factual (179) Fluency (283) + PoS 0.85 / 0.68 0.78 / 0.76 0.84 / 0.72
+ NE 0.86 / 0.79 0.79 / 0.87 0.87 / 0.78
Figure 1: A decision tree that uses character-level + Acronyms 0.87 / 0.66 0.83 / 0.70 0.86 / 0.68
edit distance as a sole feature. The tree correctly + LM 0.85 / 0.67 0.79 / 0.76 0.84 / 0.69
All Features 0.88 / 0.86 0.86 / 0.88 0.87 / 0.84
classifies 76% of the labeled user edits.
Table 5: Fraction of correctly classified edits per
Feature set SVM RF Logit type: fluency edits (left) and factual edits (right),
Baseline 76.26% 76.26% 76.34% using the baseline, each feature set added to the
+ Char-level 83.71% 84.45% 84.01% baseline, and all features combined.
+ Word-level 78.38% 81.38% 78.13%
+ PoS 76.58% 76.97% 78.35%
+ NE 82.71% 83.12% 82.38%
+ Acronyms 76.55% 76.61% 76.96% line. Then each one of the feature groups is sep-
+ LM 76.20% 77.42% 76.52% arately added to the baseline. Finally, all features
All Features 87.14% 87.14% 85.64% are evaluated together. Table 4 reports the per-
centage of correctly classified edits (classifiers
Table 4: Classification accuracy using the base- accuracy), and Table 5 reports the fraction of cor-
line, each feature set added to the baseline, and rectly classified edits per type. All results are for
all features combined. Statistical significance at 10-fold cross validation. Statistical significance
p < 0.05 is indicated by w.r.t the baseline (us- against the baseline and between classifiers is cal-
ing the same classifier), and by w.r.t to another culated at p < 0.05 using paired t-test.
classifier marked by (using the same features). The first interesting result is the highly predic-
Highest accuracy per classifier is marked in bold. tive power of the single-feature baseline. It con-
firms the intuition that longer edits are mainly fac-
tual. Figure 1 shows that the edit distance of 72%
5.2 Feature Analysis
of the user edits labeled as fluency is between 1 to
We experiment with three classifiers: Support 4, while the edit distance of 82% of those labeled
Vector Machines (SVM), Random Forests (RF) as factual is greater than 4. The cut-off value is
and Logistic Regression (Logit).7 SVMs (Cortes found by a single-node decision tree that uses edit
and Vapnik, 1995) and Logistic Regression (or distance as a sole feature. The tree correctly clas-
Maximum Entropy classifiers) are two widely sifies 76% of the instances. This result implies
used machine learning techniques. SVMs have that the actual challenge is to correctly classify
been applied to many text classification problems short factual edits and long fluency edits.
(Joachims, 1998). Maximum Entropy classifiers Character-level features and named-entity fea-
have been applied to the similar tasks of para- tures lead to significant improvements over the
phrase recognition (Malakasiotis, 2009) and tex- baseline for all classifiers. Their strength lies in
tual entailment (Hickl et al., 2006). Random their ability to identify short factual edits such
Forests (Breiman, 2001) as well as other decision as changes of numeric values or proper names.
tree algorithms are successfully used for classi- Word-level features also significantly improve the
fying Wikipedia edits for the purpose of vandal- baseline but their contribution is smaller. PoS
ism detection (Potthast et al., 2010; Potthast and and acronym features lead to small statistically-
Holfeld, 2011). insignificant improvements over the baseline.
Experiments begin with the edit-distance base- The poor contribution of LM features is sur-
prising. It might be due to the limited context
science.uva.nl/abronner/uec/data.
7
Using Weka classifiers: SMO (SVM), RandomForest &
of n-grams, but it might be that LM probabili-
Logistic (Hall et al., 2009). Classifiers parameters are tuned ties are not a good predictor for the task. Re-
using the held-out development set. moving LM features from the set of all features
361
Fluency Edits Misclassified as Factual Correctly Classified Fluency Edits
Equivalent or redundant in context 14 Adventure education makes intentional use of intention-
Paraphrases 13 ally uses challenging experiences for learning.
Equivalent numeric patterns 7 He served as president from October 1 , 1985 and retired
Replacing first name with last name 4 through his retirement on June 30 , 2002.
Acronyms 4 In 1973, he helped organize assisted in organizing his
Non specific adjectives or adverbs 3 first ever visit to the West.
Other 5
Correctly Classified Factual Edits
Factual Edits Misclassified as Fluency
Over the course of the next two years five months, the
Short correction of content 35 unit completed a series of daring raids.
Opposites 3 Scottish born David Tennant has reportedly said he
Similar names 3 would like his Doctor to wear a kilt.
Noise (unfiltered vandalism) 3
Other 6 This family joined the strip in late 1990 around March
1991.
Table 6: Error types based on manual examina-

Table 7: Examples of correctly classified user ed-
tion of 50 fluency edit misclassifications and 50
its. Deleted segments are struck out, inserted are
factual edit misclassifications.
bold (revision numbers are omitted for brevity).
leads to a small decrease in classification accu-

racy, namely 86.68% instead of 87.14% for SVM. For example: in 1986 that year, when
This decrease is not statistically significant. she returned when Ruffa returned and the
The highest accuracy is achieved by both SVM core member of the group are the core mem-
and RF and there are few significant differences bers are. 13 (26%) are paraphrases misclassified
among the three classifiers. The fraction of cor- as factual edits. Examples are: made cartoons
rectly classified edits per type (Table 5) reveals produced animated cartoons and with the
that for SVM and Logit, most fluency edits are implication that they are similar to imply-
correctly classified by the baseline and most im- ing a connection to. 7 modify numeric patterns
provements over the baseline are attributed to bet- that do not change the meaning such as the year
ter classification of factual edits. This is not the 37 1937. 4 replace a first name of a per-
case for RF, where the fraction of correctly classi- son with the last name. 4 contain acronyms, e.g.
fied factual edits is higher and the fraction of cor- Display PostScript Display PostScript (or
rectly classified fluency edits is lower. This in- DPS). Acronym features are correctly identified
sight motivates further experimentation. Repeat- but the classifier fails to recognize a fluency edit.
ing the experiment with a meta-classifier that uses 3 modify adjectives or adverbs that do not change
a majority voting scheme, achieves an improved the meaning such as entirely and various.
accuracy of 87.58%. This improvement is not sta- Factual edit misclassifications: the big major-
tistically significant. ity, 35 instances (70%), could be characterized as
short corrections, often replacing a similar word,
5.3 Error Analysis
that make the content more accurate or more
To have better understanding of errors made by precise. Examples (context is omitted): city
the classifier, 50 fluency edit misclassifications village, emigrated immigrated and
and 50 factual edit misclassifications are ran- electrical electromagnetic. 3 are opposites
domly selected and manually examined. The er- or antonyms such as previous next and
rors are grouped into categories as summarized in lived died. 3 are modifications of similar
Table 6. These explain certain limitations of the person or entity names, e.g. Kelly Kate.
classifier and suggest possible improvements. 3 are instances of unfiltered vandalism, i.e. noisy
Fluency edit misclassifications: 14 instances examples. Other misclassifications include verb
(28%) are phrases (often co-referents) that are ei- tense modifications such as is was and
ther equivalent or redundant in the given context. consists consisted. These are difficult to
362
Comment Test Set Classified as Replaced by Frequency Edit class
Size Fluency Edits second 144 Factual
grammar 1,122 88.9% First 38 Fluency
spelling 2,893 97.6% last 31 Factual
typo 3,382 91.6% 1st 22 Fluency
copyedit 3,437 68.4% third 22 Factaul
Random set 5,000 49.4%
Table 9: User edits replacing the word first with
Table 8: Classifying unlabeled data selected by another single word: most frequent 5 out of 524.
user comments that suggest a fluency edit. The
SVM classifier is trained using the labeled data.
Replaced by Frequency Replaced by Frequency
User comments are not used as features.
Adams 7 Squidward 6
Joseph 7 Alexander 5
Einstein 6 Davids 5
classify because the modification of verb tense in Galland 6 Haim 5
a given context is sometimes factual and some- Lowe 6 Hickes 5
times a fluency edit.
These findings agree with the feature analy- Table 10: Fluency edits replacing the word He
sis. Fluency edit misclassifications are typically with proper noun: most frequent 10 out of 1,381.
longer phrases that carry the same meaning while
factual edit misclassifications are typically sin-
gle words or short phrases that carry different uate against. We resort to Wikipedia user com-
meaning. The main conclusion is that the clas- ments. It is a problematic option because it is un-
sifier should take into account explicit content reliable. Users may add a comment when submit-
and context. Putting aside the consideration of ting an edit, but it is not mandatory. The com-
simplicity and interoperability, features based on ment is a free text with no predefined structure.
co-reference resolution and paraphrase recogni- It could be meaningful or nonsense. The com-
tion are likely to improve fluency edits classi- ment is per revision. It may refer to one, some
fication, and features from language resources or all edits submitted for a given revision. Nev-
that describe synonymy and antonymy relations ertheless, we identify several keywords that rep-
are likely to improve factual edits classification. resent certain types of fluency edits: grammar,
While this conclusion may come at no surprise, it spelling, typo, and copyedit. The first three
is important to highlight the high classification ac- clearly indicate grammar and spelling corrections.
curacy that is achieved without such capabilities The last indicates a correction of format and style,
and resources. Table 7 presents several examples but also of accuracy of the text. Therefore it only
of correct classification produced by our classifier. represents a bias towards fluency edits.
We extract unlabeled edits whose comment is
6 Exploiting Unlabeled Data
equal to one of the keywords and construct a test
We extracted a large set of user edits but our ap- set per keyword. An additional test set consists of
proach has been limited to a restricted number of randomly selected unlabeled edits with any com-
labeled examples. This section attempts to find ment. The five test sets are classified by the SVM
whether the classifier generalizes beyond labeled classifier trained using the labeled data and the set
data and whether unlabeled data could be used to of all features. To remove any doubt, user com-
improve classification accuracy. ments are not part of any feature of the classifier.
The results in Table 8 show that most unlabeled
6.1 Generalizing Beyond Labeled Data edits whose comments are grammar, spelling
The aim of the next experiment is to test how well or typo are indeed classified as fluency ed-
the supervised classifier generalizes beyond the its. The classification of edits whose comment is
labeled test set. The problem is the availability copyedit is biased towards fluency edits, but as
of test data. There is no shared task for user ed- expected the result is less distinct. The classifica-
its classification and no common test set to eval- tion of the random set is balanced, as expected.
363
Feature set SVM RF Logit tences of other unlabeled edits. The first step is to
Baseline 76.26% 76.26% 76.34% select candidates using a bag of words approach.
All Features 87.14% 87.14% 85.64% The second step is a comparison of the user edit
Unlabeled only 78.11% 83.49% 78.78% with each one of the candidates while increment-
Base + unlabeled 80.86% 85.45% 81.83% ing counts of similarity measures. These account
All + unlabeled 87.23% 88.35% 85.92% for exact matches between different representa-
tions (original and low case, lemmas, PoS and NE
Table 11: Classification accuracy using features tags) as well as for approximate matches using
from unlabeled data. The first two rows are identi- character- and word-level edit distance between
cal to Table 4. Statistical significance at p < 0.05 those representations. An additional feature is the
is indicated by: w.r.t the baseline; w.r.t all fea- number of distinct documents in the candidate set.
tures excluding features from unlabeled data; and We compute the set of features for the labeled
w.r.t to another classifier marked by (using the
dataset based on the unlabeled data. The number
same features). The best result is marked in bold. of candidates is set to 1,000 per user edit. We
re-train the classifiers using five configurations:
6.2 Features from Unlabeled Data Baseline and All Features are identical to the first
experiment. Unlabeled only uses the new feature
The purpose of the last experiment is to exploit set without any other feature. Base + Unlabeled
unlabeled data in order to extract additional fea- adds the new feature set to the baseline. All + Un-
tures for the classifier. The underlying assumption labeled uses all available features. All results are
is that reoccurring patterns may indicate whether for 10-fold cross validation with statistical signif-
a user edit is factual or a fluency edit. icance at p < 0.05 by paired t-test, see Table 11.
We could assume that fluency edits would re- We find that features extracted from unlabeled
occur across many revisions, while factual edits data outperform the baseline and lead to statisti-
would only appear in revisions of specific docu- cally significant improvements when added to it.
ments. However, this assumption does not nec- The combination of all features allows Random
essarily hold. Table 9 gives a simple example of Forests to achieve the highest statistically signifi-
single word replacements for which the most re- cant accuracy level of 88.35%.
occurring edit is actually factual and other factual
and fluency edits reoccur in similar frequencies. 7 Conclusions
Finding user edits reoccurrence is not trivial.
This work addresses the task of user edits clas-
We could rely on exact matches of surface forms,
sification as factual or fluency edits. It adopts
but this may lead to data sparseness issues. Flu-
a supervised machine learning approach and
ency edits that exchange co-referents and proper
uses character- and word- level features, part-
nouns, as illustrated by the example in Table 10,
of-speech tags, named entities, language model
may reoccur frequently but this fact could not
probabilities, and a set of features extracted from
be revealed by exact matching of specific proper
large amounts of unlabeled data. Our experiments
nouns. On the other hand, using a bag of word
with contiguous user edits extracted from revision
approach may find too many unrelated edits.
histories of the English Wikipedia achieve high
We introduce a two-step method that measures
classification accuracy and demonstrate general-
the reoccurrence of edits in unlabeled data us-
ization to data beyond labeled edits.
ing exact and approximate matching over multi-
Our approach shows that machine learning
ple representations. The method provides a set of
techniques can successfully distinguish between
frequencies that is fed into the classifier and al-
user edit types, making them a favorable alterna-
lows for learning subtle patterns of reoccurrence.
tive to heuristic solutions. The simple and adap-
Staying consistent with our initial design consid-
tive nature of our method allows for application to
erations, the method is simple and interoperable.
large and evolving sets of user edits.
Given a user edit (pre, post), the method does
not compare pre with post in any way. It only Acknowledgments. This research was funded
compares pre with pre-edited sentences of other in part by the European Commission through the
unlabeled edits and post with post-edited sen- CoSyne project FP7-ICT-4-248531.
364
References R. Nelken and E. Yamangil. 2008. Mining
Wikipedias article revision history for training
A. Aji, Y. Wang, E. Agichtein, and E. Gabrilovich. computational linguistics algorithms. In Proceed-
2010. Using the past to score the present: Extend- ings of the AAAI Workshop on Wikipedia and Arti-
ing term weighting models through revision history ficial Intelligence: An Evolving Synergy, pages 31
analysis. In Proceedings of the 19th ACM inter- 36.
national conference on Information and knowledge S. Nunes, C. Ribeiro, and G. David. 2011. Term
management, pages 629638. weighting based on document revision history.
I. Androutsopoulos and P. Malakasiotis. 2010. A sur- Journal of the American Society for Information
vey of paraphrasing and textual entailment meth- Science and Technology, 62(12):24712478.
ods. Journal of Artificial Intelligence Research, M. Potthast and T. Holfeld. 2011. Overview of the 2nd
38(1):135187. international competition on Wikipedia vandalism
L. Breiman. 2001. Random forests. Machine learn- detection. Notebook for PAN at CLEF 2011.
ing, 45(1):532. M. Potthast, B. Stein, and T. Holfeld. 2010. Overview
J. Chae and A. Nenkova. 2009. Predicting the fluency of the 1st international competition on Wikipedia
of text with shallow structural features: case stud- vandalism detection. Notebook Papers of CLEF,
ies of machine translation and human-written text. pages 2223.
In Proceedings of the 12th Conference of the Euro- D. Shapira and J. Storer. 2002. Edit distance with
pean Chapter of the Association for Computational move operations. In Combinatorial Pattern Match-
Linguistics, pages 139147. ing, pages 8598.
C. Cortes and V. Vapnik. 1995. Support-vector net- M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and
works. Machine learning, 20(3):273297. J. Makhoul. 2006. A study of translation edit rate
with targeted human annotation. In Proceedings of
C. Dutrey, D. Bernhard, H. Bouamor, and A. Max.
Association for Machine Translation in the Ameri-
2011. Local modifications and paraphrases in
cas, pages 223231.
Wikipedias revision history. Procesamiento del A. Stolcke. 2002. SRILM-an extensible language
Lenguaje Natural, Revista no 46:5158. modeling toolkit. In Proceedings of the interna-
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reute- tional conference on spoken language processing,
mann, and I.H. Witten. 2009. The WEKA data volume 2, pages 901904.
mining software: an update. ACM SIGKDD Explo- F.B. Viegas, M. Wattenberg, and K. Dave. 2004.
rations Newsletter, 11(1):1018. Studying cooperation and conflict between authors
A. Hickl, J. Williams, J. Bensley, K. Roberts, B. Rink, with history flow visualizations. In Proceedings of
and Y. Shi. 2006. Recognizing textual entailment the SIGCHI conference on Human factors in com-
with LCCs GROUNDHOG system. In Proceedings puting systems, pages 575582.
of the Second PASCAL Challenges Workshop. A.G. West and I. Lee. 2011. Multilingual vandalism
T. Joachims. 1998. Text categorization with support detection using language-independent & ex post
vector machines: Learning with many relevant fea- facto evidence. Notebook for PAN at CLEF 2011.
tures. Machine Learning: ECML-98, pages 137 A.G. West, S. Kannan, and I. Lee. 2010. Detecting
142. Wikipedia vandalism via spatio-temporal analysis
A. Kittur, B. Suh, B.A. Pendleton, and E.H. Chi. 2007. of revision metadata. In Proceedings of the Third
He says, she says: Conflict and coordination in European Workshop on System Security, pages 22
Wikipedia. In Proceedings of the SIGCHI confer- 28.
ence on Human factors in computing systems, pages K. Woodsend and M. Lapata. 2011. Learning to
453462. simplify sentences with quasi-synchronous gram-
mar and integer programming. In Proceedings of
V.I. Levenshtein. 1966. Binary codes capable of cor-
the 2011 Conference on Empirical Methods in Nat-
recting deletions, insertions, and reversals. Soviet
ural Language Processing, pages 409420.
Physics Doklady, 10(8):707710.
E. Yamangil and R. Nelken. 2008. Mining Wikipedia
P. Malakasiotis. 2009. Paraphrase recognition using revision histories for improving sentence compres-
machine learning to combine similarity measures. sion. In Proceedings of ACL-08: HLT, Short Pa-
In Proceedings of the ACL-IJCNLP 2009 Student pers, pages 137140.
Research Workshop, pages 2735. M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, and
A. Max and G. Wisniewski. 2010. Min- L. Lee. 2010. For the sake of simplicity: Unsu-
ing naturally-occurring corrections and paraphrases pervised extraction of lexical simplifications from
from Wikipedias revision history. In Proceedings Wikipedia. In Human Language Technologies: The
of LREC, pages 31433148. 2010 Annual Conference of the North American
E.W. Myers. 1986. An O(N D) difference algorithm Chapter of the Association for Computational Lin-
and its variations. Algorithmica, 1(1):251266. guistics, pages 365368.
365
F.M. Zanzotto and M. Pennacchiotti. 2010. Expand-
ing textual entailment corpora from Wikipedia us-
ing co-training. In Proceedings of the 2nd Work-
shop on Collaboratively Constructed Semantic Re-
sources, COLING 2010.
S. Zwarts and M. Dras. 2008. Choosing the right
translation: A syntactically informed classification
approach. In Proceedings of the 22nd International
Conference on Computational Linguistics-Volume
1, pages 11531160.
366
User Participation Prediction in Online Forums
Zhonghua Qu and Yang Liu

The University of Texas at Dallas
{qzh,yangl@hlt.utdallas.edu}
Abstract ommendation systems are built. Content based rec-

ommendation systems use the textual information
Online community is an important source of news articles and user generated content to rank
for latest news and information. Accurate items. Collaborative filtering, on the other hand,
prediction of a users interest can help pro- uses co-occurrence information from a collection
vide better user experience. In this paper,
of users for recommendation.
we develop a recommendation system for
online forums. There are a lot of differ- During the past few years, online community
ences between online forums and formal me- has become a large part of internet. More often,
dia. For example, content generated by users latest information and knowledge appear at on-
in online forums contains more noise com- line community earlier than other formal media.
pared to formal documents. Content topics This makes it a favorable place for people seeking
in the same forum are more focused than timely update and latest information. Online com-
sources like news websites. Some of these munity sites appear in many forms, for example,
differences present challenges to traditional
online forums, blogs, and social networking web-
word-based user profiling and recommenda-
tion systems, but some also provide oppor- sites. Here we focus our study on online forums. It
tunities for better recommendation perfor- is very helpful to build an automatic system to sug-
mance. In our recommendation system, we gest latest information a user would be interested
propose to (a) use latent topics to interpo- in. However, unlike formal news media, user gen-
late with content-based recommendation; (b) erated content in forums is usually less organized
model latent user groups to utilize informa- and not well formed. This presents a great chal-
tion from other users. We have collected
lenge to many existing news article recommenda-
three types of forum data sets. Our experi-
mental results demonstrate that our proposed tion systems. In addition, what makes online fo-
hybrid approach works well in all three types rums different from other media is that users of
of forums. online communities are not only the information
consumers but also active providers as participants.
Therefore in this study we develop a recommen-
1 Introduction dation system to account for these characteristics
of forums. We propose several improvements over
Internet is an important source of information. It
previous work:
has become a habit of many people to go to the in-
ternet for latest news and updates. However, not all Latent topic interpolation: This is to address
articles are equally interesting for different users. the issue with the word-based content repre-
In order to intelligently predict interesting articles sentation. In this paper we used Latent Dirich-
for individual users, personalized news recommen- let Allocation (LDA), a generative multino-
dation systems have been developed. There are in mial mixture model, for topic inference inside
general two types of approaches upon which rec- threads. We build a system based on words
367
and latent topics, and linearly interpolate their according to its informativeness. Then, base on
results. this personal profile a ranking machine is applied
to give a ranked recommendation list. In Fabs sys-
User modeling: We model users participa- tem, Rocchio algorithm (Rocchio, 1971) is used
tion inside threads as latent user groups. Each to learn the average TF-IDF vector of highly rated
latent group is a multinomial distribution on documents. Skyskill & Weberts system uses Naive
users. Then LDA is used to infer the group Bayes classifiers to give the probability of docu-
mixture inside each thread, based on which ments being liked. Winnows algorithm (Little-
the probability of a users participation can be stone, 1988), which is similar to perception algo-
derived. rithm, has been shown to perform well when there
Hybrid system: Since content and user- are many features. An adaptive framework is intro-
based methods rely on different information duced in (Li et al., 2010) using forum comments
sources, we combine the results from them for for news recommendation. In (Wu et al., 2010),
further improvement. a topic-specific topic flow model is introduced to
rank the likelihood of user participating in a thread
We have evaluated our proposed method using in online forums.
three data sets collected from three representative Collaborative-filtering based systems, unlike
forums. Our experimental results show that in all content-based systems, predict the recommending
forums, by using latent topics information, system items using co-occurrence information between
can achieve better accuracy in predicting threads users. For example, in a news recommendation
for recommendation. In addition, by modeling la- system, in order to recommend an article to user
tent user groups in thread participation, further im- c, the system tries to find users with similar taste
provement is achieved in the hybrid system. Our as c. Items favored by similar users would be rec-
analysis also showed that each forum has its nature, ommended. Grundy (Rich, 1979) is known to be
resulting in different optimal parameters in the dif- one of the first collaborative-filtering based sys-
ferent forums. tems. Collaborative filtering systems can be ei-
ther model based or memory based (Breese et al.,
2 Related Work
1998). Memory-based algorithms, such as (Del-
Recommendation systems can help make informa- gado and Ishii, 1999; Nakamura and Abe, 1998;
tion retrieving process more intelligent. Generally, Shardanand and Maes, 1995), use a utility function
recommendation methods are categorized into two to measure the similarity between users. Then rec-
types (Adomavicius and Tuzhilin, 2005), content- ommendation of an item is made according to the
based filtering and collaborative filtering. sum of the utility values of active users that partic-
Systems using content-based filtering use the ipate in it. Model-based algorithms, on the other
content information of recommendation items a hand, try to formulate the probability function of
user is interested in to recommend new items to one item being liked statistically using active user
the user. For example, in a news recommendation information. (Ungar et al., 1998) clustered sim-
system, in order to recommend appropriate news ilar users into groups for recommendation. Dif-
articles to a user, it finds the most prominent fea- ferent clustering methods have been experimented,
tures (e.g., key words, tags, category) in the docu- including K-means and Gibbs Sampling. Other
ment that a user likes, then suggests similar articles probabilistic models have also been used to model
based on this personal profile. In Fabs system collaborative relationships, including a Bayesian
(Balabanovic and Shoham, 1997), Skyskill & We- model (Chien and George, 1999), linear regres-
bert system (Pazzani et al., 1997), documents are sion model (Sarwar et al., 2001), Gaussian mix-
represented using a set of most important words ture models (Hofmann, 2003; Hofmann, 2004). In
according to a weighting measure. The most popu- (Blei et al., 2001) a collaborative filtering appli-
lar measure of word importance is TF-IDF (term cation is discussed using LDA. However in this
frequency, inverse document frequency) (Salton model, re-estimation of parameters for the whole
and Buckley, 1988), which gives weights to words system is needed when a new item comes in. In
368
this paper, we formulate users participation differ- duce some bias toward negative instances in terms
ently using the LDA mixture model. of user interests. A users absence from a thread
Some previous work has also evaluated using does not necessarily mean the user is not interested
a hybrid model with both content and collabora- in that thread; it may be a result of the user being
tive features and showed outstanding performance. offline by that time or the thread is too behind in
For example, in (Basu et al., 1998), hybrid features pages. As a matter of fact, we found most users
are used to make recommendation using inductive read only the threads on the first page during their
learning. time of visit of a forum. This makes participation
prediction an even harder task than interest predic-
3 Forum Data tion.
In online forums, threads are ordered by the time
We have collected data from three forums in this
stamp of their last participating post. Provided with
study.1 Ubuntu community forum is a technical
the time stamp for each post, we can calculate the
support forum; World of Warcraft (WoW) forum is
order of a thread on its board during a users par-
about gaming; Fitness forum is about how to live
ticipation. Figure 1 shows the distribution of post
a healthy life. These three forums are quite rep-
location during users participation. We found that
resentative of online forums on the internet. Us-
most of the users read only the posts on the first
ing three different types of forums for task eval-
page. In order to minimize the false negative in-
uation helps to demonstrate the robustness of our
stances from the data set, we did thread location
proposed method. In addition, it can show how the
filtering. That is, we want to filter out messages
same method could have substantial performance
that actually interest the user but do not have the
difference on forums of different nature. Users
users participation because they are not on the first
behaviors in these three forums are very differ-
page. For any user, only those threads appearing in
ent. Casual forums like Wow gaming have much
the first 10 entries on a page during a users visit
more posts in each thread. However its posts are
are included in the data set.
the shortest in length. This is because discussions
inside these types of forums are more like casual
conversation, and there is not much requirement
on the users background, and thus there is more
user participation. In contrast, technical forums
like Ubuntu have fewer average posts in each
thread, and have the longest post length. This is
because a Question and Answer (QA) forum tends
to be very goal oriented. If a user finds the thread
is unrelated, then there will be no motivation for
participation.
Inside forums, different boards are created to
categorize the topics allowed for discussion. From
Figure 1: Thread position during users participation.
the data we find that users tend to participate in a
few selected boards of their choices. To create a
In the pre-processing step of the experiment, first
data set for user interest prediction in this study,
we use online status filtering discussed above to
we pick the most popular boards in each forum.
remove threads that a user does not see while of-
Even within the same board, users tend to partici-
fline. The statistics of the boards we have used in
pate in different threads base on their interest. We
each forum are shown in Table 1. The statistics
use a users participation information as an indica-
are consistent with the full forum statistics. For
tion whether a thread is interesting to a user or not.
example, users in technical forums tend to post
Hence, our task is to predict the user participation
less than casual forums. We define active users as
in forum threads. Note this approach could intro-
those who have participated in 10 or more threads.
1
Please contact the authors to obtain the data. Column Part. @300 shows the average number
369
of threads the top 300 users have participated in. that normalization by document length yielded
Filt. Threads@300 shows the average number of good empirical results in approximating a well cal-
threads after using online filtering with a window ibrated posterior probability for Naive Bayes clas-
of 10. Thread participation in Ubuntu forum is sifier. The normalized Naive Bayes classifier they
very sparse for each user, having only 10.01% par- used is as follows:
ticipating threads for each user after filtering. Fit- 1 Y 1
ness and Wow Forum have denser participation, P (Ci |f1..k ) = P (Ci ) P (fj |Ci ) |f | (2)
Z
at 18.97% and 13.86% respectively. j
4 Interesting Thread Prediction In this equation, the probability of generat-

ing each word is normalized by the length of
In the task of interesting thread prediction, the sys- the feature vector |f |. The posterior probabil-
tem generates a ranked list of threads a user is ity P (interested|f1..k ) from (normalized) Naive
likely to be interested in based on users past his- Bayes classifier is used for recommendation item
tory of thread participation. Here, instead of pre- ranking.
dicting the true interestedness, we predict the par-
ticipation of the user, which is a sufficient condi- 4.1.2 Latent Topics based Interpolation
tion for interestedness. This approach is also used Because of noisy forum writing and limited
by (Wu et al., 2010) for their task evaluation. In training data, the above bag-of-word model used in
this section, we describe our proposed approaches naive Bayes classifier may suffer from data sparsity
for thread participation prediction. issues. We thus propose to use latent topic model-
ing to alleviate this problem. Latent Dirichlet Allo-
4.1 Content-based Filtering cation (LDA) is a generative model based on latent
In the content-based filtering approach, only con- topics. The major difference between LDA and
tent of a thread is used as features for prediction. previous methods such as probabilistic Latent Se-
Recommendation through content-based filtering mantic Analysis (pLSA) is that LDA can efficiently
has its deep root in information retrieval. Here we infer topic composition of new documents, regard-
use a Naive Bayes classifier for ranking the threads less of the training data size (Blei et al., 2001). This
using information based on the words and the la- makes it ideal for efficiently reducing the dimen-
tent topic analysis. sion of incoming documents.
In an online forum, words contained in threads
4.1.1 Naive Bayes Classification tend to be very noisy. Irregular words, such as
In (Pazzani et al., 1997) Naive Bayesian classi- abbreviation, misspelling and synonyms, are very
fier showed outstanding performance in web page common in an online environment. From our ex-
recommendation compared to several other clas- periments, we observe that LDA seems to be quite
sifiers. A Naive Bayes classifier is a generative robust to these phenomena and able to capture
model in which words inside a document are as- word relationship semantically. To illustrate the
sumed to be conditionally independent. That is, words inside latent topics in the LDA model in-
given the class of a document, words are generated ferred from online forums, we show in Table 2 the
independently. The posterior probability of a test top words in 3 out of 20 latent topics inferred from
instance in Naive Bayes classifier takes the follow- Ubuntu forum according to its multinomial dis-
ing form: tribution. We can see that variations of the same
1 Y words are grouped into the same topic.
P (Ci |f1..k ) = P (Ci ) P (fj |Ci ) (1) Since each post could be very short and LDA is
Z
j
generally known not to work well with short docu-
where Z is the class label independent normaliza- ments, we concatenated the content of posts inside
tion term, f1..k is the bag-of-word feature vector each thread to form documents. In order to build
for the document. Naive Bayes classifier is known a valid evaluation configuration, only posts before
for not having a well calibrated posterior probabil- the first time the testing user participated are used
ity (Bennett, 2000). (Pavlov et al., 2004) showed for model fitting and inference.
370
Forum Name Threads Posts Active Users Part. @300 Filt. Threads @300
Ubuntu 185,747 940,230 1,700 464.72 4641.25
Fitness 27,250 529,201 2,808 613.15 3231.04
Wow Gaming 34,187 1,639,720 19,173 313.77 2264.46
Table 1: Data statistics after filtering.
Topic 1 Topic 2 Topic 3 where 1 , ..., t is the multinomial distribution of

lold wine email topics for the thread.
lol. Wine mail
imo. game Thunderbird 4.2 Collaborative Filtering
, fixme evolution Collaborative filtering techniques make prediction
-, stub send using information from similar users. It has ad-
lulz. not emails vantages over content-based filtering in that it can
lmao. WINE gmail correctly predict items that are vastly different in
rofl. play postfix content but similar in concepts indicated by users
participation.
Table 2: Example of LDA topics that capture words In some previous work, clustering methods were
with different variations. used to partition users into several groups, Then,
predictions were made using information from
users in the same group. However, in the case
After model fitting for LDA, the topic distri-
of thread recommendation, we found that users
butions on new threads can be inferred using the
interest does not form clean clusters. Figure 2
model. Compared to the original bag-of-word fea-
shows the mutual information between users after
ture vector, the topic distribution vector is not only
doing an average-link clustering on their pairwise
more robust against noise, but also closer to hu-
mutual information. In a clean clustering, intra-
man interpretation of words. For example in topic
cluster mutual information should be high, while
3 in Table 2, people who care about Thunder-
inter-cluster mutual information is very low. If so,
bird, an email client, are also very likely to show
we would expect that the figure shows clear rect-
interest in postfix, which is a Linux email ser-
angles along the diagonal. Unfortunately, from this
vice. These closely related words, however, might
figure it appears that users far away in the hierarchy
not be captured using the bag-of-word model since
tree still have a lot of common thread participation.
that would require the exact words to appear in the
Here, we propose to model user similarity based on
training set.
latent user groups.
In order to take advantage of the topic level in-
formation while not losing the fine-grained word 4.2.1 Latent User Groups
level feature, we use the topic distribution as ad- In this paper, we model users participation in-
ditional features in combination with the bag-of- side threads as an LDA generative model. We
word features. To tune the contribution of topic model each user group as a multinomial distribu-
level features in classifiers like Naive Bayes clas- tion. Users inside each group are assumed to have
sifiers, we normalize the topic level feature to a common interests in certain topic(s). A thread in an
length of Lt = |f | and bag-of-word feature to online forum typically contains several such top-
Lw = (1 )|f |. is a tuning parameter from 0 to ics. We could model a users participation in a
1 that determines the proportion of the topic infor- thread as a mixture of several different user groups.
mation used in the features. |f | is from the original Since one thread typically attracts a subset of user
bag-of-word feature vector. The final feature vec- groups, it is reasonable to add a Dirichlet prior on
tor for each thread can be represented as: the user group mixture.
The generative process is the same as the LDA
F = Lw w1 , ..., Lw wk Lt 1 , ..., Lt T (3) used above for topic modeling, except now users
371
groups, and j is the group composition in thread
j after inference using the training data. In gen-
eral, the probability of user ui appearing in thread
j is proportional to the membership probabilities
of this user in the groups that compose the partici-
pating users.
4.3 Hybrid System

Up to this point we have two separate systems that
can generate ranked recommendation lists based on
different factors of threads. In order to generate the
final ranked list, we give each item a score accord-
Figure 2: Mutual information between users in Average ing to the ranked lists from the two systems. Then
Link Hierarchical clustering. the two scores are linearly interpolated using a tun-
ing parameter as shown in Equation 5. The final
are words and user groups are topics. Using ranked list is generated accordingly.
LDA to model user participation can be viewed Ci =(1 )Scorecontent
as soft-clustering of users in a sense that one user (5)
could appear in multiple groups at the same time. + Scorecollaborative
The generative process for participating users is as We propose several different rescoring methods
follows. to generate the scores in the above formula for the
1. Choose Dir() two individual systems.
2. For each of N participating users, un : Posterior: The posterior probabilities of each

item from the two systems are used directly as
(a) Choose a group zn M ultinomial() the score.
(b) Choose a user un p(un |zn )
Scoredir = p(clike |itemi ) (6)
One thing worth noting is that in LDA model a
document is assumed to consist of many words. In This way the confidence of how likely an
the case of modeling user participation, a thread item is interesting is preserved. However,
typically has far fewer users than words inside a the downside is that the two different sys-
document. This could potentially cause problem tems have different calibration on its posterior
during variable estimation and inference. How- probability, which could be problematic when
ever, we show that this approach actually works directly adding them together.
well in practice (experimental results in Section 5).
Linear rescore: To counter the problem asso-
4.2.2 Using Latent User Groups for ciated with posterior probability calibration,
Prediction we use linear rescoring based on the ranked
For an incoming new thread, first the latent list:
posi
group distribution is inferred using collapsed Gibbs Scorelin = 1 (7)
N
Sampling (Griffiths and Steyvers, 2004). The pos-
terior probability of a user ui participating in thread In the formula, posi is the position of item i
j given the user group distribution is as follows. in the ranked list, and N is the total number
of items being ranked. The resulting score is
X
P (ui |j , ) = P (ui |k )P (k|j ) between 0 and 1, 1 being the first item on the
(4)
kT list and 0 being the last.
In the equation, k is the multinomial distribution Sigmoid rescore: In a ranked list, usually
of users in group k, T is the number of latent user items on the top and bottom of the list have
372
higher confidence than those in the middle. During evaluation, a 3-fold cross-validation is
That is to say more emphasis should be put performed for each user in the test set. In each fold,
on both ends of the list. Hence we use a sig- MAP@10 score is calculated from the ranked list
moid function on the Scorelinear to capture generated by the system. Then the average from all
this. the folds and all the users is computed as the final
1 result.
Scoresig = (8) To make a proper evaluation configuration, for
1 + el(Scorelin 0.5)
each user, only posts up to the first participation of
A sigmoid function is relatively flat on both the testing user are used for the test set.
ends while being steep in the middle. In the
equation, l is a tuning parameter that decides 5.1 Content-based Results
how flat the score of both ends of the list is Here we evaluate the performance of interest
going to be. Determining the best value for l thread prediction using only features from text.
is not a trivial problem. Here we empirically First we use the ranking model with latent topic
assign l = 10. information only on the development set to deter-
5 Experiment and Evaluation mine an optimal number of topics. Empirically,
we use hyper parameter = 0.1 and = 1/K
In this section, we evaluate our approach empiri- (K is the number of topics). We use the perfor-
cally on the three forum data sets described in Sec- mance of content-based recommendation directly
tion 3. We pick the top 300 most active users from to determine the optimal topic number K. We var-
each forum for the evaluation. Among the 300 ied the latent topic number K from 10 to 100, and
users, 100 of them are randomly selected as the de- found that the best performance was achieved us-
velopment set for parameter tuning, while the rest ing 30 topics in all three forums. Hence we use
is test set. All the data sets are filtered using an on- K = 30 for content based recommendation unless
line filter as previously described, with a window otherwise specified.
size of 10 threads. Next, we show how topic information can help
Threads are tokenized into words and filtered us- content-based recommendation achieve better re-
ing a simple English stop word list. All words sults. We tune the parameter described in Sec-
are then ordered by their occurrences multiplied by tion 4.1.2 and show corresponding performances.
their inverse document frequencies (IDF). We compare the performance using Naive Bayes
classifier, before and after normalization. The
|D|
idfw = log (9) MAP@10 results on the test set are shown in Fig-
|{d : w d}|
ure 3 for three forums. When = 0, no latent topic
The top 4,000 words from this list are then used to information is used, and when = 1, latent topics
form the vocabulary. are used without any word features.
We used standard mean average precision When using Naive Bayes classifier without nor-
(MAP) as the evaluation metric. This standard in- malization, we find relatively larger performance
formation retrieval evaluation metric measures the gain from adding topic information for the val-
quality of the returned rank lists from a system. ues of close to 0. This phenomenon is probably
Entries higher in the rank are more accurate than because of the poor posterior probabilities of the
lower ones. For an interesting thread recommenda- Naive Bayes classifier, which are close to either 1
tion system, it is preferable to provide a short and or 0.
high-quality list of recommendation; therefore, in- For normalized Naive Bayes classifier, interpo-
stead of reporting full-range MAP, we report MAP lating with latent topics based ranking yields per-
on top 10 relevant threads (MAP@10). The reason formance improvement compared to word-based
why we picked 10 as the number of relevant doc- results consistently for the three forums. In
ument for MAP evaluation is that users might not Wow Gaming corpus, the optimal performance
have time to read too many posts, even if they are is achieved with a relatively high value (at around
relevant. 0.5), and it is even higher for the Fitness forum.
373
This means that the system relies more on the la-
tent topics information. This is because in these fo-
rums, casual conversation contains more irregular
words, causing more severe data sparsity problem
than others.
Between the two naive Bayes classifiers, we
#word
can see that using normalized probabilities out-
performs the original one in Wow Gaming and
Ubuntu forums. This observation is consistent
with previous work (e.g., (Pavlov et al., 2004)).
However, we found that in Fitness Forum, the
performance degrades with normalization. Further
work is still needed to understand why this is the
#user
case.
5.2 Latent User Group Classification Figure 5: Position of items with different #users and
#words in a ranked list. (red=0 being higher on the
In this section, collaborative filtering using latent ranked list and green being lower)
user groups is evaluated. First, participating users
from the training set are used to estimate an LDA
model. Then, users participating in a thread are may be interested in a larger variety of topics and
used to infer the topic distribution of the thread. thus the user distribution in different topics is not
Candidate threads are then sorted by the proba- very obvious. In contrast, people in the gaming
bility of a target users participation according to forum are more specific to the topics they are inter-
Equation 4. Note that all the users in the forum are ested in.
used to estimate the latent user groups, but only the It is known that LDA tends to perform poorly
top 300 active users are used in evaluation. Here, when there are too few words/users. To have a
we vary the number of latent user groups G from general idea of how much user participation is
5 to 100. Hyper parameters were set empirically: enough for decent prediction, we show a graph
= 1/G, = 0.1. (Figure 5) depicting the relationships among the
Figure 4 shows the MAP@10 results using dif- number of users, the number of words, and the po-
ferent numbers of latent groups for the three fo- sition of the positive instances in the ranked lists.
rums. We compare the performance using latent In this graph, every dot is a positive thread instance
groups with a baseline using SVM ranking. In in Wow Gaming forum. Red color shows that
the baseline system, users participation in a thread the positive thread is indeed getting higher ranks
is used as a binary feature. LibSVM with radius than others. We observe that threads with around
based function (RBF) kernel is used to estimate the 16 participants can already achieve a decent perfor-
probability of a users participation. mance.
From the results, we find that ranking using la-
5.3 Hybrid System Performance
tent groups information outperforms the baseline
in almost all non-trivial cases. In the case of In this section, we evaluate the performance of the
Ubuntu forum, the performance gain is less com- hybrid system output. Parameters used in each fo-
pared to other forums. We believe this is because rum data set are the optimal parameters found in
in this technical support forum, the average user the previous sections. Here we show the effect of
participation in threads is much less, thus making the tuning parameter (described in Section 4.3).
it hard to infer a reliable group distribution in a Also, we compare three different scoring schemes
thread. In addition, the optimal number of user used to generate the final ranked list. Performance
groups differs greatly between Fitness forum and of the hybrid system is shown in Table 3.
Wow Gaming forum. We conjecture the reason We can see that the combination of the two sys-
behind this is that in the Fitness forum, users tems always outperforms any one model alone.
374
Ubuntu Forum Wow Gaming Fitness Forum
0.54 0.3 1
Naive Bayes Naive Bayes 0.9 Naive Bayes
0.51 Normalized NB 0.28 Normalized NB Normalized NB
0.8
0.48 0.7
MAP 10
MAP 10
MAP 10
0.26
0.6
0.45
0.5
0.24
0.42 0.4
0.22 0.3
0.39
0.2
0.36 0.2 0.1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Gamma Gamma Gamma
Figure 3: Content-based filtering results: MAP@10 vs. (contribution of topic-based features).
Ubuntu Forum Wow Gaming Fitness Forum

0.22 0.35 0.6
Latent Group Latent Group Latent Group
SVM SVM SVM
0.2 0.3 0.5
MAP 10
MAP 10
MAP 10
0.18
0.25 0.4
0.16
0.2 0.3
0.14
0.15 0.2
1 10 100 1 10 100 1 10 100
Number of Groups Number of Groups Number of Groups
Figure 4: Collaborative filtering results: MAP@10 vs. user group number.
Contribution Factor 6 Conclusion

Forum
0.0 1.0 Optimal
Ubuntu 0.523 0.198 0.534 ( = 0.9) In this paper, we proposed a new system that can
Wow 0.278 0.283 0.304 ( = 0.1) intelligently recommend threads from online com-
Fitness 0.545 0.457 0.551 ( = 0.85) munity according to a users interest. The system
uses both content-based filtering and collaborative-
Table 3: Performance of the hybrid system with differ- filtering techniques. In content-based filtering, we
ent values.
solve the problem of data sparsity in online con-
tent by smoothing using latent topic information.
In collaborative filtering, we model users partici-
This is intuitive since the two models use differ- pation in threads with latent groups under an LDA
ent information sources. A MAP@10 score of 0.5 framework. The two systems compliment each
means that around half of the suggested results do other and their combination achieves better per-
have user participation. We think this is a good re- formance than individual ones. Our experiments
sult considering that this is not a trivial task. across different forums demonstrate the robustness
of our methods and the difference among forums.
We also notice that based on the nature of differ- In the future work, we plan to explore how social
ent forums, the optimal value could be substan- information could help further refine a users inter-
tially different. For example, in Wow gaming est.
forum where people participate in more threads, a
higher value is observed which favors collabo-
rative filtering score. In contrast, in Ubuntu fo- References
rum, where people participate in far fewer threads,
Gediminas Adomavicius and Alexander Tuzhilin.
the content-based system is more reliable in thread 2005. Toward the next generation of recommender
prediction, hence a lower is used. This observa- systems: A survey of the state-of-the-art and possi-
tion also shows that the hybrid system is more ro- ble extensions. IEEE TRANSACTIONS ON KNOWL-
bust against differences among forums compared EDGE AND DATA ENGINEERING, 17(6):734749.
with single model systems. Marko Balabanovic and Yoav Shoham. 1997.
375
Fab: Content-based, collaborative recommendation. Michael Pazzani, Daniel Billsus, S. Michalski, and
Communications of the ACM, 40:6672. Janusz Wnek. 1997. Learning and revising user pro-
Chumki Basu, Haym Hirsh, and William Cohen. 1998. files: The identification of interesting web sites. In
Recommendation as classification: Using social and Machine Learning, pages 313331.
content-based information in recommendation. In In Elaine Rich. 1979. User modeling via stereotypes.
Proceedings of the Fifteenth National Conference on Cognitive Science, 3(4):329354.
Artificial Intelligence, pages 714720. AAAI Press. J. Rocchio, 1971. Relevance Feedback in Information
Paul N. Bennett. 2000. Assessing the calibration of Retrieval.
naive bayes posterior estimates. Gerard Salton and Christopher Buckley. 1988. Term-
David Blei, Andrew Y. Ng, and Michael I. Jordan. weighting approaches in automatic text retrieval.
2001. Latent dirichlet allocation. Journal of Ma- In INFORMATION PROCESSING AND MANAGE-
chine Learning Research, 3:2003. MENT, pages 513523.
John S. Breese, David Heckerman, and Carl Kadie. Badrul Sarwar, George Karypis, Joseph Konstan, and
1998. Empirical analysis of predictive algorithms for John Reidl. 2001. Item-based collaborative fil-
collaborative filtering. pages 4352. Morgan Kauf- tering recommendation algorithms. In WWW 01:
mann. Proceedings of the 10th international conference on
Y H Chien and E I George, 1999. A bayesian model for World Wide Web, pages 285295, New York, NY,
collaborative filtering. Number 1. USA. ACM.
Joaquin Delgado and Naohiro Ishii. 1999. Memory- Upendra Shardanand and Pattie Maes. 1995. So-
based weighted-majority prediction for recom- cial information filtering: Algorithms for automating
mender systems. word of mouth. In CHI, pages 210217.
Thomas L. Griffiths and Mark Steyvers. 2004. Find- Lyle Ungar, Dean Foster, Ellen Andre, Star Wars,
ing scientific topics. Proceedings of the National Fred Star Wars, Dean Star Wars, and Jason Hiver
Academy of Sciences of the United States of Amer- Whispers. 1998. Clustering methods for collabo-
ica, 101(Suppl 1):52285235, April. rative filtering. AAAI Press.
Thomas Hofmann. 2003. Collaborative filtering via Hao Wu, Jiajun Bu, Chun Chen, Can Wang, Guang Qiu,
gaussian probabilistic latent semantic analysis. In Lijun Zhang, and Jianfeng Shen. 2010. Modeling
Proceedings of the 26th annual international ACM dynamic multi-topic discussions in online forums. In
SIGIR conference on Research and development in AAAI.
informaion retrieval, SIGIR 03, pages 259266,
New York, NY, USA. ACM.
Thomas Hofmann. 2004. Latent semantic models
for collaborative filtering. ACM Trans. Inf. Syst.,
22(1):89115.
Qing Li, Jia Wang, Yuanzhu Peter Chen, and Zhangxi
Lin. 2010. User comments for news recom-
mendation in forum-based social media. Inf. Sci.,
180:49294939, December.
Nick Littlestone. 1988. Learning quickly when irrele-
vant attributes abound: A new linear-threshold algo-
rithm. In Machine Learning, pages 285318.
Atsuyoshi Nakamura and Naoki Abe. 1998. Collab-
orative filtering using weighted majority prediction
algorithms. In Proceedings of the Fifteenth Interna-
tional Conference on Machine Learning, ICML 98,
pages 395403, San Francisco, CA, USA. Morgan
Kaufmann Publishers Inc.
Dmitry Pavlov, Ramnath Balasubramanyan, Byron
Dom, Shyam Kapur, and Jignashu Parikh. 2004.
Document preprocessing for naive bayes classifica-
tion and clustering with mixture of multinomials. In
Proceedings of the tenth ACM SIGKDD international
conference on Knowledge discovery and data min-
ing, KDD 04, pages 829834, New York, NY, USA.
ACM.
376
Inferring Selectional Preferences from Part-Of-Speech N-grams
Hyeju Jang and Jack Mostow

Project LISTEN (www.cs.cmu.edu/~listen), School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213, USA
hyejuj@cs.cmu.edu, mostow@cs.cmu.edu
because originally it referred to a vocabulary

Abstract word targeted for instruction, and r its relative.
We present the PONG method to compute
selectional preferences using part-of-speech Notation Description
(POS) N-grams. From a corpus labeled with R a relation between words
grammatical dependencies, PONG learns the t a target word
distribution of word relations for each POS r, r' possible relatives of t
N-gram. From the much larger but unlabeled g a word N-gram
Google N-grams corpus, PONG learns the
gi and gj ith and jth words of g
distribution of POS N-grams for a given pair
of words. We derive the probability that one p the POS N-gram of g
word has a given grammatical relation to the
other. PONG estimates this probability by Table 1: Notation used throughout this paper
combining both distributions, whether or not
either word occurs in the labeled corpus. Previous work on selectional preferences has
PONG achieves higher average precision on used them primarily for natural language analytic
16 relations than a state-of-the-art baseline in tasks such as word sense disambiguation (Resnik,
a pseudo-disambiguation task, but lower 1997), dependency parsing (Zhou et al., 2011),
coverage and recall.
and semantic role labeling (Gildea and Jurafsky,
2002). However, selectional preferences can
1 Introduction also apply to natural language generation tasks
Selectional preferences specify plausible fillers such as sentence generation and question
for the arguments of a predicate, e.g., celebrate. generation. For generation tasks, choosing the
Can you celebrate a birthday? Sure. Can you right word to express a specified argument of a
celebrate a pencil? Arguably yes: Today the relation requires knowing its connotations that
Acme Pencil Factory celebrated its one-billionth is, its selectional preferences. Therefore, it is
pencil. However, such a contrived example is useful to know selectional preferences for many
unnatural because unlike birthday, pencil lacks a different relations. Such knowledge could have
strong association with celebrate. How can we many uses. In education, they could help teach
compute the degree to which birthday or pencil word connotations. In machine learning they
is a plausible and typical object of celebrate? could help computers learn languages. In
Formally, we are interested in computing the machine translation, they could help generate
probability Pr(r | t, R), where (as Table 1 more natural wording.
specifies), t is a target word such as celebrate, r This paper introduces a method named PONG
is a word possibly related to it, such as birthday (for Part-Of-Speech N-Grams) to compute
or pencil, and R is a possible relation between selectional preferences for many different
them, whether a semantic role such as the agent relations by combining part-of-speech
of an action, or a grammatical dependency such information and Google N-grams. PONG
as the object of a verb. We call t the target achieves higher precision on a pseudo-
377
disambiguation task than the best previous model relation between these two content words: sky is
(Erk et al., 2010), but lower coverage. the location where flies occurs. Other function
The paper is organized as follows. Section 2 words yield different collapsed dependencies.
describes the relations for which we compute For example, consider these two sentences:
selectional preferences. Section 3 describes The airplane flies over the ocean.
PONG. Section 4 evaluates PONG. Section 5 The airplane flies and lands.
relates PONG to prior work. Section 6 concludes. Collapsed dependencies for the first sentence
include prep_over between flies and ocean,
2 Relations Used which characterizes their relative vertical
position, and conj_and between flies and lands,
Selectional preferences characterize constraints which links two actions that an airplane can
on the arguments of predicates. Selectional perform. As these examples illustrate, collapsing
preferences for semantic roles (such as agent and dependencies involving prepositions and
patient) are generally more informative than for conjunctions can yield informative dependencies
grammatical dependencies (such as subject and between content words.
object). For example, consider these Besides collapsed dependencies, PONG infers
semantically equivalent but grammatically inverse dependencies. Inverse selectional
distinct sentences: preferences are selectional preferences of
Pat opened the door. arguments for their predicates, such as a
The door was opened by Pat. preference of a subject or object for its verb.
In both sentences the agent of opened, namely They capture semantic regularities such as the set
Pat, must be capable of opening something an of verbs that an agent can perform, which tend to
informative constraint on Pat. In contrast, outnumber the possible agents for a verb (Erk et
knowing that the grammatical subject of opened al., 2010).
is Pat in the first sentence and the door in the
second sentence tells us only that they are nouns. 3 Method
Despite this limitation, selectional preferences
for grammatical dependencies are still useful, for To compute selectional preferences, PONG
a number of reasons. First, in practice they combines information from a limited corpus
approximate semantic role labels. For instance, labeled with the grammatical dependencies
typically the grammatical subject of opened is its described in Section 2, and a much larger
agent. Second, grammatical dependencies can be unlabeled corpus. The key idea is to abstract
extracted by parsers, which tend to be more word sequences labeled with grammatical
accurate than current semantic role labelers. relations into POS N-grams, in order to learn a
Third, the number of different grammatical mapping from POS N-grams to those relations.
dependencies is large enough to capture diverse For instance, PONG abstracts the parsed
relations, but not so large as to have sparse data sentence Pat opened the door as NN VB DT NN,
for individual relations. Thus in this paper, we with the first and last NN as the subject and
use grammatical dependencies as relations. object of the VB. To estimate the distribution of
A parse tree determines the basic grammatical POS N-grams containing particular target and
dependencies between the words in a sentence. relative words, PONG POS-tags Google N-
For instance, in the parse of Pat opened the door, grams (Franz and Brants, 2006).
the verb opened has Pat as its subject and door Section 3.1 derives PONGs probabilistic
as its object, and door has the as its determiner. model for combining information from labeled
Besides these basic dependencies, we use two and unlabeled corpora. Section 3.2 and Section
additional types of dependencies. 3.3 describe how PONG estimates probabilities
Composing two basic dependencies yields a from each corpus. Section 3.4 discusses a
collapsed dependency (de Marneffe and Manning, sparseness problem revealed during probability
2008). For example, consider this sentence: estimation, and how we address it in PONG.
The airplane flies in the sky.
Here sky is the prepositional object of in, which 3.1 Probabilistic model
is the head of a prepositional phrase attached to We quantify the selectional preference for a
flies. Composing these two dependencies yields relative r to instantiate a relation R of a target t as
the collapsed dependency prep_in between flies the probability Pr(r | t, R), estimated as follows.
and sky, which captures an important semantic By the definition of conditional probability:
378
Pr(r , t , R) Pr( R | t , r , p) Pr( p | t , r ) Pr(t , r )
Pr(r | t , R)
Pr(t , R) p Pr(t , r )
We care only about the relative probability of Cancelling the common factor yields:
different r for fixed t and R, so we rewrite it as:
Pr( R | p, t , r ) Pr( p | t , r )
Pr(r, t , R) p
We use the chain rule: We approximate the first term Pr(R | p, t, r) as
Pr( R | r, t ) Pr(r | t ) Pr(t ) Pr(R | p), based on the simplifying assumption
and notice that t is held constant: that R is conditionally independent of t and r,
Pr( R | r , t ) Pr( r | t ) given p. In other words, we assume that given a
POS N-gram, the target and relative words t and
We estimate the second factor as follows: r give no additional information about the
Pr(t , r ) freq(t , r ) probability of a relation. However, their
Pr(r | t )
Pr(t ) freq(t ) respective positions i and j in the POS N-gram p
matter, so we condition the probability on them:
We calculate the denominator freq(t) as the
number of N-grams in the Google N-gram Pr( R | p, t , r ) Pr( R | p, i, j )
corpus that contain t, and the numerator freq(t, r) Summing over their possible positions, we get
as the number of N-grams containing both t and r. Pr( R | r , t )
To estimate the factor Pr(R | r, t) directly from
Pr( R | p, i, j ) Pr( p | t gi , r g j)
a corpus of text labeled with grammatical
p i j
relations, it would be trivial to count how often a
word r bears relation R to target word t. As Figure 1 shows, we estimate Pr(R | p, i, j) by
However, the results would be limited to the abstracting the labeled corpus into POS N-grams.
words in the corpus, and many relation We estimate Pr(p | t = gi, r = gj) based on the
frequencies would be estimated sparsely or frequency of partially lexicalized POS N-grams
missing altogether; t or r might not even occur. like DT JJ:red NN:hat VB NN among Google N-
Instead, we abstract each word in the corpus as grams with t and r in the specified positions.
its part-of-speech (POS) label. Thus we abstract Sections 3.2 and 3.3 describe how we estimate
The big boy ate meat as DT JJ NN VB NN. We Pr(R | p, i, j) and Pr(p | t = gi, r = gj), respectively.
call this sequence of POS tags a POS N-gram. Note that PONG estimates relative rather than
We use POS N-grams to predict word relations. absolute probabilities. Therefore it cannot (and
For instance, we predict that in any word does not) compare them against a fixed threshold
sequence with this POS N-gram, the JJ will to make decisions about selectional preferences.
modify (amod) the first NN, and the second NN 3.2 Mapping POS N-grams to relations
will be the direct object (dobj) of the VB.
This prediction is not 100% reliable. For To estimate Pr(R | p, i, j), we use the Penn
example, the initial 5-gram of The big boy ate Treebank Wall Street Journal (WSJ) corpus,
meat pie has the same POS 5-gram as before. which is labeled with grammatical relations
However, the dobj of its VB (ate) is not the using the Stanford dependency parser (Klein and
second NN (meat), but the subsequent NN (pie). Manning, 2003).
Thus POS N-grams predict word relations only To estimate the probability Pr(R | p, i, j) of a
in a probabilistic sense. relation R between a target at position i and a
To transform Pr(R | r, t) into a form we can relative at position j in a POS N-gram p, we
estimate, we first apply the definition of compute what fraction of the word N-grams g
conditional probability: with POS N-gram p have relation R between
Pr( R, t , r ) some target t and relative r at positions i and j:
Pr( R | t , r ) Pr( R | p, i, j )
Pr(t , r )
freq( g s.t.POS( g ) p relation( gi , g j ) R)
To estimate the numerator Pr(R, t, r), we first
marginalize over the POS N-gram p: freq( g s.t.POS( g ) p relation( gi , g j ))
Pr( R, t , r , p) 3.3 Estimating POS N-gram distributions
p Pr(t , r ) Given a target and relative, we need to estimate
We expand the numerator using the chain rule: their distribution of POS N-grams and positions.
379
Figure 1: Overview of PONG.
From the labeled corpus, PONG extracts abstract mappings from POS N-grams to relations.
From the unlabeled corpus, PONG estimates POS N-gram probability given a target and relative.
A labeled corpus is too sparse for this purpose, instance consists of two randomly chosen words
so we use the much larger unlabeled Google N- in the WSJ corpus labeled with a grammatical
grams corpus (Franz and Brants, 2006). relation. Coarse POS tags increased coverage of
The probability that an N-gram with target t at this pilot set that is, the fraction of instances for
position i and relative r at position j will have the which PONG computes a probability from 69%
POS N-gram p is: to 92%.
Pr( p | t gi , r gj) Using the universal tag set (Petrov et al., 2011)
as an even coarser tag set is an interesting future
freq( g s.t.POS( g ) p, g i t , g j r )) direction, especially for other languages. Its
freq( g s.t. gi t gj r) smaller size (12 tags vs. our 23) should reduce
data sparseness, but increase the risk of over-
To compute this ratio, we first use a well- generalization.
indexed table to efficiently retrieve all N-grams
with words t and r at the specified positions. We 4 Evaluation
then obtain their POS N-grams from the Stanford
POS tagger (Toutanova et al., 2003), and count To evaluate PONG, we use a standard pseudo-
how many of them have the POS N-gram p. disambiguation task, detailed in Section 4.1.
Section 4.2 describes our test set. Section 4.3
3.4 Reducing POS N-gram sparseness lists the metrics we evaluate on this test set.
We abstract word N-grams into POS N-grams to Section 4.4 describes the baselines we compare
address the sparseness of the labeled corpus, but PONG against on these metrics, and Section 4.5
even the POS N-grams can be sparse. For n=5, describes the relations we compare them on.
the rarer ones occur too sparsely (if at all) in our Section 4.6 reports our results. Section 4.7
labeled corpus to estimate their frequency. analyzes sources of error.
To address this issue, we use a coarser POS
tag set than the Penn Treebank POS tag set. As 4.1 Evaluation task
Table 2 shows, we merge tags for adjectives, The pseudo-disambiguation task (Gale et al.,
nouns, adverbs, and verbs into four coarser tags. 1992; Schutze, 1992) is as follows: given a
Coarse Original target word t, a relation R, a relative r, and a
random distracter r', prefer either r or r',
ADJ JJ, JJR, JJS
whichever is likelier to have relation R to word t.
ADVERB RB, RBR, RBS This evaluation does not use a threshold: just
NOUN NN, NNS, NNP, NNPS prefer whichever word is likelier according to the
VERB VB, VBD, VBG, VBN, VBP, VBZ model being evaluated. If the model assigns only
Table 2: Coarser POS tag set used in PONG one of the words a probability, prefer it, based on
the assumption that the unknown probability of
To gauge the impact of the coarser POS tags, the other word is lower. If the model assigns the
we calculated Pr(r | t, R) for 76 test instances same probability to both words, or no probability
used in an earlier unpublished study by Liu Liu, to either word, do not prefer either word.
a former Project LISTEN graduate student. Each
380
4.2 Test set distracters any actual relatives, i.e. candidates r'
where the test corpus contained the triple (R, t, r').
As a source of evaluation data, we used the
Table 3 shows the resulting number of (R, t, r, r')
British National Corpus (BNC). As a common test tuples for each relation.
test corpus for all the methods we evaluated, we
selected one half of BNC by sorting filenames
alphabetically and using the odd-numbered files. Relation R # tuples for R # tuples for RT
We used the other half of BNC as a training
corpus for the baseline methods we compared advmod 121 131
PONG to. amod 162 128
A test set for the pseudo-disambiguation task conj_and 155 151
task consists of tuples of the form (R, t, r, r'). To dobj 145 167
construct a test set, we adapted the process used nn 173 158
by Rooth et al. (1999) and Erk et al. (2010). nsubj 97 124
First, we chose 100 (R, t) pairs for each prep_of 144 153
relation R at random from the test corpus. Rooth xcomp 139 140
et al. (1999) and Erk et al. (2010) chose such Table 3: Test set size for each relation
pairs from a training corpus to ensure that it
contained the target t. In contrast, choosing pairs 4.3 Metrics
from an unseen test corpus includes target words
whether or not they occur in the training corpus. We report four evaluation metrics: precision,
To obtain a sample stratified by frequency, coverage, recall, and F-score. Precision (called
rather than skewed heavily toward high- accuracy in some papers on selectional
frequency pairs, Erk et al. (2010) drew (R, t) preferences) is the percentage of all covered
pairs from each of five frequency bands in the tuples where the original relative r is preferred.
entire British National Corpus (BNC): 50-100 Coverage is the percentage of tuples for which
occurrences; 101-200; 201-500; 500-1000; and the model prefers r to r' or vice versa. Recall is
more than 1000. However, we use only half of the percentage of all tuples where the original
BNC as our test corpus, so to obtain a relative is preferred, i.e., precision times
comparable test set, we drew 20 (R, t) pairs from coverage. F-score is the harmonic mean of
each of the corresponding frequency bands in precision and recall.
that half: 26-50 occurrences; 51-100; 101-250;
4.4 Baselines
251-500; and more than 500.
For each chosen (R, t) pair, we drew a separate We compare PONG to two baseline methods.
(R, t, r) triple from each of six frequency bands: EPP is a state-of-the-art model for which Erk
1-25 occurrences; 26-50; 51-100; 101-250; 251- et al. (2010) reported better performance than
500; and more than 500. We necessarily omitted both Resniks (1996) WordNet model and
frequency bands that contained no such triples. Rooths (1999) EM clustering model. EPP
We filtered out triples where r did not have the computes selectional preferences using
most frequent part of speech for the relation R. distributional similarity, based on the assumption
For example, this filter would exclude the triple that relatives are likely to appear in the same
(dobj, celebrate, the) because a direct object is contexts as relatives seen in the training corpus.
most frequently a noun, but the is a determiner. EPP computes the similarity of a potential
Then, like Erk et al. (2010), we paired the relatives vector space representation to relatives
relative r in each (R, t, r) triple with a distracter r' in the training corpus.
with the same (most frequent) part of speech as EPP has various options for its vector space
the relative r, yielding the test tuple (R, t, r, r'). representation, similarity measure, weighting
Rooth et al. (1999) restricted distracter scheme, generalization space, and whether to use
candidates to words with between 30 and 3,000 PCA. In re-implementing EPP, we chose the
occurrences in BNC; accordingly, we chose only options that performed best according to Erk et al.
distracters with between 15 and 1,500 (2010), with one exception. To save work, we
occurrences in our test corpus. We selected r' chose not to use PCA, which Erk et al. (2010)
from these candidates randomly, with probability described as performing only slightly better in
proportional to their frequency in the test corpus. the dependency-based space.
Like Rooth et al. (1999), we excluded as
381
Relation Target Relative Description
advmod verb adverb Adverbial modifier
amod noun adjective Adjective modifier
conj_and noun noun Conjunction with and
dobj verb noun Direct object
nn noun noun Noun compound modifier
nsubj verb noun Nominal subject
prep_of noun noun Prepositional modifier
xcomp verb verb Open clausal complement
Table 4: Relations tested in the pseudo-disambiguation experiment.

Relation names and descriptions are from de Marneffe and Manning (2008) except for prep_of.
Target and relative POS are the most frequent POS pairs for the relations in our labeled WSJ corpus.
Precision (%) Coverage (%) Recall (%) F-score (%)

Relation PONG EPP DEP PONG EPP DEP PONG EPP DEP PONG EPP DEP
advmod 78.7 - 98.6 72.1 - 69.2 56.7 - 68.3 65.9 - 80.7
advmodT 89.0 71.0 97.4 69.5 100 59.5 61.8 71.0 58.0 73.0 71.0 72.7
amod 78.8 - 99.0 90.1 - 61.1 71.0 - 60.5 74.7 - 75.1
amodT 84.1 74.0 97.3 83.6 99.2 57.0 70.3 73.4 55.5 76.6 73.7 70.6
conj_and 77.2 74.2 100 73.6 100 52.3 56.8 74.2 52.3 65.4 74.2 68.6
conj_andT 80.5 70.2 97.3 74.8 100 49.7 60.3 70.2 48.3 68.9 70.2 64.6
dobj 87.2 80.0 97.7 80.7 100 60.0 70.3 80.0 58.6 77.9 80.0 73.3
dobjT 89.6 80.2 98.1 92.2 100 64.1 82.6 80.2 62.9 86.0 80.2 76.6
nn 86.7 73.8 97.2 95.3 99.4 63.0 82.7 73.4 61.3 84.6 73.6 75.2
nnT 83.8 79.7 99.0 93.7 100 60.8 78.5 79.7 60.1 81.0 79.7 74.8
nsubj 76.1 77.3 100 69.1 100 42.3 52.6 77.3 42.3 62.2 77.3 59.4
nsubjT 78.5 66.9 95.0 86.3 100 48.4 67.7 66.9 46.0 72.7 66.9 62.0
prep_of 88.4 77.8 98.4 84.0 100 44.4 74.3 77.8 43.8 80.3 77.8 60.6
prep_ofT 79.2 76.5 97.4 81.7 100 50.3 64.7 76.5 49.0 71.2 76.5 65.2
xcomp 84.0 61.9 95.3 85.6 100 61.2 71.9 61.9 58.3 77.5 61.9 72.3
xcompT 86.4 78.6 98.9 89.3 100 63.6 77.1 78.6 62.9 81.5 78.6 76.9
average 83.0 74.4 97.9 82.6 99.9 56.7 68.7 74.4 55.5 75.0 74.4 70.5
Table 5: Coverage, Precision, Recall, and F-score for various relations; RT is the inverse of relation R.
PONG uses POS N-grams, EPP uses distributional similarity, and DEP uses dependency parses.
To score a potential relative r0, EPP uses this DEP, our second baseline method, runs the
formula: Stanford dependency parser to label the training
wtR ,t (r ) corpus with grammatical relations, and uses their
Selpref R ,t (r0 ) sim(r0 , r )
r Seen arg s ( R ,t ) Z R ,t frequencies to predict selectional preferences.
To do the pseudo-disambiguation task, DEP
Here sim(r0, r) is the nGCM similarity defined
compares the frequencies of (R, t, r) and (R, t, r').
below between vector space representations of r0
and a relative r seen in the training data: 4.5 Relations tested
To test PONG, EPP, and DEP, we chose the
n abi a 'bi
simnGCM (a, a ') exp( ( )2 ) most frequent eight relations between content
i 1 a a' words in the WSJ corpus, which occur over
10,000 times and are described in Table 4. We
n
also tested their inverse relations. However, EPP
where a ab2i
i 1
does not compute selectional preferences for
The weight function wtr,t(a) is analogous to adjective and adverb as relatives. For this reason,
inverse document frequency in Information we did not test EPP on advmod and amod
Retrieval. relations with adverbs and adjectives as relatives.
382
4.6 Experimental results is the probability of a POS N-gram for rare co-
occurrences of a target and relative in Google
Table 5 displays results for all 16 relations. To
word N-grams. Using a smaller tag set may
compute statistical significance conservatively in reduce the sparse data problem but increase the
comparing methods, we used paired t-tests with
risk of over-generalization.
N = 16 relations.
PONGs precision was significantly better
5 Relation to Prior Work
than EPP (p<0.001) but worse than DEP
(p<0.0001). Still, PONGs high precision In predicting selectional preferences, a key
validates its underlying assumption that POS N- issue is generalization. Our DEP baseline simply
grams strongly predict grammatical counts co-occurrences of target and relative
dependencies. words in a corpus to predict selectional
On coverage and recall, EPP beat PONG, preferences, but only for words seen in the
which beat DEP (p<0.0001). PONGs F-score corpus. Prior work, summarized in
was higher, but not significantly, than EPPs Table 6, has therefore tried to infer the similarity
(p>0.5) or DEPs (p>0.02). of unseen relatives to seen relatives. To illustrate,
consider the problem of inducing that the direct
4.7 Error analysis
objects of celebrate tend to be days or events.
In the pseudo-disambiguation task of choosing Resnik (1996) combined WordNet with a
which of two words is related to a target, PONG labeled corpus to model the probability that
makes errors of coverage (preferring neither relatives of a predicate belong to a particular
word) and precision (preferring the wrong word). conceptual class. This method could notice, for
Coverage errors, which occurred 17.4% of the example, that the direct objects of celebrate tend
time on average, arose only when PONG failed to belong to the conceptual class event. Thus it
to estimate a probability for either word. PONG could prefer anniversary or occasion as the
fails to score a potential relative r of a target t object of celebrate even if unseen in its training
with a specified relation R if the labeled corpus corpus. However, this method depends strongly
has no POS N-grams that (a) map to R, (b) on the WordNet taxonomy.
contain the POS of t and r, and (c) match Google Rather than use linguistic resources such as
word N-grams with t and r at those positions. WordNet, Rooth et al. (1999) and Wald et al.
Every relation has at least one POS N-gram that (2008) induced semantically annotated
maps to it, so condition (a) never fails. PONG subcategorization frames from unlabeled corpora.
uses the most frequent POS of t and r, and we They modeled semantic classes as hidden
believe that condition (b) never fails. However, variables, which they estimated using EM-based
condition (c) can and does fail when t and r do clustering. Ritter (2010) computed selectional
not co-occur in any Google N-grams, at least that preferences by using unsupervised topic models
match a POS N-gram that can map to relation R. such as LinkLDA, which infers semantic classes
For example, oversee and diet do not co-occur in of words automatically instead of requiring a pre-
any Google N-grams, so PONG cannot score diet defined set of classes as input.
as a potential dobj of oversee. The contexts in which a linguistic unit occurs
Precision errors, which occur 17% of the time provide information about its meaning. Erk
on average, arose when (a) PONG scored the (2007) and Erk et al. (2010) modeled the
distracter but failed to score the true relative, or contexts of a word as the distribution of words
(b) scored them both but preferred the distracter. that co-occur with it. They calculated the
Case (a) accounted for 44.62% of the errors on semantic similarity of two words as the similarity
the covered test tuples. of their context distributions according to various
One likely cause of errors in case (b) is over- measures. Erk et al. (2010) reported the state-of-
generalization when PONG abstracts a word N- the-art method we used as our EPP baseline.
gram labeled with a relation by mapping its POS In contrast to prior work that explored various
N-gram to that relation. In particular, the coarse solutions to the generalization problem, we dont
POS tag set may discard too much information. so much solve this problem as circumvent it.
Another likely cause of errors is probabilities Instead of generalizing from a training corpus
estimated poorly due to sparse data. The directly to unseen words, PONG abstracts a word
probability of a relation for a POS N-gram rare in N-gram to a POS N-gram and maps it to the
the training corpus is likely to be inaccurate. So relations that the word N-gram is labeled with.
383
Reference Relation to Lexical Primary corpus Generalization Method
target resource (labeled) & corpus
information (unlabeled) &
used information used
Resnik, Verb-object Senses in Target, relative, none Information
1996 Verb-subject WordNet and relation in a theoretic
Adjective-noun noun parsed, partially model
Modifier-head taxonomy sense-tagged
Head-modifier corpus (Brown
corpus)
Rooth et Verb-object none Target, relative, none EM-based
al., 1999 Verb-subject and relation in a clustering
parsed corpus
(parsed BNC)
Ritter, Verb-subject none Subject-verb- none LDA model
2010 Verb-object object tuples
Subject-verb- from 500 million
object web-pages
Erk, 2007 Predicate and none Target, relative, Words and their Similarity
Semantic roles and relation in a relations in a model based
semantic role parsed corpus on word co-
labeled corpus (BNC) occurrence
(FrameNet)
Erk et al., SYN option: none Target, relative, Two options: Similarity
2010 Verb-subject and relation in model using
Verb-object, and SYN option: a WORDSPACE: vector space
their inverse parsed corpus an unlabeled representation
relations (parsed BNC) corpus (BNC) of words
SEM option: SEM option: a
verb and semantic role DEPSPACE:
semantic roles labeled corpus Words and their
that have nouns (FrameNet) subject and object
as their headword relations in a
in a primary parsed corpus
corpus, and their (parsed BNC)
inverse relations
Zhou et Any (relations none Counts of words none PMI
al., 2011 not distinguished) in Web or (Pointwise
Google N-gram Mutual
Information)
This paper All grammatical none POS N-gram POS N-gram Combine both
dependencies in a distribution for distribution for POS N-gram
parsed corpus, relations in target and relative distributions
and their inverse parsed WSJ in Google N-gram
relations corpus
Table 6: Comparison with prior methods to compute selectional preferences
To compute selectional preferences, whether the corpus. The most closely related work we found
words are in the training corpus or not, PONG was by Gormley et al. (2011). They used
applies these abstract mappings to word N-grams patterns in POS N-grams to generate test data for
in the much larger Google N-grams corpus. their selectional preferences model, but not to
Some prior work on selectional preferences infer preferences. Zhou et al. (2011) identified
has used POS N-grams and a large unlabeled selectional preferences of one word for another
384
by using Pointwise Mutual Information (PMI) Erk, K. 2007. A Simple, Similarity-Based Model for
(Fano, 1961) to check whether they co-occur Selectional Preferences. In Proceedings of the 45th
more frequently in a large corpus than predicted Annual Meeting of the Association of
by their unigram frequencies. However, their Computational Linguistics, Prague, Czech
Republic, June, 2007, 216-223.
method did not distinguish among different
relations. Erk, K., Pad, S. and Pad, U. 2010. A Flexible,
Corpus-Driven Model of Regular and Inverse
6 Conclusion Selectional Preferences. Computational Linguistics
36(4), 723-763.
This paper describes, derives, and evaluates
PONG, a novel probabilistic model of selectional Fano, R. 1961. Transmission O F Information: A
preferences. PONG uses a labeled corpus to map Statistical Theory of Communications. MIT
POS N-grams to grammatical relations. It Press, Cambridge, MA.
combines this mapping with probabilities
estimated from a much larger POS-tagged but Franz, A. and Brants, T. 2006. All Our N-Gram Are
unlabeled Google N-grams corpus. Belong to You.
We tested PONG on the eight most common
relations in the WSJ corpus, and their inverses Gale, W.A., Church, K.W. and Yarowsky, D. 1992.
Work on Statistical Methods for Word Sense
more relations than evaluated in prior work.
Disambiguation. In Proceedings of the AAAI Fall
Compared to the state-of-the-art EPP baseline Symposium on Probabilistic Approaches to Natural
(Erk et al., 2010), PONG averaged higher Language, Cambridge, MA, October 2325, 1992,
precision but lower coverage and recall. 54-60.
Compared to the DEP baseline, PONG averaged
lower precision but higher coverage and recall. Gildea, D. and Jurafsky, D. 2002. Automatic Labeling
All these differences were substantial (p < 0.001). of Semantic Roles. Computational Linguistics
Compared to both baselines, PONGs average F- 28(3), 245-288.
score was higher, though not significantly.
Some directions for future work include: First, Gormley, M.R., Dredze, M., Durme, B.V. and Eisner,
improve PONG by incorporating models of J. 2011. Shared Components Topic Models with
Application to Selectional Preference, NIPS
lexical similarity explored in prior work. Second,
Workshop on Learning Semantics Sierra Nevada,
use the universal tag set to extend PONG to other Spain.
languages, or to perform better in English. Third,
in place of grammatical relations, use rich, im Walde, S.S., Hying, C., Scheible, C. and Schmid,
diverse semantic roles, while avoiding sparsity. H. 2008. Combining Em Training and the Mdl
Finally, use selectional preferences to teach word Principle for an Automatic Verb Classification
connotations by using various relations to Incorporating Selectional Preferences. In
generate example sentences or useful questions. Proceedings of the 46th Annual Meeting of the
Association for Computational Linguistics,
Acknowledgments Columbus, OH, 2008, 496-504.
The research reported here was supported by the Klein, D. and Manning, C.D. 2003. Accurate
Institute of Education Sciences, U.S. Department Unlexicalized Parsing. In Proceedings of the 41st
of Education, through Grant R305A080157. The Annual Meeting of the Association for
Computational Linguistics, Sapporo, Japan, July 7-
opinions expressed are those of the authors and
12, 2003, E.W. HINRICHS and D. ROTH, Eds.
do not necessarily represent the views of the
Institute or the U.S. Department of Education. Petrov, S., Das, D. and McDonald, R.T. 2011. A
We thank the helpful reviewers and Katrin Erk Universal Part-of-Speech Tagset. ArXiv
for her generous assistance. 1104.2086.
References Resnik, P. 1996. Selectional Constraints: An

Information-Theoretic Model and Its
de Marneffe, M.-C. and Manning, C.D. 2008. Computational Realization. Cognition 61, 127-159.
Stanford Typed Dependencies Manual.
http://nlp.stanford.edu/software/dependencies_man Resnik, P. 1997. Selectional Preference and Sense
ual.pdf, Stanford University, Stanford, CA. Disambiguation. In ACL SIGLEX Workshop on
385
Tagging Text with Lexical Semantics: Why, What,
and How, Washington, DC, April 4-5, 1997, 52-57.
Ritter, A., Mausam and Etzioni, O. 2010. A Latent

Dirichlet Allocation Method for Selectional
Preferences. In Proceedings of the 48th Annual
Meeting of the Association for Computational
Linguistics, Uppsala, Sweden, 2010, 424-434.
Rooth, M., Riezler, S., Prescher, D., Carroll, G. and

Beil, F. 1999. Inducing a Semantically Annotated
Lexicon Via Em-Based Clustering. In Proceedings
of the 37th Annual Meeting of the Association for
Computational Linguistics on Computational
Linguistics, College Park, MD, 1999, Association
for Computational Linguistics, 104-111.
Schutze, H. 1992. Context Space. In Proceedings of

the AAAI Fall Symposium on Intelligent
Probabilistic Approaches to Natural Language,
Cambridge, MA, 1992, 113-120.
Toutanova, K., Klein, D., Manning, C. and Singer, Y.

2003. Feature-Rich Part-of-Speech Tagging with a
Cyclic Dependency Network. In Proceedings of the
Human Language Technology Conference and
Annual Meeting of the North American Chapter of
the Association for Computational Linguistics
(HLT-NAACL), Edmonton, Canada, 2003, 252
259.
Zhou, G., Zhao, J., Liu, K. and Cai, L. 2011.

Exploiting Web-Derived Selectional Preference to
Improve Statistical Dependency Parsing. In
Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics,
Portland, OR, 2011, 15561565.
386
WebCAGe A Web-Harvested Corpus Annotated with GermaNet Senses
Verena Henrich, Erhard Hinrichs, and Tatiana Vodolazova

University of Tubingen
Department of Linguistics
{firstname.lastname}@uni-tuebingen.de
Abstract Thus far, sense-annotated corpora have typi-

cally been constructed manually, making the cre-
This paper describes an automatic method ation of such resources expensive and the com-
for creating a domain-independent sense- pilation of larger data sets difficult, if not com-
annotated corpus harvested from the web. pletely infeasible. It is therefore timely and ap-
As a proof of concept, this method has propriate to explore alternatives to manual anno-
been applied to German, a language for
tation and to investigate automatic means of cre-
which sense-annotated corpora are still in
short supply. The sense inventory is taken ating sense-annotated corpora. Ideally, any auto-
from the German wordnet GermaNet. The matic method should satisfy the following crite-
web-harvesting relies on an existing map- ria:
ping of GermaNet to the German version
of the web-based dictionary Wiktionary. (1) The method used should be language inde-
The data obtained by this method consti- pendent and should be applicable to as many
tute WebCAGe (short for: Web-Harvested languages as possible for which the neces-
Corpus Annotated with GermaNet Senses), sary input resources are available.
a resource which currently represents the
largest sense-annotated corpus available for (2) The quality of the automatically generated
German. While the present paper focuses data should be extremely high so as to be us-
on one particular language, the method as
able as is or with minimal amount of manual
such is language-independent.
post-correction.
(3) The resulting sense-annotated materials (i)

1 Motivation should be non-trivial in size and should be
The availability of large sense-annotated corpora dynamically expandable, (ii) should not be
is a necessary prerequisite for any supervised and restricted to a narrow subject domain, but
many semi-supervised approaches to word sense be as domain-independent as possible, and
disambiguation (WSD). There has been steady (iii) should be freely available for other re-
progress in the development and in the perfor- searchers.
mance of WSD algorithms for languages such as
The method presented below satisfies all of
English for which hand-crafted sense-annotated
the above criteria and relies on the following re-
corpora have been available (Agirre et al., 2007;
sources as input: (i) a sense inventory and (ii) a
Erk and Strapparava, 2012; Mihalcea et al., 2004),
mapping between the sense inventory in question
while WSD research for languages that lack these
and a web-based resource such as Wiktionary1 or
corpora has lagged behind considerably or has
1
been impossible altogether. http://www.wiktionary.org/
387
Wikipedia2 . lexical space into a set of concepts that are inter-
As a proof of concept, this automatic method linked by semantic relations. A semantic concept
has been applied to German, a language for which is represented as a synset, i.e., as a set of words
sense-annotated corpora are still in short supply whose individual members (referred to as lexical
and fail to satisfy most if not all of the crite- units) are taken to be (near) synonyms. Thus, a
ria under (3) above. While the present paper synset is a set-representation of the semantic rela-
focuses on one particular language, the method tion of synonymy.
as such is language-independent. In the case There are two types of semantic relations in
of German, the sense inventory is taken from GermaNet. Conceptual relations hold between
the German wordnet GermaNet3 (Henrich and two semantic concepts, i.e. synsets. They in-
Hinrichs, 2010; Kunze and Lemnitzer, 2002). clude relations such as hypernymy, part-whole re-
The web-harvesting relies on an existing map- lations, entailment, or causation. Lexical rela-
ping of GermaNet to the German version of the tions hold between two individual lexical units.
web-based dictionary Wiktionary. This mapping Antonymy, a pair of opposites, is an example of a
is described in Henrich et al. (2011). The lexical relation.
resulting resource consists of a web-harvested GermaNet covers the three word categories of
corpus WebCAGe (short for: Web-Harvested adjectives, nouns, and verbs, each of which is
Corpus Annotated with GermaNet Senses), hierarchically structured in terms of the hyper-
which is freely available at: http://www.sfs.uni- nymy relation of synsets. The development of
tuebingen.de/en/webcage.shtml GermaNet started in 1997, and is still in progress.
The remainder of this paper is structured as GermaNets version 6.0 (release of April 2011)
follows: Section 2 provides a brief overview of contains 93407 lexical units, which are grouped
the resources GermaNet and Wiktionary. Sec- into 69594 synsets.
tion 3 introduces the mapping of GermaNet to
2.2 Wiktionary
Wiktionary and how this mapping can be used
to automatically harvest sense-annotated materi- Wiktionary is a web-based dictionary that is avail-
als from the web. The algorithm for identifying able for many languages, including German. As
the target words in the harvested texts is described is the case for its sister project Wikipedia, it
in Section 4. In Section 5, the approach of au- is written collaboratively by volunteers and is
tomatically creating a web-harvested corpus an- freely available4 . The dictionary provides infor-
notated with GermaNet senses is evaluated and mation such as part-of-speech, hyphenation, pos-
compared to existing sense-annotated corpora for sible translations, inflection, etc. for each word.
German. Related work is discussed in Section 6, It includes, among others, the same three word
together with concluding remarks and an outlook classes of adjectives, nouns, and verbs that are
on future work. also available in GermaNet. Distinct word senses
are distinguished by sense descriptions and ac-
2 Resources companied with example sentences illustrating
the sense in question.
2.1 GermaNet Further, Wiktionary provides relations to
GermaNet (Henrich and Hinrichs, 2010; Kunze other words, e.g., in the form of synonyms,
and Lemnitzer, 2002) is a lexical semantic net- antonyms, hypernyms, hyponyms, holonyms, and
work that is modeled after the Princeton Word- meronyms. In contrast to GermaNet, the relations
Net for English (Fellbaum, 1998). It partitions the are (mostly) not disambiguated.
For the present project, a dump of the Ger-
2
http://www.wikipedia.org/ man Wiktionary as of February 2, 2011 is uti-
3
Using a wordnet as the gold standard for the sense inven-
4
tory is fully in line with standard practice for English where Wiktionary is available under the Cre-
the Princeton WordNet (Fellbaum, 1998) is typically taken ative Commons Attribution/Share-Alike license
as the gold standard. http://creativecommons.org/licenses/by-sa/3.0/deed.en
388
Figure 1: Sense mapping of GermaNet and Wiktionary using the example of Bogen.
lized, consisting of 46457 German words com- possibilities for data mining community-driven
prising 70339 word senses. The Wiktionary data resources such as Wikipedia and web-generated
was extracted by the freely available Java-based content more generally. It is precisely this poten-
library JWKTL5 . tial that is fully exploited for the creation of the
WebCAGe sense-annotated corpus.
3 Creation of a Web-Harvested Corpus Fig. 1 illustrates the existing GermaNet-
Wiktionary mapping using the example word Bo-
The starting point for creating WebCAGe is an gen. The polysemous word Bogen has three dis-
existing mapping of GermaNet senses with Wik- tinct senses in GermaNet which directly corre-
tionary sense definitions as described in Henrich spond to three separate senses in Wiktionary6 .
et al. (2011). This mapping is the result of a Each Wiktionary sense entry contains a definition
two-stage process: i) an automatic word overlap and one or more example sentences illustrating
alignment algorithm in order to match GermaNet the sense in question. The examples in turn are
senses with Wiktionary sense descriptions, and often linked to external references, including sen-
ii) a manual post-correction step of the automatic tences contained in the German Gutenberg text
alignment. Manual post-correction can be kept at archive7 (see link in the topmost Wiktionary sense
a reasonable level of effort due to the high accu- entry in Fig. 1), Wikipedia articles (see link for
racy (93.8%) of the automatic alignment. the third Wiktionary sense entry in Fig. 1), and
The original purpose of this mapping was to other textual sources (see the second sense en-
automatically add Wiktionary sense descriptions try in Fig. 1). It is precisely this collection of
to GermaNet. However, the alignment of these
two resources opens up a much wider range of 6
Note that there are further senses in both resources not
displayed here for reasons of space.
5 7
http://www.ukp.tu-darmstadt.de/software/jwktl http://gutenberg.spiegel.de/
389
Figure 2: Sense mapping of GermaNet and Wiktionary using the example of Archiv.
heterogeneous material that can be harvested for than once in a given text. In keeping with
the purpose of compiling a sense-annotated cor- the widely used heuristic of one sense per dis-
pus. Since the target word (rendered in Fig. 1 course, multiple occurrences of a target word in
in bold face) in the example sentences for a par- a given text are all assigned to the same GermaNet
ticular Wiktionary sense is linked to a GermaNet sense. An inspection of the annotated data shows
sense via the sense mapping of GermaNet with that this heuristic has proven to be highly reliable
Wiktionary, the example sentences are automati- in practice. It is correct in 99.96% of all target
cally sense-annotated and can be included as part word occurrences in the Wiktionary example sen-
of WebCAGe. tences, in 96.75% of all occurrences in the exter-
Additional material for WebCAGe is harvested nal webpages, and in 95.62% of the Wikipedia
by following the links to Wikipedia, the Guten- files.
berg archive, and other web-based materials. The WebCAGe is developed primarily for the pur-
external webpages and the Gutenberg texts are ob- pose of the word sense disambiguation task.
tained from the web by a web-crawler that takes Therefore, only those target words that are gen-
some URLs as input and outputs the texts of the uinely ambiguous are included in this resource.
corresponding web sites. The Wikipedia articles Since WebCAGe uses GermaNet as its sense in-
are obtained by the open-source Java Wikipedia ventory, this means that each target word has at
Library JWPL 8 . Since the links to Wikipedia, the least two GermaNet senses, i.e., belongs to at least
Gutenberg archive, and other web-based materials two distinct synsets.
also belong to particular Wiktionary sense entries
The GermaNet-Wiktionary mapping is not al-
that in turn are mapped to GermaNet senses, the
ways one-to-one. Sometimes one GermaNet
target words contained in these materials are au-
sense is mapped to more than one sense in Wik-
tomatically sense-annotated.
tionary. Fig. 2 illustrates such a case. For
Notice that the target word often occurs more
the word Archiv each resource records three dis-
8
http://www.ukp.tu-darmstadt.de/software/jwpl/ tinct senses. The first sense (data repository)
390
in GermaNet corresponds to the first sense in of Apache OpenNLP tools9 and the TreeTagger
Wiktionary, and the second sense in GermaNet (Schmid, 1994) are used. Further, compounds
(archive) corresponds to both the second and are split by using BananaSplit10 . Since the au-
third senses in Wiktionary. The third sense in tomatic lemmatization obtained by the tagger and
GermaNet (archived file) does not map onto any the compound splitter are not 100% accurate, tar-
sense in Wiktionary at all. As a result, the word get word identification also utilizes the full set of
Archiv is included in the WebCAGe resource with inflected forms for a target word whenever such
precisely the sense mappings connected by the information is available. As it turns out, Wik-
arrows shown in Fig. 2. The fact that the sec- tionary can often be used for this purpose as well
ond GermaNet sense corresponds to two sense since the German version of Wiktionary often
descriptions in Wiktionary simply means that the contains the full set of word forms in tables11 such
target words in the example are both annotated by as the one shown in Fig. 3 for the word Bogen.
the same sense. Furthermore, note that the word
Archiv is still genuinely ambiguous since there is
a second (one-to-one) mapping between the first
senses recorded in GermaNet and Wiktionary, re-
spectively. However, since the third GermaNet
sense is not mapped onto any Wiktionary sense at
all, WebCAGe will not contain any example sen-
tences for this particular GermaNet sense.
The following section describes how the target
words within these textual materials can be auto- Figure 3: Wiktionary inflection table for Bogen.
matically identified.
Fig. 4 shows an example of such a sense-
4 Automatic Detection of Target Words annotated text for the target word Bogen vi-
olin bow. The text is an excerpt from the
For highly inflected languages such as German, Wikipedia article Violine violin, where the target
target word identification is more complex com- word (rendered in bold face) appears many times.
pared to languages with an impoverished inflec- Only the second occurrence shown in the figure
tional morphology, such as English, and thus re- (marked with a 2 on the left) exactly matches the
quires automatic lemmatization. Moreover, the word Bogen as is. All other occurrences are ei-
target word in a text to be sense-annotated is ther the plural form Bogen (4 and 7), the geni-
not always a simplex word but can also appear tive form Bogens (8), part of a compound such
as subpart of a complex word such as a com- as Bogenstange (3), or the plural form as part
pound. Since the constituent parts of a compound of a compound such as in Fernambukbogen and
are not usually separated by blank spaces or hy- Schulerbogen (5 and 6). The first occurrence
phens, German compounding poses a particular of the target word in Fig. 4 is also part of a
challenge for target word identification. Another compound. Here, the target word occurs in the
challenging case for automatic target word detec- singular as part of the adjectival compound bo-
tion in German concerns particle verbs such as an- gengestrichenen.
kundigen announce. Here, the difficulty arises For expository purposes, the data format shown
when the verbal stem (e.g., kundigen) is separated in Fig. 4 is much simplified compared to the ac-
from its particle (e.g., an) in German verb-initial tual, XML-based format in WebCAGe. The infor-
and verb-second clause types.
9
As a preprocessing step for target word identi- http://incubator.apache.org/opennlp/
10
http://niels.drni.de/s9y/pages/bananasplit.html
fication, the text is split into individual sentences, 11
The inflection table cannot be extracted with the Java
tokenized, and lemmatized. For this purpose, the Wikipedia Library JWPL. It is rather extracted from the Wik-
sentence detector and the tokenizer of the suite tionary dump file.
391
Figure 4: Excerpt from Wikipedia article Violine violin tagged with target word Bogen violin bow.
mation for each occurrence of a target word con- ysemous words contained in GermaNet, among
sists of the GermaNet sense, i.e., the lexical unit which there are 211 adjectives, 1499 nouns, and
ID, the lemma of the target word, and the Ger- 897 verbs (see Table 2). On average, these words
maNet word category information, i.e., ADJ for have 2.9 senses in GermaNet (2.4 for adjectives,
adjectives, NN for nouns, and VB for verbs. 2.6 for nouns, and 3.6 for verbs).
Table 2 also shows that WebCAGe is consid-
5 Evaluation erably larger than the other two sense-annotated
In order to assess the effectiveness of the ap- corpora available for German ((Broscheit et al.,
proach, we examine the overall size of WebCAGe 2010) and (Raileanu et al., 2002)). It is impor-
and the relative size of the different text col- tant to keep in mind, though, that the other two
lections (see Table 1), compare WebCAGe to resources were manually constructed, whereas
other sense-annotated corpora for German (see WebCAGe is the result of an automatic harvesting
Table 2), and present a precision- and recall-based method. Such an automatic method will only con-
evaluation of the algorithm that is used for auto- stitute a viable alternative to the labor-intensive
matically identifying target words in the harvested manual method if the results are of sufficient qual-
texts (see Table 3). ity so that the harvested data set can be used as is
or can be further improved with a minimal amount
Table 1 shows that Wiktionary (7644 tagged
of manual post-editing.
word tokens) and Wikipedia (1732) contribute
by far the largest subsets of the total number of For the purpose of the present evaluation, we
tagged word tokens (10750) compared with the conducted a precision- and recall-based analy-
external webpages (589) and the Gutenberg texts sis for the text types of Wiktionary examples,
(785). These tokens belong to 2607 distinct pol- external webpages, and Wikipedia articles sep-
392
Table 1: Current size of WebCAGe.
Wiktionary External Wikipedia Gutenberg All
examples webpages articles texts texts
Number of adjectives 575 31 79 28 713
tagged nouns 4103 446 1643 655 6847
word verbs 2966 112 10 102 3190
tokens all word classes 7644 589 1732 785 10750
adjectives 565 31 76 26 698
Number of
nouns 3965 420 1404 624 6413
tagged
verbs 2945 112 10 102 3169
sentences
all word classes 7475 563 1490 752 10280
adjectives 623 1297 430 65030 67380
Total
nouns 4184 9630 6851 376159 396824
number of
verbs 3087 5285 263 146755 155390
sentences
all word classes 7894 16212 7544 587944 619594
Table 2: Comparing WebCAGe to other sense-tagged corpora of German.

Broscheit et Raileanu et
WebCAGe
al., 2010 al., 2002
adjectives 211 6 0
Sense
nouns 1499 18 25
tagged
verbs 897 16 0
words
all word classes 2607 40 25
Number of tagged word tokens 10750 approx. 800 2421
medical
Domain independent yes yes
domain
arately for the three word classes of adjectives, of sense-annotated target words and to manually
nouns, and verbs. Table 3 shows that precision sense-tag any missing target words for the four
and recall for all three word classes that occur text types.
for Wiktionary examples, external webpages, and
Wikipedia articles lies above 92%. The only size- 6 Related Work and Future Directions
able deviations are the results for verbs that occur
in the Gutenberg texts. Apart from this one excep- With relatively few exceptions to be discussed
tion, the results in Table 3 prove the viability of shortly, the construction of sense-annotated cor-
the proposed method for automatic harvesting of pora has focussed on purely manual methods.
sense-annotated data. The average precision for This is true for SemCor, the WordNet Gloss Cor-
all three word classes is of sufficient quality to be pus, and for the training sets constructed for En-
used as-is if approximately 2-5% noise in the an- glish as part of the SensEval and SemEval shared
notated data is acceptable. In order to eliminate task competitions (Agirre et al., 2007; Erk and
such noise, manual post-editing is required. How- Strapparava, 2012; Mihalcea et al., 2004). Purely
ever, such post-editing is within acceptable lim- manual methods were also used for the German
its: it took an experienced research assistant a to- sense-annotated corpora constructed by Broscheit
tal of 25 hours to hand-correct all the occurrences et al. (2010) and Raileanu et al. (2002) as well as
for other languages including the Bulgarian and
393
Table 3: Evaluation of the algorithm of identifying the target words.
Wiktionary External Wikipedia Gutenberg
examples webpages articles texts
adjectives 97.70% 95.83% 99.34% 100%
nouns 98.17% 98.50% 95.87% 92.19%
Precision
verbs 97.38% 92.26% 100% 69.87%
all word classes 97.32% 96.19% 96.26% 87.43%
adjectives 97.70% 97.22% 98.08% 97.14%
nouns 98.30% 96.03% 92.70.% 97.38%
Recall
verbs 97.51% 99.60% 100% 89.20%
all word classes 97.94% 97.32% 93.36% 95.42%
the Chinese sense-tagged corpora (Koeva et al., list supervised WSD algorithm as a seed set for it-
2006; Wu et al., 2006). The only previous at- eratively disambiguating the remaining examples
tempts of harvesting corpus data for the purpose collected in step 1. The selection and annotation
of constructing a sense-annotated corpus are the of the representative examples in Yarowskys ap-
semi-supervised method developed by Yarowsky proach is performed completely manually and is
(1995), the knowledge-based approach of Lea- therefore limited to the amount of data that can
cock et al. (1998), later also used by Agirre and reasonably be annotated by hand.
Lopez de Lacalle (2004), and the automatic asso- Leacock et al. (1998), Agirre and Lopez de La-
ciation of Web directories (from the Open Direc- calle (2004), and Mihalcea and Moldovan (1999)
tory Project, ODP) to WordNet senses by Santa- propose a set of methods for automatic harvesting
mara et al. (2003). of web data for the purposes of creating sense-
The latter study (Santamara et al., 2003) is annotated corpora. By focusing on web-based
closest in spirit to the approach presented here. data, their work resembles the research described
It also relies on an automatic mapping between in the present paper. However, the underlying har-
wordnet senses and a second web resource. While vesting methods differ. While our approach re-
our approach is based on automatic mappings be- lies on a wordnet to Wiktionary mapping, their
tween GermaNet and Wiktionary, their mapping approaches all rely on the monosemous relative
algorithm maps WordNet senses to ODP subdi- heuristic. Their heuristic works as follows: In or-
rectories. Since these ODP subdirectories contain der to harvest corpus examples for a polysemous
natural language descriptions of websites relevant word, the WordNet relations such as synonymy
to the subdirectory in question, this textual mate- and hypernymy are inspected for the presence of
rial can be used for harvesting sense-specific ex- unambiguous words, i.e., words that only appear
amples. The ODP project also covers German so in exactly one synset. The examples found for
that, in principle, this harvesting method could be these monosemous relatives can then be sense-
applied to German in order to collect additional annotated with the particular sense of its ambigu-
sense-tagged data for WebCAGe. ous word relative. In order to increase coverage
The approach of Yarowsky (1995) first collects of the monosemous relatives approach, Mihalcea
all example sentences that contain a polysemous and Moldovan (1999) have developed a gloss-
word from a very large corpus. In a second step, based extension, which relies on word overlap of
a small number of examples that are representa- the gloss and the WordNet sense in question for
tive for each of the senses of the polysemous tar- all those cases where a monosemous relative is
get word is selected from the large corpus from not contained in the WordNet dataset.
step 1. These representative examples are manu- The approaches of Leacock et al., Agirre and
ally sense-annotated and then fed into a decision- Lopez de Lacalle, and Mihalcea and Moldovan as
394
well as Yarowskys approach provide interesting hild Barkey, Sarah Schulz, and Johannes Wahle
directions for further enhancing the WebCAGe re- for their help with the evaluation reported in Sec-
source. It would be worthwhile to use the au- tion 5. Special thanks go to Yana Panchenko and
tomatically harvested sense-annotated examples Yannick Versley for their support with the web-
as the seed set for Yarowskys iterative method crawler and to Emanuel Dima and Klaus Sut-
for creating a large sense-annotated corpus. An- tner for helping us to obtain the Gutenberg and
other fruitful direction for further automatic ex- Wikipedia texts.
pansion of WebCAGe is to use the heuristic of
monosemous relatives used by Leacock et al., by
Agirre and Lopez de Lacalle, and by Mihalcea References
and Moldovan. However, we have to leave these Agirre, E., Lopez de Lacalle, O. 2004. Publicly
matters for future research. available topic signatures for all WordNet nominal
In order to validate the language independence senses. Proceedings of the 4th International Con-
ference on Languages Resources and Evaluations
of our approach, we plan to apply our method to
(LREC04), Lisbon, Portugal, pp. 11231126
sense inventories for languages other than Ger-
Agirre, E., Marquez, L., Wicentowski, R. 2007. Pro-
man. A precondition for such an experiment is an ceedings of the 4th International Workshop on Se-
existing mapping between the sense inventory in mantic Evaluations. Assoc. for Computational Lin-
question and a web-based resource such as Wik- guistics, Stroudsburg, PA, USA
tionary or Wikipedia. With BabelNet, Navigli and Broscheit, S., Frank, A., Jehle, D., Ponzetto, S. P.,
Ponzetto (2010) have created a multilingual re- Rehl, D., Summa, A., Suttner, K., Vola, S. 2010.
source that allows the testing of our approach to Rapid bootstrapping of Word Sense Disambigua-
languages other than German. As a first step in tion resources for German. Proceedings of the 10.
this direction, we applied our approach to English Konferenz zur Verarbeitung Naturlicher Sprache,
Saarbrucken, Germany, pp. 1927
using the mapping between the Princeton Word-
Erk, K., Strapparava, C. 2010. Proceedings of the 5th
Net and the English version of Wiktionary pro- International Workshop on Semantic Evaluation.
vided by Meyer and Gurevych (2011). The re- Assoc. for Computational Linguistics, Stroudsburg,
sults of these experiments, which are reported in PA, USA
Henrich et al. (2012), confirm the general appli- Fellbaum, C. (ed.). 1998. WordNet An Electronic
cability of our approach. Lexical Database. The MIT Press.
To conclude: This paper describes an automatic Henrich, V., Hinrichs, E. 2010. GernEdiT The Ger-
method for creating a domain-independent sense- maNet Editing Tool. Proceedings of the Seventh
annotated corpus harvested from the web. The Conference on International Language Resources
and Evaluation (LREC10), Valletta, Malta, pp.
data obtained by this method for German have
22282235
resulted in the WebCAGe resource which cur-
Henrich, V., Hinrichs, E., Vodolazova, T. 2011. Semi-
rently represents the largest sense-annotated cor- Automatic Extension of GermaNet with Sense Def-
pus available for this language. The publication of initions from Wiktionary. Proceedings of the 5th
this paper is accompanied by making WebCAGe Language & Technology Conference: Human Lan-
freely available. guage Technologies as a Challenge for Computer
Science and Linguistics (LTC11), Poznan, Poland,
Acknowledgements pp. 126130
Henrich, V., Hinrichs, E., Vodolazova, T. 2012. An
The research reported in this paper was jointly Automatic Method for Creating a Sense-Annotated
Corpus Harvested from the Web. Poster pre-
funded by the SFB 833 grant of the DFG and by
sented at 13th International Conference on Intelli-
the CLARIN-D grant of the BMBF. We would gent Text Processing and Computational Linguistics
like to thank Christina Hoppermann, Marie Hin- (CICLing-2012), New Delhi, India, March 2012
richs as well as three anonymous EACL 2012 re- Koeva, S., Leseva, S., Todorova, M. 2006. Bul-
viewers for their helpful comments on earlier ver- garian Sense Tagged Corpus. Proceedings of the
sions of this paper. We are very grateful to Rein- 5th SALTMIL Workshop on Minority Languages:
395
Strategies for Developing Machine Translation for for Computational Linguistics (ACL95), Associ-
Minority Languages, Genoa, Italy, pp. 7987 ation for Computational Linguistics, Stroudsburg,
Kunze, C., Lemnitzer, L. 2002. GermaNet rep- PA, USA, pp. 189196
resentation, visualization, application. Proceed-
ings of the 3rd International Language Resources
and Evaluation (LREC02), Las Palmas, Canary Is-
lands, pp. 14851491
Leacock, C., Chodorow, M., Miller, G. A. 1998.
Using corpus statistics and wordnet relations for
sense identification. Computational Linguistics,
24(1):147165
Meyer, C. M., Gurevych, I. 2011. What Psycholin-
guists Know About Chemistry: Aligning Wik-
tionary and WordNet for Increased Domain Cov-
erage. Proceedings of the 5th International Joint
Conference on Natural Language Processing (IJC-
NLP), Chiang Mai, Thailand, pp. 883892
Mihalcea, R., Moldovan, D. 1999. An Auto-
matic Method for Generating Sense Tagged Cor-
pora. Proceedings of the American Association for
Artificial Intelligence (AAAI99), Orlando, Florida,
pp. 461466
Mihalcea, R., Chklovski, T., Kilgarriff, A. 2004. Pro-
ceedings of Senseval-3: Third International Work-
shop on the Evaluation of Systems for the Semantic
Analysis of Text, Barcelona, Spain
Navigli, R., Ponzetto, S. P. 2010. BabelNet: Build-
ing a Very Large Multilingual Semantic Network.
sociation for Computational Linguistics (ACL10),
Uppsala, Sweden, pp. 216225
Raileanu, D., Buitelaar, P., Vintar, S., Bay, J. 2002.
Evaluation Corpora for Sense Disambiguation in
the Medical Domain. Proceedings of the 3rd In-
ternational Language Resources and Evaluation
(LREC02), Las Palmas, Canary Islands, pp. 609
612
Santamara, C., Gonzalo, J., Verdejo, F. 2003. Au-
tomatic Association of Web Directories to Word
Senses. Computational Linguistics 29 (3), MIT
Press, PP. 485502
Schmid, H. 1994. Probabilistic Part-of-Speech Tag-
ging Using Decision Trees. Proceedings of the In-
ternational Conference on New Methods in Lan-
guage Processing, Manchester, UK
Wu, Y., Jin, P., Zhang, Y., Yu, S. 2006. A Chinese
Corpus with Word Sense Annotation. Proceedings
of 21st International Conference on Computer Pro-
cessing of Oriental Languages (ICCPOL06), Sin-
gapore, pp. 414421
Yarowsky, D. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. Proceed-
ings of the 33rd Annual Meeting on Association
396
Learning to Behave by Reading
Regina Barzilay
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
regina@csail.mit.edu
Abstract
In this talk, I will address the problem of grounding linguistic analysis in control applications, such
as game playing and robot navigation. We assume access to natural language documents that describe
the desired behavior of a control algorithm (e.g., game strategy guides). Our goal is to demonstrate
that knowledge automatically extracted from such documents can dramatically improve performance
of the target application. First, I will present a reinforcement learning algorithm for learning to map
natural language instructions to executable actions. This technique has enabled automation of tasks
that until now have required human participation for example, automatically configuring software
by consulting how-to guides. Next, I will present a Monte-Carlo search algorithm for game playing
that incorporates information from game strategy guides. In this framework, the task of text inter-
pretation is formulated as a probabilistic model that is trained based on feedback from Monte-Carlo
search. When applied to the Civilization strategy game, a language-empowered player outperforms
its traditional counterpart by a significant margin.
397
Lexical surprisal as a general predictor of reading time
Irene Fernandez Monsalve, Stefan L. Frank and Gabriella Vigliocco

Division of Psychology and Language Sciences
University College London
{ucjtife, s.frank, g.vigliocco}@ucl.ac.uk
Abstract predictive. For example, on-line detection of se-

mantic or syntactic anomalies can be observed in
Probabilistic accounts of language process- the brains EEG signal (Hagoort et al., 2004) and
ing can be psychologically tested by com- eye gaze is directed in anticipation at depictions
paring word-reading times (RT) to the con- of plausible sentence completions (Kamide et al.,
ditional word probabilities estimated by 2003). Moreover, probabilistic accounts of lan-
language models. Using surprisal as a link- guage processing have identified unpredictability
ing function, a significant correlation be-
as a major cause of processing difficulty in lan-
tween unlexicalized surprisal and RT has
been reported (e.g., Demberg and Keller, guage comprehension. In such incremental pro-
2008), but success using lexicalized models cessing, parsing would entail a pre-allocation of
has been limited. In this study, phrase struc- resources to expected interpretations, so that ef-
ture grammars and recurrent neural net- fort would be related to the suitability of such
works estimated both lexicalized and unlex- an allocation to the actually encountered stimulus
icalized surprisal for words of independent (Levy, 2008).
sentences from narrative sources. These
same sentences were used as stimuli in Possible sentence interpretations can be con-
a self-paced reading experiment to obtain strained by both linguistic and extra-linguistic
RTs. The results show that lexicalized sur- context, but while the latter is difficult to evalu-
prisal according to both models is a signif- ate, the former can be easily modeled: The pre-
icant predictor of RT, outperforming its un- dictability of a word for the human parser can be
lexicalized counterparts.
expressed as the conditional probability of a word
given the sentence so far, which can in turn be es-
timated by language models trained on text cor-
1 Introduction pora. These probabilistic accounts of language
Context-sensitive, prediction-based processing processing difficulty can then be validated against
has been proposed as a fundamental mechanism empirical data, by taking reading time (RT) on a
of cognition (Bar, 2007): Faced with the prob- word as a measure of the effort involved in its pro-
lem of responding in real-time to complex stim- cessing.
uli, the human brain would use basic information Recently, several studies have followed this ap-
from the environment, in conjunction with previ- proach, using surprisal (see Section 1.1) as the
ous experience, in order to extract meaning and linking function between effort and predictabil-
anticipate the immediate future. Such a cognitive ity. These can be computed for each word in a
style is a well-established finding in low level sen- text, or alternatively for the words parts of speech
sory processing (e.g., Kveraga et al., 2007), but (POS). In the latter case, the obtained estimates
has also been proposed as a relevant mechanism can give an indication of the importance of syn-
in higher order processes, such as language. In- tactic structure in developing upcoming-word ex-
deed, there is ample evidence to show that human pectations, but ignore the rich lexical information
language comprehension is both incremental and that is doubtlessly employed by the human parser
398
to constrain predictions. However, whereas such 1.2 Empirical evidence for surprisal
an unlexicalized (i.e., POS-based) surprisal has
been shown to significantly predict RTs, success The simplest statistical language models that can
with lexical (i.e., word-based) surprisal has been be used to estimate surprisal values are n-gram
limited. This can be attributed to data sparsity models or Markov chains, which condition the
(larger training corpora might be needed to pro- probability of a given word only on its n 1 pre-
vide accurate lexical surprisal than for the unlex- ceding ones. Although Markov models theoret-
icalized counterpart), or to the noise introduced ically limit the amount of prior information that
by participants world knowledge, inaccessible to is relevant for prediction of the next step, they
the models. The present study thus sets out to find are often used in linguistic context as an approx-
such a lexical surprisal effect, trying to overcome imation to the full conditional probability. The
possible limitations of previous research. effect of bigram probability (or forward transi-
tional probability) has been repeatedly observed
1.1 Surprisal theory (e.g. McDonald and Shillcock, 2003), and Smith
The concept of surprisal originated in the field of and Levy (2008) report an effect of lexical sur-
information theory, as a measure of the amount of prisal as estimated by a trigram model on RTs
information conveyed by a particular event. Im- for the Dundee corpus (a collection of newspaper
probable (surprising) events carry more infor- texts with eye-tracking data from ten participants;
mation than expected ones, so that surprisal is in- Kennedy and Pynte, 2005).
versely related to probability, through a logarith- Phrase structure grammars (PSGs) have also
mic function. In the context of sentence process- been amply used as language models (Boston et
ing, if w1 , ..., wt1 denotes the sentence so far, al., 2008; Brouwer et al., 2010; Demberg and
then the cognitive effort required for processing Keller, 2008; Hale, 2001; Levy, 2008). PSGs
the next word, wt , is assumed to be proportional can combine statistical exposure effects with ex-
to its surprisal: plicit syntactic rules, by annotating norms with
their respective probabilities, which can be es-
effort(t) surprisal(wt ) timated from occurrence counts in text corpora.
Information about hierarchical sentence structure
= log(P (wt |w1 , ..., wt1 )) (1)
can thus be included in the models. In this way,
Different theoretical groundings for this rela- Brouwer et al. trained a probabilistic context-
tionship have been proposed (Hale, 2001; Levy free grammar (PCFG) on 204,000 sentences ex-
2008; Smith and Levy, 2008). Smith and Levy tracted from Dutch newspapers to estimate lexi-
derive it by taking a scale free assumption: Any cal surprisal (using an Earley-Stolcke parser; Stol-
linguistic unit can be subdivided into smaller en- cke, 1995), showing that it could account for
tities (e.g., a sentence is comprised of words, a the noun phrase coordination bias previously de-
word of phonemes), so that time to process the scribed and explained by Frazier (1987) in terms
whole will equal the sum of processing times for of a minimal-attachment preference of the human
each part. Since the probability of the whole can parser. In contrast, Demberg and Keller used texts
be expressed as the product of the probabilities of from a naturalistic source (the Dundee corpus) as
the subunits, the function relating probability and the experimental stimuli, thus evaluating surprisal
effort must be logarithmic. Levy (2008), on the as a wide-coverage account of processing diffi-
other hand, grounds surprisal in its information- culty. They also employed a PSG, trained on a
theoretical context, describing difficulty encoun- one-million-word language sample from the Wall
tered in on-line sentence processing as a result of Street Journal (part of the Penn Treebank II, Mar-
the need to update a probability distribution over cus et al., 1993). Using Roarks (2001) incremen-
possible parses, being directly proportional to the tal parser, they found significant effects of unlexi-
difference between the previous and updated dis- calized surprisal on RTs (see also Boston et al. for
tributions. By expressing the difference between a similar approach and results for German texts).
these in terms of relative entropy, Levy shows that However, they failed to find an effect for lexical-
difficulty at each newly encountered word should ized surprisal, over and above forward transitional
be equal to its surprisal. probability. Roark et al. (2009) also looked at the
399
effects of syntactic and lexical surprisal, using RT words than on POS (both the Brown corpus and
data for short narrative texts. However, their es- the WSJ are relatively small), and in addition, the
timates of these two surprisal values differ from particular journalistic style of the WSJ might not
those described above: In order to tease apart se- be the best alternative for modeling human be-
mantic and syntactic effects, they used Demberg haviour. Although similarity between the train-
and Kellers lexicalized surprisal as a total suring and experimental data sets (both from news-
prisal measure, which they decompose into syn- paper sources) can improve the linguistic perfor-
tactic and lexical components. Their results show mance of the models, their ability to simulate hu-
significant effects of both syntactic and lexical man behaviour might be limited: Newspaper texts
surprisal, although the latter was found to hold probably form just a small fraction of a persons
only for closed class words. Lack of a wider effect linguistic experience. This study thus aims to
was attributed to data sparsity: The models were tackle some of the identified limitations: Rather
trained on the relatively small Brown corpus (over than cohesive texts, independent sentences, from
one million words from 500 samples of American a narrative style are used as experimental stim-
English text), so that surprisal estimates for the uli for which word-reading times are collected
less frequent content words would not have been (as explained in Section 3). In addition, as dis-
accurate enough. cussed in the following section, language mod-
Using the same training and experimental lan- els are trained on a larger corpus, from a more
guage samples as Demberg and Keller (2008), representative language sample. Following Frank
and only unlexicalized surprisal estimates, Frank (2009) and Frank and Bod (2011), two contrasting
(2009) and Frank and Bod (2011) focused on types of models are employed: hierarchical PSGs
comparing different language models, including and linear RNNs.
various n-gram models, PSGs and recurrent net-
works (RNN). The latter were found to be the bet- 2 Models
ter predictors of RTs, and PSGs could not explain
2.1 Training data
any variance in RT over and above the RNNs,
suggesting that human processing relies on linear The training texts were extracted from the writ-
rather than hierarchical representations. ten section of the British National Corpus (BNC),
Summing up, the only models taking into ac- a collection of language samples from a variety
count actual words that have been consistently of sources, designed to provide a comprehensive
shown to simulate human behaviour with natural- representation of current British English. A total
istic text samples are bigram models.1 A possi- of 702,412 sentences, containing only the 7,754
ble limitation in previous studies can be found in most frequent words (the open-class words used
the stimuli employed. In reading real newspaper by Andrews et al., 2009, plus the 200 most fre-
texts, prior knowledge of current affairs is likely quent words in English) were selected, making up
to highly influence RTs, however, this source of a 7.6-million-word training corpus. In addition to
variability cannot be accounted for by the mod- providing a larger amount of data than the WSJ,
els. In addition, whereas the models treat each this training set thus provides a more representa-
sentence as an independent unit, in the text cor- tive language sample.
pora employed they make up coherent texts, and
are therefore clearly dependent. Thirdly, the stim- 2.2 Experimental sentences
uli used by Demberg and Keller (2008) comprise Three hundred and sixty-one sentences, all com-
a very particular linguistic style: journalistic edi- prehensible out of context and containing only
torials, reducing the ability to generalize conclu- words included in the subset of the BNC used
sions to language in general. Finally, failure to to train the models, were randomly selected from
find lexical surprisal effects can also be attributed three freely accessible on-line novels2 (for addi-
to the training texts. Larger corpora are likely to tional details, see Frank, 2012). The fictional
be needed for training language models on actual narrative provides a good contrast to the pre-
1 2
Although Smith and Levy (2008) report an effect of tri- Obtained from www.free-online-novels.com.
grams, they did not check if it exceeded that of simpler bi- Having not been published elsewhere, it is unlikely partici-
grams. pants had read the novels previously.
400
viously examined newspaper editorials from the probability distribution
Dundee corpus, since participants did not need over 7,754 word types
prior knowledge regarding the details of the sto-
ries, and a less specialised language and style
were employed. In addition, the randomly se-
lected sentences did not make up coherent texts 200
(in contrast, Roark et al., 2009, employed short
stories), so that they were independent from each
other, both for the models and the readers. 400
2.3 Part-of-speech tagging

500
In order to produce POS-based surprisal esti-
mates, versions of both the training and exper-
imental texts with their words replaced by POS 400
were developed: The BNC sentences were parsed
by the Stanford Parser, version 1.6.7 (Klein and 7,754 word types
Manning, 2003), whilst the experimental texts
were tagged by an automatic tagger (Tsuruoka
and Tsujii, 2005), with posterior review and cor- Figure 1: Architecture of neural network language
rection by hand following the Penn Treebank model, and its three learning stages. Numbers indicate
the number of units in each network layer.
Project Guidelines (Santorini, 1991). By training
language models and subsequently running them
on the POS versions of the texts, unlexicalized
Stage 1: Developing word representations
surprisal values were estimated.
Neural network language models can bene-
2.4 Phrase-structure grammars
fit from using distributed word representations:
The Treebank formed by the parsed BNC sen- Each word is assigned a vector in a continu-
tences served as training data for Roarks (2001) ous, high-dimensional space, such that words that
incremental parser. Following Frank and Bod are paradigmatically more similar are closer to-
(2011), a range of grammars was induced, dif- gether (e.g., Bengio et al., 2003; Mnih and Hin-
fering in the features of the tree structure upon ton, 2007). Usually, these representations are
which rule probabilities were conditioned. In learned together with the rest of the model, but
four grammars, probabilities depended on the left- here we used a more efficient approach in which
hand sides ancestors, from one up to four levels word representations are learned in an unsuper-
up in the parse tree (these grammars will be devised manner from simple co-occurrences in the
noted a1 to a4 ). In four other grammars (s1 to training data. First, vectors of word co-occurrence
s4 ), the ancestors left siblings were also taken frequencies were developed using Good-Turing
into account. In addition, probabilities were con- (Gale and Sampson, 1995) smoothed frequency
ditioned on the current head node in all grammars. counts from the training corpus. Values in the
Subsequently, Roarks (2001) incremental parser vector corresponded to the smoothed frequencies
parsed the experimental sentences under each of with which each word directly preceded or fol-
the eight grammars, obtaining eight surprisal val- lowed the represented word. Thus, each word
ues for each word. Since earlier research (Frank, w was assigned a vector (fw,1 , ..., fw,15508 ), such
2009) showed that decreasing the parsers base that fw,v is the number of times word v directly
beam width parameter improves performance, it precedes (for v 7754) or follows (for v >
was set to 1018 (the default being 1012 ). 7754) word w. Next, the frequency counts were
transformed into Pointwise Mutual Information
2.5 Recurrent neural network (PMI) values (see Equation 2), following Bulli-
The RNN (see Figure 1) was trained in three naria and Levys (2007) findings that PMI pro-
stages, each taking the selected (unparsed) BNC duced more psychologically accurate predictions
sentences as training data. than other measures:
401
3 Experiment
P !
fw,v i,j fi,j 3.1 Procedure
PMI(w, v) = log P P (2)
i fi,v j fw,j Text display followed a self-paced reading
paradigm: Sentences were presented on a com-
Finally, the 400 columns with the highest vari- puter screen one word at a time, with onset of
ance were selected from the 775415508-matrix the next word being controlled by the subject
of row vectors, making them more computation- through a key press. The time between word
ally manageable, but not significantly less infor- onset and subsequent key press was recorded as
mative. the RT (measured in milliseconds) on that word
Stage 2: Learning temporal structure by that subject.3 Words were presented centrally
Using the standard backpropagation algorithm, aligned in the screen, and punctuation marks ap-
a simple recurrent network (SRN) learned to pre- peared with the word that preceded them. A fixed-
dict, at each point in the training corpus, the next width font type (Courier New) was used, so that
words vector given the sequence of word vectors physical size of a word equalled number of char-
corresponding to the sentence so far. The total acters. Order of presentation was randomized for
corpus was presented five times, each time with each subject. The experiment was time-bounded
the sentences in a different random order. to 40 minutes, and the number of sentences read
by each participant varied between 120 and 349,
Stage 3: Decoding predicted word with an average of 224. Yes-no comprehension
representations questions followed 46% of the sentences.
The distributed output of the trained SRN
served as training input to the feedforward de- 3.2 Participants
coder network, that learned to map the dis- A total of 117 first year psychology students took
tributed representations back to localist ones. part in the experiment. Subjects unable to an-
This network, too, used standard backpropaga- swer correctly to more than 20% of the questions
tion. Its output units had softmax activation func- and 47 participants who were non-native English
tions, so that the output vector constitutes a prob- speakers were excluded from the analysis, leaving
ability distribution over word types. These trans- a total of 54 subjects.
late directly into surprisal values, which were col-
lected over the experimental sentences at ten in- 3.3 Design
tervals over the course of Stage 3 training (after
The obtained RTs served as the dependent vari-
presenting 2K, 5K, 10K, 20K, 50K, 100K, 200K,
able against which a mixed-effects multiple re-
and 350K sentences, and after presenting the full
gression analysis with crossed random effects for
training corpus once and twice). These will be
subjects and items (Baayen et al., 2008) was per-
denoted by RNN-1 to RNN-10.
formed. In order to control for low-level lexical
A much simpler RNN model suffices for ob-
factors that are known to influence RTs, such as
taining unlexicalized surprisal. Here, we used
word length or frequency, a baseline regression
the same models as described by Frank and Bod
model taking them into account was built. Subse-
(2011), albeit trained on the POS tags of our
quently, the decrease in the models deviance, af-
BNC training corpus. These models employed
ter the inclusion of surprisal as a fixed factor to the
so-called Echo State Networks (ESN; Jaeger and
baseline, was assessed using likelihood tests. The
Haas, 2004), which are RNNs that do not develop
resulting 2 statistic indicates the extent to which
internal representations because weights of input
each surprisal estimate accounts for RT, and can
and recurrent connections remain fixed at ran-
thus serve as a measure of the psychological ac-
dom values (only the output connection weights
curacy of each model.
are trained). Networks of six different sizes were
However, this kind of analysis assumes that RT
used. Of each size, three networks were trained,
for a word reflects processing of only that word,
using different random weights. The best and
worst model of each size were discarded to reduce 3
The collected RT data are available for download at
the effect of the random weights. www.stefanfrank.info/EACL2012.
402
but spill-over effects (in which processing diffi- Word position: Low-level effects of word or-
culty at word wt shows up in the RT on wt+1 ) der, not related to predictability itself, were
have been found in self-paced and natural read- modeled by including word position in the
ing (Just et al., 1982; Rayner, 1998; Rayner and sentence, both as a linear and quadratic fac-
Pollatsek, 1987). To evaluate these effects, the tor (some of the sentences were quite long,
decrease in deviance after adding surprisal of the so that the effect of word position is unlikely
previous item to the baseline was also assessed. to be linear).
The following control predictors were included
in the baseline regression model: Reading time for previous word: As sug-
gested by Baayen and Milin (2010), includ-
Lexical factors: ing RT on the previous word can control for
Number of characters: Both physical size several autocorrelation effects.
and number of characters have been found
to affect RTs for a word (Rayner and Pollat- 4 Results
sek, 1987), but the fixed-width font used in
Data were analysed using the free statistical soft-
the experiment assured number of characters
ware package R (R Development Core Team,
also encoded physical word length.
2009) and the lme4 library (Bates et al., 2011).
Frequency and forward transitional proba- Two analyses were performed for each language
bility: The effects of these two factors have model, using surprisal for either current or pre-
been repeatedly reported (e.g. Juhasz and vious word as the dependent variable. Unlikely
Rayner, 2003; Rayner, 1998). Given the high reading times (lower than 50ms or over 3000ms)
correlations between surprisal and these two were removed from the analysis, as were clitics,
measures, their inclusion in the baseline as- words followed by punctuation, words follow-
sures that the results can be attributed to pre- ing punctuation or clitics (since factors for pre-
dictability in context, over and above fre- vious word were included in the analysis), and
quency and bigram probability. Frequency sentence-initial words, leaving a total of 132,298
was estimated from occurrence counts of data points (between 1,335 and 3,829 per subject).
each word in the full BNC corpus (written
4.1 Baseline model
section). The same transformation (nega-
tive logarithm) was applied as for computing Theoretical considerations guided the selection
surprisal, thus obtaining unconditional and of the initial predictors presented above, but an
bigram surprisal values. empirical approach led actual regression model
building. Initial models with the original set of
Previous word lexical factors: Lexical fac- fixed effects, all two-way interactions, plus ran-
tors for the previous word were included in dom intercepts for subjects and items were evalu-
the analysis to control for spill-over effects. ated, and least significant factors were removed
Temporal factors and autocorrelation: one at a time, until only significant predictors
were left (|t| > 2). A different strategy was
RT data over naturalistic texts violate the re-
used to assess which by-subject and by item ran-
gression assumption of independence of obser-
dom slopes to include in the model. Given the
vations in several ways, and important word-by-
large number of predictors, starting from the sat-
word sequential correlations exist. In order to en-
urated model with all random slopes generated
sure validity of the statistical analysis, as well as
non-convergence problems and excessively long
providing a better model fit, the following factors
running times. By-subject and by-item random
were also included:
slopes for each fixed effect were therefore as-
Sentence position: Fatigue and practice ef- sessed individually, using likelihood tests. The
fects can influence RTs. Sentence position final baseline model included by-subject random
in the experiment was included both as linear intercepts, by-subject random slopes for sentence
and quadratic factor, allowing for the model- position and word position, and by-item slopes for
ing of initial speed-up due to practice, fol- previous RT. All factors (random slopes and fixed
lowed by a slowing down due to fatigue. effects) were centred and standardized to avoid
403
Lexicalized models Unlexicalized models
70 30
81
90
7 43
60
25
Psychological accuracy ()
2
50
6 32 20 4
5 41 3 56
40 2 1
34 1
3 4
15
4
30
3 2 10 2
20
2
10 5
1 1
1
0 0
-6.6 -6.4 -6.2 -6 -5.8 -5.6 -5.4 -5.2 -5 -2.55 -2.5 -2.45 -2.4 -2.35 -2.3 -2.25 -2.2 -2.15 -2.1
Linguistic accuracy (-average surprisal) PSG-a PSG-s RNN
Figure 2: Psychological accuracy (combined effect of current and previous surprisal) against linguistic accuracy
of the different models. Numbered labels denote the maximum number of levels up in the tree from which
conditional information is used (PSG); point in training when estimates were collected (word-based RNN); or
network size (POS-based RNN).
multicollinearity-related problems. is by the test corpus). For the lexicalized models,

RNNs clearly outperform PSGs. Moreover, the
4.2 Surprisal effects RNNs accuracy increases as training progresses
All model categories (PSGs and RNNs) produced (the highest psychological accuracy is achieved
lexicalized surprisal estimates that led to a signif- at point 8, when 350K training sentences were
icant (p < 0.05) decrease in deviance when in- presented). The PSGs taking into account sib-
cluded as a fixed factor in the baseline, with pos- ling nodes are slightly better than their ancestor-
itive coefficients: Higher surprisal led to longer only counterparts (the best psychological model
RTs. Significant effects were also found for their is PSG-s3 ). Contrary to the trend reported by
unlexicalized counterparts, albeit with consider- Frank and Bod (2011), the unlexicalized PSGs
ably smaller 2 -values. and RNNs reach similar levels of psychological
Both for the lexicalized and unlexicalized ver- accuracy, with the PSG-s4 achieving the highest
sions, these effects persisted whether surprisal for 2 -value.
the previous or current word was taken as the in-
dependent variable. However, the effect size was Model comparison 2 (2) p-value
much larger for previous surprisal, indicating the PSG over RNN 12.45 0.002
presence of strong spill-over effects (e.g. lexical- RNN over PSG 30.46 0.001
ized PSG-s3 : current surprisal: 2 (1) = 7.29,
p = 0.007; previous surprisal: 2 (1) = 36.73,
Table 1: Model comparison between best performing
p 0.001). word-based PSG and RNN.
From hereon, only results for the combined ef-
fect of both (inclusion of previous and current Although RNNs outperform PSGs in the lexi-
surprisal as fixed factors in the baseline) are re- calized estimates, comparisons between the best
ported. Figure 2 shows the psychological accu- performing model (i.e. highest 2 ) in each cate-
racy of each model (2 (2) values) plotted against gory showed both were able to explain variance
its linguistic accuracy (i.e., its quality as a lan- over and above each other (see Table 1). It is
guage model, measured by the negative aver- worth noting, however, that if comparisons are
age surprisal on the experimental sentences: the made amongst models including surprisal for cur-
higher this value, the less surprised the model rent, but not previous word, the PSG is unable
404
to explain a significant amount of variance over sults show the ability of lexicalized surprisal to
and above the RNN (2 (1) = 2.28; p = 0.13).4 explain a significant amount of variance in RT
Lexicalized models achieved greater psychologi- data for naturalistic texts, over and above that
cal accuracy than their unlexicalized counterparts, accounted for by other low-level lexical factors,
but the latter could still explain a small amount of such as frequency, length, and forward transi-
variance over and above the former (see Table 2).5 tional probability. Although previous studies had
presented results supporting such a probabilistic
Model comparison 2 (2) p-value language processing account, evidence for word-
based surprisal was limited: Brouwer et al. (2010)
Best models overall:
only examined a specific psycholinguistic phe-
POS- over word-based 10.40 0.006 nomenon, rather than a random language sample;
word- over POS-based 47.02 0.001 Demberg and Keller (2008) reported effects that
PSGs: were only significant for POS but not word-based
POS- over word-based 6.89 0.032 surprisal; and Smith and Levy (2008) found an
word- over POS-based 25.50 0.001 effect of lexicalized surprisal (according to a tri-
gram model), but did not assess whether simpler
RNNs: predictability estimates (i.e., by a bigram model)
POS- over word-based 5.80 0.055 could have accounted for those effects.
word- over POS-based 49.74 0.001 Demberg and Kellers (2008) failure to find lex-
icalized surprisal effects can be attributed both to
Table 2: Word- vs. POS-based models: comparisons the language corpus used to train the language
between best models overall, and best models within models, as well as to the experimental texts used.
each category. Both were sourced from newspaper texts: As
training corpora these are unrepresentative of a
4.3 Differences across word classes persons linguistic experience, and as experimen-
tal texts they are heavily dependent on partici-
In order to make sure that the lexicalized sur-
pants world knowledge. Roark et al. (2009), in
prisal effects found were not limited to closed-
contrast, used a more representative, albeit rela-
class words (as Roark et al., 2009, report), a fur-
tively small, training corpus, as well as narrative-
ther model comparison was performed by adding
style stimuli, thus obtaining RTs less dependent
by-POS random slopes of surprisal to the models
on participants prior knowledge. With such an
containing the baseline plus surprisal. If particu-
experimental set-up, they were able to demon-
lar syntactic categories were contributing to the
strate the effects of lexical surprisal for RT of
overall effect of surprisal more than others, in-
closed-class, but not open-class, words, which
cluding such random slopes would lead to addi-
they attributed to their differential frequency and
tional variance being explained. However, this
to training-data sparsity: The limited Brown cor-
was not the case: inclusion of by-POS random
pus would have been enough to produce accurate
slopes of surprisal did not lead to a significant im-
estimates of surprisal for function words, but not
provement in model fit (PSG: 2 (1) = 0.86, p =
for the less frequent content words. A larger train-
0.35; RNN: 2 (1) = 3.20, p = 0.07).6
ing corpus, constituting a broad language sample,
5 Discussion was used in our study, and the detected surprisal
effects were shown to hold across syntactic cate-
The present study aimed to find further evidence gory (modeling slopes for POS separately did not
for surprisal as a wide-coverage account of lan- improve model fit). However, direct comparison
guage processing difficulty, and indeed, the re- with Roark et al.s results is not possible: They
4
Best models in this case were PSG-a3 and RNN-7. employed alternative definitions of structural and
5
Since best performing lexicalized and unlexicalized lexical surprisal, which they derived by decom-
models belonged to different groups: RNN and PSG, respec- posing the total surprisal as obtained with a fully
tively, Table 2 also shows comparisons within model type.
6 lexicalized PSG model.
Comparison was made on the basis of previous word
surprisal (best models in this case were PSG-s3 and RNN- In the current study, a similar approach to that
9). taken by Demberg and Keller (2008) was used to
405
define structural (or unlexicalized), and lexical- texts, or self-paced reading of independent, nar-
ized surprisal, but the results are strikingly differ- rative sentences. The absence of global context,
ent: Whereas Demberg and Keller report a signif- or the unnatural reading methodology employed
icant effect for POS-based estimates, but not for in the current experiment, could have led to an
word-based surprisal, our results show that lexi- increased reliance on hierarchical structure for
calized surprisal is a far better predictor of RTs sentence comprehension. The sources and struc-
than its unlexicalized counterpart. This is not sur- tures relied upon by the human parser to elabo-
prising, given that while the unlexicalized mod- rate upcoming-word expectations could therefore
els only have access to syntactic sources of in- be task-dependent. On the other hand, our re-
formation, the lexicalized models, like the hu- sults show that the independent effects of word-
man parser, can also take into account lexical co- based PSG estimates only become apparent when
occurrence trends. However, when a training cor- investigating the effect of surprisal of the previous
pus is not large enough to accurately capture the word. That is, considering only the current words
latter, it might still be able to model the former, surprisal, as in Frank and Bods analysis, did not
given the higher frequency of occurrence of each reveal a significant contribution of PSGs over and
possible item (POS vs. word) in the training data. above RNNs. Thus, additional effects of PSG sur-
Roark et al. (2009) also included in their analysis prisal might only be apparent when spill-over ef-
a POS-based surprisal estimate, which lost signif- fects are investigated by taking previous word sur-
icance when the two components of the lexical- prisal as a predictor of RT.
ized surprisal were present, suggesting that such
unlexicalized estimates can be interpreted only as 6 Conclusion
a coarse version of the fully lexicalized surprisal, The results here presented show that lexicalized
incorporating both syntactic and lexical sources surprisal can indeed model RT over naturalistic
of information at the same time. The results pre- texts, thus providing a wide-coverage account of
sented here do not replicate this finding: The best language processing difficulty. Failure of previ-
unlexicalized estimates were able to explain ad- ous studies to find such an effect could be at-
ditional variance over and above the best word- tributed to the size or nature of the training cor-
based estimates. However, this comparison con- pus, suggesting that larger and more general cor-
trasted two different model types: a word-based pora are needed to model successfully both the
RNN and a POS-based PSG, so that the observed structural and lexical regularities used by the hu-
effects could be attributed to the model represen- man parser to generate predictions. Another cru-
tations (hierarchical vs. linear) rather than to the cial finding presented here is the importance of
item of analysis (POS vs. words). Within-model spill-over effects: Surprisal of a word had a much
comparisons showed that unlexicalized estimates larger influence on RT of the following item than
were still able to account for additional variance, of the word itself. Previous studies where lexi-
although only reaching significance at the 0.05 calized surprisal was only analysed in relation to
level for the PSGs. current RT could have missed a significant effect
Previous results reported by Frank (2009) and only manifested on the following item. Whether
Frank and Bod (2011) regarding the higher psy- spill-over effects are as important for different RT
chological accuracy of RNNs and the inability of collection paradigms (e.g., eye-tracking) remains
the PSGs to explain any additional variance in to be tested.
RT, were not replicated. Although for the word-
based estimates RNNs outperform the PSGs, we Acknowledgments
found both to have independent effects. Further- The research presented here was funded by the
more, in the POS-based analysis, performance of European Union Seventh Framework Programme
PSGs and RNNs reaches similarly high levels of (FP7/2007-2013) under grant number 253803.
psychological accuracy, with the best-performing The authors acknowledge the use of the UCL Le-
PSG producing slightly better results than the gion High Performance Computing Facility, and
best-performing RNN. This discrepancy in the re- associated support services, in the completion of
sults could reflect contrasting reading styles in this work.
the two studies: natural reading of newspaper
406
References Stefan L. Frank. 2012. Uncertainty reduction as a
measure of cognitive processing load in sentence
Gerry T.M. Altmann and Yuki Kamide. 1999. Incre- comprehension. Manuscript submitted for publica-
mental interpretation at verbs: Restricting the do- tion.
main of subsequent reference. Cognition, 73:247 Peter Hagoort, Lea Hald, Marcel Bastiaansen, and
264. Karl Magnus Petersson. 2004. Integration of word
Mark Andrews, Gabriella Vigliocco, and David P. Vin- meaning and world knowledge in language compre-
son. 2009. Integrating experiential and distribu- hension. Science, 304:438441.
tional data to learn semantic representations. Psy- John Hale. 2001. A probabilistic earley parser as a
chological Review, 116:463498. psycholinguistic model. In Proceedings of the sec-
R. Harald Baayen and Petar Milin. 2010. Analyzing ond meeting of the North American Chapter of the
reaction times. International Journal of Psycholog- Association for Computational Linguistics on Lan-
ical Research, 3:1228. guage technologies, pages 18, Stroudsburg, PA.
R. Harald Baayen, Doug J. Davidson, and Douglas M. Herbert Jaeger and Harald Haas. 2004. Harnessing
Bates. 2008. Mixed-effects modeling with crossed nonlinearity: predicting chaotic systems and saving
random effects for subjects and items. Journal of energy in wireless communication. Science, pages
Memory and Language, 59:390412. 7880.
Moshe Bar. 2007. The proactive brain: using Barbara J. Juhasz and Keith Rayner. 2003. Investigat-
analogies and associations to generate predictions. ing the effects of a set of intercorrelated variables on
Trends in Cognitive Sciences, 11:280289. eye fixation durations in reading. Journal of Exper-
Douglas Bates, Martin Maechler, and Ben Bolker, imental Psychology: Learning, Memory and Cogni-
2011. lme4: Linear mixed-effects models using tion, 29:13121318.
S4 classes. Available from: http://CRAN.R- Marcel A. Just, Patricia A. Carpenter, and Jacque-
project.org/package=lme4 (R package version line D. Woolley. 1982. Paradigms and processes
0.999375-39). in reading comprehension. Journal of Experimen-
tal Psychology: General, 111:228238.
Yoshua Bengio, Rejean Ducharme, Pascal Vincent,
and Christian Jauvin. 2003. A neural probabilis- Yuki Kamide, Christoph Scheepers, and Gerry T. M.
tic language model. Journal of Machine Learning Altmann. 2003. Integration of syntactic and se-
Research, 3:11371155. mantic information in predictive processing: cross-
linguistic evidence from German and English.
Marisa Ferrara Boston, John Hale, Reinhold Kliegl, Journal of Psycholinguistic Research, 32:3755.
Umesh Patil, and Shravan Vasishth. 2008. Parsing
Alan Kennedy and Joel Pynte. 2005. Parafoveal-on
costs as predictors of reading difficulty: An evalua-
foveal effects in normal reading. Vision Research,
tion using the potsdam sentence corpus. Journal of
45:153168.
Eye Movement Research,, 2:112.
Dan Klein and Christopher D. Manning. 2003. Ac-
Harm Brouwer, Hartmut Fitz, and John C. J. Hoeks. curate unlexicalized parsing. In Proceedings of the
2010. Modeling the noun phrase versus sentence 41st Meeting of the Association for Computational
coordination ambiguity in Dutch: evidence from Linguistics,, pages 423430.
surprisal theory. In Proceedings of the 2010 Work- Kestutis Kveraga, Avniel S. Ghuman, and Moshe Bar.
shop on Cognitive Modeling and Computational 2007. Top-down predictions in the cognitive brain.
Linguistics, pages 7280, Stroudsburg, PA, USA. Brain and Cognition, 65:145168.
John A. Bullinaria and Joseph P. Levy. 2007. Ex- Roger Levy. 2008. Expectation-based syntactic com-
tracting semantic representations from word co- prehension. Cognition, 106:11261177.
occurrence statistics: A computational study. Be- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
havior Research Methods, 39:510526. Beatrice Santorini. 1993. Building a large anno-
Vera Demberg and Frank Keller. 2008. Data from eye- tated corpus of English: the Penn Treebank. Com-
tracking corpora as evidence for theories of syn- putational Linguistics, 19:313330.
tactic processing complexity. Cognition, 109:193 Scott A. McDonald and Richard C. Shillcock. 2003.
210. Low-level predictive inference in reading: the influ-
Stefan L. Frank and Rens Bod. 2011. Insensitivity of ence of transitional probabilities on eye movements.
the human sentence-processing system to hierarchi- Vision Research, 43:17351751.
cal structure. Psychological Science, 22:829834. Andriy Mnih and Geoffrey Hinton. 2007. Three new
Stefan L. Frank. 2009. Surprisal-based comparison graphical models for statistical language modelling.
between a symbolic and a connectionist model of Proceedings of the 25th International Conference of
sentence processing. In Proceedings of the 31st An- Machine Learning, pages 641648.
nual Conference of the Cognitive Science Society, Keith Rayner and Alexander Pollatsek. 1987. Eye
pages 11391144, Austin, TX. movements in reading: A tutorial review. In
407
M. Coltheart, editor, Attention and performance
XII: the psychology of reading., pages 327362.
Lawrence Erlbaum Associates, London, UK.
Keith Rayner. 1998. Eye movements in reading and
information processing: 20 years of research. Psy-
chological Bulletin, 124:372422.
Brian Roark, Asaf Bachrach, Carlos Cardenas, and
Christophe Pallier. 2009. Deriving lexical and syn-
tactic expectation-based measures for psycholin-
guistic modeling via incremental top-down parsing.
In Proceedings of the 2009 Conference on Empiri-
cal Methods in Natural Language Processing: Vol-
ume 1 - Volume 1, pages 324333, Stroudsburg, PA.
Brian Roark. 2001. Probabilistic top-down parsing
and language modeling. Computational Linguis-
tics, 27:249276.
Beatrice Santorini. 1991. Part-of-speech tagging
guidelines for the Penn Treebank Project. Technical
report, Philadelphia, PA.
Nathaniel J. Smith and Roger Levy. 2008. Optimal
processing times in reading: a formal model and
empirical investigation. In Proceedings of the 30th
Annual Conference of the Cognitive Science Soci-
ety, pages 595600, Austin,TX.
Andreas Stolcke. 1995. An efficient probabilistic
context-free parsing algorithm that computes prefix
probabilities. Computational linguistics, 21:165
201.
Yoshimasa Tsuruoka and Junichi Tsujii. 2005. Bidi-
rectional inference with the easiest-first strategy for
tagging sequence data. In Proceedings of the con-
ference on Human Language Technology and Em-
pirical Methods in Natural Language Processing,
pages 467474, Stroudsburg, PA.
408
Spectral Learning for Non-Deterministic Dependency Parsing
Franco M. Luque Ariadna Quattoni and Borja Balle and Xavier Carreras
Universidad Nacional de Cordoba Universitat Politecnica de Catalunya
and CONICET Barcelona E-08034
Cordoba X5000HUA, Argentina {aquattoni,bballe,carreras}@lsi.upc.edu
francolq@famaf.unc.edu.ar
Abstract is to explicitly tell the model what properties of

higher-order factors need to be remembered. This
In this paper we study spectral learning can be achieved by means of feature engineering,
methods for non-deterministic split head- but compressing such information into a state of
automata grammars, a powerful hidden- bounded size will typically be labor intensive, and
state formalism for dependency parsing.
will not generalize across languages. (2) Increas-
We present a learning algorithm that, like
other spectral methods, is efficient and non- ing the size of the factors generally results in poly-
susceptible to local minima. We show nomial increases in the parsing cost.
how this algorithm can be formulated as In principle, hidden variable models could
a technique for inducing hidden structure solve some of the problems of feature engineering
from distributions computed by forward- in higher-order factorizations, since they could
backward recursions. Furthermore, we
automatically induce the information in a deriva-
also present an inside-outside algorithm
for the parsing model that runs in cubic tion history that should be passed across factors.
time, hence maintaining the standard pars- Potentially, they would require less feature engi-
ing costs for context-free grammars. neering since they can learn from an annotated
corpus an optimal way to compress derivations
into hidden states. For example, one line of work
1 Introduction has added hidden annotations to the non-terminals
Dependency structures of natural language sen- of a phrase-structure grammar (Matsuzaki et al.,
tences exhibit a significant amount of non-local 2005; Petrov et al., 2006; Musillo and Merlo,
phenomena. Historically, there have been two 2008), resulting in compact grammars that ob-
main approaches to model non-locality: (1) in- tain parsing accuracies comparable to lexicalized
creasing the order of the factors of a dependency grammars. A second line of work has modeled
model (e.g. with sibling and grandparent relations hidden sequential structure, like in our case, but
(Eisner, 2000; McDonald and Pereira, 2006; Car- using PDFA (Infante-Lopez and de Rijke, 2004).
reras, 2007; Martins et al., 2009; Koo and Collins, Finally, a third line of work has induced hidden
2010)), and (2) using hidden states to pass in- structure from the history of actions of a parser
formation across factors (Matsuzaki et al., 2005; (Titov and Henderson, 2007).
Petrov et al., 2006; Musillo and Merlo, 2008). However, the main drawback of the hidden
Higher-order models have the advantage that variable approach to parsing is that, to the best
they are relatively easy to train, because estimat- of our knowledge, there has not been any convex
ing the parameters of the model can be expressed formulation of the learning problem. As a result,
as a convex optimization. However, they have training a hidden-variable model is both expen-
two main drawbacks. (1) The number of param- sive and prone to local minima issues.
eters grows significantly with the size of the fac- In this paper we present a learning algorithm
tors, leading to potential data-sparsity problems. for hidden-state split head-automata grammars
A solution to address the data-sparsity problem (SHAG) (Eisner and Satta, 1999). In this for-
409
malism, head-modifier sequences are generated ner, 2000), a context-free grammatical formal-
by a collection of finite-state automata. In our ism whose derivations are projective dependency
case, the underlying machines are probabilistic trees. We will use xi:j = xi xi+1 xj to de-
non-deterministic finite state automata (PNFA), note a sequence of symbols xt with i t j.
which we parameterize using the operator model A SHAG generates sentences s0:N , where sym-
representation. This representation allows the use bols st X with 1 t N are regular words
of simple spectral algorithms for estimating the and s0 = ? 6 X is a special root symbol. Let
model parameters from data (Hsu et al., 2009; X = X {?}. A derivation y, i.e. a depen-
Bailly, 2011; Balle et al., 2012). In all previous dency tree, is a collection of head-modifier se-
work, the algorithms used to induce hidden struc- quences hh, d, x1:T i, where h X is a word,
ture require running repeated inference on train- d {LEFT, RIGHT} is a direction, and x1:T is
ing datae.g. Expectation-Maximization (Demp- a sequence of T words, where each xt X is
ster et al., 1977), or split-merge algorithms. In a modifier of h in direction d. We say that h is
contrast, spectral methods are simple and very ef- the head of each xt . Modifier sequences x1:T are
ficient parameter estimation is reduced to com- ordered head-outwards, i.e. among x1:T , x1 is the
puting some data statistics, performing SVD, and word closest to h in the derived sentence, and xT
inverting matrices. is the furthest. A derivation y of a sentence s0:N
The main contributions of this paper are: consists of a LEFT and a RIGHT head-modifier se-
quence for each st . As special cases, the LEFT se-
We present a spectral learning algorithm for
quence of the root symbol is always empty, while
inducing PNFA with applications to head-
the RIGHT one consists of a single word corre-
automata dependency grammars. Our for-
sponding to the head of the sentence. We denote
mulation is based on thinking about the dis-
by Y the set of all valid derivations.
tribution generated by a PNFA in terms of
Assume a derivation y contains hh, LEFT, x1:T i
the forward-backward recursions.
and hh, RIGHT, x01:T 0 i. Let L(y, h) be the derived
Spectral learning algorithms in previous sentence headed by h, which can be expressed as
work only use statistics of prefixes of se- L(y, xT ) L(y, x1 ) h L(y, x01 ) L(y, x0T 0 ).1
quences. In contrast, our algorithm is able The language generated by a SHAG are the
to learn from substring statistics. strings L(y, ?) for any y Y.
In this paper we use probabilistic versions of
We derive an inside-outside algorithm for SHAG where probabilities of head-modifier se-
non-deterministic SHAG that runs in cubic quences in a derivation are independent of each
time, keeping the costs of CFG parsing. other:
In experiments we show that adding non-
Y
P(y) = P(x1:T |h, d) . (1)
determinism improves the accuracy of sev- hh,d,x1:T iy
eral baselines. When we compare our algo-
rithm to EM we observe a reduction of two In the literature, standard arc-factored models fur-
orders of magnitude in training time. ther assume that
TY
+1
The paper is organized as follows. Next section P(x1:T |h, d) = P(xt |h, d, t ) , (2)
describes the necessary background on SHAG t=1
and operator models. Section 3 introduces Op-
where xT +1 is always a special STOP word, and t
erator SHAG for parsing, and presents a spectral
is the state of a deterministic automaton generat-
learning algorithm. Section 4 presents a parsing
ing x1:T +1 . For example, setting 1 = FIRST and
algorithm. Section 5 presents experiments and
t>1 = REST corresponds to first-order models,
analysis of results, and section 6 concludes.
while setting 1 = NULL and t>1 = xt1 corre-
2 Preliminaries sponds to sibling models (Eisner, 2000; McDon-
ald et al., 2005; McDonald and Pereira, 2006).
2.1 Head-Automata Dependency Grammars 1
Throughout the paper we assume we can distinguish the
In this work we use split head-automata gram- words in a derivation, irrespective of whether two words at
mars (SHAG) (Eisner and Satta, 1999; Eis- different positions correspond to the same symbol.
410
2.2 Operator Models symbol a and moving to state i given that we are
An operator model A with n states is a tuple at state j.
> , {A }
h1 , a aX i, where Aa R
nn is an op- HMM are only one example of distributions
erator matrix and 1 , Rn are vectors. A that can be parameterized by operator models.
computes a function f : X R as follows: In general, operator models can parameterize any
PNFA, where the parameters of the model corre-
>
f (x1:T ) = A xT A x1 1 . (3) spond to probabilities of emitting a symbol from
a state and moving to the next state.
One intuitive way of understanding operator The advantage of working with operator mod-
models is to consider the case where f computes els is that, under certain mild assumptions on the
a probability distribution over strings. Such a dis- operator parameters, there exist algorithms that
tribution can be described in two equivalent ways: can estimate the operators from observable statis-
by making some independence assumptions and tics of the input sequences. These algorithms are
providing the corresponding parameters, or by ex- extremely efficient and are not susceptible to local
plaining the process used to compute f . This is minima issues. See (Hsu et al., 2009) for theoret-
akin to describing the distribution defined by an ical proofs of the learnability of HMM under the
HMM in terms of a factorization and its corre- operator model representation.
sponding transition and emission parameters, or In the following, we write x = xi:j X to
using the inductive equations of the forward al- denote sequences of symbols, and use Axi:j as a
gorithm. The operator model representation takes shorthand for Axj Axi . Also, for convenience
the latter approach. we assume X = {1, . . . , l}, so that we can index
Operator models have had numerous applica- vectors and matrices by symbols in X .
tions. For example, they can be used as an alter-
native parameterization of the function computed 3 Learning Operator SHAG
by an HMM (Hsu et al., 2009). Consider an HMM
with n hidden states and initial-state probabilities We will define a SHAG using a collection of op-
Rn , transition probabilities T Rnn , and erator models to compute probabilities. Assume
observation probabilities Oa Rnn for each that for each possible head h in the vocabulary X
a X , with the following meaning: and each direction d {LEFT, RIGHT} we have
an operator model that computes probabilities of
(i) is the probability of starting at state i, modifier sequences as follows:
T (i, j) is the probability of transitioning h,d > h,d
P(x1:T |h, d) = ( ) AxT Ah,d h,d
x1 1 .
from state j to state i,
Then, this collection of operator models defines
Oa is a diagonal matrix, such that Oa (i, i) is
an operator SHAG that assigns a probability to
the probability of generating symbol a from
each y Y according to (1). To learn the model
state i.
parameters, namely h1h,d , h,d
, {Ah,d
a }aX i for
Given an HMM, an equivalent operator model h X and d {LEFT, RIGHT}, we use spec-
can be defined by setting 1 = , Aa = T Oa tral learning methods based on the works of Hsu
and = ~1. To see this, let us show that the for- et al. (2009), Bailly (2011) and Balle et al. (2012).
ward algorithm computes the expression in equa- The main challenge of learning an operator
tion (3). Let t denote the state of the HMM model is to infer a hidden-state space from ob-
at time t. Consider a state-distribution vector servable quantities, i.e. quantities that can be com-
t Rn , where t (i) = P(x1:t1 , t = i). Ini- puted from the distribution of sequences that we
tially 1 = . At each step in the chain of prod- observe. As it turns out, we cannot recover the
ucts (3), t+1 = Axt t updates the state dis- actual hidden-state space used by the operators
tribution from positions t to t + 1 by applying we wish to learn. The key insight of the spectral
the appropriate operator, i.e. by emitting symbol learning method is that we can recover a hidden-
xt and transitioning to the new statePdistribution. state space that corresponds to a projection of the
The probability of x1:T is given by i T +1 (i). original hidden space. Such projected space is
Hence, Aa (i, j) is the probability of generating equivalent to the original one in the sense that we
411
can find operators in the projected space that pa- Furthermore, for each b X let Pb Rll denote
rameterize the same probability distribution over the matrix whose entries are given by
sequences.
Pb (c, a) = E(abc v] x) , (7)
In the rest of this section we describe an algo-
rithm for learning an operator model. We will as- the expected number of occurrences of trigrams.
l l
sume a fixed head word and direction, and drop h P p1 R and p R
Finally, we define vectors
and d from all terms. Hence, our goal is to learn as follows: p1 (a) = sX P(as), the probabil-
the following distribution, parameterized by oper- ity that a stringPbegins with a particular symbol;
ators 1 , {Aa }aX , and : and p (a) = pX P(pa), the probability that
a string ends with a particular symbol.
>
P(x1:T ) = A xT A x1 1 . (4) Now we show a particularly useful way to ex-
press the quantities defined above in terms of the
Our algorithm shares many features with the > , {A }
operators h1 , a aX i of P. First, note
previous spectral algorithms of Hsu et al. (2009)
that each entry of P can be written in this form:
and Bailly (2011), though the derivation given X
here is based upon the general formulation of P (b, a) = P(pabs) (8)
Balle et al. (2012). The main difference is that p,sX
our algorithm is able to learn operator models
X
>
= As Ab Aa Ap 1
from substring statistics, while algorithms in pre- p,sX
vious works were restricted to statistics on pre- >
X X
fixes. In principle, our algorithm should extract = ( As ) Ab Aa ( Ap 1 ) .
sX pX
much more information from a sample.
It is not hard to see that, since P isPa probability
3.1 Preliminary Definitions distribution over X , actually >
A =
P sX s
The spectral learning algorithm will use statistics ~1> . Furthermore, since pX Ap =
estimated from samples of the target distribution. P P k (I
P 1
k0 ( aX Aa ) P= aX Aa ) ,
More specifically, consider the function that com- we write 1 = (I aX Aa )1 1 . From (8) it
putes the expected number of occurrences of a is natural to define a forward matrix F Rnl
substring x in a random string x0 drawn from P: whose ath column contains the sum of all hidden-
state vectors obtained after generating all prefixes
f (x) = E(x v] x0 )
X ended in a:
= (x v] x0 )P(x0 ) X
x0 X
F (:, a) = Aa Ap 1 = Aa 1 . (9)
X pX
= P(pxs) , (5)
p,sX
Conversely, we also define a backward matrix
B Rln whose ath row contains the probability
where x v] x0 denotes the number of times x ap- of generating a from any possible state:
pears in x0 . Here we assume that the true values >
X
of f (x) for bigrams are known, though in practice B(a, :) = As Aa = ~1> Aa . (10)
sX
the algorithm will work with empirical estimates
of these. By plugging the forward and backward matri-
The information about f known by the algo- ces into (8) one obtains the factorization P =
rithm is organized in matrix form as follows. Let BF . With similar arguments it is easy to see
P Rll be a matrix containing the value of f (x) that one also has Pb = BAb F , p1 = B 1 , and
for all strings of length two, i.e. bigrams.2 . That p> >
= F . Hence, if B and F were known, one
is, each entry in P Rll contains the expected could in principle invert these expressions in order
number of occurrences of a given bigram: to recover the operators of the model from em-
pirical estimations computed from a sample. In
P (b, a) = E(ab v] x) . (6) the next section we show that in fact one does not
2
In fact, while we restrict ourselves to strings of length
need to know B and F to learn an operator model
two, an analogous algorithm can be derived that considers for P, but rather that having a good factorization
longer strings to define P . See (Balle et al., 2012) for details. of P is enough.
412
3.2 Inducing a Hidden-State Space Algorithm 1 Learn Operator SHAG
We have shown that an operator model A com- inputs:
An alphabet X
puting P induces a factorization of the matrix P ,
A training set TRAIN = {hhi , di , xi1:T i}M
i=1
namely P = BF . More generally, it turns out that The number of hidden states n
when the rank of P equals the minimal number of
1: for each h X and d {LEFT, RIGHT} do
states of an operator model that computes P, then
2: Compute an empirical estimate from TRAIN of
one can prove a duality relation between opera-
statistics matrices pb1 , pb , Pb, and {Pba }aX
tors and factorizations of P . In particular, one can 3: Compute the SVD of Pb and let U b be the matrix
show that, for any rank factorization P = QR, the of top n left singular vectors of Pb
operators given by 1 = Q+ p1 , > = p> R+ ,
4: Compute the observable operators for h and d:
+ +
and Aa = Q Pa R , yield an operator model for 5: b1h,d = U
b > pb1
P. A key fact in proving this result is that the func- 6: ) = pb>
(b h,d > b> b +
(U P )
h,d > b > Pb)+ for each a X
tion P is invariant to the basis chosen to represent 7: Ab =U
a
b Pba (U
operator matrices. See (Balle et al., 2012) for fur- 8: end for
ther details. 9: return Operators hb 1h,d ,
bh,d bh,d
, Aa i
Thus, we can recover an operator model for P for each h X , d {LEFT, RIGHT}, a X
from any rank factorization of P , provided a rank
assumption on P holds (which hereafter we as-
SHAG is learned separately. The running time
sume to be the case). Since we only have access
of the algorithm is dominated by two computa-
to an approximation of P , it seems reasonable to
tions. First, a pass over the training sequences to
choose a factorization which is robust to estima-
compute statistics over unigrams, bigrams and tri-
tion errors. A natural such choice is the thin SVD
grams. Second, SVD and matrix operations for
decomposition of P (i.e. using top n singular vec-
computing the operators, which run in time cubic
tors), given by: P = U (V > ) = U (U > P ).
in the number of symbols l. However, note that
Intuitively, we can think of U and U > P as pro-
when dealing with sparse matrices many of these
jected backward and forward matrices. Now that
operations can be performed more efficiently.
we have a factorization of P we can construct an
operator model for P as follows: 3 4 Parsing Algorithms
1 = U > p1 , (11) Given a sentence s0:N we would like
>
= p> >
(U P )
+
, (12) to find its most likely derivation, y =
> > argmaxyY(s0:N ) P(y). This problem, known as
Aa = U Pa (U P )+ . (13)
MAP inference, is known to be intractable for
Algorithm 1 presents pseudo-code for an algo- hidden-state structure prediction models, as it
rithm learning operators of a SHAG from train- involves finding the most likely tree structure
ing head-modifier sequences using this spectral while summing out over hidden states. We use
method. Note that each operator model in the a common approximation to MAP based on first
computing posterior marginals of tree edges (i.e.
3
To see that equations (11-13) define a model for P, one dependencies) and then maximizing over the
must first see that the matrix M = F (V > )+ is invertible
with inverse M 1 = U > B. Using this and recalling that
tree structure (see (Park and Darwiche, 2004)
p1 = B1 , Pa = BAa F , p> >
= F , one obtains that:
for complexity of general MAP inference and
approximations). For parsing, this strategy is
1 = U > B1 = M 1 1 ,
> >
sometimes known as MBR decoding; previous
= F (U > BF )+ =
>
M ,
work has shown that empirically it gives good
Aa = U > BAa F (U > BF )+ = M 1 Aa M .
performance (Goodman, 1996; Clark and Cur-
Finally: ran, 2004; Titov and Henderson, 2006; Petrov
> and Klein, 2007). In our case, we use the
P(x1:T ) = AxT Ax1 1
>
non-deterministic SHAG to compute posterior
= M M 1 AxT M M 1 Ax1 M M 1 1
>
marginals of dependencies. We first explain the
= AxT Ax1 1
general strategy of MBR decoding, and then
present an algorithm to compute marginals.
413
Let (si , sj ) denote a dependency between head and Satta (1999), we use decoding structures re-
word i and modifier word j. The posterior lated to complete half-constituents (or triangles,
or marginal probability of a dependency (si , sj ) denoted C) and incomplete half-constituents (or
given a sentence s0:N is defined as trapezoids, denoted I), each decorated with a di-
X rection (denoted L and R). We assume familiarity
i,j = P((si , sj ) | s0:N ) = P(y) . with their algorithm.
I,R
yY(s0:N ) : (si ,sj )y We define i,j Rn as the inside score-vector
of a right trapezoid dominated by dependency
To compute marginals, the sum over derivations (si , sj ),
can be decomposed into a product of inside and
outside quantities (Baker, 1979). Below we de-
X
I,R
i,j = P(y 0 )si ,R (x1:t ) . (15)
scribe an inside-outside algorithm for our gram- yY(si:j ) : (si ,sj )y ,
mars. Given a sentence s0:N and marginal scores y={hsi ,R,x1:t i} y 0 , xt =sj
i,j , we compute the parse tree for s0:N as
The term P(y 0 ) is the probability of head-modifier
sequences in the range si:j that do not involve
X
y = argmax log i,j (14)
yY(s0:N ) (s ,s )y
i j
si . The term si ,R (x1:t ) is a forward state-
distribution vector the qth coordinate of the
using the standard projective parsing algorithm vector is the probability that si generates right
for arc-factored models (Eisner, 2000). Overall modifiers x1:t and remains at state q. Similarly,
I,R
we use a two-pass parsing process, first to com- we define i,j Rn as the outside score-vector
pute marginals and then to compute the best tree. of a right trapezoid, as
X
4.1 An Inside-Outside Algorithm I,R
i,j = P(y 0 ) si ,R (xt+1:T ) , (16)
In this section we sketch an algorithm to com- yY(s0:i sj:n ) : root(y)=s0 ,
y={hsi ,R,xt:T i} y 0 , xt =sj
pute marginal probabilities of dependencies. Our
algorithm is an adaptation of the parsing algo- where si ,R (xt+1:T ) Rn is a backward state-
rithm for SHAG by Eisner and Satta (1999) to distribution vector the qth coordinate is the
the case of non-deterministic head-automata, and probability of being at state q of the right au-
has a runtime cost of O(n2 N 3 ), where n is the tomaton of si and generating xt+1:T . Analogous
number of states of the model, and N is the inside-outside expressions can be defined for the
length of the input sentence. Hence the algorithm rest of structures (left/right triangles and trape-
maintains the standard cubic cost on the sentence zoids). With these quantities, we can compute
length, while the quadratic cost on n is inher- marginals as
ent to the computations defined by our model in
( I , R > I , R 1
Eq. (3). The main insight behind our extension (i,j ) i,j Z if i < j ,
is that, because the computations of our model in- i,j = I,L > I,L 1 (17)
(i,j ) i,j Z if j < i ,
volve state-distribution vectors, we need to extend
the standard inside/outside quantities to be in the P ?,R > C , R
form of such state-distribution quantities.4 where Z = yY(s0:N) P(y) = ( ) 0,N .
Throughout this section we assume a fixed sen- Finally, we sketch the equations for computing
tence s0:N . Let Y(xi:j ) be the set of derivations inside scores in O(N 3 ) time. The outside equa-
that yield a subsequence xi:j . For a derivation y, tions can be derived analogously (see (Paskin,
we use root(y) to indicate the root word of it, 2001)). For 0 i < j N :
and use (xi , xj ) y to refer a dependency in y C,R
i,i = 1si ,R (18)
from head xi to modifier xj . Following Eisner
j
X
4 C,R I,R sk , R > C , R
Technically, when working with the projected operators i,j = i,k ( ) k,j (19)
the state-distribution vectors will not be distributions in the k=i+1
formal sense. However, they correspond to a projection of a j
state distribution, for some projection that we do not recover

s ,L
X
from data (namely M 1 in footnote 3). This projection has
I,R
i,j = C,R
Assij,R i,k (j )> k+1,j
C,L
(20)
no effect on the computations because it cancels out. k=i
414
5 Experiments 82
80
The goal of our experiments is to show that in-
unlabeled attachment score

corporating hidden states in a SHAG using oper- 78
ator models can consistently improve parsing ac- 76

curacy. A second goal is to compare the spec-
74
tral learning algorithm to EM, a standard learning
72 Det
method that also induces hidden states. Det+F
Spectral
The first set of experiments involve fully unlex- 70 EM (5)
EM (10)
icalized models, i.e. parsing part-of-speech tag se- 68
EM (25)
EM (100)
quences. While this setting falls behind the state- 2 4 6 8 10 12 14
of-the-art, it is nonetheless valid to analyze empir- number of states
ically the effect of incorporating hidden states via

operator models, which results in large improve- Figure 1: Accuracy curve on English development set
ments. In a second set of experiments, we com- for fully unlexicalized models.
bine the unlexicalized hidden-state models with
simple lexicalized models. Finally, we present created a diagonal matrix Oah,d Rnn ,
some analysis of the automaton learned by the where Oah,d (i, i) is the probability of gener-
spectral algorithm to see the information that is ating symbol a from h and d (estimated from
captured in the hidden state space. training); (4) we set A bh,d h,d
a = T Oa .
5.1 Fully Unlexicalized Grammars We trained SHAG models using the standard
We trained fully unlexicalized dependency gram- WSJ sections of the English Penn Treebank (Mar-
mars from dependency treebanks, that is, X are cus et al., 1994). Figure 1 shows the Unlabeled
PoS tags and we parse PoS tag sequences. In Attachment Score (UAS) curve on the develop-
all cases, our modifier sequences include special ment set, in terms of the number of hidden states
START and STOP symbols at the boundaries. 5 6 for the spectral and EM models. We can see
We compare the following SHAG models: that D ET +F largely outperforms D ET7 , while the
hidden-state models obtain much larger improve-
D ET: a baseline deterministic grammar with
ments. For the EM model, we show the accuracy
a single state.
curve after 5, 10, 25 and 100 iterations.8
D ET +F: a deterministic grammar with two
In terms of peak accuracies, EM gives a slightly
states, one emitting the first modifier of a
better result than the spectral method (80.51% for
sequence, and another emitting the rest (see
EM with 15 states versus 79.75% for the spectral
(Eisner and Smith, 2010) for a similar deter-
method with 9 states). However, the spectral al-
ministic baseline).
gorithm is much faster to train. With our Matlab
S PECTRAL: a non-deterministic grammar implementation, it took about 30 seconds, while
with n hidden states trained with the spectral each iteration of EM took from 2 to 3 minutes,
algorithm. n is a parameter of the model. depending on the number of states. To give a con-
EM: a non-deterministic grammar with n crete example, to reach an accuracy close to 80%,
states trained with EM. Here, we estimate there is a factor of 150 between the training times
operators hb 1 , bh,d
b , A a i using forward- of the spectral method and EM (where we com-
backward for the E step. To initialize, we pare the peak performance of the spectral method
mimicked an HMM initialization: (1) we set versus EM at 25 iterations with 13 states).

b1 and b randomly; (2) we created a ran- 7
dom transition matrix T Rnn ; (3) we For parsing with deterministic SHAG we employ MBR
inference, even though Viterbi inference can be performed
5
Even though the operators 1 and of a PNFA ac- exactly. In experiments on development data D ET improved
count for start and stop probabilities, in preliminary experi- from 62.65% using Viterbi to 68.52% using MBR, and
ments we found that having explicit START and STOP sym- D ET +F improved from 72.72% to 74.80%.
8
bols results in more accurate models. We ran EM 10 times under different initial conditions
6
Note that, for parsing, the operators for the START and and selected the run that gave the best absolute accuracy after
STOP symbols can be packed into 1 and respectively. 100 iterations. We did not observe significant differences
One just defines 10 = ASTART 1 and 0>
= >
ASTOP . between the runs.
415
D ET D ET +F S PECTRAL EM 86
WSJ 69.45% 75.91% 80.44% 81.68%
84
unlabeled attachment score

82
Table 1: Unlabeled Attachment Score of fully unlexi-
calized models on the WSJ test set. 80
78
Table 1 shows results on WSJ test data, se- 76

Lex
lecting the models that obtain peak performances Lex+F
Lex+FCP
74
in development. We observe the same behavior: Lex + Spectral
Lex+F + Spectral
72 Lex+FCP + Spectral
hidden-states largely improve over deterministic
baselines, and EM obtains a slight improvement 2 3 4 5 6 7 8 9 10
number of states
over the spectral algorithm. Comparing to previ-
ous work on parsing WSJ PoS sequences, Eisner Figure 2: Accuracy curve on English development set
and Smith (2010) obtained an accuracy of 75.6% for lexicalized models.
using a deterministic SHAG that uses informa-
tion about dependency lengths. However, they
coarse level conditions on {th , d, }. For PB we
used Viterbi inference, which we found to per-
use three levels, which from fine to coarse are
form worse than MBR inference (see footnote 7).
{ta , wh , d, }, {ta , th , d, } and {ta }. We follow
5.2 Experiments with Lexicalized Collins (1999) to estimate PA and PB from a tree-
Grammars bank using a back-off strategy.
We now turn to combining lexicalized determinis- We use a simple approach to combine lexical
tic grammars with the unlexicalized grammars ob- models with the unlexical hidden-state models we
tained in the previous experiment using the spec- obtained in the previous experiment. Namely, we
tral algorithm. The goal behind this experiment use a log-linear model that computes scores for
is to show that the information captured in hidden head-modifier sequences as
states is complimentary to head-modifier lexical
preferences. s(hh, d, x1:T i) = log Psp (x1:T |h, d) (21)
In this case X consists of lexical items, and we + log Pdet (x1:T |h, d) ,
assume access to the PoS tag of each lexical item.
We will denote as ta and wa the PoS tag and word where Psp and Pdet are respectively spectral and
of a symbol a X . We will estimate condi- deterministic probabilistic models. We tested
tional distributions P(a | h, d, ), where a X combinations of each deterministic model with
is a modifier, h X is a head, d is a direction, the spectral unlexicalized model using different
and is a deterministic state. Following Collins number of states. Figure 2 shows the accuracies of
(1999), we use three configurations of determin- single deterministic models, together with combi-
istic states: nations using different number of states. In all
cases, the combinations largely improve over the
L EX: a single state. purely deterministic lexical counterparts, suggest-
L EX +F: two distinct states for first modifier ing that the information encoded in hidden states
and rest of modifiers. is complementary to lexical preferences.
L EX +FCP: four distinct states, encoding:
first modifier, previous modifier was a coor- 5.3 Results Analysis
dination, previous modifier was punctuation, We conclude the experiments by analyzing the
and previous modifier was some other word. state space learned by the spectral algorithm.
To estimate P we use a back-off strategy: Consider the space Rn where the forward-state
vectors lie. Generating a modifier sequence corre-
P(a|h, d, ) = PA (ta |h, d, )PB (wa |ta , h, d, ) sponds to a path through the n-dimensional state
space. We clustered sets of forward-state vectors
To estimate PA we use two back-off levels, in order to create a DFA that we can use to visu-
the fine level conditions on {wh , d, } and the alize the phenomena captured by the state space.
416
nns ments in accuracy with respect to the baselines.
STOP
, I A DFA for the automaton (NN, LEFT) is shown
cc in Figure 3. The vectors were originally divided
prp$ vbg jjs
rb vbn pos $ nn in ten clusters, but the DFA construction required
jjr nnp cd
jj in dt cd two state mergings, leading to a eight state au-
9 2 tomaton. The state named I is the initial state.
prp$ nn pos
nn
$ nnp
jj dt nnp Clearly, we can see that there are special states
cc for punctuation (state 9) and coordination (states
, , STOP 1 and 5). States 0 and 2 are harder to interpret.
nn
1 0 cc To understand them better, we computed an esti-

5 mation of the probabilities of the transitions, by
cd nns
cc prp$ rb pos counting the number of times each of them is
jj dt nnp
used. We found that our estimation of generating
STOP
3 STOP from state 0 is 0.67, and from state 2 it is
STOP 0.15. Interestingly, state 2 can transition to state 0
generating prp$, POS or DT, that are usual end-
ings of modifier sequences for nouns (recall that
7 modifiers are generated head-outwards, so for a
left automaton the final modifier is the left-most
Figure 3: DFA approximation for the generation of NN modifier in the sentence).
left modifier sequences.
6 Conclusion
To build a DFA, we computed the forward vec- Our main contribution is a basic tool for inducing
tors corresponding to frequent prefixes of modi- sequential hidden structure in dependency gram-
fier sequences of the development set. Then, we mars. Most of the recent work in dependency
clustered these vectors using a Group Average parsing has explored explicit feature engineering.
Agglomerative algorithm using the cosine simi- In part, this may be attributed to the high cost of
larity measure (Manning et al., 2008). This simi- using tools such as EM to induce representations.
larity measure is appropriate because it compares Our experiments have shown that adding hidden-
the angle between vectors, and is not affected by structure improves parsing accuracy, and that our
their magnitude (the magnitude of forward vec- spectral algorithm is highly scalable.
tors decreases with the number of modifiers gen- Our methods may be used to enrich the rep-
erated). Each cluster i defines a state in the DFA, resentational power of more sophisticated depen-
and we say that a sequence x1:t is in state i if its dency models. For example, future work should
corresponding forward vector at time t is in clus- consider enhancing lexicalized dependency gram-
ter i. Then, transitions in the DFA are defined us- mars with hidden states that summarize lexical
ing a procedure that looks at how sequences tra- dependencies. Another line for future research
verse the states. If a sequence x1:t is at state i at should extend the learning algorithm to be able
time t 1, and goes to state j at time t, then we to capture vertical hidden relations in the depen-
define a transition from state i to state j with la- dency tree, in addition to sequential relations.
bel xt . This procedure may require merging states Acknowledgements We are grateful to Gabriele
to give a consistent DFA, because different se- Musillo and the anonymous reviewers for providing us
quences may define different transitions for the with helpful comments. This work was supported by
same states and modifiers. After doing a merge, a Google Research Award and by the European Com-
new merges may be required, so the procedure mission (PASCAL2 NoE FP7-216886, XLike STREP
must be repeated until a DFA is obtained. FP7-288342). Borja Balle was supported by an FPU
fellowship (AP2008-02064) of the Spanish Ministry
For this analysis, we took the spectral model of Education. The Spanish Ministry of Science and
with 9 states, and built DFA from the non- Innovation supported Ariadna Quattoni (JCI-2009-
deterministic automata corresponding to heads 04240) and Xavier Carreras (RYC-2008-02223 and
and directions where we saw largest improve- KNOW2 TIN2009-14715-C04-04).
417
References Daniel Hsu, Sham M. Kakade, and Tong Zhang. 2009.
A spectral algorithm for learning hidden markov
Raphael Bailly. 2011. Quadratic weighted automata:
models. In COLT 2009 - The 22nd Conference on
Spectral algorithm and likelihood maximization.
Learning Theory.
JMLR Workshop and Conference Proceedings
Gabriel Infante-Lopez and Maarten de Rijke. 2004.
ACML.
Alternative approaches for generating bodies of
James K. Baker. 1979. Trainable grammars for speech
grammar rules. In Proceedings of the 42nd Meet-
recognition. In D. H. Klatt and J. J. Wolf, editors,
ing of the Association for Computational Lin-
Speech Communication Papers for the 97th Meeting
guistics (ACL04), Main Volume, pages 454461,
of the Acoustical Society of America, pages 547
Barcelona, Spain, July.
550.
Terry Koo and Michael Collins. 2010. Efficient third-
Borja Balle, Ariadna Quattoni, and Xavier Carreras.
order dependency parsers. In Proceedings of the
2012. Local loss optimization in operator models:
A new insight into spectral learning. Technical Re-
tational Linguistics, pages 111, Uppsala, Sweden,
port LSI-12-5-R, Departament de Llenguatges i Sis-
temes Informatics (LSI), Universitat Politecnica de
Catalunya (UPC). Christopher D. Manning, Prabhakar Raghavan, and
Xavier Carreras. 2007. Experiments with a higher- Hinrich Schutze. 2008. Introduction to Information
order projective dependency parser. In Proceed- Retrieval. Cambridge University Press, Cambridge,
ings of the CoNLL Shared Task Session of EMNLP- first edition, July.
CoNLL 2007, pages 957961, Prague, Czech Re- Mitchell P. Marcus, Beatrice Santorini, and Mary A.
public, June. Association for Computational Lin- Marcinkiewicz. 1994. Building a large annotated
guistics. corpus of english: The penn treebank. Computa-
Stephen Clark and James R. Curran. 2004. Parsing tional Linguistics, 19.
the wsj using ccg and log-linear models. In Pro- Andre Martins, Noah Smith, and Eric Xing. 2009.
ceedings of the 42nd Meeting of the Association for Concise integer linear programming formulations
Computational Linguistics (ACL04), Main Volume, for dependency parsing. In Proceedings of the Joint
pages 103110, Barcelona, Spain, July. Conference of the 47th Annual Meeting of the ACL
Michael Collins. 1999. Head-Driven Statistical Mod- and the 4th International Joint Conference on Natu-
els for Natural Language Parsing. Ph.D. thesis, ral Language Processing of the AFNLP, pages 342
University of Pennsylvania. 350, Suntec, Singapore, August. Association for
Arthur P. Dempster, Nan M. Laird, and Donald B. Ru- Computational Linguistics.
bin. 1977. Maximum likelihood from incomplete Takuya Matsuzaki, Yusuke Miyao, and Junichi Tsujii.
data via the em algorithm. Journal of the royal sta- 2005. Probabilistic CFG with latent annotations. In
tistical society, Series B, 39(1):138. Proceedings of the 43rd Annual Meeting of the As-
Jason Eisner and Giorgio Satta. 1999. Efficient pars- sociation for Computational Linguistics (ACL05),
ing for bilexical context-free grammars and head- pages 7582, Ann Arbor, Michigan, June. Associa-
automaton grammars. In Proceedings of the 37th tion for Computational Linguistics.
Annual Meeting of the Association for Computa- Ryan McDonald and Fernando Pereira. 2006. Online
tional Linguistics (ACL), pages 457464, Univer- learning of approximate dependency parsing algo-
sity of Maryland, June. rithms. In Proceedings of the 11th Conference of
Jason Eisner and Noah A. Smith. 2010. Favor the European Chapter of the Association for Com-
short dependencies: Parsing with soft and hard con- putational Linguistics, pages 8188.
straints on dependency length. In Harry Bunt, Paola Ryan McDonald, Fernando Pereira, Kiril Ribarov, and
Merlo, and Joakim Nivre, editors, Trends in Parsing Jan Hajic. 2005. Non-projective dependency pars-
Technology: Dependency Parsing, Domain Adapta- ing using spanning tree algorithms. In Proceed-
tion, and Deep Parsing, chapter 8, pages 121150. ings of Human Language Technology Conference
Springer. and Conference on Empirical Methods in Natural
Jason Eisner. 2000. Bilexical grammars and their Language Processing, pages 523530, Vancouver,
cubic-time parsing algorithms. In Harry Bunt and British Columbia, Canada, October. Association for
Anton Nijholt, editors, Advances in Probabilis- Computational Linguistics.
tic and Other Parsing Technologies, pages 2962. Gabriele Antonio Musillo and Paola Merlo. 2008. Un-
Kluwer Academic Publishers, October. lexicalised hidden variable models of split depen-
Joshua Goodman. 1996. Parsing algorithms and met- dency grammars. In Proceedings of ACL-08: HLT,
rics. In Proceedings of the 34th Annual Meeting Short Papers, pages 213216, Columbus, Ohio,
of the Association for Computational Linguistics, June. Association for Computational Linguistics.
pages 177183, Santa Cruz, California, USA, June. James D. Park and Adnan Darwiche. 2004. Com-
Association for Computational Linguistics. plexity results and approximation strategies for map
418
explanations. Journal of Artificial Intelligence Re-
search, 21:101133.
Mark Paskin. 2001. Cubic-time parsing and learning
algorithms for grammatical bigram models. Techni-
cal Report UCB/CSD-01-1148, University of Cali-
fornia, Berkeley.
Slav Petrov and Dan Klein. 2007. Improved infer-
ence for unlexicalized parsing. In Human Language
Technologies 2007: The Conference of the North
American Chapter of the Association for Computa-
tional Linguistics; Proceedings of the Main Confer-
ence, pages 404411, Rochester, New York, April.
Slav Petrov, Leon Barrett, Romain Thibaux, and Dan
Klein. 2006. Learning accurate, compact, and in-
terpretable tree annotation. In Proceedings of the
21st International Conference on Computational
Linguistics and 44th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 433
440, Sydney, Australia, July. Association for Com-
Ivan Titov and James Henderson. 2006. Loss mini-
mization in parse reranking. In Proceedings of the
2006 Conference on Empirical Methods in Natu-
ral Language Processing, pages 560567, Sydney,
Australia, July. Association for Computational Lin-
guistics.
Ivan Titov and James Henderson. 2007. A latent vari-
able model for generative dependency parsing. In
Proceedings of the Tenth International Conference
on Parsing Technologies, pages 144155, Prague,
Czech Republic, June. Association for Computa-
tional Linguistics.
419
Combining Tree Structures, Flat Features and Patterns
for Biomedical Relation Extraction
Md. Faisal Mahbub Chowdhury and Alberto Lavelli

Fondazione Bruno Kessler (FBK-irst), Italy

University of Trento, Italy
{chowdhury,lavelli}@fbk.eu
Abstract and tree kernels have been designed and experi-

mented.
Kernel based methods dominate the current
trend for various relation extraction tasks Early RE approaches more or less fall in one of
including protein-protein interaction (PPI) the following categories: (i) exploitation of statis-
extraction. PPI information is critical in un- tics about co-occurrences of entities, (ii) usage of
derstanding biological processes. Despite patterns and rules, and (iii) usage of flat features
considerable efforts, previously reported to train machine learning (ML) classifiers. These
PPI extraction results show that none of the
approaches have been studied for a long period
approaches already known in the literature
is consistently better than other approaches
and have their own pros and cons. Exploitation
when evaluated on different benchmark PPI of co-occurrence statistics results in high recall
corpora. In this paper, we propose a but low precision, while rule or pattern based ap-
novel hybrid kernel that combines (auto- proaches can increase precision but suffer from
matically collected) dependency patterns, low recall. Flat feature based ML approaches em-
trigger words, negative cues, walk fea- ploy various kinds of linguistic, syntactic or con-
tures and regular expression patterns along textual information and integrate them into the
with tree kernel and shallow linguistic ker-
feature space. They obtain relatively good results
nel. The proposed kernel outperforms the
exiting state-of-the-art approaches on the but are hindered by drawbacks of limited feature
BioInfer corpus, the largest PPI benchmark space and excessive feature engineering. Kernel
corpus available. On the other four smaller based approaches have become an attractive alter-
benchmark corpora, it performs either bet- native solution, as they can exploit huge amount
ter or almost as good as the existing ap- of features without an explicit representation.
proaches. Moreover, empirical results show
In this paper, we propose a new hybrid kernel
that the proposed hybrid kernel attains con-
siderably higher precision than the existing
for RE. We apply the kernel to Proteinprotein
approaches, which indicates its capability interaction (PPI) extraction, the most widely re-
of learning more accurate models. This also searched topic in biomedical relation extraction.
demonstrates that the different types of in- PPI1 information is very critical in understanding
formation that we use are able to comple- biological processes. Considerable progress has
ment each other for relation extraction. been made for this task. Nevertheless, empirical
results of previous studies show that none of the
1 Introduction approaches already known in the literature is con-
Kernel methods are considered the most effective sistently better than other approaches when evalu-
techniques for various relation extraction (RE) ated on different benchmark PPI corpora (see Ta-
tasks on both general (e.g. newspaper text) and ble 4). This demands further study and innovation
specialized (e.g. biomedical text) domains. In 1
PPIs occur when two or more proteins bind together,
particular, as the importance of syntactic struc- and are integral to virtually all cellular processes, such as
tures for deriving the relationships between en- metabolism, signalling, regulation, and proliferation (Tikk
tities in text has been growing, several graph et al., 2010).
420
of new approaches that are sensitive to the varia- The remainder of the paper is organized as fol-
tions of complex linguistic constructions. lows. In Section 2, we briefly review previous
The proposed hybrid kernel is the composition work. Section 3 lists the datasets. Then, in Sec-
of one tree kernel and two feature based kernels tion 4, we define our proposed hybrid kernel and
(one of them is already known in the literature describe its individual component kernels. Sec-
and the other is proposed in this paper for the first tion 5 outlines the experimental settings. Follow-
time). The novelty of the newly proposed feature ing that, empirical results are discussed in Section
based kernel is that it envisages to accommodate 6. Finally, we conclude with a summary of our
the advantages of pattern based approaches. More study as well as suggestions for further improve-
precisely: ment of our approach.
1. We propose a new feature based kernel (de- 2 Related Work

tails in Section 4.1) by using syntactic de- In this section, we briefly discuss some of the
pendency patterns, trigger words, negative recent work on PPI extraction. Several RE ap-
cues, regular expression (henceforth, regex) proaches have been reported to date for the PPI
patterns and walk features (i.e. e-walks and task, most of which are kernel based methods.
v-walks)2 . Tikk et al. (2010) reported a benchmark evalu-
ation of various kernels on PPI extraction. An
2. The syntactic dependency patterns are au-
interesting finding is that the Shallow Linguis-
tomatically collected from a type of depen-
tic (SL) kernel (Giuliano et al., 2006) (to be dis-
dency subgraph (we call it reduced graph,
cussed in Section 4.2), despite its simplicity, is on
more details in Section 4.1.1) during run-
par with the best kernels in most of the evaluation
time.
settings.
3. We only use the regex patterns, trigger words Kim et al. (2010) proposed walk-weighted sub-
and negative cues mentioned in the literature sequence kernel using e-walks, partial matches,
(Ono et al., 2001; Fundel et al., 2007; Bui et non-contiguous paths, and different weights for
al., 2010). The objective is to verify whether different sub-structures (which are used to capture
we can exploit knowledge which is already structural similarities during kernel computation).
known and used. Miwa et al. (2009a) proposed a hybrid kernel,
which combines the all-paths graph (APG) kernel
4. We propose a hybrid kernel by combin- (Airola et al., 2008), the bag-of-words kernel, and
ing the proposed feature based kernel (out- the subset tree kernel (Moschitti, 2006) (applied
lined above) with the Shallow Linguistic on the shortest dependency paths between target
(SL) kernel (Giuliano et al., 2006) and the protein pairs). They used multiple parser inputs.
Path-enclosed Tree (PET) kernel (Moschitti, The system is regarded as the current state-of-the-
2004). art PPI extraction system because of its high re-
sults on different PPI corpora (see the results in
The aim of our work is to take advantage of Table 4).
different types of information (i.e., dependency As an extension of their work, they boosted sys-
patterns, regex patterns, trigger words, negative tem performance by training on multiple PPI cor-
cues, syntactic dependencies among words and pora instead of on a single corpus and adopting
constituent parse trees) and their different repre- a corpus weighting concept with support vector
sentations (i.e. flat features, tree structures and machine (SVM) which they call SVM-CW (Miwa
graphs) which can complement each other to learn et al., 2009b). Since most of their results are re-
more accurate models. ported by training on the combination of multi-
2
The syntactic dependencies of the words of a sentence ple corpora, it is not possible to compare them
create a dependency graph. A v-walk feature consists of directly with the results published in the other re-
(wordi dependency typei,i+1 wordi+1 ), and an e- lated works (that usually adopt 10-fold cross vali-
walk feature is composed of (dependency typei1,i
wordi dependency typei,i+1 ). Note that, in a depen-
dation on a single PPI corpus). To be comparable
dency graph, the words are nodes while the dependency with the vast majority of the existing work, we
types are edges. also report results using 10-fold cross validation
421
Corpus Sentences Positive pairs Negative pairs (PET) kernel respectively. w is a multiplicative
BioInfer 1,100 2,534 7,132 constant used for the PET kernel. It allows the
AIMed 1,955 1,000 4,834 hybrid kernel to assign more (or less) weight to
IEPA 486 335 482 the information obtained using tree structures de-
HPRD50 145 163 270 pending on the corpus. The proposed hybrid ker-
LLL 77 164 166 nel is valid according to the closure properties of
kernels.
Table 1: Basic statistics of the 5 benchmark PPI cor- Both the TPWF and SL kernels are linear ker-
pora. nels, while PET kernel is computed using Unlex-
icalized Partial Tree (uPT) kernel (Severyn and
on single corpora. Moschitti, 2010). The following subsections ex-
Apart from the approaches described above, plain each of the individual kernels in more detail.
there also exist other studies that used kernels for 4.1 Proposed TPWF Kernel
PPI extraction (e.g. subsequence kernel (Bunescu
and Mooney, 2006)). 4.1.1 Reduced graph, trigger words,
A notable exception is the work published by negative cues and dependency patterns
Bui et al. (2010). They proposed an approach that For each of the candidate entity pairs, we
consists of two phases. In the first phase, their construct a type of subgraph from the depen-
system categorizes the data into different groups dency graph formed by the syntactic dependen-
(i.e. subsets) based on various properties and pat- cies among the words of a sentence. We call it
terns. Later they classify candidate PPI pairs in- reduced graph and define it in the follow-
side each of the groups using SVM trained with ing way:
features specific for the corresponding group.
A reduced graph is a subgraph
3 Data of the dependency graph of a sentence
which includes:
There are 5 benchmark corpora for the PPI task
the two candidate entities and their
that are frequently used: HPRD50 (Fundel et al.,
governor nodes up to their least
2007), IEPA (Ding et al., 2002), LLL (Nedellec,
common governor (if exists).
2005), BioInfer (Pyysalo et al., 2007) and AIMed
(Bunescu et al., 2005). These corpora adopt dif- dependent nodes (if exist) of all the
ferent PPI annotation formats. For a comparative nodes added in the previous step.
evaluation Pyysalo et al. (2008) put all of them the immediate governor(s) (if ex-
in a common format which has become the stan- ists) of the least common governor.
dard evaluation format for the PPI task. In our
Figure 1 shows an example of a reduced graph.
experiments, we use the versions of the corpora
A reduced graph is an extension of the smallest
converted to such format.
common subgraph of the dependency graph that
Table 1 shows various statistics regarding the 5
aims at overcoming its limitations. It is a known
(converted) corpora.
issue that the smallest common subgraph (or sub-
4 Proposed Hybrid Kernel tree) sometimes does not contain cue words. Pre-
viously, Chowdhury et al. (2011a) proposed a lin-
The hybrid kernel that we propose is as follows: guistically motivated extension of the minimal
KHybrid (R1 , R2 ) = KT P W F (R1 , R2 ) (i.e. smallest) common subtree (which includes
+ KSL (R1 , R2 ) + w * KP ET (R1 , R2 ) the candidate entity pairs), known as Mildly Ex-
tended Dependency Tree (MEDT). However, the
where KT P W F stands for the new feature rules used for MEDT are too constrained. Our ob-
based kernel (henceforth, TPWF kernel) com- jective in constructing the reduced graph is to in-
puted using flat features collected by exploiting clude any potential modifier(s) or cue word(s) that
patterns, trigger words, negative cues and walk describes the relation between the given pair of
features. KSL and KP ET stand for the Shallow entities. Sometimes such modifiers or cue words
Linguistic (SL) kernel and the Path-enclosed Tree are not directly dependent (syntactically) on any
422
BioInfer AIMed IEPA HPRD50 LLL
P R F P R F P R F P R F P R F
Only walk features 51.8 71.2 60.0 48.7 63.2 55.0 61.0 75.2 67.4 60.2 65.0 62.5 64.6 87.8 74.4
Features: dep. patterns, 53.8 68.8 60.4 50.6 63.9 56.5 63.9 74.6 68.9 65.0 71.8 68.2 66.5 89.6 76.4
trigger, neg. cues, walks
Features: dep. patterns, 53.5 68.6 60.1 52.5 62.9 57.2 63.8 74.6 68.8 65.1 69.9 67.5 67.4 88.4 76.5
trigger, neg. cues, walks,
regex patterns
Table 2: Results of the proposed TPWF feature based kernel on 5 benchmark PPI corpora before and after adding
features collected using dependency patterns, regex patterns, trigger words and negative cues to the walk features.
The TPWF kernel is a component of the new hybrid kernel.
Figure 1: Dependency graph for the sentence A pVHL mutant containing a P154L substitution does not promote
degradation of HIF1-Alpha generated by the Stanford parser. The edges with blue dots form the smallest
common subgraph for the candidate entity pair pVHL and HIF1-Alpha, while the edges with red dots form the
reduced graph for the pair.
of the entities (of the candidate pair). Rather they of a (positive or negative) entity pair in the train-
are dependent on some other word(s) which is de- ing data. For example, the dependency pattern for
pendent on one (or both) of the entities. The word the reduced graph in Figure 1 is {det, amod, part-
not in Figure 1 is one such example. The re- mod, nsubj, aux, neg, dobj, prep of }. The same
duced graph aims to preserve these cue words. dependency pattern might be constructed for mul-
The following types of features are collected tiple (positive or negative) entity pairs. However,
from the reduced graph of a candidate pair: if it is constructed for both positive and negative
pairs, it has to be discarded from the pattern list.
1. HasTriggerWord: whether the least common
The dependency patterns allow some kind of
governor(s) of the target entity pairs inside
underspecification as they do not contain the lex-
the reduced graph matches any trigger word.
ical items (i.e. words) but contain the likely com-
2. Trigger-X: whether the least common gov- bination of syntactic dependencies that a given re-
ernor(s) of the target entity pairs inside the lated pair of entities would pose inside their re-
reduced graph matches the trigger word X. duced graph.
The list of trigger words contains 144 words
3. HasNegWord: whether the reduced graph previously used by Bui et al. (2010) and Fundel
contains any negative word. et al. (2007). The list of negative cues contain 18
4. DepPattern-i: whether the reduced graph words, most of which are mentioned in Fundel et
contains all the syntactic dependencies of the al. (2007).
i-th pattern of dependency pattern list. 4.1.2 Walk features
The dependency pattern list is automatically We extract e-walk and v-walk features from
constructed from the training data during the the Mildly Extended Dependency Tree (MEDT)
learning phase. Each pattern is a set of syntactic (Chowdhury et al., 2011a) of each candidate pair.
dependencies of the corresponding reduced graph Reduced graphs sometimes include some unin-
423
Pos. / Neg. 2,534 / 7,132 1,000 / 4,834 335 / 482 163 / 270 164 / 166
Proposed TPWF kernel 53.8 68.8 60.4 50.6 63.9 56.5 63.9 74.6 68.9 65.0 71.8 68.2 66.5 89.6 76.4
(without regex)
Proposed TPWF kernel 53.5 68.6 60.1 52.5 62.9 57.2 63.8 74.6 68.8 65.1 69.9 67.5 67.4 88.4 76.5
(with regex)
SL kernel 60.8 65.8 63.2 56.2 64.4 60.0 73.3 71.9 72.6 62.0 65.0 63.5 74.9 85.4 79.8
PET kernel 72.8 74.9 73.9 44.8 72.8 55.5 70.7 77.9 74.2 65.0 73.0 68.8 72.1 89.6 79.9
Proposed hybrid kernel 80.0 71.4 75.5 64.2 58.2 61.1 81.1 69.3 74.7 72.9 59.5 65.5 70.4 95.7 81.1
(PET + SL + TPWF
(without regex))
Proposed hybrid kernel 80.1 72.0 75.9 64.4 58.3 61.2 79.3 69.6 74.1 71.9 61.4 66.2 70.6 95.1 81.0
(PET + SL + TPWF
(with regex))
Table 3: Results of the proposed hybrid kernel and its individual components. Pos. and Neg. refer to number
positive and negative relations respectively. PET refers to the path-enclosed tree kernel, SL refers to the shallow
linguistic kernel, and TPWF refers to the kernel computed using trigger, pattern, negative cue and walk features.
formative words which produce uninformative (i.e. {node X, dependent 1 of X} and

walk features. Hence, they are not suitable for {node X, dependent 2 of X}).
walk feature generation. MEDT suits better for Apart from the above types of features, we also
this purpose. The walk features extracted from add features for lemmas of the immediate preced-
MEDTs have the following properties: ing and following words of the candidate entities.
These feature names are augmented with -1 or +1
The directionality of the edges (or nodes) in
depending on whether the corresponding words
an e-walk (or v-walk) is not considered. In
are preceded or followed by a candidate entity.
other words, e.g., pos(stimulatory)amod
pos(ef f ects) and pos(ef f ects) amod 4.1.3 Regular expression patterns
pos(stimulatory) are treated as the same fea- We use a set of 22 regex patterns as binary
ture. features. These patterns were previously used
by Ono et al. (2001) and Bui et al. (2010).
The v-walk features are of the form (posi If there is a match for a pattern (e.g. En-
dependency typei,i+1 posi+1 ). Here, posi is
tity 1.*activates.*Entity 2 where Entity 1 and
the POS tag of wordi , i is the governor node Entity 2 form the candidate entity pair) in a given
and i + 1 is the dependent node. sentence, value 1 is added for the feature (i.e., pat-
tern) inside the feature vector.
The e-walk features are of the form
(dep. typei1,i posi dep. typei,i+1 ) and 4.2 Shallow Linguistic (SL) Kernel
(dep. typei1,i lemmai dep. typei,i+1 ).
The Shallow Linguistic (SL) kernel was proposed
Here, lemmai is the lemmatized form of
by Giuliano et al. (2006). It is one of the best
wordi .
performing kernels applied on different biomedi-
Usually, the e-walk features are con- cal RE tasks such as PPI and DDI (drug-drug in-
structed using dependency types be- teraction) extraction (Tikk et al., 2010; Segura-
tween {governor of X, node X} and Bedmar et al., 2011; Chowdhury and Lavelli,
{node X, dependent of X}. However, 2011b; Chowdhury et al., 2011c). It is defined
we also extract e-walk features from as follows:
the dependency types between any two KSL (R1 , R2 ) = KLC (R1 , R2 ) + KGC
dependents and their common governor (R1 , R2 )
424
Pos. / Neg. 2,534 / 7,132 1,000 / 4,834 335 / 482 163 / 270 164 / 166
SL kernel 60.9 57.2 59.0
(Giuliano et al., 2006)
APG kernel 56.7 67.2 61.3 52.9 61.8 56.4 69.6 82.7 75.1 64.3 65.8 63.4 72.5 87.2 76.8
(Airola et al., 2008)
Hybrid kernel and 65.7 71.1 68.1 55.0 68.8 60.8 67.5 78.6 71.7 68.5 76.1 70.9 77.6 86.0 80.1
multiple parser input
(Miwa et al., 2009a)
SVM-CW, multiple 67.6 64.2 74.4 69.7 80.5
parser input and graph,
walk and BOW features
(Miwa et al., 2009b)
kBSPS kernel 49.9 61.8 55.1 50.1 41.4 44.6 58.8 89.7 70.5 62.2 87.1 71.0 69.3 93.2 78.1
(Tikk et al., 2010)
Walk weighted 61.8 54.2 57.6 61.4 53.3 56.6 73.8 71.8 72.9 66.7 69.2 67.8 76.9 91.2 82.4
subsequence kernel
(Kim et al., 2010)
2 phase extraction 61.7 57.5 60.0 55.3 68.5 61.2
(Bui et al., 2010)
Our proposed hybrid 80.0 71.4 75.5 64.2 58.2 61.1 81.1 69.3 74.7 72.9 59.5 65.5 70.4 95.7 81.1
kernel (PET + SL +
TPWF without regex)
Table 4: Comparison of the results on the 5 benchmark PPI corpora. Pos. and Neg. refer to number positive and
negative relations respectively. The underlined numbers indicate the best results for the corresponding corpus
reported by any of the existing state-of-the-art approaches. The results of Bui et al. (2010) on LLL, HPRD50,
and IEPA are not reported since thy did not use all the positive and negative examples during cross validation.
Miwa et al. (2009b) showed that better results can be obtained using multiple corpora for training. However,
we consider only those results of their experiments where they used single training corpus as it is the standard
evaluation approach adopted by all the other studies on PPI extraction for comparing results. All the results of
the previous approaches reported in this table are directly quoted from their respective original papers.
where KSL , KGC and KLC correspond to SL, main). A PET is the smallest common subtree of a
global context (GC) and local context (LC) ker- phrase structure tree that includes the two entities
nels respectively. The GC kernel exploits contex- involved in a relation.
tual information of the words occurring before,
between and after the pair of entities (to be in-
vestigated for RE) in the corresponding sentence;
while the LC kernel exploits contextual informa- A tree kernel calculates the similarity between
tion surrounding individual entities. two input trees by counting the number of com-
mon sub-structures. Different techniques have
4.3 Path-enclosed tree (PET) Kernel been proposed to measure such similarity. We use
the Unlexicalized Partial Tree (uPT) kernel (Sev-
The path-enclosed tree (PET) kernel3 was first
eryn and Moschitti, 2010) for the computation of
proposed by Moschitti (2004) for semantic role
the PET kernel since a comparative evaluation by
labeling. It was later successfully adapted by
Chowdhury et al. (2011a) reported that uPT ker-
Zhang et al. (2005) and other works for relation
nels achieve better results for PPI extraction than
extraction on general texts (such as newspaper do-
the other techniques used for tree kernel compu-
3
Also known as shortest path-enclosed tree (SPT) kernel. tation.
425
5 Experimental Settings 6 Results and Discussion
We have followed the same criteria commonly To measure the contribution of the features col-
used for the PPI extraction tasks, i.e. abstract- lected from the reduced graphs (using dependency
wise 10-fold cross validation on individual corpus patterns, trigger words and negative cues) and
and one-answer-per-occurrence criterion. In fact, regex patterns, we have applied the new TPWF
we have used exactly the same (abstract-wise) kernel on the 5 PPI corpora before and after using
fold splitting of the 5 benchmark (converted) cor- these features. Results shown in Table 2 clearly
pora used by Tikk et al. (2010) for benchmarking indicate that usage of these features improve the
various kernel methods4 . performance. The improvement of performance
The Charniak-Johnson reranking parser (Char- is primarily due to the usage of dependency pat-
niak and Johnson, 2005), along with a self-trained terns which resulted in higher precision for all the
biomedical parsing model (McClosky, 2010), has corpora.
been used for tokenization, POS-tagging and We have tried to measure the contribution of
parsing of the sentences. Before parsing the sen- the regex patterns. However, from the empirical
tences, all the entities are blinded by assigning results a clear trend does not emerge (see Table
names as EntityX where X is the entity index. 2).
In each example, the POS tags of the two can- Table 3 shows a comparison among the re-
didate entities are changed to EntityX. The sults of the proposed hybrid kernel and its indi-
parse trees produced by the Charniak-Johnson vidual components. As we can see, the overall
reranking parser are then processed by the Stan- results of the hybrid kernel (with and without us-
ford parser5 (Klein and Manning, 2003) to obtain ing regex pattern features) are better than those
syntactic dependencies according to the Stanford by any of its individual component kernels. Inter-
Typed Dependency format. estingly, precision achieved on the 4 benchmark
The Stanford parser often skips some syntactic corpora (other than the smallest corpus LLL) is
dependencies in output. We use the following two much higher for the hybrid kernel than for the in-
rules to add some of such dependencies: dividual components. This strongly indicates that
these different types of information (i.e. depen-
If there is a conj and or conj or dependency patterns, regex patterns, triggers, negative
dency between two words X and Y, then X cues, syntactic dependencies among words and
should be dependent on any word Z on which constituent parse trees) and their different repre-
Y is dependent and vice versa. sentations (i.e. flat features, tree structures and
graphs) can complement each other to learn more
If there are two verbs X and Y such that inaccurate models.
side the corresponding sentence they have Table 4 shows a comparison of the PPI extrac-
only the word and or or between them, tion results of our proposed hybrid kernel with
then any word Z dependent on X should be those of other state-of-the-art approaches. Since
also dependent on Y and vice versa. the contribution of regex patterns in the perfor-
mance of the hybrid kernel was not relevant (as
Our system exploits SVM-LIGHT-TK6 (Mos- Tables 2 and 3 show), we used the results of pro-
chitti, 2006; Joachims, 1999). We made minor posed hybrid kernel without regex for the compar-
changes in the toolkit to compute the proposed ison. As we can see, the proposed kernel achieves
hybrid kernel. The ratio of negative and positive significantly higher results on the BioInfer corpus,
examples has been used as the value of the cost- the largest benchmark PPI corpus (2,534 positive
ratio-factor parameter. We have done parameter PPI pair annotations) available, than any of the
tuning following the approach described by Hsu existing approaches. Moreover, the results of the
et al. (2003). proposed hybrid kernel are on par with the state-
4
of-the-art results on the other smaller corpora.
Downloaded from http://informatik.hu-
berlin.de/forschung /gebiete/wbi/ppi-benchmark .
Furthermore, empirical results show that the
5
http://nlp.stanford.edu/software/lex-parser.shtml proposed hybrid kernel attains considerably
6
http://disi.unitn.it/moschitti/Tree-Kernel.htm higher precision than the existing approaches.
426
Since a dependency pattern, by construction, also demonstrates that the different types of infor-
contains all the syntactic dependencies inside the mation that we use are able to complement each
corresponding reduced graph, it may happen that other for relation extraction.
some of the dependencies (e.g. det or determiner) We believe there are at least three ways to
are not informative for classifying the label of the further improve the proposed approach. First
corresponding class label (i.e., positive or nega- of all, the 22 regular expression patterns (col-
tive relation) of the pattern. Their presence in- lected from Ono et al. (2001) and Bui et al.
side a pattern might make it unnecessarily rigid (2010)) are applied at the level of the sen-
and less general. So, we tried to identify and dis- tences and this sometimes produces unwanted
card such non informative dependencies by mea- matches. For example, consider the sentence
suring probabilities of the dependencies with re- X activates Y and inhibits Z where X, Y,
spect to the class label and then removing any of and Z are entities. The pattern Entity1.
them which has probability lower than a threshold activates. Entity2 matches both the XY and
(we tried with different threshold values). But do- XZ pairs in the sentence. But only the XY pair
ing so decreased the performance. This suggests should be considered. So, the patterns should
that the syntactic dependencies of a dependency be constrained to reduce the number of unwanted
pattern are not independent of each other even if matches. For example, they could be applied on
some of them might have low probability (with smaller linguistic units than full sentences. Sec-
respect to the class label) individually. We plan to ondly, different techniques could be used to iden-
further investigate whether there could be differ- tify less-informative syntactic dependencies in-
ent criteria for identifying non informative depen- side dependency patterns to make them more ac-
dencies. For the work reported in this paper, we curate and effective. Thirdly, usage of automati-
used the dependency patterns as they are initially cally collected paraphrases of regular expression
constructed. patterns instead of the patterns directly could be
We also did experiments to see whether collect- also helpful. Weakly supervised collection of
ing features for trigger words from the whole re- paraphrases for RE has been already investigated
duced graph would help. But that also decreased (e.g. Romano et al. (2006)) and, hence, can be
performance. This suggests that trigger words are tried for improving the TPWF kernel (which is a
more likely to appear in the least common gover- component of the proposed hybrid kernel).
nors.
Acknowledgments
7 Conclusion
This work was carried out in the context of the project
In this paper, we have proposed a new hybrid
eOnco - Pervasive knowledge and data management
kernel for RE that combines two vector based
in cancer care. The authors are grateful to Alessan-
kernels and a tree kernel. The proposed kernel
dro Moschitti for his help in the use of SVM-LIGHT-
outperforms any of the exiting approaches by a
TK. We also thank the anonymous reviewers for help-
wide margin on the BioInfer corpus, the largest
ful suggestions.
PPI benchmark corpus available. On the other
four smaller benchmark corpora, it performs ei-
ther better or almost as good as the existing state- References
of-the art approaches.
We have also proposed a novel feature based Antti Airola, Sampo Pyysalo, Jari Bjorne, Tapio
kernel, called TPWF kernel, using (automatically Pahikkala, Filip Ginter, and Tapio Salakoski. 2008.
All-paths graph kernel for protein-protein inter-
collected) dependency patterns, trigger words,
action extraction with evaluation of cross-corpus
negative cues, walk features and regular expres- learning. BMC Bioinformatics, 9(Suppl 11):S2.
sion patterns. The TPWF kernel is used as a com- Quoc-Chinh Bui, Sophia Katrenko, and Peter M.A.
ponent of the new hybrid kernel. Sloot. 2010. A hybrid approach to extract protein-
Empirical results show that the proposed hy- protein interactions. Bioinformatics.
brid kernel achieves considerably higher precision Razvan Bunescu and Raymond J. Mooney. 2006.
than the existing approaches, which indicates its Subsequence kernels for relation extraction. In Pro-
capability of learning more accurate models. This ceedings of NIPS 2006, pages 171178.
427
Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Ed- Parsing. Ph.D. thesis, Department of Computer
ward M. Marcotte, Raymond J. Mooney, Arun Ku- Science, Brown University.
mar Ramani, and Yuk Wah Wong. 2005. Compara- Makoto Miwa, Rune Stre, Yusuke Miyao, and
tive experiments on learning information extractors Junichi Tsujii. 2009a. Protein-protein interac-
for proteins and their interactions. Artificial Intelli- tion extraction by leveraging multiple kernels and
gence in Medicine, 33(2):139155. parsers. International Journal of Medical Informat-
Eugene Charniak and Mark Johnson. 2005. Coarse- ics, 78.
to-fine n-best parsing and maxent discriminative Makoto Miwa, Rune Stre, Yusuke Miyao, and
reranking. In Proceedings of ACL 2005. Junichi Tsujii. 2009b. A rich feature vector for
Md. Faisal Mahbub Chowdhury and Alberto Lavelli. protein-protein interaction extraction from multiple
2011b. Drug-drug interaction extraction using com- corpora. In Proceedings of EMNLP 2009, pages
posite kernels. In Proceedings of DDIExtrac- 121130, Singapore.
tion2011: First Challenge Task: Drug-Drug In- Alessandro Moschitti. 2004. A study on convolution
teraction Extraction, pages 2733, Huelva, Spain, kernels for shallow semantic parsing. In Proceed-
September. ings of ACL 2004, Barcelona, Spain.
Md. Faisal Mahbub Chowdhury, Alberto Lavelli, and Alessandro Moschitti. 2006. Making Tree Kernels
Alessandro Moschitti. 2011a. A study on de- Practical for Natural Language Learning. In Pro-
pendency tree kernels for automatic extraction of ceedings of EACL 2006, Trento, Italy.
protein-protein interaction. In Proceedings of Claire Nedellec. 2005. Learning language in logic -
BioNLP 2011 Workshop, pages 124133, Portland, genic interaction extraction challenge. Proceedings
Oregon, USA, June. of the ICML 2005 workshop: Learning Language in
Md. Faisal Mahbub Chowdhury, Asma Ben Abacha, Logic (LLL05), pages 3137.
Alberto Lavelli, and Pierre Zweigenbaum. 2011c. Toshihide Ono, Haretsugu Hishigaki, Akira Tanigami,
Two dierent machine learning techniques for drug- and Toshihisa Takagi. 2001. Automated ex-
drug interaction extraction. In Proceedings of traction of information on proteinprotein interac-
DDIExtraction2011: First Challenge Task: Drug- tions from the biological literature. Bioinformatics,
Drug Interaction Extraction, pages 1926, Huelva, 17(2):155161.
Spain, September. Sampo Pyysalo, Filip Ginter, Juho Heimonen, Jari
J. Ding, D. Berleant, D. Nettleton, and E. Wurtele. Bjorne, Jorma Boberg, Jouni Jarvinen, and Tapio
2002. Mining MEDLINE: abstracts, sentences, or Salakoski. 2007. Bioinfer: a corpus for information
phrases? Pacific Symposium on Biocomputing, extraction in the biomedical domain. BMC Bioin-
pages 326337. formatics, 8(1):50.
Katrin Fundel, Robert Kuffner, and Ralf Zimmer. Sampo Pyysalo, Antti Airola, Juho Heimonen, Jari
2007. Relexrelation extraction using dependency Bjorne, Filip Ginter, and Tapio Salakoski. 2008.
parse trees. Bioinformatics, 23(3):365371. Comparative analysis of five protein-protein in-
Claudio Giuliano, Alberto Lavelli, and Lorenza Ro- teraction corpora. BMC Bioinformatics, 9(Suppl
mano. 2006. Exploiting shallow linguistic infor- 3):S6.
mation for relation extraction from biomedical lit- Lorenza Romano, Milen Kouylekov, Idan Szpektor,
erature. In Proceedings of EACL 2006, pages 401 Ido Dagan, and Alberto Lavelli. 2006. Investi-
408. gating a generic paraphrasebased approach for re-
CW Hsu, CC Chang, and CJ Lin, 2003. A practical lation extraction. In Proceedings of EACL 2006,
guide to support vector classification. Department pages 409416.
of Computer Science and Information Engineering, Isabel Segura-Bedmar, Paloma Martnez, and Cesar de
National Taiwan University, Taipei, Taiwan. Pablo-Sanchez. 2011. Using a shallow linguistic
Thorsten Joachims. 1999. Making large-scale sup- kernel for drug-drug interaction extraction. Jour-
port vector machine learning practical. In Advances nal of Biomedical Informatics, In Press, Corrected
in kernel methods: support vector learning, pages Proof, Available online, 24 April.
169184. MIT Press, Cambridge, MA, USA. Aliaksei Severyn and Alessandro Moschitti. 2010.
Seonho Kim, Juntae Yoon, Jihoon Yang, and Seog Fast cutting plane training for structural kernels. In
Park. 2010. Walk-weighted subsequence kernels Proceedings of ECML-PKDD 2010.
for protein-protein interaction extraction. BMC Domonkos Tikk, Philippe Thomas, Peter Palaga,
Bioinformatics, 11(1). Jorg Hakenberg, and Ulf Leser. 2010. A Compre-
Dan Klein and Christopher D. Manning. 2003. Accu- hensive Benchmark of Kernel Methods to Extract
rate unlexicalized parsing. In Proceedings of ACL Protein-Protein Interactions from Literature. PLoS
2003, pages 423430, Sapporo, Japan. Computational Biology, 6(7), July.
David McClosky. 2010. Any Domain Parsing: Au- Min Zhang, Jian Su, Danmei Wang, Guodong Zhou,
tomatic Domain Adaptation for Natural Language and Chew Lim Tan. 2005. Discovering relations
428
between named entities from a large raw corpus us-
ing tree similarity-based clustering. In Natural Lan-
guage Processing IJCNLP 2005, volume 3651 of
Lecture Notes in Computer Science, pages 378389.
Springer Berlin / Heidelberg.
429
Coordination Structure Analysis using Dual Decomposition
Atsushi Hanamoto 1 Takuya Matsuzaki 1 Junichi Tsujii 2

1. Department of Computer Science, University of Tokyo, Japan
2. Web Search & Mining Group, Microsoft Research Asia, China
{hanamoto, matuzaki}@is.s.u-tokyo.ac.jp
jtsujii@microsoft.com
Abstract a freshman advertising and marketing major. Ta-

ble 1 shows the output from them and the correct
Coordination disambiguation remains a dif- coordination structure.
ficult sub-problem in parsing despite the The coordination structure above is obvious to
frequency and importance of coordination
humans because there is a symmetry of conjuncts
structures. We propose a method for disam-
biguating coordination structures. In this (-ing) in the sentence. Coordination structures of-
method, dual decomposition is used as a ten have such structural and semantic symmetry
framework to take advantage of both HPSG of conjuncts. One approach is to capture local
parsing and coordinate structure analysis symmetry of conjuncts. However, this approach
with alignment-based local features. We fails in VP and sentential coordinations, which
evaluate the performance of the proposed can easily be detected by a grammatical approach.
method on the Genia corpus and the Wall
This is because conjuncts in these coordinations
Street Journal portion of the Penn Tree-
bank. Results show it increases the per- do not necessarily have local symmetry.
centage of sentences in which coordination It is therefore natural to think that consider-
structures are detected correctly, compared ing both the syntax and local symmetry of con-
with each of the two algorithms alone. juncts would lead to a more accurate analysis.
However, it is difficult to consider both of them
in a dynamic programming algorithm, which has
1 Introduction
been often used for each of them, because it ex-
Coordination structures often give syntactic ambi- plodes the computational and implementational
guity in natural language. Although a wrong anal- complexity. Thus, previous studies on coordina-
ysis of a coordination structure often leads to a tion disambiguation often dealt only with a re-
totally garbled parsing result, coordination disam- stricted form of coordination (e.g. noun phrases)
biguation remains a difficult sub-problem in pars- or used a heuristic approach for simplicity.
ing, even for state-of-the-art parsers. In this paper, we present a statistical analysis
One approach to solve this problem is a gram- model for coordination disambiguation that uses
matical approach. This approach, however, of- the dual decomposition as a framework. We con-
ten fails in noun and adjective coordinations be- sider both of the syntax, and structural and se-
cause there are many possible structures in these mantic symmetry of conjuncts so that it outper-
coordinations that are grammatically correct. For forms existing methods that consider only either
example, a noun sequence of the form n0 n1 of them. Moreover, it is still simple and requires
and n2 n3 has as many as five possible struc- only O(n4 ) time per iteration, where n is the num-
tures (Resnik, 1999). Therefore, a grammatical ber of words in a sentence. This is equal to that
approach is not sufficient to disambiguate coor- of coordination structure analysis with alignment-
dination structures. In fact, the Stanford parser based local features. The overall system still has a
(Klein and Manning, 2003) and Enju (Miyao and quite simple structure because we need just slight
Tsujii, 2004) fail to disambiguate a sentence I am modifications of existing models in this approach,
430
Stanford parser/Enju They disambiguated coordination structures
I am a ( freshman advertising ) and ( based on the edit distance between two conjuncts.
marketing major ) Hara et al. (2009) extended the method, dealing
with nested coordinations as well. We used their
Correct coordination structure
method as one of the two sub-models.
I am a freshman ( ( advertising and mar-
keting ) major ) 3 Background
3.1 Coordination structure analysis with
Table 1: Output from the Stanford parser, Enju and the
correct coordination structure alignment-based local features
Coordination structure analysis with alignment-
so we can easily add other modules or features for based local features (Hara et al., 2009) is a hy-
future. brid approach to coordination disambiguation that
The structure of this paper is as follows. First, combines a simple grammar to ensure consistent
we describe three basic methods required in the global structure of coordinations in a sentence,
technique we propose: 1) coordination structure and features based on sequence alignment to cap-
analysis with alignment-based local features, 2) ture local symmetry of conjuncts. In this section,
HPSG parsing, and 3) dual decomposition. Fi- we describe the method briefly.
nally, we show experimental results that demon- A sentence is denoted by x = x1 ...xk , where xi
strate the effectiveness of our approach. We com- is the i-th word of x. A coordination boundaries
pare three methods: coordination structure anal- set is denoted by y = y1 ...yk , where
ysis with alignment-based local features, HPSG

(bl , el , br , er ) (if xi is a coordinating
parsing, and the dual-decomposition-based ap-

proach that combines both. conjunction having left
yi = conjunct xbl ...xel and

2 Related Work
right conjunct xbr ...xer )

null (otherwise)
Many previous studies for coordination disam-
biguation have focused on a particular type of NP In other words, yi has a non-null value
coordination (Hogan, 2007). Resnik (1999) dis- only when it is a coordinating conjunction.
ambiguated coordination structures by using se- For example, a sentence I bought books and
mantic similarity of the conjuncts in a taxonomy. stationary has a coordination boundaries set
He dealt with two kinds of patterns, [n0 n1 and (null, null, null, (3, 3, 5, 5), null).
n2 n3 ] and [n1 and n2 n3 ], where ni are all nouns. The score of a coordination boundaries set is
He detected coordination structures based on sim- defined as the sum of score of all coordinating
ilarity of form, meaning and conceptual associa- conjunctions in the sentence.
tion between n1 and n2 and between n1 and n3 .
Nakov and Hearst (2005) used the Web as a train-
k
ing set and applied it to a task that is similar to score(x, y) = score(x, ym )

m=1
Resniks.
In terms of integrating coordination disam- k
= w f (x, ym ) (1)
biguation with an existing parsing model, our ap- m=1
proach resembles the approach by Hogan (2007).
She detected noun phrase coordinations by find- where f (x, ym ) is a real-valued feature vector of
ing symmetry in conjunct structure and the depen- the coordination conjunct xm . We used almost the
dency between the lexical heads of the conjuncts. same feature set as Hara et al. (2009): namely, the
They are used to rerank the n-best outputs of the surface word, part-of-speech, suffix and prefix of
Bikel parser (2004), whereas two models interact the words, and their combinations. We used the
with each other in our method. averaged perceptron to tune the weight vector w.
Shimbo and Hara (2007) proposed an Hara et al. (2009) proposed to use a context-
alignment-based method for detecting and dis- free grammar to find a properly nested coordina-
ambiguating non-nested coordination structures. tion structure. That is, the scoring function Eq (1)
431
COMPS list of synsem COMPS < >
SEM semantics
nonlocal Spring
NONLOC REL list of local
SLASH list of local
Figure 1: HPSG sign

COORD Coordination. HEAD 1
SUBJ < >
HEAD 1
SUBJ 2
COMPS < > COMPS 4
CJT Conjunct.
SUBJ <> HEAD 1 HEAD 1 HEAD
N Non-coordination. 2 COMPS < > SUBJ < 2 >
COMPS < >
SUBJ 2
COMPS < 3 | 4 >
3 COMPS < > 1 SUBJ
COMPS
CC Coordinating conjunction like and. Sprin

W Any word. Figure 2: Subject-Head Schema (left) and Head-
Figure 1: subject-head schema (left) and head-
Complement
complement Schema
schema (right)
(right); taken from Miyao et al. Fig
Table 2: Non-terminals (2004).
Rules for coordinations: and unbounded dependencies. SEM feature rep- required becaus
COORDi,m CJTi,j CCj+1,k1 CJTk,m resents the semantics of a constituent, and in this tive, i.e., daught
formalism. In a lexicalized grammar, quite a
Rules for conjuncts: study it expresses a predicate-argument structure. termined given t
small numbers of schemata are used to explain
CJTi,j (COORD|N)i,j Figure 2 presents the Subject-Head Schema tions are at least
general grammatical constraints, compared with
and the Head-Complement Schema1 defined in of each non-head
Rules for non-coordinations: other theories. On the other hand, rich word-
(Pollard and Sag, 1994). In order to express gen- this is not perco
Ni,k COORDi,j Nj+1,k specific characteristics areonly
embedded in lexical
eral constraints, schemata provide sharing of SLASH/REL feat
Ni,j Wi,i (COORD|N)i+1,j entries.
feature values, and no instantiated values. entries
Both of schemata and lexical
our previous stu
Ni,i Wi,i are Figure
represented
3 hasbyantyped examplefeature structures,
of HPSG and
parsing
Rules for pre-terminals: the SUBJ featur
constraints in parsing
of the sentence Spring arehas
checked
come.byFirst,
unification
each the Head-Comp
CCi,i (and|or|but|, |; |+|+/)i among them. Figure 1 shows examples
of the lexical entries for has and come of HPSG
are since this schem
CCi,i+1 (, |; )i (and|or|but)i+1 schema.
unified with a daughter feature structure of the rated constituen
CCi,i+2 (as)i (well)i+1 (as)i+2 Figure 2 shows anSchema.
Head-Complement HPSG parse tree ofprovides
Unification the sen- empty SUBJ fea
Wi,i i the phrasal
tence Spring signhasof come.
the mother.First,Thethe sign of the
lexical en- tated with at lea
larger
tries ofconstituent
has andis come
obtained areby repeatedly
joined byapply-head- tries required to
Table 3: Production rules ing schemataschema.
complement to lexical/phrasal
Unification signs.
givesFinally, the
the HPSG determined. In
phrasal sign of the entire sentence
sign of mother. After applying schemata to HPSGis output on the specified deriva
top of
signs the derivation
repeatedly, tree. sign of the whole sen-
the HPSG tated with schem
is only defined on the coordination structures that
are licensed by the grammar. We only slightly ex- tence is output. ing the specifica
3 Acquiring HPSG from the Penn
tended their grammar for convering more variety We use Enju for an English HPSG parser We describe t
Treebank
of coordinating conjunctions. (Miyao et al., 2004). Figure 3 shows how a co- ment in terms o
Table 2 and Table 3 show the non-terminals and As discussed
ordination in Section
structure is built1, in
ourthegrammar devel-
Enju grammar. externalization,
production rules used in the model. The only ob- opment
First, requires each conjunction
a coordinating sentence to and be annotated
the right
with i) aarehistory ofby
rule applications, and ii) ad- 3.1 Specificat
jective of the grammar is to ensure the consistency conjunct joined coord right schema. Af-
ditional annotations to make
terwards, the parent and the left conjunctthe grammar rules
are General gramm
of two or more coordinations in a sentence, which
be pseudo-injective.
joined by coord left schema. In HPSG, a history of rule this phase, and
means for any two coordinations they must be ei-
applications
The Enju parser is represented
is equippedby a tree
withannotated
a disam- through the desi
ther non-overlapping or nested coordinations. We
with schema names. Additional annotations are ure 1 shows the
use a bottom-up chart parsing algorithm to out- biguation model trained by the maximum entropy
1
The value of category has been2008).
presentedSince
for simplicity, structure of a sig
put the coordination boundaries with the highest method (Miyao and Tsujii, we do
while the other portions of the sign have been omitted. features are defin
score. Note that these production rules dont need not need the probability of each parse tree, we
to be isomorphic to those of HPSG parsing and treat the model just as a linear model that defines
actually they arent. This is because the two meth- the score of a parse tree as the sum of feature
ods interact only through dual decomposition and weights. The features of the model are defined
the search spaces defined by the methods are con- on local subtrees of a parse tree.
sidered separately. The Enju parser takes O(n3 ) time since it uses
This method requires O(n4 ) time, where n is the CKY algorithm, and each cell in the CKY
the number of words. This is because there are parse table has at most a constant number of edges
O(n2 ) possible coordination structures in a sen- because we use beam search algorithm. Thus, we
tence, and the method requires O(n2 ) time to get can regard the parser as a decoder for a weighted
a feature vector of each coordination structure. CFG.
3.2 HPSG parsing 3.3 Dual decomposition

HPSG (Pollard and Sag, 1994) is one of the Dual decomposition is a classical method to solve
linguistic theories based on lexicalized grammar complex optimization problems that can be de-
432
u(1) 0
Head-complement
schema
HEAD 1
for k = 1 to K do
2
x(k) arg maxx (f (x) + u(k) x)
SUBJ
COMPS 4
Unify HEAD 1
SUBJ 2 Unify 3 COMPS < > y (k) arg maxy (g(y) u(k) y)
COMPS < 3 | 4 >
synsem if x = y then
synsem
return u(k)
HEAD verb HEAD verb
SUBJ < 5 >
ynsem HEAD noun
SUBJ < > HEAD verb
HEAD noun
SUBJ < SUBJ < > >
synsem COMPS < > COMPS < SUBJ < 5 > > COMPS < >
COMPS < > COMPS < >
end if
Spring has come u(k+1) uk ak (x(k) y (k) )
Lexical entries
end for
return u(K)
HEAD verb
n SUBJ < >
COMPS < >
subject-head Table 4: The subgradient method
HEAD 1 HEAD verb
SUBJ 2 SUBJ < 1 >
COMPS 4 COMPS < >
head-comp
HEAD noun HEAD verb HEAD verb
3 COMPS < > 1 SUBJ < > SUBJ < 1 > 2 SUBJ < 1 >
4> COMPS < > COMPS < 2 > COMPS < > shows, you can use existing algorithms and dont
Spring has come need to have an exact algorithm for the optimiza-
left) and Head- tion problem, which are features of dual decom-
Figure 3: HPSG parsing position.
Figure 2: HPSG parsing; taken from Miyao et al.
(2004). If x(k) = y (k) occurs during the algorithm, then
EM feature rep- we simply take x(k) as the primal solution, which
required because HPSG schemata are not injec-
ent, and in this is the exact answer. If not, we simply take x(K) ,
tive, i.e., daughters signs cannot be uniquely de-
ment structure. Coordina(on
termined given the mother. The following annota-
the answer of coordination structure analysis with
-Head Schema tions are at least required. First, the HEAD feature alignment-based features, as an approximate an-
coord_left_schema
ma1 defined in of each non-head daughter must be specified since swer to the primal solution. The answer does not
to express gen- this is not percolated Par(al,
to the mother sign. Second, always solve the original problem Eq (2), but pre-
ovide sharing of Le3,Conjunct vious works (e.g., (Rush et al., 2010)) has shown
SLASH/REL features Coordina(on
are required as described in
values. our previous study (Miyao et al., 2003a). Finally, that it is effective in practice. We use it in this
coord_right_schema
HPSG parsing the SUBJ feature of the complement daughter in paper.
e. First, each the Head-Complement
Coordina(ng, Schema must be specified
Right,
nd come are since this Conjunc(on Conjunct an unsatu-
schema may subcategorize 4 Proposed method
tructure of the rated constituent, i.e., a constituent with a non-
cation provides In this section, we describe how we apply dual
empty SUBJ feature. When the corpus is anno-
The sign of the decomposition to the two models.
tated with
Figure 3: at least these offeatures,
Construction the lexical
coordination in Enjuen-
peatedly apply- tries required to explain the sentence are uniquely
ns. Finally, the 4.1 Notation
determined. In this study, we define partially-
is output on the composed into efficiently
specified derivation solvable
trees as sub-problems.
tree structures anno- We define some notations here. First we describe
Ittated
is becoming popular in the
with schema names and HPSG signs NLP community
includ- weighted CFG parsing, which is used for both
and
inghas been shown to
the specifications of work
the aboveeffectively
features.on sev- coordination structure analysis with alignment-
e Penn eral We
NLP tasks (Rush et al., 2010).
describe the process of grammar develop- based features and HPSG parsing. We follows the
We consider an optimization
ment in terms of the four phases: problemspecification, formulation by Rush et al., (2010). We assume a
rammar devel- externalization, extraction, and verification. context-free grammar in Chomsky normal form,
arg max(f (x) + g(x)) (2)
o be annotated x with a set of non-terminals N . All rules of the
3.1 Specification
ons, and ii) ad- grammar are either the form A BC or A w
which is difficult to solve (e.g. NP-hard), while
grammar rules General grammatical constraints are defined in where A, B, C N and w V . For rules of the
arg maxx f (x) and arg maxx g(x) are effectively
history of rule this phase, and in HPSG, they are represented form A w we refer to A as the pre-terminal for
tree annotated solvable.
through theIn dual decomposition,
design of the sign andwe solve Fig-
schemata. w.
annotations are uremin
1 shows
max(f the(x)
definition
+ g(y) + foru(x
thetyped
y)) feature Given a sentence with n words, w1 w2 ...wn , a
nted for simplicity, structure of a sign used in this study. Some more
u x,y
parse tree is a set of rule productions of the form
been omitted. features are defined for each syntactic category al-
instead of the original problem. A BC, i, k, j where A, B, C N , and
To find the minimum value, we can use a sub- 1 i k j n. Each rule production rep-
gradient method (Rush et al., 2010). The subgra- resents the use of CFG rule A BC where non-
dient method is given in Table 4. As the algorithm terminal A spans words wi ...wj , non-terminal B
433
spans word wi ...wk , and non-terminal C spans 1 if rule COORDa,c CJTa,b CC , CJT ,c or
word wk+1 ...wj if k < j, and the use of CFG COORD ,c CJT , CCa,b CJT ,c is in the parse
rule A wi if i = k = j. tree; otherwise it is 0.
We now define the index set for the coordina- We apply the same extension to the HPSG in-
tion structure analysis as dex set, also giving an over-complete representa-
tion. We define za,b,c analogously to ya,b,c .
Icsa = {A BC, i, k, j : A, B, C N,
1 i k j n} 4.2 Proposed method
We now describe the dual decomposition ap-
Each parse tree is a vector y = {yr : r Icsa }, proach for coordination disambiguation. First, we
with yr = 1 if rule r is in the parse tree, and yr = define the set Q as follows:
0 otherwise. Therefore, each parse tree is repre-
sented as a vector in {0, 1}m , where m = |Icsa |. Q = {(y, z) : y Y, z Z, ya,b,c = za,b,c
We use Y to denote the set of all valid parse-tree for all (a, b, c) Iuni }
vectors. The set Y is a subset of {0, 1}m .
Therefore, Q is the set of all (y, z) pairs that
In addition, we assume a vector csa = {rcsa :
agree on their coordination structures. The coor-
r Icsa } that specifies a score for each rule pro-
dination structure analysis with alignment-based
duction. Each rcsa can take any real value. The
features and HPSG parsing problem is then to
optimal parse tree is y = arg maxyY y csa
solve
where y csa = r yr rcsa is the inner product
between y and csa . max (y csa + z hpsg ) (3)
(y,z)Q
We use similar notation for HPSG parsing. We
define Ihpsg , Z and hpsg as the index set for where > 0 is a parameter dictating the relative
HPSG parsing, the set of all valid parse-tree vec- weight of the two models and is chosen to opti-
tors and the weight vector for HPSG parsing re- mize performance on the development test set.
spectively. This problem is equivalent to
We extend the index sets for both the coor-
dination structure analysis with alignment-based max(g(z) csa + z hpsg ) (4)
zZ
features and HPSG parsing to make a constraint
where g : Z Y is a function that maps a
between the two sub-problems. For the coor-
HPSG tree z to its set of coordination structures
z = g(y).
features we define
the extended index set to be
We solve this optimization problem by using
I csa = Icsa Iuni where
dual decomposition. Figure 4 shows the result-
Iuni = {(a, b, c) : a, b, c {1...n}} ing algorithm. The algorithm tries to optimize
the combined objective by separately solving the
Here each triple (a, b, c) represents that word sub-problems again and again. After each itera-
wc is recognized as the last word of the right tion, the algorithm updates the weights u(a, b, c).
conjunct and the scope of the left conjunct or These updates modify the objective functions for
the coordinating conjunction is wa ...wb 1 . Thus the two sub-problems, encouraging them to agree
each parse-tree vector y will have additional com- on the same coordination structures. If y (k) =
ponents ya,b,c . Note that this representation is z (k) occurs during the iterations, then the algo-
over-complete, since a parse tree is enough to rithm simply returns y (k) as the exact answer. If
determine unique coordination structures for a not, the algorithm returns the answer of coordina-
sentence: more explicitly, the value of ya,b,c is tion analysis with alignment features as a heuristic
1
This definition is derived from the structure of a co-
answer.
ordination in Enju (Figure 3). The triples show where It is needed to modify original sub-problems
the coordinating conjunction and right conjunct are in for calculating (1) and (2) in Table 4. We modified
coord right schema, and the left conjunct and partial coor- the sub-problems to regard the score of u(a, b, c)
dination are in coord left schema. Thus they alone enable
not only the coordination structure analysis with alignment-
as a bonus/penalty of the coordination. The mod-
based features but Enju to uniquely determine the structure ified coordination structure analysis with align-
of a coordination. ment features adds u(k) (i, j, m) and u(k) (j+1, l
434
u(1) (a, b, c) 0 for all (a, b, c) Iuni
for k = 1 to K do !
y (k) arg maxyY (y csa (a,b,c)Iuni u(k) (a, b, c)ya,b,c ) ... (1)
!
z (k) arg maxzZ (z hpsg + (a,b,c)Iuni u(k) (a, b, c)za,b,c ) ... (2)
if y (k) (a, b, c) = z (k) (a, b, c) for all (a, b, c) Iuni then
return y (k)
end if
for all (a, b, c) Iuni do
u(k+1) (a, b, c) u(k) (a, b, c) ak (y (k) (a, b, c) z (k) (a, b, c))
end for
end for
return y (K)
Figure 4: Proposed algorithm

Figure 4: Proposed algorithm
w f (x, (i, j, l, m)) to the score of the sub- COORD WSJ Genia
1, m), as well as adding w f (x, (i, j, l, m)) to COORD WSJ Genia
NP 63.7 66.3
tree, when the rule production COORDi,m
the score of the subtree, when the rule produc- NP 63.7 66.3
CJTi,j CCj+1,l1 CJTl,m is applied. VP 13.8 11.4
tion COORDi,m CJTi,j CCj+1,l1 CJT l,m is VP 13.8
ADJP
11.4
6.8 9.6
The modified Enju adds u (i, j, l) when co-
(k)
applied. ADJP 6.8 9.6
ord left schema is applied, where word wc S 11.4 6.0
The modified Enju adds u(k) (a, b, c) when S 11.4
PP
6.0
2.4 5.1
is recognized as a coordinating conjunction
coord right schema is applied, where word PP 2.4 5.1
and left side of its scope is wa ...wb , or co- Others 1.9 1.5
wa ...wb is recognized as a coordinating conjunc- Others 1.9 1.5
ord right schema is applied, where word wc
tion and the last word of the right conjunct is
is recognized as a coordinating conjunction and Table 6: The percentage of each conjunct type (%) of
wc , or coord left schema is applied, where word Table 6: The each percentage
test setof each conjunct type (%) of
right side of its scope is wa ...wb . each test set
wa ...wb is recognized as the left conjunct and the
last word of 5 Experiments
the right conjunct is wc .
Penn Treebank has more VP-COOD tags and S-
rized into phrase COOD types suchwhile
tags, as a NP
the coordination
Genia corpus has more
5 Experiments 5.1 Test/Training data
or PP coordination.
NP-COOD Table 6
tags shows
and the percentagetags.
ADJP-COOD
We trained the alignment-based coordination of each phrase type in all coordianitons. It indi-
5.1 Test/Training data
analysis model on both the Genia corpus (?) Wall Street Journal portion of the Penn
cates the 5.2 Implementation of sub-problems
We trained and the the Wall Street Journal
alignment-based portion of Treebank
coordination the Penn has more VP coordinations and S co-
analysis modelTreebank
on both (?),theand evaluated
Genia corpusthe(Kimperformance of We
ordianitons,
used Enju (?) for the implementation of
while the Genia corpus has more NP
et al., 2003)ourandmethod
the Wall onStreet
(i) theJournal
Genia portion
corpus andcoordianitons
(ii) the HPSG parsing, which has a wide-coverage prob-
and ADJP coordiations.
of the Penn Wall
TreebankStreet(Marcus
Journal et portion Penn Treebank. abilistic HPSG grammar and an efficient parsing
of the and
al., 1993),
evaluated theMore precisely, we used HPSG (i) treebank algorithm, while we re-implemented Hara et al.,
performance of our method on 5.2 con-Implementation of sub-problems
verted and
the Genia corpus from(ii)thethePenn
WallTreebank
Street Jour-and Genia, and (2009)s algorithm with slight modifications.
nal portion offurther
the Pennextracted the training/test data for
Treebank. More precisely, We coor-
used Enju (Miyao and Tsujii, 2004) for
the 5.2.1 Step
implementation of HPSGsize parsing, which has
we used HPSG dination
treebankstructure
convertedanalysis
fromwith alignment-based
the Penn
Treebank and features
Genia,usingandthe annotation
further in thethe
extracted a wide-coverage
Treebank. Ta- We probabilistic
used the followingHPSG step size in our algo-
grammar
data??
training/test ble forshows the corpus used in the experiments.
coordination structure analy- and an rithm
efficient (Figure
parsing ??). First,
algorithm, we
whileinitialized
we re- a0 , which
The Wall features
sis with alignment-based Street Journal
using the portion implemented
anno-of the Penn is chosen
Hara et to
al., optimize
(2009)s performance
algorithm withon the devel-
tation in the Treebank
Treebank.has 2317
Table sentences
5 shows from WSJslight
the corpus articles, opment set. Then we defined ak = a0 2k ,
modifications.
(k! )
and there are 1356 COOD tags in the sentences, where! k is the number of times that L(u ) >
used in the experiments.
5.2.1 Step size(k 1) ) for k # k.
The Wall while
Streetthe Genia portion
Journal corpus has 1754
of the sentences from L(u
Penn
Treebank in MEDLINE abstracts,
the test set has and there from
2317 sentences are 1848 COODWe used the following step size in our algo-
WSJ articles,tags andinthere
the sentences. COOD tags arerithm
are 1356 coordinations further(Figure5.34). Evaluation metric a , which
First, we initialized 0
subcategorized
in the sentences, while the intoGeniaphrase
corpustypes NP- to We
in thesuchisaschosen evaluated
optimize the performance
performance on the of the tested meth-
devel-

a0 2 k ,
COOD
test set has 1764 or VP-COOD.
sentences from MEDLINETable ?? ab- showsopment
the per-set. ods Then by we
the defined
accuracyakof=coordination-level brack-

stracts, and centage
there areof1848 each coordinations
phrase type ininall theCOODwhere tags.k is eting (?); i.e.,
the number of we count
times L(u(k
thateach of )the
) >coordination
1)
It indicates theare
sentences. Coordinations Wall Streetsubcatego-
further Journal portion L(uof(k the k as
scopes
) for k.one output of the system, and the system
435
Task (i) Task (ii)
Training WSJ (sec. 221) + Genia (No. 11600) WSJ (sec. 221)
Development Genia (No. 16011800) WSJ (sec. 22)
Test Genia (No. 18011999) WSJ (sec. 23)
Table 5: The corpus used in the experiments
100%$
Proposed Enju CSA
95%$
Precision 72.4 66.3 65.3
90%$
Recall 67.8 65.5 60.5 85%$
F1 70.0 65.9 62.8 80%$
75%$
Table 7: Results of Task (i) on the test set. The preci- 70%$
sion, recall, and F1 (%) for the proposed method, Enju, 65%$
and Coordination structure analysis with alignment- 60%$
1$ 3$ 5$ 7$ 9$ 11$13$15$17$19$21$23$25$27$29$31$33$35$37$39$41$43$45$47$49$
based features (CSA)
accuracy certicates
5.3 Evaluation metric Figure 5: Performance of the approach as a function of

We evaluated the performance of the tested meth- K of Task (i) on the development set. accuracy (%):
the percentage of sentences that are correctly parsed.
ods by the accuracy of coordination-level bracket-
certificates (%): the percentage of sentences for which
ing (Shimbo and Hara, 2007); i.e., we count each a certificate of optimality is obtained.
of the coordination scopes as one output of the
system, and the system output is regarded as cor-
rect if both of the beginning of the first output tion only on NP coordinations to have a better re-
conjunct and the end of the last conjunct match sult.
annotations in the Treebank (Hara et al., 2009). Figure 5 shows performance of the approach as
a function of K, the maximum number of iter-
5.4 Experimental results of Task (i) ations of dual decomposition. The graphs show
We ran the dual decomposition algorithm with a that values of K much less than 50 produce al-
limit of K = 50 iterations. We found the two most identical performance to K = 50 (with
sub-problems return the same answer during the K = 50, the accuracy of the method is 73.4%,
algorithm in over 95% of sentences. with K = 20 it is 72.6%, and with K = 1 it
We compare the accuracy of the dual decompo- is 69.3%). This means you can use smaller K in
sition approach to two baselines: Enju and coor- practical use for speed.
5.5 Experimental results of Task (ii)
features. Table 7 shows all three results. The dual
decomposition method gives a statistically signif- We also ran the dual decomposition algorithm
icant gain in precision and recall over the two with a limit of K = 50 iterations on Task (ii).
methods2 . Table 9 and 10 show the results of task (ii). They
Table 8 shows the recall of coordinations of show the proposed method outperformed the two
each type. It indicates our re-implementation of methods statistically in precision and recall3 .
CSA and Hara et al. (2009) have a roughly simi- Figure 6 shows performance of the approach as
lar performance, although their experimental set- a function of K, the maximum number of iter-
tings are different. It also shows the proposed ations of dual decomposition. The convergence
method took advantage of Enju and CSA in NP speed for WSJ was faster than that for Genia. This
coordination, while it is likely just to take the an- is because a sentence of WSJ often have a simpler
swer of Enju in VP and sentential coordinations. coordination structure, compared with that of Ge-
This means we might well use dual decomposi- nia.
2 3
p < 0.01 (by chi-square test) p < 0.01 (by chi-square test)
436
COORD # Proposed Enju CSA # Hara et al. (2009)
Overall 1848 67.7 63.3 61.9 3598 61.5
NP 1213 67.5 61.4 64.1 2317 64.2
VP 208 79.8 78.8 66.3 456 54.2
ADJP 193 58.5 59.1 54.4 312 80.4
S 111 51.4 52.3 34.2 188 22.9
PP 110 64.5 59.1 57.3 167 59.9
Others 13 78.3 73.9 65.2 140 49.3
Table 8: The number of coordinations of each type (#), and the recall (%) for the proposed method, Enju,
coordination structure analysis with alignment-based features (CSA) , and Hara et al. (2009) of Task (i) on the
development set. Note that Hara et al. (2009) uses a different test set and different annotation rules, although its
test data is also taken from the Genia corpus. Thus we cannot compare them directly.
100%$
Proposed Enju CSA
95%$
Precision 76.3 70.7 66.0
90%$
Recall 70.6 69.0 60.1 85%$
F1 73.3 69.9 62.9 80%$
75%$
Table 9: Results of Task (ii) on the test set. The preci- 70%$
sion, recall, and F1 (%) for the proposed method, Enju, 65%$
and Coordination structure analysis with alignment- 60%$
1$ 3$ 5$ 7$ 9$ 11$13$15$17$19$21$23$25$27$29$31$33$35$37$39$41$43$45$47$49$
based features (CSA)
accuracy certicates
COORD # Proposed Enju CSA

Overall 1017 71.6 68.1 60.7 Figure 6: Performance of the approach as a function of
NP 573 76.1 71.0 67.7 K of Task (ii) on the development set. accuracy (%):
VP 187 62.0 62.6 47.6 the percentage of sentences that are correctly parsed.
ADJP 73 82.2 75.3 79.5 certificates (%): the percentage of sentences for which
a certificate of optimality is provided.
S 141 64.5 62.4 42.6
PP 19 52.6 47.4 47.4
Others 24 62.5 70.8 54.2
method with corpus in different domains. Be-
Table 10: The number of coordinations of each type cause characteristics of coordination structures
(#), and the recall (%) for the proposed method, Enju, differs from corpus to corpus, experiments on
and coordination structure analysis with alignment- other corpus would lead to a different result. Sec-
based features (CSA) of Task (ii) on the development ond, we would want to add some features to coor-
set. dination structure analysis with alignment-based
local features such as ontology. Finally, we can
6 Conclusion and Future Work add other methods (e.g. dependency parsing) as
sub-problems to our method by using the exten-
In this paper, we presented an efficient method for sion of dual decomposition, which can deal with
detecting and disambiguating coordinate struc- more than two sub-problems.
tures. Our basic idea was to consider both gram-
mar and symmetries of conjuncts by using dual
decomposition. Experiments on the Genia corpus
Acknowledgments
and the Wall Street Journal portion of the Penn
Treebank showed that we could obtain statisti-
cally significant improvement in accuracy when The second author is partially supported by KAK-
using dual decomposition. ENHI Grant-in-Aid for Scientific Research C
We would need a further study in the follow- 21500131 and Microsoft CORE project 7.
ing points of view: First, we should evaluate our
437
References In Proceedings of the 2007 Joint Conference on
Kazuo Hara, Masashi Shimbo, Hideharu Okuma, and ing and Computational Natural Language Learn-
Yuji Matsumoto. 2009. Coordinate structure analy- ing, pages 610619, Jun.
sis with global structural constraints and alignment-
based local features. In Proceedings of the 47th An-
nual Meeting of the ACL and the 4th IJCNLP of the
AFNLP, pages 967975, Aug.
Deirdre Hogan. 2007. Coordinate noun phrase dis-
ambiguation in a generative parsing model. In Pro-
ciation of Computational Linguistics (ACL 2007),
pages 680687.
Jun-Dong Kim, Tomoko Ohta, and Junich Tsujii.
2003. Genia corpus - a semantically annotated cor-
pus for bio-textmining. Bioinformatics, 19.
Dan Klein and Christopher D. Manning. 2003. Fast
exact inference with a factored model for natural
language parsing. Advances in Neural Information
Processing Systems, 15:310.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated
corpus of english: The penn treebank. Computa-
tional Linguistics, 19:313330.
Yusuke Miyao and Junich Tsujii. 2004. Deep lin-
guistic analysis for the accurate identification of
predicate-argument relations. In Proceeding of
COLING 2004, pages 13921397.
Yusuke Miyao and Junich Tsujii. 2008. Feature
forest models for probabilistic hpsg parsing. MIT
Press, 1(34):3580.
Yusuke Miyao, Takashi Ninomiya, and Junichi Tsu-
jii. 2004. Corpus-oriented grammar development
for acquiring a head-driven phrase structure gram-
mar from the penn treebank. In Proceedings of
the First International Joint Conference on Natural
Language Processing (IJCNLP 2004).
Preslav Nakov and Marti Hearst. 2005. Using the web
as an implicit training set: Application to structural
ambiguity resolution. In Proceedings of the Human
Language Technology Conference and Conference
on Empirical Methods in Natural Language (HLT-
EMNLP 2005), pages 835842.
Carl Pollard and Ivan A. Sag. 1994. Head-driven
phrase structure grammar. University of Chicago
Press.
Philip Resnik. 1999. Semantic similarity in a takon-
omy. Journal of Artificial Intelligence Research,
11:95130.
Alexander M. Rush, David Sontag, Michael Collins,
and Tommi Jaakkola. 2010. On dual decomposi-
tion and linear programming relaxations for natu-
ral language processing. In Proceeding of the con-
Processing.
Masashi Shimbo and Kazuo Hara. 2007. A discrimi-
native learning model for coordinate conjunctions.
438
Cutting the Long Tail: Hybrid Language Models
for Translation Style Adaptation
Arianna Bisazza and Marcello Federico

Fondazione Bruno Kessler
Trento, Italy
{bisazza,federico}@fbk.eu
Abstract Hybrid class-based LMs are trained on text

where only infrequent words are mapped to Part-
In this paper, we address statistical ma- of-Speech (POS) classes. In this way, topic-
chine translation of public conference talks. specific words are discarded and the model fo-
Modeling the style of this genre can be very cuses on generic words that we assume more use-
challenging given the shortage of available ful to characterize the language style. The factor-
in-domain training data. We investigate the ization of similar expressions made possible by
use of a hybrid LM, where infrequent words
this mixed text representation yields a better n-
are mapped into classes. Hybrid LMs are
used to complement word-based LMs with gram coverage, but with a much higher discrimi-
statistics about the language style of the native power than POS-level LMs.
talks. Extensive experiments comparing Hybrid LM also differs from POS-level LM in
different settings of the hybrid LM are re- that it uses a word-to-class mapping to determine
ported on publicly available benchmarks POS tags. Consequently, it doesnt require the de-
based on TED talks, from Arabic to English coding overload of factored models nor the tag-
and from English to French. The proposed
ging of all parallel data used to build phrase ta-
models show to better exploit in-domain
data than conventional word-based LMs for
bles. A hybrid LM trained on in-domain data can
the target language modeling component of thus be easily added to an existing baseline sys-
a phrase-based statistical machine transla- tem trained on large amounts of background data.
tion system. The proposed models are used in addition to
standard word-based LMs, in the framework of
log-linear phrase-based SMT.
1 Introduction The remainder of this paper is organized as fol-
lows. After discussing the language style adapta-
The translation of TED conference talks1 is an tion problem, we will give an overview of relevant
emerging task in the statistical machine transla- work. In the following sections we will describe
tion (SMT) community (Federico et al., 2011). in detail hybrid LM and its possible variants. Fi-
The variety of topics covered by the speeches, as nally, we will present an empirical analysis of the
well as their specific language style, make this a proposed technique, including intrinsic evaluation
very challenging problem. and SMT experiments.
Fixed expressions, colloquial terms, figures of
speech and other phenomena recurrent in the talks 2 Background
should be properly modeled to produce transla- Our working scenario is the translation of TED
tions that are not only fluent but that also em- talks transcripts as proposed by the IWSLT Eval-
ploy the right register. In this paper, we propose uation Campaign2 . This genre covers a variety
a language modeling technique that leverages in- of topics ranging from business to psychology.
domain training data for style adaptation. The available training material both parallel and
1 2
http://www.ted.com/talks http://www.iwslt2011.org
439
Beginning of Sentence: [s] End of Sentence: [/s]
TED NEWS TED NEWS
1st [s] Thank you . [/s] 1st [s] ( AP ) - 1st [s] Thank you . [/s] 1st he said . [/s]
2 [s] Thank you very much 2 [s] WASHINGTON ( ... 2 you very much . [/s] 2 she said . [/s]
3 [s] I m going to 3 [s] NEW YORK ( AP 3 in the world . [/s] 3 , he said . [/s]
4 [s] And I said , 4 [s] ( CNN ) 4 and so on . [/s] 4 he said . [/s]
5 [s] I don t know 5 [s] NEW YORK ( R... 5 , you know . [/s] 5 in a statement . [/s]
6 [s] He said , 6 [s] He said : 6 of the world . [/s] 6 the United States . [/s]
7 [s] I said , 7 [s] I don t 7 around the world . [/s] 7 to this report . [/s]
8 [s] And of course , 8 [s] It was last updated 8 . Thank you . [/s] 8 he added . [/s]
9 [s] And one of the 9 [s] At the same time 9 the United States . [/s] 9 , police said . [/s]
10 [s] And I want to ... 10 all the time . [/s] 10 , officials said . [/s]
11 [s] And that s what 69 [s] I don t know 11 to do it . [/s] ...
12 [s] We re going to 612 [s] I m going to 12 and so forth . [/s] 13 in the world . [/s]
13 [s] And I think that 2434 [s] I said , 13 don t know . [/s] 17 around the world . [/s]
14 [s] And you can see 7034 [s] He said , 14 to do that . [/s] 46 of the world . [/s]
15 [s] And this is a 8199 [s] And I said , 15 in the future . [/s] 129 all the time . [/s]
16 [s] And this is the 8233 [s] Thank you very much 16 the same time . [/s] 157 and so on . [/s]
17 [s] And he said , ... 17 , you know ? [/s] 1652 , you know . [/s]
18 [s] So this is a [s] Thank you . [/s] 18 to do this . [/s] 5509 you very much . [/s]
Table 1: Common sentence-initial and sentence-final 5-grams, as ranked by frequency, in the TED and NEWS
corpora. Numbers denote the frequency rank.
monolingual consists of a rather small collection used by these two LMs to score the tests refer-
of TED talks plus a variety of large out-of-domain ence translations. Note that the latter measure is
corpora, such as news stories and UN proceed- bounded at the LM order minus one, and is in-
ings. versely proportional to the number of back-offs
Given the diversity of topics, the in-domain performed by the model. Hence, we use this value
data alone cannot ensure sufficient coverage to an to estimate how well an n-gram LM fits the test
SMT system. The addition of background data data. Indeed, despite the genre mismatch, the per-
can certainly improve the n-gram coverage and plexity of a NEWS 5-gram LM on the TED-2010
thus the fluency of our translations, but it may also test reference translations is 104 versus 112 for
move our system towards an unsuitable language the in-domain LM, and the average history size is
style, such as that of written news. 2.5 versus 1.7 words.
In our study, we focus on the subproblem of TED NEWS
target language modeling and consider two En- 1st , the 1st
glish text collections, namely the in-domain TED ... ...
and the out-of-domain NEWS3 , summarized in 9 I 40 I
12 you 64 you
Table 2. Because of its larger size two orders
90 actually 965 actually
of magnitude the NEWS corpus can provide a 268 stuff 2479 guy
better LM coverage than the TED on the test data. 370 guy 2861 stuff
This is reflected both on perplexity and on the av- 436 amazing 4706 amazing
erage length of the context (or history h) actually
Table 3: Excerpts from TED and NEWS training vo-
3
http://www.statmt.org/wmt11/translation-task.html cabularies, as ranked by frequency. Numbers denote
the frequency rank.
LM Data |S| |W | |V | PP h5g

Yet we observe that the style of public speeches
TED-En 124K 2.4M 51K 112 1.7
NEWS-En 30.7M 782M 2.2M 104 2.5 is much better represented in the in-domain cor-
pus than in the out-of-domain one. For instance,
Table 2: Training data and coverage statistics of two let us consider the vocabulary distribution4 of the
5-gram LMs used for the TED task: number of sen- 4
Hesitations and filler words, typical of spoken language,
tences and tokens, vocabulary size; perplexity and av- are not covered in our study because they are generally not
erage word history. reported in the TED talk transcripts.
440
two corpora (Table 3). The very first forms, as adoption of the log-linear modeling framework in
ranked by frequency, are quite similar in the two many NLP tasks has recently introduced the use
corpora. However, there are important excep- of multiple LM components (features), which per-
tions: the pronouns I and you are among the top mit to naturally factor out and integrate different
20 frequent forms in the TED, while in the NEWS aspects of language into one model. In SMT, the
they are ranked only 40th and 64th respectively. factored model (Koehn and Hoang, 2007), for in-
Other interesting cases are the words actually, stance, permits to better tailor the LM to the task
stuff, guy and amazing, all ranked about 10 times syntax, by complementing word-based n-grams
higher in the TED than in the NEWS corpus. with a part-of-speech (POS) LM , that can be es-
We can also analyze the most typical ways timated even on a limited amount of task-specific
to start and end a sentence in the two text col- data. Besides many works addressing holistic LM
lections. As shown in Table 1, the frequency domain adaptation for SMT, e.g. Foster and Kuhn
ranking of sentence-initial and sentence-final 5- (2007), recently methods were also proposed to
grams in the in-domain corpus is notably different explicitly adapt the LM to the discourse topic of a
from the out-of-domain one. TEDs most frequent talk (Ruiz and Federico, 2011). Our work makes
sentence-initial 5-gram [s] Thank you . [/s] is another step in this direction by investigating hy-
not at all attested in the NEWS corpus. As for brid LMs that try to explicitly represent the speak-
the 4th most common sentence start [s] And I ing style of the talk genre. As a difference from
said , is only ranked 8199th in the NEWS, and standard class-based LMs (Brown et al., 1992) or
so on. Notably, the top ranked NEWS 5-grams in- the more recent local LMs (Monz, 2011), which
clude names of cities (Washington, New York) and are used to predict sequences of classes or word-
of news agency (AP, Reuters). As regards sen- class pairs, our hybrid LM is devised to pre-
tence endings, we observe similar contrasts: for dict sequences of classes interleaved by words.
instance, the word sequence and so on . [/s] While we do not claim any technical novelty in
is ranked 4th in the TED and 157th in the NEWS the model itself, to our knowledge a deep investi-
while , you know . [/s] is 5th in the TED and gation of hybrid LMs for the sake of style adap-
only 1652th in the NEWS. tation is definitely new. Finally, the term hybrid
These figures confirm that the talks have a spe- LM was inspired by Yazgan and Saraclar (2004),
cific language style, remarkably different from which called with this name a LM predicting se-
that of the written news genre. In summary, talks quences of words and sub-words units, devised to
are characterized by a massive use of first and sec- let a speech recognizer detect out-of-vocabulary-
ond persons, by shorter sentences, and by more words.
colloquial lexical and syntactic constructions.
4 Hybrid Language Model
3 Related Work
Hybrid LMs are n-gram models trained on a
The brittleness of n-gram LMs in case of mis- mixed text representation where each word is ei-
match between training and task data is a well ther mapped to a class or left as is. This choice
known issue (Rosenfeld, 2000). So called do- is made according to a measure of word common-
main adaptation methods (Bellegarda, 2004) can ness and is univocal for each word type.
improve the situation, once a limited amount The rationale is to discard topic-specific words,
of task specific data become available. Ideally, while preserving those words that best character-
domain-adaptive LMs aim to improve model ro- ize the language style (note that word frequency
bustness under changing conditions, involving is computed on the in-domain corpus only). Map-
possible variations in vocabulary, syntax, content, ping non-frequent terms to classes naturally leads
and style. Most of the known LM adaption tech- to a shorter tail in the frequency distribution, as
niques (Bellegarda, 2004), however, address all visualized by Figure 1. A model trained on such
these variations in a holistic way. A possible rea- data has a better n-gram coverage of the test set
son for this is that LM adaptation methods were and may take advantage of a larger context when
originally developed under the automatic speech scoring translation hypotheses.
recognition framework, which typically assumes As classes, we use deterministically assigned
the presence of one single LM. The progressive POS tags, obtained by first tagging the data with
441
!""""""#
!"""""# 4.1 Word commonness criteria

The most intuitive way to measure word common-
!""""# )*+,-# ness is by absolute term frequency (F ). We will
!"""#
$'./01# use this criterion in most of our experiments. A
finer solution would be to also consider the com-
!""# monness of a word across different talks. At this
end, we propose to use the fdf statistics, that is the
!"# product of relative term f requency and document
"# !"""# $"""# %"""# &"""# '"""# ("""#
f requency5 :
c(w) c(dw )
Figure 1: Type frequency distribution in the English f dfw = P 0)

TED corpus before and after POS-mapping of words w 0 c(w c(d)
with less than 500 occurrences (25% of tokens). The where dw are the documents (talks) containing at
rank in the frequency list (x-axis) is plotted against the
respective frequency in logarithmic scale. Types with
least one occurrence of the word w.
less than 20 occurrences are omitted from the graph. If available, real talk boundaries can be used
to define the documents. Alternatively, we can
simply split the corpus into chunks of fixed size.
Tree Tagger (Schmid, 1994) and then choosing In this work we use this approximation.
the most likely tag for each word type. In this Another issue is how to set the threshold. In-
way, we avoid the overload of searching for the dependently from the chosen commonness mea-
best tagging decisions at run-time at the cost of sure, we can reason in terms of the ratio of tokens
a slightly higher imprecision (see Section 5.1). that are mapped to POS classes (WP ). For in-
The hybridly mapped data is used to train a high- stance, in our experiments with English, we can
order n-gram LM that is plugged into an SMT de- set the threshold to F =500 and observe that WP
coder as an additional feature on target word se- corresponds to 25% of the tokens (and 99% of the
quences. During the translation process, words types). In the same corpus, a similar ratio is ob-
are mapped to their class just before querying the tained with fdf =0.012.
hybrid LM, therefore translation models can be In our study, we consider three ratios WP ={.25,
trained on plain un-tagged data. .50, .75} that correspond to different levels of lan-
As exemplified in Table 4, hybrid LMs can guage modeling: from a domain-generic word-
draw useful statistics on the context of common level LM to a lexically anchored POS-level LM.
words even from a small corpus such as the TED.
To have an idea of data sparseness, consider that 4.2 Handling morphology
in the unprocessed TED corpus the most frequent Token frequency-based measures may not be suit-
5-gram containing the common word guy occurs able for languages other than English. When
only 3 times. After the mapping of words with translating into French, for instance, we have to
frequency <500, the highest 5-gram frequency deal with a much richer morphology.
grows to 17, the second one to 9, and so on. As a solution we can use lemmas, univocally
assigned to word types in the same manner as
guy 598 actually 3978 POS tags. Lemmas can be employed in two ways:
a guy VBN NP NP 17 [s] This is actually a 20
guy VBN NP NP , 9 [s] It s actually a 17
only for word selection, as a frequency measure,
guy , NP NP , 8 , you can actually VB 13 or also for word representation, as a mapping for
a guy called NP NP 8 is actually a JJ NN 13 common words. In the former, we preserve in-
this guy , NP NP 6 This is actually a NN 12 flected variants that may be useful to model the
guy VBN NP NP . 6 [s] And this is actually 12
language style, but we also risk to see n-gram cov-
by a guy VBN NP 5 [s] And that s actually 10
a JJ guy . [/s] 5 , but it s actually 10 erage decrease due to the presence of rare types.
I was VBG this guy 4 NN , it s actually 9 In the latter, only canonical forms and POS tags
guy VBN NP . [/s] 4 were actually going to 8 5
This differs from the tf-idf widely used in information
retrieval, which is used to measure the relevance of a term in
Table 4: Most common hybrid 5-grams containing the a document. Instead, we measure commonness of a term in
words guy and actually, along with absolute frequency. the whole corpus.
442
appear in the processed text, thus introducing a Hybrid 10g LM |V | POS-Err h10g
further level of abstraction from the original text. all words 51299 0.0% 1.7
all lemmas 38486 0.0% 1.9
Here follows a TED sentence in its original
.25 POS/words 475 1.9% 2.7
version (first line) and after three different hy- .50 POS/words 93 4.1% 3.5
brid mappings namely WP =.25, WP =.25 with .75 POS/words 50 5.7% 4.1
lemma forms, and WP =.50: allPOS 43 6.6% 4.4
.25 POS/lemmas 302 1.8% 2.8
Now you laugh, but that quote has kind of a sting to it, right. .25 POS/words(fdf) 301 1.9% 2.7
Now you VB , but that NN has kind of a NN to it, right.
Now you VB , but that NN have kind of a NN to it, right. Table 5: Comparison of LMs obtained from different
RB you VB , CC that NN VBZ NN of a NN to it, RB . hybrid mappings of the English TED corpus: vocabu-
lary size, POS error rate, and average word history on
IWSLTtst2010s reference translations.
5 Evaluation
In this section we perform an intrinsic evaluation course, the more words are mapped, the less dis-
of the proposed LM technique, then we measure criminative our model will be. Thus, choosing the
its impact on translation quality when integrated best hybrid mapping means finding the best trade-
into a state-of-the-art phrase-based SMT system. off between coverage and informativeness.
We also applied hybrid LM to the French lan-
5.1 Intrinsic evaluation guage, again using Tree Tagger to create the POS
We analyze here a set of hybrid LMs trained on mapping. The tag set in this case comprises 34
the English TED corpus by varying the ratio of classes and the POS error rate with WP =.25 is
POS-mapped words and the word representation 1.2% (compare with 1.9% in English). As previ-
technique (word vs lemma). All models were ously discussed, morphology has a notable effect
trained with the IRSTLM toolkit (Federico et al., on the modeling of French. In fact, the vocabu-
2008), using a very high n-gram order (10) and lary reduction obtained by mapping all the words
Witten-Bell smoothing. to their most probable lemma is -45% (57959 to
First, we estimate an upper bound of the POS 31908 types in the TED corpus), while in English
tagging errors introduced by deterministic tag- it is only -25%.
ging. At this end, the hybridly mapped data is
5.2 SMT baseline
compared with the actual output of Tree Tagger on
the TED training corpus (see Table 5). Naturally, Our SMT experiments address the translation of
the impact of tagging errors correlates with the ra- TED talks from Arabic to English and from En-
tio of POS-mapped tokens, as no error is counted glish to French. The training and test datasets
on non-mapped tokens. For instance, we note that were provided by the organizers of the IWSLT11
the POS error rate is only 1.9% in our primary set- evaluation, and are summarized in Table 6.
ting, WP =.25 and word representation, whereas Marked in bold are the corpora used for hybrid
on a fully POS-mapped text it is 6.6%. Note that LM training. Dev and test sets have a single ref-
the English tag set used by Tree Tagger includes erence translation.
43 classes. For both language pairs, we set up com-
Now we focus on the main goal of hybrid text petitive phrase-based systems6 using the Moses
representation, namely increasing the coverage of toolkit (Koehn et al., 2007). The decoder fea-
the in-domain LM on the test data. Here too, we tures a statistical log-linear model including a
measure coverage by the average length of word phrase translation model and a phrase reordering
history h used to score the test reference transla- model (Tillmann, 2004; Koehn et al., 2005), two
tions (see Section 2). We do not provide perplex- word-based language models, distortion, word
ity figures, since these are not directly compara- and phrase penalties. The translation and re-
ble across models with different vocabularies. As ordering models are obtained by combining mod-
shown by Table 5, n-gram coverage increases with els independently trained on the available paral-
the ratio of POS-mapped tokens, ranging from 1.7 6
The SMT systems used in this paper are thoroughly de-
on an all-words LM to 4.4 on an all-POS LM. Of scribed in (Ruiz et al., 2011).
443
Corpus |S| |W | ` translation models, while the English-French sys-
TED 90K 1.7M 18.9 tem uses lowercased models and a standard re-
AR-EN
UN 7.9M 220M 27.8
casing post-process.
TED 124K 2.4M 19.5
EN
NEWS 30.7M 782M 25.4
Feature weights are tuned on dev2010 by
dev2010 934 19K 20.0 means of a minimum error training procedure
AR test (MERT) (Och, 2003). Following suggestions by
tst2010 1664 30K 18.1
TED 105K 2.0M 19.5 Clark et al. (2011) and Cettolo et al. (2011) on
EN-FR UN 11M 291M 26.5 controlling optimizer instability, we run MERT
NEWS 111K 3.1M 27.6 four times on the same configuration and use the
FR
TED 107K 2.2M 20.6 average of the resulting weights to evaluate trans-
NEWS 11.6M 291M 25.2 lation performance.
dev2010 934 20K 21.5
EN test
tst2010 1664 32K 19.1 5.3 Hybrid LM integration
As previously stated, hybrid LMs are trained only
Table 6: IWSLT11 training and test data statistics:
number of sentences |S|, number of tokens |W | and on in-domain data and are added to the log-linear
average sentence length `. Token numbers are com- decoder as an additional target LM. To this end,
puted on the target language, except for the test sets. we use the class-based LM implementation pro-
vided in Moses and IRSTLM, which applies the
word-to-class mapping to translation hypotheses
lel corpora: namely TED and NEWS for Arabic- before LM querying8 . The order of the additional
English; TED, NEWS and UN for English- LM is set to 10 in the Arabic-English evaluation
French. To this end we applied the fill-up method and 7 in the English-French, as these appeared to
(Nakov, 2008; Bisazza et al., 2011) in which out- be the best settings in preliminary tests.
of-domain phrase tables are merged with the in- Translation quality is measured by BLEU (Pa-
domain table by adding only new phrase pairs. pineni et al., 2002), METEOR (Banerjee and
Out-of-domain phrases are marked with a binary Lavie, 2005) and TER (Snover et al., 2006)9 . To
feature whose weight is tuned together with the test whether differences among systems are statis-
SMT system weights. tically significant we use approximate randomiza-
For each target language, two standard 5-gram tion as done in (Riezler and Maxwell, 2005)10 .
LMs are trained separately on the monolingual
TED and NEWS datasets, and log-linearly com- Model variants. The effect on MT quality of
bined at decoding time. In the Arabic-English various hybrid LM variants is shown in Table 7.
task, we use a hierarchical reordering model (Gal- Note that allPOS and allLemmas refer to deter-
ley and Manning, 2008; Hardmeier et al., 2011), ministically assigned POS tags and lemmas, re-
while in the English-French task we use a default spectively. Concerning the ratio of POS-mapped
word-based bidirectional model. The distortion tokens, the best performing values are WP =.25 in
limit is set to the default value of 6. Note that Arabic-English and WP =.50 in English-French.
the use of large n-gram LMs and of lexicalized These hybrid mappings outperform all the uni-
reordering models was shown to wipe out the im- form representations (words, lemmas and POS)
provement achievable by POS-level LM (Kirch- with statistically significant BLEU and METEOR
hoff and Yang, 2005; Birch et al., 2007). improvements.
Concerning data preprocessing we apply stan- The fdf experiment involves the use of doc-
dard tokenization to the English and French text, ument frequency for the selection of common
while for Arabic we use an in-house tokenizer that words. Its performance is very close to that of hy-
removes diacritics and normalizes special charac- 8
Detailed instructions on how to build and use hybrid
ters and digits. Arabic text is then segmented with LMs can be found at http://hlt.fbk.eu/people/bisazza.
AMIRA (Diab et al., 2004) according to the ATB 9
We use case-sensitive BLEU and TER, but case-
scheme7 . The Arabic-English system uses cased insensitive METEOR to enable the use of paraphrase tables
distributed with the tool (version 1.3).
7 10
The Arabic Treebank tokenization scheme isolates con- Translation scores and significance tests were com-
junctions w+ and f+, prepositions l+, k+, b+, future marker puted with the Multeval toolkit (Clark et al., 2011):
s+, pronominal suffixes, but not the article Al+. https://github.com/jhclark/multeval.
444
(a) Arabic to English, IWSLTtst2010 (b) English to French, IWSLTtst2010
Added InDomain 10gLM BLEU MET TER Added InDomain 7gLM BLEU MET TER
.00 POS/words (all words) 26.1 30.5 55.4 .00 POS/words (all words) 31.1 52.5 49.9
.00 POS/lemmas (all lem.) 26.0 30.5 55.4 .00 POS/lemmas (all lem.) 31.2 52.6 49.7
1.0 POS/words (all POS) 25.9 30.6 55.3 1.0 POS/words (all POS) 31.4 52.8 49.8
.25 POS/words 26.5 30.6 54.7 .25 POS/lemmas 31.5 52.9 49.7
.25 POS/words(fdf) 26.5 30.7 54.7 .50 POS/lemmas(fdf) 31.9 53.3 49.5
.25 POS/lemmaF 26.4 30.6 54.8 .50 POS/lemmaF 31.6 53.0 49.6
.25 POS/lemmas 26.5 30.8 54.6 .50 POS/words 31.7 53.1 49.5
Table 7: Comparison of various hybrid LM variants. Translation quality is measured with BLEU, METEOR and
TER (all in percentage form). The settings used for weight tuning are marked with . Best models according to
all metrics are highlighted in bold.
brid LMs simply based on term frequency; only Comparison with baseline. In Table 8 the
METEOR gains 0.1 points in Arabic-English. A best performing hybrid LM is compared against
possible reason for this is that document fre- the baseline that only includes the standard LMs
quency was computed on fixed-size text chunks described in Section 5.2. To complete our eval-
rather than on real document boundaries (see Sec- uation, we also report the effect of an in-domain
tion 4.1). The lemmaF experiment refers to the LM trained on 50 word classes induced from the
use of canonical forms for frequency measuring: corpus by maximum-likelihood based clustering
this technique does not seem to help in either lan- (Och, 1999).
guage pair. Finally, we compare the use of lem- In the two language pairs, both types of LM
mas versus surface forms to represent common result in consistent improvements over the base-
words. As expected, lemmas appear to be help- line. However, the gains achieved by the hybrid
ful for French language modeling. Interestingly approach are larger and all statistically signifi-
this is also the case for English, even if by a small cant. The hybrid approach is significantly bet-
margin (+0.2 METEOR, -0.1 TER). ter than the unsupervised one by TER in Arabic-
English and by BLEU and METEOR in English-
Summing up, hybrid mapping appears as a French (these siginificances are not reported in
winning strategy compared to uniform map-
ping. Although differences among LM variants
(a) Arabic to English, IWSLTtst2010
are small, the best model in Arabic-English is
.25-POS/lemmas, which can be thought of as Added InDomain
BLEU MET TER
a domain-generic lemma-level LM. In English- 10g LM
French, instead, the highest scores are achieved none (baseline) 26.0 30.4 55.6
by .50-POS/lemmas or .50-POS/lemmas(fdf), that unsup. classes 26.4 30.8 55.1
is POS-level LM with few frequently occurring hybrid 26.5 (+.5) 30.8 (+.4) 54.6 (-1.0)

lexical anchors (vocabulary size 59). An inter- (b) English to French, IWSLTtst2010
pretation of this result is that, for French, mod- Added InDomain
BLEU MET TER
eling the syntax is more helpful than modeling 7g LM
the style. We also suspect that the French TED none (baseline) 31.2 52.7 49.8
corpus is more irregular and diverse with respect unsup. classes 31.5 52.9 49.6
to the style, than its English counterpart. In fact, hybrid 31.9 (+.7) 53.3 (+.6) 49.5 (-.3)
while the English corpus include transcripts of
talks given by English speakers, the French one is Table 8: Final MT results: baseline vs unsupervised
mostly a collection of (human) translations. Typi- word classes-based LM and best hybrid LM. Statis-
cal features of the speech style may have been lost tically significant improvements over the baseline are
in this process. marked with at the p < .01 and at the p < .05 level.
445
the table for clarity). The proposed method ap- points, while that of the word-based PP is 79. The
pears to better leverage the available in-domain BLEU improvement given by hybrid LM, how-
data, achieving improvements according to all ever modest, is consistent across the talks, with
metrics: +0.5/+0.4/-1.0 BLEU/METEOR/TER only two outliers: a drop of -0.2 on talk 00, and
in Arabic-English and +0.7/-0.6/-0.3 in English- a drop of -0.7 on talk 02. The largest gain (+1.1)
French, without requiring any bitext annotation or is observed on talk 10, from 16.8 to 17.9 BLEU.
decoder modification.
6 Conclusions
Talk-level analysis. To conclude the study,
we analyze the effect of our best hybrid LM We have proposed a language modeling technique
on Arabic-English translation quality, at the sin- that leverages the in-domain data for SMT style
gle talk level. The test used in the experiments adaptation. Trained to predict mixed sequences
(tst2010) consists of 11 transcripts with an av- of POS classes and frequent words, hybrid LMs
erage length of 15173 sentences. For each are devised to capture typical lexical and syntactic
talk, we compare the baseline BLEU score with constructions that characterize the style of speech
that obtained by adding a .25-POS/lemmas hybrid transcripts.
LM. Results are presented in Figure 2. The dark Compared to standard language models, hy-
and light columns denote baseline and hybrid-LM brid LMs generalize better to the test data and
BLEU scores, respectively, and refer to the left y- partially compensate for the disproportion be-
axis. Additional data points, plotted on the right tween in-domain and out-of-domain training data.
y-axis in reverse order, represent talk-level per- At the same time, hybrid LMs show more dis-
plexities (PP) of a standard 5-gram LM trained criminative power than merely POS-level LMs.
on TED () and those of the .25-POS/lemmas The integration of hybrid LMs into a competi-
10-gram hybrid LM (M), computed on reference tive phrase-based SMT system is straightforward
translations. and leads to consistent improvements on the TED
What emerges first is a dramatic variation of task, according to three different translation qual-
performance among the speeches, with baseline ity metrics.
BLEU scores ranging from 33.95 on talk 00 to Target language modeling is only one aspect
only 12.42 on talk 02. The latter talk appears as of the statistical translation problem. Now that
a corner case also according to perplexities (397 the usability of the proposed method has been as-
by word LM and 111 by hybrid LM). Notably, the sessed for language modeling, future work will
perplexities of the two LMs correlate well with address the extension of the idea to the modeling
each other, but the hybrids PP is much more sta- of phrase translation and reordering.
ble across talks: its standard deviation is only 14
Acknowledgments
-./0"123456" -./0"17826" 99"1:;<=4#>6" 99"17826" This work was supported by the T4ME network
&#(!!" !"
&%(#!" #!"
of excellence (IST-249119), funded by the DG
&!(!!" $!!"
INFSO of the European Commission through the
%)(#!"
$#!"
7th Framework Programme. We thank the anony-
%#(!!" mous reviewers for their valuable suggestions.
%!!"
%%(#!"
%#!"
%!(!!"
&!!"
$)(#!"
$#(!!" &#!" References
$%(#!" '!!"
Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
$!(!!" '#!"
!!" !$" !%" !&" !'" !#" !*" !)" !+" !," $!"
An automatic metric for MT evaluation with im-
proved correlation with human judgments. In Pro-
ceedings of the ACL Workshop on Intrinsic and Ex-
Figure 2: Talk-level evaluation on Arabic-English trinsic Evaluation Measures for Machine Transla-
(IWSLT-tst2010). Left y-axis: BLEU impact of a .25- tion and/or Summarization, pages 6572, Ann Ar-
POS/lemma hybrid LM. Right y-axis: perplexities by bor, Michigan, June. Association for Computational
word LM and by hybrid LM. Linguistics.
446
Jerome R. Bellegarda. 2004. Statistical language Processing, pages 848856, Morristown, NJ, USA.
model adaptation: review and perspectives. Speech Association for Computational Linguistics.
Communication, 42(1):93 108. Christian Hardmeier, Jorg Tiedemann, Markus Saers,
Alexandra Birch, Miles Osborne, and Philipp Koehn. Marcello Federico, and Mathur Prashant. 2011.
2007. CCG supertags in factored statistical ma- The Uppsala-FBK systems at WMT 2011. In Pro-
chine translation. In Proceedings of the Second ceedings of the Sixth Workshop on Statistical Ma-
Workshop on Statistical Machine Translation, pages chine Translation, pages 372378, Edinburgh, Scot-
916, Prague, Czech Republic, June. Association land, July. Association for Computational Linguis-
for Computational Linguistics. tics.
Arianna Bisazza, Nick Ruiz, and Marcello Fed- Katrin Kirchhoff and Mei Yang. 2005. Improved lan-
erico. 2011. Fill-up versus Interpolation Meth- guage modeling for statistical machine translation.
ods for Phrase-based SMT Adaptation. In Interna- In Proceedings of the ACL Workshop on Building
tional Workshop on Spoken Language Translation and Using Parallel Texts, pages 125128, Ann Ar-
(IWSLT), San Francisco, CA. bor, Michigan, June. Association for Computational
P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, Linguistics.
and R. L. Mercer. 1992. Class-based n-gram mod- Philipp Koehn and Hieu Hoang. 2007. Factored
els of natural language. Computational Linguistics, translation models. In Proceedings of the 2007
18(4):467479. Joint Conference on Empirical Methods in Natural
Mauro Cettolo, Nicola Bertoldi, and Marcello Fed- Language Processing and Computational Natural
erico. 2011. Methods for smoothing the optimizer Language Learning (EMNLP-CoNLL), pages 868
instability in SMT. In MT Summit XIII: the Thir- 876, Prague, Czech Republic, June. Association for
teenth Machine Translation Summit, pages 3239, Computational Linguistics.
Xiamen, China. Philipp Koehn, Amittai Axelrod, Alexandra Birch
Jonathan Clark, Chris Dyer, Alon Lavie, and Mayne, Chris Callison-Burch, Miles Osborne, and
Noah Smith. 2011. Better hypothesis testing David Talbot. 2005. Edinburgh system description
for statistical machine translation: Controlling for the 2005 IWSLT speech translation evaluation.
for optimizer instability. In Proceedings of In Proc. of the International Workshop on Spoken
the Association for Computational Lingustics, Language Translation, October.
ACL 2011, Portland, Oregon, USA. Associa- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,
tion for Computational Linguistics. available at M. Federico, N. Bertoldi, B. Cowan, W. Shen,
http://www.cs.cmu.edu/ jhclark/pubs/significance.pdf. C. Moran, R. Zens, C. Dyer, O. Bojar, A. Con-
Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. stantin, and E. Herbst. 2007. Moses: Open Source
2004. Automatic Tagging of Arabic Text: From Toolkit for Statistical Machine Translation. In Pro-
Raw Text to Base Phrase Chunks. In Daniel Marcu ceedings of the 45th Annual Meeting of the Associa-
Susan Dumais and Salim Roukos, editors, HLT- tion for Computational Linguistics Companion Vol-
NAACL 2004: Short Papers, pages 149152, ume Proceedings of the Demo and Poster Sessions,
Boston, Massachusetts, USA, May 2 - May 7. As- pages 177180, Prague, Czech Republic.
sociation for Computational Linguistics. Christof Monz. 2011. Statistical Machine Translation
Marcello Federico, Nicola Bertoldi, and Mauro Cet- with Local Language Models. In Proceedings of the
tolo. 2008. IRSTLM: an Open Source Toolkit for 2011 Conference on Empirical Methods in Natural
Handling Large Scale Language Models. In Pro- Language Processing, pages 869879, Edinburgh,
ceedings of Interspeech, pages 16181621, Mel- Scotland, UK., July. Association for Computational
bourne, Australia. Linguistics.
Marcello Federico, Luisa Bentivogli, Michael Paul, Preslav Nakov. 2008. Improving English-Spanish
and Sebastian Stuker. 2011. Overview of the Statistical Machine Translation: Experiments in
IWSLT 2011 Evaluation Campaign. In Interna- Domain Adaptation, Sentence Paraphrasing, Tok-
tional Workshop on Spoken Language Translation enization, and Recasing. . In Workshop on Statis-
(IWSLT), San Francisco, CA. tical Machine Translation, Association for Compu-
George Foster and Roland Kuhn. 2007. Mixture- tational Linguistics.
model adaptation for SMT. In Proceedings of the Franz Josef Och. 1999. An efficient method for de-
Second Workshop on Statistical Machine Transla- termining bilingual word classes. In Proceedings of
tion, pages 128135, Prague, Czech Republic, June. the 9th Conference of the European Chapter of the
Association for Computational Linguistics. Association for Computational Linguistics (EACL),
Michel Galley and Christopher D. Manning. 2008. A pages 7176.
simple and effective hierarchical phrase reordering Franz Josef Och. 2003. Minimum Error Rate Train-
model. In EMNLP 08: Proceedings of the Con- ing in Statistical Machine Translation. In Erhard
ference on Empirical Methods in Natural Language Hinrichs and Dan Roth, editors, Proceedings of the
447
41st Annual Meeting of the Association for Compu-
Wei-Jing Zhu. 2002. BLEU: a method for auto-
matic evaluation of machine translation. In Pro-
ciation of Computational Linguistics (ACL), pages
311318, Philadelphia, PA.
Stefan Riezler and John T. Maxwell. 2005. On some
pitfalls in automatic evaluation and significance
testing for MT. In Proceedings of the ACL Work-
shop on Intrinsic and Extrinsic Evaluation Mea-
sures for Machine Translation and/or Summariza-
tion, pages 5764, Ann Arbor, Michigan, June. As-
sociation for Computational Linguistics.
R. Rosenfeld. 2000. Two decades of statistical lan-
guage modeling: where do we go from here? Pro-
ceedings of the IEEE, 88(8):1270 1278.
Nick Ruiz and Marcello Federico. 2011. Topic adap-
tation for lecture translation through bilingual la-
tent semantic models. In Proceedings of the Sixth
Workshop on Statistical Machine Translation, pages
294302, Edinburgh, Scotland, July. Association
Nick Ruiz, Arianna Bisazza, Fabio Brugnara, Daniele
Falavigna, Diego Giuliani, Suhel Jaber, Roberto
Gretter, and Marcello Federico. 2011. FBK @
IWSLT 2011. In International Workshop on Spo-
ken Language Translation (IWSLT), San Francisco,
CA.
Helmut Schmid. 1994. Probabilistic part-of-speech
tagging using decision trees. In Proceedings of In-
ternational Conference on New Methods in Lan-
guage Processing.
Matthew Snover, Bonnie Dorr, Rich Schwartz, Linnea
Micciulla, and John Makhoul. 2006. A study of
translation edit rate with targeted human annotation.
In 5th Conference of the Association for Machine
Translation in the Americas (AMTA), Boston, Mas-
sachusetts, August.
Christoph Tillmann. 2004. A Unigram Orientation
Model for Statistical Machine Translation. In Pro-
ceedings of the Joint Conference on Human Lan-
guage Technologies and the Annual Meeting of the
North American Chapter of the Association of Com-
putational Linguistics (HLT-NAACL).
A. Yazgan and M. Saraclar. 2004. Hybrid language
models for out of vocabulary word detection in large
vocabulary conversational speech recognition. In
Proceedings of ICASSP, volume 1, pages I 7458
vol.1, may.
448
Detecting Highly Confident Word Translations from Comparable
Corpora without Any Prior Knowledge
Ivan Vulic and Marie-Francine Moens

KU Leuven
Celestijnenlaan 200A
Leuven, Belgium
{ivan.vulic,marie-francine.moens}@cs.kuleuven.be
Abstract with partially overlapping content, usually avail-

able in abundance. Thus, it is much easier to build
In this paper, we extend the work on using a high-volume comparable corpus. A representa-
latent cross-language topic models for iden- tive example of such a comparable text collection
tifying word translations across compara-
ble corpora. We present a novel precision-
is Wikipedia, where one may observe articles dis-
oriented algorithm that relies on per-topic cussing the similar topic, but strongly varying in
word distributions obtained by the bilin- style, length and vocabulary, while still sharing a
gual LDA (BiLDA) latent topic model. certain amount of main concepts (or topics).
The algorithm aims at harvesting only the
most probable word translations across lan- Over the years, several approaches for min-
guages in a greedy fashion, without any ing translations from non-parallel corpora have
prior knowledge about the language pair, emerged (Rapp, 1995; Fung and Yee, 1998; Rapp,
relying on a symmetrization process and 1999; Diab and Finch, 2000; Dejean et al., 2002;
the one-to-one constraint. We report our re- Chiao and Zweigenbaum, 2002; Gaussier et al.,
sults for Italian-English and Dutch-English
2004; Fung and Cheung, 2004; Morin et al., 2007;
language pairs that outperform the current
state-of-the-art results by a significant mar-
Haghighi et al., 2008; Shezaf and Rappoport,
gin. In addition, we show how to use the al- 2010; Laroche and Langlais, 2010), all sharing
gorithm for the construction of high-quality the same Firthian assumption, often called the
initial seed lexicons of translations. distributionial hypothesis (Harris, 1954), which
states that words with a similar meaning are likely
to appear in similar contexts across languages.
1 Introduction
All these methods have examined different rep-
Bilingual lexicons serve as an invaluable resource resentations of word contexts and different meth-
of knowledge in various natural language pro- ods for matching words across languages, but they
cessing tasks, such as dictionary-based cross- all have in common a need for a seed lexicon of
language information retrieval (Carbonell et al., translations to efficiently bridge the gap between
1997; Levow et al., 2005) and statistical machine languages. That seed lexicon is usually crawled
translation (SMT) (Och and Ney, 2003). In or- from the Web or obtained from parallel corpora.
der to construct high quality bilingual lexicons for Recently, Li et al. (2011) have proposed an ap-
different domains, one usually needs to possess proach that improves precision of the existing
parallel corpora or build such lexicons by hand. methods for bilingual lexicon extraction, based
Compiling such lexicons manually is often an ex- on improving the comparability of the corpus un-
pensive and time-consuming task, whereas the der consideration, prior to extracting actual bilin-
methods for mining the lexicons from parallel cor- gual lexicons. Other methods such as (Koehn and
pora are not applicable for language pairs and do- Knight, 2002) try to design a bootstrapping algo-
mains where such corpora is unavailable or miss- rithm based on an initial seed lexicon of transla-
ing. Therefore the focus of researchers turned to tions and various lexical evidences. However, the
comparable corpora, which consist of documents quality of their initial seed lexicon is disputable,
449
since the construction of their lexicon is language- employed for a precision-oriented algorithm. In
pair biased and cannot be completely employed our setting, it basically means that we keep a
on distant languages. It solely relies on unsatis- translation pair (wiS , wjT ) if and only if, after the
factory language-pair independent cross-language symmetrization process, the top translation candi-
clues such as words shared across languages. date for the source word wiS is the target word wiT
Recent work from Vulic et al.(2011) utilized and vice versa. The one-to-one constraint aims
the distributional hypothesis in a different direc- at matching the most confident candidates during
tion. It attempts to abrogate the need of a seed lex- the early stages of the algorithm, and then exclud-
icon as a prerequisite for bilingual lexicon extrac- ing them from further search. The utility of the
tion. They train a cross-language topic model on constraint for parallel corpora has already been
document-aligned comparable corpora and intro- evaluated by Melamed (2000).
duce different methods for identifying word trans- The remainder of the paper is structured as
lations across languages, underpinned by per- follows. Section 2 gives a brief overview of
topic word distributions from the trained topic the methods, relying on per-topic word distribu-
model. Due to the fact that they deal with compa- tions, which serve as the tool for computing cross-
rable Wikipedia data, their translation model con- language similarity between words. In Section
tains a lot of noise, and some words are poorly 3, we motivate the main assumptions of the al-
translated simply because there are not enough gorithm and describe the full algorithm. Sec-
occurrences in the corpus. The goal of this work is tion 4 justifies the underlying assumptions of
to design an algorithm which will learn to harvest the algorithm by providing comparisons with a
only the most probable translations from the per- current-state-of-the-art system for Italian-English
word topic distributions. The translations learned and Dutch-English language pairs. It also con-
by the algorithm then might serve as a highly ac- tains another set of experiments which inves-
curate, precision-based initial seed lexicon, which tigates the potential of the algorithm in build-
can then be used as a tool for translating source ing a language-pair unbiased seed lexicon, and
word vectors into the target language. The key ad- compares the lexicon with other seed lexicons.
vantage of such a lexicon lies in the fact that there Finally, Section 5 lists conclusion and possible
is no language-pair dependent prior knowledge paths of future work.
involved in its construction (e.g., orthographic
features). Hence, it is completely applicable to
2 Calculating Initial Cross-Language
any language pair for which there exist sufficient Word Similarity
comparable data for training of the topic model. This section gives a quick overview of the Cue
Since comparable corpora often construct a method, the TI method, and their combination,
very noisy environment, it is of the utmost impor- described by Vulic et al.(2011), which proved to
tance for a precision-oriented algorithm to learn be the most efficient and accurate for identify-
when to stop the process of matching words, and ing potential word translations once the cross-
which candidate pairs are surely not translations language BiLDA topic model is trained and the
of each other. The method described in this paper associated per-topic distributions are obtained for
follows this intuition: while extracting a bilingual both source and target corpora. The BiLDA
lexicon, we try to rematch words, keeping only model we use is a natural extension of the stan-
the most confident candidate pairs and disregard- dard LDA model and, along with the definition of
ing all the others. After that step, the most con- per-topic word distributions, has been presented
fident candidate pairs might be used with some in (Ni et al., 2009; De Smet and Moens, 2009;
of the existing context-based techniques to find Mimno et al., 2009). BiLDA takes advantage of
translations for the words discarded in the pre- the document alignment by using a single variable
vious step. The algorithm is based on: (1) the that contains the topic distribution . This vari-
assumption of symmetry, and (2) the one-to-one able is language-independent, because it is shared
constraint. The idea of symmetrization has been by each of the paired bilingual comparable doc-
borrowed from the symmetrization heuristics in- uments. Topics for each document are sampled
troduced for word alignments in SMT (Och and from , from which the words are then sampled
Ney, 2003), where the intersection heuristics is in conjugation with the vocabulary distribution
450
S S
in the same textual units and therefore add ex-
zji wji
tra information of potential relatedness. These
MS
two methods for automatic bilingual lexicon ex-
traction interpret and exploit underlying per-topic
T T
zji wji
word distributions in different ways, so combin-
MT ing the two should lead to even better results. The
D
two methods are linearly combined, with the over-
all score given by:

SimT I+Cue (w1S , w2T ) = SimT I (w1S , w2T )

+ (1 )SimCue (w1S , w2T ) (1)
Figure 1: Plate
Figure modelbilingual
1: The for bilingual
LDALatent Dirichletmodel
(BiLDA) Allocation Both methods posses several desirable proper-
ties. According to Griffiths et al. (2007), the con-
ditioning for the Cue method automatically com-
(for language S) and (for language T).
promises between word frequency and semantic
2.1 Cue Method relatedness since higher frequency words tend to
A straightforward approach to express similarity have higher probability across all topics, but the
between words tries to emphasize the associative distribution over topics P (zk |w1S ) ensures that se-
relation in a natural way - modeling the proba- mantically related topics dominate the sum. The
bility P (w2T |w1S ), i.e. the probability that a tar- similar phenomenon is captured by the TI method
get word w2T will be generated as a response to a by the usage of TF, which rewards high frequency
cue source word w1S , where the link between the words, and ITF, which assigns a higher impor-
1
words is established via the shared topic space: tance for words semantically more related to a
T S
PK
P (w2 |w1 ) = k=1 P (w2 |zk )P (zk |w1S ), where
T specific topic. These properties are incorporated
K denotes the number of cross-language topics. in the combination of the methods. As the final
result, the combined method provides, for each
2.2 TI Method source word, a ranked list of target words with as-
This approach constructs word vectors over a sociated scores that measure the strength of cross-
shared space of cross-language topics, where val- language similarity. The higher the score, the
ues within vectors are the TF-ITF scores (term more confident a translation pair is. We will use
frequency - inverse topic frequency), computed this observation in the next section during the al-
in a completely analogical manner as the TF- gorithm construction.
IDF scores for the original word-document space The lexicon constructed by solely applying the
(Manning and Schutze, 1999). Term frequency, combination of these methods without any addi-
given a source word wiS and a topic zk , measures tional assumptions will serve as a baseline in the
the importance of the word wiS within the particu- results section.
lar topic zk , while inverse topical frequency (ITF)
3 Constructing the Algorithm
of the word wiS measures the general importance
of the source word wiS across all topics. The fi- This section explains the underlying assumptions
nal TF-ITF score for the source word wiS and the of the algorithm: the assumption of symmetry
topic zk is given by T F IT Fi,k = T Fi,k IT Fi . and the one-to-one assumption. Finally, it pro-
The TF-ITF scores for target words associated vides the complete outline of the algorithm.
with target topics are calculated in an analogical
manner and the standard cosine similarity is then 3.1 Assumption of Symmetry
used to find the most similar target word vectors First, we start with the intuition that the assump-
for a given source word vector. tion of symmetry strengthens the confidence of a
translation pair. In other words, if the most prob-
2.3 Combining the Methods able translation candidate for a source word w1S is
Topic models have the ability to build clusters of a target word w2T and, vice versa, the most prob-
words which might not always co-occur together able translation candidate of the target word w2T
451
is the source word w1S , and their TI+Cue scores T , GM ) to the list F inal .
ple (ws,i i s
are above a certain threshold, we can claim that (b) If we have reached the end of the list
the words w1S and w2T are a translation pair. The for the target candidate word ws,i T with-
definition of the symmetric relation can also be out finding the given source word wsS ,
relaxed. Instead of observing only one top can- and i < N , continue with the next word
didate from the lists, we can observe top N can- T
ws,i+1 . Do not add any tuple to F inals
didates from both sides and include them in the in this step.
search space, and then re-rank the potential candi- 5. If the list F inals is not empty, sort the tuples
dates taking into account their associated TI+Cue in the list in descending order according to
scores and their respective positions in the list. their GMi scores. The first element of the
We will call N the search space depth. Here is T
sorted list contains a word ws,high , the final
the outline of the re-ranking method if the search
translation candidate of the source word wsS .
space consists of the top N candidates on both
If the list F inals is not empty, the final re-
sides:
sult of this process will be the cross-language
1. Given is a source word wsS , for which we ac- word translation pair (wsS , ws,high
T ).
tually want to find the most probable trans- We will call this symmetrization process the
lation candidate. Initialize an empty list symmetrizing re-ranking. It attempts at push-
F inals = {} in which target language ing the correct cross-language synonym to the top
candidates with their recalculated associated of the candidates list, taking into account both
scores will be stored. the strength of similarities defined through the
2. Obtain TI+Cue scores for all target words. TI+Cue scores in both directions, and positions
Keep only N best scoring target candidates: in ranked lists. A blatant example depicting how
T , . . . , w T } along with their respective
{ws,1 s,N this process helps boost precision is presented in
scores. Figure 2. We can also design a thresholded variant
3. For each target candidate from of this procedure by imposing an extra constraint.
T T
{ws,1 , . . . , ws,N } acquire TI+Cue scores When calculating target language candidates for
over the entire source vocabulary. Keep only the source word wsS in Step 2, we proceed fur-
N best scoring source language candidates. ther only if the first target candidate scores above
Each word ws,i T {ws,1 T , . . . , w T } now
s,N a certain threshold P and, additionally, in Step 3,
has a list of N source language candidates we keep lists of N source language candidates
associated with it: {wi,1 S , w S . . . , w S }.
i,2 i,N for only those target words for which the first
4. For each target candidate word ws,i T source language candidate in their respective list
T T
{ws,1 , . . . , ws,N }, do as follows: scored above the same threshold P . We will call
(a) If one of the words from the associated this procedure the thresholded symmetrizing re-
list is the given source word wsS , reranking, and this version will be employed in the
member: (1) the position m, denoting final algorithm.
how high in the list the word wsS was
found, and (2) the associated TI+Cue 3.2 One-to-one Assumption
score SimT I+Cue (ws,i T , wS S
i,m = ws ). Melamed (2000) has already established that most
Calculate: source words in parallel corpora tend to translate
(i) G1,i = SimT I+Cue (wsS , ws,i T )/i
to only one target word. That tendency is modeled
(ii) G2,i = SimT I+Cue (ws,i T , w S )/m by the one-to-one assumption, which constrains
i,m
Following that, calculate GMi , the ge- each source word to have at most one translation
on the target side. Melameds paper reports that
ometric mean of
1
p the values G1,i and
G2,i : GMi = G1,i G2,i . Add a tu- this bias leads to a significant positive impact on
precision and recall of bilingual lexicon extraction
1
Scores G1,i and G2,i are structured in such a way to from parallel corpora. This assumption should
balance between positions in the ranked lists and the TI+Cue
scores, since they reward candidate words which have high
also be reasonable for many types of comparable
TI+Cue scores associated with them, and penalize words if corpora such as Wikipedia or news corpora, which
they are found lower in the list of potential candidates. are topically aligned or cover similar themes. We
452
klooster
0.3049
0.1740
monastery monnik
0.1338
benedictijn
0.2237
klooster
0.2266
0.1586 0.1494
abdij monk monnik
0.1131
abdij
0.1155
abdij
0.2549
0.1496
abbey monnik
0.1288
klooster
Figure 2: An example where the assumption of symmetry and the one-to-one assumption clearly help boost
precision. If we keep top Nc = 3 candidates from both sides, the algorithm is able to detect that the correct
Dutch-English translation pair is (abdij, abbey). The TI+Cue method without any assumptions would result with
an indirect association (abdij, monastery). If only the one-to-one assumption was present, the algorithm would
greedily learn the correct direct association (monastery, klooster), remove those words from their respective
vocabularies and then again result with another indirect association (abdij, monk). By additionally employing
the assumption of symmetry with the re-ranking method from Subsection 3.1, the algorithm correctly learns
the translation pair (abdij, abbey). Correct translation pairs (klooster, monastery) and (monnik, monk) are also
obtained. Again here, the pair (monnik, monk) would not be obtained without the one-to-one assumption.
will prove that the assumption leads to better pre- cally very close, and therefore have similar distri-
cision scores even for bilingual lexicon extraction butions over cross-language topics, but island is a
from such comparable data. The intuition be- much more frequent term. The TI+Cue method
hind introducing this constraint is fairly simple. concludes that two words are potential trans-
Without the assumption, the similarity scores be- lations whenever their distributions over cross-
tween source and target words are calculated in- language topics are much more similar than ex-
dependently of each other. We will illustrate the pected by chance. Moreover, it gives a preference
problem arising from the independence assump- to more frequent candidates, so it will eventually
tion with an example. end up learning an indirect association2 between
Suppose we have an Italian word arcipelago, words arcipelago and island. The one-to-one as-
and we would like to detect its correct English sumption should mitigate the problem of such in-
translation (archipelago). However, after the direct associations if we design our algorithm in
TI+Cue method is employed, and even after the such a way that it learns the most confident direct
symmetrizing re-ranking process from the previ- associations2 first:
ous step is used, we still acquire a wrong transla- 2
A direct association, as defined in (Melamed, 2000), is
tion candidate pair (arcipelago, island). Why is an association between two words (in this setting found by
that so? The word (arcipelago) (or its translation) the TI+Cue method) where the two words are indeed mutual
and the acquired translation (island) are semanti- translations. Otherwise, it is an indirect association.
453
1. Learn the correct direct association pair vocabularies: V S = V S {wsS } and
(isola, island). V T = V T {ws,high T } to satisfy the
2. Remove the words isola and island from one-to-one constraint. Add the pair
their respective vocabularies. (wsS , ws,high
T ) to the lexicon L.
3. Since island is not in the vocabulary, the
indirect association between arcipelago and We will name this procedure the one-
island is not present any more. The algo- vocabulary-pass and employ it later in an iter-
rithm learns the correct direct association ative algorithm with a varying threshold and a
(arcipelago, archipelago). varying maximum search space depth.
3.3 The Algorithm 3.3.2 The Final Algorithm

3.3.1 One-Vocabulary-Pass Let us now define P0 as the initial threshold, let
First, we will provide a version of the algorithm Pf be the threshold at which we stop decreas-
with a fixed threshold P which completes only ing the value for threshold and start expanding
one pass through the source vocabulary. Let V S our maximum search space depth for the thresh-
denote a given source vocabulary, and let V T de- olded symmetrizing re-ranking, and let decp be a
note a given target vocabulary. We need to define value for which we decrease the current threshold
several parameters of the algorithm. Let N0 be in each step. Finally, let Nf be the limit for the
the initial maximum search space depth for the maximum search space depth, and NM denote the
thresholded symmetrizing re-ranking procedure. current maximum search space depth. The final
In Figure 2, the current depth Nc is 3, while the algorithm is given by:
maximum depth might be set to a value higher
1. Initialize the maximum search space depth
than 3. The algorithm with the fixed threshold P
NM = N0 and the starting threshold P =
proceeds as follows:
P0 . Initialize an empty lexicon Lf inal .
1. Initialize the maximum search space depth 2. Check the stopping criterion: If NM > Nf ,
NM = N0 . Initialize an empty lexicon L. go to Step 5, otherwise continue with Step 3.
2. For each source word wsS V S do: 3. Perform the one-vocabulary-pass with the
(a) Set the current search space depth Nc = current values of P and NM . Whenever a
1.3 translation pair is found, it is added to the
(b) Perform the thresholded symmetrizing lexicon Lf inal . Additionally, we can also
re-ranking procedure with the current save the threshold and the depth at which that
search space set to Nc and the threshold pair was found.
T
P . If a translation pair (wsS , ws,high ) is 4. Decrease P : P = P decp , and check
found, go to the Sub-step 2(d). if P < Pf . If still not P < Pf , go to
(c) If a translation pair is not found, and Step 3 and perform the one-vocabulary-pass
Nc < NM , increment the current again. Otherwise, if P < Pf and there are
search space Nc = Nc + 1 and return to still unmatched words in the source vocab-
the previous Sub-step 2(b). If a trans- ulary, reset P : P = P0 , increment NM :
lation pair is not found and Nc = NM , NM = NM + 1 and go to Step 2.
return to Step 2 and proceed with the 5. Return Lf inal as the final output of the algo-
next word. rithm.
(d) For the found translation pair
The parameters of the algorithm model its be-
(wsS , ws,high
T ), remove words wsS
T
havior. Typically, we would like to set P0 to a high
and ws,high from their respective
value, and N0 to a low value, which makes our
3
The intuition here is simple we are trying to detect constraints strict and narrows our search space,
a direct association as high as possible in the list. In other and consequently, extracts less translation pairs
words, if the first translation candidate for the source word in the first steps of the algorithm, but the set
isola is the target word island, and, vice versa, the first
translation candidate for the target word island is isola, we
of those translation pairs should be highly accu-
do not need to expand our search depth, because these two rate. Once it is not possible to extract any more
words are the most likely translations. pairs with such strict constraints, the algorithm re-
454
laxes them by lowering the threshold and expand- BiLDA training are obtained from Vulic et al.
ing the search space by incrementing the max- (2011). We train the BiLDA model with 2000
imum search space depth. The algorithm may topics using Gibbs sampling, since that number
leave some of the source words unmatched, which of topics displays the best performance in their
is also dependent on the parameters of the algo- paper. The linear interpolation parameter for the
rithm, but, due to the one-to-one assumption, that combined TI+Cue method is set to = 0.1.
scenario also occurs whenever a target vocabulary The parameters of the algorithm, adjusted on a
contains more words than a source vocabulary. set of 500 randomly sampled Italian words, are set
The number of operations of the algorithm also to the following values in all experiments, except
depends on the parameters, but it mostly depends where noted different: P0 = 0.20, Pf = 0.00,
on the sizes of the given vocabularies. The com- decp = 0.01, N0 = 3, and Nf = 10.
plexity is O(|V S ||V T |), but the algorithm is com- The initial ground truth for our source vocab-
putationally feasible even for large vocabularies. ularies has been constructed by the freely avail-
able Google Translate tool. The final ground truth
4 Results and Discussion for our test sets has been established after we
4.1 Training Collections have manually revised the list of pairs obtained by
Google Translate, deleting incorrect entries and
The data used for training of the models is col- adding additional correct entries. All translation
lected from various sources and varies strongly in candidates are evaluated against this benchmark
theme, style, length and its comparableness. In lexicon.
order to reduce data sparsity, we keep only lem-
matized non-proper noun forms. 4.3 Experiment I: Do Our Assumptions Help
For Italian-English language pair, we use Lexicon Extraction?
18, 898 Wikipedia article pairs to train BiLDA,
With this set of experiments, we wanted to test
covering different themes with different scopes
whether both the assumption of symmetry and
and subtopics being addressed. Document align-
the one-to-one assumption are useful in improv-
ment is established via interlingual links from the
ing precision of the initial TI+Cue lexicon extrac-
Wikipedia metadata. Our vocabularies consist of
tion method. We compare three different lexicon
7, 160 Italian nouns and 9, 116 English nouns.
extraction algorithms: (1) the basic TI+Cue ex-
For Dutch-English language pair, we use 7, 602
traction algorithm (LALG-BASIC) which serves
Wikipedia article pairs, and 6, 206 Europarl doc-
as the baseline algorithm5 , (2) the algorithm from
ument pairs, and combine them for training.4 Our
Section 3, but without the one-to-one assump-
final vocabularies consist of 15, 284 Dutch nouns
tion (LALG-SYM), meaning that if we find a
and 12, 715 English nouns.
translation pair, we still keep words from the
Unlike, for instance, Wikipedia articles, where
translation pair in their respective vocabularies,
document alignment is established via interlin-
and (3) the complete algorithm from Section 3
gual links, in some cases it is necessary to perform
(LALG-ALL). In order to evaluate these lexicon
document alignment as the initial step. Since our
extraction algorithms for both Italian-English and
work focuses on Wikipedia data, we will not get
Dutch-English, we have constructed a test set of
into detail with algorithms for document align-
650 Italian nouns, and a test set of 1000 Dutch
ment. An IR-based method for document align-
nouns of high and medium frequency. Precision
ment is given in (Utiyama and Isahara, 2003;
scores for both language pairs and for all lexicon
Munteanu and Marcu, 2005), and a feature-based
extraction algorithms are provided in Table 1.
method can be found in (Vu et al., 2009).
Based on these results, it is clearly visible that
4.2 Experimental Setup both assumptions our algorithm makes are valid
All our experiments rely on BiLDA training 5
We have also tested whether LALG-BASIC outperforms
with comparable data. Corpora and software for a method modeling direct co-occurrence, that uses cosine
to detect similarity between word vectors consisting of TF-
4
In case of Europarl, we use only the evidence of docu- IDF scores in the shared document space (Cimiano et al.,
ment alignment during the training and do not benefit from 2009). Precision using that method is significantly lower,
the parallelness of the sentences in the corpus. e.g. 0.5538 vs. 0.6708 of LALG-BASIC for Italian-English.
455
1
LEX Algorithm Italian-English Dutch-English
IT-EN Precision
LALG-BASIC 0.6708 0.6560 IT-EN F-score
LALG-SYM 0.6862 0.6780 0.95 NL-EN Precision
NL-EN F-score
LALG-ALL 0.7215 0.7170
0.9
Table 1: Precision scores on our test sets for the 3 dif-
Precision/F-score
ferent lexicon extraction algorithms.
0.85
and contribute to better overall scores. Therefore 0.8
in all further experiments we will use the LALG-

ALL extraction algorithm. 0.75
4.4 Experiment II: How Does Thresholding 0.7
Affect Precision?
0.65
The next set of experiments aims at exploring how 0.2 0.15 0.1 0.05 0
precision scores change while we gradually de- Threshold
crease threshold values. The main goal of these

Figure 3: Precision and F0.5 scores in relation to
experiments is to detect when to stop with the ex-
threshold values. We can observe that the algorithm
traction of translation candidates in order to pre- retrieves only highly accurate translations for both lan-
serve a lexicon of only highly accurate transla- guage pairs while the threshold goes down from value
tions. We have fixed the maximum search space 0.2 to 0.1, while precision starts to drop significantly
depth N0 = Nf = 3. We used the same test sets after the threshold of 0.1. F0.5 scores also reach their
from Experiment I. Figure 3 displays the change peaks within that threshold region.
of precision in relation to different threshold val-
ues, where we start harvesting translations from
If we do not know anything about a given lan-
the threshold P0 = 0.2 down to Pf = 0.0. Since
guage pair, we can only use words shared across
our goal is to extract as many correct translation
languages as lexical clues for the construction of
pairs as possible, but without decreasing the pre-
a seed lexicon. It often leads to a low precision
cision scores, we have also examined what impact
lexicon, since many false friends are detected.
this gradual decrease of threshold also has on the
For Italian-English, we have found 431 nouns
number of extracted translations. We have opted
shared between the two languages, of which 350
for the F measure (van Rijsbergen, 1979):
were correct translations, leading to a precision
P recision Recall of 0.8121. As an illustration, if we take the
F = (1 + 2 ) (2) first 431 translation pairs retrieved by LALG-
2 P recision + Recall
ALL, there are 427 correct translation pairs, lead-
Since our task is precision-oriented, we have set ing to a precision of 0.9907. Some pairs do
= 0.5. F0.5 measure values precision as twice not share any orthographic similarities: (uccello,
as important as recall. The F0.5 scores are also bird), (tastiera, keyboard), (salute, health), (terre-
provided in Figure 3. moto, earthquake) etc.
Following Koehn and Knight (2002), we have
4.5 Experiment III: Building a Seed Lexicon also employed simple transformation rules for the
Finally, we wanted to test how many accurate adoption of words from one language to another.
translation pairs our best scoring LALG-ALL al- The rules specific to the Italian-English transla-
gorithm is able to acquire from the entire source tion process that have been employed are: (R1) if
vocabulary, with very high precision still remain- an Italian noun ends in ione, but not in zione,
ing paramount. The obtained highly-precise seed strip the final e to obtain the corresponding En-
lexicon then might be employed for an additional glish noun. Otherwise, strip the suffix zione,
bootstrapping procedure similar to (Koehn and and append tion; (R2) if a noun ends in ia,
Knight, 2002; Fung and Cheung, 2004) or sim- but not in zia or f ia, replace the suffix ia
ply for translating context vectors as in (Gaussier with y. If a noun ends in zia, replace the suf-
et al., 2004). fix with cy and if a noun ends in f ia, replace
456
Italian-English Dutch-English
Lexicon # Correct Precision F0.5 # Correct Precision F0.5
LEX-1 350 0.8121 0.1876 898 0.8618 0.2308
LEX-2 766 0.8938 0.3473 1376 0.9011 0.3216
LEX-LALG 782 0.8958 0.3524 1106 0.9559 0.2778
LEX-1+LEX-LALG 1070 0.8785 0.4290 1860 0.9082 0.3961
LEX-R+LEX-LALG 1141 0.9239 0.4548 1507 0.9642 0.3500
LEX-2+LEX-LALG 1429 0.8926 0.5102 2261 0.9217 0.4505
Table 2: A comparison of different lexicons. For lexicons employing our LALG-ALL algorithm, only translation
candidates that scored above the threshold P = 0.11 have been kept.
it with phy. Similar rules have been introduced Knight (2002) has been outperformed in terms of
for Dutch-English: the suffix tie is replaced by precision and coverage. Additionally, we have
tion, sie by sion, and teit by ty. shown that adding simple translation rules for lan-
Finally, we have compared the results of the guages sharing same roots might lead to even bet-
following constructed lexicons: ter scores (LEX-2+LEX-LALG). However, it is
not always possible to rely on such knowledge,
A lexicon containing only words shared and the usefulness of the designed LALG-ALL
across languages (LEX-1). algorithm really comes to the fore when the algo-
A lexicon containing shared words and trans- rithm is applied on distant language pairs which
lation pairs found by applying the language- do not share many words and cognates, and word
specific transformation rules (LEX-2). translation rules cannot be easily established. In
A lexicon containing only translation pairs such cases, without any prior knowledge about the
obtained by the LALG-ALL algorithm that languages involved in a translation process, one is
score above a certain threshold P (LEX- left with the linguistically unbiased LEX-1+LEX-
LALG). LALG lexicon, which also displays a promising
A combination of the lexicons LEX-1 and performance.
LEX-LALG (LEX-1+LEX-LALG). Non- 5 Conclusions and Future Work
matching duplicates are resolved by taking
We have designed an algorithm that focuses on ac-
the translation pair from LEX-LALG as the
quiring and keeping only highly confident trans-
correct one. Note that this lexicon is com-
lation candidates from multilingual comparable
pletely language-pair independent.
corpora. By employing the algorithm we have
A lexicon combining only translation pairs
improved precision scores of the methods rely-
found by applying the language-specific
ing on per-topic word distributions from a cross-
transformation rules and LEX-LALG (LEX-
language topic model. We have shown that the al-
R+LEX-LALG).
gorithm is able to produce a highly reliable bilin-
A combination of the lexicons LEX-2 and
gual seed lexicon even when all other lexical clues
LEX-LALG, where non-matching dupli-
are absent, thus making our algorithm suitable
cates are resolved by taking the translation
even for unrelated language pairs. In future work,
pair from LEX-LALG if it is present in
we plan to further improve the algorithm and use
LEX-1, and from LEX-2 otherwise (LEX-
it as a source of translational evidence for differ-
2+LEX-LALG).
ent alignment tasks in the setting of non-parallel
According to the results from Table 2, we can corpora.
conclude that adding translation pairs extracted
Acknowledgments
by our LALG-ALL algorithm has a major posi-
tive impact on both precision and coverage. Ob- The research has been carried out in the frame-
taining results for two different language pairs work of the TermWise Knowledge Platform (IOF-
proves that the approach is generic and appli- KP/09/001) funded by the Industrial Research
cable to any other language pairs. The previ- Fund K.U. Leuven, Belgium.
ous approach relying on work from Koehn and
457
References 46th Annual Meeting of the Association for Compu-
Jaime G. Carbonell, Jaime G. Yang, Robert E. Fred-
Zellig S. Harris. 1954. Distributional structure. Word
erking, Ralf D. Brown, Yibing Geng, Danny Lee,
10, (23):146162.
Yiming Frederking, Robert E, Ralf D. Geng, and
Philipp Koehn and Kevin Knight. 2002. Learning a
Yiming Yang. 1997. Translingual information re-
translation lexicon from monolingual corpora. In
trieval: A comparative evaluation. In Proceedings
Proceedings of the ACL-02 Workshop on Unsuper-
of the 15th International Joint Conference on Arti-
vised Lexical Acquisition, pages 916.
ficial Intelligence, pages 708714.
Audrey Laroche and Philippe Langlais. 2010. Re-
Yun-Chuang Chiao and Pierre Zweigenbaum. 2002.
visiting context-based projection methods for term-
Looking for candidate translational equivalents in
translation spotting in comparable corpora. In Pro-
specialized, comparable corpora. In Proceedings
ceedings of the 23rd International Conference on
of the 19th International Conference on Computa-
Computational Linguistics, pages 617625.
tional Linguistics, pages 15.
Gina-Anne Levow, Douglas W. Oard, and Philip
Philipp Cimiano, Antje Schultz, Sergej Sizov, Philipp Resnik. 2005. Dictionary-based techniques for
Sorg, and Steffen Staab. 2009. Explicit versus cross-language information retrieval. Information
latent concept models for cross-language informa- Processing and Management, 41:523547.
tion retrieval. In Proceedings of the 21st Inter-
Bo Li, Eric Gaussier, and Akiko Aizawa. 2011. Clus-
national Joint Conference on Artifical Intelligence,
tering comparable corpora for bilingual lexicon ex-
pages 15131518.
traction. In Proceedings of the 49th Annual Meeting
Wim De Smet and Marie-Francine Moens. 2009. of the Association for Computational Linguistics:
Cross-language linking of news stories on the Web Human Language Technologies, pages 473478.
using interlingual topic modeling. In Proceedings Christopher D. Manning and Hinrich Schutze. 1999.
of the CIKM 2009 Workshop on Social Web Search Foundations of Statistical Natural Language Pro-
and Mining, pages 5764. cessing. MIT Press, Cambridge, MA, USA.
Herve Dejean, Eric Gaussier, and Fatia Sadat. 2002. I. Dan Melamed. 2000. Models of translational equiv-
An approach based on multilingual thesauri and alence among words. Computational Linguistics,
model combination for bilingual lexicon extraction. 26:221249.
In Proceedings of the 19th International Conference David Mimno, Hanna M. Wallach, Jason Naradowsky,
on Computational Linguistics, pages 17. David A. Smith, and Andrew McCallum. 2009.
Mona T. Diab and Steve Finch. 2000. A statis- Polylingual topic models. In Proceedings of the
tical translation model using comparable corpora. 2009 Conference on Empirical Methods in Natural
In Proceedings of the 6th Triennial Conference on Language Processing, pages 880889.
Recherche dInformation Assistee par Ordinateur Emmanuel Morin, Beatrice Daille, Koichi Takeuchi,
(RIAO), pages 15001508. and Kyo Kageura. 2007. Bilingual terminology
Pascale Fung and Percy Cheung. 2004. Mining very- mining - using brain, not brawn comparable cor-
non-parallel corpora: Parallel sentence and lexicon pora. In Proceedings of the 45th Annual Meeting
extraction via bootstrapping and EM. In Proceed- of the Association for Computational Linguistics,
ings of the Conference on Empirical Methods in pages 664671.
Natural Language Processing, pages 5763. Dragos Stefan Munteanu and Daniel Marcu. 2005.
Pascale Fung and Lo Yuen Yee. 1998. An IR ap- Improving machine translation performance by ex-
proach for translating new words from nonparallel, ploiting non-parallel corpora. Computational Lin-
comparable texts. In Proceedings of the 17th Inter- guistics, 31:477504.
national Conference on Computational Linguistics, Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng
pages 414420. Chen. 2009. Mining multilingual topics from
Eric Gaussier, Jean-Michel Renders, Irina Matveeva, Wikipedia. In Proceedings of the 18th International
Cyril Goutte, and Herve Dejean. 2004. A geomet- World Wide Web Conference, pages 11551156.
ric view on bilingual lexicon extraction from com- Franz Josef Och and Hermann Ney. 2003. A sys-
parable corpora. In Proceedings of the 42nd Annual tematic comparison of various statistical alignment
Meeting of the Association for Computational Lin- models. Computational Linguistics, 29(1):1951.
guistics, pages 526533. Reinhard Rapp. 1995. Identifying word translations in
Thomas L. Griffiths, Mark Steyvers, and Joshua B. non-parallel texts. In Proceedings of the 33rd An-
Tenenbaum. 2007. Topics in semantic represen- nual Meeting of the Association for Computational
tation. Psychological Review, 114(2):211244. Linguistics, pages 320322.
Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, Reinhard Rapp. 1999. Automatic identification of
and Dan Klein. 2008. Learning bilingual lexicons word translations from unrelated English and Ger-
from monolingual corpora. In Proceedings of the man corpora. In Proceedings of the 37th Annual
458
guistics, pages 519526.
Daphna Shezaf and Ari Rappoport. 2010. Bilingual
lexicon generation using non-aligned signatures. In
sociation for Computational Linguistics, pages 98
107.
Masao Utiyama and Hitoshi Isahara. 2003. Reliable
measures for aligning Japanese-English news arti-
cles and sentences. In Proceedings of the 41st An-
Linguistics, pages 7279.
C. J. van Rijsbergen. 1979. Information Retrieval.
Butterworth.
Thuy Vu, Ai Ti Aw, and Min Zhang. 2009. Feature-
based method for document alignment in compara-
ble news corpora. In Proceedings of the 12th Con-
ference of the European Chapter of the Association
for Computational Linguistics, pages 843851.
Ivan Vulic, Wim De Smet, and Marie-Francine Moens.
2011. Identifying word translations from compara-
ble corpora using latent topic models. In Proceed-
ings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language
Technologies, pages 479484.
459
Efficient Parsing with Linear Context-Free Rewriting Systems
Andreas van Cranenburgh

Huygens ING & ILLC, University of Amsterdam
Royal Netherlands Academy of Arts and Sciences
Postbus 90754, 2509 LT The Hague, the Netherlands
andreas.van.cranenburgh@huygens.knaw.nl
Abstract SBARQ
Previous work on treebank parsing with SQ

discontinuous constituents using Linear
Context-Free Rewriting systems (LCFRS) VP
has been limited to sentences of up to 30
words, for reasons of computational com- WHNP MD NP VB .
plexity. There have been some results on
binarizing an LCFRS in a manner that min- What should I do ?
imizes parsing complexity, but the present
work shows that parsing long sentences with Figure 1: A tree with WH-movement from the Penn
such an optimally binarized grammar re- treebank, in which traces have been converted to dis-
mains infeasible. Instead, we introduce a continuity. Taken from Evang and Kallmeyer (2011).
technique which removes this length restric-
tion, while maintaining a respectable accu-
racy. The resulting parser has been applied
to a discontinuous treebank with favorable 1997) and Tiger (Brants et al., 2002) corpora, or
results. those that can be extracted from traces such as in
the Penn treebank (Marcus et al., 1993) annota-
tion. However, the computational complexity is
1 Introduction such that until now, the length of sentences needed
Discontinuity in constituent structures (cf. figure 1 to be restricted. In the case of Kallmeyer and
& 2) is important for a variety of reasons. For Maier (2010) and Evang and Kallmeyer (2011) the
one, it allows a tight correspondence between limit was 25 words. Maier (2010) and van Cranen-
syntax and semantics by letting constituent struc- burgh et al. (2011) manage to parse up to 30 words
ture express argument structure (Skut et al., 1997). with heuristics and optimizations, but no further.
Other reasons are phenomena such as extraposi- Algorithms have been suggested to binarize the
tion and word-order freedom, which arguably re- grammars in such a way as to minimize parsing
quire discontinuous annotations to be treated sys- complexity, but the current paper shows that these
tematically in phrase-structures (McCawley, 1982; techniques are not sufficient to parse longer sen-
Levy, 2005). Empirical investigations demon- tences. Instead, this work presents a novel form
strate that discontinuity is present in non-negligible of coarse-to-fine parsing which does alleviate this
amounts: around 30% of sentences contain dis- limitation.
continuity in two German treebanks (Maier and The rest of this paper is structured as follows.
Sgaard, 2008; Maier and Lichte, 2009). Re- First, we introduce linear context-free rewriting
cent work on treebank parsing with discontinuous systems (LCFRS). Next, we discuss and evalu-
constituents (Kallmeyer and Maier, 2010; Maier, ate binarization strategies for LCFRS. Third, we
2010; Evang and Kallmeyer, 2011; van Cranen- present a technique for approximating an LCFRS
burgh et al., 2011) shows that it is feasible to by a PCFG in a coarse-to-fine framework. Lastly,
directly parse discontinuous constituency anno- we evaluate this technique on a large corpus with-
tations, as given in the German Negra (Skut et al., out the usual length restrictions.
460
ROOT
ROOT(ab) S(a) $.(b)
S S(abcd) VAFIN(b) NN(c) VP2 (a, d)
VP2 (a, bc) PROAV(a) NN(b) VVPP(c)
VP
PROAV(Danach)
PROAV VAFIN NN NN VVPP $. VAFIN(habe)
NN(Kohlenstaub)
Danach habe Kohlenstaub Feuer gefangen .
NN(Feuer)
Afterwards had coal dust fire caught . VVPP(gefangen)
$.(.)
Figure 2: A discontinuous tree from the Negra corpus. Figure 3: The productions that can be read off from the
Translation: After that coal dust had caught fire. tree in figure 2. Note that lexical productions rewrite to
, because they do not rewrite to any non-terminals.
2 Linear Context-Free Rewriting
Systems terminal may cover a tuple of discontinuous strings
instead of a single, contiguous sequence of termi-
Linear Context-Free Rewriting Systems (LCFRS; nals. The number of components in such a tuple
Vijay-Shanker et al., 1987; Weir, 1988) subsume is called the fan-out of a rule, which is equal to
a wide variety of mildly context-sensitive for- the number of gaps plus one; the fan-out of the
malisms, such as Tree-Adjoining Grammar (TAG), grammar is the maximum fan-out of its production.
Combinatory Categorial Grammar (CCG), Min- A context-free grammar is a LCFRS with a fan-out
imalist Grammar, Multiple Context-Free Gram- of 1. For convenience we will will use the rule
mar (MCFG) and synchronous CFG (Vijay-Shanker notation of simple RCG (Boullier, 1998), which
and Weir, 1994; Kallmeyer, 2010). Furthermore, is a syntactic variant of LCFRS, with an arguably
they can be used to parse dependency struc- more transparent notation.
tures (Kuhlmann and Satta, 2009). Since LCFRS A LCFRS is a tuple G = hN, T, V, P, Si. N
subsumes various synchronous grammars, they are is a finite set of non-terminals; a function dim :
also important for machine translation. This makes N N specifies the unique fan-out for every non-
it possible to use LCFRS as a syntactic backbone terminal symbol. T and V are disjoint finite sets
with which various formalisms can be parsed by of terminals and variables. S is the distinguished
compiling grammars into an LCFRS, similar to the start symbol with dim(S) = 1. P is a finite set of
TuLiPa system (Kallmeyer et al., 2008). As all rewrite rules (productions) of the form:
mildly context-sensitive formalisms, LCFRS are
parsable in polynomial time, where the degree A(1 , . . . dim(A) ) B1 (X11 , . . . , Xdim(B
1
1)
)
depends on the productions of the grammar. In-
. . . Bm (X1m , . . . , Xdim(B
m
m)
)
tuitively, LCFRS can be seen as a generalization
of context-free grammars to rewriting other ob- for m 0, where A, B1 , . . . , Bm N ,
jects than just continuous strings: productions are each Xji V for 1 i m, 1 j dim(Aj )
context-free, but instead of strings they can rewrite and i (T V ) for 1 i dim(Ai ).
tuples, trees or graphs. Productions must be linear: if a variable occurs
We focus on the use of LCFRS for parsing with in a rule, it occurs exactly once on the left hand
discontinuous constituents. This follows up on side (LHS), and exactly once on the right hand side
recent work on parsing the discontinuous anno- (RHS). A rule is ordered if for any two variables
tations in German corpora with LCFRS (Maier, X1 and X2 occurring in a non-terminal on the RHS,
2010; van Cranenburgh et al., 2011) and work on X1 precedes X2 on the LHS iff X1 precedes X2
parsing the Wall Street journal corpus in which on the RHS.
traces have been converted to discontinuous con- Every production has a fan-out determined by
stituents (Evang and Kallmeyer, 2011). In the case the fan-out of the non-terminal symbol on the left-
of parsing with discontinuous constituents a non- hand side. Apart from the fan-out productions also
461
have a rank: the number of non-terminals on the This binarization introduces a production with
right-hand side. These two variables determine a fan-out of 2, which could have been avoided.
the time complexity of parsing with a grammar. A After binarization, an LCFRS can be parsed in
production can be instantiated when its variables O(|G| |w|p ) time, where |G| is the size of the
can be bound to non-overlapping spans such that grammar, |w| is the length of the sentence. The de-
for each component i of the LHS, the concatena- gree p of the polynomial is the maximum parsing
tion of its terminals and bound variables forms a complexity of a rule, defined as:
contiguous span in the input, while the endpoints
of each span are non-contiguous. parsing complexity := + 1 + 2 (6)
As in the case of a PCFG, we can read off LCFRS where is the fan-out of the left-hand side and
productions from a treebank (Maier and Sgaard, 1 and 2 are the fan-outs of the right-hand side
2008), and the relative frequencies of productions of the rule in question (Gildea, 2010). As Gildea
form a maximum likelihood estimate, for a prob- (2010) shows, there is no one to one correspon-
abilistic LCFRS (PLCFRS), i.e., a (discontinuous) dence between fan-out and parsing complexity: it
treebank grammar. As an example, figure 3 shows is possible that parsing complexity can be reduced
the productions extracted from the tree in figure 2. by increasing the fan-out of a production. In other
words, there can be a production which can be bi-
3 Binarization
narized with a parsing complexity that is minimal
A probabilistic LCFRS can be parsed using a CKY- while its fan-out is sub-optimal. Therefore we fo-
like tabular parsing algorithm (cf. Kallmeyer and cus on parsing complexity rather than fan-out in
Maier, 2010; van Cranenburgh et al., 2011), but this work, since parsing complexity determines the
this requires a binarized grammar.1 Any LCFRS actual time complexity of parsing with a grammar.
can be binarized. Crescenzi et al. (2011) state There has been some work investigating whether
while CFGs can always be reduced to rank two the increase in complexity can be minimized ef-
(Chomsky Normal Form), this is not the case for fectively (Gomez-Rodrguez et al., 2009; Gildea,
LCFRS with any fan-out greater than one. How- 2010; Crescenzi et al., 2011).
ever, this assertion is made under the assumption of More radically, it has been suggested that the
a fixed fan-out. If this assumption is relaxed then power of LCFRS should be limited to well-nested
it is easy to binarize either deterministically or, as structures, which gives an asymptotic improve-
will be investigated in this work, optimally with ment in parsing time (Gomez-Rodrguez et al.,
a dynamic programming approach. Binarizing an 2010). However, there is linguistic evidence that
LCFRS may increase its fan-out, which results in not all language use can be described in well-
an increase in asymptotic complexity. Consider nested structures (Chen-Main and Joshi, 2010).
the following production: Therefore we will use the full power of LCFRS in
X(pqrs) A(p, r) B(q) C(s) (1) this workparsing complexity is determined by
the treebank, not by a priori constraints.
Henceforth, we assume that non-terminals on the
right-hand side are ordered by the order of their 3.1 Further binarization strategies
first variable on the left-hand side. There are two Apart from optimizing for parsing complexity, for
ways to binarize this production. The first is from linguistic reasons it can also be useful to parse
left to right: the head of a constituent first, yielding so-called
X(ps) XAB (p) C(s) (2) head-driven binarizations (Collins, 1999). Addi-
tionally, such a head-driven binarization can be
XAB (pqr) A(p, r) B(q) (3) Markovizedi.e., the resulting production can be
This binarization maintains the fan-out of 1. The constrained to apply to a limited amount of hor-
second way is from right to left: izontal context as opposed to the full context in
the original constituent (e.g., Klein and Manning,
X(pqrs) A(p, r) XBC (q, s) (4)
2003), which can have a beneficial effect on accu-
XBC (q, s) B(q) C(s) (5) racy. In the notation of Klein and Manning (2003)
1
Other algorithms exist which support n-ary productions, there are two Markovization parameters: h and
but these are less suitable for statistical treebank parsing. v. The first parameter describes the amount of
462
X X X
X XB,C,D,E XB XD
XB,C,D,E XB,C,D B XE XA
X XC,D,E XB,C XD XB
B B XD,E B B
A X C Y D E A X C Y D E A X C Y D E A X C Y D E A X C Y D E
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
original right branching optimal head-driven optimal head-driven
p = 4, = 2 p = 5, = 2 p = 4, = 2 p = 5, = 2 p = 4, = 2
Figure 4: The four binarization strategies. C is the head node. Underneath each tree is the maximum parsing
complexity and fan-out among its productions.
horizontal context for the artificial labels of a bi- dering and there is no probabilistic interpretation
narized production. In a normal form binarization, of Markovization in such a setting.
this parameter equals infinity, because the bina- To summarize, we have at least four binarization
rized production should only apply in the exact strategies (cf. figure 4 for an illustration):
same context as the context in which it originally
belongs, as otherwise the set of strings accepted 1. right branching: A right-to-left binarization.
by the grammar would be affected. An artificial No regard for optimality or statistical tweaks.
label will have the form XA,B,C for a binarized 2. optimal: A binarization which minimizes pars-
production of a constituent X that has covered ing complexity, introduced in Gildea (2010).
children A, B, and C of X. The other extreme, Binarizing with this strategy is exponential in
h = 1, enables generalizations by stringing parts the resulting optimal fan-out (Gildea, 2010).
of binarized constituents together, as long as they 3. head-driven: Head-outward binarization with
share one non-terminal. In the previous example, horizontal Markovization. No regard for opti-
the label would become just XA , i.e., the pres- mality.
ence of B and C would no longer be required, 4. optimal head-driven: Head-outward binariza-
which enables switching to any binarized production with horizontal Markovization. Min-
tion that has covered A as the last node. Limit- imizes parsing complexity. Introduced in
ing the amount of horizontal context on which a and proven to be NP-hard by Crescenzi et al.
production is conditioned is important when the (2011).
treebank contains many unique constituents which
3.2 Finding optimal binarizations
can only be parsed by stringing together different
binarized productions; in other words, it is a way An issue with the minimal binarizations is that
of dealing with the data sparseness about n-ary the algorithm for finding them has a high compu-
productions in the treebank. tational complexity, and has not been evaluated
empirically on treebank data.2 Empirical inves-
The second parameter describes parent annota- tigation is interesting for two reasons. First of
tion, which will not be investigated in this work; all, the high computational complexity may not
the default value is v = 1 which implies only in- be relevant with constant factors of constituents,
cluding the immediate parent of the constituent which can reasonably be expected to be relatively
that is being binarized; including grandparents is a small. Second, it is important to establish whether
way of weakening independence assumptions. an asymptotic improvement is actually obtained
Crescenzi et al. (2011) also remark that through optimal binarizations, and whether this
an optimal head-driven binarization allows for translates to an improvement in practice.
Markovization. However, it is questionable Gildea (2010) presents a general algorithm to
whether such a binarization is worthy of the name binarize an LCFRS while minimizing a given scor-
Markovization, as the non-terminals are not intro- ing function. We will use this algorithm with two
duced deterministically from left to right, but in different scoring functions.
an arbitrary fashion dictated by concerns of pars- 2
Gildea (2010) evaluates on a dependency bank, but does
ing complexity; as such there is not a Markov not report whether any improvement is obtained over a naive
process based on a meaningful (e.g., temporal) or- binarization.
463
100000 100000
right branching head-driven
optimal optimal head-driven
10000 10000
Frequency
Frequency
1000 1000
100 100
10 10
1 1
3 4 5 6 7 8 9 3 4 5 6 7 8 9
Parsing complexity Parsing complexity
Figure 5: The distribution of parsing complexity Figure 6: The distribution of parsing complexity among
among productions in binarized grammars read off from productions in Markovized, head-driven grammars read
NEGRA -25. The y-axis has a logarithmic scale. off from NEGRA -25. The y-axis has a logarithmic scale.
The first directly optimizes parsing complexity. opment and test splits (Dubey and Keller, 2003).
Given a (partially) binarized constituent c, the func- Following common practice, punctuation, which
tion returns a tuple of scores, for which a linear is left out of the phrase-structure in Negra, is re-
order is defined by comparing elements starting attached to the nearest constituent.
from the most significant (left-most) element. The In the course of experiments it was discovered
tuples contain the parsing complexity p, and the that the heuristic method for punctuation attach-
fan-out to break ties in parsing complexity; if ment used in previous work (e.g., Maier, 2010;
there are still ties after considering the fan-out, thevan Cranenburgh et al., 2011), as implemented in
sum of the parsing complexities of the subtrees of rparse,3 introduces additional discontinuity. We
c is considered, which will give preference to a bi- applied a slightly different heuristic: punctuation
narization where the worst case complexity occurs is attached to the highest constituent that contains a
once instead of twice. The formula is then: neighbor to its right. The result is that punctuation
opt(c) = hp, , si can be introduced into the phrase-structure with-
out any additional discontinuity, and thus without
The second function is the similar except that artificially inflating the fan-out and complexity of
only head-driven strategies are accepted. A head- grammars read off from the treebank. This new
driven strategy is a binarization in which the head heuristic provides a significant improvement: in-
is introduced first, after which the rest of the chil- stead of a fan-out of 9 and a parsing complexity of
dren are introduced one at a time. 19, we obtain values of 4 and 9 respectively.

hp, , si if c is head-driven The parser is presented with the gold part-of-
opt-hd(c) =
h, , i otherwise speech tags from the corpus. For reasons of effi-
ciency we restrict sentences to 25 words (includ-
Given a (partial) binarization c, the score should
ing punctuation) in this experiment: NEGRA -25.
reflect the maximum complexity and fan-out in
A grammar was read off from the training part
that binarization, to optimize for the worst case, as
of NEGRA -25, and sentences of up to 25 words
well as the sum, to optimize the average case. This
in the development set were parsed using the re-
aspect appears to be glossed over by Gildea (2010).
sulting PLCFRS, using the different binarization
Considering only the score of the last production in
schemes. First with a right-branching, right-to-left
a binarization produces suboptimal binarizations.
binarization, and second with the minimal bina-
3.3 Experiments rization according to parsing complexity and fan-
As data we use version 2 of the Negra (Skut et al., 3
Available from http://www.wolfgang-maier.net/
1997) treebank, with the common training, devel- rparse/downloads. Retrieved March 25th, 2011
464
right optimal
branching optimal head-driven head-driven
Markovization v=1, h= v=1, h= v=1, h=2 v=1, h=2
fan-out 4 4 4 4
complexity 8 8 9 8
labels 12861 12388 4576 3187
clauses 62072 62097 53050 52966
time to binarize 1.83 s 46.37 s 2.74 s 28.9 s
time to parse 246.34 s 193.94 s 2860.26 s 716.58 s
coverage 96.08 % 96.08 % 98.99 % 98.73 %
F1 score 66.83 % 66.75 % 72.37 % 71.79 %
Table 1: The effect of binarization strategies on parsing efficiency, with sentences from the development section of
NEGRA -25.
out. The last two binarizations are head-driven rizations is exponential (Gildea, 2010) and NP-
and Markovizedthe first straightforwardly from hard (Crescenzi et al., 2011), they can be computed
left-to-right, the latter optimized for minimal pars- relatively quickly on this data set.5 Importantly, in
ing complexity. With Markovization we are forced the first case there is no improvement on fan-out
to add a level of parent annotation to tame the or parsing complexity, while in the head-driven
increase in productivity caused by h = 1. case there is a minimal improvement because of a
The distribution of parsing complexity (mea- single production with parsing complexity 15 with-
sured with eq. 6) in the grammars with different out optimal binarization. On the other hand, the
binarization strategies is shown in figure 5 and optimal binarizations might still have a significant
6. Although the optimal binarizations do seem effect on the average case complexity, rather than
to have some effect on the distribution of parsing the worst-case complexities. Indeed, in both cases
complexities, it remains to be seen whether this parsing with the optimal grammar is faster; in the
can be cashed out as a performance improvement first case, however, when the time for binariza-
in practice. To this end, we also parse using the tion is considered as well, this advantage mostly
binarized grammars. disappears.
In this work we binarize and parse with The difference in F1 scores might relate to the
disco-dop introduced in van Cranenburgh et al. efficacy of Markovization in the binarizations. It
(2011).4 In this experiment we report scores of the should be noted that it makes little theoretical
(exact) Viterbi derivations of a treebank PLCFRS; sense to Markovize a binarization when it is not
cf. table 1 for the results. Times represent CPU a left-to-right or right-to-left binarization, because
time (single core); accuracy is given with a gener- with an optimal binarization the non-terminals of
alization of PARSEVAL to discontinuous structures, a constituent are introduced in an arbitrary order.
described in Maier (2010). More importantly, in our experiments, these
Instead of using Maiers implementation of dis- techniques of optimal binarizations did not scale
continuous F1 scores in rparse, we employ a vari- to longer sentences. While it is possible to obtain
ant that ignores (a) punctuation, and (b) the root an optimal binarization of the unrestricted Negra
node of each tree. This makes our evaluation in- corpus, parsing long sentences with the resulting
comparable to previous results on discontinuous grammar remains infeasible. Therefore we need to
parsing, but brings it in line with common practice look at other techniques for parsing longer sen-
on the Wall street journal benchmark. Note that tences. We will stick with the straightforward
this change yields scores about 2 or 3 percentage
points lower than those of rparse. 5
The implementation exploits two important optimiza-
Despite the fact that obtaining optimal binations. The first is the use of bit vectors to keep track of which
non-terminals are covered by a partial binarization. The sec-
4
All code is available from: http://github.com/ ond is to skip constituents without discontinuity, which are
andreasvc/disco-dop. equivalent to CFG productions.
465
head-driven, head-outward binarization strategy, cedure introduced in Boyd (2007). Each discontin-
despite this being a computationally sub-optimal uous node is split into a set of new nodes, one for
binarization. each component; for example a node NP2 will be
One technique for efficient parsing of LCFRS is split into two nodes labeled NP *1 and NP *2 (like
the use of context-summary estimates (Kallmeyer Barthelemy et al., we mark components with an
and Maier, 2010), as part of a best-first parsing index to reduce overgeneration). Because Boyds
algorithm. This allowed Maier (2010) to parse transformation is reversible, chart items from this
sentences of up to 30 words. However, the calcu- grammar can be converted back to discontinuous
lation of these estimates is not feasible for longer chart items, and can guide parsing of an LCFRS.
sentences and large grammars (van Cranenburgh This guiding takes the form of a white list. Af-
et al., 2011). ter parsing with the coarse grammar, the result-
Another strategy is to perform an online approx- ing chart is pruned by removing all items that
imation of the sentence to be parsed, after which fail to meet a certain criterion. In our case this
parsing with the LCFRS can be pruned effectively. is whether a chart item is part of one of the k-best
This is the strategy that will be explored in the derivationswe use k = 50 in all experiments (as
current work. in van Cranenburgh et al., 2011). This has simi-
lar effects as removing items below a threshold
4 Context-free grammar approximation of marginalized posterior probability; however,
for coarse-to-fine parsing the latter strategy requires computation of outside
Coarse-to-fine parsing (Charniak et al., 2006) is probabilities from a parse forest, which is more
a technique to speed up parsing by exploiting the involved with an LCFRS than with a PCFG. When
information that can be gained from parsing with parsing with the fine grammar, whenever a new
simpler, coarser grammarse.g., a grammar with item is derived, the white list is consulted to see
a smaller set of labels on which the original gram- whether this item is allowed to be used in further
mar can be projected. Constituents that do not derivations; otherwise it is immediately discarded.
contribute to a full parse tree with a coarse gram- This coarse-to-fine approach will be referred to as
mar can be ruled out for finer grammars as well, CFG - CTF , and the transformed, coarse grammar
which greatly reduces the number of edges that will be referred to as a split-PCFG.
need to be explored. However, by changing just Splitting discontinuous nodes for the coarse
the labels only the grammar constant is affected. grammar introduces new nodes, so obviously we
With discontinuous treebank parsing the asymp- need to binarize after this transformation. On the
totic complexity of the grammar also plays a major other hand, the coarse-to-fine approach requires a
role. Therefore we suggest to parse not just with mapping between the grammars, so after reversing
a coarser grammar, but with a coarser grammar the transformation of splitting nodes, the resulting
formalism, following a suggestion in van Cranen- discontinuous trees must be binarized (and option-
burgh et al. (2011). ally Markovized) in the same manner as those on
This idea is inspired by the work of Barthelemy which the fine grammar is based.
et al. (2001), who apply it in a non-probabilistic To resolve this tension we elect to binarize twice.
setting where the coarse grammar acts as a guide to The first time is before splitting discontinuous
the non-deterministic choices of the fine grammar. nodes, and this is where we introduce Markoviza-
Within the coarse-to-fine approach the technique tion. This same binarization will be used for the
becomes a matter of pruning with some probabilis- fine grammar as well, which ensures the models
tic threshold. Instead of using the coarse gram- make the same kind of generalizations. The sec-
mar only as a guide to solve non-deterministic ond binarization is after splitting nodes, this time
choices, we apply it as a pruning step which also with a binary normal form (2NF; all productions
discards the most suboptimal parses. The basic are either unary, binary, or lexical).
idea is to extract a grammar that defines a superset Parsing with this grammar proceeds as fol-
of the language we want to parse, but with a fan- lows. After obtaining an exhaustive chart from
out of 1. More concretely, a context-free grammar the coarse stage, the chart is pruned so as to only
can be read off from discontinuous trees that have contain items occurring in the k-best derivations.
been transformed to context-free trees by the pro- When parsing in the fine stage, each new item is
466
S
S SA
SA S SB
SB SA B*0 SB : SC *0,B*1,SC *1
B SB SC *0 SB : B*1,SC *1
SC B*0 SC *0 B*1 SC *1 B*1 SC *1
S SD SD SD
B SE SE SE
A X C Y D E A X C Y D E A X C Y D E A X C Y D E
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
Figure 7: Transformations for a context-free coarse grammar. From left to right: the original constituent,
Markovized with v = 1, h = 1, discontinuities resolved, normal form (second binarization).
model train dev test rules labels fan-out complexity

Split-PCFG 17988 975 968 57969 2026 1 3
PLCFRS 17988 975 968 55778 947 4 9
Disco-DOP 17988 975 968 2657799 702246 4 9
Table 2: Some statistics on the coarse and fine grammars read off from NEGRA -40.
looked up in this pruned coarse chart, with multi- 45

PLCFRS
ple lookups if the item is discontinuous (one for 40 CFG-CTF (Split-PCFG PLCFRS)
each component). 35
To summarize, the transformation happens in 30
cpu time (s)
four steps (cf. figure 7 for an illustration):

25
1. Treebank tree: Original (discontinuous) tree 20
2. Binarization: Binarize discontinuous tree, op- 15
tionally with Markovization 10
3. Resolve discontinuity: Split discontinuous
5
nodes into components, marked with indices
0
4. 2NF: A binary normal form is applied; all pro- 0 5 10 15 20 25
ductions are either unary, binary, or lexical. Sentence length
5 Evaluation
Figure 8: Efficiency of parsing PLCFRS with and with-
We evaluate on Negra with the same setup as in
out coarse-to-fine. The latter includes time for both
section 3.3. We report discontinuous F1 scores as coarse & fine grammar. Datapoints represent the aver-
well as exact match scores. For previous results on age time to parse sentences of that length; each length
discontinuous parsing with Negra, see table 3. For is made up of 2040 sentences.
results with the CFG - CTF method see table 4.
We first establish the viability of the CFG - CTF
method on NEGRA -25, with a head-driven v = 1, sentences of length > 22 despite its overhead of
h = 2 binarization, and reporting again the scores parsing twice.
of the exact Viterbi derivations from a treebank The second experiment demonstrates the CFG -
PLCFRS versus a PCFG using our transformations. CTF technique on longer sentences. We restrict the
Figure 8 compares the parsing times of LCFRS length of sentences in the training, development
with and without the new CFG - CTF method. The and test corpora to 40 words: NEGRA -40. As a first
graph shows a steep incline for parsing with LCFRS step we apply the CFG - CTF technique to parse with
directly, which makes it infeasible to parse longer a PLCFRS as the fine grammar, pruning away all
sentences, while the CFG - CTF method is faster for items not occurring in the 10,000 best derivations
467
words PARSEVAL Exact
(F1 ) match
DPSG :Plaehn (2004) 15 73.16 39.0
PLCFRS :Maier (2010) 30 71.52 31.65
Disco-DOP: van Cranenburgh et al. (2011) 30 73.98 34.80
Table 3: Previous work on discontinuous parsing of Negra.
words PARSEVAL Exact

(F1 ) match
PLCFRS , dev set 25 72.37 36.58
Split-PCFG, dev set 25 70.74 33.80
Split-PCFG, dev set 40 66.81 27.59
CFG - CTF , PLCFRS , dev set 40 67.26 27.90
CFG - CTF , Disco- DOP , dev set 40 74.27 34.26
CFG - CTF , Disco- DOP , test set 40 72.33 33.16
CFG - CTF , Disco- DOP , dev set 73.32 33.40
CFG - CTF , Disco- DOP , test set 71.08 32.10
Table 4: Results on NEGRA -25 and NEGRA -40 with the CFG - CTF method. NB: As explained in section 3.3, these
F1 scores are incomparable to the results in table 3; for comparison, the F1 score for Disco-DOP on the dev set
40 is 77.13 % using that evaluation scheme.
from the PCFG chart. The result shows that the same model from NEGRA -40 can also be used to
PLCFRS gives a slight improvement over the split-- parse the full development set, without length re-
pcfg, which accords with the observation that the strictions, establishing that the CFG - CTF method
latter makes stronger independence assumptions effectively eliminates any limitation of length for
in the case of discontinuity. parsing with LCFRS.
In the next experiments we turn to an all-

6 Conclusion
fragments grammar encoded in a PLCFRS using
Goodmans (2003) reduction, to realize a (dis-
Our results show that optimal binarizations are
continuous) Data-Oriented Parsing (DOP; Scha,
clearly not the answer to parsing LCFRS efficiently,
1990) modelwhich goes by the name of Disco-
as they do not significantly reduce parsing com-
DOP (van Cranenburgh et al., 2011). This provides
plexity in our experiments. While they provide
an effective yet conceptually simple method to
some efficiency gains, they do not help with the
weaken the independence assumptions of treebank
main problem of longer sentences.
grammars. Table 2 gives statistics on the gram-
mars, including the parsing complexities. The fine We have presented a new technique for large-
grammar has a parsing complexity of 9, which scale parsing with LCFRS, which makes it possible
means that parsing with this grammar has com- to parse sentences of any length, with favorable
plexity O(|w|9 ). We use the same parameters as accuracies. The availability of this technique may
van Cranenburgh et al. (2011), except that unlike lead to a wider acceptance of LCFRS as a syntactic
van Cranenburgh et al., we can use v = 1, h = 1 backbone in computational linguistics.
Markovization, in order to obtain a higher cover-
age. The DOP grammar is added as a third stage in Acknowledgments
the coarse-to-fine pipeline. This gave slightly bet-
ter results than substituting the the DOP grammar I am grateful to Willem Zuidema, Remko Scha,
for the PLCFRS stage. Parsing with NEGRA -40 Rens Bod, and three anonymous reviewers for
took about 11 hours and 4 GB of memory. The comments.
468
References Proceedings of NAACL HLT 2010., pages 769
776.
Francois Barthelemy, Pierre Boullier, Philippe De-
schamp, and Eric de la Clergerie. 2001. Guided Carlos Gomez-Rodrguez, Marco Kuhlmann, and
parsing of range concatenation languages. In Giorgio Satta. 2010. Efficient parsing of well-
Proc. of ACL, pages 4249. nested linear context-free rewriting systems. In
Proceedings of NAACL HLT 2010., pages 276
Pierre Boullier. 1998. Proposal for a natural lan- 284.
guage processing syntactic backbone. Techni-
cal Report RR-3342, INRIA-Rocquencourt, Le Carlos Gomez-Rodrguez, Marco Kuhlmann, Gior-
Chesnay, France. URL http://www.inria. gio Satta, and David Weir. 2009. Optimal reduc-
fr/RRRT/RR-3342.html. tion of rule length in linear context-free rewrit-
ing systems. In Proceedings of NAACL HLT
Adriane Boyd. 2007. Discontinuity revisited: An 2009, pages 539547.
improved conversion to context-free representa-
Joshua Goodman. 2003. Efficient parsing of
tions. In Proceedings of the Linguistic Annota-
DOP with PCFG-reductions. In Rens Bod,
tion Workshop, pages 4144.
Remko Scha, and Khalil Simaan, editors, Data-
Sabine Brants, Stefanie Dipper, Silvia Hansen, Oriented Parsing. The University of Chicago
Wolfgang Lezius, and George Smith. 2002. The Press.
Tiger treebank. In Proceedings of the workshop
Laura Kallmeyer. 2010. Parsing Beyond Context-
on treebanks and linguistic theories, pages 24
Free Grammars. Cognitive Technologies.
41.
Springer Berlin Heidelberg.
Eugene Charniak, Mark Johnson, M. Elsner,
Laura Kallmeyer, Timm Lichte, Wolfgang Maier,
J. Austerweil, D. Ellis, I. Haxton, C. Hill,
Yannick Parmentier, Johannes Dellert, and Kil-
R. Shrivaths, J. Moore, M. Pozar, et al. 2006.
ian Evang. 2008. Tulipa: Towards a multi-
Multilevel coarse-to-fine PCFG parsing. In Pro-
formalism parsing environment for grammar
ceedings of NAACL-HLT, pages 168175.
engineering. In Proceedings of the Workshop
Joan Chen-Main and Aravind K. Joshi. 2010. Un- on Grammar Engineering Across Frameworks,
avoidable ill-nestedness in natural language and pages 18.
the adequacy of tree local-mctag induced depen- Laura Kallmeyer and Wolfgang Maier. 2010. Data-
dency structures. In Proceedings of TAG+. URL driven parsing with probabilistic linear context-
http://www.research.att.com/srini/ free rewriting systems. In Proceedings of the
TAG+10/papers/chenmainjoshi.pdf. 23rd International Conference on Computa-
Michael Collins. 1999. Head-driven statistical tional Linguistics, pages 537545.
models for natural language parsing. Ph.D. the- Dan Klein and Christopher D. Manning. 2003. Ac-
sis, University of Pennsylvania. curate unlexicalized parsing. In Proc. of ACL,
Pierluigi Crescenzi, Daniel Gildea, Aandrea volume 1, pages 423430.
Marino, Gianluca Rossi, and Giorgio Satta. Marco Kuhlmann and Giorgio Satta. 2009. Tree-
2011. Optimal head-driven parsing complex- bank grammar techniques for non-projective de-
ity for linear context-free rewriting systems. In pendency parsing. In Proceedings of EACL,
Proc. of ACL. pages 478486.
Amit Dubey and Frank Keller. 2003. Parsing ger- Roger Levy. 2005. Probabilistic models of word
man with sister-head dependencies. In Proc. of order and syntactic discontinuity. Ph.D. thesis,
ACL, pages 96103. Stanford University.
Kilian Evang and Laura Kallmeyer. 2011. Wolfgang Maier. 2010. Direct parsing of discon-
PLCFRS parsing of English discontinuous continuous constituents in German. In Proceedings
stituents. In Proceedings of IWPT, pages 104 of the SPMRL workshop at NAACL HLT 2010,
116. pages 5866.
Daniel Gildea. 2010. Optimal parsing strategies Wolfgang Maier and Timm Lichte. 2009. Charac-
for linear context-free rewriting systems. In terizing discontinuity in constituent treebanks.
469
In Proceedings of Formal Grammar 2009, pages Ph.D. thesis, University of Pennsylvania.
167182. Springer. URL http://repository.upenn.edu/
Wolfgang Maier and Anders Sgaard. 2008. Tree- dissertations/AAI8908403/.
banks and mild context-sensitivity. In Proceed-
ings of Formal Grammar 2008, page 61.
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
Beatrice Santorini. 1993. Building a large an-
notated corpus of english: The penn treebank.
Computational linguistics, 19(2):313330.
James D. McCawley. 1982. Parentheticals and
discontinuous constituent structure. Linguistic
Inquiry, 13(1):91106.
Oliver Plaehn. 2004. Computing the most prob-
able parse for a discontinuous phrase structure
grammar. In Harry Bunt, John Carroll, and Gior-
gio Satta, editors, New developments in parsing
technology, pages 91106. Kluwer Academic
Publishers, Norwell, MA, USA.
Remko Scha. 1990. Language theory and language
technology; competence and performance. In
Q.A.M. de Kort and G.L.J. Leerdam, editors,
Computertoepassingen in de Neerlandistiek,
pages 722. LVVN, Almere, the Netherlands.
Original title: Taaltheorie en taaltechnologie;
competence en performance. Translation avail-
able at http://iaaa.nl/rs/LeerdamE.html.
Stuart M. Shieber. 1985. Evidence against the
context-freeness of natural language. Linguis-
tics and Philosophy, 8:333343.
Wojciech Skut, Brigitte Krenn, Thorten Brants,
and Hans Uszkoreit. 1997. An annotation
scheme for free word order languages. In Pro-
ceedings of ANLP, pages 8895.
Andreas van Cranenburgh, Remko Scha, and
Federico Sangati. 2011. Discontinuous data-
oriented parsing: A mildly context-sensitive all-
fragments grammar. In Proceedings of SPMRL,
pages 3444.
K. Vijay-Shanker and David J. Weir. 1994. The
equivalence of four extensions of context-free
grammars. Theory of Computing Systems,
27(6):511546.
K. Vijay-Shanker, David J. Weir, and Aravind K.
Joshi. 1987. Characterizing structural descrip-
tions produced by various grammatical for-
malisms. In Proc. of ACL, pages 104111.
David J. Weir. 1988. Characterizing mildly
context-sensitive grammar formalisms.
470
Evaluating language understanding accuracy with respect to objective
outcomes in a dialogue system
Myroslava O. Dzikovska and Peter Bell and Amy Isard and Johanna D. Moore
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh, United Kingdom
{m.dzikovska,peter.bell,amy.isard,j.moore}@ed.ac.uk
Abstract task-based evaluation can be used to complement

intrinsic evaluations. For example, NLP com-
It is not always clear how the differences
ponents such as parsers and co-reference resolu-
in intrinsic evaluation metrics for a parser
or classifier will affect the performance of
tion algorithms could be compared in terms of
the system that uses it. We investigate the how much they contribute to the performance of
relationship between the intrinsic evalua- a textual entailment (RTE) system (Sammons et
tion scores of an interpretation component al., 2010; Yuret et al., 2010); parser performance
in a tutorial dialogue system and the learn- could be evaluated by how well it contributes to
ing outcomes in an experiment with human an information retrieval task (Miyao et al., 2008).
users. Following the PARADISE method-
However, task-based evaluation can be difficult
ology, we use multiple linear regression to
build predictive models of learning gain, and expensive for interactive applications. Specif-
an important objective outcome metric in ically, task-based evaluation for dialogue systems
tutorial dialogue. We show that standard typically involves collecting data from a number
intrinsic metrics such as F-score alone do of people interacting with the system, which is
not predict the outcomes well. However, time-consuming and labor-intensive. Thus, it is
we can build predictive performance func- desirable to develop an off-line evaluation pro-
tions that account for up to 50% of the vari- cedure that relates intrinsic evaluation metrics to
ance in learning gain by combining fea-
predicted interaction outcomes, reducing the need
tures based on standard evaluation scores
and on the confusion matrix entries. We to conduct experiments with human participants.
argue that building such predictive mod- This problem can be addressed via the use of
els can help us better evaluate performance the PARADISE evaluation methodology for spo-
of NLP components that cannot be distin- ken dialogue systems (Walker et al., 2000). In a
guished based on F-score alone, and illus- PARADISE study, after an initial data collection
trate our approach by comparing the cur- with users, a performance function is created to
rent interpretation component in the system
to a new classifier trained on the evaluation
predict an outcome metric (e.g., user satisfaction)
data. which can normally only be measured through
user surveys. Typically, a multiple linear regres-
sion is used to fit a predictive model of the desired
1 Introduction metric based on the values of interaction param-
Much of the work in natural language processing eters that can be derived from system logs with-
relies on intrinsic evaluation: computing standard out additional user studies (e.g., dialogue length,
evaluation metrics such as precision, recall and F- word error rate, number of misunderstandings).
score on the same data set to compare the perfor- PARADISE models have been used extensively
mance of different approaches to the same NLP in task-oriented spoken dialogue systems to estab-
problem. However, once a component, such as lish which components of the system most need
a parser, is included in a larger system, it is not improvement, with user satisfaction as the out-
always clear that improvements in intrinsic eval- come metric (Moller et al., 2007; Moller et al.,
uation scores will translate into improved over- 2008; Walker et al., 2000; Larsen, 2003). In tu-
all system performance. Therefore, extrinsic or torial dialogue, PARADISE studies investigated
471
which manually annotated features predict learn- in series or in parallel. Explanation and defi-
ing outcomes, to justify new features needed in nition questions require longer answers that con-
the system (Forbes-Riley et al., 2007; Rotaru and sist of 1-2 sentences, e.g., Why was bulb A on
Litman, 2006; Forbes-Riley and Litman, 2006). when switch Z was open? (expected answer Be-
We adapt the PARADISE methodology to eval- cause it was still in a closed path with the bat-
uating individual NLP components, linking com- tery) or What is voltage? (expected answer
monly used intrinsic evaluation scores with ex- Voltage is the difference in states between two
trinsic outcome metrics. We describe an evalua- terminals). We focus on the performance of the
tion of an interpretation component of a tutorial system on these long-answer questions, since re-
dialogue system, with student learning gain as the acting to them appropriately requires processing
target outcome measure. We first describe the more complex input than factual questions.
evaluation setup, which uses standard classifica- We collected a corpus of 35 dialogues from
tion accuracy metrics for system evaluation (Sec- paid undergraduate volunteers interacting with the
tion 2). We discuss the results of the intrinsic sys- system as part of a formative system evaluation.
tem evaluation in Section 3. We then show that Each student completed a multiple-choice test as-
standard evaluation metrics do not serve as good sessing their knowledge of the material before and
predictors of system performance for the system after the session. In addition, system logs con-
we evaluated. However, adding confusion matrix tained information about how each students utter-
features improves the predictive model (Section ance was interpreted. The resulting data set con-
4). We argue that in practical applications such tains 3426 student answers grouped into 35 sub-
predictive metrics should be used alongside stan- sets, paired with test results. The answers were
dard metrics for component evaluations, to bet- then manually annotated to create a gold standard
ter predict how different components will perform evaluation corpus.
in the context of a specific task. We demonstrate
how this technique can help differentiate the out- 2.2 B EETLE II Interpretation Output
put quality between a majority class baseline, the The interpretation component of B EETLE II uses
systems output, and the output of a new classifier a syntactic parser and a set of hand-authored rules
we trained on our data (Section 5). Finally, we to extract the domain-specific semantic represen-
discuss some limitations and possible extensions tations of student utterances from the text. The
to this approach (Section 6). student answer is first classified with respect to its
domain-specific speech act, as follows:
2 Evaluation Procedure
Answer: a contentful expression to which
2.1 Data Collection the system responds with a tutoring action,
We collected transcripts of students interacting either accepting it as correct or remediating
with B EETLE II (Dzikovska et al., 2010b), a tu- the problems as discussed in (Dzikovska et
torial dialogue system for teaching conceptual al., 2010a).
knowledge in the basic electricity and electron-
Help request: any expression indicating that
ics domain. The system is a learning environment
the student does not know the answer and
with a self-contained curriculum targeted at stu-
without domain content.
dents with no knowledge of high school physics.
When interacting with the system, students spend Social: any expression such as sorry which
3-5 hours going through pre-prepared reading ma- appears to relate to social interaction and has
terial, building and observing circuits in a simula- no recognizable domain content.
tor, and talking with a dialogue-based computer
tutor via a text-based chat interface. Uninterpretable: the system could not arrive
During the interaction, students can be asked at any interpretation of the utterance. It will
two types of questions. Factual questions require respond by identifying the likely source of
them to name a set of objects or a simple prop- error, if possible (e.g., a word it does not un-
erty, e.g., Which components in circuit 1 are in derstand) and asking the student to rephrase
a closed path? or Are bulbs A and B wired their utterance (Dzikovska et al., 2009).
472
If the student utterance was determined to be an the tutoring strategy based on the general answer
answer, it is further diagnosed for correctness as class (correct, incomplete, or contradictory). In
discussed in (Dzikovska et al., 2010b), using a do- addition, this allows us to cast the problem in
main reasoner together with semantic representa- terms of classifier evaluation, and to use standard
tions of expected correct answers supplied by hu- classifier evaluation metrics. If more detailed an-
man tutors. The resulting diagnosis contains the notations were available, this approach could eas-
following information: ily be extended, as discussed in Section 6.
We employed a hierarchical annotation scheme
Consistency: whether the student statement
shown in Figure 1, which is a simplification of
correctly describes the facts mentioned in
the DeMAND coding scheme (Campbell et al.,
the question and the simulation environment:
2009). Student utterances were first annotated
e.g., student saying Switch X is closed is
as either related to domain content, or not con-
labeled inconsistent if the question stipulated
taining any domain content, but expressing the
that this switch is open.
students metacognitive state or attitudes. Utter-
Diagnosis: an analysis of how well the stu- ances expressing domain content were then coded
dents explanation matches the expected an- with respect to their correctness, as being fully
swer. It consists of 4 parts correct, partially correct but incomplete, contain-
ing some errors (rather than just omissions) or
Matched: parts of the student utterance
irrelevant1 . The irrelevant category was used
that matched the expected answer
for utterances which were correct in general but
Contradictory: parts of the student ut-
which did not directly answer the question. Inter-
terance that contradict the expected an-
annotator agreement for this annotation scheme
swer
on the corpus was = 0.69.
Extra: parts of the student utterance that The speech acts and diagnoses logged by the
do not appear in the expected answer system can be automatically mapped into our an-
Not-mentioned: parts of the expected notation labels. Help requests and social acts
answer missing from the student utter- are assigned the non-content label; answers
ance. are assigned a label based on which diagnosis
The speech act and the diagnosis are passed to fields were filled: contradictory for those an-
the tutorial planner which makes decisions about swers labeled as either inconsistent, or contain-
feedback. They constitute the output of the inter- ing something in the contradictory field; incom-
pretation component, and its quality is likely to plete if there is something not mentioned, but
affect the learning outcomes, therefore we need something matched as well, and irrelevant if
an effective way to evaluate it. In future work, nothing matched (i.e., the entire expected answer
performance of individual pipeline components is in not-mentioned). Finally, uninterpretable ut-
could also be evaluated in a similar fashion. terances are treated as unclassified, analogous to a
situation where a statistical classifier does not out-
2.3 Data Annotation put a label for an input because the classification
The general idea of breaking down the student an- probability is below its confidence threshold.
swer into correct, incorrect and missing parts is This mapping was then compared against the
common in tutorial dialogue systems (Nielsen et manually annotated labels to compute the intrin-
al., 2008; Dzikovska et al., 2010b; Jordan et al., sic evaluation scores for the B EETLE II interpreter
2006). However, representation details are highly described in Section 3.
system specific, and difficult and time-consuming
to annotate. Therefore we implemented a simpli-
3 Intrinsic Evaluation Results
fied annotation scheme which classifies whole an- The interpretation component of B EETLE II was
swers as correct, partially correct but incomplete, developed based on the transcripts of 8 sessions
or contradictory, without explicitly identifying the 1
Several different subcategories of non-content utter-
correct and incorrect parts. This makes it easier to ances, and of contradictory utterances, were recorded. How-
create the gold standard and still retains useful in- ever, they resulting classes were too small and so were col-
formation, because tutoring systems often choose lapsed into a single category for purposes of this study.
473
Category Subcategory Description
Non-content Metacognitive and social expressions without domain content, e.g., I
dont know, I need help, you are stupid
Content The utterance includes domain content.
correct The student answer is fully correct
pc incomplete The student said something correct, but incomplete, with some parts of
the expected answer missing
contradictory The students answer contains something incorrect or contradicting the
expected answer, rather than just an omission
irrelevant The students statement is correct in general, but it does not answer the
question.
Figure 1: Annotation scheme used in creating the gold standard
Label Count Frequency 43%, the same as B EETLE II. However, this is
correct 1438 0.43 obviously not a good choice for a tutoring sys-
pc incomplete 796 0.24 tem, since students who make mistakes will never
contradictory 808 0.24 get tutoring feedback. This is reflected in a much
irrelevant 105 0.03 lower value of the F score (0.12 macroaverage F
non content 232 0.07 score for baseline vs. 0.44 for B EETLE II). Note
also that there is a large difference in the micro-
Table 1: Distribution of annotated labels in the evalu-
and macro- averaged scores. It is not immediately
ation corpus
clear which of these metrics is the most important,
and how they relate to actual system performance.
of students interacting with earlier versions of the We discuss machine learning models to help an-
system. These sessions were completed prior to swer this question in the next section.
the beginning of the experiment during which our
evaluation corpus was collected, and are not in- 4 Linking Evaluation Measures to
cluded in the corpus. Thus, the corpus constitutes Outcome Measures
unseen testing data for the B EETLE II interpreter.
Table 1 shows the distribution of codes in Although the intrinsic evaluation shows that the
the annotated data. The distribution is unbal- B EETLE II interpreter performs better than the
anced, and therefore in our evaluation results we baseline on the F score, ultimately system devel-
use two different ways to average over per-class opers are not interested in improving interpreta-
evaluation scores. Macro-average combines per- tion for its own sake: they want to know whether
class scores disregarding the class sizes; micro- the time spent on improvements, and the compli-
average weighs the per-class scores by class size. cations in system design which may accompany
The overall classification accuracy (defined as the them, are worth the effort. Specifically, do such
number of correctly classified instances out of all changes translate into improvement in overall sys-
instances) is mathematically equivalent to micro- tem performance?
averaged recall; however, macro-averaging better To answer this question without running expen-
reflects performance on small classes, and is com- sive user studies we can build a model which pre-
monly used for unbalanced classification prob- dicts likely outcomes based on the data observed
lems (see, e.g., (Lewis, 1991)). so far, and then use the models predictions as an
The detailed evaluation results are presented additional evaluation metric. We chose a multiple
in Table 2. We will focus on two metrics: the linear regression model for this task, linking the
overall classification accuracy (listed as micro- classification scores with learning gain as mea-
averaged recall as discussed above), and the sured during the data collection. This approach
macro-averaged F score. follows the general PARADISE approach (Walker
The majority class baseline is to assign cor- et al., 2000), but while PARADISE is typically
rect to every instance. Its overall accuracy is used to determine which system components need
474
Label baseline B EETLE II
prec. recall F1 prec. recall F1
correct 0.43 1.00 0.60 0.93 0.52 0.67
pc incomplete 0.00 0.00 0.00 0.42 0.53 0.47
contradictory 0.00 0.00 0.00 0.57 0.22 0.31
irrelevant 0.00 0.00 0.00 0.17 0.15 0.16
non-content 0.00 0.00 0.00 0.91 0.41 0.57
macroaverage 0.09 0.20 0.12 0.60 0.37 0.44
microaverage 0.18 0.43 0.25 0.70 0.43 0.51
Table 2: Intrinsic Evaluation Results for the B EETLE II and a majority class baseline
the most improvement, we focus on finding a bet- rate confusion matrices for each student. We nor-
ter performance metric for a single component malized each confusion matrix cell by the total
(interpretation), using standard evaluation scores number of incorrect classifications for that stu-
as features. dent. We then added features based on confusion
Recall from Section 2.1 that each participant frequencies to our feature set.2
in our data collection was given a pre-test and Ideally, we should add 20 different features to
a post-test, measuring their knowledge of course our model, corresponding to every possible con-
material. The test score was equal to the propor- fusion. However, we are facing a sparse data
tion of correctly answered questions. The normal- problem, illustrated by the overall confusion ma-
ized learning gain, postpre
1pre is a metric typically trix for the corpus in Table 3. For example,
used to assess system quality in intelligent tutor- we only observed 25 instances where a contra-
ing, and this is the metric we are trying to model. dictory utterance was miscategorized as correct
Thus, the training data for our model consists of (compared to 200 contradictorypc incomplete
35 instances, each corresponding to a single dia- confusions), and so for many students this mis-
logue and the learning gain associated with it. We classification was never observed, and predictions
can compute intrinsic evaluation scores for each based on this feature are not likely to be reliable.
dialogue, in order to build a model that predicts Therefore, we limited our features to those mis-
that students learning gain based on these scores. classifications that occurred at least twice for each
If the models predictions are sufficiently reliable, student (i.e., at least 70 times in the entire cor-
we can also use them for predicting the learning pus). The list of resulting features is shown in the
gain that a student could achieve when interacting conf row of Table 4. Since only a small num-
with a new version of the interpretation compo- ber of features was included, this limits the appli-
nent for the system, not yet tested with users. We cability of the model we derived from this data
can then use the predicted score to compare dif- set to the systems which make similar types of
ferent implementations and choose the one with confusions. However, it is still interesting to in-
the highest predicted learning gain. vestigate whether confusion probabilities provide
additional information compared to standard eval-
4.1 Features uation metrics. We discuss how better coverage
Table 4 lists the feature sets we used. We tried two could be obtained in Section 6.
basic types of features. First, we used the eval-
uation scores reported in the previous section as 4.2 Regression Models
features. Second, we hypothesized that some er- Table 5 shows the regression models we obtained
rors that the system makes are likely to be worse using different feature sets. All models were ob-
than others from a tutoring perspective. For ex- tained using stepwise linear regression, using the
ample, if the student gives a contradictory answer, Akaike information criterion (AIC) for variable
accepting it as correct may lead to student miscon- 2
We also experimented with using % unclassified as an
ceptions; on the other hand, calling an irrelevant additional feature, since % of rejections is known to be a
answer partially correct but incomplete may be problem for spoken dialogue systems. However, it did not
less of a problem. Therefore, we computed sepa- improve the models, and we do not report it here for brevity.
475
Actual
Predicted contradictory correct irrelevant non-content pc incomplete
contradictory 175 86 3 0 43
correct 25 752 1 4 26
irrelevant 31 12 16 4 29
non-content 1 3 2 95 3
pc incomplete 200 317 40 28 419
Table 3: Confusion matrix for B EETLE II. System predicted values are in rows; actual values in columns.
selection implemented in the R stepwise regres- full set of intrinsic evaluation scores with confu-
sion library. As measures of model quality, we re- sion frequencies. Note that if the full set of met-
port R2 , the percentage of variance accounted for rics (precision, recall, F score) is used, the model
by the models (a typical measure of fit in regres- derives a more complex formula which covers
sion modeling), and mean squared error (MSE). about 33% of the variance. Our best models,
These were estimated using leave-one-out cross- however, combine the averaged scores with con-
validation, since our data set is small. fusion frequencies, resulting in a higher R2 and
We used feature ablation to evaluate the contri- a lower MSE (22% relative decrease between the
bution of different features. First, we investigated scores.f and conf+scores.f models in the ta-
models using precision, recall or F-score alone. ble). This shows that these features have comple-
As can be seen from the table, precision is not pre- mentary information, and that combining them in
dictive of learning gain, while F-score and recall an application-specific way may help to predict
perform similarly to one another, with R2 = 0.12. how the components will behave in practice.
In comparison, the model using only confusion
frequencies has substantially higher estimated R2 5 Using prediction models in evaluation
and a lower MSE.3 In addition, out of the 3 con-
The models from Table 5 can be used to compare
fusion features, only one is selected as predictive.
different possible implementations of the inter-
This supports our hypothesis that different types
pretation component, under the assumption that
of errors may have different importance within a
the component with a higher predicted learning
practical system.
gain score is more appropriate to use in an ITS.
The confusion frequency feature chosen by To show how our predictive models can be used
the stepwise model (predicted-pc incomplete- in making implementation decisions, we compare
actual-contradictory) has a reasonable theoret- three possible choices for an interpretation com-
ical justification. Previous research shows that ponent: the original B EETLE II interpreter, the
students who give more correct or partially cor- baseline classifier described earlier, and a new de-
rect answers, either in human-human or human- cision tree classifier trained on our data.
computer dialogue, exhibit higher learning gains,
We built a decision tree classifier using the
and this has been established for different sys-
Weka implementation of C4.5 pruned decision
tems and tutoring domains (Litman et al., 2009).
trees, with default parameters. As features, we
Consequently, % of contradictory answers is neg-
used lexical similarity scores computed by the
atively predictive of learning gain. It is reasonable
Text::Similarity package4 . We computed
to suppose, as predicted by our model, that sys-
8 features: the similarity between student answer
tems that do not identify such answers well, and
and either the expected answer text or the question
therefore do not remediate them correctly, will do
text, using 4 different scores: raw number of over-
worse in terms of learning outcomes.
lapping words, F1 score, lesk score and cosine
Based on this initial finding, we investigated score. Its intrinsic evaluation scores are shown in
the models that combined either F scores or the Table 6, estimated using 10-fold cross-validation.
3
The decrease in MSE is not statistically significant, pos- We can compare B EETLE II and baseline clas-
sibly because of the small data set. However, since we ob- sifier using the scores.all model. The predicted
serve the same pattern of results across our models, it is still
4
useful to examine. http://search.cpan.org/dist/Text-Similarity/
476
Name Variables
scores.fm fmeasure.microaverage, fmeasure.macroaverage, fmeasure.correct,
fmeasure.contradictory, fmeasure.pc incomplete,fmeasure.non-content,
fmeasure.irrelevant
scores.precision precision.microaverage, precision.macroaverage, precision.correct,
precision.contradictory, precision.pc incomplete,precision.non-content,
precision.irrelevant
scores.recall recall.microaverage, recall.macroaverage, recall.correct, recall.contradictory,
recall.pc incomplete,recall.non-content, recall.irrelevant
scores.all scores.fm + scores.precision + scores.recall
conf Freq.predicted.contradictory.actual.correct,
Freq.predicted.pc incomplete.actual.correct,
Freq.predicted.pc incomplete.actual.contradictory
Table 4: Feature sets for regression models
Variables Cross- Cross- Formula

validation validation
R2 MSE
scores.f 0.12 0.0232 0.32
(0.02) (0.0302) + 0.56 f measure.microaverage
scores.precision 0.00 0.0242 0.61
(0.00) (0.0370)
scores.recall 0.12 0.0232 0.37
(0.02) (0.0310) + 0.56 recall.microaverage
conf 0.25 0.0197 0.74
(0.03) (0.0262) 0.56
F req.predicted.pc incomplete.actual.contradictory
scores.all 0.33 0.0218 0.63
(0.03) (0.0264) + 4.20 f measure.microaverage
1.30 precision.microaverage
2.79 recall.microaverage
0.07 recall.non content
conf+scores.f 0.36 0.0179 0.52
(0.03) (0.0281) 0.66
+ 0.42 f measure.correct
0.07 f measure.non content
full 0.49 0.0189 0.88
(conf+scores.all) (0.02) (0.0248) 0.68
0.06 precision.non domain
+ 0.28 recall.correct
0.79 precision.microaverage
+ 0.65 f measure.microaverage
Table 5: Regression models for learning gain. R2 and MSE estimated with leave-one-out cross-validation.
Standard deviation in parentheses.
477
score for B EETLE II is 0.66. The predicted Label prec. recall F1
score for the baseline is 0.28. We cannot use correct 0.66 0.76 0.71
the models based on confusion scores (conf, pc incomplete 0.38 0.34 0.36
conf+scores.f or full) for evaluating the base- contradictory 0.40 0.35 0.37
line, because the confusions it makes are always irrelevant 0.07 0.04 0.05
to predict that the answer is correct when the non-content 0.62 0.76 0.68
actual label is incomplete or contradictory. macroaverage 0.43 0.45 0.43
Such situations were too rare in our training data, microaverage 0.51 0.53 0.52
and therefore were not included in the models (as
Table 6: Intrinsic evaluation scores for our newly built
discussed in Section 4.1). Additional data will
classifier.
need to be collected before this model can rea-
sonably predict baseline behavior.
Compared to our new classifier, B EETLE II has swer. However, we could still use a classifier to
lower overall accuracy (0.43 vs. 0.53), but per- double-check the interpreters output. If the
forms micro- and macro- averaged scores. B EE - predictions made by the original interpreter and
TLE II precision is higher than that of the classi- the classifier differ, and in particular when the
fier. This is not unexpected given how the system classifier assigns the contradictory label to an
was designed: since misunderstandings caused answer, B EETLE II may choose to use a generic
dialogue breakdown in pilot tests, the interpreter strategy for contradictory utterances, e.g. telling
was built to prefer rejecting utterances as uninter- the student that their answer is incorrect without
pretable rather than assigning them to an incorrect specifying the exact problem, or asking them to
class, leading to high precision but lower recall. re-read portions of the material.
However, we can use all our predictive models
6 Discussion and Future Work
to evaluate the classifier. We checked the the con-
fusion matrix (not shown here due to space lim- In this paper, we proposed an approach for cost-
itations), and saw that the classifier made some sensitive evaluation of language interpretation
of the same types of confusions that B EETLE II within practical applications. Our approach is
interpreter made. On the scores.all model, the based on the PARADISE methodology for dia-
predicted learning gain score for the classifier is logue system evaluation (Walker et al., 2000).
0.63, also very close to B EETLE II. But with the We followed the typical pattern of a PARADISE
conf+scores.all model, the predicted score is study, but instead of relying on a variety of fea-
0.89, compared to 0.59 for B EETLE II, indicating tures that characterize the interaction, we used
that we should prefer the newly built classifier. scores that reflect only the performance of the
Looking at individual class performance, the interpretation component. For B EETLE II we
classifier performs better than the B EETLE II in- could build regression models that account for
terpreter on identifying correct and contradic- nearly 50% variance in the desired outcomes, on
tory answers, but does not do as well for par- par with models reported in earlier PARADISE
tially correct but incomplete, and for irrelevant an- studies (Moller et al., 2007; Moller et al., 2008;
swers. Using our predictive performance metric Walker et al., 2000; Larsen, 2003). More impor-
highlights the differences between the classifiers tantly, we demonstrated that combining averaged
and effectively helps determine which confusion scores with features based on confusion frequen-
types are the most important. cies improves prediction quality and allows us to
One limitation of this prediction, however, is see differences between systems which are not ob-
that the original systems output is considerably vious from the scores alone.
more complex: the B EETLE II interpreter explic- Previous work on task-based evaluation of NLP
itly identifies correct, incorrect and missing parts components used RTE or information extraction
of the student answer which are then used by the as target tasks (Sammons et al., 2010; Yuret et al.,
system to formulate adaptive feedback. This is 2010; Miyao et al., 2008), based on standard cor-
an important feature of the system because it al- pora. We specifically targeted applications which
lows for implementation of strategies such as ac- involve human-computer interaction, where run-
knowledging and restating correct parts of the an- ning task-based evaluations is particularly expen-
478
sive, and building a predictive model of system tation variants during the system development,
performance can simplify system development. without re-running user evaluations, can provide
Our evaluation data limited the set of features important information, as we illustrated with an
that we could use in our models. For most con- example of evaluating a new classifier we built for
fusion features, there were not enough instances our interpretation task. Moreover, the confusion
in the data to build a model that would reliably frequency feature that our models picked is con-
predict learning gain for those cases. One way sistent with earlier results from a different tutor-
to solve this problem would be to conduct a user ing domain (see Section 4.2). Thus, these models
study in which the system simulates random er- could provide a starting point when making sys-
rors appearing some of the time. This could pro- tem development choices, which can then be con-
vide the data needed for more accurate models. firmed by user evaluations in new domains.
The general pattern we observed in our data The models we built do not fully account for
is that a model based on F-scores alone predicts the variance in the training data. This is expected,
only a small proportion of the variance. If a full since interpretation performance is not the only
set of metrics (including F-score, precision and factor influencing the objective outcome: other
recall) is used, linear regression derives a more factors, such choosing the the appropriate tutor-
complex equation, with different weights for pre- ing strategy, are also important. Similar models
cision and recall. Instead of the linear model, we could be built for other system components to ac-
may consider using a model based on F score, count for their contribution to the variance. Fi-
F = (1 + 2 ) 2PPR+R , and fitting it to the data to nally, we could consider using different learning
derive the weight rather than using the standard algorithms. Moller et al. (2008) examined deci-
F1 score. We plan to investigate this in the future. sion trees and neural networks in addition to mul-
Our method would apply to a wide range of tiple linear regression for predicting user satisfac-
systems. It can be used straightforwardly with tion in spoken dialogue. They found that neural
many current spoken dialogue systems which rely networks had the best prediction performance for
on classifiers to support language understanding their task. We plan to explore other learning algo-
in domains such as call routing and technical sup- rithms for this task as part of our future work.
port (Gupta et al., 2006; Acomb et al., 2007).
7 Conclusion
We applied it to a system that outputs more com-
plex logical forms, but we showed that we could In this paper, we described an evaluation of an
simplify its output to a set of labels which still interpretation component of a tutorial dialogue
allowed us to make informed decisions. Simi- system using predictive models that link intrin-
lar simplifications could be derived for other sys- sic evaluation scores with learning outcomes. We
tems based on domain-specific dialogue acts typ- showed that adding features based on confusion
ically used in dialogue management. For slot- frequencies for individual classes significantly
based systems, it may be useful to consider con- improves the prediction. This approach can be
cept accuracy for recognizing individual slot val- used to compare different implementations of lan-
ues. Finally, for tutoring systems it is possible guage interpretation components, and to decide
to annotate the answers on a more fine-grained which option to use, based on the predicted im-
level. Nielsen et al. (2008) proposed an annota- provement in a task-specific target outcome met-
tion scheme based on the output of a dependency ric trained on previous evaluation data.
parser, and trained a classifier to identify individ-
ual dependencies as expressed, contradicted Acknowledgments
or unaddressed. Their system could be evalu- We thank Natalie Steinhauser, Gwendolyn Camp-
ated using the same approach. bell, Charlie Scott, Simon Caine, Leanne Taylor,
The specific formulas we derived are not likely Katherine Harrison and Jonathan Kilgour for help
to be highly generalizable. It is a well-known with data collection and preparation; and Christo-
limitation of PARADISE evaluations that models pher Brew for helpful comments and discussion.
built based on one system often do not perform This work has been supported in part by the US
well when applied to different systems (Moller et ONR award N000141010085.
al., 2008). But using them to compare implemen-
479
References Gilbert. 2006. The AT&T spoken language un-
derstanding system. IEEE Transactions on Audio,
Kate Acomb, Jonathan Bloom, Krishna Dayanidhi, Speech & Language Processing, 14(1):213222.
Phillip Hunter, Peter Krogh, Esther Levin, and Pamela W. Jordan, Maxim Makatchev, and Umarani
Roberto Pieraccini. 2007. Technical support dia- Pappuswamy. 2006. Understanding complex nat-
log systems: Issues, problems, and solutions. In ural language explanations in tutorial applications.
Proceedings of the Workshop on Bridging the Gap: In Proceedings of the Third Workshop on Scalable
Academic and Industrial Research in Dialog Tech- Natural Language Understanding, ScaNaLU 06,
nologies, pages 2531, Rochester, NY, April. pages 1724.
Gwendolyn C. Campbell, Natalie B. Steinhauser, Lars Bo Larsen. 2003. Issues in the evaluation of spo-
Myroslava O. Dzikovska, Johanna D. Moore, ken dialogue systems using objective and subjective
Charles B. Callaway, and Elaine Farrow. 2009. The measures. In Proceedings of the 2003 IEEE Work-
DeMAND coding scheme: A common language shop on Automatic Speech Recognition and Under-
for representing and analyzing student discourse. In standing, pages 209214.
Proceedings of 14th International Conference on David D. Lewis. 1991. Evaluating text categorization.
Artificial Intelligence in Education (AIED), poster In Proceedings of the workshop on Speech and Nat-
session, Brighton, UK, July. ural Language, HLT 91, pages 312318, Strouds-
Myroslava O. Dzikovska, Charles B. Callaway, Elaine burg, PA, USA.
Farrow, Johanna D. Moore, Natalie B. Steinhauser, Diane Litman, Johanna Moore, Myroslava Dzikovska,
and Gwendolyn E. Campbell. 2009. Dealing with and Elaine Farrow. 2009. Using natural lan-
interpretation errors in tutorial dialogue. In Pro- guage processing to analyze tutorial dialogue cor-
ceedings of the SIGDIAL 2009 Conference, pages pora across domains and modalities. In Proceed-
3845, London, UK, September. ings of 14th International Conference on Artificial
Myroslava Dzikovska, Diana Bental, Johanna D. Intelligence in Education (AIED), Brighton, UK,
Moore, Natalie B. Steinhauser, Gwendolyn E. July.
Campbell, Elaine Farrow, and Charles B. Callaway. Yusuke Miyao, Rune Stre, Kenji Sagae, Takuya Mat-
2010a. Intelligent tutoring with natural language suzaki, and Junichi Tsujii. 2008. Task-oriented
support in the Beetle II system. In Sustaining TEL: evaluation of syntactic parsers and their representa-
From Innovation to Learning and Practice - 5th Eu- tions. In Proceedings of ACL-08: HLT, pages 46
ropean Conference on Technology Enhanced Learn- 54, Columbus, Ohio, June.
ing, (EC-TEL 2010), Barcelona, Spain, October. Sebastian Moller, Paula Smeele, Heleen Boland, and
Jan Krebber. 2007. Evaluating spoken dialogue
Myroslava O. Dzikovska, Johanna D. Moore, Natalie
systems according to de-facto standards: A case
Steinhauser, Gwendolyn Campbell, Elaine Farrow,
study. Computer Speech & Language, 21(1):26
and Charles B. Callaway. 2010b. Beetle II: a sys-
53.
tem for tutoring and computational linguistics ex-
Sebastian Moller, Klaus-Peter Engelbrecht, and
perimentation. In Proceedings of the 48th Annual
Robert Schleicher. 2008. Predicting the quality and
usability of spoken dialogue services. Speech Com-
guistics (ACL-2010) demo session, Uppsala, Swe-
munication, pages 730744.
den, July.
Rodney D. Nielsen, Wayne Ward, and James H. Mar-
Kate Forbes-Riley and Diane J. Litman. 2006. Mod- tin. 2008. Learning to assess low-level conceptual
elling user satisfaction and student learning in a understanding. In Proceedings 21st International
spoken dialogue tutoring system with generic, tu- FLAIRS Conference, Coconut Grove, Florida, May.
toring, and user affect parameters. In Proceed- Mihai Rotaru and Diane J. Litman. 2006. Exploit-
ings of the Human Language Technology Confer- ing discourse structure for spoken dialogue perfor-
ence of the North American Chapter of the Asso- mance analysis. In Proceedings of the 2006 Con-
ciation of Computational Linguistics (HLT-NAACL ference on Empirical Methods in Natural Language
06), pages 264271, Stroudsburg, PA, USA. Processing, EMNLP 06, pages 8593, Strouds-
Kate Forbes-Riley, Diane Litman, Amruta Purandare, burg, PA, USA.
Mihai Rotaru, and Joel Tetreault. 2007. Compar- Mark Sammons, V.G.Vinod Vydiswaran, and Dan
ing linguistic features for modeling learning in com- Roth. 2010. Ask not what textual entailment can
puter tutoring. In Proceedings of the 2007 confer- do for you.... In Proceedings of the 48th Annual
ence on Artificial Intelligence in Education: Build- Meeting of the Association for Computational Lin-
ing Technology Rich Learning Contexts That Work, guistics, pages 11991208, Uppsala, Sweden, July.
pages 270277, Amsterdam, The Netherlands. IOS Marilyn A. Walker, Candace A. Kamm, and Diane J.
Press. Litman. 2000. Towards Developing General Mod-
Narendra K. Gupta, Gokhan Tur, Dilek Hakkani-Tur, els of Usability with PARADISE. Natural Lan-
Srinivas Bangalore, Giuseppe Riccardi, and Mazin guage Engineering, 6(3).
480
Deniz Yuret, Aydin Han, and Zehra Turgut. 2010.
SemEval-2010 task 12: Parser evaluation using tex-
tual entailments. In Proceedings of the 5th Inter-
national Workshop on Semantic Evaluation, pages
5156, Uppsala, Sweden, July.
481
Experimenting with Distant Supervision for Emotion Classification
Matthew Purver and Stuart Battersby

Interaction Media and Communication Group Chatterbox Analytics
School of Electronic Engineering and Computer Science
Queen Mary University of London
Mile End Road, London E1 4NS, UK
m.purver@qmul.ac.uk stuart@cbanalytics.co.uk
Abstract in unconventional style and without accompany-

ing metadata, audio/video signals or access to the
We describe a set of experiments using au- author for disambiguation, how can we easily pro-
tomatically labelled data to train supervised duce a gold-standard labelling for training and/or
classifiers for multi-class emotion detection
for evaluation and test? One possible solution
in Twitter messages with no manual inter-
vention. By cross-validating between mod-
that is becoming popular is crowd-sourcing the la-
els trained on different labellings for the belling task, as the easy access to very large num-
same six basic emotion classes, and testing bers of annotators provided by tools such as Ama-
on manually labelled data, we conclude that zons Mechanical Turk can help with the problem
the method is suitable for some emotions of dataset size; however, this has its own attendant
(happiness, sadness and anger) but less able problems of annotator reliability (see e.g. (Hsueh
to distinguish others; and that different la- et al., 2009)), and cannot directly help with the in-
belling conventions are more suitable for
herent problem of ambiguity using many anno-
some emotions than others.
tators does not guarantee that they can understand
or correctly assign the authors intended interpre-
1 Introduction tation or emotional state.
We present a set of experiments into classify- In this paper, we investigate a different ap-
ing Twitter messages into the six basic emotion proach via distant supervision (see e.g. (Mintz
classes of (Ekman, 1972). The motivation behind et al., 2009)). By using conventional markers of
this work is twofold: firstly, to investigate the pos- emotional content within the texts themselves as
sibility of detecting emotions of multiple classes a surrogate for explicit labels, we can quickly re-
(rather than purely positive or negative sentiment) trieve large subsets of (noisily) labelled data. This
in such short texts; and secondly, to investigate approach has the advantage of giving us direct
the use of distant supervision to quickly bootstrap access to the authors own intended interpreta-
large datasets and classifiers without the need for tion or emotional state, without relying on third-
manual annotation. party annotators. Of course, the labels themselves
Text classification according to emotion and may be noisy: ambiguous, vague or not having
sentiment is a well-established research area. In a direct correspondence with the desired classi-
this and other areas of text analysis and classifica- fication. We therefore experiment with multiple
tion, recent years have seen a rise in use of data such conventions with apparently similar mean-
from online sources and social media, as these ings here, emoticons (following (Read, 2005))
provide very large, often freely available datasets and Twitter hashtags allowing us to examine the
(see e.g. (Eisenstein et al., 2010; Go et al., 2009; similarity of classifiers trained on independent la-
Pak and Paroubek, 2010) amongst many others). bels but intended to detect the same underlying
However, one of the challenges this poses is that class. We also investigate the precision and cor-
of data annotation: given very large amounts of respondence of particular labels with the desired
data, often consisting of very short texts, written emotion classes by testing on a small set of man-
482
ually labelled data. 74% for anger. However, they achieved signifi-
We show that the success of this approach de- cant improvements using acoustic features avail-
pends on both the conventional markers chosen able in their speech data, improving accuracies up
and the emotion classes themselves. Some emo- to a maximum of 81.5%.
tions are both reliably marked by different con-
2.2 Conventions
ventions and distinguishable from other emotions;
this seems particularly true for happiness, sadness As we are using text data, such intonational and
and anger, indicating that this approach can pro- prosodic cues are unavailable, as are the other
vide not only the basic distinction required for rich sources of emotional cues we obtain from
sentiment analysis but some more finer-grained gesture, posture and facial expression in face-to-
information. Others are either less distinguishable face communication. However, the prevalence of
from short text messages, or less reliably marked. online text-based communication has led to the
emergence of textual conventions understood by
2 Related Work the users to perform some of the same functions
as these acoustic and non-verbal cues. The most
2.1 Emotion and Sentiment Classification familiar of these is the use of emoticons, either
Much research in this area has concentrated on the Western-style (e.g. :), :-( etc.) or Eastern-style
related tasks of subjectivity classification (distin- (e.g. (_), (>_<) etc.). Other conventions
guishing objective from subjective texts see e.g. have emerged more recently for particular inter-
(Wiebe and Riloff, 2005)); and sentiment classifi- faces or domains; in Twitter data, one common
cation (classifying subjective texts into those that convention is the use of hashtags to add or em-
convey positive, negative and neutral sentiment phasise emotional content see (1).
see e.g. (Pang and Lee, 2008)). We are interested (1) a. Best day in ages! #Happy :)
in emotion detection: classifying subjective texts
according to a finer-grained classification of the b. Gets so #angry when tutors dont email
emotions they convey, and thus providing richer back... Do you job idiots!
and more informative data for social media anal- Linguistic and social research into the use of
ysis than simple positive/negative sentiment. In such conventions suggests that their function is
this study we confine ourselves to the six basic generally to emphasise or strengthen the emo-
emotions identified by Ekman (1972) as being tion or sentiment conveyed by a message, rather
common across cultures; other finer-grained clas- than to add emotional content which would not
sifications are of course available. otherwise be present. Walther and DAddario
(2001) found that the contribution of emoticons
2.1.1 Emotion Classification towards the sentiment of a message was out-
The task of emotion classification is by nature weighed by the verbal content, although nega-
a multi-class problem, and classification experi- tive ones tended to shift interpretation towards the
ments have therefore achieved lower accuracies negative. Ip (2002) experimented with emoticons
than seen in the binary problems of sentiment and in instant messaging, with the results suggesting
subjectivity classification. Danisman and Alpko- that emoticons do not add positivity or negativ-
cak (2008) used vector space models for the same ity but rather increase valence (making positive
six-way emotion classification we examine here, messages more positive and vice versa). Similarly
and achieved F-measures around 32%; Seol et al. Derks et al. (2008a; 2008b) found that emoticons
(2008) used neural networks for an 8-way clas- are used in strengthening the intensity of a ver-
sification (hope, love, thank, neutral, happy, sad, bal message (although they serve other functions
fear, anger) and achieved per-class accuracies of such as expressing humour), and hypothesized
45% to 65%. Chuang and Wu (2004) used su- that they serve similar functions to actual non-
pervised classifiers (SVMs) and manually defined verbal behavior; Provine et al. (2007) also found
keyword features over a seven-way classification that emoticons are used to punctuate messages
consisting of the same six-class taxonomy plus a rather than replace lexical content, appearing in
neutral category, and achieved an average accu- similar grammatical locations to verbal laughter
racy of 65.5%, varying from 56% for disgust to and preserving phrase structure.
483
2.3 Distant Supervision b. Leftover ToeJams with Kettle Salt and
These findings suggest, of course, that emoticons Vinegar chips. #stress #sadness #comfort
and related conventional markers are likely to be #letsturnthisfrownupsidedown
useful features for sentiment and emotion classifi-
3 Methodology
cation. They also suggest, though, that they might
be used as surrogates for manual emotion class la- We used a collection of Twitter messages, all
bels: if their function is often to complement the marked with emoticons or hashtags correspond-
verbal content available in messages, they should ing to one of Ekman (1972)s six emotion classes.
give us a way to automatically label messages ac- For emoticons, we used Ansari (2010)s taxon-
cording to emotional class, while leaving us with omy, taken from the Yahoo messenger classifica-
messages with enough verbal content to achieve tion. For hashtags, we used emotion names them-
reasonable classification. selves together with the main related adjective
This approach has been exploited in several both are used commonly on Twitter in slightly
ways in recent work; Tanaka et al. (2005) used different ways as shown in (3); note that emo-
Japanese-style emoticons as classification labels, tion names are often used as marked verbs as well
and Go et al. (2009) and Pak and Paroubek (2010) as nouns. Details of the classes and markers are
used Western-style emoticons to label and classify given in Table 1.
Twitter messages according to positive and nega-
tive sentiment, using traditional supervised clas- (3) a. Gets so #angry when tutors dont email
sification methods. The highest accuracies ap- back... Do you job idiots!
pear to have been achieved by Go et al. (2009),
b. Im going to say it, Paranormal Activity
who used various combinations of features (un-
2 scared me and I didnt sleep well last
igrams, bigrams, part-of-speech tags) and clas-
night because of it. #fear #demons
sifiers (Nave Bayes, maximum entropy, and
SVMs), achieving their best accuracy of 83.0% c. Girls that sleep w guys without even fully
with unigram and bigram features and a maxi- getting to know them #disgust me
mum entropy; using only unigrams with a SVM
classifier achieved only slightly lower accuracy at Messages with multiple conventions (see (4))
82.2%. Ansari (2010) then provides an initial in- were collected and used in the experiments, ensur-
vestigation into applying the same methods to six- ing that the marker being used as a label in a par-
way emotion classification, treating each emotion ticular experiment was not available as a feature in
independently as a binary classification problem that experiment. Messages with no markers were
and showing that accuracy varied with emotion not collected. While this prevents us from exper-
class as well as with dataset size. The highest ac- imenting with the classification of neutral or ob-
curacies achieved were up to 81%, but these were jective messages, it would require manual anno-
on very small datasets (e.g. 81.0% accuracy on tation to distinguish these from emotion-carrying
fear, but with only around 200 positive and nega- messages which are not marked. We assume that
tive data instances). any implementation of the techniques we investi-
We view this approach as having several ad- gate here would be able to use a preliminary stage
vantages; apart from the ease of data collection of subjectivity and/or sentiment detection to iden-
it allows by avoiding manual annotation, it gives tify these messages, and leave this aside here.
us access to the authors own intended interpeta-
tions, as the markers are of course added by the (4) a. just because people are celebs they dont
authors themselves at time of writing. In some reply to your tweets! NOT FAIR #Angry
cases such as the examples of (1) above, the emo- :( I wish They would reply! #Please
tion conveyed may be clear to a third-party anno-
Data was collected from Twitters Streaming
tator; but in others it may not be clear at all with-
API service.1 This provides a 1-2% random sam-
out the marker see (2):
ple of all tweets with no constraints on language
(2) a. Still trying to recover from seeing the 1
See http://dev.twitter.com/docs/
#bluewaffle on my TL #disgusted #sick streaming-api.
484
absolute performance for future work.
Table 1: Conventional markers used for emotion
classes.
4 Experiments
happy :-) :) ;-) :D :P 8) 8-| <@o Throughout, the markers (emoticons and/or hash-
sad :-( :( ;-( :-< :( tags) used as labels in any experiment were re-
anger :-@ :@ moved before feature extraction in that experi-
fear :| :-o :-O ment labels were not used as features.
surprise :s :S
4.1 Experiment 1: Emotion detection
disgust :$ +o(
happy #happy #happiness To simulate the task of detecting emotion classes
sad #sad #sadness from a general stream of messages, we first built
anger #angry #anger for each convention type C and each emotion
class E a dataset DE C of size N containing (a)
fear #scared #fear
surprise #surprised #surprise as positive instances, N/2 messages containing
disgust #disgusted #disgust markers of the emotion class E and no other
markers of type C, and (b) as negative instances,
N/2 messages containing markers of type C of
or location. These are collected in near real time any other emotion class. For example, the posi-
and stored in a local database. An English lan- tive instance set for emoticon-marked anger was
guage selection filter was applied; scripts collect- based on those tweets which contained :-@ or
ing each conventional marker set were alternated :@, but none of the emoticons from the happy,
throughout different times of day and days of the sad, surprise, disgust or fear classes;
week to avoid any bias associated with e.g. week- any hashtags were allowed, including those as-
ends or mornings. The numbers of messages col- sociated with emotion classes. The negative in-
lected varied with the popularity of the markers stance set contained a representative sample of
themselves: for emoticons, we obtained a max- the same number of instances, with each having
imum of 837,849 (for happy) and a minimum at least one of the happy, sad, surprise,
of 10,539 for anger; for hashtags, a maximum disgust or fear emoticons but not containing
of 10,219 for happy and a minimum of 536 for :-@ or :@.
disgust. 2 This of course excludes messages with no emo-
Classification in all experiments was using sup- tional markers; for this to act as an approximation
port vector machines (SVMs) (Vapnik, 1995) via of the general task therefore requires a assump-
the LIBSVM implementation of Chang and Lin tion that unmarked messages reflect the same dis-
(2001) with a linear kernel and unigram features. tribution over emotion classes as marked mes-
Unigram features included all words and hashtags sages. For emotion-carrying but unmarked mes-
(other than those used as labels in relevant exper- sages, this does seem intuitively likely, but re-
iments) after removal of URLs and Twitter user- quires investigation. For neutral objective mes-
names. Some improvement in performance might sages it is clearly false, but as stated above we as-
be available using more advanced features (e.g. sume a preliminary stage of subjectivity detection
n-grams), other classification methods (e.g. maxi- in any practical application.
mum entropy, as lexical features are unlikely to be Performance was evaluated using 10-fold
independent) and/or feature weightings (e.g. the cross-validation. Results are shown as the bold
variant of TFIDF used for sentiment classification figures in Table 2; despite the small dataset
by Martineau (2009)). Here, our interest is more sizes in some cases, a 2 test shows all to be
in the difference between the emotion and con- significantly different from chance. The best-
vention marker classes - we leave investigation of performing classes show accuracies very similar
to those achieved by Go et al. (2009) for their bi-
2
One possible way to increase dataset sizes for the rarer nary positive/negative classification, as might be
markers might be to include synonyms in the hashtag names
used; however, peoples use and understanding of hashtags is
expected; for emoticon markers, the best classes
not straightforwardly predictable from lexical form. Instead, are happy, sad and anger; interestingly the
we intend to run a longer-term data gathering exercise. best classes for hashtag markers are not the same
485
but the highest figures (between 63% and 68%)
Table 2: Experiment 1: Within-class results. Same-
convention (bold) figures are accuracies over 10-fold are achieved for happy, sad and anger; here
cross-validation; cross-convention (italic) figures are perhaps we can have some confidence that not
accuracies over full sets. only are the markers acting as predictable labels
Train themselves, but also seem to be labelling the same
Convention Test emoticon hashtag thing (and therefore perhaps are actually labelling
emoticon happy 79.8% 63.5% the emotion we are hoping to label).
emoticon sad 79.9% 65.5%
emoticon anger 80.1% 62.9% 4.2 Experiment 2: Emotion discrimination
emoticon fear 76.2% 58.5% To investigate whether these independent clas-
emoticon surprise 77.4% 48.2% sifiers can be used in multi-class classification
emoticon disgust 75.2% 54.6% (distinguishing emotion classes from each other
hashtag happy 67.7% 82.5% rather than just distinguishing one class from a
hashtag sad 67.1% 74.6% general other set), we next cross-tested the clas-
hashtag anger 62.8% 74.7% sifiers between emotion classes: training models
hashtag fear 60.6% 77.2% on one emotion and testing on the others for
hashtag surprise 51.9% 67.4% each convention type C and each emotion class
hashtag disgust 64.6% 78.3% E1, train a classifier on dataset DE1 C and test on
DE2C , D C etc. The datasets in Experiment 1 had

E3
happy performs best, but disgust and fear an uneven balance of emotion classes (including a
outperform sad and anger, and surprise high proportion of happy instances) which could
performs particularly badly. For sad, one reason bias results; for this experiment, therefore, we cre-
may be a dual meaning of the tag #sad (one emo- ated datasets with an even balance of emotions
tional and one expressing ridicule); for anger among the negative instances. For each conven-
one possibility is the popularity on Twitter of the tion type C and each emotion class E1, we built
a dataset DE1 C of size N containing (a) as pos-
game Angry Birds; for surprise, the data
seems split between two rather distinct usages, itive instances, N/2 messages containing mark-
ones expressing the authors emotion, but one ex- ers of the emotion class E1 and no other mark-
pressing an intended effect on the audience (see ers of type C, and (b) as negative instances, N/2
(5)). However, deeper analysis is needed to estab- messages consisting of N/10 messages contain-
lish the exact causes. ing only markers of class E2, N/10 messages
containing only markers of class E3 etc. Results
(5) a. broke 100 followers. #surprised im glad were then generated as in Experiment 1.
that the HOFF is one of them. Within-class results are shown in Table 3 and
are similar to those obtained in Experiment 1;
b. Whos excited for the Big Game? We
again, differences between bold/italic results are
know we are AND we have a #surprise
statistically significant. Cross-class results are
for you!
shown in Table 4. The happy class was well
To investigate whether the different conven- distinguished from other emotion classes for both
tion types actually convey similar properties (and convention types (i.e. cross-class classification
hence are used to mark similar messages) we then accuracy is low compared to the within-class fig-
compared these accuracies to those obtained by ures in italics and parentheses). The sad class
training classifiers on the dataset for a different also seems well distinguished when using hash-
convention: in other words, for each emotion tags as labels, although less so when using emoti-
C1 and test
class E, train a classifier on dataset DE cons. However, other emotion classes show a sur-
C2
on DE . As the training and testing sets are dif- prisingly high cross-class performance in many
ferent, we now test on the entire dataset rather cases in other words, they are producing dis-
than using cross-validation. Results are shown as appointingly similar classifiers.
the italic figures in Table 2; a 2 test shows all This poor discrimination for negative emotion
to be significantly different from the bold same- classes may be due to ambiguity or vagueness in
convention results. Accuracies are lower overall, the label, similarity of the verbal content associ-
486
Table 4: Experiment 2: Cross-class results. Same-class figures from 10-fold cross-validation are shown in
(italics) for comparison; all other figures are accuracies over full sets.
Train
Convention Test happy sad anger fear surprise disgust
emoticon happy (78.1%) 17.3% 39.6% 26.7% 28.3% 42.8%
emoticon sad 16.5% (78.9%) 59.1% 71.9% 69.9% 55.5%
emoticon anger 29.8% 67.0% (79.7%) 74.2% 76.4% 67.5%
emoticon fear 27.0% 69.9% 64.4% (75.3%) 74.0% 61.2%
emoticon surprise 25.4% 69.9% 67.7% 76.3% (78.1%) 66.4%
emoticon disgust 42.2% 54.4% 61.1% 64.2% 64.1% (73.9%)
hashtag happy (81.1%) 10.7% 45.3% 47.8% 52.7% 43.4%
hashtag sad 13.8% (77.9%) 47.7% 49.7% 46.5% 54.2%
hashtag anger 44.6% 45.2% (74.3%) 72.0% 63.0% 62.9%
hashtag fear 45.0% 50.4% 68.6% (74.7%) 63.9% 60.7%
hashtag surprise 51.5% 45.7% 67.4% 70.7% (70.2%) 64.2%
hashtag disgust 40.4% 53.5% 74.7% 71.8% 70.8% (74.2%)
we would expect emoticon-trained models to fail

Table 3: Experiment 2: Within-class results. Same-
convention (bold) figures are accuracies over 10-fold to discriminate hashtag-labelled test sets, but
cross-validation; cross-convention (italic) figures are hashtag-trained models to discriminate emoticon-
accuracies over full sets. labelled test sets well; if on the other hand the
Train cause lies in the overlap of verbal content or the
Convention Test emoticon hashtag emotions themselves, the effect should be simi-
emoticon happy 78.1% 61.2% lar in either direction. This experiment also helps
emoticon sad 78.9% 60.2% determine in more detail whether the labels used
emoticon anger 79.7% 63.7% label similar underlying properties.
emoticon fear 75.3% 55.9%
Table 5 shows the results. For the three classes
emoticon surprise 78.1% 53.1%
happy, sad and perhaps anger, models trained
emoticon disgust 73.9% 51.5%
using emoticon labels do a reasonable job of dis-
hashtag happy 68.7% 81.1%
tinguishing classes in hashtag-labelled data, and
hashtag sad 65.4% 77.9%
vice versa. However, for the other classes, dis-
hashtag anger 63.9% 74.3%
crimination is worse. Emoticon-trained mod-
hashtag fear 58.9% 74.7%
els appear to give (undesirably) higher perfor-
hashtag surprise 51.8% 70.2%
mance across emotion classes in hashtag-labelled
hashtag disgust 55.4% 74.2%
data (for the problematic non-happy classes).
Hashtag-trained models perform around the ran-
dom 50% level on emoticon-labelled data for
ated with the emotions, or of genuine frequent co- those classes, even when tested on nominally
presence of the emotions. Given the close lex- the same emotion as they are trained on. For
ical specification of emotions in hashtag labels, both label types, then, the lower within-class and
the latter reasons seem more likely; however, with higher cross-class performance with these nega-
emoticon labels, we suspect that the emoticons tive classes (fear, surprise, disgust) sug-
themselves are often used in ambiguous or vague gests that these emotion classes are genuinely
ways. hard to tell apart (they are all negative emotions,
As one way of investigating this directly, we and may use similar words), or are simply of-
tested classifiers across labelling conventions as ten expressed simultaneously. The higher perfor-
well as across emotion classes, to determine mance of emoticon-trained classifiers compared
whether the (lack of) cross-class discrimination to hashtag-trained classifiers, though, also sug-
holds across convention marker types. In the gests vagueness or ambiguity in emoticons: data
case of ambiguity or vagueness of emoticons, labelled with emoticons nominally thought to be
487
Table 5: Experiment 2: Cross-class, cross-convention results (train on hashtags, test on emoticons and vice
versa). All figures are accuracies over full sets. Accuracies over 60% are shown in bold.
Train
Convention Test happy sad anger fear surprise disgust
emoticon happy 61.2% 40.4% 44.1% 47.4% 52.0% 45.9%
emoticon sad 38.3% 60.2% 55.1% 51.5% 47.1% 53.9%
emoticon anger 47.0% 48.0% 63.7% 56.2% 50.9% 56.6%
emoticon fear 39.8% 57.7% 57.1% 55.9% 50.8% 56.1%
emoticon surprise 43.7% 55.2% 59.2% 58.4% 53.1% 54.0%
emoticon disgust 51.5% 48.0% 53.5% 55.1% 53.1% 51.5%
hashtag happy 68.7% 32.5% 43.6% 32.1% 35.4% 50.4%
hashtag sad 33.8% 65.4% 53.2% 65.0% 61.8% 48.8%
hashtag anger 43.9% 55.5% 63.9% 59.6% 60.4% 53.0%
hashtag fear 44.3% 54.6% 56.1% 58.9% 61.5% 54.3%
hashtag surprise 54.2% 45.3% 49.8% 49.9% 51.8% 52.3%
hashtag disgust 41.5% 57.6% 61.6% 62.2% 59.3% 55.4%
associated with surprise produces classifiers for the three classes already seen to be prob-
which perform well on data labelled with many lematic: surprise, fear and disgust. To
other hashtag classes, suggesting that those emo- create our dataset for this experiment, we there-
tions were present in the training data. Con- fore took only instances which were given the
versely, the more specific hashtag labels produce same primary label by all labellers i.e. only
classifiers which perform poorly on data labelled those examples which we could take as reliably
with emoticons and which thus contains a range and unambiguously labelled. This gave an un-
of actual emotions. balanced dataset, with numbers varying from 266
instances for happy to only 12 instances for
4.3 Experiment 3: Manual labelling each of surprise and fear. Classifiers were
To confirm whether either (or both) set of auto- trained using the datasets from Experiment 2. Per-
matic (distant) labels do in fact label the under- formance is shown in Table 6; given the imbal-
lying emotion class intended, we used human an- ance between class numbers in the test dataset,
notators via Amazons Mechanical Turk to label evaluation is given as recall, precision and F-score
a set of 1,000 instances. These instances were all for the class in question rather than a simple accu-
labelled with emoticons (we did not use hashtag- racy figure (which is biased by the high proportion
labelled data: as hashtags are so lexically close to of happy examples).
the names of the emotion classes being labelled,
their presence may influence labellers unduly)3
and were evenly distributed across the 6 classes, Table 6: Experiment 3: Results on manual labels.
in so far as indicated by the emoticons. Labellers Train Class Precision Recall F-score
were asked to choose the primary emotion class emoticon happy 79.4% 75.6% 77.5%
(from the fixed set of six) associated with the mes- emoticon sad 43.5% 73.2% 54.5%
sage; they were also allowed to specify if any emoticon anger 62.2% 37.3% 46.7%
other classes were also present. Each data in- emoticon fear 6.8% 63.6% 12.3%
stance was labelled by three different annotators. emoticon surprise 15.0% 90.0% 25.7%
Agreement between labellers was poor over- emoticon disgust 8.3% 25.0% 12.5%
all. The three annotators unanimously agreed in hashtag happy 78.9% 51.9% 62.6%
only 47% of cases overall; although two of three hashtag sad 47.9% 81.7% 60.4%
agreed in 83% of cases. Agreement was worst hashtag anger 58.2% 76.0% 65.9%
3
hashtag fear 10.1% 81.8% 18.0%
Although, of course, one may argue that they do the hashtag surprise 5.9% 60.0% 10.7%
same for their intended audience of readers in which case,
such an effect is legitimate. hashtag disgust 6.7% 66.7% 11.8%
488
Again, results for happy are good, and cor- To avoid any effect of ordering, the order of the
respond fairly closely to the levels of accuracy emoticon list and each drop-down menu was ran-
reported by Go et al. (2009) and others for the domised every time the survey page was loaded.
binary positive/negative sentiment detection task. The survey was distributed via Twitter, Facebook
Emoticons give significantly better performance and academic mailing lists. Respondents were not
than hashtags here. Results for sad and anger given the opportunity to give their own definitions
are reasonable, and provide a baseline for fur- or to provide finer-grained classifications, as we
ther experiments with more advanced features and wanted to establish purely whether they would re-
classification methods once more manually anno- liably associate labels with the six emotions in our
tated data is available for these classes. In con- taxonomy.
trast, hashtags give much better performance with
these classes than the (perhaps vague or ambigu- 5.2 Results
ous) emoticons. The survey was completed by 492 individuals;
The remaining emotion classes, however, show full results are shown in Table 7. It demonstrated
poor performance for both labelling conventions. agreement with the predefined emoticons for sad
The observed low precision and high recall can be and most of the emoticons for happy (people
adjusted using classifier parameters, but F-scores were unsure what 8-| and <@o meant). For all
are not improved. Note that Experiment 1 shows the emoticons listed as anger, surprise and
that both emoticon and hashtag labels are to some disgust, the survey showed that people are reli-
extent predictable, even for these classes; how- ably unsure as to what these mean. For the emoti-
ever, Experiment 2 shows that they may not be con :-o there was a direct contrast between the
reliably different to each other, and Experiment 3 defined meaning and the survey meaning; the def-
tells us that they do not appear to coincide well inition of this emoticon following Ansari (2010)
with human annotator judgements of emotions. was fear, but the survey reliably assigned this to
More reliable labels may therefore be required; surprise.
although we do note that the low reliability of Given the small scale of the survey, we hesi-
the human annotations for these classes, and the tate to draw strong conclusions about the emoti-
correspondingly small amount of annotated data con meanings themselves (in fact, recent conver-
used in this evaluation, means we hesitate to draw sations with schoolchildren see below have in-
strong conclusions about fear, surprise and dicated very different interpretations from these
disgust. An approach which considers multi- adult survey respondents). However, we do con-
ple classes to be associated with individual mes- clude that for most emotions outside happy and
sages may also be beneficial: using majority- sad, emoticons may indeed be an unreliable la-
decision labels rather than unanimous labels im- bel; as hashtags also appear more reliable in the
proves F-scores for surprise to 23-35% by in- classification experiments, we expect these to be
cluding many examples also labelled as happy a more promising approach for fine-grained emo-
(although this gives no improvements for other tion discrimination in future.
classes).
6 Conclusions
5 Survey
The approach shows reasonable performance at
To further detemine whether emoticons used individual emotion label prediction, for both
as emotion class labels are ambiguous or vague emoticons and hashtags. For some emotions (hap-
in meaning, we set up a web survey to exam- piness, sadness and anger), performance across
ine whether people could reliably classify these label conventions (training on one, and testing on
emoticons. the other) is encouraging; for these classes, per-
formance on those manually labelled examples
5.1 Method where annotators agree is also reasonable. This
Our survey asked people to match up which of gives us confidence not only that the approach
the six emotion classes (selected from a drop- produces reliable classifiers which can predict the
down menu) best matched each emoticon. Each labels, but that these classifiers are actually de-
drop-down menu included a Not Sure option. tecting the desired underlying emotional classes,
489
Table 7: Survey results showing the defined emotion, the most popular emotion from the survey, the percentage
of votes this emotion received, and the 2 significance test for the distribution of votes. These are indexed by
emoticon.
Emoticon Defined Emotion Survey Emotion % of votes Significance of votes distribution
:-) Happy Happy 94.9 2 = 3051.7 (p < 0.001)
:) Happy Happy 95.5 2 = 3098.2 (p < 0.001)
;-) Happy Happy 87.4 2 = 2541 (p < 0.001)
:D Happy Happy 85.7 2 = 2427.2 (p < 0.001)
:P Happy Happy 59.1 2 = 1225.4 (p < 0.001)
8) Happy Happy 61.9 2 = 1297.4 (p < 0.001)
8-| Happy Not Sure 52.2 2 = 748.6 (p < 0.001)
<@o Happy Not Sure 84.6 2 = 2335.1 (p < 0.001)
:-( Sad Sad 91.3 2 = 2784.2 (p < 0.001)
:( Sad Sad 89.0 2 = 2632.1 (p < 0.001)
;-( Sad Sad 67.9 2 = 1504.9 (p < 0.001)
:-< Sad Sad 56.1 2 = 972.59 (p < 0.001)
:( Sad Sad 80.7 2 = 2116 (p < 0.001)
:-@ Anger Not Sure 47.8 2 = 642.47 (p < 0.001)
:@ Anger Not Sure 50.4 2 = 691.6 (p < 0.001)
:s Surprise Not Sure 52.2 2 = 757.7 (p < 0.001)
:$ Disgust Not Sure 62.8 2 = 1136 (p < 0.001)
+o( Disgust Not Sure 64.2 2 = 1298.1 (p < 0.001)
:| Fear Not Sure 55.1 2 = 803.41 (p < 0.001)
:-o Fear Surprise 89.2 2 = 2647.8 (p < 0.001)
without requiring manual annotation. We there- Acknowledgements

fore plan to pursue this approach with a view to
improving performance by investigating training The authors are supported in part by the Engi-
with combined mixed-convention datasets, and neering and Physical Sciences Research Council
cross-training between classifiers trained on sepa- (grants EP/J010383/1 and EP/J501360/1) and the
rate conventions. Technology Strategy Board (R&D grant 700081).
We thank the reviewers for their comments.
However, this cross-convention performance is
much better for some emotions (happiness, sad-
ness and anger) than others (fear, surprise and dis- References
gust). Indications are that the poor performance Saad Ansari. 2010. Automatic emotion tone detection
on these latter emotion classes is to a large de- in twitter. Masters thesis, Queen Mary University
gree an effect of ambiguity or vagueness of the of London.
emoticon and hashtag conventions we have used Chih-Chung Chang and Chih-Jen Lin, 2001. LIB-
as labels here; we therefore intend to investi- SVM: a library for Support Vector Machines.
gate other conventions with more specific and/or Software available at http://www.csie.ntu.
less ambiguous meanings, and the combination edu.tw/cjlin/libsvm.
of multiple conventions to provide more accu- Ze-Jing Chuang and Chung-Hsien Wu. 2004. Multi-
rately/specifically labelled data. Another possi- modal emotion recognition from speech and text.
Computational Linguistics and Chinese Language
bility might be to investigate approaches to anal-
Processing, 9(2):4562, August.
yse emoticons semantically on the basis of their
Taner Danisman and Adil Alpkocak. 2008. Feeler:
shape, or use features of such an analysis see Emotion classification of text using vector space
(Ptaszynski et al., 2010; Radulovic and Milikic, model. In AISB 2008 Convention, Communication,
2009) for some interesting recent work in this di- Interaction and Social Intelligence, volume 2, pages
rection. 5359, Aberdeen.
490
Daantje Derks, Arjan Bos, and Jasper von Grumbkow. F. Radulovic and N. Milikic. 2009. Smiley ontology.
2008a. Emoticons and online message interpreta- In Proceedings of The 1st International Workshop
tion. Social Science Computer Review, 26(3):379 On Social Networks Interoperability.
388. Jonathon Read. 2005. Using emoticons to reduce de-
Daantje Derks, Arjan Bos, and Jasper von Grumbkow. pendency in machine learning techniques for sen-
2008b. Emoticons in computer-mediated commu- timent classification. In Proceedings of the 43rd
nication: Social motives and social context. Cy- Meeting of the Association for Computational Lin-
berPsychology & Behavior, 11(1):99101, Febru- guistics. Association for Computational Linguis-
ary. tics.
Jacob Eisenstein, Brendan OConnor, Noah A. Smith, Young-Soo Seol, Dong-Joo Kim, and Han-Woo Kim.
and Eric P. Xing. 2010. A latent variable model 2008. Emotion recognition from text using knowl-
for geographic lexical variation. In Proceedings edge based ANN. In Proceedings of ITC-CSCC.
of the 2010 Conference on Empirical Methods in Y. Tanaka, H. Takamura, and M. Okumura. 2005. Ex-
Natural Language Processing, pages 12771287, traction and classification of facemarks with kernel
Cambridge, MA, October. Association for Compu- methods. In Proceedings of IUI.
tational Linguistics. Vladimir N. Vapnik. 1995. The Nature of Statistical
Paul Ekman. 1972. Universals and cultural differ- Learning Theory. Springer.
ences in facial expressions of emotion. In J. Cole, Joseph Walther and Kyle DAddario. 2001. The
editor, Nebraska Symposium on Motivation 1971, impacts of emoticons on message interpretation in
volume 19. University of Nebraska Press. computer-mediated communication. Social Science
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit- Computer Review, 19(3):324347.
ter sentiment classification using distant supervi- J. Wiebe and E. Riloff. 2005. Creating subjective
sion. Masters thesis, Stanford University. and objective sentence classifiers from unannotated
Pei-Yun Hsueh, Prem Melville, and Vikas Sindhwani. texts . In Proceedings of the 6th International Con-
2009. Data quality from crowdsourcing: A study of ference on Computational Linguistics and Intelli-
annotation selection criteria. In Proceedings of the gent Text Processing (CICLing-05), volume 3406 of
NAACL HLT 2009 Workshop on Active Learning for Springer LNCS. Springer-Verlag.
Natural Language Processing, pages 2735, Boul-
der, Colorado, June. Association for Computational
Linguistics.
Amy Ip. 2002. The impact of emoticons on affect in-
terpretation in instant messaging. Carnegie Mellon
University.
Justin Martineau. 2009. Delta TFIDF: An improved
feature space for sentiment analysis. Artificial In-
telligence, 29:258261.
Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-
sky. 2009. Distant supervision for relation extrac-
tion without labeled data. In Proceedings of ACL-
IJCNLP 2009.
Alexander Pak and Patrick Paroubek. 2010. Twitter
as a corpus for sentiment analysis and opinion min-
ing. In Proceedings of the 7th conference on Inter-
national Language Resources and Evaluation.
Bo Pang and Lillian Lee. 2008. Opinion mining and
sentiment analysis. Foundations and Trends in In-
formation Retrieval, 2(12):1135.
Robert Provine, Robert Spencer, and Darcy Mandell.
2007. Emotional expression online: Emoticons
punctuate website text messages. Journal of Lan-
guage and Social Psychology, 26(3):299307.
M. Ptaszynski, J. Maciejewski, P. Dybala, R. Rzepka,
and K Araki. 2010. CAO: A fully automatic emoti-
con analysis system based on theory of kinesics. In
Proceedings of The 24th AAAI Conference on Arti-
ficial Intelligence (AAAI-10), pages 10261032, At-
lanta, GA.
491
Feature-Rich Part-of-speech Tagging
for Morphologically Complex Languages: Application to Bulgarian
Georgi Georgiev and Valentin Zhikov Petya Osenova and Kiril Simov
Ontotext AD IICT, Bulgarian Academy of Sciences
135 Tsarigradsko Sh., Sofia, Bulgaria 25A Acad. G. Bonchev, Sofia, Bulgaria
{georgi.georgiev,valentin.zhikov}@ontotext.com {petya,kivs}@bultreebank.org
Preslav Nakov
Qatar Computing Research Institute, Qatar Foundation
Tornado Tower, floor 10, P.O. Box 5825, Doha, Qatar
pnakov@qf.org.qa
Abstract For example, there are six tags for verbs in the
Penn Treebank: VB (verb, base form; e.g., sing),
We present experiments with part-of-
speech tagging for Bulgarian, a Slavic lan- VBD (verb, past tense; e.g., sang), VBG (verb,
guage with rich inflectional and deriva- gerund or present participle; e.g., singing), VBN
tional morphology. Unlike most previous (verb, past participle; e.g., sung) VBP (verb, non-
work, which has used a small number of 3rd person singular present; e.g., sing), and VBZ
grammatical categories, we work with 680 (verb, 3rd person singular present; e.g., sings);
morpho-syntactic tags. We combine a large these tags are morpho-syntactic in nature. Other
morphological lexicon with prior linguis-
corpora have used even larger tagsets, e.g., the
tic knowledge and guided learning from a
POS-annotated corpus, achieving accuracy Brown corpus (Kucera and Francis, 1967) and the
of 97.98%, which is a significant improve- Lancaster-Oslo/Bergen (LOB) corpus (Johansson
ment over the state-of-the-art for Bulgarian. et al., 1986) use 87 and 135 tags, respectively.
POS tagging poses major challenges for mor-
1 Introduction phologically complex languages, whose tagsets
encode a lot of additional morpho-syntactic fea-
Part-of-speech (POS) tagging is the task of as-
tures (for most of the basic POS categories), e.g.,
signing each of the words in a given piece of text a
gender, number, person, etc. For example, the
contextually suitable grammatical category. This
BulTreeBank (Simov et al., 2004) for Bulgarian
is not trivial since words can play different syn-
uses 680 tags, while the Prague Dependency Tree-
tactic roles in different contexts, e.g., can is a
bank (Hajic, 1998) for Czech has over 1,400 tags.
noun in I opened a can of coke. but a verb in
I can write. Traditionally, linguists have classi- Below we present experiments with POS tag-
fied English words into the following eight basic ging for Bulgarian, which is an inflectional lan-
POS categories: noun, pronoun, adjective, verb, guage with rich morphology. Unlike most previ-
adverb, preposition, conjunction, and interjection; ous work, which has used a reduced set of POS
this list is often extended a bit, e.g., with deter- tags, we use all 680 tags in the BulTreeBank. We
miners, particles, participles, etc., but the number combine prior linguistic knowledge and statistical
of categories considered is rarely more than 15. learning, achieving accuracy comparable to that
Computational linguistics works with a larger reported for state-of-the-art systems for English.
inventory of POS tags, e.g., the Penn Treebank The remainder of the paper is organized as fol-
(Marcus et al., 1993) uses 48 tags: 36 for part- lows: Section 2 provides an overview of related
of-speech, and 12 for punctuation and currency work, Section 3 describes Bulgarian morphology,
symbols. This increase in the number of tags Section 4 introduces our approach, Section 5 de-
is partially due to finer granularity, e.g., there scribes the datasets, Section 6 presents our exper-
are special tags for determiners, particles, modal iments in detail, Section 7 discusses the results,
verbs, cardinal numbers, foreign words, existen- Section 8 offers application-specific error analy-
tial there, etc., but also to the desire to encode sis, and Section 9 concludes and points to some
morphological information as part of the tags. promising directions for future work.
492
2 Related Work First, a coarse POS class is assigned (e.g., noun,
verb, adjective), then, additional fine-grained
Most research on part-of-speech tagging has fo- morphological features like case, number and
cused on English, and has relied on the Penn Tree- gender are added, and finally, the proposed tags
bank (Marcus et al., 1993) and its tagset for train- are further reconsidered using non-local features.
ing and evaluation. The task is typically addressed Similarly, Smith et al. (2005) decomposed the
as a sequential tagging problem; one notable ex- complex tags into factors, where models for pre-
ception is the work of Brill (1995), who proposed dicting part-of-speech, gender, number, case, and
non-sequential transformation-based learning. lemma are estimated separately, and then com-
A number of different sequential learning posed into a single CRF model; this yielded com-
frameworks have been tried, yielding 96-97% petitive results for Arabic, Korean, and Czech.
accuracy: Lafferty et al. (2001) experimented Most previous work on Bulgarian POS tagging
with conditional random fields (CRFs) (95.7% has started with large tagsets, which were then
accuracy), Ratnaparkhi (1996) used a maximum reduced. For example, Dojchinova and Mihov
entropy sequence classifier (96.6% accuracy), (2004) mapped their initial tagset of 946 tags to
Brants (2000) employed a hidden Markov model just 40, which allowed them to achieve 95.5%
(96.6% accuracy), Collins (2002) adopted an av- accuracy using the transformation-based learning
eraged perception discriminative sequence model of Brill (1995), and 98.4% accuracy using manu-
(97.1% accuracy). All these models fix the order ally crafted linguistic rules. Similarly, Georgiev
of inference from left to right. et al. (2009), who used maximum entropy and
Toutanova et al. (2003) introduced a cyclic de- the BulTreeBank (Simov et al., 2004), grouped
pendency network (97.2% accuracy), where the its 680 fine-grained POS tags into 95 coarse-
search is bi-directional. Shen et al. (2007) have grained ones, and thus improved their accuracy
further shown that better results (97.3% accu- from 90.34% to 94.4%. Simov and Osenova
racy) can be obtained using guided learning, a (2001) used a recurrent neural network to predict
framework for bidirectional sequence classifica- (a) 160 morpho-syntactic tags (92.9% accuracy)
tion, which integrates token classification and in- and (b) 15 POS tags (95.2% accuracy).
ference order selection into a single learning task Some researchers did not reduce the tagset:
and uses a perceptron-like (Collins and Roark, Savkov et al. (2011) used 680 tags (94.7% ac-
2004) passive-aggressive classifier to make the curacy), and Tanev and Mitkov (2002) used 303
easiest decisions first. Recently, Tsuruoka et al. tags and the BULMORPH morphological ana-
(2011), proposed a simple perceptron-based clas- lyzer (Krushkov, 1997), achieving P=R=95%.
sifier applied from left to right but augmented
with a lookahead mechanism that searches the 3 Bulgarian Morphology
space of future actions, yielding 97.3% accuracy. Bulgarian is an Indo-European language from the
For morphologically complex languages, the Slavic language group, written with the Cyrillic
problem of POS tagging typically includes mor- alphabet and spoken by about 9-12 million peo-
phological disambiguation, which yields a much ple. It is also a member of the Balkan Sprachbund
larger number of tags. For example, for Arabic, and thus differs from most other Slavic languages:
Habash and Rambow (2005) used support vector it has no case declensions, uses a suffixed definite
machines (SVM), achieving 97.6% accuracy with article (which has a short and a long form for sin-
139 tags from the Arabic Treebank (Maamouri et gular masculine), and lacks verb infinitive forms.
al., 2003). For Czech, Hajic et al. (2001) com- It further uses special evidential verb forms to ex-
bined a hidden Markov model (HMM) with lin- press unwitnessed, retold, and doubtful activities.
guistic rules, which yielded 95.2% accuracy using Bulgarian is an inflective language with very
an inventory of over 1,400 tags from the Prague rich morphology. For example, Bulgarian verbs
Dependency Treebank (Hajic, 1998). For Ice- have 52 synthetic wordforms on average, while
landic, Dredze and Wallenberg (2008) reported pronouns have altogether more than ten grammat-
92.1% accuracy with 639 tags developed for the ical features (not necessarily shared by all pro-
Icelandic frequency lexicon (Pind et al., 1991), nouns), including case, gender, person, number,
they used guided learning and tag decomposition: definiteness, etc.
493
This rich morphology inevitably leads to ambi- In many cases, strong domain preferences exist
guity proliferation; our analysis of BulTreeBank about how various systematic ambiguities should
shows four major types of ambiguity: be resolved. We made a study for the newswire
domain, analyzing a corpus of 546,029 words,
1. Between the wordforms of the same lexeme,
and we found that ambiguity type 2 (lexeme-
i.e., in the paradigm. For example, divana,
lexeme) prevailed for functional parts-of-speech,
an inflected form of divan (sofa, mascu-
while the other types were more frequent for in-
line), can mean (a) the sofa (definite, singu-
flecting parts-of-speech. Below we show the most
lar, short definite article) or (b) a count form,
frequent types of morpho-syntactic ambiguities
e.g., as in dva divana (two sofas).
and their frequency in our corpus:
2. Between two or more lexemes, i.e., conver- na: preposition (of) vs. emphatic particle,
sion. For example, kato can be (a) a subor- with a ratio of 28,554 to 38;
dinator meaning as, when, or (b) a preposi- da: auxiliary particle (to) vs. affirmative
tion meaning like, such as. particle, with a ratio of 12,035 to 543;
3. Between a lexeme and an inflected wordform e: 3rd person present auxiliary verb (to be)
of another lexeme, i.e., across-paradigms. vs. particle (well) vs. interjection (wow),
For example, politika can mean (a) the with a ratio of 9,136 to 21 to 5;
politician (masculine, singular, definite, singular masculine noun with a short definite
short definite article) or (b) politics (fem- article vs. count form of a masculine noun,
inine, singular, indefinite). with a ratio of 6,437 to 1,592;
adverb vs. neuter singular adjective, with a
4. Between the wordforms of two or more ratio of 3,858 to 1,753.
lexemes, i.e., across-paradigms and quasi-
Overall, the following factors should be taken
conversion. For example, vrvi can mean
into account when modeling Bulgarian morpho-
(a) walks (verb, 2nd or 3rd person, present
syntax: (1) locality vs. non-locality of grammat-
tense) or (b) strings, laces (feminine, plu-
ical features, (2) interdependence of grammatical
ral, indefinite).
features, and (3) domain-specific preferences.
Some morpho-syntactic ambiguities in Bulgar-
ian are occasional, but many are systematic, e.g., 4 Method
neuter singular adjectives have the same forms We used the guided learning framework described
as adverbs. Overall, most ambiguities are local, in (Shen et al., 2007), which has yielded state-of-
and thus arguably resolvable using n-grams, e.g., the-art results for English and has been success-
compare hubavo dete (beautiful child), where fully applied to other morphologically complex
hubavo is a neuter adjective, and Pe hubavo. languages such as Icelandic (Dredze and Wallen-
(I sing beautifully.), where it is an adverb of berg, 2008); we found it quite suitable for Bul-
manner. Other ambiguities, however, are non- garian as well. We used the feature set defined in
local and may require discourse-level analysis, (Shen et al., 2007), which includes the following:
e.g., Vidh go. can mean I saw him., where
go is a masculine pronoun, or I saw it., where 1. The feature set of Ratnaparkhi (1996), in-
it is a neuter pronoun. Finally, there are ambi- cluding prefix, suffix and lexical, as well as
guities that are very hard or even impossible1 to some bigram and trigram context features;
resolve, e.g., Deteto vleze veselo. can mean 2. Feature templates as in (Ratnaparkhi, 1996),
both The child came in happy. (veselo is an ad- which have been shown helpful in bidirec-
jective) and The child came in happily. (it is an tional search;
adverb); however, the latter is much more likely.
1
The problem also exists for English, e.g., the annotators
3. More bigram and trigram features and bi-
of the Penn Treebank were allowed to use tag combinations lexical features as in (Shen et al., 2007).
for inherently ambiguous cases: JJ|NN (adjective or noun as
prenominal modifier), JJ|VBG (adjective or gerund/present
Note that we allowed prefixes and suffixes of
participle), JJ|VBN (adjective or past participle), NN|VBG length up to 9, as in (Toutanova et al., 2003) and
(noun or gerund), and RB|RP (adverb or particle). (Tsuruoka and Tsujii, 2005).
494
We further extended the set of features with The rules are quite efficient at reducing the POS
the tags proposed for the current word token by a ambiguity. On the test dataset, before the rule ap-
morphological lexicon, which maps words to pos- plication, 34.2% of the tokens (excluding punctu-
sible tags; it is exhaustive, i.e., the correct tag is ation) had more than one tag in our morphological
always among the suggested ones for each token. lexicon. This number is reduced to 18.5% after
We also used 70 linguistically-motivated, high- the cascaded application of the 70 linguistic rules.
precision rules in order to further reduce the num- Table 1 illustrates the effect of the rules on a small
ber of possible tags suggested by the lexicon. sentence fragment. In this example, the rules have
The rules are similar to those proposed by Hin- left only one tag (the correct one) for three of the
richs and Trushkina (2004) for German; we im- ambiguous words. Since the rules in essence de-
plemented them as constraints in the CLaRK sys- crease the average number of tags per token, we
tem (Simov et al., 2003). calculated that the lexicon suggests 1.6 tags per
Here is an example of a rule: If a wordform token on average, and after the application of the
is ambiguous between a masculine count noun rules this number decreases to 1.44 per token.
(Ncmt) and a singular short definite masculine
5 Datasets
noun (Ncmsh), the Ncmt tag should be chosen if
the previous token is a numeral or a number. 5.1 BulTreeBank
The 70 rules were developed by linguists based We used the latest version of the BulTree-
on observations over the training dataset only. Bank (Simov and Osenova, 2004), which contains
They target primarily the most frequent cases of 20,556 sentences and 321,542 word tokens (four
ambiguity, and to a lesser extent some infrequent times less than the English Penn Treebank), anno-
but very problematic cases. Some rules operate tated using a total of 680 unique morpho-syntactic
over classes of words, while other refer to partic- tags. See (Simov et al., 2004) for a detailed de-
ular wordforms. The rules were designed to be scription of the BulTreeBank tagset.
100% accurate on our training dataset; our exper- We split the data into training/development/test
iments show that they are also 100% accurate on as shown in Table 2. Note that only 552 of all 680
the test and on the development dataset. tag types were used in the training dataset, and
Note that some of the rules are dependent on the development and the test datasets combined
others, and thus the order of their cascaded appli- contain a total of 128 new tag types that were not
cation is important. For example, the wordform seen in the training dataset. Moreover, 32% of the
is ambiguous between an accusative feminine sin- word types in the development dataset and 31%
gular short form of a personal pronoun (her) and of those in the testing dataset do not occur in the
an interjection (wow). To handle this properly, training dataset. Thus, data sparseness is an issue
the rule for interjection, which targets sentence at two levels: word-level and tag-level.
initial positions, followed by a comma, needs to
Dataset Sentences Tokens Types Tags
be executed first. The rule for personal pronouns
Train 16,532 253,526 38,659 552
is only applied afterwards. Dev 2,007 32,995 9,635 425
Test 2,017 35,021 9,627 435
Word Tags
To$i Ppe-os3m Table 2: Statistics about our datasets.
obaqe Cc; Dd
nma Afsi; Vnitf-o3s; Vnitf-r3s;
5.2 Morphological Lexicon
Vpitf-o2s; Vpitf-o3s; Vpitf-r3s
vzmonost Ncfsi In order to alleviate the data sparseness issues,
da Ta;Tx we further used a large morphological lexicon for
sledi Ncfpi; Vpitf-o2s; Vpitf-o3s; Vpitf-r3s;
Bulgarian, which is an extended version of the
Vpitz2s
dictionary described in (Popov et al., 1998) and
... ...
(Popov et al., 2003). It contains over 1.5M in-
Table 1: Sample fragment showing the possible tags flected wordforms (for 110K lemmata and 40K
suggested by the lexicon. The tags that are further proper names), each mapped to a set of possible
filtered by the rules are in italic; the correct tag is bold. morpho-syntactic tags.
495
6 Experiments and Evaluation 6.1 Baselines
State-of-the-art POS taggers for English typically First, we experimented with the most-frequent-
build a lexicon containing all tags a word type has tag baseline, which is standard for POS tagging.
taken in the training dataset; this lexicon is then This baseline ignores context altogether and as-
used to limit the set of possible tags that an input signs each word type the POS tag it was most
token can be assigned, i.e., it imposes a hard con- frequently seen with in the training dataset; ties
straint on the possibilities explored by the POS are broken randomly. We coped with word types
tagger. For example, if can has only been tagged not seen in the training dataset using three sim-
as a verb and as a noun in the training dataset, ple strategies: (a) we considered them all wrong,
it will be only assigned those two tags at test (b) we assigned them Ncmsi, which is the most
time; other tags such as adjective, adverb and pro- frequent open-class tag in the training dataset, or
noun will not be considered. Out-of-vocabulary (c) we used a very simple guesser, which assigned
words, i.e., those that were not seen in the train- Ncfsi, Ncnsi, Ncfsi, and Ncmsf, if the target word
ing dataset, are constrained as well, e.g., to a small ended by -a, -o, -i, and -t, respectively, other-
set of frequent open-class tags. wise, it assigned Ncmsi. The results are shown
In our experiments, we used a morphological in lines 1-3 of Table 3: we can see that the token-
lexicon that is much larger than what could be level accuracy ranges in 78-80% for (a)-(c), which
built from the training corpus only: building a is relatively high, given that we use a large inven-
lexicon from the training corpus only is of lim- tory of 680 morpho-syntactic tags.
ited utility since one can hardly expect to see in We further tried a baseline that uses the above-
the training corpus all 52 synthetic forms a verb described morphological lexicon, in addition to
can possibly have. Moreover, we did not use the the training dataset. We first built two frequency
tags listed in the lexicon as hard constraints (ex- lists, containing respectively (1) the most frequent
cept in one of our baselines); instead, we experi- tag in the training dataset for each word type, as
mented with a different, non-restrictive approach: before, and (2) the most frequent tag in the train-
we used the lexicons predictions as features or ing dataset for each class of tags that can be as-
soft constraints, i.e., as suggestions only, thus al- signed to some word type, according to the lexi-
lowing each token to take any possible tag. Note con. For example, the most frequent tag for poli-
that for both known and out-of-vocabulary words tika is Ncfsi, and the most frequent tag for the
we used all 680 tags rather than the 552 tags ob- tag-class {Ncmt;Ncmsi} is Ncmt.
served in the training dataset; we could afford to Given a target word type, this new baseline first
explore this huge search space thanks to the effi- tries to assign it the most frequent tag from the
ciency of the guided learning framework. Allow- first list. If this is not possible, which happens
ing all 680 tags on training helped the model by (i) in case of ties or (ii) when the word type was
exposing it to a larger set of negative examples. not seen on training, it extracts the tag-class from
We combined these lexicon features with stan- the lexicon and consults the second list. If there
dard features extracted from the training corpus. is a single most frequent tag in the corpus for this
We further experimented with the 70 contextual tag-class, it is assigned; otherwise a random tag
linguistic rules, using them (a) as soft and (b) as from this tag-class is selected.
hard constraints. Finally, we set four baselines: Line 4 of Table 3 shows that this latter baseline
three that do not use the lexicon and one that does. achieves a very high accuracy of 94.40%. Note,
however, that this is over-optimistic: the lexicon
Accuracy (%) contains a tag-class for each word type in our test-
# Baselines (token-level) ing dataset, i.e., while there can be word types
1 MFT + unknowns are wrong 78.10
not seen in the training dataset, there are no word
2 MFT + unknowns are Ncmsi 78.52
3 MFT + guesser for unknowns 79.49 types that are not listed in the lexicon. Thus, this
4 MFT + lexicon tag-classes 94.40 high accuracy is probably due to a large extent
to the scale and quality of our morphological lexi-
Table 3: Most-frequent-tag (MFT) baselines. con, and it might not be as strong with smaller lex-
icons; we plan to investigate this in future work.
496
6.2 Lexicon Tags as Soft Constraints 6.3 Linguistic Rules as Hard Constraints
We experimented with three types of features: Next, we experimented with using the suggestions
of the linguistic rules as hard constraints. Table 4
1. Word-related features only; shows that this is a very good idea. Comparing
line 1 to line 2, which do not use the morpholog-
2. Word-related features + the tags suggested ical lexicon, we can see very significant improve-
by the lexicon; ments: from 95.72% to 97.20% at the token-level
and from 52.95% to 64.50% at the sentence-level.
3. Word-related features + the tags suggested The improvements are smaller but still consistent
by the lexicon but then further filtered using when the morphological lexicon is used: compar-
the 70 contextual linguistic rules. ing lines 3 and 4 to lines 6 and 7, respectively, we
see an improvement from 97.83% to 97.91% and
Table 4 shows the sentence-level and the token- from 97.80% to 97.93% at the token-level, and
level accuracy on the test dataset for the three about 1% absolute at the sentence-level.
kinds of features: shown on lines 1, 3 and 4, re-
spectively. We can see that using the tags pro- 6.4 Increasing the Beam Size
posed by the lexicon as features (lines 3 and 4) Finally, we increased the beam size of guided
has a major positive impact, yielding up to 49% learning from 1 to 3 as in (Shen et al., 2007).
error reduction at the token-level and up to 37% Comparing line 7 to line 8 in Table 4, we can see
at the sentence-level, as compared to using word- that this yields further token-level improvement:
related features alone (line 1). from 97.93% to 97.98%.
Interestingly, filtering the tags proposed by the
lexicon using the 70 contextual linguistic rules 7 Discussion
yields a minor decrease in accuracy both at the
Table 5 compares our results to previously re-
word token-level and at the sentence-level (com-
ported evaluation results for Bulgarian. The
pare line 4 to line 2). This is surprising since
first four lines show the token-level accuracy for
the linguistic rules are extremely reliable: they
standard POS tagging tools trained and evalu-
were designed to be 100% accurate on the train-
ated on the BulTreeBank:2 TreeTagger (Schmid,
ing dataset, and we found them experimentally to
1994), which uses decision trees, TnT (Brants,
be 100% correct on the development and on the
2000), which uses a hidden Markov model,
testing dataset as well.
SVMtool (Gimenez and Marquez, 2004), which
One possible explanation is that by limiting the is based on support vector machines, and
set of available tags for a given token at training ACOPOST (Schroder, 2002), implementing the
time, we prevent the model from observing some memory-based model of Daelemans et al. (1996).
potentially useful negative examples. We tested The following lines report the token-level accu-
this hypothesis by using the unfiltered lexicon racy reported in previous work, as compared to
predictions at training time but then making use our own experiments using guided learning.
of the filtered ones at testing time; the results are We can see that we outperform by a very large
shown on line 5. We can observe a small increase margin (92.53% vs. 97.98%, which represents
in accuracy compared to line 4: from 97.80% to 73% error reduction) the systems from the first
97.84% at the token-level, and from 70.30% to four lines, which are directly comparable to our
70.40% at the sentence-level. Although these dif- experiments: they are trained and evaluated on the
ferences are tiny, they suggest that having more BulTreeBank using the full inventory of 680 tags.
negative examples at training is helpful. We further achieved statistically significant im-
We can conclude that using the lexicon as a provement (p < 0.0001; Pearsons chi-squared
source of soft constraints has a major positive im- test (Plackett, 1983)) over the best pervious result
pact, e.g., because it provides access to impor- on 680 tags: from 94.65% to 97.98%, which rep-
tant external knowledge that is complementary resents 62.24% error reduction at the token-level.
to what can be learned from the training corpus 2
We used the pre-trained TreeTagger; for the rest, we re-
alone; the improvements when using linguistic port the accuracy given on the Webpage of the BulTreeBank:
rules as soft constraints are more limited. www.bultreebank.org/taggers/taggers.html
497
Lexicon Linguistic Rules (applied to filter): Beam Accuracy (%)
# (source of) (a) the lexicon features (b) the output tags size Sentence-level Token-level
1 1 52.95 95.72
2 yes 1 64.50 97.20
3 features 1 70.40 97.83
4 features yes 1 70.30 97.80
5 features yes, for test only 1 70.40 97.84
6 features yes 1 71.34 97.91
7 features yes yes 1 71.69 97.93
8 features yes yes 3 71.94 97.98
Table 4: Evaluation results on the test dataset. Line 1 shows the evaluation results when using features derived
from the text corpus only; these features are used by all systems in the table. Line 2 further uses the contextual
linguistic rules to limit the set of possible POS tags that can be predicted. Note that these rules (1) consult the
lexicon, and (2) always predict a single POS tag. Line 3 uses the POS tags listed in the lexicon as features, i.e.,
as soft suggestions only. Line 4 is like line 3, but the list of feature-tags proposed by the lexicon is filtered by
the contextual linguistic rules. Line 5 is like line 4, but the linguistic rules filtering is only applied at test time;
it is not done on training. Lines 6 and 7 are similar to lines 3 and 4, respectively, but here the linguistic rules
are further applied to limit the set of possible POS tags that can be predicted, i.e., the rules are used as hard
constraints. Finally, line 8 is like line 7, but here the beam size is increased to 3.
Overall, we improved over almost all previ- Still, our performance is impressive because
ously published results. Our accuracy is sec- (1) our model is trained on 253,526 tokens only
ond only to the manual rules approach of Do- while the standard training sections 0-18 of the
jchinova and Mihov (2004). Note, however, that Penn Treebank contain a total of 912,344 tokens,
they used 40 tags only, i.e., their inventory is 17 i.e., almost four times more, and (2) we predict
times smaller than ours. Moreover, they have op- 680 rather than just 48 tags as for the Penn Tree-
timized their tagset specifically to achieve very bank, which is 14 times more.
high POS tagging accuracy by choosing not to at- Note, however, that (1) we used a large exter-
tempt to resolve some inherently hard systematic nal morphological lexicon for Bulgarian, which
ambiguities, e.g., they do not try to choose be- yielded about 50% error reduction (without it,
tween second and third person past singular verbs, our accuracy was 95.72% only), and (2) our
whose inflected forms are identical in Bulgarian train/dev/test sentences are generally shorter, and
and hard to distinguish when the subject is not thus arguably simpler for a POS tagger to analyze:
present (Bulgarian is a pro-drop language). we have 17.4 words per test sentence in the Bul-
In order to compare our results more closely TreeBank vs. 23.7 in the Penn Treebank.
to the smaller tagsets in Table 5, we evaluated Our results also compare favorably to the state-
our best model with respect to (a) the first letter of-the-art results for other morphologically com-
of the tag only (which is part-of-speech only, no plex languages that use large tagsets, e.g., 95.2%
morphological information; 13 tags), e.g., Ncmsf for Czech with 1,400+ tags (Hajic et al., 2001),
becomes N, and (b) the first two letters of the 92.1% for Icelandic with 639 tags (Dredze and
tag (POS + limited morphological information; Wallenberg, 2008), 97.6% for Arabic with 139
49 tags), e.g., Ncmsf becomes Nc. This yielded tags (Habash and Rambow, 2005).
99.30% accuracy for (a) and 98.85% for (b).
The latter improves over (Dojchinova and Mihov, 8 Error Analysis
2004), while using a bit larger number of tags. In this section, we present error analysis with re-
Our best token-level accuracy of 97.98% is spect to the impact of the POS taggers perfor-
comparable and even slightly better than the state- mance on other processing steps in a natural lan-
of-the-art results for English: 97.33% when using guage processing pipeline, such as lemmatization
Penn Treebank data only (Shen et al., 2007), and and syntactic dependency parsing.
97.50% for Penn Treebank plus some additional First, we explore the most frequently confused
unlabeled data (Sgaard, 2011). Of course, our pairs of tags for our best-performing POS tagging
results are only indirectly comparable to English. system; these are shown in Table 6.
498
Accuracy
Tool/Authors Method # Tags (token-level, %)
*TreeTagger Decision Trees 680 89.21
*ACOPOST Memory-based Learning 680 89.91
*SVMtool Support Vector Machines 680 92.22
*TnT Hidden Markov Model 680 92.53
(Georgiev et al., 2009) Maximum Entropy 680 90.34
(Simov and Osenova, 2001) Recurrent Neural Network 160 92.87
(Georgiev et al., 2009) Maximum Entropy 95 94.43
(Savkov et al., 2011) SVM + Lexicon + Rules 680 94.65
(Tanev and Mitkov, 2002) Manual Rules 303 95.00(=P=R)
(Simov and Osenova, 2001) Recurrent Neural Network 15 95.17
(Dojchinova and Mihov, 2004) Transformation-based Learning 40 95.50
(Dojchinova and Mihov, 2004) Manual Rules + Lexicon 40 98.40
Guided Learning 680 95.72
Guided Learning + Lexicon 680 97.83
This work Guided Learning + Lexicon + Rules 680 97.98
Guided Learning + Lexicon + Rules 49 98.85
Guided Learning + Lexicon + Rules 13 99.30
Table 5: Comparison to previous work for Bulgarian. The first four lines report evaluation results for various
standard POS tagging tools, which were retrained and evaluated on the BulTreeBank. The following lines report
token-level accuracy for previously published work, as compared to our own experiments using guided learning.
We can see that most of the wrong tags share Here is an example of such a rule:
the same part-of-speech (indicated by the initial if tag = Vpitf-o1s then
uppercase letter), such as V for verb, N for noun, {remove oh; concatenate a}
etc. This means that most errors refer to the mor-
The application of the above rule to the past
phosyntactic features. For example, personal or
simple verb form qetoh (I read) would remove
impersonal verb; definite or indefinite feminine
oh, and then concatenate a. The result would be
noun; singular or plural masculine adjective, etc.
the correct lemma qeta (to read).
At the same time, there are also cases, where the
Such rules are generated for each wordform in
error has to do with the part-of-speech label itself.
the morphological lexicon; the above functional
For example, between an adjective and an adverb,
representation allows for compact representation
or between a numeral and an indefinite pronoun.
in a finite state automaton. Similar rules are ap-
We want to use the above tagger to develop plied to the unknown words, where the lemma-
(1) a rule-based lemmatizer, using the morpholog- tizer tries to guess the correct lemma.
ical lexicon, e.g., as in (Plisson et al., 2004), and Obviously, the applicability of each rule cru-
(2) a dependency parser like MaltParser (Nivre et cially depends on the output of the POS tagger.
al., 2007), trained on the dependency part of the If the tagger suggests the correct tag, then the
BulTreeBank. We thus study the potential impact wordform would be lemmatized correctly. Note
of wrong tags on the performance of these tools. that, in some cases of wrongly assigned POS tags
The lemmatizer relies on the lexicon and uses in a given context, we might still get the correct
string transformation functions defined via two lemma. This is possible in the majority of the
operations remove and concatenate: erroneous cases in which the part-of-speech has
if tag = Tag then been assigned correctly, but the wrong grammat-
ical alternative has been selected. In such cases,
{remove OldEnd; concatenate NewEnd}
the error does not influence lemmatization.
where Tag is the tag of the wordform, OldEnd is In order to calculate the proportion of such
the string that has to be removed from the end of cases, we divided each tag into two parts:
the wordform, and NewEnd is the string that has (a) grammatical features that are common for all
to be concatenated to the beginning of the word- wordforms of a given lemma, and (b) features that
form in order to produce the lemma. are specific to the wordform.
499
Freq. Gold Tag Proposed Tag Finally, we should note that there are two spe-
43 Ansi Dm cial classes of tokens for which it is generally
23 Vpitf-r3s Vnitf-r3s hard to predict some of the grammatical features:
16 Npmsh Npmsi
14 Vpiif-r3s Vniif-r3s
(1) abbreviations and (2) numerals written with
13 Npfsd Npfsi digits. In sentences, they participate in agreement
12 Dm Ansi relations only if they are pronounced as whole
12 Vpitcam-smi Vpitcao-smi phrases; unfortunately, it is very hard for the tag-
12 Vpptf-r3p Vpitf-r3p ger to guess such relations since it does not have
11 Vpptf-r3s Vpptf-o3s at its disposal enough features, such as the inflec-
10 Mcmsi Pfe-os-mi tion of the numeral form, that might help detect
10 Ppetas3n Ppetas3m
and use the agreement pattern.
10 Ppetds3f Psot3f
9 Npnsi Npnsd
9 Vpptf-o3s Vpptf-r3s 9 Conclusion and Future Work
8 Dm A-pi
We have presented experiments with part-of-
8 Ppxts Ppxtd
7 Mcfsi Pfe-os-fi speech tagging for Bulgarian, a Slavic language
7 Npfsi Npfsd with rich inflectional and derivational morphol-
7 Ppetas3m Ppetas3n ogy. Unlike most previous work for this language,
7 Vnitf-r3s Vpitf-r3s which has limited the number of possible tags, we
7 Vpitcam-p-i Vpitcao-p-i used a very rich tagset of 680 morpho-syntactic
tags as defined in the BulTreeBank. By com-
Table 6: Most frequently confused pairs of tags.
bining a large morphological lexicon with prior
linguistic knowledge and guided learning from a
The part-of-speech features are always deter- POS-annotated corpus, we achieved accuracy of
mined by the lemma. For example, Bulgarian 97.98%, which is a significant improvement over
verbs have the lemma features aspect and tran- the state-of-the-art for Bulgarian. Our token-level
sitivity. If they are correct, then the lemma is pre- accuracy is also comparable to the best results re-
dicted also correctly, regardless of whether cor- ported for English.
rect or wrong on the grammatical features. For In future work, we want to experiment with a
example, if the verb participle form (aorist or richer set of features, e.g., derived from unlabeled
imperfect) has its correct aspect and transitivity, data (Sgaard, 2011) or from the Web (Umansky-
then it is lemmatized also correctly, regardless Pesin et al., 2010; Bansal and Klein, 2011). We
of whether the imperfect or aorist features were further plan to explore ways to decompose the
guessed correctly; similarly, for other error types. complex Bulgarian morpho-syntactic tags, e.g., as
We evaluated these cases for the 711 errors in our proposed in (Simov and Osenova, 2001) and
experiment, and we found that 206 of them (about (Smith et al., 2005). Modeling long-distance
29%) were non-problematic for lemmatization. syntactic dependencies (Dredze and Wallenberg,
2008) is another promising direction; we believe
For the MaltParser, we encode most of the this can be implemented efficiently using poste-
grammatical features of the wordforms as spe- rior regularization (Graca et al., 2009) or expecta-
cific features for the parser. Hence, it is much tion constraints (Bellare et al., 2009).
harder to evaluate the problematic cases due to
the tagger. Still, we were able to make an es- Acknowledgments
timation of some cases. Our strategy was to ig-
nore the grammatical features that do not always We would like to thank the anonymous reviewers
contribute to the syntactic behavior of the word- for their useful comments, which have helped us
forms. Such grammatical features for the verbs improve the paper.
are aspect and tense. Thus, proposing perfective The research presented above has been par-
instead of imperfective for a verb or present in- tially supported by the EU FP7 project 231720
stead of past tense would not cause problems for EuroMatrixPlus, and by the SmartBook project,
the MaltParser. Among our 711 errors, 190 cases funded by the Bulgarian National Science Fund
(or about 27%) were not problematic for parsing. under grant D002-111/15.12.2008.
500
References vector machines. In Proceedings of the 4th Inter-
Mohit Bansal and Dan Klein. 2011. Web-scale fea- Evaluation, LREC 04, Lisbon, Portugal.
tures for full-scale parsing. In Proceedings of the
Joao Graca, Kuzman Ganchev, Ben Taskar, and Fer-
49th Annual Meeting of the Association for Com-
nando Pereira. 2009. Posterior vs parameter spar-
putational Linguistics: Human Language Technolo-
sity in latent variable models. In Yoshua Bengio,
gies, ACL-HLT 10, pages 693702, Portland, Ore-
Dale Schuurmans, John D. Lafferty, Christopher
gon, USA.
K. I. Williams, and Aron Culotta, editors, Advances
Kedar Bellare, Gregory Druck, and Andrew McCal- in Neural Information Processing Systems 22, NIPS
lum. 2009. Alternating projections for learning 09, pages 664672. Curran Associates, Inc., Van-
with expectation constraints. In Proceedings of the couver, British Columbia, Canada.
25th Conference on Uncertainty in Artificial Intel- Nizar Habash and Owen Rambow. 2005. Arabic to-
ligence, UAI 09, pages 4350, Montreal, Quebec, kenization, part-of-speech tagging and morpholog-
Canada. ical disambiguation in one fell swoop. In Proceed-
Thorsten Brants. 2000. TnT a statistical part-of- ings of the 43rd Annual Meeting of the Associa-
speech tagger. In Proceedings of the Sixth Applied tion for Computational Linguistics, ACL 05, pages
Natural Language Processing, ANLP 00, pages 573580, Ann Arbor, Michigan.
224231, Seattle, Washington, USA. Jan Hajic, Pavel Krbec, Pavel Kveton, Karel Oliva,
Eric Brill. 1995. Transformation-based error-driven and Vladimr Petkevic. 2001. Serial combination
learning and natural language processing: a case of rules and statistics: A case study in Czech tag-
study in part-of-speech tagging. Comput. Linguist., ging. In Proceedings of the 39th Annual Meeting
21:543565. of the Association for Computational Linguistics,
Michael Collins and Brian Roark. 2004. Incremen- ACL 01, pages 268275, Toulouse, France.
tal parsing with the perceptron algorithm. In Pro- Jan Hajic. 1998. Building a Syntactically Annotated
ceedings of the 42nd Meeting of the Association for Corpus: The Prague Dependency Treebank. In Eva
Computational Linguistics, Main Volume, ACL 04, Hajicova, editor, Issues of Valency and Meaning.
pages 111118, Barcelona, Spain. Studies in Honor of Jarmila Panevova, pages 12
Michael Collins. 2002. Discriminative training meth- 19. Prague Karolinum, Charles University Press.
ods for hidden Markov models: theory and experi- Erhard W. Hinrichs and Julia S. Trushkina. 2004.
ments with perceptron algorithms. In Proceedings Forging agreement: Morphological disambiguation
of the Conference on Empirical Methods in Natu- of noun phrases. Research on Language & Compu-
ral Language Processing, EMNLP 02, pages 18, tation, 2:621648.
Philadelphia, PA, USA. Stig Johansson, Eric Atwell, Roger Garside, and Geof-
Walter Daelemans, Jakub Zavrel, Peter Berck, and frey Leech, 1986. The Tagged LOB Corpus: Users
Steven Gillis. 1996. MBT: A memory-based part manual. ICAME, The Norwegian Computing Cen-
of speech tagger generator. In Eva Ejerhed and tre for the Humanities, Bergen University, Norway.
Ido Dagan, editors, Fourth Workshop on Very Large Hristo Krushkov. 1997. Modelling and building ma-
Corpora, pages 1427, Copenhagen, Denmark. chine dictionaries and morphological processors
Veselka Dojchinova and Stoyan Mihov. 2004. High (in Bulgarian). Ph.D. thesis, University of Plov-
performance part-of-speech tagging of Bulgarian. div, Faculty of Mathematics and Informatics, Plov-
In Christoph Bussler and Dieter Fensel, editors, div, Bulgaria.
AIMSA, volume 3192 of Lecture Notes in Computer Henry Kucera and Winthrop Nelson Francis. 1967.
Science, pages 246255. Springer. Computational analysis of present-day American
Mark Dredze and Joel Wallenberg. 2008. Icelandic English. Brown University Press, Providence, RI.
data driven part of speech tagging. In Proceedings John D. Lafferty, Andrew McCallum, and Fernando
of the 44th Annual Meeting of the Association of C. N. Pereira. 2001. Conditional random fields:
Computational Linguistics: Short Papers, ACL 08, Probabilistic models for segmenting and labeling
pages 3336, Columbus, Ohio, USA. sequence data. In Proceedings of the 18th Inter-
Georgi Georgiev, Preslav Nakov, Petya Osenova, and national Conference on Machine Learning, ICML
Kiril Simov. 2009. Cross-lingual adaptation as 01, pages 282289, San Francisco, CA, USA.
a baseline: adapting maximum entropy models to Mohamed Maamouri, Ann Bies, Hubert Jin, and Tim
Bulgarian. In Proceedings of the RANLP09 Work- Buckwalter. 2003. Arabic Treebank: Part 1 v 2.0.
shop on Adaptation of Language Resources and LDC2003T06.
Technology to New Domains, AdaptLRTtoND 09, Mitchell P. Marcus, Mary Ann Marcinkiewicz, and
pages 3538, Borovets, Bulgaria. Beatrice Santorini. 1993. Building a large anno-
Jesus Gimenez and Llus Marquez. 2004. SVMTool: tated corpus of English: the Penn Treebank. Com-
A general POS tagger generator based on support put. Linguist., 19:313330.
501
Joakim Nivre, Johan Hall, Jens Nilsson, Atanas garian texts. Technical Report BTB-TR04, Bulgar-
Chanev, Gulsen Eryigit, Sandra Kubler, Svetoslav ian Academy of Sciences.
Marinov, and Erwin Marsi. 2007. MaltParser: Kiril Ivanov Simov, Alexander Simov, Milen
A language-independent system for data-driven de- Kouylekov, Krasimira Ivanova, Ilko Grigorov, and
pendency parsing. Natural Language Engineering, Hristo Ganev. 2003. Development of corpora
13(2):95135. within the CLaRK system: The BulTreeBank
Jorgen Pind, Fridrik Magnusson, and Stefan Briem. project experience. In Proceedings of the 10th con-
1991. The Icelandic frequency dictionary. Techni- ference of the European chapter of the Association
cal report, The Institute of Lexicography, University for Computational Linguistics, EACL 03, pages
of Iceland, Reykjavik, Iceland. 243246, Budapest, Hungary.
Robin L. Plackett. 1983. Karl Pearson and the Chi- Kiril Simov, Petya Osenova, and Milena Slavcheva.
Squared Test. International Statistical Review / Re- 2004. BTB-TR03: BulTreeBank morphosyntac-
vue Internationale de Statistique, 51(1):5972. tic tagset. Technical Report BTB-TR03, Bulgarian
Joel Plisson, Nada Lavrac, and Dunja Mladenic. 2004. Academy of Sciences.
A rule based approach to word lemmatization. In Noah A. Smith, David A. Smith, and Roy W. Tromble.
Proceedings of the 7th International Multiconfer- 2005. Context-based morphological disambigua-
ence: Information Society, IS 2004, pages 8386, tion with random fields. In Proceedings of Hu-
Ljubljana, Slovenia. man Language Technology Conference and Confer-
Dimitar Popov, Kiril Simov, and Svetlomira Vidinska. ence on Empirical Methods in Natural Language
1998. Dictionary of Writing, Pronunciation and Processing, pages 475482, Vancouver, British
Punctuation of Bulgarian Language (in Bulgarian). Columbia, Canada.
Atlantis KL, Sofia, Bulgaria. Anders Sgaard. 2011. Semi-supervised condensed
nearest neighbor for part-of-speech tagging. In Pro-
Dimityr Popov, Kiril Simov, Svetlomira Vidinska, and
ceedings of the 49th Annual Meeting of the Associa-
Petya Osenova. 2003. Spelling Dictionary of Bul-
tion for Computational Linguistics, ACL-HLT 10,
garian. Nauka i izkustvo, Sofia, Bulgaria.
pages 4852, Portland, Oregon, USA.
Adwait Ratnaparkhi. 1996. A maximum entropy
Hristo Tanev and Ruslan Mitkov. 2002. Shallow
model for part-of-speech tagging. In Eva Ejerhed
language processing architecture for Bulgarian. In
and Ido Dagan, editors, Fourth Workshop on Very
Proceedings of the 19th International Conference
Large Corpora, pages 133142, Copenhagen, Den-
on Computational Linguistics, COLING 02, pages
mark.
17, Taipei, Taiwan.
Aleksandar Savkov, Laska Laskova, Petya Osenova, Kristina Toutanova, Dan Klein, Christopher D. Man-
Kiril Simov, and Stanislava Kancheva. 2011. ning, and Yoram Singer. 2003. Feature-rich
A web-based morphological tagger for Bulgarian. part-of-speech tagging with a cyclic dependency
In Daniela Majchrakova and Radovan Garabk, network. In Proceedings of the Conference of
editors, Slovko 2011. Sixth International Confer- the North American Chapter of the Association
ence. Natural Language Processing, Multilingual- for Computational Linguistics, NAACL 03, pages
ity, pages 126137, Modra/Bratislava, Slovakia. 173180, Edmonton, Canada.
Helmut Schmid. 1994. Probabilistic part-of-speech Yoshimasa Tsuruoka and Junichi Tsujii. 2005. Bidi-
tagging using decision trees. In International Con- rectional inference with the easiest-first strategy
ference on New Methods in Language Processing, for tagging sequence data. In Proceedings of the
pages 4449, Manchester, UK. Conference on Human Language Technology and
Ingo Schroder. 2002. A case study in part-of-speech- Empirical Methods in Natural Language Process-
tagging using the ICOPOST toolkit. Technical Re- ing, HLT-EMNLP 05, pages 467474, Vancouver,
port FBI-HH-M-314/02, Department of Computer British Columbia, Canada.
Science, University of Hamburg. Yoshimasa Tsuruoka, Yusuke Miyao, and Junichi
Libin Shen, Giorgio Satta, and Aravind Joshi. 2007. Kazama. 2011. Learning with lookahead: Can
Guided learning for bidirectional sequence classi- history-based models rival globally optimized mod-
fication. In Proceedings of the 45th Annual Meet- els? In Proceedings of the 49th Annual Meeting of
ing of the Association of Computational Linguistics, the Association for Computational Linguistics: Hu-
ACL 07, pages 760767, Prague, Czech Republic. man Language Technologies, ACL-HLT 10, pages
Kiril Simov and Petya Osenova. 2001. A hybrid 238246, Portland, Oregon, USA.
system for morphosyntactic disambiguation in Bul- Shulamit Umansky-Pesin, Roi Reichart, and Ari Rap-
garian. In Proceedings of the EuroConference on poport. 2010. A multi-domain web-based algo-
Recent Advances in Natural Language Processing, rithm for POS tagging of unknown words. In Pro-
RANLP 01, pages 57, Tzigov chark, Bulgaria. ceedings of the 23rd International Conference on
Kiril Simov and Petya Osenova. 2004. BTB-TR04: Computational Linguistics: Posters, COLING 10,
BulTreeBank morphosyntactic annotation of Bul- pages 12741282, Beijing, China.
502
Instance-Driven Attachment of Semantic Annotations
over Conceptual Hierarchies
Janara Christensen Marius Pasca

University of Washington Google Inc.
Seattle, Washington 98195 Mountain View, California 94043
janara@cs.washington.edu mars@google.com
Abstract or manually created within encyclopedic re-

sources (Remy, 2002). Such facts could state, for
Whether automatically extracted or human instance, that rhapsody in blue was composed-
generated, open-domain factual knowledge by george gershwin, or that tristan und isolde
is often available in the form of semantic was composed-by richard wagner. In compar-
annotations (e.g., composed-by) that take
ison, concept-level annotations more concisely
one or more specific instances (e.g., rhap-
sody in blue, george gershwin) as their ar- and effectively capture the underlying semantics
guments. This paper introduces a method of the annotations by identifying the concepts cor-
for converting flat sets of instance-level responding to the arguments, e.g., Musical Com-
annotations into hierarchically organized, positions are composed-by Composers.
concept-level annotations, which capture
not only the broad semantics of the desired The frequent occurrence of instances, relative
arguments (e.g., People rather than Loca- to more abstract concepts, in Web documents and
tions), but also the correct level of general- popular Web search queries (Barr et al., 2008;
ity (e.g., Composers rather than People, Li, 2010), is both an asset and a liability from
or Jazz Composers). The method refrains the point of view of knowledge acquisition. On
from encoding features specific to a partic- one hand, it makes instance-level annotations rel-
ular domain or annotation, to ensure imme- atively easy to find, either from manually created
diate applicability to new, previously un-
resources (Remy, 2002; Bollacker et al., 2008),
seen annotations. Over a gold standard of
semantic annotations and concepts that best or extracted automatically from text (Banko et
capture their arguments, the method sub- al., 2007). On the other hand, it makes concept-
stantially outperforms three baselines, on level annotations more difficult to acquire di-
average, computing concepts that are less rectly. While Rhapsody in Blue was composed
than one step in the hierarchy away from by George Gershwin [..] may occur in some
the corresponding gold standard concepts. form within Web documents, the more abstract
Musical compositions are composed by musi-
1 Introduction cians [..] is unlikely to occur. A more practical
approach to collecting concept-level annotations
Background: Knowledge about the world can is to indirectly derive them from already plenti-
be thought of as semantic assertions or anno- ful instance-level annotations, effectively distill-
tations, at two levels of granularity: instance ing factual knowledge into more abstract, concise
level (e.g., rhapsody in blue, tristan und isolde, and generalizable knowledge.
george gershwin, richard wagner) and concept Contributions: This paper introduces a method
level (e.g., Musical Compositions, Works of for converting flat sets of specific, instance-
Art, Composers). Instance-level annotations level annotations into hierarchically organized,
correspond to factual knowledge that can be concept-level annotations. As illustrated in Fig-
found in repositories extracted automatically from ure 1, the resulting annotations must capture not
text (Banko et al., 2007; Wu and Weld, 2010) just the broad semantics of the desired arguments

Contributions made during an internship at Google. (e.g., People rather than Locations or Prod-
503
Annotations
composedby livesin instrumentplayed sungby mappings {cs cg } from more specific con-
Conceptual hierarchy
cepts to more general concepts, as encoded in a
People
hierarchy H, e.g., American ActorsActors,
People from KievPeople from Ukraine,
Composers Musicians ActorsEntertainers.
Composers by genre Cellists Singers
Thus, the main inputs are the conceptual hi-
Baroque Composers Jazz Composers erarchy H, and the instance-level annotations I.
The hierarchy contains instance-to-concept map-
Figure 1: Hierarchical Semantic Annotations: The pings, as well as specific-to-general concept map-
attachment of semantic annotations (e.g., composed- pings. Via transitivity, instances (milla jovovich)
by) into a conceptual hierarchy, a portion of which is and concepts (American Actors) may be im-
shown in the diagram, requires the identification of the
mediate children of more general concepts (Ac-
correct concept at the correct level of generality (e.g.,
Composers rather than Jazz Composers or Peo- tors), or transitive descendants of more general
ple, for the right argument of composed-by). concepts (Entertainers). The hierarchy is not re-
quired to be a tree; in particular, a concept may
have multiple parent concepts. The instance-level
ucts, as the right argument of the annotation annotations may be created collaboratively by hu-
composed-by), but actually identify the concepts man contributors, or extracted automatically from
at the correct level of generality/specificity (e.g., Web documents or some other data source.
Composers rather than Artists or Jazz Com-
Goal: Given the data sources, the goal is to de-
posers) in the underlying conceptual hierarchy.
termine to which concept c in the hierarchy H the
To ensure portability to new, previously unseen
arguments of the target concept-level annotation
annotations, the proposed method avoids encod-
r should be attached. While the left argument of
ing features specific to a particular domain or an-
acted-in could attach to American Actors, Peo-
notation. In particular, the use of annotations la-
ple from Kiev, Entertainers or People, it is
bels (composed-by) as lexical features might be
best attached to the concept Actors. The goal
tempting, but would anchor the annotation model
is to select the concept c that most appropriately
to that particular annotation. Instead, the method
generalizes across the instances. Over the set I
relies only on features that generalize across an-
of instance-level annotations, selecting a method
notations. Over a gold standard of semantic anno-
for this goal can be thought of as a minimization
tations and concepts that best capture their argu-
problem. The metric to be minimized is the sum
ments, the method substantially outperforms three
of the distances between each predicted concept c
baseline methods. On average, the method com-
and the correct concept cgold , where the distance
putes concepts that are less than one step in the
is the number of edges between c and cgold in H.
hierarchy away from the corresponding gold stan-
dard concepts of the various annotations. Intuitions and Challenges: Given instances such
as milla jovovich that instantiate an argument of
2 Hierarchical Semantic Annotations an annotation like acted-in, the conceptual hierar-
chy can be used to propagate the annotation up-
2.1 Task Description wards, from instances to their concepts, then in
Data Sources: The computation of hierarchical turn further upwards to more general concepts.
semantic annotations relies on the following data The best concept would be one of the many can-
sources: didate concepts reached during propagation. In-
a target annotation r (e.g., acted-in) that takes tuitively, when compared to other candidate con-
M arguments; cepts, a higher proportion of the descendant in-
N annotations I={<i1j , . . . , iM j >}N j=1 of
stances of the best concept should instantiate (or
r at instance level, e.g., {<leonardo dicaprio, match) the annotation. At the same time, rela-
inception>, <milla jovovich, fifth element>} (in tive to other candidate concepts, the best concept
this example, M =2); should have more descendant instances.
mappings {ic} from instances to con- While the intuitions seem clear, their inclu-
cepts to which they belong, e.g., milla jovovich sion in a working method faces a series of prac-
American Actors, milla jovovich People tical challenges. First, the data sources may be
from Kiev, milla jovovich Models; noisy. One form of noise is missing or erroneous
504
Conceptual hierarchy Candidate concepts Training/testing data
Entities Entities People-Actors, 3/2, 0.1/0.7 . . .
People Actors-People, 2/3, 0.7/0.1 . . .
Actors Actors-American Actors, 2/1, 0.7/0.9 . . .
American Actors American Actors-Actors, 1/2, 0.9/0.7 . . .
Locations People English Actors .
.
.
Singers Actors
Raw statistics Classified data
Entities, 4, 0.01 . . . 0, People-Actors, 3/2, 0.1/0.7 . . .
People, 3, 0.1 . . . 1, Actors-People, 2/3, 0.7/0.1 . . .
Actors, 2, 0.7 . . . 1, Actors-American Actors, 2/1, 0.7/0.9 . . .
American Actors English Actors 0, American Actors-Actors, 1/2, 0.9/0.7 . . .
American Actors, 1, 0.9 . . .
English Actors, 1, 0.8 . . . .
.
.
Instance-level annotations
Features Depth, Instance Percent . . .
acted-in(leonardo dicaprio, inception)
acted-in(milla jovovich, fifth element) Ranked data (for Concept-level annotations)
acted-in(judy dench, casino royale) 4, Actors
acted-in(colin firth, the kings speech) 3, People
2, American Actors
1, English Actors
Query logs 0, Entities
Instance to concept mappings fifth element actors
leonardo dicaprio: American Actors fifth element costumes
milla jovovich: American Actors inception quotes
judy dench: English Actors out of africa actors Concept-level annotations
colin firth: English Actors the kings speech oscars acted-in(Actors, ?)
Figure 2: Method Overview: Inferring concept-level annotations from instance-level annotations.
instance-level annotations, which may artificially Second, to apply evidence collected from some
skew the distribution of matching instances to- annotations to a new annotation, the evidence
wards a less than optimal region in the hierarchy. must generalize across annotations. However,
If the input annotations for acted-in are available collected evidence or statistics may vary widely
almost exhaustively for all descendant instances across annotations. Observing that 90% of all de-
of American Actors, and are available for only a scendant instances of the concept Actors match
few of the descendant instances of Belgian Ac- an annotation acted-in constitutes strong evidence
tors, Italian Actors etc., then the distribution that Actors is a good concept for acted-in. In
over the hierarchy may incorrectly suggest that contrast, observing that only 0.09% of all descen-
the left argument of acted-in is American Actors dant instances of the concept Football Teams
rather than the more general Actors. In another match won-super-bowl should not be as strong
example, if virtually all instances that instantiate negative evidence as the percentage suggests.
the left argument of the annotation won-award are
mapped to the concept Award Winning Actors, 2.2 Inferring Concept-Level Annotations
then it would be difficult to distinguish Award
Winning Actors from the more general Actors Determining Candidate Concepts: As illus-
or People, as best concept to be computed for trated in the left part of Figure 2, the first step to-
the annotation. Another type of noise is missing wards inferring concept-level from instance-level
or erroneous edges in the hierarchy, which could annotations is to propagate the instances that in-
artificially direct propagation towards irrelevant stantiate a particular argument of the annota-
regions of the hierarchy, or prevent propagation tion, upwards in the hierarchy. Starting from the
from even reaching relevant regions of the hier- left arguments of the annotation acted-in, namely
archy. For example, if the hierarchy incorrectly leonardo dicaprio, milla jovovich etc., the prop-
maps Actors to Entertainment, then Entertain- agation reaches their parent concepts American
ment and its ancestor concepts incorrectly be- Actors, English Actors, then their parent and
come candidate concepts during propagation for ancestor concepts Actors, People, Entities
the left argument of acted-in. Conversely, if miss- etc. The concepts reached during upward prop-
ing edges caused Actors to not have any children agation become candidate concepts. In subse-
in the hierarchy, then Actors would not even be quent steps, the candidates are modeled, scored
reached and considered as a candidate concept and ranked such that ideally the best concept is
during propagation. ranked at the top.
Ranking Candidate Concepts: The identifica-
505
tion of a ranking function is cast as a semi- ing descendant instances might be noise.
supervised learning problem. Given the cor- Also in this category are features that relay in-
rect (gold) concept of an annotation, it would be formation about the candidate concepts children
tempting to employ binary classification directly, concepts. These features include (1) M ATCHED
by marking the correct concept as a positive ex- C HILDREN the number of child concepts con-
ample, and all other candidate concepts as nega- taining at least one matching instance, (2) C HIL -
tive examples. Unfortunately, this would produce DREN P ERCENT the percentage of child concepts
a highly imbalanced training set, with thousands with at least one matching instance, (3) AVG I N -
of negative examples and, more importantly, with STANCE P ERCENT C HILDREN the average per-
only one positive example. Another disadvan- centage of matching descendant instances of the
tage of using binary classification directly is that child concepts, and (4) I NSTANCE P ERCENT TO
it is difficult to capture the preference for concepts I NSTANCE P ERCENT C HILDREN the ratio be-
closer in the hierarchy to the correct concept, over tween I NSTANCE P ERCENT and AVERAGE I N -
concepts many edges away. Finally, the absolute STANCE P ERCENT OF C HILDREN . The last fea-
values of the features that might be employed may ture is meant to capture dramatic changes in per-
be comparable within an annotation, but incompa- centages when moving in the hierarchy from child
rable across annotations, which reduces the porta- concepts to the candidate concept in question.
bility of the resulting model to new annotations. (B) Concept Features: Concept features ap-
To address the above issues, the ranking func- proximate the generality of the concepts: (1)
tion proposed does not construct training exam- N UM I NSTANCES the number of descendant in-
ples from raw features collected for each indi- stances of the concept, (2) N UM C HILDREN the
vidual candidate concept. Instead, it constructs number of child concepts, and (3) D EPTH the dis-
training examples from pairwise comparisons of tance to the concepts farthest descendant.
a candidate concept with another candidate con- (C) Argument Co-occurrence Features: The ar-
cept. Concretely, a pairwise comparison is la- gument co-occurrence features model the likeli-
beled as a positive example if the first concept is hood that an annotation applies to a concept by
closer to the correct concept than the second, or as looking at co-occurrences with another argument
negative otherwise. The pairwise formulation has of the same annotation. Intuitively, if a con-
three immediate advantages. First, it accomodates cept representing one argument has a high co-
the preference for concepts closer to the gold con- occurrence with an instance that is some other ar-
cept. Second, the pairwise formulation produces gument, a relationship more likely exists between
a larger, more balanced training set. Third, deci- members of the concept and the instance. For ex-
sions of whether the first concept being compared ample, given acted-in, Actors is likely to have a
is more relevant than the second are more likely to higher co-occurrence with casablanca than Peo-
generalize across annotations, than absolute deci- ple is. These features are generated from a set of
sions of whether (and how much) a particular con- Web queries. Therefore, the collected values are
cept is relevant for a given annotation. likely to be affected by different noise than that
Compiling Ranking Features: The features are present in the original dataset. For every concept
grouped into four categories: (A) annotation co- and instance pair from the arguments of a given
occurrence features, (B) concept features, (C) ar- annotation, they feature the number of times each
gument co-occurrence features, and (D) combina- of the tokens in the concept appears in the same
tion features, as described below. query with each of the tokens in the instance,
(A) Annotation Co-occurrence Features: The normalizing to the respective number of tokens.
annotation co-occurrence features emphasize how The procedure generates, for each candidate con-
well an annotation applies to a concept. These cept, an average co-occurrence score (AVG C O -
features include (1) M ATCHED I NSTANCES the OCCURRENCE ) and a total co-occurrence score
number of descendant instances of the concept (T OTAL C O - OCCURRENCE) over all instances the
that appear with the annotation, (2) I NSTANCE concept is paired with.
P ERCENT the percentage of matched instances in (D) Combination Features: The last group
the concept, (3) M ORE THAN T HREE M ATCHING of features are combinations of the above fea-
I NSTANCES and (4) M ORE THAN T EN M ATCH - tures: (1) D EPTH , I NSTANCE P ERCENT which is
ING I NSTANCES , which indicate when the match- D EPTH multiplied by I NSTANCE P ERCENT, and
506
Concept Distance Match Total Match Total AvgInst Depth Avg Total
ToCorrect Inst Inst Child Child PercOfChild Cooccur Cooccur
People 4 36512 879423 22 29 4% 14 0.67 33506
Actors 0 29101 54420 6 10 32% 6 2.08 99971
English Actors 2 3091 5922 3 4 37% 3 2.75 28378
Labeled Concept Pair Annotation Co-occurrence Concept Arg Co-occurrence Combination

Features Features Features Features
Concept Label Match Inst Match Child AvgInst Num Num Depth Avg Total Depth DepthInst
Pair Inst Perc Child Perc PercChild Inst Child Cooccur Cooccur InstPerc PercChild
People-Actors 0 1.25 0.08 3.67 1.26 0.13 1.25 3.67 2.33 0.32 0.34 0.18 0.66
Actors-People 1 0.8 12.88 0.27 0.79 7.65 0.8 0.27 0.43 3.11 2.98 5.52 1.51
Actors-English Actors 1 9.41 1.02 2.0 0.8 0.87 9.41 2.0 2.0 0.76 3.52 2.05 4.1
English Actors-Actors 0 0.11 0.98 0.5 1.25 1.15 0.11 0.5 0.5 1.32 0.28 0.49 0.24
English Actors-People 1 0.08 12.57 0.14 0.99 8.82 0.08 0.14 0.21 4.12 0.85 2.69 0.37
People-English Actors 0 11.81 0.08 7.33 1.01 0.11 11.81 7.33 4.67 0.24 1.18 0.37 2.72
Table 1: Training/Testing Examples: The top table shows examples of raw statistics gathered for three candidate
concepts for the left argument of the annotation acted-in. The second table shows the training/testing examples
generated from these concepts and statistics. Each example represents a pair of concepts which is labeled positive
if the first concept is closer to the correct concept than the second concept. Features shown here are the ratio
between a statistic for the first concept and a statistic for the second (e.g. D EPTH for Actors-English Actors is 2
as Actors has depth of 6 and English Actors has depth of 3). Some features omitted due to space constraints.
(2) D EPTH , I NSTANCE P ERCENT, C HILDREN, ceptual hierarchy derived automatically from the
which is the D EPTH multipled by the I NSTANCE Wikipedia (Remy, 2002) category network, as de-
P ERCENT multiplied by M ATCHED C HILDREN. scribed in (Ponzetto and Navigli, 2009). The hi-
Both these features seek to balance the perceived erarchy filters out edges (e.g., from British Film
relevance of an annotation to a candidate concept, Actors to Cinema of the United Kingdom) from
with the generality of the candidate concept. the Wikipedia category network that do not corre-
Generating Learning Examples: For a given spond to IsA relations. A concept in the hierarchy
annotation, the ranking features described so far is a Wikipedia category (e.g., English Film Ac-
are computed for each candidate concept (e.g., tors) that has zero or more Wikipedia categories
Movie Actors, Models, Actors). However, as child concepts, and zero or more Wikipedia
the actual training and testing examples are gener- categories (e.g., English People by Occupation,
ated for pairs of candidate concepts (e.g., <Film British Film Actors) as parent concepts. Each
Actors, Models>, <Film Actors, Actors>, concept in the hierarchy has zero or more in-
<Models, Actors>). A training example rep- stances, which are the Wikipedia articles listed (in
resents a comparison between two candidate con- Wikipedia) under the respective categories (e.g.,
cepts, and specifies which of the two is more rele- colin firth is an instance of English Actors).
vant. To create training and testing examples, the
values of the features of the first concept in the Instance-Level Annotations: The experiments
pair are respectively combined with the values of exploit a set of binary instance-level annotations
the features of the second concept in the pair to (e.g., acted-in, composed) among Wikipedia in-
produce values corresponding to the entire pair. stances, as available in Freebase (Bollacker et
Following classification of testing examples, al., 2008). The annotation is a Freebase prop-
concepts are ranked according to the number of erty (e.g., /music/composition/composer). Inter-
other concepts which they are classified as more nally, the left and right arguments are Freebase
relevant than. Table 1 shows examples of train- topic identifiers mapped to their corresponding
ing/testing data. Wikipedia articles (e.g., /m/03f4k mapped to the
Wikipedia article on george gershwin). In this pa-
3 Experimental Setting per, the derived annotations and instances are dis-
played in a shorter, more readable form for con-
3.1 Data Sources ciseness and clarity. As features do not use the
Conceptual Hierarchy: The experiments com- label of the annotation, labels are never used in
pute concept-level annotations relative to a con- the experiments and evaluation.
507
Web Search Queries: The argument co- mantics. The manual annotation is carried out
occurrence features described above are com- independently by two human judges, who then
puted over a set of around 100 million verify each others work and discard inconsisten-
anonymized Web search queries from 2010. cies. For example, the gold concept of the left
argument of composed-by is annotated to be the
3.2 Experimental Runs Wikipedia category Musical Compositions. In
The experimental runs exploit ranking features the process, some annotation labels are discarded,
described in the previous section, employing: when (a) it is not clear what concept captures an
one of three learning algorithms: naive Bayes argument (e.g., for the right argument of function-
(NAIVE BAYES), maximum entropy (M AXENT), of-building), or (b) more than 5000 candidate con-
or perceptron (P ERCEPTRON) (Mitchell, 1997), cepts are available via propagation for one of the
chosen for their scalability to larger datasets via arguments, which would cause too many train-
distributed implementations. ing or testing examples to be generated via con-
one of three ways of combining the values cept pairs, and slow down the experiments. The
of features collected for individual candidate con- retained 139 annotation labels, whose arguments
cepts into values of features for pairs of candidate have been labeled with their respective gold con-
concepts: the raw ratio of the values of the re- cepts, form the gold standard for the experiments.
spective features of the two concepts (0 when the More precisely, an entry in the resulting gold stan-
denominator is 0); the ratio scaled to the interval dard consists of an annotation label, one of its
[0, 1]; or a binary value indicating which of the arguments being considered (left or right), and
values is larger. a gold concept that best captures that argument.
For completeness, the experiments include The set of annotation labels from the gold stan-
three additional, baseline runs. Each baseline dard is quite diverse and covers many domains of
computes scores for all candidate concepts based potential interest, e.g., has-company(Industries,
on the respective metric; then candidate concepts Companies), written-by(Films, Screenwrit-
are ranked in decreasing order of their scores. The ers), member-of (Politicians,Political Parties),
baselines metrics are: or part-of-movement(Artists, Art Movements).
I NST P ERCENT ranks candidate concepts by Evaluation Metric: Following previous work
the percentage of matched instances that are de- on selectional preferences (Kozareva and Hovy,
scendants of the concept. It emphasizes concepts 2010; Ritter et al., 2010), each entry in the gold
which are proven to belong to the annotation; standard, (i.e., each argument for a given annota-
tion) is evaluated separately. Experimental runs
E NTROPY ranks candidate concepts by the
compute a ranked list of candidate concepts for
entropy (Shannon, 1948) of the proportion of
each entry in the gold standard. In theory, a com-
matched descendant instances of the concept;
puted candidate concept is better if it is closer
AVG D EPTH ranks candidate concepts by
semantically to the gold concept. In practice,
their distances to half of the maximum hierarchy
the accuracy of a ranked list of candidate con-
height, emphasizing a balance of generality and
cepts, relative to the gold concept of the anno-
specificity.
tation label, is measured by two scoring metrics
3.3 Evaluation Procedure that correspond to the mean reciprocal rank score
(MRR) (Voorhees and Tice, 2000) and a modifi-
Gold Standard of Concept-Level Annotations: cation of it (DRR) (Pasca and Alfonseca, 2009):
A random, weighted sample of 200 annotation la- 1 X
N
1
M RR = max
bels (e.g., corresponding to composed-by, play- N i=1 rank ranki
instrument) is selected, out of the set of labels N is the number of annotations and ranki is the
of all instance-level annotations collected from rank of the gold concept in the returned list for
Freebase. During sampling, the weights are the MRR. An annotation ai receives no credit for
counts of distinct instance-level annotations (e.g., MRR if the gold concept does not appear in the
<rhapsody in blue, george gershwin>) avail- corresponding ranked list.
N
able for the label. The arguments of the anno- 1 X 1
DRR = max
tation labels are then manually annotated with N i=1 rank ranki (1 + Len)
a gold concept, which is the category from the For DRR, ranki is the rank of a candidate con-
Wikipedia hierarchy that best captures their se- cept in the returned list and Len is the length of
508
Annotation (Number of Candidate Concepts) Examples of Instances Top Ranked Concepts
Composers compose Musical Compositions (3038) aaron copland; black sabbath Music by Nationality; Composers; Classical
Composers
Musical Compositions composed-by Composers (1734) we are the champions; yor- Musical Compositions; Compositions by
ckscher marsch Composer; Classical Music
Foods contain Nutrients (1112) acca sellowiana; lasagna Foods; Edible Plants; Food Ingredients
Organizations has-boardmember People (3401) conocophillips; spence school Companies by Stock Exchange; Companies
Listed on the NYSE; Companies
Educational Organizations has-graduate Alumni (4072) air force institute of technology; Education by Country; Schools by Country;
deering high school Universities and Colleges by Country
Television Actors guest-role Fictional Characters (4823) melanie griffith; patti laBelle Television Actors by Nationality; Actors;
American Actors
Musical Groups has-member Musicians (2287) steroid maximus; u2 Musical Groups; Musical Groups by Genre;
Musical Groups by Nationality
Record Labels represent Musician (920) columbia records; vandit Record Labels; Record Labels by Country;
Record Labels by Genre
Awards awarded-to People (458) academy award for best original Film Awards; Awards; Grammy Awards
song; erasmus prize
Foods contain Nutrients (177) lycopene; glutamic acid Carboxylic Acids ; Acids; Essential Nutrients
Architects design Buildings and Structures (4811) 20 times square; berkeley build- Buildings and Structures; Buildings and Struc-
ing tures by Architect; Houses by Country
People died-from Causes of Death (577) malaria; skiing Diseases; Infectious Diseases; Causes of
Death
Art Directors direct Films (1265) batman begins; the lion king Films; Films by Director; Film
Episodes guest-star Television Actors (1067) amy poehler; david caruso Television Actors by Nationality; Actors;
American Actors
Television Network has-tv-show Television Series (2492) george of the jungle; great expec- Television Series by Network; Television Se-
tations ries; Television Series by Genre
Musicians play Musical Instruments (423) accordion; tubular bell Musical Instruments; Musical Instruments by
Nationality; Percussion Instruments
Politicians member-of Political Parties (938) independent moralizing front; Political Parties; Political Parties by Country;
national coalition party Political Parties by Ideology
Table 2: Concepts Computed for Gold-Standard Annotations: Examples of entries from the gold standard and
counts of candidate concepts (Wikipedia categories) reached from upward propagation of instances (Wikipedia
instances). The target gold concept is shown in bold. Also shown are examples of Wikipedia instances, and the
top concepts computed by the best-performing learning algorithm for the respective gold concepts.
the minimum path in the hierarchy between the the annotation labels in testing appears in train-
concept and the gold concept. Len is minimum ing. This restriction makes the evaluation more
(0) if the candidate concept is the same as the gold rigurous and conservative as it actually assesses
standard concept. A given annotation ai receives the extent the models learned are applicable to
no credit for DRR if no path is found between the new, previously unseen annotation labels. If
returned concepts and the gold concept. this restriction were relaxed, the baselines would
As an illustration, for a single annotation, the preform equivalently as they do not depend on
right argument of composed-by, the ranked list the training data, but the learned methods would
of concepts returned by an experimental may likely do better.
be [Symphonies by Anton Bruckner, Sym-
4 Evaluation Results
phonies by Joseph Haydn, Symphonies by Gus-
tav Mahler, Musical Compositions, ..], with the 4.1 Quantitative Results
gold concept being Musical Compositions. The Conceptual Hierarchy: The conceptual hierar-
length of the path between Symphonies by An- chy contains 108,810 Wikipedia categories, and
ton Bruckner etc. and Musical Compositions is its maximum depth, measured as the distance
2 (via Symphonies). Therefore, the MRR score from a concept to its farthest descendant, is 16.
would be 0.25 (given by the fourth element of Candidate Concepts: On average, for the gold
the ranked list), whereas the DRR score would be standard, the method propagates a given annota-
0.33 (given by the first element of the ranked list). tion from instances to 1,525 candidate concepts,
MRR and DRR are computed in five-fold cross from which the single best concept must be deter-
validation. Concretely, the gold standard is split mined. The left part of Table 2 illustrates the num-
into five folds such that the sets of annotation la- ber of candidate concepts reached during propa-
bels in each fold are disjoint. Thus, none of gation for a sample of annotations.
509
Experimental Run Accuracy 0.513 (DRR) over the top 20 computed concepts,
N=1 N=20
MRR DRR MRR DRR
and 0.245 (MRR) and 0.456 (DRR) when consid-
With raw-ratio features:
ering only the first concept. These scores corre-
NAIVE BAYES 0.021 0.180 0.054 0.222 spond to the ranked list being less than one step
M AXENT 0.029 0.168 0.045 0.208 away in the hierarchy. The very first computed
P ERCEPTRON 0.029 0.176 0.045 0.216 concept exactly matches the gold concept in about
With scaled-ratio features: one in four cases, and is slightly more than one
NAIVE BAYES 0.050 0.170 0.112 0.243 step away from it. In comparison, the very first
M AXENT 0.245 0.456 0.430 0.513
concept computed by the best baseline matches
P ERCEPTRON 0.245 0.391 0.367 0.461
the gold concept in about one in 35 cases (0.029
With binary features:
NAIVE BAYES 0.115 0.297 0.224 0.361 MRR), and is about 6 steps away (0.173 DRR).
M AXENT 0.165 0.390 0.293 0.441 The accuracies of the various learning algorithms
P ERCEPTRON 0.180 0.332 0.330 0.429 (not shown) were also measured and correlated
For baselines: roughly with the MRR and DRR scores.
I NST P ERCENT 0.029 0.173 0.045 0.224 Discussion: The baseline runs I NST P ERCENT
E NTROPY 0.000 0.110 0.007 0.136
and E NTROPY produce categories that are far
AVG D EPTH 0.007 0.018 0.028 0.045
too specific. For the gold annotation composed-
Table 3: Precision Results: Accuracy of ranked lists by(Composers, Musical Compositions), I NST-
of concepts (Wikipedia categories) computed by var- P ERCENT produces Scottish Flautists for the left
ious runs, as an average over the gold standard of argument and Operas by Ernest Reyer for the
concept-level annotations, considering the top N can- right. AVG D EPTH does not suffer from over-
didate concepts computed for each gold standard entry.
specification, but often produces concepts that
4.2 Qualitative Results have been reached via propagation, yet are not
close to the gold concept. For composed-by,
Precision: Table 3 compares the precision of the
AVG D EPTH produces Film for the left argument
ranked lists of candidate concepts produced by the
and History by Region for the right.
experimental runs. The MRR and DRR scores in
the table consider either at most 20 of the concepts
4.3 Error Analysis
in the ranked list computed by a given experimen-
tal run, or only the first, top ranked computed con- The right part of Table 2 provides a more de-
cept. Note that, in the latter case, the MRR and tailed view into the best performing experimental
DRR scores are equivalent to precision@1 scores. run, showing actual ranked lists of concepts pro-
Several conclusions can be drawn from the reduced for a sample of the gold standard entries
sults. First, as expected by definition of the by M AXENT with scaled-ratio. A separate analy-
scoring metrics, DRR scores are higher than the sis of the results indicates that the most common
stricter MRR scores, as they give partial credit cause of errors is noise in the conceptual hier-
to concepts that, while not identical to the gold archy, in the form of unbalanced instance-level
concepts, are still close approximations. This is annotations and missing hierarchy edges. Un-
particularly noticeable for the runs M AXENT and balanced annotations are annotations where cer-
P ERCEPTRON with raw-ratio features (4.6 and tain subtrees of the hierarchy are artificially more
4.8 times higher respectively). Second, among populated than other subtrees. For the left argu-
the baselines, I NST P ERCENT is the most accu- ment of the annotation has-profession, 0.05% of
rate, with the computed concepts identifying the New York Politicians are matched but 70% of
gold concept strictly at rank 22 on average (for Bushrangers are matched. Such imbalances may
an MRR score 0.045), and loosely at an aver- be inherent to how annotations are added to Free-
age of 4 steps away from the gold concept (for base: different human contributors may add new
a DRR score of 0.224). Third, the accuracy of annotations to particular portions of Freebase, but
the learning algorithms varies with how the pair- miss other relevant portions.
wise feature values are combined. Overall, raw- The results are also affected by missing edges
ratio feature values perform the worst, and scaled- in the hierarchy. Of the more than 100K con-
ratio the best, with binary in-between. Fourth, cepts in the hierarchy, 3479 are roots of subhier-
the scores of the best experimental run, M AXENT archies that are mutually disconnected. Exam-
with scaled-ratio features, are 0.430 (MRR) and ples are People by Region, Shades of Red, and
510
Members of the Parliament of Northern Ireland, and use this semantic information to construct a
all of which should have parents in the hierarchy. taxonomy. The resulting taxonomy is the concep-
If a few edges are missing in a particular region tual hierarchy used in the evaluation.
of the hierarchy, the method can recover, but if so Another related area of work is the discovery of
many edges are missing that a gold concept has relations between concepts. Nastase and Strube
very few descendants, then propagation can be (2008) use Wikipedia category names and cate-
substantially affected. In the worst case, the gold gory structure to generate a set of relations be-
concept becomes disconnected, and thus will be tween concepts. Yan et al. (2009) discover re-
missing from the set of candidate concepts com- lations between Wikipedia concepts via deep lin-
piled during propagation. For example, for the guistic information and Web frequency informa-
annotation team-color(Sports Clubs, Colors), tion. Mohamed et al. (2011) generate candi-
the only descendant concept of Colors in the hi- date relations by coclustering text contexts for ev-
erarchy is Horse Coat Colors, meaning that the ery pair of concepts in a hierarchy. In a sense,
gold concept Colors is not reached during prop- this area of research is complementary to that dis-
agation from instances upwards in the hierarchy. cussed in this paper. These methods induce new
relations, and the proposed method can be used
5 Related Work to find appropriate levels of generalization for the
arguments of any given relation.
Similar to the task of attaching a semantic anno-
tation to the concept in a hierarchy that has the
best level of generality is the task of finding se- 6 Conclusions
lectional preferences for relations. Most relevant This paper introduces a method to convert flat sets
to this paper is work that seeks to find the appro- of instance-level annotations to hierarchically or-
priate concept in a hierarchy for an argument of ganized, concept-level annotations. The method
a specific relation (Ribas, 1995; McCarthy, 1997; determines the appropriate concept for a given se-
Li and Abe, 1998). Li and Abe (1998) address mantic annotation in three stages. First, it propa-
this problem by attempting to identify the best tree gates annotations upwards in the hierarchy, form-
cut in a hierarchy for an argument of a given verb. ing a set of candidate concepts. Second, it classi-
They use the minimum description length princi- fies each candidate concept as more or less appro-
ple to select a set of concepts from a hierarchy to priate than each other candidate concept within an
represent the selectional preferences. This work annotation. Third, it ranks candidate concepts by
makes several limiting assumptions including that the number of other concepts relative to which it
the hierarchy is a tree, and every instance belongs is classified as more appropriate. Because the fea-
to just one concept. Clark and Weir (2002) inves- tures are comparisons between concepts within a
tigate the task of generalizing a single relation- single semantic annotation, rather than consider-
concept pair. A relation is propagated up a hier- ations of individual concepts, the method is able
archy until a chi-square test determines the differ- to generalize across annotations, and can thus be
ence between the probability of the child and par- applied to new, previously unseen annotations.
ent concepts to be significant where the probabili- Experiments demonstrate that, on average, the
ties are relation-concept frequencies. This method method is able to identify the concept of a given
has no direct translation to the task discussed here; annotations argument within one hierarchy edge
it is unclear how to choose the correct concept if of the gold concept.
instances generalize to different concepts. The proposed method can take advantage of
In other research on selectional preferences, existing work on open-domain information ex-
Pantel et al. (2007), Kozareva and Hovy (2010) traction. The output of such work is usually
and Ritter et al. (2010) focus on generating ad- instance-level annotations, although often at sur-
missible arguments for relations, and Erk (2007) face level (non-disambiguated arguments) rather
and Bergsma et al. (2008) investigate classifying than semantic level (disambiguated arguments).
a relation-instance pair as plausible or not. After argument disambiguation (e.g., (Dredze et
al., 2010)), the annotations can be used as input
Important to this paper is the Wikipedia cate- to determining concept-level annotations. Thus,
gory network (Remy, 2002) and work on refin- the method has the potential to generalize any
ing it. Ponzetto and Navigli (2009) disambiguate existing database of instance-level annotations to
Wikipedia categories by using WordNet synsets concept-level annotations.
511
References Diana McCarthy. 1997. Word sense disambiguation
for acquisition of selectional preferences. In Pro-
Michele Banko, Michael Cafarella, Stephen Soder- ceedings of the ACL/EACL Workshop on Automatic
land, Matt Broadhead, and Oren Etzioni. 2007. Information Extraction and Building of Lexical Se-
Open information extraction from the Web. In Pro- mantic Resources for NLP Applications, pages 52
ceedings of the 20th International Joint Conference 60, Madrid, Spain.
on Artificial Intelligence (IJCAI-07), pages 2670 Tom Mitchell. 1997. Machine Learing. McGraw Hill.
2676, Hyderabad, India. Thahir Mohamed, Estevam Hruschka, and Tom
Cory Barr, Rosie Jones, and Moira Regelson. 2008. Mitchell. 2011. Discovering relations between
The linguistic structure of English Web-search noun categories. In Proceedings of the 2011 Con-
queries. In Proceedings of the 2008 Conference ference on Empirical Methods in Natural Language
on Empirical Methods in Natural Language Pro- Processing (EMNLP-11), pages 14471455, Edin-
cessing (EMNLP-08), pages 10211030, Honolulu, burgh, United Kingdom.
Hawaii. Vivi Nastase and Michael Strube. 2008. Decoding
Shane Bergsma, Dekang Lin, and Randy Goebel. Wikipedia categories for knowledge acquisition. In
2008. Discriminative learning of selectional pref- Proceedings of the 23rd National Conference on
erence from unlabeled text. In Proceedings of the Artificial Intelligence (AAAI-08), pages 12191224,
2008 Conference on Empirical Methods in Natural Chicago, Illinois.
Language Processing (EMNLP-08), pages 5968, M. Pasca and E. Alfonseca. 2009. Web-derived
Honolulu, Hawaii. resources for Web Information Retrieval: From
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim conceptual hierarchies to attribute hierarchies. In
Sturge, and Jamie Taylor. 2008. Freebase: A Proceedings of the 32nd International Conference
collaboratively created graph database for struc- on Research and Development in Information Re-
turing human knowledge. In Proceedings of the trieval (SIGIR-09), pages 596603, Boston, Mas-
2008 International Conference on Management of sachusetts.
Data (SIGMOD-08), pages 12471250, Vancouver, Patrick Pantel, Rahul Bhagat, Timothy Chklovski, and
Canada. Eduard Hovy. 2007. ISP: Learning inferential se-
Stephen Clark and David Weir. 2002. Class-based lectional preferences. In Proceedings of the Annual
probability estimation using a semantic hierarchy. Meeting of the North American Chapter of the Asso-
Computational Linguistics, 28(2):187206. ciation for Computational Linguistics (NAACL-07),
pages 564571, Rochester, New York.
Mark Dredze, Paul McNamee, Delip Rao, Adam Ger-
Simone Paolo Ponzetto and Roberto Navigli. 2009.
ber, and Tim Finin. 2010. Entity disambiguation
Large-scale taxonomy mapping for restructuring
for knowledge base population. In Proceedings
and integrating Wikipedia. In Proceedings of
of the 23rd International Conference on Compu-
the 21st International Joint Conference on Ar-
tational Linguistics (COLING-10), pages 277285,
tifical Intelligence (IJCAI-09), pages 20832088,
Beijing, China.
Barcelona, Spain.
Katrin Erk. 2007. A simple, similarity-based model Melanie Remy. 2002. Wikipedia: The free encyclope-
for selectional preferences. In Proceedings of the dia. Online Information Review, 26(6):434.
45th Annual Meeting of the Association for Com-
Francesc Ribas. 1995. On learning more appropriate
putational Linguistics (ACL-07), pages 216223,
selectional restrictions. In Proceedings of the 7th
Prague, Czech Republic.
Conference of the European Chapter of the Asso-
Zornitsa Kozareva and Eduard Hovy. 2010. Learning ciation for Computational Linguistics (EACL-97),
arguments and supertypes of semantic relations us- pages 112118, Madrid, Spain.
ing recursive patterns. In Proceedings of the 48th Alan Ritter, Mausam, and Oren Etzioni. 2010. A la-
Annual Meeting of the Association for Computa- tent dirichlet allocation method for selectional pref-
tional Linguistics (ACL-10), pages 14821491, Up- erences. In Proceedings of the 48th Annual Meet-
psala, Sweden. ing of the Association for Computational Linguis-
Hang Li and Naoki Abe. 1998. Generalizing case tics (ACL-10), pages 424434, Uppsala, Sweden.
frames using a thesaurus and the mdl principle. In Claude Shannon. 1948. A mathematical theory of
Proceedings of the ECAI-2000 Workshop on Ontol- communication. Bell System Technical Journal,
ogy Learning, pages 217244, Berlin, Germany. 27:379423,623656.
Xiao Li. 2010. Understanding the semantic struc- Ellen Voorhees and Dawn Tice. 2000. Building a
ture of noun phrase queries. In Proceedings of the question-answering test collection. In Proceedings
48th Annual Meeting of the Association for Com- of the 23rd International Conference on Research
putational Linguistics (ACL-10), pages 13371345, and Development in Information Retrieval (SIGIR-
Uppsala, Sweden. 00), pages 200207, Athens, Greece.
512
Fei Wu and Daniel S. Weld. 2010. Open information
extraction using wikipedia. In Proceedings of the
tational Linguistics (ACL-10), pages 118127, Up-
psala, Sweden.
Yulan Yan, Naoaki Okazaki, Yutaka Matsuo, Zhenglu
Yang, and Mitsuru Ishizuka. 2009. Unsupervised
relation extraction by mining Wikipedia texts using
information from the Web. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP (ACL-
IJCNLP-09), pages 10211029, Suntec, Singapore.
513
Joint Satisfaction of Syntactic and Pragmatic Constraints
Improves Incremental Spoken Language Understanding
Andreas Peldszus Okko Bu
University of Potsdam University of Potsdam
Department for Linguistics Department for Linguistics
peldszus@uni-potsdam.de okko@ling.uni-potsdam.de
Timo Baumann David Schlangen

University of Hamburg University of Bielefeld
Department for Informatics Department for Linguistics
baumann@informatik.uni-hamburg.de david.schlangen@uni-bielefeld.de
Abstract In this paper, we investigate whether the other

potential advantage of incremental processing
We present a model of semantic processing providing higher-level-feedback to lower-level
of spoken language that (a) is robust against modules, in order to improve subsequent process-
ill-formed input, such as can be expected
ing of the lower-level modulecan be realised as
from automatic speech recognisers, (b) re-
spects both syntactic and pragmatic con-
well. Specifically, we experimented with giving a
straints in the computation of most likely syntactic parser feedback about whether semantic
interpretations, (c) uses a principled, ex- readings of nominal phrases it is in the process of
pressive semantic representation formalism constructing have a denotation in the given con-
(RMRS) with a well-defined model the- text or not. Based on the assumption that speak-
ory, and (d) works continuously (produc- ers do plan their referring expressions so that they
ing meaning representations on a word- can successfully refer, we use this information to
by-word basis, rather than only for full
re-rank derivations; this in turn has an influence
utterances) and incrementally (computing
only the additional contribution by the new on how the derivations are expanded, given con-
word, rather than re-computing for the tinued input. As we show in our experiments, for
whole utterance-so-far). a corpus of realistic dialogue utterances collected
We show that the joint satisfaction of syn- in a Wizard-of-Oz setting, this strategy led to an
tactic and pragmatic constraints improves absolute improvement in computing the intended
the performance of the NLU component denotation of around 10 % over a baseline (even
(around 10 % absolute, over a syntax-only
more using a more permissive metric), both for
baseline).
manually transcribed test data as well as for the
output of automatic speech recognition.
1 Introduction The remainder of this paper is structured as fol-
lows: We discuss related work in the next section,
Incremental processing for spoken dialogue sys-
and then describe in general terms our model and
tems (i. e., the processing of user input even while
its components. In Section 4 we then describe the
it still may be extended) has received renewed at-
data resources we used for the experiments and
tention recently (Aist et al., 2007; Baumann et
the actual implementation of the model, the base-
al., 2009; Bu and Schlangen, 2010; Skantze and
lines for comparison, and the results of our exper-
Hjalmarsson, 2010; DeVault et al., 2011; Purver
iments. We close with a discussion and an outlook
et al., 2011). Most of the practical work, how-
on future work.
ever, has so far focussed on realising the poten-
tial for generating more responsive system be- 2 Related Work
haviour through making available processing re-
sults earlier (e. g. (Skantze and Schlangen, 2009)), The idea of using real-world reference to inform
but has otherwise followed a typical pipeline ar- syntactic structure building has been previously
chitecture where processing results are passed explored by a number of authors. Stoness et al.
only in one direction towards the next module. (2004, 2005) describe a proof-of-concept imple-
514
mentation of a continuous understanding mod- The components of our model are described in
ule that uses reference information in guiding a the following sections: first the parser which com-
bottom-up chart-parser, which is evaluated on a putes the syntactic probability in an incremental,
single dialogue transcript. In contrast, our model top-down manner; the semantic construction al-
uses a probabilistic top-down parser with beam gorithm which associates (underspecified) logi-
search (following Roark (2001)) and is evalu- cal forms to derivations; the reference resolution
ated on a large number of real-world utterances component that computes the pragmatic plausi-
as processed by an automatic speech recogniser. bility; and the combination that incorporates the
Similarly, DeVault and Stone (2003) describe a feedback from this pragmatic signal.
system that implements interaction between a
parser and higher-level modules (in this case, even 3.2 Parser
more principled, trying to prove presuppositions), Roark (2001) introduces a strategy for incremen-
which however is also only tested on a small, con- tal probabilistic top-down parsing and shows that
structed data-set. it can compete with high-coverage bottom-up
Schuler (2003) and Schuler et al. (2009) present parsers. One of the reasons he gives for choosing
a model where information about reference is a top-down approach is that it enables fully left-
used directly within the speech recogniser, and connected derivations, where at every process-
hence informs not only syntactic processing but ing step new increments directly find their place
also word recognition. To this end, the processing in the existing structure. This monotonically en-
is folded into the decoding step of the ASR, and riched structure can then serve as a context for in-
is realised as a hierarchical HMM. While techni- cremental language understanding, as the author
cally interesting, this approach is by design non- claims, although this part is not further developed
modular and restricted in its syntactic expressiv- by Roark (2001). He discusses a battery of dif-
ity. ferent techniques for refining his results, mostly
The work presented here also has connections based on grammar transformations and on con-
to work in psycholinguistics. Pado et al. (2009) ditioning functions that manipulate a derivation
present a model that combines syntactic and se- probability on the basis of local linguistic and lex-
mantic models into one plausibility judgement ical information.
that is computed incrementally. However, that We implemented a basic version of his parser
work is evaluated for its ability to predict reading without considering additional conditioning or
time data and not for its accuracy in computing lexicalizations. However, we applied left-facto-
meaning. rization to parts of the grammar to delay cer-
tain structural decisions as long as possible. The
3 The Model search-space is reduced by using beam search. To
match the next token, the parser tries to expand
3.1 Overview
the existing derivations. These derivations are
Described abstractly, the model computes the stored in a priorized queue, which means that the
probability of a syntactic derivation (and its ac- most probable derivation will always be served
companying logical form) as a combination of a first. Derivations resulting from rule expansions
syntactic probability (as in a typical PCFG) and are kept in the current queue, derivations result-
a semantic or pragmatic plausibility.1 The prag- ing from a successful lexical match are pushed in
matic plausibility here comes from the presuppo- a new queue. The parser proceeds with the next
sition that the speaker intended her utterance to most probable derivation until the current queue
successfully refer, i. e. to have a denotation in the is empty or until a threshhold is reached at which
current situation (a unique one, in the case of def- remaining analyses are pruned. This threshhold
inite reference). Hence, readings that do have a is determined dynamically: If the probability of
denotation are preferred over those that do not. the current derivation is lower than the product of
1
the best derivations probability on the new queue,
Note that, as described below, in the actual implemen-
tation the weights given to particular derivations are not real
the number of derivations in the new queue, and a
probabilities anymore, as derivations fall out of the beam and base beam factor (an initial parameter for the size
normalisation is not performed after re-weighting. of the search beam), then all further old deriva-
515
TextualWordIU
TextualWordIU TextualWordIU TextualWordIU TextualWordIU TextualWordIU

$TopOfWords nimm den winkel in
TagIU
TagIU TagIU TagIU TagIU TagIU

$TopOfTags vvimp det nn appr
CandidateAnalysisIU
CandidateAnalysisIU CandidateAnalysisIU CandidateAnalysisIU CandidateAnalysisIU

LD=[s*/s,kon,s*, s/vp, vp/vvimp-v1, m(vvimp)] LD=[v1/np-vz, np/pper, i(det)] LD=[n1/adjp-n1, adjp/adja, i(nn)] LD=[nz/eps, vz/advp-vz, advp/adv, i(appr)]
P=0.14 P=0.00441 P=0.002646 P=0.00007938
CandidateAnalysisIU S=[V1, kon, S*, S!] S=[pper, VZ, S!] S=[adja, N1, VZ, S!] S=[adv, VZ, S!]
LD=[]
P=1.00
S=[S*,S!] CandidateAnalysisIU CandidateAnalysisIU CandidateAnalysisIU CandidateAnalysisIU
LD=[s*/s, s/vp, vp/vvimp-v1, m(vvimp)] LD=[v1/np-vz, np/det-n1, m(det)] LD=[n1/nadj-nz, nadj/adja, i(nn)] LD=[nz/advp-nz, advp/adv, i(appr)]
P=0.49 P=0.2205 P=0.000441 P=0.0003969
S=[V1, S!] S=[N1, VZ, S!] S=[adja, NZ, VZ, S!] S=[adv, NZ, VZ, S!]
CandidateAnalysisIU CandidateAnalysisIU
LD=[n1/nn-nz, m(nn)] LD=[nz/pp-nz, pp/appr-np, m(appr)]
P=0.06615 P=0.0178605
S=[NZ, VZ, S!] S=[NP, NZ, VZ, S!]
FormulaIU
FormulaIU FormulaIU
... ...
FormulaIU
...
FormulaIU FormulaIU
FormulaIU FormulaIU
... ...
[ [l0:a1:i2] [ [l0:a1:e2]
FormulaIU
{ [l0:a1:i2] } ] { [l18:a19:x14] [l0:a1:e2] }
[ [l0:a1:e2]
ARG1(a1,x8), FormulaIU
{ [l0:a1:e2] } FormulaIU
l6:a7:addressee(x8), ...
ARG1(a1,x8), [ [l0:a1:e2]
l0:a1:_nehmen(e2),
l6:a7:addressee(x8), { [l42:a43:x44] [l29:a30:x14] [l0:a1:e2] }
ARG2(a1,x14),
l0:a1:_nehmen(e2)] FormulaIU ARG1(a1,x8),
BV(a13,x14),
[ [l0:a1:e2] l6:a7:addressee(x8),
RSTR(a13,h21),
{ [l29:a30:x14] [l0:a1:e2] } l0:a1:_nehmen(e2),
BODY(a13,h22),
ARG1(a1,x8), ARG2(a1,x14),
l12:a13:_def(),
l6:a7:addressee(x8), BV(a13,x14),
qeq(h21,l18)]
l0:a1:_nehmen(e2), RSTR(a13,h21),
ARG2(a1,x14), BODY(a13,h22),
BV(a13,x14), l12:a13:_def(),
RSTR(a13,h21), l18:a19:_winkel(x14),
BODY(a13,h22), ARG1(a40,x14),
l12:a13:_def(), ARG2(a40,x44),
l18:a19:_winkel(x14), l39:a40:_in(e41),
qeq(h21,l18)] qeq(h21,l18)]
Figure 1: An example network of incremental units, including the levels of words, POS-tags, syntactic derivations
and logical forms. See section 3 for a more detailed description.
tions are pruned. Due to probabilistic weighing derivations (CandidateAnalysisIUs) are repre-
and the left factorization of the rules, left recur- sented by three features: a list of the last parser ac-
sion poses no direct threat in such an approach. tions of the derivation (LD), with rule expansions
Additionally, we implemented three robust lex- or (robust) lexical matches; the derivation proba-
ical operations: insertions consume the current bility (P); and the remaining stack (S), where S*
token without matching it to the top stack item; is the grammars start symbol and S! an explicit
deletions can consume a requested but actu- end-of-input marker. (To keep the Figure small,
ally non-existent token; repairs adjust unknown we artificially reduced the beam size and cut off
tokens to the requested token. These robust op- alternatives paths, shown in grey.)
erations have strong penalties on the probability
to make sure they will survive in the derivation 3.3 Semantic Construction Using RMRS
only in critical situations. Additionally, only a As a novel feature, we use for the representation
single one of them is allowed to occur between of meaning increments (that is, the contributions
the recognition of two adjacent input tokens. of new words and syntactic constructions) as well
Figure 1 illustrates this process for the first few as for the resulting logical forms the formalism
words of the example sentence nimm den winkel Robust Minimal Recursion Semantics (Copestake,
in der dritten reihe (take the bracket in the third 2006). This is a representation formalism that was
row), using the incremental unit (IU) model to originally constructed for semantic underspecifi-
represent increments and how they are linked; see cation (of scope and other phenomena) and then
(Schlangen and Skantze, 2009).2 Here, syntactic adapted to serve the purposes of semantics repre-
2
Very briefly: rounded boxes in the Figures represent the same predecessor can be regarded as alternatives. Solid
IUs, and dashed arrows link an IU to its predecessor on the arrows indicate which information from a previous level an
same level, where the levels correspond to processing stages. IU is grounded in (based on); here, every semantic IU is
The Figure shows the levels of input words, POS-tags, syn- grounded in a syntactic IU, every syntactic IU in a POS-tag-
tactic derivations and logical forms. Multiple IUs sharing IU, and so on.
516
sentations in heterogeneous situations where in- semantic combination in synchronisation with the
formation from deep and shallow parsers must be syntactic expansion of the tree, i.e. in a top-down
combined. In RMRS, meaning representations of left-to-right fashion. This way, no underspecifica-
a first order logic are underspecified in two ways: tion of projected nodes and no re-interpretation of
First, the scope relationships can be underspeci- already existing parts of the tree is required. This,
fied by splitting the formula into a list of elemen- however, requires adjustments to the slot structure
tary predications (EP) which receive a label ` and of RMRS. Left-recursive rules can introduce mul-
are explicitly related by stating scope constraints tiple slots of the same sort before they are filled,
to hold between them (e.g. qeq-constraints). This which is not allowed in the classic (R)MRS se-
way, all scope readings can be compactly repre- mantic algebra, where only one named slot of
sented. Second, RMRS allows underspecification each sort can be open at a time. We thus organize
of the predicate-argument-structure of EPs. Ar- the slots as a stack of unnamed slots, where mul-
guments are bound to a predicate by anchor vari- tiple slots of the same sort can be stored, but only
ables a, expressed in the form of an argument re- the one on top can be accessed. We then define
lation ARGREL(a,x). This way, predicates can a basic combination operation equivalent to for-
be introduced without fixed arity and arguments ward function composition (as in standard lambda
can be introduced without knowing which predi- calculus, or in CCG (Steedman, 2000)) and com-
cates they are arguments of. We will make use of bine substructures in a principled way across mul-
this second form of underspecification and enrich tiple syntactic rules without the need to represent
lexical predicates with arguments incrementally. slot names.
Combining two RMRS structures involves at Each lexical items receives a generic represen-
least joining their list of EPs and ARGRELs and tation derived from its lemma and the basic se-
of scope constraints. Additionally, equations be- mantic type (individual, event, or underspecified
tween the variables can connect two structures, denotations), determined by its POS tag. This
which is an essential requirement for semantic makes the grammar independent of knowledge
construction. A semantic algebra for the combi- about what later (semantic) components will ac-
nation of RMRSs in a non-lexicalist setting is de- tually be able to process (understand).3 Parallel
fined in (Copestake, 2007). Unsaturated semantic to the production of syntactic derivations, as the
increments have open slots that need to be filled tree is expanded top-down left-to-right, seman-
by what is called the hook of another structure. tic macros are activated for each syntactic rule,
Hook and slot are triples [`:a:x] consisting of a composing the contribution of the new increment.
label, an anchor and an index variable. Every vari- This allows for a monotonic semantics construc-
able of the hook is equated with the corresponding tion process that proceeds in lockstep with the
one in the slot. This way the semantic representa- syntactic analysis.
tion can grow monotonically at each combinatory Figure 1 (in the FormulaIU box) illustrates
step by simply adding predicates, constraints and the results of this process for our example deriva-
equations. tion. Again, alternatives paths have been cut to
Our approach differs from (Copestake, 2007) keep the size of the illustration small. Notice that,
only in the organisation of the slots: In an incre- apart from the end-of-input marker, the stack of
mental setting, a proper semantic representation semantic slots (in curly brackets) is always syn-
is desired for every single state of growth of the chronized with the parsers stack.
syntactic tree. Typically, RMRS composition as-
3.4 Computing Noun Phrase Denotations
sumes that the order of semantic combination is
parallel to a bottom-up traversal of the syntactic Formally, the task of this module is, given a model
tree. Yet, this would require for every incremental M of the current context, to compute the set of
step first to calculate an adequate underspecified all variable assignments such that M satisfies :
semantic representation for the projected nodes G = {g | M |=g }. If |G| > 1, we say that
on the lower right border of the tree and then to refers ambiguously; if |G| = 1, it refers uniquely;
proceed with the combination not only of the new 3
This feature is not used in the work presented here, but
semantic increments but of the complete tree. For it could be used for enabling the system to learn the meaning
our purposes, it is more elegant to proceed with of unknown words.
517
and if |G| = 0, it fails to refer. This process does
not work directly on RMRS formulae, but on ex-
tracted and unscoped first-order representations of
their nominal content.
3.5 Parse Pruning Using Reference

Information
After all possible syntactic hypotheses at an in-
crement have been derived by the parser and
the corresponding semantic representations have
been constructed, reference resolution informa-
tion can be used to re-rank the derivations. If
pragmatic feedback is enabled, the probability of
every reprentation that does not resolve in the cur-
rent context is degraded by a constant factor (we
Figure 2: The game board used in the study, as pre-
used 0.001 in our experiments described below, sented to the player: (a) the current state of the game
determined by experimentation). The degradation on the left, (b) the goal state to be reached on the right.
thus changes the derivation order in the parsing
queue for the next input item and increases the
chances of degraded derivations to be pruned in our study does not focus on these, we have dis-
the following parsing step. regarded another 661 utterances in which pieces
are referred to by pronouns, leaving us with 1026
4 Experiments and Results utterances for evaluation. These utterances con-
tained on average 5.2 words (median 5 words;
4.1 Data std dev 2 words).
We use data from the Pentomino puzzle piece do- In order to test the robustness of our method,
main (which has been used before for example we generated speech recognition output using an
by (Fernandez and Schlangen, 2007; Schlangen et acoustic model trained for spontaneous (German)
al., 2009)), collected in a Wizard-of-Oz study. In speech. We used leave-one-out language model
this specific setting, users gave instructions to the training, i. e. we trained a language model for ev-
system (the wizard) in order to manipulate (select, ery utterance to be recognized which was based
rotate, mirror, delete) puzzle pieces on an upper on all the other utterances in the corpus. Unfor-
board and to put them onto a lower board, reach- tunately, the audio recordings of the first record-
ing a pre-specified goal state. Figure 2 shows an ing day were too quiet for successful recognition
example configuration. Each participant took part (with a deletion rate of 14 %). We thus decided
in several rounds in which the distinguishing char- to limit the analysis for speech recognition out-
acteristics for puzzle pieces (color, shape, pro- put to the remaining 633 utterances from the other
posed name, position on the board) varied widely. recording days. On this part of the corpus word
In total, 20 participants played 284 games. error rate (WER) was at 18 %.
We extracted the semantics of an utterance The subset of the full corpus that we used for
from the wizards response action. In some cases, evaluation, with the utterances selected according
such a mapping was not possible to do (e. g. be- to the criteria described above, nevertheless still
cause the wizard did not perform a next action, only consists of natural, spontaneous utterances
mimicking a non-understanding by the system), (with all the syntactic complexity that brings) that
or potentially unreliable (if the wizard performed are representative for interactions in this type of
several actions at or around the end of the utter- domain.
ance). We discarded utterances without a clear se-
mantics alignment, leaving 1687 semantically an- 4.2 Grammar and Resolution Model
notated user utterances. The wizard of course was The grammar used in our experiments was hand-
able to use her model of the previous discourse for constructed, inspired by a cursory inspection of
resolving references, including anaphoric ones; as the corpus and aiming to reach good coverage
518
Words Predicates Status
nimm nimm(e) -1
nimm den nimm(e,x) def(x) 0
nimm den Winkel nimm(e,x) def(x) winkel(x) 0
nimm den Winkel in nimm(e,x) def(x) winkel(x) in(x,y) 0
nimm den Winkel in der nimm(e,x) def(x) winkel(x) in(x,y) def(y) 0
nimm den Winkel in der dritten nimm(e,x) def(x) winkel(x) in(x,y) def(y) third(y) 1
nimm den Winkel in der dritten Reihe nimm(e,x) def(x) winkel(x) in(x,y) def(y) third(y) row(y) 1
Table 1: Example of logical forms (flattened into first-order base-language formulae) and reference resolution
results for incrementally parsing and resolving nimm den winkel in der dritten reihe
for a core fragment. We created 30 rules, whose winkel in der dritten reihe (take the bracket in the
weights were also set by hand (as discussed be- third row) is shown in Table 1. The first column
low, this is an obvious area for future improve- shows the incremental word hypothesis string, the
ment), sparingly and according to standard intu- second the set of predicates derived from the most
itions. When parsing, the first step is the assign- recent RMRS representation and the third the res-
ment of a POS tag to each word. This is done by olution status (-1 for no resolution, 0 for some res-
a simple lookup tagger that stores the most fre- olution and 1 for a unique resolution).
quent tag for each word (as determined on a small
subset of our corpus).4 4.3 Baselines and Evaluation Metric
The situation model used in reference resolu- 4.3.1 Variants / Baselines
tion is automatically derived from the internal To be able to accurately quantify and assess the
representation of the current game state. (This effect of our reference-feedback strategy, we im-
was recorded in an XML-format for each utter- plemented different variants / baselines. These all
ance in our corpus.) Variable assignments were differ in how, at each step, the reading is deter-
then derived from the relevant nominal predicate mined that is evaluated against the gold standard,
structures,5 consisting of extracted simple pred- and are described in the following:
ications, e. g. red(x) and cross(x) for the NP in In the Just Syntax (JS) variant, we simply take
a phrase such as take the red cross. For each single-best derivation, as determined by syntax
unique predicate argument X in these EP struc- alone and evaluate this.
tures (such as as x above), the set of domain ob- The External Filtering (EF) variant adds in-
jects that satisfied all predicates of which X was formation from reference resolution, but keeps
an argument were determined. For example for it separate from the parsing process. Here, we
the phrase above, X mapped to all elements that look at the 5 highest ranking derivations (as de-
were red and crosses. termined by syntax alone), and go through them
Finally, the size of these sets was determined: beginning at the highest ranked, picking the first
no elements, one element, or multiple elements, derivation where reference resolution can be per-
as described above. Emptiness of at least one set formed uniquely; this reading is then put up for
denoted that no resolution was possible (for in- evaluation. If there is no such reading, the highest
stance, if no red crosses were available, xs set ranking one will be put forward for evaluation (as
was empty), uniqueness of all sets denoted that in JS).
an exact resolution was possible while multiple Syntax/Pragmatics Interaction (SPI) is the
elements in at least some sets denoted ambiguity. variant described in the previous section. Here,
This status was then leveraged for parse pruning, all active derivations are sent to the reference res-
as per Section 3.5. olution module, and are re-weighted as described
A more complex example using the scene de- above; after this has been done, the highest-
picted in Figure 2 and the sentence nimm den ranking reading is evaluated.
4
Finally, the Combined Interaction and Fil-
A more sophisticated approach has recently been pro-
posed by Beuck et al. (2011); this could be used in our setup.
tering (CIF) variant combines the previous two
5
The domain model did not allow making a plausibility strategies, by using reference-feedback in com-
judgement based on verbal resolution. puting the ranking for the derivations, and then
519
again using reference-information to identify the be used. But as we are building this module for an
most promising reading within the set of 5 highest interactive system, ultimately, accuracy in recov-
ranking ones. ering meaning is what we are interested in, and so
we see this not just as a proxy, but actually as a
4.3.2 Metric
more valuable metric. Moreover, this metric can
When a reading has been identified according be applied at each incremental step, which is not
to one of these methods, a score s is computed as clear how to do with more traditional metrics.
follows: s = 1, if the correct referent (according
to the gold standard) is computed as the denota- 4.4 Experiments
tion for this reading; s = 0 if no unique referent Our parser, semantic construction and reference
can be computed, but the correct one is part of the resolution modules are implemented within the
set of possible referents; s = 1 if no referent InproTK toolkit for incremental spoken dialogue
can be computed at all, or the correct one is not systems development (Schlangen et al., 2010). In
part of the set of those that are computed. this toolkit, incremental hypotheses are modified
As this is done incrementally for each word as more information becomes available over time.
(adding the new word to the parser chart), for an Our modules support all such modifications (i. e.
utterance of length m we get a sequence of m also allow to revert their states and output if word
such numbers. (In our experiments we treat the input is revoked).
end of utterance signal as a pseudo-word, since As explained in Section 4.1, we used offline
knowing that an utterance has concluded allows recognition results in our evaluation. However,
the parser to close off derivations and remove the results would be identical if we were to use
those that are still requiring elements. Hence, we the incremental speech recognition output of In-
in fact have sequences of m+1 numbers.) A com- proTK directly.
bined score for the whole utterance is computed
The system performs several times faster than
according to the following formula:
real-time on a standard workstation computer. We
m
X thus consider it ready to improve practical end-to-
su = (sn n/m) end incremental systems which perform within-
n=1 turn actions such as those outlined in (Bu and
(where sn is the score at position n). The fac- Schlangen, 2010).
tor n/m causes later decisions to count more The parser was run with a base-beam factor of
towards the final score, reflecting the idea that 0.01; this parameter may need to be adjusted if a
it is more to be expected (and less harmful) to larger grammar was used.
be wrong early on in the utterance, whereas the
4.5 Results
longer the utterance goes on, the more pressing
it becomes to get a correct result (and the more Table 2 shows an overview of the experiment re-
damaging if mistakes are made).6 sults. The table lists, separately for the manual
Note that this score is not normalised by utter- transcriptions and the ASR transcripts, first the
ance length m; the maximally achievable score number of times that the final reading did not re-
being (m + 1)/2. This has the additional ef- solve at all, or to a wrong entitiy; did not uniquely
fect of increasing the weight of long utterances resolve, but included the correct entity in its de-
when averaging over the score of all utterances; notiation; or did uniquely resolve to the correct
we see this as desirable, as the analysis task be- entity (-1, 0, and 1, respectively). The next lines
comes harder the longer the utterance is. show strict accuracy (proportion of 1 among
We use success in resolving reference to eval- all results) at the end of utterance, and relaxed
uate the performance of our parsing and semantic accuracy (which allows ambiguity, i.e., is the set
construction component, where more tradition- {0, 1}). incr.scr is the incremental score as de-
ally, metrics like parse bracketing accuracy might scribed above, which includes in the evaluation
6
the development of references and not just the fi-
This metric compresses into a single number some of
the concerns of the incremental metrics developed in (Bau-
nal state. (And in that sense, is the most appro-
mann et al., 2011), which can express more fine-grainedly priate metric here, as it captures the incremental
the temporal development of hypotheses. behaviour.) This score is shown both as absolute
520
JS EF SPI CIF
1 563 518 364 363
0 197 198 267 268
transcript
1 264 308 392 392
str.acc. 25.7 % 30.0 % 38.2 % 38.2 %
rel.acc. 44.9 % 49.3 % 64.2 % 64.3 %
incr.scr 1568 1248 536 504
avg.incr.scr 1.52 1.22 0.52 0.49
1 362 348 254 255
0 122 121 173 173
recogntion
1 143 158 196 195

str.acc. 22.6 % 25.0 % 31.0 % 30.8 %
rel.acc. 41.2 % 44.1 % 58.3 % 58.1 %
incr.scr 1906 1730 1105 1076
avg.incr.scr 1.86 1.69 1.01 1.05
Table 2: Results of the Experiments. See text for explanation of metrics.
number as well as averaged for each utterance. ber of non-standard constructions in our sponta-
As these results show, the strategy of provid- neous material (e.g., utterances like loschen, un-
ing the parser with feedback about the real-world ten (delete, bottom) which we did not try to cover
utility of constructed phrases (in the form of refer- with syntactic rules, and which may not even con-
ence decisions) improves the parser, in the sense tain NPs. The SPI condition can promote deriva-
that it helps the parser to successfully retrieve the tions resulting from robust rules (here, deletion)
intended meaning more often compared to an ap- which then can refer. In general though state-of-
proach that only uses syntactic information (JS) the art grammar engineering may narrow the gap
or that uses pragmatic information only outside between JS and SPI this remains to be tested
of the main programme: 38.2 % strict or 64.2 % but we see as an advantage of our approach that
relaxed for SPI over 25.7 % / 44.9 % for JS, an it can improve over the (easy-to-engineer) set of
absolute improvement of 12.5 % for strict or even core grammar rules.
more, 19.3 %, for the relaxed metric; the incre-
mental metric shows that this advantage holds not 5 Conclusions
only at the final word, but also consistently within We have described a model of semantic process-
the utterance, the average incremental score for ing of natural, spontaneous speech that strives
an utterance being 0.49 for SPI and 1.52 to jointly satisfy syntactic and pragmatic con-
for JS. The improvement is somewhat smaller straints (the latter being approximated by the as-
against the variant that uses some reference infor- sumption that referring expressions are intended
mation, but does not integrate this into the parsing to indeed successfully refer in the given context).
process (EF), but it is still consistently present. The model is robust, accepting also input of the
Adding such n-best-list processing to the output kind that can be expected from automatic speech
of the parser+reference-combination (as variant recognisers, and incremental, that is, can be fed
CIF does) finally does not further improve the input on a word-by-word basis, computing at each
performance noticeably. When processing par- increment only exactly the contribution of the new
tially defective material (the output of the speech word. Lastly, as another novel contribution, the
recogniser), the difference between the variants model makes use of a principled formalism for se-
is maintained, showing a clear advantage of SPI, mantic representation, RMRS (Copestake, 2006).
although performance of all variants is degraded While the results show that our approach of
somewhat. combining syntactic and pragmatic information
Clearly, accuracy is rather low for the base- can work in a real-world setting on realistic
line condition (JS); this is due to the large num- dataprevious work in this direction has so far
521
only been at the proof-of-concept stagethere is Ann Copestake. 2006. Robust minimal recursion se-
much room for improvement. First, we are now mantics. Technical report, Cambridge Computer
exploring ways of bootstrapping a grammar and Lab. Unpublished draft.
derivation weights from hand-corrected parses. Ann Copestake. 2007. Semantic composition with
Secondly, we are looking at making the variable (robust) minimal recursion semantics. In Proceed-
ings of the Workshop on Deep Linguistic Process-
assignment / model checking function probabilis-
ing, DeepLP 07, pages 7380, Stroudsburg, PA,
tic, assigning probabilities (degree of strength of USA. Association for Computational Linguistics.
belief) to candidate resolutions (as for example David DeVault and Matthew Stone. 2003. Domain
the model of Schlangen et al. (2009) does). An- inference in incremental interpretation. In Proceed-
other next stepwhich will be very easy to take, ings of ICOS 4: Workshop on Inference in Compu-
given the modular nature of the implementation tational Semantics, Nancy, France, September. IN-
framework that we have usedwill be to integrate RIA Lorraine.
this component into an interactive end-to-end sys- David DeVault, Kenji Sagae, and David Traum. 2011.
tem, and testing other domains in the process. Incremental Interpretation and Prediction of Utter-
ance Meaning for Interactive Dialogue. Dialogue
and Discourse, 2(1):143170.
Acknowledgements We thank the anonymous Raquel Fernandez and David Schlangen. 2007. Re-
reviewers for their helpful comments. The work ferring under restricted interactivity conditions. In
reported here was supported by a DFG grant in Simon Keizer, Harry Bunt, and Tim Paek, editors,
the Emmy Noether programme to the last author Proceedings of the 8th SIGdial Workshop on Dis-
course and Dialogue, pages 136139, Antwerp,
and a stipend from DFG-CRC (SFB) 632 to the
Belgium, September.
first author.
Ulrike Pado, Matthew W Crocker, and Frank Keller.
2009. A probabilistic model of semantic plausi-
References bility in sentence processing. Cognitive Science,
33(5):794838.
Gregory Aist, James Allen, Ellen Campana, Car- Matthew Purver, Arash Eshghi, and Julian Hough.
los Gomez Gallo, Scott Stoness, Mary Swift, and 2011. Incremental semantic construction in a di-
Michael K. Tanenhaus. 2007. Incremental under- alogue system. In J. Bos and S. Pulman, editors,
standing in human-computer dialogue and experi- Proceedings of the 9th International Conference on
mental evidence for advantages over nonincremen- Computational Semantics (IWCS), pages 365369,
tal methods. In Proceedings of Decalog 2007, the Oxford, UK, January.
11th International Workshop on the Semantics and
Brian Roark. 2001. Robust Probabilistic Predictive
Pragmatics of Dialogue, Trento, Italy.
Syntactic Processing: Motivations, Models, and
Timo Baumann, Michaela Atterer, and David
Applications. Ph.D. thesis, Department of Cogni-
Schlangen. 2009. Assessing and improving the per-
tive and Linguistic Sciences, Brown University.
formance of speech recognition for incremental sys-
tems. In Proceedings of the North American Chap- David Schlangen and Gabriel Skantze. 2009. A gen-
ter of the Association for Computational Linguis- eral, abstract model of incremental dialogue pro-
tics - Human Language Technologies (NAACL HLT) cessing. In EACL 09: Proceedings of the 12th
2009 Conference, Boulder, Colorado, USA, May. Conference of the European Chapter of the Associa-
Timo Baumann, Okko Bu, and David Schlangen. tion for Computational Linguistics, pages 710718.
2011. Evaluation and optimization of incremen- Association for Computational Linguistics, mar.
tal processors. Dialogue and Discourse, 2(1):113 David Schlangen, Timo Baumann, and Michaela At-
141. terer. 2009. Incremental reference resolution: The
Niels Beuck, Arne Kohn, and Wolfgang Menzel. task, metrics for evaluation, and a bayesian filtering
2011. Decision strategies for incremental pos tag- model that is sensitive to disfluencies. In Proceed-
ging. In Proceedings of the 18th Nordic Con- ings of SIGdial 2009, the 10th Annual SIGDIAL
ference of Computational Linguistics, NODALIDA- Meeting on Discourse and Dialogue, London, UK,
2011, Riga, Latvia. September.
Okko Bu and David Schlangen. 2010. Modelling David Schlangen, Timo Baumann, Hendrik
sub-utterance phenomena in spoken dialogue sys- Buschmeier, Okko Bu, Stefan Kopp, Gabriel
tems. In Proceedings of the 14th International Skantze, and Ramin Yaghoubzadeh. 2010. Middle-
Workshop on the Semantics and Pragmatics of Dia- ware for Incremental Processing in Conversational
logue (Pozdial 2010), pages 3341, Poznan, Poland, Agents. In Proceedings of SigDial 2010, Tokyo,
June. Japan, September.
522
William Schuler, Stephen Wu, and Lane Schwartz.
2009. A framework for fast incremental interpre-
tation during speech decoding. Computational Lin-
guistics, 35(3).
William Schuler. 2003. Using model-theoretic se-
mantic interpretation to guide statistical parsing and
word recognition in a spoken language interface. In
Proceedings of the 41st Meeting of the Association
for Computational Linguistics (ACL 2003), Sap-
poro, Japan. Association for Computational Lin-
guistics.
Gabriel Skantze and Anna Hjalmarsson. 2010. To-
wards incremental speech generation in dialogue
systems. In Proceedings of the SIGdial 2010 Con-
ference, pages 18, Tokyo, Japan, September.
Gabriel Skantze and David Schlangen. 2009. Incre-
mental dialogue processing in a micro-domain. In
Proceedings of the 12th Conference of the Euro-
pean Chapter of the Association for Computational
Linguistics (EACL 2009), pages 745753, Athens,
Greece, March.
Mark Steedman. 2000. The Syntactic Process. MIT
Press, Cambridge, Massachusetts.
Scott C. Stoness, Joel Tetreault, and James Allen.
2004. Incremental parsing with reference inter-
action. In Proceedings of the Workshop on In-
cremental Parsing at the ACL 2004, pages 1825,
Barcelona, Spain, July.
Scott C. Stoness, James Allen, Greg Aist, and Mary
Swift. 2005. Using real-world reference to improve
spoken language understanding. In AAAI Workshop
on Spoken Language Understanding, pages 3845.
523
Learning How to Conjugate the Romanian Verb. Rules for Regular and
Partially Irregular Verbs
Liviu P. Dinu Vlad Niculae Octavia-Maria S, ulea
Faculty of Mathematics Faculty of Mathematics Faculty of Foreign Languages
and Computer Science and Computer Science and Literatures
University of Bucharest University of Bucharest Faculty of Mathematics
ldinu@fmi.unibuc.ro vlad@vene.ro and Computer Science
University of Bucharest
mary.octavia@gmail.com
Abstract to give other conjugational classifications based

on the way the verb actually conjugates. Lom-
In this paper we extend our work described bard (1955), looking at a corpus of 667 verbs,
in (Dinu et al., 2011) by adding more con- combined the traditional 4 classes with the way in
jugational rules to the labelling system in- which the biggest two subgroups conjugate (one
troduced there, in an attempt to capture
using the suffix ez, the other esc) and ar-
the entire dataset of Romanian verbs ex-
tracted from (Barbu, 2007), and we em- rived at 6 classes. Ciompec (Ciompec et. al.,
ploy machine learning techniques to predict 1985 in Costanzo, 2011) proposed 10 conjuga-
a verbs correct label (which says what con- tional classes, while Felix (1964) proposed 12,
jugational pattern it follows) when only the both of them looking at the inflection of the verbs
infinitive form is given. and number of allomorphs of the stem. Romalo
(1968, p. 5-203) produced a list of 38 verb types,
which she eventually reduced to 10.
1 Introduction
For the purpose of machine translation, Moisil
Using only a restricted group of verbs, in (Dinu (1960) proposed 5 regrouped classes of verbs,
et al., 2011) we validated the hypothesis that pat- with numerous subgroups, and introduced the
terns can be identified in the conjugation of the method of letters with variable values, while Pa-
Romanian (partially irregular) verb and that these pastergiou et al. (2007) have recently developed
patterns can be learnt automatically so that, given a classification from a (second) language acquisi-
the infinitive of a verb, its correct conjugation tion point of view, dividing the 1st and 4th tradi-
for the indicative present tense can be produced. tional classes into 3 and respectively 5 subclasses,
In this paper, we extend our investigation to the each with a different conjugational pattern, and
whole dataset described in (Barbu, 2008) and at- offering rules for alternations in the stem.
tempt to capture, beside the general ending pat- Of the more extensive classifications, Barbu
terns during conjugation, as much of the phono- (2007) distinguished 41 conjugational classes for
logical alternations occuring in the stem of verbs all tenses and 30 for the indicative present alone,
(apophony) from the dataset as we can. covering a whole corpus of more that 7000 con-
Traditionally, Romanian has received a Latin- temporary Romanian verbs, a corpus which was
inspired classification of verbs into 4 (or some- also used in the present paper. However, her
times 5) conjugational classes based on the ending classes were developed on the basis of the suf-
of their infinitival form alone (Costanzo, 2011). fixes each verb receives during conjugation, and
However, this infinitive-based classification has the classification system did not take into account
proved itself inadequate due to its inability to ac- the alternations occuring in the stem of irregular
count for the behavior of partially irregular verbs and partially irregular verbs. The system of rules
(whose stems have a smaller number of allo- presented below took into account both the end-
morphs than the completely irregular) during their ings pattern and the type of stem alternation for
conjugation. each verb.
There have been, thus, numerous attempts In what follows we describe our method for la-
throughout the history of Romanian Linguistics beling the dataset and finding a model able to pre-
524
dict the labels. Person Regexp Example
1st singular (.+)a(.+)t$ tresalt
2 Approach 2nd singular (.+)a(.+)ti$ tresalti
3rd singular (.+)a(.+)ta$ tresalta
The problem which we are aiming to solve is to
1st plural (.+)a(.+)tam$ tresaltam
determine how to conjugate a verb, given its in-
2nd plural (.+)a(.+)tati$ tresaltati
finitive form. The traditional infinitive-based clas-
3rd plural (.+)a(.+)ta$ tresalta
sification taught in school does not take one all the
way to solving this problem. Many conjugational Table 1: Rule 14 modelling a tresalta
patterns exist within each of these four classes.
2.1 Labeling the dataset that they model. Note that, when we say (no) al-
ternation, we mean (no) alternation in the stem.
Following our own observations, the alternations So the difference between rules 1, 20, 22, and the
identified in (Papastergiou et al., 2007) and the sort lies in the suffix that is added to the stem
classes of suffix patterns given in (Barbu, 2007), for each verb form. They may share some suf-
we developed a number of conjugational rules fixes, but not all and/or not for the same person
which were narrowed down to the 30 most pro- and number.
ductive in relation to the dataset. Each of these
30 rules (or patterns) contains 6 regular expres- 1. no alternation; a spera (to hope);
sions through which the rule models how a (dif-
2. alternation: ae for the 2nd person singular;
ferent) type of Romanian verb conjugates in the
a numara (to count);
indicative present. They each consist of 6 reg-
ular expressions because there are three persons 3. no alternation; a intra (to enter), stem ends
(first, second, and third) times two numbers (sin- in tr, pl, bl or fl which determines
gular and plural). the addition of u at the end of the 1st per-
Rule 10, for example, models, as stated in son singular form;
the list that follows, how verbs of the type
a canta (to sing) conjugate in the indicative 4. alternation: it lacks tt for the 2nd person
present, by having the first regular expression singular, which otherwise normally occurs;
model the first person singular form (eu) cant a misca (to move), stem ends in sca;
(in regular expression format: (.+)$), the sec- 5. no alternation; a taia (to cut), ends in ia
ond, model the second person singular form (tu) and has a vowel before;
canti ((.+)ti$), the third, model the third per-
son singular form (ei) canta ((.+)a$), and so 6. no alternation; a speria (to scare), ends in
forth. Thus, rule 10 catches the alternation tt ia and has a consonant before;
for the 2nd person singular, while modelling a
7. no alternation; a dansa (to dance), conju-
particular type of verb class with a particular set
gated with the suffix ez;
of suffixes. Note that the dot accepts any letter
in the Romanian alphabet and that, for each of 8. no alternation; a copia (to copy), conju-
the six forms, the value of the capturing groups gated with a modified ez due to the stem
(those between brackets) remains constant, in this ending in ia;
case can. These groups correspond to all parts of
the stem that remain unchanged and ensure that, 9. altenation cch(e) or ggh(e); a parca
given the infinitive and the regular expressions, (to park), conjugated with ez, ending in
one can work backwards and produce the correct ca or ga;
conjugation.
10. alternation: tt for the 2nd person singular;
For a clearer understanding of one such rule,
a canta (to sing);
Table 1 shows an example of how the verb a
tresalta is modeled by rule 14. 11. alternation: ss which replaces the usual
Below, we list all the rules used, with the stem tt for the 2nd person singular; a exista
alternations they capture and an example of a verb (to exist);
525
12. alternation: aea for the 3rd person singular 27. no alternation; a citi (to read), conjugates
and plural, tt for the 2nd person singular; with the suffix esc ;
a destepta (to awake/arouse);
28. this type preserves the i from the infinitive;
13. alternation: eea for the 3rd person singular a locui (to reside), ends in ai, oi, or ui
and plural, tt for the 2nd person singular; and conjugates with esc;
a deserta (to empty);
29. alternation: ooa in the 3rd person singular
14. alternation: aa for all the forms except the and plural; end in , a omor (to kill);
1st and 2nd person plural; a tresalta (to
start, to take fright); 30. no alternation; a hotar (to decide), ends in
and conjugates with asc, a variant of
15. alternation: aa in the 3rd person singular esc
and plural, ae in the 2nd person singular;
a desfata (to delight); 2.2 Classifiers and features
Each infinitive in the dataset received a label cor-
16. alternation: aa for all the forms except for
responding to the first rule that correctly produces
the 1st and 2nd person plural; a parea (to
a conjugation for it. This was implemented in
seem);
order to reduce the ambiguity of the data, which
17. alternation: dz for the 2nd person singu- was due to some verbs having alternate conjuga-
lar due to palatalization, along with ae; a tion patterns. The unlabeled verbs were thrown
vedea (to see), stem ends in d; out, while the labeled ones were used to train and
evaluate a classifier.
18. alternation: aa for all forms except the 1st The context sensitive nature of the alternations
and 2nd person plural, dz for the 2nd per- leads to the idea that n-gram character windows
son singular due to palatalization; a cadea are useful. In the preprocessing step, the list of in-
(to fall); finitives is transformed to a sparse matrix whose
lines correspond to samples, and whose features
19. no alternation; a veghea (to watch over),
are the occurence or the frequency of a specific n-
conjugates with another type of ez ending
gram. This feature extraction step has three free
pattern;
parameters: the maximum n-gram length, the op-
20. no alternations; a merge (to walk), receives tional binarization of the features (taking only bi-
the typical ending pattern for the third conju- nary occurences instead of counts), and the op-
gational class; tional appending of a terminator character. The
terminator character allows the classifier to iden-
21. alternation: tt for the 2nd person singular; tify and assign a different weight to the n-grams
a promite (to promise); that overlap with the suffix of the string.
22. no alternation; a scrie (to write); For example, consider the English infinitive to
walk. We will assume the following illustrative
23. alternations: stsc for the 1st person singu- values for the parameters: n-gram size of 3 and
lar and 3rd person plural; a naste (to give appending the terminator character. Firstly, a ter-
birth), ends in ste; minator is appended to the end, yielding the string
walk$. Subsequently, the string is broken into 1, 2
24. alternation: n is deleted from the stem in and 3-grams: w, a, l, k, $, wa, al, lk, k$, wal, alk,
the 2nd person singular; a pune (to put), lk$. Next, this list is turned into a vector using a
ends in ne; standard process. We have first built a dictionary
25. alternation: dz in the 2nd person singular of all the n-grams from the whole dataset. These,
due to palatalization; a crede (to believe), in order, encode the features. The verb (to) walk
stem ends in d; is therefore encoded as a row vector with ones in
the columns corresponding to the features w, a,
26. no alternation; a sui (to climb), ends in etc. and zeros in the rest. In this particular case,
ui, ai, or ai; there is no difference between binary and count
526
rule no. verbs rule no. verbs terminator and with non-binarized (count) fea-
1 547 16 13 tures. The estimated correct classification rate is
2 8 17 6 90.64%, with a weighted averaged precision of
3 18 18 4 80.90%, recall of 90.64% and F1 score of 89.89%.
4 5 19 14 Appending the artificial terminator character $
5 8 20 124 consistently improves accuracy by around 0.7%.
6 16 21 25 Because each word was represented as a bag of
7 3330 22 15 character n-grams instead of a continuous string,
8 273 23 7 and because, by its nature, a SVM yields sparse
9 89 24 41 solutions, combined with the evaluation using
10 4 25 51 cross-validation, we can safely say that the model
11 5 26 185 does not overfit and indeed learns useful decision
12 4 27 1554 boundaries.
13 106 28 486
14 13 29 5 4 Conclusions and Future Works
15 5 30 27
Our results show that the labelling system based
Table 2: Number of verbs captured by each of our rules on the verb conjugation model we developed can
be learned with reasonable accuracy. In the future,
we plan to develop a multiple tiered labelling sys-
features because all of the n-grams of this short
tem that will allow for general alternations, such
verb occur only once. But for a verb such as (to)
as the ones occuring as a result of palatalization,
tantalize, the feature corresponding to the 2-gram
to be defined only once for all verbs that have
ta would get a value of 2 in a count reprezentation,
them, taking cues from the idea of letters with
but only a value of 1 in a binary one.
multiple values. This, we feel, will highly im-
The system was put together using the scikit-
prove the acuracy of the classifier.
learn machine learning library for Python (Pe-
dregosa et al., 2011), which provides a fast, scal-
5 Acknowledgements
able implementation of linear support vector ma-
chines based on liblinear (Fan et al., 2008), along The authors would like to thank the anonymous
with n-gram extraction and grid search function- reviewers for their helpful comments. All authors
ality. contributed equally to this work. The research of
Liviu P. Dinu was supported by the CNCS, IDEI
3 Results - PCE project 311/2011, The Structure and In-
Tabel 2 shows how well the rules fitted the dataset. terpretation of the Romanian Nominal Phrase in
Out of 7,295 verbs in the dataset, 349 were uncap- Discourse Representation Theory: the Determin-
tured by our rules. As expected, the rule capturing ers.
the most verbs (3,330) is the one modelling those
from the 1st conjugational class (whose infinitives References
end in a) which conjugate with the ez suffix Ana-Maria Barbu. Conjugarea verbelor roma-
and are regular, namely rule 7, created for verbs nesti. Dictionar: 7500 de verbe romanesti gru-
like a dansa. The second largest class, also as pate pe clase de conjugare. Bucharest: Coresi,
expected, is the one belonging to verbs from the 2007. 4th edition, revised. (In Romanian.) (263
4th conjugational group (whose infinitives end in pp.).
i), which are regular, meaning no alternation in
the stem, and conjugate with the esc suffix. This Ana-Maria Barbu. Romanian lexical databases:
class is modeled by rule number 27. Inflected and syllabic forms dictionaries. In
The support vector classifier was evaluated Sixth International Language Resources and
using a 10-fold cross-validation. The multi- Evaluation (LREC08), 2008.
class problem is treated using the one-versus-all Angelo Roth Costanzo. Romance Conjugational
scheme. The parameters chosen by grid search are Classes: Learning from the Peripheries. PhD
a maximum n-gram length of 5, with appended thesis, Ohio State University, 2011.
527
Figure 1: 10-fold cross validation scores for various combination of parameters. Only the values corresponding
to the best C regularization parameters are shown.
Liviu P. Dinu, Emil Ionescu, Vlad Niculae, and J. Vanderplas, A. Passos, D. Cournapeau,
Octavia-Maria Sulea. Can alternations be M. Brucher, M. Perrot, and E. Duchesnay.
learned? a machine learning approach to verb Scikit-learn: Machine learning in Python. Jour-
alternations. In Recent Advances in Natural nal of Machine Learning Research, 12:2825
Language Processing 2011, September 2011. 2830, Oct 2011.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Valeria Gutu Romalo. Morfologie Structurala a
Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: limbii romane. Editura Academiei Republicii
A library for large linear classification. Journal Socialiste Romania, 1968.
of Machine Learning Research, 9:18711874,
June 2008. ISSN 1532-4435.
Jiri Felix. Classification des verbes roumains, vol-
ume VII. Philosophica Pragensia, 1964.
Alf Lombard. Le verbe roumain. Etude mor-
phologique, volume 1. Lund, C. W. K. Gleerup,
1955.
Grigore C. Moisil. Probleme puse de traduc-
erea automata. conjugarea verbelor n limba
romana. Studii si cercetari lingvistice, XI(1):
729, 1960.
I. Papastergiou, N. Papastergiou, and L. Man-
deki. Verbul romanesc - reguli pentru nlesnirea
nsusirii indicativului prezent. In Romanian
National Symposium Directions in Roma-
nian Philological Research, 7th Edition, May
2007.
F. Pedregosa, G. Varoquaux, A. Gramfort,
V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg,
528
Measuring Contextual Fitness Using Error Contexts Extracted from the
Wikipedia Revision History
Torsten Zesch
Ubiquitous Knowledge Processing Lab (UKP-DIPF)
German Institute for Educational Research and Educational Information, Frankfurt
Ubiquitous Knowledge Processing Lab (UKP-TUDA)
Department of Computer Science, Technische Universitat Darmstadt
http://www.ukp.tu-darmstadt.de
Abstract the optical character recognition module (Walker

et al., 2010).
We evaluate measures of contextual fitness A malapropism or real-word spelling error oc-
on the task of detecting real-word spelling curs when a word is replaced with another cor-
errors. For that purpose, we extract nat- rectly spelled word which does not suit the con-
urally occurring errors and their contexts
text, e.g. People with lots of honey usually
from the Wikipedia revision history. We
show that such natural errors are better live in big houses., where money was replaced
suited for evaluation than the previously with honey. Besides typing mistakes, a major
used artificially created errors. In partic- source of such errors is the failed attempt of au-
ular, the precision of statistical methods tomatic spelling correctors to correct a misspelled
has been largely over-estimated, while the word (Hirst and Budanitsky, 2005). A real-word
precision of knowledge-based approaches spelling error is hard to detect, as the erroneous
has been under-estimated. Additionally, we
word is not misspelled and fits syntactically into
show that knowledge-based approaches can
be improved by using semantic relatedness the sentence. Thus, measures of contextual fitness
measures that make use of knowledge be- are required to detect words that do not fit their
yond classical taxonomic relations. Finally, contexts.
we show that statistical and knowledge- Existing measures of contextual fitness can be
based methods can be combined for in- categorized into knowledge-based (Hirst and Bu-
creased performance.
danitsky, 2005) and statistical methods (Mays et
al., 1991; Wilcox-OHearn et al., 2008). Both
test the lexical cohesion of a word with its con-
1 Introduction
text. For that purpose, knowledge-based ap-
Measuring the contextual fitness of a term in its proaches employ the structural knowledge en-
context is a key component in different NLP ap- coded in lexical-semantic networks like WordNet
plications like speech recognition (Inkpen and (Fellbaum, 1998), while statistical approaches
Desilets, 2005), optical character recognition rely on co-occurrence counts collected from large
(Wick et al., 2007), co-reference resolution (Bean corpora, e.g. the Google Web1T corpus (Brants
and Riloff, 2004), or malapropism detection (Bol- and Franz, 2006).
shakov and Gelbukh, 2003). The main idea is al- So far, evaluation of contextual fitness mea-
ways to test what fits better into the current con- sures relied on artificial datasets (Mays et al.,
text: the actual term or a possible replacement that 1991; Hirst and Budanitsky, 2005) which are cre-
is phonetically, structurally, or semantically simi- ated by taking a sentence that is known to be cor-
lar. We are going to focus on malapropism detec- rect, and replacing a word with a similar word
tion as it allows evaluating measures of contex- from the vocabulary. This has a couple of dis-
tual fitness in a more direct way than evaluating advantages: (i) the replacement might be a syn-
in a complex application which always entails in- onym of the original word and perfectly valid in
fluence from other components, e.g. the quality of the given context, (ii) the generated error might
529
be very unlikely to be made by a human, and real-word spelling errors at some point, which
(iii) inserting artificial errors often leads to un- are then corrected in subsequent revisions of the
natural sentences that are quite easy to correct, same article. The challenge lies in discriminating
e.g. if the word class has changed. However, real-word spelling errors from all sorts of other
even if the word class is unchanged, the origi- changes, including non-word spelling errors, re-
nal word and its replacement might still be vari- formulations, or the correction of wrong facts.
ants of the same lemma, e.g. a noun in singu- For that purpose, we apply a set of precision-
lar and plural, or a verb in present and past form. oriented heuristics narrowing down the number
This usually leads to a sentence where the error of possible error candidates. Such an approach
can be easily detected using syntactical or statis- is feasible, as the high number of revisions in
tical methods, but is almost impossible to detect Wikipedia allows to be extremely selective.
for knowledge-based measures of contextual fit-
ness, as the meaning of the word stays more or 2.1 Accessing the Revision Data
less unchanged. To estimate the impact of this is- We access the Wikipedia revision data using
sue, we randomly sampled 1,000 artificially cre- the freely available Wikipedia Revision Toolkit
ated real-word spelling errors1 and found 387 sin- (Ferschke et al., 2011) together with the JWPL
gular/plural pairs and 57 pairs which were in an- Wikipedia API (Zesch et al., 2008a).3 The API
other direct relation (e.g. adjective/adverb). This outputs plain text converted from Wiki-Markup,
means that almost half of the artificially created but the text still contains a small portion of left-
errors are not suited for an evaluation targeted at over markup and other artifacts. Thus, we per-
finding optimal measures of contextual fitness, as form additional cleaning steps removing (i) to-
they over-estimate the performance of statistical kens with more than 30 characters (often URLs),
measures while underestimating the potential of (ii) sentences with less than 5 or more than 200
semantic measures. In order to investigate this tokens, and (iii) sentences containing a high frac-
issue, we present a framework for mining natu- tion of special characters like : usually indicat-
rally occurring errors and their contexts from the ing Wikipedia-specific artifacts like lists of lan-
Wikipedia revision history. We use the resulting guage links. The remaining sentences are part-of-
English and German datasets to evaluate statisti- speech tagged and lemmatized using TreeTagger
cal and knowledge-based measures. (Schmid, 2004). Using these cleaned and anno-
We make the full experimental framework pub- tated articles, we form pairs of adjacent article re-
licly available2 which will allow reproducing our visions (ri and ri+1 ).
experiments as well as conducting follow-up ex-
2.2 Sentence Alignment
periments. The framework contains (i) methods
to extract natural errors from Wikipedia, (ii) ref- Fully aligning all sentences of the adjacent revi-
erence implementations of the knowledge-based sions is a quite costly operation, as sentences can
and the statistical methods, and (iii) the evalua- be split, joined, replaced, or moved in the arti-
tion datasets described in this paper. cle. However, we are only looking for sentence
pairs which are almost identical except for the
2 Mining Errors from Wikipedia real-word spelling error and its correction. Thus,
we form all sentence pairs and then apply an ag-
Measures of contextual fitness have previously gressive but cheap filter that rules out all sentences
been evaluated using artificially created datasets, which (i) are equal, or (ii) whose lengths differ
as there are very few sources of sentences with more than a small number of characters. For the
naturally occurring errors and their corrections. resulting much smaller subset of sentence pairs,
Recently, the revision history of Wikipedia has we compute the Jaro distance (Jaro, 1995) be-
been introduced as a valuable knowledge source tween each pair. If the distance exceeds a cer-
for NLP (Nelken and Yamangil, 2008; Yatskar et tain threshold tsim (0.05 in this case), we do not
al., 2010). It is also a possible source of natural further consider the pair. The small amount of re-
errors, as it is likely that Wikipedia editors make maining sentence pairs is passed to the sentence
1
pair filter for in-depth inspection.
The same artificial data as described in Section 3.2.
2 3
http://code.google.com/p/dkpro-spelling-asl/ http://code.google.com/p/jwpl/
530
2.3 Sentence Pair Filtering tions, the change is likely to be semantically mo-
The sentence pair filter further reduces the num- tivated, e.g. if house was replaced with hut.
ber of remaining sentence pairs by applying a set Thus, we do not consider cases, where we detect
of heuristics including surface level and semantic a direct semantic relation between the original and
level filters. Surface level filters include: the replaced term. For this purpose, we use Word-
Replaced Token Sentences need to consist of Net (Fellbaum, 1998) for English and GermaNet
identical tokens, except for one replaced token. (Lemnitzer and Kunze, 2002) for German.
No Numbers The replaced token may not be a
3 Resulting Datasets
number.
UPPER CASE The replaced token may not be 3.1 Natural Error Datasets
in upper case.
Using our framework for mining real-word
Case Change The change should not only in-
spelling errors in context, we extracted an En-
volve case changes, e.g. changing english into
glish dataset5 , and a German dataset6 . Although
English.
the output generally was of high quality, man-
Edit Distance The edit distance between the
ual post-processing was necessary7 , as (i) for
replaced token and its correction need to be be-
some pairs the available context did not provide
low a certain threshold.
enough information to decide which form was
After applying the surface level filters, the re- correct, and (ii) a problem that might be spe-
maining sentence pairs are well-formed and con- cific to Wikipedia vandalism. The revisions are
tain exactly one changed token at the same posi- full of cases where words are replaced with simi-
tion in the sentence. However, the change does lar sounding but greasy alternatives. A relatively
not need to characterize a real-word spelling er- mild example is In romantic comedies, there is
ror, but could also be a normal spelling error or a a love story about a man and a woman who fall
semantically motivated change. Thus, we apply a in love, along with silly or funny comedy farts.,
set of semantic filters: where parts was replaced with farts only to be
Vocabulary The replaced token needs to occur changed back shortly afterwards by a Wikipedia
in the vocabulary. We found that even quite com- vandalism hunter. We removed all cases that re-
prehensive word lists discarded too many valid sulted from obvious vandalism. For further ex-
errors as Wikipedia contains articles from a very periments, a small list of offensive terms could be
wide range of domains. Thus, we use a frequency added to the stopword list to facilitate this pro-
filter based on the Google Web1T n-gram counts cess.
(Brants and Franz, 2006). We filter all sentences
A connected problem is correct words that get
where the replaced token has a very low unigram
falsely corrected by Wikipedia editors (without
count. We experimented with different values and
the malicious intend from the previous examples,
found 25,000 for English and 10,000 for German
but with similar consequences). For example, the
to yield good results.
initially correct sentence Dung beetles roll it into
Same Lemma The original token and the re- a ball, sometimes being up to 50 times their own
placed token may not have the same lemma, e.g. weight. was corrected by exchanging weight
car and cars would not pass this filter. with wait. We manually removed such obvious
Stopwords The replaced token should not be in mistakes, but are still left with some borderline
a short list of stopwords (mostly function words). cases. In the sentence By the 1780s the goals
Named Entity The replaced token should not of England were so full that convicts were often
be part of a named entity. For this purpose, we chained up in rotting old ships. the obvious error
applied the Stanford NER (Finkel et al., 2005).
5
Normal Spelling Error We apply the Jazzy Using a revision dump from April 5, 2011.
6
spelling detector4 and rule out all cases in which Using a revision dump from August 13, 2010.
7
The most efficient and precise way of finding real-word
it is able to detect the error. spelling errors would of course be to apply measures of con-
Semantic Relation If the original token and the textual fitness. However, the resulting dataset would then
replaced token are in a close lexical-semantic rela- only contain errors that are detectable by the measures we
want to evaluate a clearly unacceptable bias. Thus, a cer-
4
http://jazzy.sourceforge.net/ tain amount of manual validation is inevitable.
531
goal was changed by some Wikipedia editor to corpus that is known to be free of spelling errors,
jail. However, actually it should have been the sentences are randomly sampled. For each sen-
old English form for jail gaol which can be de- tence, a random word is selected and all strings
duced when looking at the full context and later with edit distance smaller than a given threshold
versions of the article. We decided to not remove (2 in our case) are generated. If one of those gen-
these rare cases, because jail is a valid correction erated strings is a known word from the vocabu-
in this context. lary, it is picked as the artificial error.
After manual inspection, we are left with 466 Previous work on evaluating real-word spelling
English and 200 German errors. Given that we correction (Hirst and Budanitsky, 2005; Wilcox-
restricted our experiment to 5 million English and OHearn et al., 2008; Islam and Inkpen, 2009)
German revisions, much larger datasets can be ex- used a dataset sampled from the Wall Street Jour-
tracted if the whole revision history is taken into nal corpus which is not freely available. Thus, we
account. Our snapshot of the English Wikipedia created a comparable English dataset of 1,000 ar-
contains 305106 revisions. Even if not all of them tificial errors based on the easily available Brown
correspond to article revisions, it is safe to assume corpus (Francis W. Nelson and Kucera, 1964).8
that more than 10,000 real-word spelling errors Additionally, we created a German dataset with
can be extracted from this version of Wikipedia. 1,000 artificial errors based on the TIGER cor-
Using the same amount of source revisions, we pus.9
found significantly more English than German er-
rors. This might be due to (i) English having more 4 Measuring Contextual Fitness
short nouns or verbs than German that are more There are two main approaches for measuring the
likely to be confused with each other, and (ii) the contextual fitness of a word in its context: the
English Wikipedia being known to attract a larger statistical (Mays et al., 1991) and the knowledge-
amount of non-native editors which might lead to based approach (Hirst and Budanitsky, 2005).
higher rates of real-word spelling errors. How-
ever, this issue needs to be further investigated 4.1 Statistical Approach
e.g. based on comparable corpora build on the ba- Mays et al. (1991) introduced an approach based
sis of different language editions of Wikipedia. on the noisy-channel model. The model assumes
Further refining the identification of real-word er- that the correct sentence s is transmitted through
rors in Wikipedia would allow evaluating how fre- a noisy channel adding noise which results in a
quent such errors actually occur, and how long word w being replaced by an error e leading the
it takes the Wikipedia editors to detect them. If wrong sentence s0 which we observe. The prob-
errors persist over a long time, using measures ability of the correct word w given that we ob-
of contextual fitness for detection would be even serve the error e can be computed as P (w|e) =
more important. P (w) P (e|w). The channel model P (e|w) de-
Another interesting observation is that the av- scribes how likely the typist is to make an error.
erage edit distance is around 1.4 for both datasets. This is modeled by the parameter .10 The re-
This means that a substantial proportion of errors maining probability mass (1 ) is distributed
involve more than one edit operation. Given that equally among all words in the vocabulary within
many measures of contextual fitness allow at most an edit distance of 1 (edits(w)):
one edit, many naturally occurring errors will not
(
be detected. However, allowing a larger edit dis- if e = w
tance enormously increases the search space re- P (e|w) =
(1 )/|edits(w)| if e 6= w
sulting in increased run-time and possibly de-
creased detection precision due to more false pos- The source model P (w) is estimated using a
itives. trigram language model, i.e. the probability of the
3.2 Artificial Error Datasets 8
http://www.archive.org/details/BrownCorpus (CC-by-na).
9
http://www.ims.uni-stuttgart.de/projekte/TIGER/
In contrast to the quite challenging process of The corpus contains 50,000 sentences of German newspaper
mining naturally occurring errors, creating artifi- text, and is freely available under a non-commercial license.
10
cial errors is relatively straightforward. From a We optimize on a held-out development set of errors.
532
intended word wi is computed as the conditional Dataset P R F
probability P (wi |wi1 wi2 ). Hence, the proba- Artificial-English .77 .50 .60
bility of the correct sentence s = w1 . . . wn can Natural-English .54 .26 .35
be estimated as Artificial-German .90 .49 .63
Natural-German .77 .20 .32
n+2
Y
P (s) = P (wi |wi1 wi2 ) Table 1: Performance of the statistical approach using
i=1
a trigram model based on Google Web1T.
The set of candidate sentences Sc contains all ver-
sions of the observed sentence s0 derived by re- It is unclear which list was used. We could use
placing one word with a word from edits(w), multi-words from WordNet, but coverage would
while all other words in the sentence remain be rather limited. We decided not to use both fil-
unchanged. The correct sentence s is those ters in order to better assess the influence of the
sentence from Sc that maximizes P (s|s0 ) = underlying semantic relatedness measure on the
arg maxsSc P (s) P (s0 |s). overall performance.
The knowledge based approach uses semantic
4.2 Knowledge Based Approach
relatedness measures to determine the cohesion
Hirst and Budanitsky (2005) introduced a between a candidate and its context. In the exper-
knowledge-based approach that detects real-word iments by Budanitsky and Hirst (2006), the mea-
spelling errors by checking the semantic relations sure by (Jiang and Conrath, 1997) yields the best
of a target word with its context. For this pur- results. However, a wide range of other measures
pose, they apply WordNet as the source of lexical- have been proposed, cf. (Zesch and Gurevych,
semantic knowledge. 2010). Some measures using a wider defini-
The algorithm flags all words as error can- tion of semantic relatedness (Gabrilovich and
didates and then applies filters to remove those Markovitch, 2007; Zesch et al., 2008b) instead
words from further consideration that are unlikely of only using taxonomic relations in a knowledge
to be errors. First, the algorithm removes all source.
closed-class word candidates as well as candi- As semantic relatedness measures usually re-
dates which cannot be found in the vocabulary. turn a numeric value, we need to determine a
Candidates are then tested for having lexical co- threshold in order to come up with a binary
hesion with their context, by (i) checking whether related/unrelated decision. Budanitsky and Hirst
the same surface form or lemma appears again in (2006) used a characteristic gap in the stan-
the context, or (ii) a semantically related concept dard evaluation dataset by Rubenstein and Good-
is found in the context. In both cases, the candi- enough (1965) that separates unrelated from re-
date is removed from the list of candidates. For lated word pairs. We do not follow this approach,
each remaining possible real-word spelling error, but optimize the threshold on a held-out develop-
edits are generated by inserting, deleting, or re- ment set of real-word spelling errors.
placing characters up to a certain edit distance
(usually 1). Each edit is then tested for lexical 5 Results & Discussion
cohesion with the context. If at least one of it fits
into the context, the candidate is selected as a real- In this section, we report on the results obtained
word error. in our evaluation of contextual fitness measures
Hirst and Budanitsky (2005) use two additional using artificial and natural errors in English and
filters: First, they remove candidates that are German.
common non-topical words. It is unclear how
the list of such words was compiled. Their list 5.1 Statistical Approach
of examples contains words like find or world Table 1 summarizes the results obtained by the
which we consider to be perfectly valid candi- statistical approach using a trigram model based
dates. Second, they also applied a filter using a on the Google Web1T data (Brants and Franz,
list of known multi-words, as the probability for 2006). On the English artificial errors, we ob-
words to accidentally form multi-words is low. serve a quite high F-measure of .60 that drops to
533
Dataset N-gram model Size P R F Dataset P R F
7 1011 .77 .50 .60 Artificial-English .26 .15 .19
Google Web 7 1010 .78 .48 .59 Natural-English .29 .18 .23
Art-En
7 109 .76 .42 .54
Artificial-German .47 .16 .24
Wikipedia 2 109 .72 .37 .49 Natural-German .40 .13 .19
7 1011 .54 .26 .35
Google Web 7 1010 .51 .23 .31 Table 3: Performance of the knowledge-based ap-
Nat-En proach using the JiangConrath semantic relatedness
7 109 .46 .19 .27
measure.
Wikipedia 2 109 .49 .19 .27
10
8 10 .90 .49 .63
Google Web 8 109 .90 .47 .61 not targeted towards the Wikipedia articles from
Art-De
8 108 .88 .36 .51 which we sampled the natural errors. Thus, we
Wikipedia 7 108 .90 .37 .52 also tested a trigram model based on Wikipedia.
8 10 10
.77 .20 .32 However, it is much smaller than the Web model,
Google Web 8 109 .68 .14 .23 which leads us to additionally testing smaller Web
Nat-De
8 108 .65 .10 .17 models. Table 2 summarizes the results.
Wikipedia 7 108 .70 .13 .22 We observe that more data is better data still
holds, as the largest Web model always outper-
Table 2: Influence of the n-gram model on the perfor- forms the Wikipedia model in terms of recall. If
mance of the statistical approach. we reduce the size of the Web model to the same
order of magnitude as the Wikipedia model, the
.35 when switching to the naturally occurring er- performance of the two models is comparable.
rors which we extracted from Wikipedia. On the We would have expected to see better results for
German dataset, we observe almost the same per- the Wikipedia model in this setting, but its higher
formance drop (from .63 to .32). quality does not lead to a significant difference.
These observations correspond to our earlier Even if statistical approaches quite reliably de-
analysis where we showed that the artificial data tect real-word spelling errors, the size of the re-
contains many cases that are quite easy to correct quired n-gram models remains a serious obstacle
using a statistical model, e.g. where a plural form for use in real-world applications. The English
of a noun is replaced with its singular form (or Web1T trigram model is about 25GB, which cur-
vice versa) as in I bought a car. vs. I bought rently is not suited for being applied in settings
a cars.. The naturally occurring errors often con- with limited storage capacities e.g. for intelligent
tain much harder contexts, as shown in the fol- input assistance in mobile devices. As we have
lowing example: Through the open window they seen above, using smaller models will decrease
heard sounds below in the street: cartwheels, a recall to a point where hardly any error will be de-
tired horses plodding step, vices. where vices tected anymore. Thus, we will now have a look on
should be corrected to voices. While the lemma knowledge-based approaches which are less de-
voice is clearly semantically related to other manding in terms of the required resources.
words in the context like hear or sound, the
5.2 Knowledge-based Approach
position at the end of the sentence is especially
difficult for the trigram-based statistical approach. Table 3 shows the results for the knowledge-based
The only trigram that connects the error to the measure. In contrast to the statistical approach,
context is (step, ,, vices/voices) which will the results on the artificial errors are not higher
probably yield a low frequency count even for than on the natural errors, but almost equal for
very large trigram models. Higher order n-gram German and even lower for English; another piece
models would help, but suffer from the usual data- of evidence supporting our view that the proper-
sparseness problems. ties of artificial datasets over-estimate the perfor-
mance of statistical measures.
Influence of the N-gram Model For building
the trigram model, we used the Google Web1T Influence of the Relatedness Measure As was
data, which has some known quality issues and is pointed out before, Budanitsky and Hirst (2006)
534
Dataset Measure P R F Dataset Comb.-Strategy P R F
JiangConrath 0.5 .26 .15 .19 Best-Single .77 .50 .60
Lin 0.5 .22 .17 .19 Artificial-English Union .52 .55 .54
Lesk 0.5 .19 .16 .17 Intersection .91 .15 .25
Art-En
ESA-Wikipedia 0.05 .43 .13 .20 Best-Single .54 .26 .35
ESA-Wiktionary 0.05 .35 .20 .25 Natural-English Union .40 .36 .38
ESA-Wordnet 0.05 .33 .15 .21 Intersection .82 .11 .19
JiangConrath 0.5 .29 .18 .23
Lin 0.5 .26 .21 .23 Table 5: Results obtained by a combination of the best
Lesk 0.5 .19 .19 .19 statistical and knowledge-based configuration. Best-
Nat-En Single is the best precision or recall obtained by a sin-
ESA-Wikipedia 0.05 .48 .14 .22
gle measure. Union merges the detections of both
ESA-Wiktionary 0.05 .39 .21 .27
ESA-Wordnet 0.05 .36 .15 .21
approaches. Intersection only detects an error if both
methods agree on a detection.
Table 4: Performance of knowledge-based approach
using different relatedness measures.
count an error as detected if both methods agree
on a detection (Intersection). When compar-
show that the measure by Jiang and Conrath ing the combined results in Table 5 with the best
(1997) yields the best results in their experi- precision or recall obtained by a single measure
ments on malapropism detection. In addition, we (Best-Single), we observe that precision can be
test another path-based measure by Lin (1998), significantly improved using the Union strategy,
the gloss-based measure by Lesk (1986), and while recall is only moderately improved using
the ESA measure (Gabrilovich and Markovitch, the Intersect strategy. This means that (i) a large
2007) based on concept vectors from Wikipedia, subset of errors is detected by both approaches
Wiktionary, and WordNet. Table 4 summarizes that due to their different sources of knowledge
the results. In contrast to the findings of Budanit- mutually reinforce the detection leading to in-
sky and Hirst (2006), JiangConrath is not the best creased precision, and (ii) a small but otherwise
path-based measure, as Lin provides equal or bet- undetectable subset of errors requires considering
ter performance. Even more importantly, other detections made by one approach only.
(non path-based) measures yield better perfor-
mance than both path-based measures. Especially 6 Related Work
ESA based on Wiktionary provides a good over-
To our knowledge, we are the first to create a
all performance, while ESA based on Wikipedia
dataset of naturally occurring errors based on the
provides excellent precision. The advantage of
revision history of Wikipedia. Max and Wis-
ESA over the other measure types can be ex-
niewski (2010) used similar techniques to create
plained with its ability to incorporate semantic re-
a dataset of errors from the French Wikipedia.
lationships beyond classical taxonomic relations
However, they target a wider class of errors in-
(as used by path-based measures).
cluding non-word spelling errors, and their class
of real-word errors conflates malapropisms as
5.3 Combining the Approaches
well as other types of changes like reformulations.
The statistical and the knowledge-based approach Thus, their dataset cannot be easily used for our
use quite different methods to assess the con- purposes and is only available in French, while
textual fitness of a word in its context. This our framework allows creating datasets for all ma-
makes it worthwhile trying to combine both ap- jor languages with minimal manual effort.
proaches. We ran the statistical method (using the Another possible source of real-word spelling
full Wikipedia trigram model) and the knowledge- errors are learner corpora (Granger, 2002), e.g.
based method (using the ESA-Wiktionary related- the Cambridge Learner Corpus (Nicholls, 1999).
ness measure) in parallel and then combined the However, annotation of errors is difficult and
resulting detections using two strategies: (i) we costly (Rozovskaya and Roth, 2010), only a small
merge the detections of both approaches in order fraction of observed errors will be real-word
to obtain higher recall (Union), and (ii) we only spelling errors, and learners are likely to make dif-
535
ferent mistakes than proficient language users. word spelling errors. For that purpose, we ex-
Islam and Inkpen (2009) presented another sta- tracted a dataset with naturally occurring errors
tistical approach using the Google Web1T data and their contexts from the Wikipedia revision
(Brants and Franz, 2006) to create the n-gram history. We show that evaluating measures of con-
model. It slightly outperformed the approach by textual fitness on this dataset provides a more re-
Mays et al. (1991) when evaluated on a corpus of alistic picture of task performance. In particular,
artificial errors based on the WSJ corpus. How- using artificial datasets over-estimates the perfor-
ever, the results are not directly comparable, as mance of the statistical approach, while it under-
Mays et al. (1991) used a much smaller n-gram estimates the performance of the knowledge-
model and our results in Section 5.1 show that based approach.
the size of the n-gram model has a large influence We show that n-gram models targeted towards
on the results. Eventually, we decided to use the the domain from which the errors are sampled
Mays et al. (1991) approach in our study, as it is do not improve the performance of the statisti-
easier to adapt and augment. cal approach if larger n-gram models are avail-
In a re-evaluation of the statistical model by able. We further show that the performance of
Mays et al. (1991), Wilcox-OHearn et al. (2008) the knowledge-based approach can be improved
found that it outperformed the knowledge-based by using semantic relatedness measures that in-
method by Hirst and Budanitsky (2005) when corporate knowledge beyond the taxonomic rela-
evaluated on a corpus of artificial errors based on tions in a classical lexical-semantic resource like
the WSJ corpus. This is consistent with our find- WordNet. Finally, by combining both approaches,
ings on the artificial errors based on the Brown significant increases in precision or recall can be
corpus, but - as we have seen in the previous sec- achieved.
tion - evaluation on the naturally occurring errors
shows a different picture. They also tried to im- In future work, we want to evaluate a wider
prove the model by permitting multiple correc- range of contextual fitness measures, and learn
tions and using fixed-length context windows in- how to combine them using more sophisticated
stead of sentences, but obtained discouraging recombination strategies. Both - the statistical as
sults. well as the knowledge-based approach - will ben-
All previously discussed methods are unsuper- efit from a better model of the typist, as not all
vised in a way that they do not rely on any training edit operations are equally likely (Kernighan et
data with annotated errors. However, real-word al., 1990). On the side of the error extraction, we
spelling correction has also been tackled by su- are going to further improve the extraction pro-
pervised approaches (Golding and Schabes, 1996; cess by incorporating more knowledge about the
Jones and Martin, 1997; Carlson et al., 2001). revisions. For example, vandalism is often re-
Those methods rely on predefined confusion-sets, verted very quickly, which can be detected when
i.e. sets of words that are often confounded e.g. looking at the full set of revisions of an article.
{peace, piece} or {weather, whether}. For each We hope that making the experimental frame-
set, the methods learn a model of the context in work publicly available will foster future research
which one or the other alternative is more proba- in this field, as our results on the natural errors
ble. This yields very high precision, but only for show that the problem is still quite challenging.
the limited number of previously defined confu-
sion sets. Our framework for extracting natural
errors could be used to increase the number of Acknowledgments
known confusion sets.
This work has been supported by the Volk-
7 Conclusions and Future Work swagen Foundation as part of the Lichtenberg-
In this paper, we evaluated two main approaches Professorship Program under grant No. I/82806.
for measuring the contextual fitness of terms: the We Andreas Kellner and Tristan Miller for check-
statistical approach by Mays et al. (1991) and ing the datasets, and the anonymous reviewers for
the knowledge-based approach by Hirst and Bu- their helpful feedback.
danitsky (2005) on the task of detecting real-
536
References ical cohesion. Natural Language Engineering,
11(1):87111, March.
David Bean and Ellen Riloff. 2004. Unsupervised
Diana Inkpen and Alain Desilets. 2005. Semantic
learning of contextual role knowledge for corefer-
similarity for detecting recognition errors in auto-
ence resolution. In Proc. of HLT/NAACL, pages
matic speech transcripts. In Proceedings of the con-
297304.
ference on Human Language Technology and Em-
Igor A. Bolshakov and Alexander Gelbukh. 2003. On pirical Methods in Natural Language Processing -
Detection of Malapropisms by Multistage Colloca- HLT 05, number October, pages 4956, Morris-
tion Testing. In Proceedings of NLDB-2003, 8th town, NJ, USA. Association for Computational Lin-
International Workshop on Applications of Natural guistics.
Language to Information Systems, number Cic.
Aminul Islam and Diana Inkpen. 2009. Real-word
Thorsten Brants and Alex Franz. 2006. Web 1T 5- spelling correction using Google Web IT 3-grams.
gram Version 1. In Proceedings of the 2009 Conference on Empiri-
Alexander Budanitsky and Graeme Hirst. 2006. Eval- cal Methods in Natural Language Processing Vol-
uating wordnet-based measures of lexical semantic ume 3 - EMNLP 09, Morristown, NJ, USA. Asso-
relatedness. Computational Linguistics, 32(1):13 ciation for Computational Linguistics.
47. M A Jaro. 1995. Probabilistic linkage of large public
Andrew J Carlson, Jeffrey Rosen, and Dan Roth. health data file. Statistics in Medicine, 14:491498.
2001. Scaling Up Context-Sensitive Text Correc- Jay J Jiang and David W Conrath. 1997. Seman-
tion. In Proceedings of IAAI. tic Similarity Based on Corpus Statistics and Lex-
C Fellbaum. 1998. WordNet An Electronic Lexical ical Taxonomy. In Proceedings of the 10th Inter-
Database. MIT Press, Cambridge, MA. national Conference on Research in Computational
Oliver Ferschke, Torsten Zesch, and Iryna Gurevych. Linguistics, Taipei, Taiwan.
2011. Wikipedia Revision Toolkit: Efficiently Michael P Jones and James H Martin. 1997. Contex-
Accessing Wikipedias Edit History. In Proceed- tual spelling correction using latent semantic analy-
ings of the 49th Annual Meeting of the Associa- sis. In Proceedings of the fifth conference on Ap-
tion for Computational Linguistics: Human Lan- plied natural language processing -, pages 166
guage Technologies. System Demonstrations, pages 173, Morristown, NJ, USA. Association for Com-
97102, Portland, OR, USA. putational Linguistics.
Jenny Rose Finkel, Trond Grenager, and Christopher Mark D Kernighan, Kenneth W Church, and
Manning. 2005. Incorporating non-local informa- William A Gale. 1990. A Spelling Correc-
tion into information extraction systems by Gibbs tion Program Based on a Noisy Channel Model.
sampling. In Proceedings of the 43rd Annual Meet- In Proceedings of the 13th International Confer-
ing on Association for Computational Linguistics - ence on Computational Linguistics, pages 205210,
ACL 05, pages 363370, Morristown, NJ, USA. Helsinki, Finland.
Association for Computational Linguistics. Lothar Lemnitzer and Claudia Kunze. 2002. Ger-
Francis W. Nelson and Henry Kucera. 1964. Manual maNet - Representation, Visualization, Application.
of information to accompany a standard corpus of In Proceedings of the 3rd International Conference
present-day edited American English, for use with on Language Resources and Evaluation (LREC),
digital computers. pages 14851491.
Evgeniy Gabrilovich and Shaul Markovitch. 2007. M Lesk. 1986. Automatic sense disambiguation using
Computing Semantic Relatedness using Wikipedia- machine readable dictionaries: how to tell a pine
based Explicit Semantic Analysis. In Proceedings cone from an ice cream cone. Proceedings of the
of the 20th International Joint Conference on Arti- 5th annual international conference, pages 2426.
ficial Intelligence, pages 16061611. Dekang Lin. 1998. An Information-Theoretic Defini-
Andrew R. Golding and Yves Schabes. 1996. Com- tion of Similarity. In Proceedings of International
bining Trigram-based and feature-based methods Conference on Machine Learning, pages 296304,
for context-sensitive spelling correction. In Pro- Madison, Wisconsin.
ceedings of the 34th annual meeting on Association Aurelien Max and Guillaume Wisniewski. 2010.
for Computational Linguistics -, pages 7178, Mor- Mining Naturally-occurring Corrections and Para-
ristown, NJ, USA. Association for Computational phrases from Wikipedias Revision History. In Pro-
Linguistics. ceedings of the Seventh conference on International
Sylviane Granger, 2002. A birds-eye view of learner Language Resources and Evaluation (LREC10),
corpus research, pages 333. John Benjamins Pub- pages 31433148.
lishing Company. Eric Mays, Fred. J Damerau, and Robert L Mercer.
Graeme Hirst and Alexander Budanitsky. 2005. Cor- 1991. Context based spelling correction. Informa-
recting real-word spelling errors by restoring lex- tion Processing & Management, 27(5):517522.
537
Rani Nelken and Elif Yamangil. 2008. Mining Torsten Zesch, Christof Muller, and Iryna Gurevych.
Wikipedias Article Revision History for Train- 2008b. Using wiktionary for computing semantic
ing Computational Linguistics Algorithms. In relatedness. In Proceedings of the 23rd AAAI Con-
Proceedings of the AAAI Workshop on Wikipedia ference on Artificial Intelligence, pages 861867,
and Artificial Intelligence: An Evolving Synergy Chicago, IL, USA, Jul.
(WikiAI), WikiAI08.
Diane Nicholls. 1999. The Cambridge Learner Cor-
pus - Error Coding and Analysis for Lexicography
and ELT. In Summer Workshop on Learner Cor-
pora, Tokyo, Japan.
Alla Rozovskaya and Dan Roth. 2010. Annotating
ESL Errors: Challenges and Rewards. In The 5th
Workshop on Innovative Use of NLP for Building
Educational Applications (NAACL-HLT).
H Rubenstein and J B Goodenough. 1965. Contextual
Correlates of Synonymy. Communications of the
ACM, 8(10):627633.
Helmut Schmid. 2004. Efficient Parsing of Highly
Ambiguous Context-Free Grammars with Bit Vec-
tors. In Proceedings of the 20th International
Conference on Computational Linguistics (COL-
ING 2004), Geneva, Switzerland.
Daniel D. Walker, William B. Lund, and Eric K. Ring-
ger. 2010. Evaluating Models of Latent Document
Semantics in the Presence of OCR Errors. Proceed-
ings of the 2010 Conference on Empirical Methods
in Natural Language Processing, (October):240
250.
M. Wick, M. Ross, and E. Learned-Miller. 2007.
Context-sensitive error correction: Using topic
models to improve OCR. In Ninth International
Conference on Document Analysis and Recogni-
tion (ICDAR 2007) Vol 2, pages 11681172. Ieee,
September.
Amber Wilcox-OHearn, Graeme Hirst, and Alexander
Budanitsky. 2008. Real-word spelling correction
with trigrams: A reconsideration of the Mays, Dam-
erau, and Mercer model. In Proceedings of the 9th
international conference on Computational linguis-
tics and intelligent text processing (CICLing).
Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-
Mizil, and Lillian Lee. 2010. For the sake of sim-
plicity: unsupervised extraction of lexical simplifi-
cations from Wikipedia. In Human Language Tech-
nologies: The 2010 Annual Conference of the North
tional Linguistics, HLT 10, pages 365368.
Torsten Zesch and Iryna Gurevych. 2010. Wisdom
of Crowds versus Wisdom of Linguists - Measur-
ing the Semantic Relatedness of Words. Journal of
Natural Language Engineering, 16(1):2559.
Torsten Zesch, Christof Muller, and Iryna Gurevych.
2008a. Extracting Lexical Semantic Knowledge
from Wikipedia and Wiktionary. In Proceedings of
the Conference on Language Resources and Evalu-
ation (LREC).
538
Perplexity Minimization for Translation Model Domain Adaptation in
Statistical Machine Translation
Rico Sennrich
Institute of Computational Linguistics
University of Zurich
Binzmhlestr. 14
CH-8050 Zrich
sennrich@cl.uzh.ch
Abstract We move the focus away from a binary com-

bination of in-domain and out-of-domain data. If
We investigate the problem of domain we can scale up the number of models whose con-
adaptation for parallel data in Statistical tributions we weight, this reduces the need for a
Machine Translation (SMT). While tech- priori knowledge about the fitness1 of each poten-
niques for domain adaptation of monolin-
tial training text, and opens new research oppor-
gual data can be borrowed for parallel data,
we explore conceptual differences between
tunities, for instance experiments with clustered
translation model and language model do- training data.
main adaptation and their effect on per-
formance, such as the fact that translation 2 Domain Adaptation for Translation
models typically consist of several features Models
that have different characteristics and can
be optimized separately. We also explore To motivate efforts in domain adaptation, let us
adapting multiple (410) data sets with no review why additional training data can improve,
a priori distinction between in-domain and but also decrease translation quality.
out-of-domain data except for an in-domain Adding more training data to a translation sys-
development set.
tem is easy to motivate through the data sparse-
ness problem. Koehn and Knight (2001) show
1 Introduction that translation quality correlates strongly with
how often a word occurs in the training corpus.
The increasing availability of parallel corpora Rare words or phrases pose a problem in sev-
from various sources, welcome as it may be, eral stages of MT modelling, from word align-
leads to new challenges when building a statis- ment to the computation of translation probabil-
tical machine translation system for a specific ities through Maximum Likelihood Estimation.
domain. The task of determining which par- Unknown words are typically copied verbatim to
allel texts should be included for training, and the target text, which may be a good strategy for
which ones hurt translation performance, is te- named entities, but is often wrong otherwise. In
dious when performed through trial-and-error. general, more data allows for a better word align-
Alternatively, methods for a weighted combina- ment, a better estimation of translation probabili-
tion exist, but there is conflicting evidence as to ties, and for the consideration of more context (in
which approach works best, and the issue of de- phrase-based or syntactic SMT).
termining weights is not adequately resolved. A second effect of additional data is not nec-
The picture looks better in language mod- essarily positive. Translations are inherently am-
elling, where model interpolation through per- biguous, and a strong source of ambiguity is the
plexity minimization has become a widespread 1
We borrow this term from early evolutionary biology to
method of domain adaptation. We investigate the emphasize that the question in domain adaptation is not how
applicability of this method for translation mod- good or bad the data is, but how well-adapted it is to the
els, and discuss possible applications. task at hand.
539
domain of a text. The German word Wort (engl. dividual model probabilities. It is defined as fol-
word) is typically translated as floor in Europarl, lows:
a corpus of Parliamentary Proceedings (Koehn,
n
2005), owing to the high frequency of phrases X
such as you have the floor, which is translated into p(x|y; ) = i pi (x|y) (1)
i=1
German as Sie haben das Wort. This translation
is highly idiomatic and unlikely to occur in other with i being the P interpolation weight of each
contexts. Still, adding Europarl as out-of-domain model i, and with ( i i ) = 1.
training data shifts the probability distribution of For SMT, linear interpolation of translation
p(t|Wort) in favour of p(floor|Wort), and models has been used in numerous systems. The
may thus lead to improper translations. approaches diverge in how they set the inter-
We will refer to the two problems as the data polation weights. Some authors use uniform
sparseness problem and the ambiguity problem. weights (Cohn and Lapata, 2007), others em-
Adding out-of-domain data typically mitigates the pirically test different interpolation coefficients
data sparseness problem, but exacerbates the am- (Finch and Sumita, 2008; Yasuda et al., 2008;
biguity problem. The net gain (or loss) of adding Nakov and Ng, 2009; Axelrod et al., 2011), others
more data changes from case to case. Because apply monolingual metrics to set the weights for
there are (to our knowledge) no tools that predict TM interpolation (Foster and Kuhn, 2007; Koehn
this net effect, it is a matter of empirical investi- et al., 2010).
gation (or, in less suave terms, trial-and-error), to There are reasons against all these approaches.
determine which corpora to use.2 Uniform weights are easy to implement, but give
From this understanding of the reasons for and little control. Empirically, it has been shown that
against out-of-domain data, we formulate the fol- they often do not perform optimally (Finch and
lowing hypotheses: Sumita, 2008; Yasuda et al., 2008). An opti-
mization of B LEU scores on a development set is
1. A weighted combination can control the con- promising, but slow and impractical. There is no
tribution of the out-of-domain corpus on the easy way to integrate linear interpolation into log-
probability distribution, and thus limit the linear SMT frameworks and perform optimization
ambiguity problem. through MERT. Monolingual optimization objec-
tives such as language model perplexity have the
2. A weighted combination eliminates the need advantage of being well-known and readily avail-
for data selection, offering a robust baseline able, but their relation to the ambiguity problem
for domain-specific machine translation. is indirect at best.
Linear interpolation is seemingly well-defined
We will discuss three mixture modelling tech-
in equation 1. Still, there are a few implemen-
niques for translation models. Our aim is to adapt
tation details worth pointing out. If we directly
all four features of the standard Moses SMT trans-
interpolate each feature in the translation model,
lation model: the phrase translation probabilities
and define the feature values of non-occurring
p(t|s) and p(s|t), and the lexical weights lex(t|s)
phrase pairs as 0, this disregards the meaning of
and lex(s|t).3
each feature. If we estimate p(x|y) via MLE as in
2.1 Linear Interpolation equation 2, and c(y) = 0, then p(x|y) is strictly
speaking undefined. Alternatively to a naive al-
A well-established approach in language mod-
gorithm, which treats unknown phrase pairs as
elling is the linear interpolation of several mod-
having a probability of 0, which results in a defi-
els, i.e. computing the weighted average of the in-
cient probability distribution, we propose and im-
2
A frustrating side-effect is that these findings rarely gen- plement the following algorithm. For each value
eralize. For instance, we were unable to reproduce the find- pair (x, y) for which we compute p(x|y), we re-
ing by Ceausu et al. (2011) that patent translation systems place i with 0 for all models i with p(y) =
are highly domain-sensitive and suffer from the inclusion of
parallel training data from other patent subdomains.
0, then renormalize the weight vector to 1.
3
We can ignore the fifth feature, the phrase penalty, We do this for p(t|s) and lex(t|s), but not for
which is a constant. p(s|t) and lex(s|t), the reasoning being the con-
540
sequences for perplexity minimization (see sec- 2.3 Alternative Paths
tion 2.4). Namely, we do not want to penalize
A third method is using multiple translation mod-
a small in-domain model for having a high out-
els as alternative decoding paths (Birch et al.,
of-vocabulary rate on the source side, but we do
2007), an idea which Koehn and Schroeder (2007)
want to penalize models that know the source
first used for domain adaptation. This approach
phrase, but not its correct translation. A sec-
has the attractive theoretical property that adding
ond modification pertains to the lexical weights
new models is guaranteed to lead to equal or bet-
lex(s|t) and lex(t|s), which form no true proba-
ter performance, given the right weights. At best,
bility distribution, but are derived from the indi-
a model is beneficial with appropriate weights. At
vidual word translation probabilities of a phrase
worst, we can set the feature weights so that the
pair (see (Koehn et al., 2003)). We propose to
decoding paths of one model are never picked for
not interpolate the features directly, but the word
the final translation. In practice, each translation
translation probabilities which are the basis of the
model adds 5 features and thus 5 more dimensions
lexical weight computation. The reason for this is
to the weight space, which leads to longer search,
that word pairs are less sparse than phrase pairs,
search errors, and/or overfitting. The expectation
so that we can even compute lexical weights for
is that, at least with MERT, using alternative de-
phrase pairs which are unknown in a model.4
coding paths does not scale well to a high number
2.2 Weighted Counts of models.
Weighting of different corpora can also be imple- A suboptimal choice of weights is not the only
mented through a modified Maximum Likelihood weakness of alternative paths, however. Let us
Estimation. The traditional equation for MLE is: assume that all models have the same weights.
Note that, if a phrase pair occurs in several mod-
c(x, y) c(x, y) els, combining models through alternative paths
p(x|y) = =P 0
(2)
c(y) x0 c(x , y) means that the decoder selects the path with the
where c denotes the count of an observation, and highest probability, whereas with linear interpo-
p the model probability. If we generalize the for- lation, the probability of the phrase pair would
mula to compute a probability from n corpora, be the (weighted) average of all models. Select-
and assign a weight i to each, we get5 : ing the highest-scoring phrase pair favours statis-
Pn tical outliers and hence is the less robust decision,
c (x, y)
p(x|y; ) = Pn i=1 P i i 0
(3) prone to data noise and data sparseness.
i=1 x0 i ci (x , y)
The main difference to linear interpolation is 2.4 Perplexity Minimization
that this equation takes into account how well- In language modelling, perplexity is frequently
evidenced a phrase pair is. This includes the dis- used as a quality measure for language models
tinction between lack of evidence and negative ev- (Chen and Goodman, 1998). Among other appli-
idence, which is missing in a naive implementa- cations, language model perplexity has been used
tion of linear interpolation. for domain adaptation (Foster and Kuhn, 2007).
Translation models trained with weighted For translation models, perplexity is most closely
counts have been discussed before, and have associated with EM word alignment (Brown et
been shown to outperform uniform ones in some al., 1993) and has been used to evaluate different
settings. However, researchers who demon- alignment algorithms (Al-Onaizan et al., 1999).
strated this fact did so with arbitrary weights (e.g.
We investigate translation model perplexity
(Koehn, 2002)), or by empirically testing differ-
minimization as a method to set model weights
ent weights (e.g. (Nakov and Ng, 2009)). We do
in mixture modelling. For the purpose of opti-
not know of any research on automatically deter-
mization, the cross-entropy H(p), the perplexity
mining weights for this method, or which is not
2H(p) , and other derived measures are equivalent.
limited to two corpora.
The cross-entropy H(p) is defined as:6
4
For instance if the word pairs (the,der) and (man,Mann)
6
are known, but the phrase pair (the man, der Mann) is not. See (Chen and Goodman, 1998) for a short discussion
5
P Unlike equation 1, equation 3 does not require that of the equation. In short, a lower cross-entropy indicates that
( i i ) = 1. the model is better able to predict the development set.
541
X Our main technical contributions are as fol-
H(p) = p(x, y) log2 p(x|y) (4) lows: Additionally to perplexity optimization for
x,y linear interpolation, which was first applied by
The phrase pairs (x, y) whose probability we Foster et al. (2010), we propose perplexity opti-
measure, and their empirical probability p need mization for weighted counts (equation 3), and a
to be extracted from a development set, whereas modified implementation of linear interpolation.
p is the model probability. To obtain the phrase Also, we independently perform perplexity mini-
pairs, we process the development set with the mization for all four features of the standard SMT
same word alignment and phrase extraction tools translation model: the phrase translation proba-
that we use for training, i.e. GIZA++ and heuris- bilities p(t|s) and p(s|t), and the lexical weights
tics for phrase extraction (Och and Ney, 2003). lex(t|s) and lex(s|t).
The objective function is the minimization of the
3 Other Domain Adaptation Techniques
cross-entropy, with the weight vector as argu-
ment: So far, we discussed mixture modelling for trans-
X lation models, which is only a subset of domain
= arg min p(x, y) log2 p(x|y; ) (5) adaptation techniques in SMT.
x,y
Mixture-modelling for language models is well
We can fill in equations 1 or 3 for p(x|y; ). The established (Foster and Kuhn, 2007). Language
optimization itself is convex and can be done with model adaptation serves the same purpose as
off-the-shelf software.7 We use L-BFGS with translation model adaptation, i.e. skewing the
numerically approximated gradients (Byrd et al., probability distribution in favour of in-domain
1995). translations. This means that LM adaptation may
Perplexity minimization has the advantage that have similar effects as TM adaptation, and that
it is well-defined for both weighted counts and lin- the two are to some extent redundant. Foster and
ear interpolation, and can be quickly computed. Kuhn (2007) find that both TM and LM adap-
Other than in language modelling, where p(x|y) tation are effective, but that combined LM and
is the probability of a word given a n-gram his- TM adaptation is not better than LM adaptation
tory, conditional probabilities in translation mod- on its own.
els express the probability of a target phrase given A second strand of research in domain adap-
a source phrase (or vice versa), which connects tation is data selection, i.e. choosing a subset of
the perplexity to the ambiguity problem. The the training data that is considered more relevant
higher the probability of correct phrase pairs, for the task at hand. This has been done for lan-
the lower the perplexity, and the more likely guage models using techniques from information
the model is to successfully resolve the ambigu- retrieval (Zhao et al., 2004), or perplexity (Lin et
ity. The question is in how far perplexity min- al., 1997; Moore and Lewis, 2010). Data selec-
imization coincides with empirically good mix- tion has also been proposed for translation mod-
ture weights.8 This depends, among others, on els (Axelrod et al., 2011). Note that for transla-
the other model components in the SMT frame- tion models, data selection offers an unattractive
work, for instance the language model. We will trade-off between the data sparseness and the am-
not evaluate perplexity minimization against em- biguity problem, and that the optimal amount of
pirically optimized mixture weights, but apply it data to select is hard to determine.
in situations where the latter is infeasible, e.g. be- Our discussion of mixture-modelling is rela-
cause of the number of models. tively coarse-grained, with 2-10 models being
7
combined. Matsoukas et al. (2009) propose an ap-
A quick demonstration of convexity: equation 1 is
affine; equation 3 linear-fractional. Both are convex in the proach where each sentence is weighted accord-
domain R>0 . Consequently, equation 4 is also convex being to a classifier, and Foster et al. (2010) ex-
cause it is the weighted sum of convex functions. tend this approach by weighting individual phrase
8
There are tasks for which perplexity is known to be un- pairs. These more fine-grained methods need not
reliable, e.g. for comparing models with different vocabular-
ies. However, such confounding factors do not affect the op-
be seen as alternatives to coarse-grained ones.
timization algorithm, which works with a fixed set of phrase Foster et al. (2010) combine the two, apply-
pairs, and merely varies . ing linear interpolation to combine the instance-
542
weighted out-of-domain model with an in-domain Data set sentences words (fr)
model. Alpine (in-domain) 220k 4 700k
Europarl 1 500k 44 000k
4 Evaluation JRC Acquis 1 100k 24 000k
OpenSubtitles v2 2 300k 18 000k
Apart from measuring the performance of the ap-
Total train 5 200k 91 000k
proaches introduced in section 2, we want to in-
Dev 1424 33 000
vestigate the following open research questions.
Test 991 21 000
1. Does an implementation of linear interpola- Table 1: Parallel data sets for German French trans-
tion that is more closely tailored to translation task.
lation modelling outperform a naive imple-
mentation? Data set sentences words
Alpine (in-domain) 650k 13 000k
2. How do the approaches perform outside a News-commentary 150k 4 000k
binary setting, i.e. when we do not work Europarl 2 000k 60 000k
with one in-domain and one out-of-domain News 25 000k 610 000k
model, but with a higher number of models? Total 28 000k 690 000k
Table 2: Monolingual French data sets for German
3. Can we apply perplexity minimization to French translation task.
other translation model features such as the
lexical weights, and if yes, does a separate all translation model features, and a modified one
optimization of each translation model fea- that normalizes for each phrase pair (s, t) for
ture improve performance? p(t|s) and recomputes the lexical weights based
on interpolated word translation probabilites. The
4.1 Data and Methods
fourth weighted combination is using alternative
In terms of tools and techniques used, we mostly decoding paths with weights set through MERT.
adhere to the work flow described for the WMT The four weighted combinations are evaluated
2011 baseline system9 . The main tools are Moses twice: once applied to the original four or ten par-
(Koehn et al., 2007), SRILM (Stolcke, 2002), and allel data sets, once in a binary setting in which
GIZA++ (Och and Ney, 2003), with settings as all out-of-domain data sets are first concatenated.
described in the WMT 2011 guide. We report Since we want to concentrate on translation
two translation measures: B LEU (Papineni et al., model domain adaptation, we keep other model
2002) and METEOR 1.3 (Denkowski and Lavie, components, namely word alignment and the lex-
2011). All results are lowercased and tokenized, ical reordering model, constant throughout the ex-
measured with five independent runs of MERT periments. We contrast two language models. An
(Och and Ney, 2003) and MultEval (Clark et al., unadapted, out-of-domain language model trained
2011) for resampling and significance testing. on data sets provided for the WMT 2011 transla-
We compare three baselines and four translation task, and an adapted language model which is
tion model mixture techniques. The three base- the linear interpolation of all data sets, optimized
lines are a purely in-domain model, a purely out- for minimal perplexity on the in-domain develop-
of-domain model, and a model trained on the con- ment set.
catenation of the two, which corresponds to equa- While unadapted language models are becom-
tion 3 with uniform weights. Additionally, we ing more rare in domain adaptation research, they
evaluate perplexity optimization with weighted allow us to contrast different TM mixtures with-
counts and the two implementations of linear in- out the effect on performance being (partially)
terpolation contrasted in section 2.1. The two lin- hidden by language model adaptation with the
ear interpolations that are contrasted are a naive same effect.
one, i.e. a direct, unnormalized interpolation of The first data set is a DEFR translation sce-
9
http://www.statmt.org/wmt11/baseline. nario in the domain of mountaineering. The in-
html domain corpus is a collection of Alpine Club pub-
543
lications (Volk et al., 2010). As parallel out-of- Data set units words (en)
domain dataset, we use Europarl, a collection of SMS (in-domain) 16 500 380 000
parliamentary proceedings (Koehn, 2005), JRC- Medical 1 600 10 000
Acquis, a collection of legislative texts (Stein- Newswire 13 500 330 000
berger et al., 2006), and OpenSubtitles v2, a par- Glossary 35 700 90 000
allel corpus extracted from film subtitles10 (Tiede- Wikipedia 8 500 110 000
mann, 2009). For language modelling, we use in- Wikipedia NE 10 500 34 000
domain data and data from the 2011 Workshop Bible 30 000 920 000
on Statistical Machine Translation. The respec- Haitisurf dict 3 700 4000
tive sizes of the data sets are listed in tables 1 and Krengle dict 1 600 2 600
2. Krengle 650 4 200
As the second data set, we use the Haitian Cre- Total train 120 000 1 900 000
ole English data from the WMT 2011 featured Dev 900 22 000
translation task. It consists of emergency SMS Test 1274 25 000
sent in the wake of the 2010 Haiti earthquake.
Table 3: Parallel data sets for Haiti Creole English
Originally, Microsoft Research and CMU oper-
translation task.
ated under severe time constraints to build a trans-
lation system for this language pair. This limits
the ability to empirically verify how much each Data set sentences words
data set contributes to translation quality, and in- SMS (in-domain) 16k 380k
creases the importance of automated and quick News 113 000k 2 650 000k
domain adaptation methods.
Table 4: Monolingual English data sets for Haiti Cre-
Note that both data sets have a relatively high ole English translation task.
ratio of in-domain to out-of-domain parallel train-
ing data (1:20 for DEEN and 1:5 for HTEN)
Previous research has been performed with ratios
of 1:100 (Foster et al., 2010) or 1:400 (Axelrod LM performs better than an out-of-domain one,
et al., 2011). Since domain adaptation becomes and using all available in-domain parallel data is
more important when the ratio of IN to OUT is better than using only part of it. The same is not
low, and since such low ratios are also realistic11 , true for out-of-domain data, which highlights the
we also include results for which the amount of problem discussed in the introduction. For the
in-domain parallel data has been restricted to 10% DEFR task, adding 86 million words of out-of-
of the available data set. domain parallel data to the 5 million in-domain
We used the same development set for lan- data set does not lead to consistent performance
guage/translation model adaptation and setting gains. We observe a decrease of 0.3 B LEU points
the global model weights with MERT. While it with an out-of-domain LM, and an increase of 0.4
is theoretically possible that MERT will give too B LEU points with an adapted LM. The out-of-
high weights to models that are optimized on the domain training data has a larger positive effect
same development set, we found no empirical evi- if less in-domain data is available, with a gain of
dence for this in experiments with separate devel- 1.4 B LEU points. The results in the HTEN trans-
opment sets. lation task (table 6) paint a similar picture. An
interesting side note is that even tiny amounts of
4.2 Results in-domain parallel data can have strong effects on
performance. A training set of 1600 emergency
The results are shown in tables 5 and 6. In the
SMS (38 000 tokens) yields a comparable perfor-
DEFR translation task, results vary between 13.5
mance to an out-of-domain data set of 1.5 million
and 18.9 B LEU points; in the HTEN task, be-
tokens.
tween 24.3 and 33.8. Unsurprisingly, an adapted
10
As to the domain adaptation experiments,
http://www.opensubtitles.org
11
We predict that the availability of parallel data will
weights optimized through perplexity minimiza-
steadily increase, most data being out-of-domain for any tion are significantly better in the majority of
given task. cases, and never significantly worse, than uniform
544
out-of-domain LM adapted LM
System full IN TM full IN TM small IN TM
B LEU METEOR B LEU METEOR B LEU METEOR
in-domain 16.8 35.9 17.9 37.0 15.7 33.5
out-of-domain 13.5 31.3 14.8 32.3 14.8 32.3
counts (concatenation) 16.5 35.7 18.3 37.3 17.1 35.4
binary in/out
weighted counts 17.4 36.6 18.7 37.9 17.6 36.2
linear interpolation (naive) 17.4 36.7 18.8 37.9 17.6 36.1
linear interpolation (modified) 17.2 36.5 18.9 38.0 17.6 36.2
alternative paths 17.2 36.5 18.6 37.8 17.4 36.0
4 models
weighted counts 17.3 36.6 18.8 37.8 17.4 36.0
Table 5: Domain adaptation results DEFR. Domain: Alpine texts. Full IN TM: Using the full in-domain parallel
corpus; small IN TM: using 10% of available in-domain parallel data.
weights.12 However, the difference is smaller for mization methods does not change significantly.
the experiments with an adapted language model This is positive for perplexity optimization be-
than for those with an out-of-domain one, which cause it demonstrates that it requires less a priori
confirms that the benefit of language model adap- information, and opens up new research possibil-
tation and translation model adaptation are not ities, i.e. experiments with different clusterings of
fully cumulative. Performance-wise, there seems parallel data. The performance degradation for
to be no clear winner between weighted counts alternative paths is partially due to optimization
and the two alternative implementations of lin- problems in MERT, but also due to a higher sus-
ear interpolation. We can still argue for weighted ceptibility to statistical outliers, as discussed in
counts on theoretical grounds. A weighted MLE section 2.3.14
(equation 3) returns a true probability distribution, A pessimistic interpretation of the results
whereas a naive implementation of linear interpo- would point out that performance gains compared
lation results in a deficient model. Consequently, to the best baseline system are modest or even
probabilities are typically lower in the naively in- inexistent in some settings. However, we want
terpolated model, which results in higher (worse) to stress two important points. First, we often
perplexities. While the deficiency did not affect do not know a priori whether adding an out-of-
MERT or decoding negatively, it might become domain data set boosts or weakens translation per-
problematic in other applications, for instance if formance. An automatic weighting of data sets re-
we want to use an interpolated model as a compo- duces the need for trial-and-error experimentation
nent in a second perplexity-based combination of and is worthwhile even if a performance increase
models.13 is not guaranteed. Second, the potential impact
When moving from a binary setting with of a weighted combination depends on the trans-
one in-domain and one out-of-domain transla- lation scenario and the available data sets. Gen-
tion model (trained on all available out-of-domain erally, we expect non-uniform weighting to have
data) to 410 translation models, we observe a a bigger impact when the models that are com-
serious performance degradation for alternative bined are more dissimilar (in terms of fitness for
paths, while performance of the perplexity opti- the task), and if the ratio of in-domain to out-of-
domain data is low. Conversely, there are situa-
12
This also applies to linear interpolation with uniform
14
weights, which is not shown in the tables. We empirically verified this weakness in a synthetic ex-
13
Specifically, a deficient model would be dispreferred by periment with a randomly split training corpus and identical
the perplexity minimization algorithm. weights for each path.
545
out-of-domain LM adapted LM
System full IN TM full IN TM small IN TM
B LEU METEOR B LEU METEOR B LEU METEOR
in-domain 30.4 30.7 33.4 31.7 29.7 28.6
out-of-domain 24.3 28.0 28.9 30.2 28.9 30.2
counts (concatenation) 30.3 31.2 33.6 32.4 31.3 31.3
binary in/out
weighted counts 31.0 31.6 33.8 32.4 31.5 31.3
10 models
weighted counts 31.0 31.5 33.5 32.3 31.8 31.5
Table 6: Domain adaptation results HTEN. Domain: emergency SMS. Full IN TM: Using the full in-domain
parallel corpus; small IN TM: using 10% of available in-domain parallel data.
tions where we actually expect a simple concate-

nation to be optimal, e.g. when the data sets have
very similar probability distributions.
4.2.1 Individually Optimizing Each TM perplexity

Feature weights B LEU
1 2 3 4
It is hard to empirically show how translation weighted counts
model perplexity optimization compares to using uniform 5.12 7.68 4.84 13.67 30.3
monolingual perplexity measures for the purpose separate 4.68 6.62 4.24 8.57 31.0
of weighting translation models, as e.g. done by 1 4.68 6.84 4.50 10.86 30.3
(Foster and Kuhn, 2007; Koehn et al., 2010). One 2 4.78 6.62 4.48 10.54 30.3
problem is that there are many different possible 3 4.86 7.31 4.24 9.15 30.8
configurations for the latter. We can use source 4 5.33 7.87 4.52 8.57 30.9
side or target side language models, operate with average 4.72 6.71 4.38 9.95 30.4
different vocabularies, smoothing techniques, and linear interpolation (modified)
n-gram orders. uniform 19.89 82.78 4.80 10.78 30.6
One of the theoretical considerations that separate 5.45 8.56 4.28 8.85 31.0
favour measuring perplexity on the translation 1 5.45 8.79 4.40 8.89 30.8
model rather than using monolingual measures 2 5.71 8.56 4.54 8.91 30.9
is that we can optimize each translation model 3 6.46 11.88 4.28 9.07 31.0
feature separately. In the default Moses transla- 4 6.12 10.86 4.47 8.85 30.9
tion model, the four features are p(s|t), lex(s|t), average 5.73 9.72 4.34 8.89 30.9
p(t|s) and lex(t|s). LM 6.01 9.83 4.56 8.96 30.8
We empirically test different optimization Table 7: Contrast between a separate optimization of
schemes as follows. We optimize perplexity on each feature and applying the weight vector optimized
each feature independently, obtaining 4 weight on one feature to the whole model. HTEN with out-
vectors. We then compute one model with one of-domain LM.
weight vector per feature (namely the feature that
the vector was optimized on), and four models
that use one of the weight vectors for all features.
A further model uses a weight vector that is the
546
average of the other four. For linear interpolation, elling. We envision that a weighted combination
we also include a model whose weights have been could be useful to deal with noisy datasets, or ap-
optimized through language model perplexity op- plied after a clustering of training data.
timization, with a 3-gram language model (modi-
fied Knesey-Ney smoothing) trained on the target Acknowledgements
side of each parallel data set. This research was funded by the Swiss National
Table 7 shows the results. In terms of B LEU Science Foundation under grant 105215_126999.
score, a separate optimization of each feature is a
winner in our experiment in that no other scheme
is better, with 8 of the 11 alternative weighting References
schemes (excluding uniform weights) being sig- Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin
nificantly worse than a separate optimization. The Knight, John Lafferty, Dan Melamed, Franz-Josef
differences in B LEU score are small, however, Och, David Purdy, Noah A. Smith, and David
since the alternative weighting schemes are gen- Yarowsky. 1999. Statistical machine translation.
erally felicitious in that they yield both a lower Technical report, Final Report, JHU Summer Work-
shop.
perplexity and better B LEU scores than uniform
Amittai Axelrod, Xiaodong He, and Jianfeng Gao.
weighting. While our general expectation is that 2011. Domain adaptation via pseudo in-domain
lower perplexities correlate with higher transla- data selection. In Proceedings of the EMNLP 2011
tion performance, this relation is complicated by Workshop on Statistical Machine Translation.
several facts. Since the interpolated models are Alexandra Birch, Miles Osborne, and Philipp Koehn.
deficient (i.e. their probabilities do not sum to 1), 2007. CCG supertags in factored statistical ma-
perplexities for weighted counts and our imple- chine translation. In Proceedings of the Second
mentation of linear interpolation cannot be com- Workshop on Statistical Machine Translation, pages
916, Prague, Czech Republic, June. Association
pard. Also, note that not all features are equally
important for decoding. Their weights in the log- Peter F. Brown, Vincent J. Della Pietra, Stephen A.
linear model are set through MERT and vary be- Della Pietra, and Robert L. Mercer. 1993. The
tween optimization runs. Mathematics of Statistical Machine Translation:
Parameter Estimation. Computational Linguistics,
5 Conclusion 19(2):263311.
Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and
This paper contributes to SMT domain adaptation Ciyou Zhu. 1995. A limited memory algorithm
research in several ways. We expand on work for bound constrained optimization. SIAM J. Sci.
by (Foster et al., 2010) in establishing transla- Comput., 16:11901208, September.
tion model perplexity minimization as a robust Alexandru Ceausu, John Tinsley, Jian Zhang, and
Andy Way. 2011. Experiments on domain adap-
baseline for a weighted combination of translation
tation for patent machine translation in the PLuTO
models.15 We demonstrate perplexity optimiza- project. In Proceedings of the 15th conference of
tion for weighted counts, which are a natural ex- the European Association for Machine Translation,
tension of unadapted MLE training, but are of lit- Leuven, Belgium.
tle prominence in domain adaptation research. We Stanley F. Chen and Joshua Goodman. 1998. An em-
also show that we can separately optimize the four pirical study of smoothing techniques for language
variable features in the Moses translation model modeling. Computer Speech & Language, 13:359
through perplexity optimization. 393.
Jonathan H. Clark, Chris Dyer, Alon Lavie, and
We break with prior domain adaptation re-
Noah A. Smith. 2011. Better hypothesis testing for
search in that we do not rely on a binary clustering statistical machine translation: Controlling for op-
of in-domain and out-of-domain training data. We timizer instability. In Proceedings of the 49th An-
demonstrate that perplexity minimization scales nual Meeting of the Association for Computational
well to a higher number of translation models. Linguistics: Human Language Technologies, pages
This is not only useful for domain adaptation, but 176181, Portland, Oregon, USA, June. Associa-
for various tasks that profit from mixture mod- tion for Computational Linguistics.
Trevor Cohn and Mirella Lapata. 2007. Machine
15 Translation by Triangulation: Making Effective Use
The source code is available in the Moses repository
http://github.com/moses-smt/mosesdecoder of Multi-Parallel Corpora. In Proceedings of the
547
45th Annual Meeting of the Association of Compu- Philipp Koehn, Barry Haddow, Philip Williams, and
tational Linguistics, pages 728735, Prague, Czech Hieu Hoang. 2010. More linguistic annotation
Republic, June. Association for Computational Lin- for statistical machine translation. In Proceedings
guistics. of the Joint Fifth Workshop on Statistical Machine
Michael Denkowski and Alon Lavie. 2011. Meteor Translation and MetricsMATR, pages 115120, Up-
1.3: Automatic Metric for Reliable Optimization psala, Sweden, July. Association for Computational
and Evaluation of Machine Translation Systems. In Linguistics.
Proceedings of the EMNLP 2011 Workshop on Sta- Philipp Koehn. 2002. Europarl: A Multilingual Cor-
tistical Machine Translation. pus for Evaluation of Machine Translation.
Andrew Finch and Eiichiro Sumita. 2008. Dynamic Philipp Koehn. 2005. Europarl: A parallel corpus for
model interpolation for statistical machine transla- statistical machine translation. In Machine Transla-
tion. In Proceedings of the Third Workshop on tion Summit X, pages 7986, Phuket, Thailand.
Statistical Machine Translation, StatMT 08, pages Sung-Chien Lin, Chi-Lung Tsai, Lee-Feng Chien,
208215, Stroudsburg, PA, USA. Association for Keh-Jiann Chen, and Lin-Shan Lee. 1997. Chinese
Computational Linguistics. language model adaptation based on document clas-
George Foster and Roland Kuhn. 2007. Mixture- sification and multiple domain-specific language
model adaptation for smt. In Proceedings of the models. In George Kokkinakis, Nikos Fakotakis,
Second Workshop on Statistical Machine Transla- and Evangelos Dermatas, editors, EUROSPEECH.
tion, StatMT 07, pages 128135, Stroudsburg, PA, ISCA.
Spyros Matsoukas, Antti-Veikko I. Rosti, and Bing
George Foster, Cyril Goutte, and Roland Kuhn. 2010. Zhang. 2009. Discriminative corpus weight esti-
Discriminative instance weighting for domain adap- mation for machine translation. In Proceedings of
tation in statistical machine translation. In Proceed- the 2009 Conference on Empirical Methods in Nat-
ings of the 2010 Conference on Empirical Methods ural Language Processing: Volume 2 - Volume 2,
in Natural Language Processing, pages 451459, pages 708717, Stroudsburg, PA, USA. Association
Stroudsburg, PA, USA. Association for Computa- for Computational Linguistics.
tional Linguistics.
Robert C. Moore and William Lewis. 2010. Intelli-
Philipp Koehn and Kevin Knight. 2001. Knowledge
gent selection of language model training data. In
sources for word-level translation models. In Lil-
Proceedings of the ACL 2010 Conference Short Pa-
lian Lee and Donna Harman, editors, Proceedings
pers, ACLShort 10, pages 220224, Stroudsburg,
of the 2001 Conference on Empirical Methods in
Natural Language Processing, pages 2735.
tics.
Philipp Koehn and Josh Schroeder. 2007. Experi-
ments in domain adaptation for statistical machine Preslav Nakov and Hwee Tou Ng. 2009. Improved
translation. In Proceedings of the Second Work- statistical machine translation for resource-poor
shop on Statistical Machine Translation, StatMT languages using related resource-rich languages. In
07, pages 224227, Stroudsburg, PA, USA. Asso- Proceedings of the 2009 Conference on Empiri-
ciation for Computational Linguistics. cal Methods in Natural Language Processing: Vol-
ume 3 - Volume 3, EMNLP 09, pages 13581367,
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
Stroudsburg, PA, USA. Association for Computa-
2003. Statistical phrase-based translation. In
tional Linguistics.
NAACL 03: Proceedings of the 2003 Conference
of the North American Chapter of the Association Franz Josef Och and Hermann Ney. 2003. A sys-
for Computational Linguistics on Human Language tematic comparison of various statistical alignment
Technology, pages 4854, Morristown, NJ, USA. models. Computational Linguistics, 29(1):1951.
Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and
Philipp Koehn, Hieu Hoang, Alexandra Birch, Wei-Jing Zhu. 2002. Bleu: A method for automatic
Chris Callison-Burch, Marcello Federico, Nicola evaluation of machine translation. In ACL 02: Pro-
Bertoldi, Brooke Cowan, Wade Shen, Christine ceedings of the 40th Annual Meeting on Associa-
Moran, Richard Zens, Chris Dyer, Ondrej Bojar, tion for Computational Linguistics, pages 311318,
Alexandra Constantin, and Evan Herbst. 2007. Morristown, NJ, USA. Association for Computa-
Moses: Open Source Toolkit for Statistical Ma- tional Linguistics.
chine Translation. In ACL 2007, Proceedings of the Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
45th Annual Meeting of the Association for Com- Camelia Ignat, Tomaz Erjavec, Dan Tufis, and
putational Linguistics Companion Volume Proceed- Daniel Varga. 2006. The JRC-Acquis: A multilin-
ings of the Demo and Poster Sessions, pages 177 gual aligned parallel corpus with 20+ languages. In
180, Prague, Czech Republic, June. Association for Proceedings of the 5th International Conference on
Computational Linguistics. Language Resources and Evaluation (LREC2006).
548
A. Stolcke. 2002. SRILM An Extensible Language
Modeling Toolkit. In Seventh International Confer-
ence on Spoken Language Processing, pages 901
904, Denver, CO, USA.
Jrg Tiedemann. 2009. News from opus - a col-
lection of multilingual parallel corpora with tools
and interfaces. In N. Nicolov, K. Bontcheva,
G. Angelova, and R. Mitkov, editors, Recent
Advances in Natural Language Processing, vol-
ume V, pages 237248. John Benjamins, Amster-
dam/Philadelphia, Borovets, Bulgaria.
Martin Volk, Noah Bubenhofer, Adrian Althaus, Maya
Bangerter, Lenz Furrer, and Beni Ruef. 2010. Chal-
lenges in building a multilingual alpine heritage
corpus. In Proceedings of the Seventh conference
on International Language Resources and Evalu-
ation (LREC10), Valletta, Malta. European Lan-
guage Resources Association (ELRA).
Keiji Yasuda, Ruiqiang Zhang, Hirofumi Yamamoto,
and Eiichiro Sumita. 2008. Method of selecting
training data to build a compact and efficient trans-
lation model. In Proceedings of the 3rd Interna-
tional Joint Conference on Natural Language Pro-
cessing (IJCNLP).
Bing Zhao, Matthias Eck, and Stephan Vogel. 2004.
Language model adaptation for statistical machine
translation with structured query models. In Pro-
ceedings of the 20th international conference on
Computational Linguistics, COLING 04, Strouds-
burg, PA, USA. Association for Computational Lin-
guistics.
549
Subcat-LMF: Fleshing out a standardized format
for subcategorization frame interoperability
Judith Eckle-Kohler and Iryna Gurevych
German Institute for Educational Research and Educational Information
Technische Universitat Darmstadt
Abstract takes two arguments that can be realized, for in-

stance, as noun phrase and that-clause as in He
This paper describes Subcat-LMF, an ISO-
says that the window is open.
LMF compliant lexicon representation for-
mat featuring a uniform representation Although a number of freely available, large-
of subcategorization frames (SCFs) for scale and accurate SCF lexicons exist, e.g. COM-
the two languages English and German. LEX (Grishman et al., 1994), VerbNet (Kipper
Subcat-LMF is able to represent SCFs at a et al., 2008) for English, availability and limita-
very fine-grained level. We utilized Subcat- tions in size and coverage remain an inherent is-
LMF to standardize lexicons with large- sue. This applies even more to languages other
scale SCF information: the English Verb-
than English.
Net and two German lexicons, i.e., a subset
of IMSlex and GermaNet verbs. To evalu- One particular approach to address this issue is
ate our LMF-model, we performed a cross- the combination and integration of existing man-
lingual comparison of SCF coverage and ually built SCF lexicons. Lexicon integration
overlap for the standardized versions of the has widely been adopted for increasing the cover-
English and German lexicons. The Subcat- age of lexicons regarding lexical-semantic infor-
LMF DTD, the conversion tools and the mation types, such as semantic roles, selectional
standardized versions of VerbNet and IMS-
restrictions, and word senses (e.g., Shi and Mi-
lex subset are publicly available.1
halcea (2005), the Semlink project2 , Navigli and
Ponzetto (2010), Niemann and Gurevych (2011),
1 Introduction Meyer and Gurevych (2011)).
Computational lexicons providing accurate Currently, SCFs are represented idiosyncrati-
lexical-syntactic information, such as subcatego- cally in existing SCF lexicons. However, inte-
rization frames (SCFs) are vital for many NLP gration of SCFs requires a common, interopera-
applications involving parsing and word sense ble representation format. Monolingual SCF in-
disambiguation. In parsing, SCFs have been tegration based on a common representation for-
successfully used to improve the output of sta- mat has already been addressed by King and
tistical parsers (Klenner (2007), Deoskar (2008), Crouch (2005) and just recently by Necsulescu et
Sigogne et al. (2011)) which is particularly al. (2011) and Padro et al. (2011). However, nei-
significant in high-precision domain-independent ther King and Crouch (2005) nor Necsulescu et
parsing. In word sense disambiguation, SCFs al. (2011) or Padro et al. (2011) make use of ex-
have been identified as important features for isting standards in order to create a uniform SCF
verb sense disambiguation (Brown et al., 2011), representation for lexicon merging. The defini-
which is due to the correlation of verb senses and tion of an interoperable representation format ac-
SCFs (Andrew et al., 2004). cording to an existing standard, such as the ISO
SCFs specify syntactic arguments of verbs and standard Lexical Markup Framework (LMF, ISO
other predicate-like lexemes, e.g. the verb say 24613:2008, see Francopoulo et al. (2006)), is the
1 2
http://www.ukp.tu-darmstadt.de/data/uby http://verbs.colorado.edu/semlink/
550
prerequisite for re-using this format in different fies a core package and a number of extensions
contexts, thus contributing to the standardization for modeling different types of lexicons, includ-
and interoperability of language resources. ing subcategorization lexicons.
While LMF models exist that cover the rep- The development of an LMF-compliant lexi-
resentation of SCFs (see Quochi et al. (2008), con model requires two steps: in the first step,
Buitelaar et al. (2009)), their suitability for repre- the structure of the lexicon model has to be de-
senting SCFs at a large scale remains unclear: nei- fined by choosing a combination of the LMF core
ther of these LMF-models has been used for stan- package and zero to many extensions (i.e. UML
dardizing lexicons with a large number of SCFs, packages). While the LMF core package models
such as VerbNet. Furthermore, the question of a lexicon in terms of lexical entries, each of which
their applicability to different languages has not is defined as the pairing of one to many forms and
been investigated yet, a situation that is compli- zero to many senses, the LMF extensions provide
cated by the fact that SCFs are highly language- UML classes for different types of lexicon orga-
specific. nization, e.g., covering the synset-based organiza-
The goal of this paper is to address these gaps tion of WordNet and the class-based organization
for the two languages English and German by pre- of VerbNet. The first step results in a set of UML
senting a uniform LMF representation of SCFs classes that are associated according to the UML
for English and German which is utilized for the diagrams given in ISO LMF.
standardization of large-scale English and Ger- In the second step, these UML classes may be
man SCF lexicons. The contributions of this enriched by attributes. While neither attributes
paper are threefold: (1) We present the LMF nor their values are given by the standard, the
model Subcat-LMF, an LMF-compliant lexicon standard states that both are to be linked to Data
representation format featuring a uniform and Categories (DCs) defined in a Data Category Reg-
very fine-grained representation of SCFs for En- istry (DCR) such as ISOCat.4 DCs that are not
glish and German. Subcat-LMF is a subset of available in ISOCat may be defined and submit-
Uby-LMF (Eckle-Kohler et al., 2012), the LMF ted for standardization. The second step results in
model of the large integrated lexical resource Uby a so-called Data Category Selection (DCS).
(Gurevych et al., 2012). (2) We convert lexicons DCs specify the linguistic vocabulary used in
with large-scale SCF information to Subcat-LMF: an LMF model. Consider as an example the
the English VerbNet and two German lexicons, linguistic term direct object that often occurs in
i.e., GermaNet (Kunze and Lemnitzer, 2002) and SCFs of verbs taking an accusative NP as argu-
a subset of IMSlex3 (Eckle-Kohler, 1999). (3) We ment. In ISOCat, there are two different specifi-
perform a comparison of these three lexicons re- cations of this term, one explicitly referring to the
garding SCF coverage and SCF overlap, based on capability of becoming the clause subject in pas-
the standardized representation. sivization5 , the other not mentioning passivization
The remainder of this paper is structured as fol- at all.6 Consequently, the use of a DCR plays a
lows: Section 2 gives a detailed description of major role regarding the semantic interoperability
Subcat-LMF and section 3 demonstrates its use- of lexicons (Ide and Pustejovsky, 2010). Different
fulness for representing and cross-lingually com- resources that share a common definition of their
paring large-scale English and German lexicons. linguistic vocabulary are said to be semantically
Section 4 provides a discussion including related interoperable.
work and section 5 concludes.
2.2 Fleshing out ISO-LMF
2 Subcat-LMF Approach: We started our development of
2.1 ISO-LMF: a meta-model Subcat-LMF with a thorough inspection of large-
scale English and German resources providing
LMF defines a meta-model of lexical resources,
SCFs for verbs, nouns, and adjectives. For
covering NLP lexicons and Machine Readable
4
Dictionaries. This meta-model is based on the http://www.isocat.org/, the implementation of the ISO
Unified Modeling Language (UML) and speci- 12620 DCR (Broeder et al., 2010).
5
http://www.isocat.org/datcat/DC-1274
3 6
http://www.ims.uni-stuttgart.de/projekte/IMSLex/ http://www.isocat.org/datcat/DC-2263
551
English, our analysis included VerbNet7 and vs.
FrameNet syntactically annotated example sen- Er schlug vor, das Haus zu putzen. (to-
tences from Ruppenhofer et al. (2010). For Ger- infinitive)
man, we inspected GermaNet, SALSA annota-
tion guidelines (Burchardt et al., 2006) and IM- morphosyntactic marking of verb phrase ar-
Slex documentation (Eckle-Kohler, 1999). In ad- guments in the main clause: He managed to
dition, the EAGLES synopsis on morphosyntactic win. (no marking) vs.
phenomena8 (Calzolari and Monachini, 1996), as Er hat es geschafft zu gewinnen. (obligatory
well as the EAGLES recommendations on subcat- es)
egorization9 have been used to identify DCs rele-
vant for SCFs. morphosyntactic marking of clausal argu-
We specified Subcat-LMF by a DTD yielding ments in the main clause: That depends on
an XML serialization of ISO-LMF. Thus, existing who did it. (preposition) vs.
lexicons can be standardized, i.e. converted into Das hangt davon ab, wer es getan hat.
Subcat-LMF format, based on the DTD.10 (pronominal adverb)
Lexicon structure: Next, we defined the
lexicon structure of Subcat-LMF. In addition Uniform Data Categories for English and Ger-
to the core package, Subcat-LMF primarily man: Thus, the main challenge in developing
makes use of the LMF Syntax and Seman- Subcat-LMF has been the specification of DCs
tics extension. Figure 1 shows the most (attributes and attribute values) in such a way,
important classes of Subcat-LMF including that a uniform specification of SCFs in the two
SynsemCorrespondence where the linking of
languages English and German can be achieved.
syntactic and semantic arguments is encoded. It The specification of DCs for Subcat-LMF in-
might by worth noting that both synsets from Ger- volved fleshing out ISO-LMF, because it is a
maNet and verb classes from VerbNet can be rep- meta-standard in the sense that it provides only
resented in Subcat-LMF by using the Synset and few linguistic terms, i.e. DCs, and these DCs
SubcategorizationFrameSet class.
are not linked to any DCR: in the Syntax Exten-
Diverging linguistic properties of SCFs in sion, the standard only provides 7 class names,
English and German: For verbs (and also for see Figure 1), complemented by 17 example at-
predicate-like nouns and adjectives), SCFs spec- tributes given in an informative, non-binding An-
ify the syntactic and morphosyntactic properties nex F. These are by far not sufficient to repre-
of their arguments that have to be present in consent the fine-grained SCFs available in such large-
crete realizations of these arguments within a sen- scale lexicons as VerbNet.
tence. While some properties of syntactic argu- In contrast, the Syntax part of Subcat-LMF
ments in English and German correspond (both comprises 58 DCs that are properly linked to
English and German are Germanic languages and ISOCat DCs; a number of DCs were missing in
hence closely related), there are other properties, ISOCat, so we entered them ourselves.11 The
mainly morphosyntactic ones that diverge. By majority of the attributes in Subcat-LMF are at-
way of examples, we illustrate some of these di- tached to the SyntacticArgument class. The
vergences in the following (we contrast English corresponding DCs can be divided into two main
examples with their German equivalents): groups:
Cross-lingually valid DCs for the spec-
overt case marking in German: ification of grammatical functions (e.g.
He helps him. vs. Er hilft ihm. (dative) subject, prepositionalComplement)
and syntactic categories (e.g. nounPhrase,
specific verb form in verb phrase arguments: prepositionalPhrase), see Table 1.
He suggested cleaning the house. (ing-form) Partly language-specific morphosyntactic
7
SCFs in VerbNet also cover SCFs in VALEX, a lexicon DCs that further specify the syntactic arguments
automatically extracted from corpora. (e.g. attribute case, attribute verbForm and
8
http://www.ilc.cnr.it/EAGLES96/morphsyn/
9 11
http://www.ilc.cnr.it/EAGLES96/synlex/ The Subcat-LMF DCS is publicly available on the ISO-
10
Available at http://www.ukp.tu-darmstadt.de/data/uby Cat website.
552
Figure 1: Selected classes of Subcat-LMF.
Values of grammaticalFunction Example

subject They arrived in time.
subjectComplement He becomes a teacher.
directObject He saw a rainbow.
objectComplement They elected him governor.
complement He told him a story.
prepositionalComplement It depends on several factors.
adverbialComplement They moved far away.
Values of syntacticCategory Example
nounPhrase The train stopped.
reflexive He drank himself sick.
expletive It is raining.
prepositionalPhrase It depends on several factors.
adverbPhrase They moved far away.
adjectivePhrase The light turned red.
verbPhrase She tried to exercise.
declarativeClause He says he agrees.
subordinateClause He believes that it works.
Table 1: Cross-lingually valid (English-German) attributes and values of the SyntacticArgument class.
values toInfinitive, bareInfinitive, ified by different subsets of morphosyntactic at-

ingForm, participle), see Table 2. tributes, see Table 2. The following examples il-
In the class LexemeProperty, we introduced lustrate some of these attributes:
an attribute syntacticProperty to encode
control and raising properties of verbs taking in- number: the number of a noun phrase argu-
finitival verb phrase arguments.12 ment can be lexically governed by the verb
In Subcat-LMF, syntactic arguments can be as in These types of fish mix well together.
specified by a selection of appropriate attribute-
value pairs. While all syntactic arguments are uni-
verbForm: the verb form of a clausal com-
formly specified by a grammatical function and a
plement can be required to be a bare infini-
syntactic category, the use of the morphosyntactic
tive as in They demanded that he be there.
attributes depends on the particular type of syn-
tactic argument. Different phrase types are spec-
tense: not only the verb form, but also the
12
Control or raising specify the co-reference between the tense of a verb phrase complement can be
implicit subject of the infinitival argument and syntactic ar-
guments in the main clause, either the subject (subject con-
lexically governed, e.g., to be a participle in
trol or raising) or direct object (object control or raising). the past tense as in They had it removed.
553
Morphosyntactic attributes and values NP PP VP C
case: nominative, genitive, dative, accusative x x
determiner: possessive, indefinite x x
number: singular, plural x
verbForm: toInfinitive, bareInfinitive, ingForm(!), Participle x x
tense: present, past x
complementizer: thatType, whType, yesNoType x
prepositionType: external ontological type, e.g. locative x x x
preposition: (string) (!) x x x
lexeme: (string) (!) x x
Table 2: Morphosyntactic attributes of SyntacticArgument and phrase types for which the attributes are
appropriate (NP: noun phrase, PP: prepositional phrase, VP: verb phrase, C: clause). Language-specific attributes
are marked by (!).
3 Utilizing Subcat-LMF dot-separated sequence of letter pairs. Each letter

pair specifies a syntactic argument: the first letter
3.1 Standardizing large-scale lexicons
encodes the grammatical function and the second
Lexicon Data: We converted VerbNet (VN) and letter the syntactic category.16 For instance, the
two German lexicons, i.e., GermaNet (GN) and following shows the GN code for transitive verbs:
a subset of IMSlex (ILS) to Subcat-LMF format. NN.AN.
ILS has been developed independently from GN
and the lexicon data were published in Eckle- ILS is represented in delimiter-separated
Kohler (1999). values format and contains 784 verbs in total.
VN is organized in verb classes based on Levin- Of these 784 verbs, 740 of them are also present
style syntactic alternations (Levin, 1993): verbs in GN, and 44 are listed in ILS only. Although
with common SCFs and syntactic alternation be- ILS contains only verbs that take clausal ar-
havior that also share common semantic roles are guments and verb phrase arguments, a total
grouped into classes. VN (version 3.1) lists 568 number of 220 SCFs is present in ILS, also
frames that are encoded as phrase structure rules including SCFs without clausal and verb phrase
(XML element SYNTAX), specifying phrase types arguments. ILS lists for each verb lemma a
and semantic roles of the arguments, as well as se- number of SCFs, thus specifying coarse-grained
lectional, syntactic and morphosyntactic restric- verb senses given by a lemma-SCF pair.17 The
tions on the arguments. Additionally, a descrip- SCFs are represented as parenthesized lists. For
tive specification of each frame is given (XML instance, the ILS SCF for transitive verbs is:
element DESCRIPTION). The verb learn, for in- (subj(NPnom),obj(NPacc)).
stance, has the following VN frame:
DESCRIPTION (primary): NP V NP Automatic Conversion: We implemented Java
SYNTAX: Agent V Topic tools for the conversion of VN, GN and ILS to
We extracted both the descriptive specifications Subcat-LMF. These tools convert the source lexi-
and the phrase structure rules, using the API cons based on a manual mapping of lexicon units
available for VN13 , resulting in 682 unique VN and terms (e.g., VN verb class, GN synset) to
frames.14 Subcat-LMF. For the majority of SCFs, this map-
GN provides detailed SCFs for verbs, in ping is defined on argument level. Lexical data
contrast to the Princeton WordNet: GN version is extracted from the source lexicons by using the
6.0 from April 2011 accessed by the GN API15 native APIs (VN, GN) and additional Perl scripts.
lists 202 frames. GN SCFs are represented as a 16
See http://www.sfs.uni-tuebingen.de/GermaNet/-
13
http://verbs.colorado.edu/verb-index/inspector/ verb frames.shtml
14 17
The VN API was used with the view options wrexyzsq In addition, ILS provides a semantic class label for each
for verb frame pairs and ctuqw for verb class information. verb; however, these semantic labels are attached at lemma
15
GermaNet Java API 2.0.2 level, i.e. they need to be disambiguated.
554
# LexicalEntry # Sense # Subcat.Frame # SemanticPred.
LMF-VN 3962 31891 284 617
orig. VN (3962 verbs) (31891 groups of verb, (568 frames) (572 sem. Pred.)
frame, sem.pred.)
LMF-GN 8626 12981 147 84
orig. GN (8626 verbs) (12981 verb-synset pairs) (202 GN frames) (no sem. Pred.)
LMF-ILS 784 3675 217 10
orig. ILS (784 verbs) (3675 verb-frame pairs) (220 SCFs) (no sem. Pred.)
Table 3: Evaluation of the automatic conversion. Numbers of Subcat-LMF instances in the converted lexicons
compared to numbers of corresponding units in original lexicons.
Evaluation of Automatic Conversion: Table 3 as GN; these few cases were mapped in the same
shows the mapping of the major source lexicon way as for GN. Therefore, the LMF version of
units (such as verb-synset pairs) to Subcat-LMF ILS, too, specifies less SCFs, but additional se-
and lists the corresponding numbers of units. mantic predicates not present in the original.
For VN, groups of VN verb, frame and se- Discussion: Grammatical functions of argu-
mantic predicate have been mapped to LMF ments are specified distinctly in the three lexicons.
senses. VN classes have been mapped to While both GN and ILS specify grammatical
SubcategorizationFrameSet. Thus, the functions, they are not explicitly encoded in VN.
original VN-sense, a pairing of verb lemma and They have to be inferred on the basis of the phrase
class, can be recovered by grouping LMF senses structure rules given in the SYNTAX element. We
that share the same verb class. There is a signif- assigned subject to the noun phrase which di-
icant difference between the original VN frames rectly precedes the verb and directObject to
and their Subcat-LMF representation: the seman- the noun phrase directly following the verb and
tic information present in VN frames (seman- having the semantic role Patient. The semantic
tic roles and selectional restrictions) is mapped role information has to be considered at this point,
to semantic arguments in Subcat-LMF, i.e. the because not all noun phrase arguments are able
mapping splits VN frames into a purely syntac- to become the subject in a corresponding passive
tic and a purely semantic part. Consequently, sentence. An example is the verb learn which
the number of unique SCFs in the Subcat-LMF has the VN frame NP(Agent) V NP(Topic);
version of VN is much smaller than the num- here, the Topic-NP is not able to become the sub-
ber of frames in the original VN. The conversion ject of a corresponding passive sentence. We as-
tool creates for each sense (specifying a unique signed the grammatical function complement to
verb, frame, semantic predicate combination) a all other phrase types.
SynSemCorrespondence. Argument order constraints in SCFs are repre-
On the other hand, the Subcat-LMF version of VN sented in LMF by a list implementation of syntac-
contains more semantic predicates than VN. This tic arguments. Most SCFs from VN require the
is due to selectional restrictions for semantic ar- subject to be the first argument, reflecting the ba-
guments that are specified in Subcat-LMF within sic word order in English sentences. VN lists one
semantic predicates, in contrast to VN. exception to this rule for the verb appear, illus-
For GN, verb-synset pairs (i.e., GN lexical trated by the example On the horizon appears a
units), have been mapped to LMF senses. Few ship.
GN frame codes also specify semantic role in- Argument optionality in VN is expressed at the
formation, e.g. manner, location. These were semantic level and at the syntactic level in paral-
mapped to the semantics part of Subcat-LMF re- lel: it is explicitly specified at the semantic level
sulting in 84 semantic predicates that encode the and implicitly specified at the syntactic level. At
semantic role information in their semantic argu- the syntactic level, two SCF versions exist in VN,
ments. one with the optional argument, the other without
ILS specifies similar semantic role information it. In addition, the semantic predicate attached to
555
these SCFs marks optional (semantic) arguments tribute values apart from the attribute optional
by a ?-sign. GN, on the other hand, expresses which is specific to GN (resulting in a consid-
argument optionality at the level of syntactic ar- erably smaller number of SCFs in GN). Sec-
guments, i.e., within the frame code. In Subcat- ond, fine-grained, but cross-lingual string SCFs
LMF, optionality is represented at the syntactic were considered; these omit the attributes case,
level by an (optional) attribute optional for syn- lexeme, preposition and the attribute value
tactic arguments, thus reflecting the explicit repre- ingForm. Finally, coarse-grained cross-lingual
sentation used in GN and the implicit representa- string SCFs were compared. These only con-
tion present in VN.18 tain the values of the attributes syntactic
GN frames specify syntactic alternations of ar- category, complementizer and verbForm
gument realizations, e.g. adverbial complements (without the attribute value ingForm). For in-
that can alternatively be realized as adverb phrase, stance, a coarse cross-lingual string SCF for tran-
prepositional phrase or noun phrase. We encoded sitive verbs is nounPhrasenounPhrase.
this generalization in Subcat-LMF by introducing Table 4 lists the results of our quantitative com-
attribute values for these aggregated syntactic cat- parison. For each lexicon pair, the number of
egories. overlapping SCFs and the numbers of comple-
mentary SCFs are given. Regarding VN and the
3.2 Cross-lingual comparison of lexicons German lexicons, the overlap at the language-
Lexicons that are standardized according to specific level is (close to) zero, which is due to the
Subcat-LMF can be quantitatively compared re- specification of case, e.g. dative, for German ar-
garding SCFs. For two lexicons, such a com- guments. However, the numbers for cross-lingual
parison gives answers to questions, such as: how SCFs clearly validate our claim: the numbers of
many SCFs are present in both lexicons (overlap- overlapping SCFs for the German lexicon pair and
ping SCFs), how many SCFs are only listed in one for the two German-English pairs are comparable,
of the lexicons (complementary SCFs). Answers ranging from 12 to 18 for the fine-grained SCFs
to these questions are important, for instance, for and from 20 to 21 for the coarse SCFs.
assessing the potential gain in SCF coverage that Based on the sets of cross-lingually overlap-
can be achieved by lexicon merging. ping SCFs, we made an estimation on how many
In order to validate our claim that Subcat-LMF high frequent verbs actually have SCFs that are
yields a cross-lingually uniform SCF represen- in the cross-lingual SCF overlap of an English-
tation, we contrast the monolingual comparison German lexicon pair. For this, we used the lemma
of GN and ILS with the cross-lingual compari- frequency lists of the English and German WaCky
son of VN, GN and VN and ILS. Assuming that corpora (Baroni et al., 2009) and extracted verbs
our claim is valid, the cross-lingual comparisons from VN, GN and ILS that are on 100 top ranked
can be expected to yield similar results regard- positions of these lists, starting from rank 100.19
ing overlapping and complementary SCFs as the Table 5 shows the results for the cross-lingual
monolingual comparison. SCF overlap between VN GN and between VN
Comparison: The comparison of SCFs from ILS. While only around 40% of the high fre-
two lexicons that are in Subcat-LMF format can quent verbs have an SCF in the fine-grained SCF
be performed on the basis of the uniform DCs. overlap, more than 70% are in the coarse overlap
As Subcat-LMF is implemented in XML, we between VN GN, and even more than 80% in
compared string representations of SCFs. SCFs the coarse overlap between VN ILS.
from VN, GN and ILS were converted to strings Analysis of results: The small numbers of
by concatenating attribute values of syntactic ar- overlapping cross-lingual SCFs (relative to the to-
guments and lexemeProperty. We created tal number of SCFs), at both levels of granularity,
string representations of different granularities: indicate that the three lexicons each encode sub-
First, fine-grained, language-specific string SCFs stantially different lexical-syntactic properties of
have been generated by concatenating all at- 19
Since the WaCky frequency lists do not contain POS in-
18
As a consequence, all semantic arguments specified in formation, our lists of extracted verbs contain some noise,
the Subcat-LMF version of VN have a corresponding syn- which we tolerated, because we aimed at an approximate es-
tactic argument. timate.
556
language-specific cross-lingual cross-lingual
(fine-grained) (fine-grained) (coarse)
GN vs. ILS 72 GN 21 both, 196 ILS 61 GN, 23 both, 69 ILS 40 GN, 24 both, 23 ILS
VN vs. GN 284 VN, 0 both, 93 GN 96 VN, 15 both, 69 GN 29 VN, 24 both, 40 GN
VN vs. ILS 283 VN, 1 both, 216 ILS 93 VN, 18 both, 74 ILS 31 VN, 22 both, 25 ILS
Table 4: Comparison of lexicon pairs regarding SCF overlap and complementary SCFs.
VN-GN overlap VN-GN overlap VN-ILS overlap VN-ILS overlap

fine-grained (15 SCFs) coarse (24 SCFs) fine-grained (18 SCFs) coarse (22 SCFs)
43% VN verbs 85% VN verbs 41% VN verbs 84% VN verbs
41% GN verbs 71% GN verbs 43% ILS verbs 87% ILS verbs
Table 5: Percentage of 100 high frequent verbs from VN, GN, ILS with a SCF in the cross-lingual SCF overlap
(fine-grained vs. coarse) between VN GN and VN ILS.
verbs. This can at least partly be explained by the ified as language-independent preposition types.
historic development of these lexicons in differ- A large number of complementary SCFs in VN
ent contexts, e.g., Levins work on verb classes vs. GN and GN vs. ILS are due to a diverging lin-
(VN), Lexical Functional Grammar (ILS), as well guistic analysis of extraposed subject clauses with
as their use for different purposes and applica- an es (it) in the main clause (e.g., It annoys him
tions. that the train is late.). In GN, such clauses are not
Another reason of the small SCF overlap is specified as subject, whereas in VN and ILS they
the comparison of strings derived from the XML are.
format. A more sophisticated representation for- Regarding VN and ILS, only VN lists subject
mat, notably one that provides semantic typing control for verbs, while both VN and ILS list ob-
and type hierarchies, e.g., OWL, could be em- ject control and subject raising. GN, on the other
ployed to define hierarchies of grammatical func- hand, does not specify control or raising at all.
tions (e.g. direct object would be a sub-type of
complement) and other attributes. These would 4 Discussion
presumably support the identification of further 4.1 Previous Work
overlapping SCFs.
Merging SCFs: Previous work on merging SCF
During a subsequent qualitative analysis of the
lexicons has only been performed in a mono-
overlapping and complementary SCFs, we col-
lingual setting and lacks the use of standards.
lected some enlightening background informa-
King and Crouch (2005) describe the process of
tion. Overlapping SCFs in the cross-lingual com-
unifying several large-scale verb lexicons for En-
parison (both fine-grained and coarse) include
glish, including VN and WordNet. They perform
prominent SCFs corresponding to transitive and
a conversion of these lexicons into a uniform, but
intransitive verbs, as well as verbs with that-
non-standard representation format, resulting in a
clause and verbs with to-infinitive.
lexicon which is integrated at the level of verb
GN and ILS are highly complementary regard-
senses, SCFs and lexical-semantics. Thus, the re-
ing SCFs: for instance, while many SCFs with ad-
sult of their work is not applicable to cross-lingual
verbial arguments are unique in GN, only ILS pro-
settings.
vides a fine-grained specification of prepositional
Necsulescu et al. (2011) and Padro et al. (2011)
complements including the preposition, as well
report on approaches to automatic merging of
as the case the preposition requires.20 VN, too,
two Spanish SCF lexicons. As these lexicons
contains a large number of SCFs with a detailed
lack sense information apart from the SCFs, their
specification of possible prepositions, partly spec-
merging approach only works on a very coarse-
20
In German, prepositions govern the case of their noun grained sense level given by lemma-SCF pairs.
phrase. The fully automatic merging approach described
557
in (Padro et al., 2011) assumes that one of the lex- SCF lexicons to Subcat-LMF, we have demon-
icons to be integrated is already represented in the strated its usability for uniformly representing a
target representation format, i.e. given two lexi- wide range of SCFs and other lexical-syntactic in-
cons, they map one lexicon to the format of the formation types in English and German.
other. Moreover, their approach requires a signif- As our cross-lingual comparison of lexicons
icant overlap of SCFs and verbs in any two lex- has revealed many complementary SCFs in VN,
icons to be merged. The authors state that it is GN and ILS, mono- and cross-lingual alignments
presently unclear, how much overlap is required of these lexicons at sense level would lead to a
to obtain sufficiently precise merging results. major increase in SCF coverage. Moreover, the
Standardizing SCFs: Much previous work on cross-lingually uniform representation of SCFs
standardizing NLP lexicons in LMF has focused can be exploited for an additional alignment of
on WordNet-like resources. Soria et al. (2009) de- the lexicons at the level of SCF arguments. Such
scribe WordNet-LMF, an LMF model for repre- a fine-grained alignment of SCFs can be used, for
senting wordnets which has been used in the KY- instance, to project VN semantic roles to GN, thus
OTO project.21 Later, WordNet-LMF has been yielding a German resource for semantic role la-
adapted by Henrich and Hinrichs (2010) to Ger- beling (see Gildea and Jurafsky (2002), Swier and
maNet and by Toral et al. (2010) to the Ital- Stevenson (2005)).
ian WordNet. WordNet-LMF does not provide Subcat-LMF could be used for standardizing
the possibility to represent subcategorization at further English and German lexicons. The auto-
all. The adaption of WordNet-LMF to GN (Hen- matic conversion of lexicons to Subcat-LMF re-
rich and Hinrichs, 2010) allows SCFs to be requires the manual definition of a mapping, at least
spresented as string values. However, this ex- for syntactic arguments. Furthermore, the auto-
tension is not sufficient, because it provides no matic merging approach by Padro et al. (2011)
means to model the syntax-semantics interface, could be tested for English: given our standard-
which specifies correspondences between syntac- ized version of VN, other English SCF lexicons
tic and semantic arguments of verbs and other could be merged fully automatically with the
predicates. Quochi et al. (2008) report on an LMF Subcat-LMF version of VN.
model that covers the syntax-semantics mapping
just mentioned; it has been used for standardizing 5 Conclusion
an Italian domain-specific lexicon. Buitelaar et al.
Subcat-LMF contributes to fostering the standard-
(2009) describe LexInfo, an LMF-model that is
ization of language resources and their interop-
used for lexicalizing ontologies. LexInfo is imple-
erability at the lexical-syntactic level across En-
mented in OWL and specifies a linking of syntac-
glish and German. The Subcat-LMF DTD in-
tic and semantic arguments. For SCFs and argu-
cluding links to ISOCat, all conversion tools,
ments, a type hierarchy is defined. In their paper,
and the standardized versions of VN and
Buitelaar et al. (2009) show only few SCFs and
ILS23 are publicly available at http://www.ukp.tu-
do not indicate what kinds of SCFs can be repre-
darmstadt.de/data/uby.
sented with LexInfo in principle. On the LexInfo
website22 , the current LexInfo version 2.0 can be Acknowledgments
viewed, but no further documentation is given.
We inspected LexInfo version 2.0 and found that This work has been supported by the Volks-
it specifies a large number of fine-grained SCFs. wagen Foundation as part of the Lichtenberg-
However, LexInfo has not been evaluated so far Professorship Program under grant No. I/82806.
on large-scale SCF lexicons, such as VerbNet. We thank the anonymous reviewers for their valu-
able comments. We also thank Dr. Jungi Kim
4.2 Subcat-LMF and Christian M. Meyer for their contributions to
Subcat-LMF enables the uniform representation this paper, and Yevgen Chebotar and Zijad Mak-
of fine-grained SCFs across the two languages suti for their contributions to the conversion soft-
English and German. By mapping large-scale ware.
21 23
http://www.kyoto-project.eu/ The converted version of GN can not be made available
22
See http://lexinfo.net/ due to licensing.
558
References and Evaluation (LREC 2012), page (to appear), Is-
tanbul, Turkey.
Galen Andrew, Trond Grenager, and Christopher D.
Judith Eckle-Kohler. 1999. Linguistisches Wissen zur
Manning. 2004. Verb sense and subcategoriza-
automatischen Lexikon-Akquisition aus deutschen
tion: using joint inference to improve performance
Textcorpora. Logos-Verlag, Berlin, Germany.
on complementary tasks. In Proceedings of the
PhDThesis.
2004 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 150157, Gil Francopoulo, Nuria Bel, Monte George, Nico-
Barcelona, Spain. letta Calzolari, Monica Monachini, Mandy Pet, and
Claudia Soria. 2006. Lexical Markup Framework
Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
(LMF). In Proceedings of the Fifth International
and Eros Zanchetta. 2009. The WaCky wide web:
Conference on Language Resources and Evaluation
a collection of very large linguistically processed
(LREC), pages 233236, Genoa, Italy.
web-crawled corpora. Language Resources and
Evaluation, 43(3):209226. Daniel Gildea and Daniel Jurafsky. 2002. Automatic
Daan Broeder, Marc Kemps-Snijders, Dieter Van Uyt- labeling of semantic roles. Computational Linguis-
vanck, Menzo Windhouwer, Peter Withers, Peter tics, 28:245288, September.
Wittenburg, and Claus Zinn. 2010. A Data Cat- Ralph Grishman, Catherine Macleod, and Adam Mey-
egory Registry- and Component-based Metadata ers. 1994. Comlex Syntax: Building a Computa-
Framework. In Proceedings of the Seventh Inter- tional Lexicon. In Proceedings of the 15th Inter-
national Conference on Language Resources and national Conference on Computational Linguistics
Evaluation (LREC), pages 4347, Valletta, Malta. (COLING), pages 268272, Kyoto, Japan.
Susan Windisch Brown, Dmitriy Dligach, and Martha Iryna Gurevych, Judith Eckle-Kohler, Silvana Hart-
Palmer. 2011. VerbNet Class Assignment as a mann, Michael Matuschek, Christian M. Meyer,
WSD Task. In Proceedings of the 9th International and Christian Wirth. 2012. Uby - A Large-Scale
Conference on Computational Semantics (IWCS), Unified Lexical-Semantic Resource. In Proceed-
pages 8594, Oxford, UK. ings of the 13th Conference of the European Chap-
Paul Buitelaar, Philipp Cimiano, Peter Haase, and ter of the Association for Computational Linguistics
Michael Sintek. 2009. Towards Linguistically (EACL 2012), page (to appear), Avignon, France.
Grounded Ontologies. In Lora Aroyo, Paolo Verena Henrich and Erhard Hinrichs. 2010. Standard-
Traverso, Fabio Ciravegna, Philipp Cimiano, Tom izing wordnets in the ISO standard LMF: Wordnet-
Heath, Eero Hyvonen, Riichiro Mizoguchi, Eyal LMF for GermaNet. In Proceedings of the 23rd In-
Oren, Marta Sabou, and Elena Simperl, editors, The ternational Conference on Computational Linguis-
Semantic Web: Research and Applications, pages tics (COLING), pages 456464, Beijing, China.
111125, Berlin Heidelberg. Springer-Verlag. Nancy Ide and James Pustejovsky. 2010. What Does
Aljoscha Burchardt, Katrin Erk, Anette Frank, Andrea Interoperability Mean, anyway? Toward an Op-
Kowalski, Sebastian Pado, and Manfred Pinkal. erational Definition of Interoperability. In Pro-
2006. The SALSA Corpus: a German Corpus Re- ceedings of the Second International Conference
source for Lexical Semantics. In Proceedings of on Global Interoperability for Language Resources,
the Fifth International Conference on Language Re- Hong Kong.
sources and Evaluation (LREC), pages 969974, Tracy Holloway King and Dick Crouch. 2005. Uni-
Genoa, Italy. fying lexical resources. In Proceedings of the In-
Nicoletta Calzolari and Monica Monachini. 1996. terdisciplinary Workshop on the Identification and
EAGLES Proposal for Morphosyntactic Stan- Representation of Verb Features and Verb Classes,
dards: in view of a ready-to-use package. In Saarbruecken, Germany.
G. Perissinotto, editor, Research in Humanities Karin Kipper, Anna Korhonen, Neville Ryant, and
Computing, volume 5, pages 4864. Oxford Uni- Martha Palmer. 2008. A Large-scale Classification
versity Press, Oxford, UK. of English Verbs. Language Resources and Evalu-
Tejaswini Deoskar. 2008. Re-estimation of lexi- ation, 42:2140.
cal parameters for treebank PCFGs. In Proceed- Manfred Klenner. 2007. Shallow dependency la-
ings of the 22nd International Conference on Com- beling. In Proceedings of the 45th Annual Meet-
putational Linguistics (COLING), pages 193200, ing of the Association for Computational Linguis-
Manchester, United Kingdom. tics (ACL), Companion Volume Proceedings of the
Judith Eckle-Kohler, Iryna Gurevych, Silvana Hart- Demo and Poster Sessions, pages 201204, Prague,
mann, Michael Matuschek, and Christian M. Czech Republic.
Meyer. 2012. UBY-LMF A Uniform Format Claudia Kunze and Lothar Lemnitzer. 2002. Ger-
for Standardizing Heterogeneous Lexical-Semantic maNet representation, visualization, applica-
Resources in ISO-LMF. In Proceedings of the 8th tion. In Proceedings of the Third International
International Conference on Language Resources Conference on Language Resources and Evaluation
559
(LREC), pages 14851491, Las Palmas, Canary Is- Claudia Soria, Monica Monachini, and Piek Vossen.
lands, Spain. 2009. Wordnet-LMF: fleshing out a standardized
Beth Levin. 1993. English Verb Classes and Alterna- format for Wordnet interoperability. In Proceedings
tions. The University of Chicago Press, Chicago, of the 2009 International Workshop on Intercultural
USA. Collaboration, pages 139146, Palo Alto, Califor-
Christian M. Meyer and Iryna Gurevych. 2011. What nia, USA.
Psycholinguists Know About Chemistry: Align- Robert S. Swier and Suzanne Stevenson. 2005. Ex-
ing Wiktionary and WordNet for Increased Domain ploiting a verb lexicon in automatic semantic role
Coverage. In Proceedings of the 5th International labelling. In Proceedings of the conference on Hu-
Joint Conference on Natural Language Processing man Language Technology and Empirical Methods
(IJCNLP), pages 883892, Chiang Mai, Thailand. in Natural Language Processing (HLT05), pages
883890, Vancouver, British Columbia, Canada.
Roberto Navigli and Simone Paolo Ponzetto. 2010.
Antonio Toral, Stefania Bracale, Monica Monachini,
BabelNet: Building a very large multilingual se-
and Claudia Soria. 2010. Rejuvenating the Italian
mantic network. In Proceedings of the 48th Annual
WordNet: upgrading, standarising, extending. In
Proceedings of the 5th Global WordNet Conference,
guistics (ACL), pages 216225, Uppsala, Sweden.
Bombay, India.
Silvia Necsulescu, Nuria Bel, Munsta Padro, Montser-
rat Marimon, and Eva Revilla. 2011. Towards
the Automatic Merging of Language Resources. In
Proceedings of the 2011 ESSLI Workshop on Lexi-
cal Resources (WoLeR 2011), Ljubljana, Slovenia.
Elisabeth Niemann and Iryna Gurevych. 2011. The
Peoples Web meets Linguistic Knowledge: Auto-
matic Sense Alignment of Wikipedia and WordNet.
In Proceedings of the 9th International Conference
on Computational Semantics (IWCS), pages 205
214, Oxford, UK.
Muntsa Padro, Nuria Bel, and Silvia Necsulescu.
2011. Towards the Automatic Merging of Lexical
Resources: Automatic Mapping. In Proceedings of
the International Conference on Recent Advances
in Natural Language Processing, pages 296301,
Hissar, Bulgaria.
Valeria Quochi, Monica Monachini, Riccardo Del
Gratta, and Nicoletta Calzolari. 2008. A lexicon
for biology and bioinformatics: the bootstrep expe-
rience. In Proceedings of the Sixth International
Conference on Language Resources and Evalua-
tion (LREC08), pages 22852292, Marrakech, Mo-
rocco, may.
Josef Ruppenhofer, Michael Ellsworth, Miriam R. L.
Petruck, Christopher R. Johnson, and Jan Schef-
fczyk. 2010. FrameNet II: Extended Theory and
Practice, September.
Lei Shi and Rada Mihalcea. 2005. Putting pieces to-
gether: Combining FrameNet, VerbNet and Word-
Net for robust semantic parsing. In Proceedings
of the Sixth International Conference on Intelligent
Text Processing and Computational Linguistics (CI-
CLing), pages 100111, Mexico City, Mexico.
Anthony Sigogne, Matthieu Constant, and Eric La-
porte. 2011. Integration of data from a syntac-
tic lexicon into generative and discriminative proba-
bilistic parsers. In Proceedings of the International
Conference on Recent Advances in Natural Lan-
guage Processing, pages 363370, Hissar, Bulgaria.
560
The effect of domain and text type on text prediction quality
Suzan Verberne, Antal van den Bosch, Helmer Strik, Lou Boves
Centre for Language Studies
Radboud University Nijmegen
s.verberne@let.ru.nl
Abstract needs software for writers who have difficulties

typing (Garay-Vitoria and Abascal, 2006). In most
Text prediction is the task of suggesting
applications, the scope of the prediction is the
text while the user is typing. Its main aim
completion of the current word; hence the often-
is to reduce the number of keystrokes that
used term word completion.
are needed to type a text. In this paper, we
The most basic method for word completion is
address the influence of text type and do-
checking after each typed character whether the
main differences on text prediction quality.
prefix typed since the last whitespace is unique
By training and testing our text predic- according to a lexicon. If it is, the algorithm sug-
tion algorithm on four different text types gests to complete the prefix with the lexicon en-
(Wikipedia, Twitter, transcriptions of con- try. The algorithm may also suggest to complete a
versational speech and FAQ) with equal prefix even before the words uniqueness point is
corpus sizes, we found that there is a clear reached, using statistical information on the pre-
effect of text type on text prediction qual- vious context. Moreover, it has been shown that
ity: training and testing on the same text significantly better prediction results can be ob-
type gave percentages of saved keystrokes tained if not only the prefix of the current word
between 27 and 34%; training on a differ- is included as previous context, but also previ-
ent text type caused the scores to drop to ous words (Fazly and Hirst, 2003) or characters
percentages between 16 and 28%. (Van den Bosch and Bogers, 2008).
In our case study, we compared a num- In the current paper, we follow up on this work
ber of training corpora for a specific data by addressing the influence of text type and do-
set for which training data is sparse: ques- main differences on text prediction quality. Brief
tions about neurological issues. We found messages on mobile devices (such as text mes-
that both text type and topic domain play sages, Twitter and Facebook updates) are of a dif-
a role in text prediction quality. The ferent style and lexicon than documents typed in
best performing training corpus was a set office software (Westman and Freund, 2010). In
of medical pages from Wikipedia. The addition, the topic domain of the text also influ-
second-best result was obtained by leave- ences its content. These differences may cause an
one-out experiments on the test questions, algorithm trained on one text type or domain to
even though this training corpus was much perform poorly on another.
smaller (2,672 words) than the other cor- The questions that we aim to answer in this pa-
pora (1.5 Million words). per are (1) What is the effect of text type dif-
ferences on the quality of a text prediction algo-
1 Introduction
rithm? and (2) What is the best choice of train-
Text prediction is the task of suggesting text while ing data if domain- and text type-specific data is
the user is typing. Its main aim is to reduce the sparse?. To answer these questions, we perform
number of keystrokes that are needed to type a three experiments:
text, thereby saving time. Text prediction algo-
rithms have been implemented for mobile devices, 1. A series of within-text type experiments on
office software (Open Office Writer), search en- four different types of Dutch text: Wikipedia
gines (Google query completion), and in special- articles, Twitter data, transcriptions of con-
561
versational speech and web pages of Fre- All modern methods share the general idea that
quently Asked Questions (FAQ). previous context (which we will call the buffer)
2. A series of across-text type experiments in can be used to predict the next block of charac-
which we train and test on different text ters (the predictive unit). If the user gets correct
types; suggestions for continuation of the text then the
3. A case study using texts from a specific do- number of keystrokes needed to type the text is
main and text type: questions about neuro- reduced. The unit to be predicted by a text pre-
logical issues. Training data for this combi- diction algorithm can be anything ranging from a
nation of language (Dutch), text type (FAQ) single character (which actually does not save any
and domain (medical/neurological) is sparse. keystrokes) to multiple words. Single words are
Therefore, we search for the type of training the most widely used as prediction units because
data that gives the best prediction results for they are recognizable at a low cognitive load for
this corpus. We compare the following train- the user, and word prediction gives good results
ing corpora: in terms of keystroke savings (Garay-Vitoria and
The corpora that we compared in the Abascal, 2006).
text type experiments: Wikipedia, Twit- There is some variation among methods in the
ter, Speech and FAQ, 1.5 Million words size and type of buffer used. Most methods use
per corpus. character n-grams as buffer, because they are pow-
A 1.5 Million words training corpus that erful and can be implemented independently of the
is of the same domain as the target data: target language (Carlberger, 1997). In many al-
medical pages from Wikipedia; gorithms the buffer is cleared at the start of each
The 359 questions from the neuro-QA new word (making the buffer never larger than
data themselves, evaluated in a leave- the length of the current word). In the paper
one-out setting (359 times training on by (Van den Bosch and Bogers, 2008), two ex-
358 questions and evaluating on the re- tensions to the basic prefix-model are compared.
maining questions). They found that an algorithm that uses the previ-
ous n characters as buffer, crossing word borders
The prospective application of the third series without clearing the buffer, performs better than
of experiments is the development of a text predic- both a prefix character model and an algorithm
tion algorithm in an online care platform: an on- that includes the full previous word as feature. In
line community for patients seeking information addition to using the previously typed characters
about their illness. In this specific case the target and/or words in the buffer, word characteristics
group is patients with language disabilities due to such as frequency and recency could also be taken
neurological disorders. into account (Garay-Vitoria and Abascal, 2006).
The remainder of this paper is organized as fol- Possible evaluation measures for text predic-
lows: In Section 2 we give a brief overview of text tion are the proportion of words that are correctly
prediction methods discussed in the literature. In predicted, the percentage of keystrokes that could
Section 3 we present our approach to text predic- maximally be saved (if the user would always
tion. Sections 4 and 5 describe the experiments make the correct decision), and the time saved by
that we carried out and the results we obtained. the use of the algorithm (Garay-Vitoria and Abas-
We phrase our conclusions in Section 6. cal, 2006). The performance that can be obtained
by text prediction algorithms depends on the lan-
2 Text prediction methods guage they are evaluated on. Lower results are ob-
Text prediction methods have been developed for tained for higher-inflected languages such as Ger-
several different purposes. The older algorithms man than for low-inflected languages such as En-
were built as communicative devices for people glish (Matiasek et al., 2002). In their overview of
with disabilities, such as motor and speech impair- text prediction systems, (Garay-Vitoria and Abas-
ments. More recently, text prediction is developed cal, 2006) report performance scores ranging from
for writing with reduced keyboards, specifically 29% to 56% of keystrokes saved.
for writing (composing messages) on mobile de- An important factor that is known to influence
vices (Garay-Vitoria and Abascal, 2006). the quality of text prediction systems, is training
562
set size (Lesher et al., 1999; Van den Bosch, 3.1 Evaluation
2011). The paper by (Van den Bosch, 2011) shows We evaluate our algorithms on corpus data. This
log-linear learning curves for word prediction (a means that we have to make assumptions about
constant improvement each time the training cor- user behaviour. We assume that the user confirms
pus size is doubled), when the training set size is a suggested word as soon as it is suggested cor-
increased incrementally from 102 to 3107 words. rectly, not typing any additional characters before
confirming. We evaluate our text prediction al-
3 Our approach to text prediction
gorithms in terms of the percentage of keystrokes
We implement a text prediction algorithm for saved K:
Dutch, which is a productive compounding lan-
guage like German, but has a somewhat simpler Pn Pn
inflectional system. We do not focus on the effect i=0 (Fi ) i=0 (Wi )
K= Pn 100 (1)
of training set size, but on the effect of text type i=0 (Fi )
and topic domain differences. in which n is the number of words in the test
Our approach to text prediction is largely in- set, Wi is the number of keystrokes that have been
spired by (Van den Bosch and Bogers, 2008). We typed before the word i is correctly suggested
experiment with two different buffer types that are and Fi is the number of keystrokes that would be
based on character n-grams: needed to type the complete word i. For example,
Prefix of current word contains all char- our algorithm correctly predicts the word niveau
acters of only the word currently keyed in, after the context i n g t o t e e n n i
where the buffer shifts by one character posi- v in the test set. Assuming that the user confirms
tion with every new character. the word niveau at this point, three keystrokes
were needed for the prefix niv. So, Wi = 3 and
Buffer15 buffer also includes any other Fi = 6. The number of keystrokes needed for
characters keyed in belonging to previously whitespace and punctuation are unchanged: these
keyed-in words. have to be typed anyway, independently of the
Modeling character history beyond the current support by a text prediction algorithm.
word can naturally be done with a buffer model in 4 Text type experiments
which the buffer shifts by one position per charac-
ter, while a typical left-aligned prefix model (that In this section, we describe the first and second se-
never shifts and fixes letters to their positional fea- ries of experiments. The case study on questions
ture) would not be able to do this. from the neurological domain is described in Sec-
In the buffer, all characters from the text are tion 5.
kept, including whitespace and punctuation. The
4.1 Data
predictive unit is one token (word or punctuation
symbol). In both the buffer and the prediction la- In the text type experiments, we evaluate our text
bel, any capitalization is kept. At each point in the prediction algorithm on four different types of
typing process, our algorithm gives one sugges- Dutch text: Wikipedia, Twitter data, transcriptions
tion: the word that is the most likely continuation of conversational speech, and web pages of Fre-
of the current buffer. quently Asked Questions (FAQ). The Wikipedia
We save the training data as a classification data corpus that we use is part of the Lassy cor-
set: each character in the buffer fills a feature slot pus (Van Noord, 2009); we obtained a version
and the word that is to be predicted is the classi- from the summer of 2010.1 The Twitter data
fication label. Figures 1 and 2 give examples of are collected continuously and automatically fil-
each of the buffer types Prefix and Buffer15 that tered for language by Erik Tjong Kim Sang (Tjong
we created for the text fragment tot een niveau Kim Sang, 2011). We used the tweets from all
in the context stelselmatig bij elke verkiezing tot users that posted at least 19 tweets (excluding
een niveau van (structurally with each election retweets) during one day in June 2011. This is
to a level of ). We use the implementation of the a set of 1 Million Twitter messages from 30,000
IGTree decision tree algorithm in TiMBL (Daele- 1
http://www.let.rug.nl/vannoord/trees/Treebank/Machine/
mans et al., 1997) to train our models. NLWIKI20100826/COMPACT/
563
t tot
t o tot
t o t tot
e een
e e een
e e n een
n niveau
n i niveau
n i v niveau
n i v e niveau
n i v e a niveau
n i v e a u niveau
Figure 1: Example of buffer type Prefix for the text fragment (elke verkiezing) tot een niveau. Un-
derscores represent whitespaces.
l k e v e r k i e z i n g tot
k e v e r k i e z i n g t tot
e v e r k i e z i n g t o tot
v e r k i e z i n g t o t tot
v e r k i e z i n g t o t een
e r k i e z i n g t o t e een
r k i e z i n g t o t e e een
k i e z i n g t o t e e n een
i e z i n g t o t e e n niveau
e z i n g t o t e e n n niveau
z i n g t o t e e n n i niveau
i n g t o t e e n n i v niveau
n g t o t e e n n i v e niveau
g t o t e e n n i v e a niveau
t o t e e n n i v e a u niveau
Figure 2: Example of buffer type Buffer15 for the text fragment (elke verkiezing) tot een niveau.
Underscores represent whitespaces.
different users. The transcriptions of conversa- and 100,000 words respectively. The results are in
tional speech are from the Spoken Dutch Corpus Table 2.
(CGN) (Oostdijk, 2000); for our experiments, we
only use the category spontaneous speech. We 4.4 Discussion of the results
obtained the FAQ data by downloading the first Table 1 shows that for all text types, the buffer
1,000 pages that Google returns for the query faq of 15 characters that crosses word borders gives
with the language restriction Dutch. After clean- better results than the prefix of the current word
ing the pages from HTML and other coding, the only. We get a relative improvement of 35% (for
resulting corpus contained approximately 1.7 Mil- FAQ) to 62% (for Speech) of Buffer15 compared
lion words of questions and answers. to Prefix-only.
4.2 Within-text type experiments Table 2 shows that text type differences have
an influence on text prediction quality: all across-
For each of the four text types, we compare the
text type experiments lead to lower results than
buffer types Prefix and Buffer15. In each ex-
the within-text type experiments. From the re-
periment, we use 1.5 Million words from the cor-
sults in Table 2, we can deduce that of the four
pus to train the algorithm and 100,000 words to
text types, speech and Twitter language resem-
test it. The results are in Table 1.
ble each other more than they resemble the other
4.3 Across-text type experiments two, and Wikipedia and FAQ resemble each other
more. Twitter and Wikipedia data are the least
We investigate the importance of text type differ-
similar: training on Wikipedia data makes the text
ences for text prediction with a series of experi-
prediction score for Twitter data drop from 29.2 to
ments in which we train and test our algorithm on
16.5%.2
texts of different text types. We keep the size of
2
the train and test sets the same: 1.5 Million words Note that the results are not symmetric. For example,
564
Table 1: Results from the within-text type experiments in terms of percentages of saved keystrokes.
Prefix means: use the previous characters of the current word as features. Buffer 15 means use a buffer
of the previous 15 characters as features.
Prefix Buffer15
Wikipedia 22.2% 30.5%
Twitter 21.3% 29.2%
Speech 20.7% 33.4%
FAQ 20.2% 27.2%
Table 2: Results from the across-text type experiments in terms of percentages of saved keystrokes, using
the best-scoring configuration from the within-text type experiments: a buffer of 15 characters
Trained on Tested on Wikipedia Tested on Twitter Tested on Speech Tested on FAQ
Wikipedia 30.5% 16.5% 22.3% 24.9%
Twitter 17.9% 29.2% 27.9% 20.7%
Speech 19.7% 22.5% 33.4% 21.0%
FAQ 22.6% 18.2% 22.9% 27.2%
5 Case study: questions about the list. The newly submitted questions are sent to
neurological issues an expert who answers them and adds both ques-
tion and answer to the chat-by-click database. In
Online care platforms aim to bring together pa- typing the question to be submitted, the user will
tients and experts. Through this medium, patients be supported by a text prediction application.
can find information about their illness, and get in The aim of this section is to find the best train-
contact with fellow-sufferers. Patients who suffer ing corpus for newly formulated questions in the
from neurological damage may have communica- neurological domain. We realize that questions
tive disabilities because their speaking and writ- formulated by users of a web interface are dif-
ing skills are impaired. For these patients, existing ferent from questions formulated by experts for
online care platforms are often not easily accessi- the purpose of a FAQ-list. Therefore, we plan to
ble. Aphasia, for example, hampers the exchange gather real user data once we have a first version
of information because the patient has problems of the user interface running online. For develop-
with word finding. ing the text prediction algorithm that is behind the
In the project Communicatie en revalidatie initial version of the application, we aim to find
DigiPoli (ComPoli), language and speech tech- the best training corpus using the questions from
nologies are implemented in the infrastructure of the chat-by-click data as training set.
an existing online care platform in order to fa-
cilitate communication for patients suffering from 5.1 Data
neurological damage. Part of the online care plat-
The chat-by-click data set on neurological issues
form is a list of frequently asked questions about
consists of 639 questions with corresponding an-
neurological diseases with answers. A user can
swers. A small sample of the data (translated to
browse through the questions using a chat-by-click
English) is shown in Table 3. In order to create the
interface (Geuze et al., 2008). Besides reading the
test data for our experiments, we removed dupli-
listed questions and answers, the user has the op-
cate questions from the chat-by-click data, leaving
tion to submit a question that is not yet included in
a set of 359 questions.3
training on Wikipedia, testing on Twitter gives a different re- In the previous sections, we used corpora of
sult from training on Twitter, testing on Wikipedia. This is 100,000 words as test collections and we calcu-
due to the size and domain of the vocabularies in both data
sets and the richness of the contexts (in order for the algo- lated the percentage of saved keystrokes over the
rithm to predict a word, it has to have seen it in the train set).
3
If the test set has a larger vocabulary than the train set, a lower Some questions and answers are repeated several times
proportion of words can be predicted than when it is the other in the chat-by-click data because they are located at different
way around. places in the chat-by-click hierarchy.
565
Table 3: A sample of the neuro-QA data, translated to English.
question 0 505 Can (P)LS be cured?
answer 0 505 Unfortunately, a real cure is not possible. However, things can be done to combat the effects of the
diseases, mainly relieving symptoms such as stiffness and spasticity. The phisical therapist and reha-
bilitation specialist can play a major role in symptom relief. Moreover, there are medications that can
reduce spasticity.
question 0 508 How is (P)LS diagnosed?
answer 0 508 The diagnosis PLS is difficult to establish, especially because the symptoms strongly resemble HSP
symptoms (Strumpells disease). Apart from blood and muscle research, several neurological examina-
tions will be carried out.
Table 4: Results for the neuro-QA questions only in terms of percentages of saved keystrokes, using
different training sets. The text prediction configuration used in all settings is Buffer15. The test samples
are 359 questions with an average length of 7.5 words. The percentages of saved keystrokes are means
over the 359 questions.
Training corpus # words Mean % of saved keystrokes in OOV-rate
neuro-QA questions (stdev)
Twitter 1.5 Million 13.3% (12.5) 28.5%
Speech 1.5 Million 14.1% (13.2) 26.6%
Wikipedia 1.5 Million 16.1% (13.1) 19.4%
FAQ 1.5 Million 19.4% (15.6) 20.0%
Medical Wikipedia 1.5 Million 28.1% (16.5) 7.0%
Neuro-QA questions (leave-one-out) 2,672 26.5% (19.9) 17.8%
complete test corpus. In the reality of our case evaluating on the remaining questions).
study however, users will type only brief frag-
ments of text: the length of the question they want In order to create the medical Wikipedia cor-
to submit. This means that there is potentially a pus, we consulted the category structure of the
large deviation in the effectiveness of the text pre- Wikipedia corpus. The Wikipedia category Ge-
diction algorithm per user, depending on the con- neeskunde (Medicine) contains 69,898 pages and
tent of the small text they are typing. Therefore, in the deeper nodes of the hierarchy we see many
we decided to evaluate our training corpora sepa- non-medical pages, such as trappist beers (or-
rately on each of the 359 unique questions, so that dered under beer, booze, alcohol, Psychoactive
we can report both mean and standard deviation drug, drug, and then medicine). If we remove all
of the text prediction scores on small (realistically pages that are more than five levels under the Ge-
sized) samples. The average number of words per neeskunde category root, 21,071 pages are left,
question is 7.5; the total size of the neuro-QA cor- which contain fairly over the 1.5 Million words
pus is 2,672 words. that we need. We used the first 1.5 Million words
of the corpus in our experiments.
5.2 Experiments The text prediction results for the different cor-
We aim to find the training set that gives the best pora are in Table 4. For each corpus, the out-of-
text prediction result for the neuro-QA questions. vocabulary rate is given: the percentage of words
We compare the following training corpora: in the Neuro-QA questions that do not occur in the
corpus.4
The corpora that we compared in the text type
experiments: Wikipedia, Twitter, Speech and 5.3 Discussion of the results
FAQ, 1.5 Million words per corpus. We measured the statistical significance of the
A 1.5 Million words training corpus that is mean differences between all text prediction
of the same topic domain as the target data: scores using a Wilcoxon Signed Rank test on
Wikipedia articles from the medical domain; paired results for the 359 questions. We found that
The 359 questions from the neuro-QA data 4
The OOV-rate for the Neuro-QA corpus itself is the av-
themselves, evaluated in a leave-one-out set- erage of the OOV-rate of each leave-one-out experiment: the
ting (359 times training on 358 questions and proportion of words that only occur in one question.
566
ECDFs for text prediction scores on NeuroQA questions
using six different training corpora
1.0
0.8
Cumulative Percent of test corpus
0.6
0.4
Twitter
Speech
0.2
Wikipedia
FAQ
NeuroQA (leaveoneout)
0.0
Medical Wikipedia
0 10 20 30 40 50 60
Text prediction scores
Figure 3: Empirical CDFs for text prediction scores on Neuro-QA data. Note that the curves that are at
the bottom-right side represent the better-performing settings.
the difference between the Twitter and Speech cor- Wikipedia corpus.
pora on the task is not significant (P = 0.18). Table 4 also shows that the standard devia-
The difference between Neuro-QA and Medical tion among the 359 samples is relatively large.
Wikipedia is significant with P = 0.02; all other For some questions, we 0% of the keystrokes are
differences are significant with P < 0.01. saved, while for other, scores of over 80% are ob-
The Medical Wikipedia corpus and the leave- tained (by the Neuro-QA and Medical Wikipedia
one-out experiments on the Neuro-QA data give training corpora). We further analyzed the differ-
better text prediction scores than the other corpora. ences between the training sets by plotting the Em-
The Medical Wikipedia even scores slightly better pirical Cumulative Distribution Function (ECDF)
than the Neuro-QA data itself. Twitter and Speech for each experiment. An ECDF shows the devel-
are the least-suited training corpora for the Neuro- opment of text prediction scores (shown on the X-
QA questions, and FAQ data gives a bit better re- axis) by walking through the test set in 359 steps
sults than a general Wikipedia corpus. (shown on the Y-axis).
These results suggest that both text type and The ECDFs for our training corpora are in Fig-
topic domain play a role in text prediction qual- ure 3. Note that the curves that are at the bottom-
ity, but the high scores for the Medical Wikipedia right side represent the better-performing settings
corpus shows that topic domain is even more im- (they get to a higher maximum after having seen
portant than text type.5 The column OOV-rate a smaller portion of the samples). From Figure 3,
shows that this is probably due to the high cover- it is again clear that the Neuro-QA and Medical
age of terms in the Neuro-QA data by the Medical Wikipedia corpora outperform the other training
corpora, and that of the other four, FAQ is the best-
5
We should note here that we did not control for domain performing corpus. Figure 3 also shows a large
differences between the four different text types. They are
intended to be general domain but Wikipedia articles will difference in the sizes of the starting percentiles:
naturally be of different topics than conversational speech. The proportion of samples with a text prediction
567
Histogram of text prediction scores for the NeuroQA Histogram of text prediction scores for leaveoneout
questions trained on Medical Wikipedia experiments on NeuroQA questions
80
80
60
60
Frequency
Frequency
40
40
20
20
0
0
0 20 40 60 80 0 20 40 60 80
percentage of keystrokes saved percentage of keystrokes saved
Figure 4: Histogram of text prediction scores Figure 5: Histogram of text prediction scores
for the Neuro-QA questions trained on Medical for leave-one-out experiments on Neuro-QA ques-
Wikipedia. Each bin represents 36 questions. tions. Each bin represents 36 questions.
score of 0% is less than 10% for the Medical around the mean, while the leave-one-out exper-
Wikipedia up to more than 30% for Speech. iments lead to a larger number of samples with
We inspected the questions that get a text pre- low prediction scores and a larger number of sam-
diction score of 0%. We see many medical terms ples with high prediction scores. This is also re-
in these questions, and many of the utterances are flected by the higher standard deviation for Neuro-
not even questions, but multi-word terms repre- QA than for Medical Wikipedia.
senting topical headers in the chat-by-click data. Since both the leave-one-out training on the
Seven samples get a zero-score in the output of all Neuro-QA questions and the Medical Wikipedia
six training corpora, e.g.: led to good results but behave differently for dif-
ferent portions of the test data, we also evaluated a
glycogenose III. combination of both corpora on our test set: We
potassium-aggrevated myotonias. created training corpora consisting of the Medi-
cal Wikipedia corpus, complemented by 90% of
26 samples get a zero-score in the output of all the Neuro-QA questions, testing on the remaining
training corpora except for Medical Wikipedia and 10% of the Neuro-QA questions. This led to mean
Neuro-QA itself. These are mainly short headings percentage of saved keystrokes of 28.6%, not sig-
with domain-specific terms such as: nificantly higher than just the Medical Wikipedia
corpus.
idiopatische neuralgische amyotrofie.
Markesbery-Griggs distale myopathie. 6 Conclusions
oculopharyngeale spierdystrofie.
In Section 1, we asked two questions: (1) What
Interestingly, the ECDFs show that the Med- is the effect of text type differences on the quality
ical Wikipedia and Neuro-QA corpora cross at of a text prediction algorithm? and (2) What is
around percentile 70 (around the point of 40% the best choice of training data if domain- and text
saved keystrokes). This indicates that although the type-specific data is sparse?
means of the two result samples are close to each By training and testing our text prediction al-
other, the distribution the scores for the individ- gorithm on four different text types (Wikipedia,
ual questions is different. The histograms of both Twitter, transcriptions of conversational speech
distributions (Figures 4 and 5) confirm this: the and FAQ) with equal corpus sizes, we found that
algorithm trained on the Medical Wikipedia cor- there is a clear effect of text type on text prediction
pus leads a larger number of samples with scores quality: training and testing on the same text type
568
gave percentages of saved keystrokes between 27 N. Garay-Vitoria and J. Abascal. 2006. Text prediction
and 34%; training on a different text type caused systems: a survey. Universal Access in the Informa-
tion Society, 4(3):188203.
the scores to drop to percentages between 16 and
28%. J. Geuze, P. Desain, and J. Ringelberg. 2008. Re-
In our case study, we compared a number of phrase: chat-by-click: a fundamental new mode of
training corpora for a specific data set for which human communication over the internet. In CHI08
extended abstracts on Human factors in computing
training data is sparse: questions about neuro- systems, pages 33453350. ACM.
logical issues. We found significant differences
between the text prediction scores obtained with G.W. Lesher, B.J. Moulton, D.J. Higginbotham, et al.
1999. Effects of ngram order and training text size
the six training corpora: the Twitter and Speech on word prediction. In Proceedings of the RESNA
corpora were the least suited, followed by the 99 Annual Conference, pages 5254.
Wikipedia and FAQ corpus. The highest scores
were obtained by training the algorithm on the Johannes Matiasek, Marco Baroni, and Harald Trost.
2002. FASTY - A Multi-lingual Approach to Text
medical pages from Wikipedia, immediately fol- Prediction. In Klaus Miesenberger, Joachim Klaus,
lowed by leave-one-out experiments on the 359 and Wolfgang Zagler, editors, Computers Helping
neurological questions. The large differences be- People with Special Needs, volume 2398 of Lec-
tween the lexical coverage of the medical domain ture Notes in Computer Science, pages 165176.
Springer Berlin / Heidelberg.
played a central role in the scores for the different
training corpora. N. Oostdijk. 2000. The spoken Dutch corpus:
Because we obtained good results by both overview and first evaluation. In Proceedings of
LREC-2000, Athens, volume 2, pages 887894.
the Medical Wikipedia corpus and the neuro-QA
questions themselves, we opted for a combination Erik Tjong Kim Sang. 2011. Het gebruik van Twit-
of both data types as training corpus in the initial ter voor Taalkundig Onderzoek. In TABU: Bulletin
version of the online text prediction application. voor Taalwetenschap, volume 39, pages 6272. In
Dutch.
Currently, a demonstration version of the appli-
cation is running for ComPoli-users. We hope to A. Van den Bosch and T. Bogers. 2008. Efficient
collect questions from these users to re-train our context-sensitive word completion for mobile de-
vices. In Proceedings of the 10th international con-
algorithm with more representative examples. ference on Human computer interaction with mobile
devices and services, pages 465470. ACM.
Acknowledgments
A. Van den Bosch. 2011. Effects of context and re-
This work is part of the research programme cency in scaled word completion. Computational
Communicatie en revalidatie digiPoli (Com- Linguistics in the Netherlands Journal, 1:7994,
12/2011.
Poli6 ), which is funded by ZonMW, the Nether-
lands organisation for health research and devel- G. Van Noord. 2009. Huge parsed corpora in LASSY.
opment. In Proceedings of The 7th International Workshop
on Treebanks and Linguistic Theories (TLT7).
S. Westman and L. Freund. 2010. Information Interac-
References tion in 140 Characters or Less: Genres on Twitter. In
Proceedings of the third symposium on Information
J. Carlberger. 1997. Design and Implementation of a Interaction in Context (IIiX), pages 323328. ACM.
Probabilistic Word Prediciton Program. Master the-
sis, Royal Institute of Technology (KTH), Sweden.
W. Daelemans, A. Van Den Bosch, and T. Weijters.

1997. IGTree: Using trees for compression and clas-
sification in lazy learning algorithms. Artificial In-
telligence Review, 11(1):407423.
A. Fazly and G. Hirst. 2003. Testing the efficacy of

part-of-speech information in word completion. In
Proceedings of the 2003 EACL Workshop on Lan-
guage Modeling for Text Entry Methods, pages 9
16.
6
http://lands.let.ru.nl/strik/research/ComPoli/
569
The Impact of Spelling Errors on Patent Search
Benno Stein and Dennis Hoppe and Tim Gollub
Bauhaus-Universitt Weimar
99421 Weimar, Germany
<first name>.<last name>@uni-weimar.de
Abstract which distinguishes patent search from general

web search. This retrieval constraint has produced
The search in patent databases is a risky a variety of sophisticated approaches tailored to
business compared to the search in other
the patent domain: citation analysis (Magdy and
domains. A single document that is relevant
but overlooked during a patent search can Jones, 2010), the learning of section-specific re-
turn into an expensive proposition. While trieval models (Lopez and Romary, 2010), and au-
recent research engages in specialized mod- tomated query generation (Xue and Croft, 2009).
els and algorithms to improve the effective- Each approach improves retrieval performance,
ness of patent retrieval, we bring another but what keeps them from attaining maximum ef-
aspect into focus: the detection and ex- fectiveness in terms of recall are the inconsisten-
ploitation of patent inconsistencies. In par-
cies found in patents: incomplete citation sets, in-
ticular, we analyze spelling errors in the as-
signee field of patents granted by the United
correctly assigned classification codes, and, not
States Patent & Trademark Office. We in- least, spelling errors.
troduce technology in order to improve re- Our paper deals with spelling errors in an oblig-
trieval effectiveness despite the presence of atory and important field of each patent, namely,
typographical ambiguities. In this regard,
the patent assignee name. Bibliographic fields are
we (1) quantify spelling errors in terms of
edit distance and phonological dissimilarity
widely used among professional patent searchers
and (2) render error detection as a learn- in order to constrain keyword-based search ses-
ing problem that combines word dissimi- sions (Joho et al., 2010). The assignee name is
larities with patent meta-features. For the particularly helpful for patentability searches and
task of finding all patents of a company, portfolio analyses since it determines the com-
our approach improves recall from 96.7% pany holding the patent. Patent experts address
(when using a state-of-the-art patent search these search tasks by formulating queries contain-
engine) to 99.5%, while precision is com-
ing the company name in question, in the hope of
promised by only 3.7%.
finding all patents owned by that company. A for-
mal and more precise description of this relevant
1 Introduction search task is as follows: Given a query q which
Patent search forms the heart of most retrieval specifies a company, and a set D of patents, de-
tasks in the intellectual property domaincf. Ta- termine the set Dq D comprised of all patents
ble 1, which provides an overview of various user held by the respective company.
groups along with their typical () and related () For this purpose, all assignee names in the
tasks. The due diligence task, for example, is patents in D should be analyzed. Let A denote
concerned with legal issues that arise while inves- the set of all assignee names in D, and let a q
tigating another company. Part of an investiga- denote the fact that an assignee name a A refers
tion is a patent portfolio comparison between one to company q. Then in the portfolio search task,
or more competitors (Lupu et al., 2011). Within all patents filed under a are relevant. The retrieval
all tasks recall is preferred over precision, a fact of Dq can thus be rendered as a query expansion
570
Table 1: User groups and patent-search-related retrieval tasks in the patent domain (Hunt et al., 2007).
User group
Analyst Attorney Manager Inventor Investor Researcher
Patentability
State of the art
Patent search task Infringement
Opposition
Due diligence
Portfolio
task, where q is expanded by the disjunction of proach more sophisticated than the standard re-
assignee names Aq with Aq = {a A | a q}. trieval approach, which is the expansion of q by
While the trivial expansion of q by the entire the empty set, is needed. Such an approach must
set A ensures maximum recall but entails an un- strive for an expansion of q by a subset of Aq ,
acceptable precision, the expansion of q by the whereby this subset should be as large as possible.
empty set yields a reasonable baseline. The latter
approach is implemented in patent search engines 1.1 Contributions
such as PatBase1 or FreePatentsOnline,2 which
return all patents where the company name q oc- The paper provides a new solution to the problem
curs as a substring of the assignee name a. This outlined. This solution employs machine learn-
baseline is simple but reasonable; due to trade- ing on orthographic features, as well as on patent
mark law, a company name q must be a unique meta features, to reliably detect spelling errors. It
identifier (i.e. a key), and an assignee name a that consists of two steps: (1) the computation of A+ q ,
contains q can be considered as relevant. It should the set of assignee names that are in a certain edit
be noted in this regard that |q| < |a| holds for distance neighborhood to q; and (2) the filtering of
most elements in Aq , since the assignee names A+
q , yielding the set Aq , which contains those as-
often contain company suffixes such as Ltd +
signee names from Aq that are classified as mis-
or Inc. spellings of q. The power of our approach can be
Our hypothesis is that due to misspelled as- seen from Table 3, which also shows a key result
signee names a substantial fraction of relevant of our research; a retrieval system that exploits
patents cannot be found by the baseline ap- our classifier will miss only 0.5% of the relevant
proach. In this regard, the types of spelling er- patents, while retrieval precision is compromised
rors in assignee names given in Table 2 should by only 3.7%.
be considered. Another contribution relates to a new, manu-
Table 2: Types of spelling errors with increasing ally-labeled corpus comprising spelling errors in
problem complexity according to Stein and Curatolo the assignee field of patents (cf. Section 3). In
(2006). The first row refers to lexical errors, whereas this regard, we consider the over 2 million patents
the last two rows refer to phonological errors. For each granted by the USPTO between 2001 and 2010.
type, an example is given, where a misspelled com- Last, we analyze indications of deliberately in-
pany name is followed by the correctly spelled variant. serted spelling errors (cf. Section 4).
Spelling error type Example
Permutations or dropped letters Whirpool Corporation Table 3: Mean average Precision, Recall, and F -
Whirlpool Corporation
Measure ( = 2) for different expansion sets for q in
Misremembering spelling details Whetherford International a portfolio search task, which is conducted on our test
Weatherford International
corpus (cf. Section 3).
Spelling out the pronunciation Emulecks Corporation
Emulex Corporation Expansion set for q Precision Recall F2
(baseline) 0.993 0.967 0.968
In order to raise the recall for portfolio search Aq (machine learning) 0.956 0.995 0.980
without significantly impairing precision, an ap- A (trivial) 0.001 1.0 0.005

A+
q (edit distance) 0.274 1.0 0.672
1
www.patbase.com
2
www.freepatentsonline.com
571
1.2 Causes for Inconsistencies in Patents names Howlett-Packard and Hewett-Packard
are distinct but refer to the same company. These
We identify the following six factors for inconsis-
kinds of near-duplicates impede the identification
tencies in the bibliographic fields of patents, in
of duplicates (Naumann and Herschel, 2010).
particular for assignee names: (1) Misspellings
are introduced due to the lack of knowledge, the Near-duplicate Detection The problem of
lack of attention, and due to spelling disabili- identifying near-duplicates is also known as
ties. Intellevate Inc. (2006) reports that 98% record linkage, or name matching; it is sub-
of a sample of patents taken from the USPTO ject of active research (Elmagarmid et al., 2007).
database contain errors, most which are spelling With respect to text documents, slightly modi-
errors. (2) Spelling errors are only removed by the fied passages in these documents can be identi-
USPTO upon request (U.S. Patent & Trademark fied using fingerprints (Potthast and Stein, 2008).
Office, 2010). (3) Spelling variations of inventor On the other hand, for data fields which con-
names are permitted by the USPTO. The Manual tain natural language such as the assignee name
of Patent Examining Procedure (MPEP) states in field, string similarity metrics (Cohen et al.,
paragraph 605.04(b) that if the applicants full 2003) as well as spelling correction technol-
name is John Paul Doe, either John P. Doe or ogy are exploited (Damerau, 1964; Monge and
J. Paul Doe is acceptable. Thus, it is valid to in- Elkan, 1997). String similarity metrics com-
troduce many different variations: with and with- pute a numeric value to capture the similarity
out initials, with and without a middle name, or of two strings. Spelling correction algorithms,
with and without suffixes. This convention ap- by contrast, capture the likelihood for a given
plies to assignee names, too. (4) Companies of- word being a misspelling of another word. In
ten have branches in different countries, where our analysis, the similarity metric SoftTfIdf is
each branch has its own company suffix, e.g., applied, which performs best in name matching
Limited (United States), GmbH (Germany), tasks (Cohen et al., 2003), as well as the complete
or Kabushiki Kaisha (Japan). Moreover, the range of spelling correction algorithms shown in
usage of punctuation varies along company suf- Figure 1: Soundex, which relies on similarity
fix abbreviations: L.L.C. in contrast to LLC, hashing (Knuth, 1997), the Levenshtein distance,
for example. (5) Indexing errors emerge from which gives the minimum number of edits needed
OCR processing patent applications, because sim- to transform a word into another word (Leven-
ilar looking letters such as e versus c or l shtein, 1966), and SmartSpell, a phonetic pro-
versus I are likely to be misinterpreted. (6) With duction approach that computes the likelihood
the advent of electronic patent application filing, of a misspelling (Stein and Curatolo, 2006). In
the number of patent reexamination steps was reorder to combine the strength of multiple met-
duced. As a consequence, the chance of unde- rics within a near-duplicate detection task, sev-
tected spelling errors increases (Adams, 2010). eral authors resort to machine learning (Bilenko
All of the mentioned factors add to a highly in- and Mooney, 2002; Cohen et al., 2003). Christen
consistent USPTO corpus. (2006) concludes that it is important to exploit all
kinds of knowledge about the type of data in ques-
2 Related Work tion, and that inconsistencies are domain-specific.
Hence, an effective near-duplicate detection ap-
Information within a corpus can only be retrieved proach should employ domain-specific heuristics
effectively if the data is both accurate and unique and algorithms (Mller and Freytag, 2003). Fol-
(Mller and Freytag, 2003). In order to yield data lowing this argumentation, we augment various
that is accurate and unique, approaches to data word similarity assessments with patent-specific
cleansing can be utilized to identify and remove meta-features.
inconsistencies. Mller and Freytag (2003) clas-
sify inconsistencies, where duplicates of entities Patent Search Commercial patent search en-
in a corpus are part of a semantic anomaly. These gines, such as PatBase and FreePatentsOnline,
duplicates exist in a database if two or more dif- handle near-duplicates in assignee names as fol-
ferent tuples refer to the same entity. With respect lows. For queries which contain a company name
to the bibliographic fields of patents, the assignee followed by a wildcard operator, PatBase suggests
572
Collision-based
Near similarity query expansion sets (cf. Table 3) into two cate-
hashing Neighborhood-based
gories: (1) The trivial as well as the edit distance
Single word Trigram-based expansion sets are underspecific, i.e., users cannot
spelling Editing Edit-distance-based
correction Rule-based
cope with the large amount of irrelevant patents
returned; the precision is close to zero. (2) The
Heuristic search
Phonetic production
Hidden Markov baseline approach, by contrast, is overspecific;
approach
models it returns too few documents, i.e., the achieved
Figure 1: Classification of spelling correction methods recall is not optimal. As a consequence, these
according to Stein and Curatolo (2006). query expansion sets are not suitable for portfolio
a set of additional companies (near-duplicates), search. Our approach, on the other hand, excels
which can be considered alongside the company in both precision and recall.
name in question. These suggestions are solely Query Spelling Correction Queries which are
retrieved based on a trailing wildcard query. Each submitted to standard web search engines differ
additional company name can then be marked in- from queries which are posed to patent search en-
dividually by a user to expand the original query. gines with respect to both length and language
In case the entire set of suggestions is consid- diversity. Hence, research in the field of web
ered, this strategy conforms to the expansion of search is concerned with suggesting reasonable
a query by the empty set, which equals a rea- alternatives to misspelled queries rather than cor-
sonable baseline approach. This query expansion recting single words (Li et al., 2011). Since stan-
strategy, however, has the following drawbacks: dard spelling correction dictionaries (e.g. ASpell)
(1) The strategy captures only inconsistencies that are not able to capture the rich language used in
succeed the given company name in the origi- web queries, large-scale knowledge sources such
nal query. Thus, near-duplicates which contain as Wikipedia (Li et al., 2011), query logs (Chen
spelling errors in the company name itself are not et al., 2007), and large n-gram corpora (Brants et
found. Even if PatBase would support left trailing al., 2007) are employed. It should be noted that
wildcards, then only the full combination of wild- the set of correctly written assignee names is un-
card expressions would cover all possible cases of known for the USPTO patent corpus.
misspellings. (2) Given an acronym of a company Moreover, spelling errors are modeled on the
such as IBM, it is infeasible to expand the ab- basis of language models (Li et al., 2011). Okuno
breviation to International Business Machines (2011) proposes a generative model to encounter
without considering domain knowledge. spelling errors, where the original query is ex-
panded based on alternatives produced by a small
Query Expansion Methods for Patent Search edit distance to the original query. This strategy
To date, various studies have investigated query correlates to the trivial query expansion set (cf.
expansion techniques in the patent domain that
Section 1). Unlike using a small edit distance, we
focus on prior-art search and invalidity search
allow a reasonable high edit distance to maximize
(Magdy and Jones, 2011). Since we are dealing
the recall.
with queries that comprise only a company name,
existing methods cannot be applied. Instead, the Trademark Search The trademark search is
near-duplicate task in question is more related to a about identifying registered trademarks which are
text reuse detection task discussed by Hagen and similar to a new trademark application. Sim-
Stein (2011); given a document, passages which ilarities between trademarks are assessed based
also appear identical or slightly modified in other on figurative and verbal criteria. In the former
documents, have to be retrieved by using standard case, the focus is on image-based retrieval tech-
keyword-based search engines. Their approach is niques. Trademarks are considered verbally simi-
guided by the user-over-ranking hypothesis intro- lar for a variety of reasons, such as pronunciation,
duced by Stein and Hagen (2011). It states that spelling, and conceptual closeness, e.g., swapping
the best retrieval performance can be achieved letters or using numbers for words. The verbal
with queries returning about as many results as similarity of trademarks, on the other hand, can
can be considered at user site. If we make use be determined by using techniques comparable
of their terminology, then we can distinguish the to near-duplicate detection: phonological parsing,
573
fuzzy search, and edit distance computation (Fall with a misspelled company name has a low
and Giraud-Carrier, 2005). frequency.
2. IPC Overlap. The IPC codes of a patent
3 Detection of Spelling Errors
specify the technological areas it applies
This section presents our machine learning ap- to. We assume that patents filed under the
proach to expand a company query q; the classi- same company name are likely to share the
fier c delivers the set Aq = {a A | c(q, a) = 1}, same set of IPC codes, regardless whether
an approximation of the ideal set of relevant as- the company name is misspelled or not.
signee names Aq . As a classification technol- Hence, if we determine the IPC codes of
ogy a support vector machine with linear kernel patents which contain q in the assignee
is used, which receives each pair (q, a) as a six- name, IPC(q), and the IPC codes of patents
dimensional feature vector. For training and test filed under assignee name a, IPC(a), then
purposes we identified misspellings for 100 dif- the intersection size of the two sets serves as
ferent company names. A detailed description of an indicator for a misspelled company name
the constructed test corpus and a report on the in a:
classifiers performance is given in the remainder
IPC(q) IPC(a)
of this section. FIPC (q, a) =
IPC(q) IPC(a)
3.1 Feature Set
The feature set comprises six features, three of 3. Company Suffix Match. The suffix match
them being orthographic similarity metrics, which relies on the company suffixes Suffixes(q)
are computed for every pair (q, a). Each metric that occur in the assignee names of A con-
compares a given company name q with the first taining q. Similar to the IPC overlap fea-
|q| words of the assignee name a: ture, we argue that if the company suffix
of a exists in the set Suffixes(q), a mis-
1. SoftTfIdf. The SoftTfIdf metric is consid- spelling in a is likely: FSuffixes (q, a) = 1
ered, since the metric is suitable for the com- iff Suffixes(a) Suffixes(q).
parison of names (Cohen et al., 2003). The
metric incorporates the Jaro-Winkler met- 3.2 Webis Patent Retrieval Assignee Corpus
ric (Winkler, 1999) with a distance threshold A key contribution of our work is a new cor-
of 0.9. The frequency values for the similar- pus called Webis Patent Retrieval Assignee Cor-
ity computation are trained on A. pus 2012 (Webis-PRA-12). We compiled the cor-
2. Soundex. The Soundex spelling correction pus in order to assess the impact of misspelled
algorithm captures phonetic errors. Since the companies on patent retrieval and the effective-
algorithm computes hash values for both q ness of our classifier to detect them.3 The corpus
and a, the feature is 1 if these hash values is built on the basis of 2 132 825 patents D granted
are equal, 0 otherwise. by the USPTO between 2001 and 2010; the patent
3. Levenshtein distance. The Levenshtein dis- corpus is provided publicly by the USPTO in
tance for (q, a) is normalized by the charac- XML format. Each patent contains bibliographic
ter length of q. fields as well as textual information such as the
abstract and the claims section. Since we are in-
To obtain further evidence for a misspelling terested in the assignee name a associated with
in an assignee name, meta information about the each patent d D, we parse each patent and ex-
patents in D, to which the assignee name refers tract the assignee name. This yields the set A of
to, is exploited. In this regard, the following three 202 846 different assignee names. Each assignee
features are derived: name refers to a set of patents, which size varies
1. Assignee Name Frequency. The number from 1 to 37 202 (the number of patents filed
of patents filed under an assignee name a: under International Business Machines Corpo-
FFreq (a) = Freq (a, D). We assume that the ration). It should be noted that for a portfolio
probability of a misspelling to occur multi- 3
The Webis-PRA-12 corpus is freely available via
ple times is low, and thus an assignee name www.webis.de/research/corpora
574
Table 4: Statistics of spelling errors for the 100 companies in the Webis-PRA-12 corpus. Considered are the
number of words and the number of letters in the company names, as well as the number of different company
suffixes that are used together with a company name (denoted as variants of q)
Total Num. of words in q Num. of letters in q Num. of variants of q
1 2 3-4 2-10 11-15 16-35 1-5 6-15 16-96
Number of companies in Q 100 36 53 11 30 35 35 45 32 23
Avg. num. of misspellings in A 3.79 2.13 3.75 9.36 1.16 2.94 6.88 0.91 3.81 9.39
search task the number of patents which refer to signee names A+ q \ Aq form the set of negative
an assignee name matters for the computation of examples (12 651 in total).
precision and recall. If we, however, isolate the During the manual assessment, names of as-
task of detecting misspelled company names, then signees which include the correct company name
it is also reasonable to weight each assignee name q were distinguished from misspelled ones. The
equally and independently from the number of latter holds true for 379 of the 1 538 assignee
patents it refers to. Both scenarios are addressed names. These names are not retrievable by the
in the experiments. baseline system, and thus form the main target for
Given A, the corpus construction task is to map our classifier. The second row of Table 4 reports
each assignee name a A to the company name on the distribution of the 379 misspelled assignee
q it refers to. This gives for each company name names. As expectable, the longer the company
q the set of relevant assignee names Aq . For our name, the more spelling errors occur. Compa-
corpus, we do not construct Aq for all company nies which file patents under many different as-
names but take a selection of 100 company names signee names are likelier to have patents with mis-
from the 2011 Fortune 500 ranking as our set of spellings in the company name.
company names Q. Since the Fortune 500 rank-
3.3 Classifier Performance
ing contains only large companies, the test cor-
pus may appear to be biased towards these com- For the evaluation with the Webis-PRA-12 cor-
panies. However, rather than the company size the pus, we train a support vector machine,4 which
structural properties of a company name are de- considers the six outlined features, and compare
terminative; our sample includes short, medium, it to the other expansion techniques. For the train-
and long company names, as well as company ing phase, we use 2/3 of the positive examples
names with few, medium, and many different to form a balanced training set of 1 025 posi-
company suffixes. Table 4 shows the distribution tive and 1 025 negative examples. After 10-fold
of company names in Q along these criteria in the cross validation, the achieved classification accu-
first row. racy is 95.97%.
For a comparison of the expansion techniques
For each company name q Q, we ap-
on the test set, which contains the examples not
ply a semi-automated procedure to derive the
considered in the training phase, two tasks are
set of relevant assignee names Aq . In a first
distinguished: finding near duplicates in assignee
step, all assignee names in A which do not re-
names (cf. Table 5, Columns 35), and finding all
fer to the company name q are filtered auto-
patents of a company (cf. Table 5, Columns 68).
matically. From a preliminary evaluation we
The latter refers to the actual task of portfo-
concluded that the Levenshtein distance d(q, a)
lio search. It can be observed that the perfor-
with a relative threshold of |q|/2 is a reasonable
mance improvements on both tasks are pretty sim-
choice for this filtering step. The resulting sets
ilar. The baseline expansion yields a recall
A+q = {a A | d(q, a) |q|/2) contain, in total
of 0.83 in the first task. The difference of 0.17
over Q, 14 189 assignee names. These assignee
to a perfect recall can be addressed by consid-
names are annotated by human assessors within a
ering query expansion techniques. If the triv-
second step to derive the final set Aq for each q
ial expansion A is applied to the task the max-
Q. Altogether we identify 1 538 assignee names
imum recall can be achieved, which, however,
that refer to the 100 companies in Q. With respect
to our classification task, the assignee names in 4
We use the implementation of the WEKA toolkit with default
each Aq are positive examples; the remaining as- parameters.
575
Table 5: The search results (macro-averaged) for two retrieval tasks and various expansion techniques. Besides
Precision and Recall, the F-Measure with = 2 is stated.
Misspelling detection Task: assignee names Task: patents

P R F2 P R F2
Baseline () .975 .829 .838 .993 .967 .968
Trivial (A) .000 1.0 .001 .001 1.0 .005
Edit distance (A+
q ) .274 1.0 .499 .412 1.0 .672
SVM (Levenshtein) .752 .981 .853 .851 .991 .911
SVM (SoftTfIdf) .702 .980 .796 .826 .993 .886
SVM (Soundex) .433 .931 .624 .629 .984 .759
SVM (orthographic features) .856 .975 .922 .942 .990 .967
SVM (Aq , all features) .884 .975 .938 .956 .995 .980
is bought with precision close to zero. Using nological areas based on the International
the edit distance expansion A+ q yields a precision Patent Classification scheme IPC: A (Hu-
of 0.274 while keeping the recall at maximum. Fi- man necessities), B (Performing operations;
nally, the machine learning expansion Aq leads transporting), C (Chemistry; metallurgy),
to a dramatic improvement (cf. Table 5, bottom D (Textiles; paper), E (Fixed constructions),
lines), whereas the exploitation of patent meta- F (Mechanical engineering; lighting; heat-
features significantly outperforms the exclusive ing; weapons; blasting), G (Physics), and
use of orthography-related features; the increase H (Electricity). If spelling errors are in-
in recall which is achieved by Aq is statistically troduced accidentally, then we expect them
significant (matched pair t-test) for both tasks (as- to be uniformly distributed across all ar-
signee names task: t = 7.6856, df = 99, eas. A biased distribution, on the other
p = 0.00; patents task: t = 2.1113, df = 99, hand, indicates that errors might be in-
p = 0.037). Note that when being applied as a serted deliberately.
single feature none of the spelling metrics (Lev-
enshtein, SoftTfIdf, Soundex) is able to achieve In the following, we compile a second corpus
a recall close to 1 without significantly impairing on the basis of the entire set A of assignee names.
the precision. In order to yield a uniform distribution of the com-
panies across years, technological areas and coun-
4 Distribution of Spelling Errors tries, a set of 120 assignee names is extracted for
each dimension. After the removal of duplicates,
Encouraged by the promising retrieval results we revised these assignee names manually in or-
achieved on the Webis-PRA-12 corpus, we ex- der to check (and correct) their spelling. Finally,
tend the analysis of spelling errors in patents to trailing business suffixes are removed, which re-
the entire USPTO corpus of granted patents be- sults in a set of 3 110 company names. For each
tween 2001 and 2010. The analysis focuses on company name q, we generate the set Aq as de-
the following two research questions: scribed in Section 3.
1. Are spelling errors an increasing issue in The results of our analysis are shown in Table 6.
patents? According to Adams (2010), the Table 6(a) refers to the first research question and
amount of spelling errors should have been shows that the amount of misspellings in compa-
increased in the last years due to the elec- nies decreased over the years from 6.67% in 2001
tronic patent filing process (cf. Section 1.2). to 4.74% in 2010 (cf. Row 3). These results let us
We address this hypothesis by analyzing the reject the hypothesis of Adams (2010). Neverthe-
distribution of spelling errors in company less, the analysis provides evidence that spelling
names that occur in patents granted between errors are still an issue. For example, the company
2001 and 2010. identified with most spelling errors are Konin-
klijke Philips Electronics with 45 misspellings
2. Are misspellings introduced deliberately in in 2008, and Centre National de la Recherche
patents? We address this question by analyz- Scientifique with 28 misspellings in 2009. The
ing the patents with respect to the eight tech- results are consistent with our findings with re-
576
Table 6: Distribution of spelling errors for 3 110 company identifiers in the USPTO patents. The mean of spelling
errors per company identifier and the standard deviation refer to companies with misspellings. The last row in
each table shows the number of patents that are additionally found if the original query q is expanded by Aq .
(a) Distribution of spelling errors between the years 2001 and 2010.
Year
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Number of companies 1 028 1 066 1 115 1 151 1 219 1 261 1 274 1 210 1 224 1 268
Number of companies with misspellings 67 63 53 65 65 60 65 64 53 60
Measure
Companies with misspellings (%) 6.52 5.91 4.75 5.65 5.33 4.76 5.1 5.29 4.33 4.73
Mean 2.78 2.35 2.23 2.28 2.18 2.48 2.23 3.0 2.64 2.8
Standard deviation 4.62 3.3 3.63 3.13 2.8 3.55 2.87 6.37 4.71 4.6
Maximum misspellings per company 24 12 16 12 10 18 12 45 28 22
Additional number of patents 7.1 7.21 7.43 7.68 7.91 8.48 7.83 8.84 8.92 8.92
(b) Distribution of spelling errors based on the IPC scheme.
IPC code
A B C D E F G H
Number of companies 954 1 231 811 277 412 771 1 232 949
Number of companies with misspellings 59 70 51 7 10 33 83 63
Measure
Companies with misspellings (%) 6.18 5.69 6.29 2.53 2.43 4.28 6.74 6.64
Mean 3.0 2.49 3.57 1.86 2.8 1.88 3.29 4.05
Standard deviation 5.28 3.65 7.03 1.99 4.22 2.31 5.72 7.13
Maximum misspellings per company 32 14 40 3 12 6 24 35
Additional number of patents 9.25 9.67 11.12 4.71 4.6 4.79 8.92 12.84
spect to the Fortune 500 sample (cf. Table 4), inconsistencies. With the analysis of spelling er-
where company names that are longer and pre- rors in assignee names we made a first yet consid-
sumably more difficult to write contain more erable contribution in this respect; searches with
spelling errors. assignee constraints become a more sensible op-
In contrast to the uniform distribution of mis- eration. We showed how a special treatment of
spellings over the years, the situation with re- spelling errors can significantly raise the effec-
gard to the technological areas is different (cf. Ta- tiveness of patent search. The identification of
ble 6(b)). Most companies are associated with this untapped potential, but also the utilization of
the IPC sections G and B, which both refer to machine learning to combine patent features with
technical domains (cf. Table 6(b), Row 1). The typography, form our main contributions.
percentage of misspellings in these sections in- Our current research broadens the application
creased compared to the spelling errors grouped of a patent spelling analysis. In order to iden-
by year. A significant difference can be seen for tify errors that are introduced deliberately we
the sections D and E. Here, the number of as- investigate different types of misspellings (edit
signed companies drops below 450 and the per- distance versus phonological). Finally, we con-
centage of misspellings decreases significantly sider the analysis of acquisition histories of com-
from about 6% to 2.5%. These findings might panies as promising research direction: since
support the hypothesis that spelling errors are in- acquired companies often own granted patents,
serted deliberately in technical domains. these patents should be considered while search-
ing for the company in question in order to further
5 Conclusions increase the recall.
While researchers in the patent domain concen-
Acknowledgements
trate on retrieval models and algorithms to im-
prove the search performance, the original aspect This work is supported in part by the German Sci-
of our paper is that it points to a different (and or- ence Foundation under grants STE1019/2-1 and
thogonal) research avenue: the analysis of patent FU205/22-1.
577
References Processing and Information Retrieval (SPIRE 11),
volume 7024 of Lecture Notes in Computer Science,
Stephen Adams. 2010. The Text, the Full Text and
pages 356367. Springer.
nothing but the Text: Part 1 Standards for creating
Textual Information in Patent Documents and Gen- David Hunt, Long Nguyen, and Matthew Rodgers, ed-
eral Search Implications. World Patent Information, itors. 2007. Patent Searching: Tools & Techniques.
32(1):2229, March. Wiley.
Mikhail Bilenko and Raymond J. Mooney. 2002. Intellevate Inc. 2006. Patent Quality, a blog en-
Learning to Combine Trained Distance Metrics try. http://www.patenthawk.com/blog/
for Duplicate Detection in Databases. Technical 2006/01/patent_quality.html, January.
Report AI 02-296, Artificial Intelligence Labora-
Hideo Joho, Leif A. Azzopardi, and Wim Vander-
tory, University of Austin, Texas, USA, Austin,
bauwhede. 2010. A Survey of Patent Users: An
TX, February.
Analysis of Tasks, Behavior, Search Functionality
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. and System Requirements. In IIix 10: Proceed-
Och, and Jeffrey Dean. 2007. Large Language ing of the third symposium on Information Inter-
Models in Machine Translation. In EMNLP-CoNLL action in Context, pages 1324, New York, NY,
07: Proceedings of the 2007 Joint Conference on USA. ACM.
Empirical Methods in Natural Language Process- Donald E. Knuth. 1997. The Art of Computer Pro-
ing and Computational Natural Language Learn- gramming, Volume I: Fundamental Algorithms, 3rd
ing, pages 858867. ACL, June. Edition. Addison-Wesley.
Qing Chen, Mu Li, and Ming Zhou. 2007. Improv- Vladimir I. Levenshtein. 1966. Binary codes capa-
ing Query Spelling Correction Using Web Search ble of correcting deletions, insertions and reversals.
Results. In EMNLP-CoNLL 07: Proceedings of Soviet Physics Doklady, 10(8):707710. Original
the 2007 Joint Conference on Empirical Methods in in Doklady Akademii Nauk SSSR 163(4): 845-848.
Natural Language Processing and Computational
Natural Language Learning, pages 181189. ACL, Yanen Li, Huizhong Duan, and ChengXiang Zhai.
June. 2011. CloudSpeller: Spelling Correction for Search
Queries by Using a Unified Hidden Markov Model
Peter Christen. 2006. A Comparison of Personal with Web-scale Resources. In Spelling Alteration
Name Matching: Techniques and Practical Is- for Web Search Workshop, pages 1014, July.
sues. In ICDM 06: Workshops Proceedings of
the sixth IEEE International Conference on Data Patrice Lopez and Laurent Romary. 2010. Experi-
Mining, pages 290294. IEEE Computer Society, ments with Citation Mining and Key-Term Extrac-
December. tion for Prior Art Search. In Martin Braschler,
Donna Harman, and Emanuele Pianta, editors,
William W. Cohen, Pradeep Ravikumar, and Stephen CLEF 2010 LABs and Workshops, Notebook Pa-
E. Fienberg. 2003. A Comparison of String pers, September.
Distance Metrics for Name-Matching Tasks. In
Subbarao Kambhampati and Craig A. Knoblock, Mihai Lupu, Katja Mayer, John Tait, and Anthony J.
editors, IIWeb 03: Proceedings of the IJCAI Trippe, editors. 2011. Current Challenges in Patent
workshop on Information Integration on the Web, Information Retrieval, volume 29 of The Informa-
pages 7378, August. tion Retrieval Series. Springer.
Fred J. Damerau. 1964. A Technique for Computer Walid Magdy and Gareth J. F. Jones. 2010. Ap-
Detection and Correction of Spelling Errors. Com- plying the KISS Principle for the CLEF-IP 2010
munications of the ACM, 7(3):171176. Prior Art Candidate Patent Search Task. In Martin
Braschler, Donna Harman, and Emanuele Pianta,
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and editors, CLEF 2010 LABs and Workshops, Note-
Vassilios S. Verykios. 2007. Duplicate Record De- book Papers, September.
tection: A Survey. IEEE Trans. Knowl. Data Eng.,
19(1):116. Walid Magdy and Gareth J.F. Jones. 2011. A Study
on Query Expansion Methods for Patent Retrieval.
Caspas J. Fall and Christophe Giraud-Carrier. 2005. In PAIR 11: Proceedings of the 4th workshop on
Searching Trademark Databases for Verbal Similar- Patent information retrieval, AAAI Workshop on
ities. World Patent Information, 27(2):135143. Plan, Activity, and Intent Recognition, pages 19
24, New York, NY, USA. ACM.
Matthias Hagen and Benno Stein. 2011. Candidate
Document Retrieval for Web-Scale Text Reuse De- Alvaro E. Monge and Charles Elkan. 1997. An Ef-
tection. In 18th International Symposium on String ficient Domain-Independent Algorithm for Detect-
578
ing Approximately Duplicate Database Records.
In DMKD 09: Proceedings of the 2nd workshop
on Research Issues on Data Mining and Knowl-
edge Discovery, pages 2329, New York, NY,
USA. ACM.
Heiko Mller and Johann-C. Freytag. 2003. Prob-
lems, Methods and Challenges in Comprehensive
Data Cleansing. Technical Report HUB-IB-164,
Humboldt-Universitt zu Berlin, Institut fr Infor-
matik, Germany.
Felix Naumann and Melanie Herschel. 2010. An In-
troduction to Duplicate Detection. Synthesis Lec-
tures on Data Management. Morgan & Claypool
Publishers.
Yoh Okuno. 2011. Spell Generation based on Edit
Distance. In Spelling Alteration for Web Search
Workshop, pages 2526, July.
Martin Potthast and Benno Stein. 2008. New Is-
sues in Near-duplicate Detection. In Christine
Preisach, Hans Burkhardt, Lars Schmidt-Thieme,
and Reinhold Decker, editors, Data Analysis, Ma-
chine Learning and Applications. Selected papers
from the 31th Annual Conference of the German
Classification Society (GfKl 07), Studies in Classi-
fication, Data Analysis, and Knowledge Organiza-
tion, pages 601609, Berlin Heidelberg New York.
Springer.
Benno Stein and Daniel Curatolo. 2006. Phonetic
Spelling and Heuristic Search. In Gerhard Brewka,
Silvia Coradeschi, Anna Perini, and Paolo Traverso,
editors, 17th European Conference on Artificial In-
telligence (ECAI 06), pages 829830, Amsterdam,
Berlin, August. IOS Press.
Benno Stein and Matthias Hagen. 2011. Introducing
the User-over-Ranking Hypothesis. In Advances in
Information Retrieval. 33rd European Conference
on IR Resarch (ECIR 11), volume 6611 of Lecture
Notes in Computer Science, pages 503509, Berlin
Heidelberg New York, April. Springer.
U.S. Patent & Trademark Office. 2010. Manual of
Patent Examining Procedure (MPEP), Eighth Edi-
tion, July.
William W. Winkler. 1999. The State of Record Link-
age and Current Research Problems. Technical re-
port, Statistical Research Division, U.S. Bureau of
the Census.
Xiaobing Xue and Bruce W. Croft. 2009. Automatic
Query Generation for Patent Search. In CIKM
09: Proceeding of the eighteenth ACM conference
on Information and Knowledge Management, pages
20372040, New York, NY, USA. ACM.
579
U BY A Large-Scale Unified Lexical-Semantic Resource
Based on LMF
Iryna Gurevych , Judith Eckle-Kohler , Silvana Hartmann , Michael Matuschek ,
Christian M. Meyer and Christian Wirth
Abstract Previously, there have been several indepen-

dent efforts of combining existing LSRs to en-
We present U BY, a large-scale lexical- hance their coverage w.r.t. their breadth and depth,
semantic resource combining a wide range i.e. (i) the number of lexical items, and (ii) the
of information from expert-constructed types of lexical-semantic information contained
and collaboratively constructed resources (Shi and Mihalcea, 2005; Johansson and Nugues,
for English and German. It currently 2007; Navigli and Ponzetto, 2010b; Meyer and
contains nine resources in two lan-
Gurevych, 2011). As these efforts often targeted
guages: English WordNet, Wiktionary,
Wikipedia, FrameNet and VerbNet, particular applications, they focused on aligning
German Wikipedia, Wiktionary and selected, specialized information types. To our
GermaNet, and multilingual OmegaWiki knowledge, no single work focused on modeling
modeled according to the LMF standard. a wide range of ECRs and CCRs in multiple lan-
For FrameNet, VerbNet and all collabora- guages and a large variety of information types in
tively constructed resources, this is done a standardized format. Frequently, the presented
for the first time. Our LMF model captures model is not easily scalable to accommodate an
lexical information at a fine-grained level
by employing a large number of Data
open set of LSRs in multiple languages and the in-
Categories from ISOCat and is designed formation mined automatically from corpora. The
to be directly extensible by new languages previous work also lacked the aspects of lexicon
and resources. All resources in U BY can format standardization and API access. We be-
be accessed with an easy to use publicly lieve that easy access to information in LSRs is
available API. crucial in terms of their acceptance and broad ap-
plicability in NLP.
In this paper, we propose a solution to this. We
1 Introduction define a standardized format for modeling LSRs.
This is a prerequisite for resource interoperabil-
Lexical-semantic resources (LSRs) are the foun-
ity and the smooth integration of resources. We
dation of many NLP tasks such as word sense
employ the ISO standard Lexical Markup Frame-
disambiguation, semantic role labeling, question
work (LMF: ISO 24613:2008), a metamodel for
answering and information extraction. They are
LSRs (Francopoulo et al., 2006), and Data Cate-
needed on a large scale in different languages.
gories (DCs) selected from ISOCat.1 One of the
The growing demand for resources is met nei-
main challenges of our work is to develop a model
ther by the largest single expert-constructed re-
that is standard-compliant, yet able to express the
sources (ECRs), such as WordNet and FrameNet,
information contained in diverse LSRs, and that in
whose coverage is limited, nor by collaboratively
the long term supports the integration of the vari-
constructed resources (CCRs), such as Wikipedia
ous resources.
and Wiktionary, which encode lexical-semantic
The main contributions of this paper can be
knowledge in a less systematic form than ECRs,
1
because they are lacking expert supervision. http://www.isocat.org/
580
summarized as follows: (1) We present an LMF- McCrae et al. (2011) propose LEMON, a con-
based model for large-scale multilingual LSRs ceptual model for lexicalizing ontologies as an
called U BY-LMF. We model the lexical-semantic extension of the LexInfo model (Buitelaar et al.,
information down to a fine-grained level of in- 2009). L EMON provides an LMF-implementation
formation (e.g. syntactic frames) and employ in the Web Ontology Language (OWL), which
standardized definitions of linguistic information is similar to U BY-LMF, as it also uses DCs
types from ISOCat. (2) We present U BY, a large- from ISOCat, but diverges further from the stan-
scale LSR implementing the U BY-LMF model. dard (e.g. by removing structural elements such
U BY currently contains nine resources in two as the predicative representation class). While
languages: English WordNet (WN, Fellbaum we focus on modeling lexical-semantic informa-
(1998), Wiktionary2 (WKT-en), Wikipedia3 (WP- tion comprehensively and at a fine-grained level,
en), FrameNet (FN, Baker et al. (1998)), and the goal of LEMON is to support the linking be-
VerbNet (VN, Kipper et al. (2008)); German Wik- tween ontologies and lexicons. This goal entails
tionary (WKT-de), Wikipedia (WP-de), and Ger- a task-targeted application: domain-specific lex-
maNet (GN, Kunze and Lemnitzer (2002)), and icons are extracted from ontology specifications
the English and German entries of OmegaWiki4 and merged with existing LSRs on demand. As a
(OW), referred to as OW-en and OW-de. OW, consequence, there is no available large-scale in-
a novel CCR, is inherently multilingual its ba- stance of the LEMON model.
sic structure are multilingual synsets, which are a Soria et al. (2009) define WordNet-LMF, an
valuable addition to our multilingual U BY. Essen- LMF model for representing wordnets used in
tial to U BY are the nine pairwise sense alignments the KYOTO project, and Henrich and Hinrichs
between resources, which we provide to enable (2010) do this for GN, the German wordnet.
resource interoperability on the sense level, e.g. These models are similar, but they still present
by providing access to the often complementary different implementations of the LMF meta-
information for a sense in different resources. (3) model, which hampers interoperability between
We present a Java-API which offers unified access the resources. We build upon this work, but ex-
to the information contained in U BY. tend it significantly: U BY goes beyond model-
We will make the U BY-LMF model, the re- ing a single ECR and represents a large number
source U BY and the API freely available to the of both ECRs and CCRs with very heterogeneous
research community.5 This will make it easy for content in the same format. Also, U BY-LMF
the NLP community to utilize U BY in a variety of features deeper modeling of lexical-semantic in-
tasks in the future. formation. Henrich and Hinrichs (2010), for
instance, do not explicitly model the argument
2 Related Work structure of subcategorization frames, since each
frame is represented as a string. In U BY-LMF,
The work presented in this paper concerns
we represent them at a fine-grained level neces-
standardization of LSRs, large-scale integration
sary for the transparent modeling of the syntax-
thereof at the representational level, and the uni-
semantics interface.
fied access to lexical-semantic information in the
integrated resources.
Large-scale integration of resources. Most
Standardization of resources. Previous work previous research efforts on the integration of re-
includes models for representing lexical informa- sources targeted at world knowledge rather than
tion relative to ontologies (Buitelaar et al., 2009; lexical-semantic knowledge. Well known exam-
McCrae et al., 2011), and standardized single ples are YAGO (Suchanek et al., 2007), or DBPe-
wordnets (English, German and Italian wordnets) dia (Bizer et al., 2009).
in the ISO standard LMF (Soria et al., 2009; Hen- Atserias et al. (2004) present the Meaning Mul-
rich and Hinrichs, 2010; Toral et al., 2010). tilingual Central Repository (MCR). MCR inte-
2
grates five local wordnets based on the Interlin-
http://www.wiktionary.org/
3
http://www.wikipedia.org/
gual Index of EuroWordNet (Vossen, 1998). The
4
http://www.omegawiki.org/ overall goal of the work is to improve word sense
5
http://www.ukp.tu-darmstadt.de/data/uby disambiguation. This work is similar to ours, as it
581
aims at a large-scale multilingual resource and in- API,6 or the Java-based Wikipedia API.7
cludes several resources. It is however restricted With a stronger focus of the NLP community
to a single type of resource (wordnets) and fea- on sharing data and reproducing experimental re-
tures a single type of lexical information (seman- sults these tools are becoming important as never
tic relations) specified upon synsets. Similarly, before. Therefore, a major design objective of
de Melo and Weikum (2009) create a multilin- U BY is a single API. This is similar in spirit to the
gual wordnet by integrating wordnets, bilingual motivation of Pradhan et al. (2007), who present
dictionaries and information from parallel cor- integrated access to corpus annotations as a main
pora. None of these resources integrate lexical- goal of their work on standardizing and integrat-
semantic information, such as syntactic subcate- ing corpus annotations in the OntoNotes project.
gorization or semantic roles. To summarize, related work focuses either on
McFate and Forbus (2011) present NULEX, the standardization of single resources (or a single
a syntactic lexicon automatically compiled from type of resource), which leads to several slightly
WN, WKT-en and VN. As their goal is to cre- different formats constrained to these resources,
ate an open-license resource to enhance syntactic or on the integration of several resources in an
parsing, they enrich verbs and nouns in WN with idiosyncratic format. CCRs have not been con-
inflection information from WKT-en and syntac- sidered at all in previous work on resource stan-
tic frames from VN. Thus, they only use a small dardization, and the level of detail of the model-
part of the lexical information present in WKT-en. ing is insufficient to fully accommodate different
Padro et al. (2011) present their work on lex- types of lexical-semantic information. API ac-
icon merging within the Panacea Project. One cess is rarely provided. This makes it hard for
goal of Panacea is to create a lexical resource de- the community to exploit their results on a large
velopment platform that supports large-scale lex- scale. Thus, it diminishes the impact that these
ical acquisition and can be used to combine exist- projects might achieve upon NLP beyond their
ing lexicons with automatically acquired ones. To original specific purpose, if their results were rep-
this end, Padro et al. (2011) explore the automatic resented in a unified resource and could easily be
integration of subcategorization lexicons. Their accessed by the community through a single pub-
current work only covers Spanish, and though lic API.
they mention the LMF standard as a potential data
3 U BY Data model
model, they do not make use of it.
Shi and Mihalcea (2005) integrate FN, VN and LMF defines a metamodel of LSRs in the Uni-
WN, and Palmer (2009) presents a combination of fied Modeling Language (UML). It provides a
Propbank, VN and FN in a resource called S EM - number of UML packages and classes for model-
L INK in order to enhance semantic role labeling. ing many different types of resources, e.g. word-
Similar to our work, multiple resources are in- nets and multilingual lexicons. The design of
tegrated, but their work is restricted to a single a standard-compliant lexicon model in LMF in-
language and does not cover CCRs, whose pop- volves two steps: in the first step, the structure
ularity and importance has grown tremendously of the lexicon model has to be defined by choos-
over the past years. In fact, with the excep- ing a combination of the LMF core package and
tion of NULEX, CCRs have only been consid- zero to many extensions (i.e. UML packages). In
ered in the sense alignment of individual resource the second step, these UML classes are enriched
pairs (Navigli and Ponzetto, 2010a; Meyer and by attributes. To contribute to semantic interop-
Gurevych, 2011). erability, it is essential for the lexicon model that
the attributes and their values refer to Data Cat-
egories (DCs) taken from a reference repository.
API access for resources. An important factor
DCs are standardized specifications of the terms
to the success of a large, integrated resource is a
that are used for attributes and their values, or in
single public API, which facilitates the access to
other words, the linguistic vocabulary occurring
the information contained in the resource. The
most important LSRs so far can be accessed us- 6
http://sourceforge.net/projects/jwordnet/
7
ing various APIs, for instance the Java WordNet http://code.google.com/p/jwpl/
582
in a lexicon model. Consider, for instance, the SubcategorizationFrame is com-
term lexeme that is defined differently in WN and posed of syntactic arguments, while
FN: in FN, a lexeme refers to a word form, not SemanticPredicate is composed of se-
including the sense aspect. In WN, on the con- mantic arguments. The linking between syntactic
trary, a lexeme is an abstract pairing of mean- and semantic arguments is represented by the
ing and form. According to LMF, the DCs are SynSemCorrespondence class.
to be selected from ISOCat, the implementation The SenseAxis class is very important in
of the ISO 12620 Data Category Registry (DCR, U BY-LMF, as it connects the different source
Broeder et al. (2010)), resulting in a Data Cate- LSRs. Its role is twofold: first, it links the cor-
gory Selection (DCS). responding word senses from different languages,
e.g. English and German. Second, it represents
Design of U BY-LMF. We have designed U BY-
monolingual sense alignments, i.e. sense align-
LMF8 as a model of the union of various hetero-
ments between different lexicons in the same lan-
geneous resources, namely WN, GN, FN, and VN
guage. The latter is a novel interpretation of
on the one hand and CCRs on the other hand.
SenseAxis introduced by U BY-LMF.
Two design principles guided our development
The organization of lexical-semantic knowl-
of U BY-LMF: first, to preserve the information
edge found in WP, WKT, and OW can be mod-
available in the original resources and to uni-
eled with the classes in U BY-LMF as well. WP
formly represent it in U BY-LMF. Second, to be
primarily provides encyclopedic information on
able to extend U BY in the future by further lan-
nouns. It mainly consists of article pages which
guages, resources, and types of linguistic infor-
are modeled as Senses in U BY-LMF.
mation, in particular, alignments between differ-
WKT is in many ways similar to tradi-
ent LSRs.
tional dictionaries, because it enumerates senses
Wordnets, FN and VN are largely complemen-
under a given headword on an entry page.
tary regarding the information types they provide,
Thus, WKT entry pages can be represented by
see, e.g. Baker and Fellbaum (2009). Accord-
LexicalEntries and WKT senses by Senses.
ingly, they use different organizational units to
represent this information. Wordnets, such as OW is different from WKT and WP, as it is or-
WN and GN, primarily contain information on ganized in multilingual synsets. To model OW
lexical-semantic relations, such as synonymy, and in U BY-LMF, we split the synsets per language
use synsets (groups of lexemes that are synony- and included them as monolingual Synsets in
mous) as organizational units. FN focuses on the corresponding Lexicon (e.g., OW-en or OW-
groups of lexemes that evoke the same prototypi- de). The original multilingual information is pre-
cal situation (so-called semantic frames, Fillmore served by adding a SenseAxis between corre-
(1982)) involving semantic roles (so-called frame sponding synsets in OW-en and OW-de.
elements). VN, a large-scale verb lexicon, is or- The LMF standard itself contains only few lin-
ganized in Levin-style verb classes (Levin, 1993) guistic terms and does neither specify attributes
(groups of verbs that share the same syntactic al- nor their values. Therefore, an important task in
ternations and semantic roles) and provides rich developing U BY-LMF has been the specification
subcategorization frames including semantic roles of attributes and their values along with the proper
and a specification of semantic predicates. attachment of attributes to LMF classes. In partic-
U BY-LMF employs several direct subclasses ular, this task involved selecting DCs from ISO-
of Lexicon in order to account for the various or- Cat and, if necessary, adding new DCs to ISOCat.
ganization types found in the different LSRs con-
Extensions in U BY-LMF. Although U BY-
sidered. While the LexicalEntry class reflects
LMF is largely compliant with LMF, the task of
the traditional headword-based lexicon organiza-
building a homogeneous lexicon model for many
tion, Synset represents synsets from wordnets,
highly heterogeneous LSRs led us to extend LMF
SemanticPredicate models FN semantic
in several ways: we added two new classes and
frames, and SubcategorizationFrameSet
several new relationships between classes.
corresponds to VN alternation classes.
First, we were facing a huge variety of lexical-
8
See www.ukp.tu-darmstadt.de/data/uby semantic labels for many different dimensions of
583
semantic classification. Examples of such dimen- form. Disambiguating the WKT relation targets
sions include ontological type (e.g. selectional re- to infer the target sense is left to future work.
strictions in VN and FN), domain (e.g. Biology in A related issue occurred, when we mapped WN
WN), style and register (e.g. labels in WKT, OW), to LMF. WN encodes morphologically related
or sentiment (e.g. sentiment of lexical units in forms as sense relations. U BY-LMF represents
FN). Since we aim at an extensible LMF-model, these related forms not only as sense relations (as
capable of representing further dimensions of se- in WordNet-LMF), but also at the morphologi-
mantic classification, we did not squeeze the in- cal level using the RelatedForm class from the
formation on semantic classes present in the con- LMF Morphology extension. In LMF, however,
sidered LSRs into existing LMF classes. Instead, the RelatedForm class for morphologically re-
we addressed this issue by introducing a more lated lexemes is not associated with the corre-
general class, SemanticLabel, which is an op- sponding sense in any way. Discarding the WN
tional subclass of Sense, SemanticPredicate, information on the senses involved in a particular
and SemanticArgument. This new class has morphological relation would lead to information
three attributes, encoding the name of the label, loss in some cases. Consider as an example the
its type (e.g. ontological, register, sentiment), and WN verb buy (purchase) which is derivationally
a numeric quantification (e.g. sentiment strength). related to the noun buy, while on the other hand
Second, we attached the subclass Frequency buy (accept as true, e.g. I cant buy this story) is
to most of the classes in U BY-LMF, in order to not derivationally related to the noun buy. We ad-
encode frequency information. This is of partic- dressed this issue by adding a sense attribute to
ular importance when using the resource in ma- the RelatedForm class. Thus, in extension of
chine learning applications. This extension of the LMF, U BY-LMF allows sense relations to refer to
standard has already been made in WordNet-LMF a form relation target and morphological relations
(Soria et al., 2009). Currently, the Frequency to refer to a sense relation target.
class is used to keep corpus frequencies for lex-
Data Categories in U BY-LMF. We encoun-
ical units in FN, but we plan to use it for en-
tered large differences in the availability of DCs
riching many other classes with frequency in-
in ISOCat for the morpho-syntactic, lexical-
formation in future work, such as Senses or
syntactic, and lexical-semantic parts of U BY-
SubcategorizationFrames.
LMF. Many DCs were missing in ISOCat and we
Third, the representation of FN in LMF re- had to enter them ourselves. While this was feasi-
quired adding two new relationships between ble at the morpho-syntactic and lexical-syntactic
LMF classes: we added a relationship between level, due to a large body of standardization re-
SemanticArgument and Definition, in or-
sults available, it was much harder at the lexical-
der to represent the definitions available for frame semantic level where standardization is still on-
elements in FN. In addition, we added a re- going. At the lexical-semantic level, U BY-LMF
lationship between the Context class and the currently allows string values for a number of at-
MonoLingualExternalRef, to represent the
tribute values, e.g. for semantic roles. We can eas-
links to annotated corpus sentences in FN. ily integrate the results of the ongoing standard-
Finally, WKT turned out to be hard to tackle, ization efforts into U BY-LMF in the future.
because it contains a special kind of ambiguity in
the semantic relations and translation links listed 4 U BY Population with information
for senses: the targets of both relations and trans-
4.1 Representing LSRs in U BY-LMF
lation links are ambiguous, as they refer to lem-
mas (word forms), rather than to senses (Meyer U BY-LMF is represented by a DTD (as suggested
and Gurevych, 2010). These ambiguous rela- by the standard) which can be used to automat-
tion targets could not directly be represented in ically convert any given resource into the corre-
LMF, since sense and translation relations are sponding XML format.9 This conversion requires
defined between senses. To resolve this, we a detailed analysis of the resource to be converted,
added a relationship between SenseRelation followed by the definition of a mapping of the
and FormRepresentation, in order to encode 9
Therefore, U BY-LMF can be considered as a serializa-
the ambiguous WKT relation target as a word tion of LMF.
584
concepts and terms used in the original resource tries provide links to the corresponding WP page.
to the U BY-LMF model. There are two major Also, the German and English language editions
tasks involved in the development of an automatic of WP and OW are connected by inter-language
conversion routine: first, the basic organizational links between articles (Senses in U BY). We can
unit in the source LSR has to be identified and expect that these links have high quality, as they
mapped, e.g. synset in WN or semantic frame in were entered manually by users and are subject
FN, and second, it has to be determined, how a to community control. Therefore, we straightfor-
(LMF) sense is defined in the source LSR. wardly imported them into U BY.
A notable aspect of converting resources into
U BY-LMF is the harmonization of linguistic ter- Alignment Framework. Automatically creat-
minology used in the LSRs. For instance, a ing new alignments is difficult because of word
WN Word and a GN Lexical Unit are mapped to ambiguities, different granularities of senses,
Sense in U BY-LMF. or language specific conceptualizations (Navigli,
We developed reusable conversion routines for 2006). To support this task for a large number
the future import of updated versions of the source of resources across languages, we have designed
LSRs into U BY, provided the structure of the a flexible alignment framework based on the
source LSR remains stable. These conversion state-of-the-art method of Niemann and Gurevych
routines extract lexical data from the source LSRs (2011). The framework is generic in order to al-
by calling their native APIs (rather than process- low alignments between different kinds of entities
ing the underlying XML data). Thus, all lexical as found in different resources, e.g. WN synsets,
information which can be accessed via the APIs FN frames or WP articles. The only requirement
is converted into U BY-LMF. is that the individual entities are distinguishable
by a unique identifier in each resource.
Converting the LSRs introduced in the previ-
ous section yielded an instantiation of U BY-LMF The alignment consists of the following steps:
named U BY. The LexicalResource instance First, we extract the alignment candidates for a
U BY currently comprises 10 Lexicon instances, given resource pair, e.g. WN sense candidates for
one each for OW-de and OW-en, and one lexicon a WKT-en entry. Second, we create a gold stan-
each for the remaining eight LSRs. dard by manually annotating a subset of candi-
date pairs as valid or non-valid. Then, we
4.2 Adding Sense Alignments extract the sense representations (e.g. lemmatized
bag-of-words based on glosses) to compute the
Besides the uniform and standardized representa-
similarity of word senses (e.g. by cosine similar-
tion of the single LSRs, one major asset of U BY
ity). The gold standard with corresponding sim-
is the semantic interoperability of resources at the
ilarity values is fed into Weka (Hall et al., 2009)
sense level. In the following, we (i) describe how
to train a machine learning classifier, and in the
we converted already existing sense alignments of
final step this classifier is used to automatically
resources into LMF, and (ii) present a framework
classify the candidate sense pairs as (non-)valid
to infer alignments automatically for any pair of
alignment. Our framework also allows us to train
resources.
on a combination of different similarity measures.
Existing Alignments. Previous work on sense Using our framework, we were able to re-
alignment yielded several alignments, such as produce the results reported by Niemann and
WNWP-en (Niemann and Gurevych, 2011), Gurevych (2011) and Meyer and Gurevych
WNWKT-en (Meyer and Gurevych, 2011) and (2011) based on the publicly available evaluation
VNFN (Palmer, 2009). datasets10 and the configuration details reported
We converted these alignments into U BY-LMF in the corresponding papers.
by creating a SenseAxis instance for each pair of
Cross-Lingual Alignment. In order to align
aligned senses. This involved mapping the sense
word senses across languages, we extended the
IDs from the proprietary alignment files to the
monolingual sense alignment described above to
corresponding sense IDs in U BY.
the cross-lingual setting. Our approach utilizes
In addition, we integrated the sense alignments
10
already present in OW and WP. Some OW en- http://www.ukp.tu-darmstadt.de/data/sense-alignment/
585
Moses,11 trained on the Europarl corpus. The Translation Similarity
lemma of one of the two senses to be aligned direction measure P R F1
as well as its representations (e.g. the gloss) is EN > DE Cosine (Cos) 0.666 0.575 0.594
translated into the language of the other resource, DE > EN Cos 0.674 0.658 0.665
yielding a monolingual setting. E.g., the WN DE > EN PPR 0.721 0.712 0.716
synset {vessel, watercraft} with its gloss a craft DE > EN PPR + Cos 0.723 0.712 0.717
designed for water transportation is translated
Table 1: Cross-lingual alignment results
into {Schiff, Wasserfahrzeug} and Ein Fahrzeug
fur Wassertransport, and then the candidate ex-
traction and all downstream steps can take place into English works significantly better than into
in German. An inherent problem with this ap- German. Also, the more elaborate similarity mea-
proach is that incorrect translations also lead to sure PPR yields better results than cosine similar-
invalid alignment candidates. However, these are ity, while the best result is achieved by a combina-
most probably filtered out by the machine learn- tion of both. Niemann and Gurevych (2011) make
ing classifier as the calculated similarity between a similar observation for the monolingual setting.
the sense representations (e.g. glosses) should be Our F-measure of 0.717 in the best configuration
low if the candidates do not match. lies between the results of Meyer and Gurevych
We evaluated our approach by creating a cross- (2011) (0.66) and Niemann and Gurevych (2011)
lingual alignment between WN and OW-de, i.e. (0.78), and thus verifies the validity of the ma-
the concepts in OW with a German lexicaliza- chine translation approach. Therefore, the best
tion.12 To our knowledge, this is the first study on alignment was subsequently integrated into U BY.
aligning OW with another LSR. OW is especially
interesting for this task due to its multilingual con- 5 Evaluating U BY
cepts, as described by Matuschek and Gurevych
We performed an intrinsic evaluation of U BY by
(2011). The created gold standard could, for in-
computing a number of resource statistics. Our
stance, be re-used to evaluate alignments for other
evaluation covers two aspects: first, it addresses
languages in OW.
the question if our automatic conversion routines
To compute the similarity of word senses, we work correctly. Second, it provides indicators for
followed the approach by Niemann and Gurevych assessing U BY in terms of the gain in coverage
(2011) while covering both translation directions. compared to the single LSRs.
We used the cosine similarity for comparing the
German OW glosses with the German translations Correctness of conversion. Since we aim to
of WN glosses and cosine and personalized page preserve the maximal amount of information from
rank (PPR) similarity for comparison of the Ger- the original LSRs, we should be able to replace
man OW glosses translated into English with the any of the original LSRs and APIs by U BY and
original English WN glosses. Note that PPR sim- the U BY-API without losing information. As
ilarity is not available for German as it is based the conversion is largely performed automatically,
on WN. Thereby, we filtered out the OW con- systematic errors and information loss could be
cepts without a German gloss which left us with introduced by a faulty conversion routine. In or-
11,806 unique candidate pairs. We randomly se- der to detect such errors and to prove the correct-
lected 500 WN synsets for analysis yielding 703 ness of the automatic conversion and the result-
candidate pairs. These were manually annotated ing representation, we have compared the orig-
as being (non-)alignments. For the subsequent inal resource statistics of the classes and infor-
machine learning task we used a simple threshold- mation types in the source LSRs to the cor-
based classifier and ten-fold cross validation. responding classes in their U BY counterparts.
Table 1 summarizes the results of different sys- For instance, the number of lexical relations in
tem configurations. We observe that translation WordNet has been compared to the number of
11
SenseRelations in the U BY WordNet lexi-
http://www.statmt.org/moses/
12 con.13
OmegaWiki consists of interlinked language-
independent concepts to which lexicalizations in several
13
languages are attached. For detailed analysis results see the U BY website.
586
Lexical Sense shows the number of lemmas with entries in one
Lexicon Entry Sense Relation or more than one lexicon, additionally split by
FN 9,704 11,942 POS and language. Lemmas occurring only once
GN 83,091 93,407 329,213 in U BY increase the coverage at lemma level. For
OW-de 30,967 34,691 60,054 lemmas with parallel entries in several U BY lex-
OW-en 51,715 57,921 85,952 icons, new information becomes available in the
WP-de 790,430 838,428 571,286 form of additional sense definitions and comple-
WP-en 2,712,117 2,921,455 3,364,083
mentary information types attached to lemmas.
WKT-de 85,575 72,752 434,358
WKT-en 335,749 421,848 716,595 Finally, the increase in coverage at sense level
WN 156,584 206,978 8,559 can be estimated for senses that are aligned across
VN 3,962 31,891 at least two U BY-lexicons. We gain access to
U BY 4,259,894 4,691,313 5,300,941 all available, partly complementary information
types attached to these aligned senses, e.g. seman-
Table 2: U BY resource statistics (selected classes). tic relations, subcategorization frames, encyclo-
pedic or multilingual information. The number
Lexicon pair Languages SenseAxis
of pairwise sense alignments provided by U BY is
WNWP-en ENEN 50,351 given in Table 3. In addition, we computed how
WNWKT-en ENEN 99,662 many senses simultaneously take part in at least
WNVN ENEN 40,716 two pairwise sense alignments. For English, this
FNVN ENEN 17,529 applies to 31,786 senses, for which information
WP-enOW-en ENEN 3,960 from 3 U BY lexicons is available.
WP-deOW-de DEDE 1,097
WNOW-de ENDE 23,024 EN Lexicons noun verb adjective
WP-enWP-de ENDE 463,311
OW-enOW-de ENDE 58,785 5 1 699 -
4 1,630 1,888 430
U BY All 758,435
3 8,439 1,948 2,271
2 53,856 4,727 12,290
Table 3: U BY alignment statistics. 1 2,900,652 50,209 41,731
(unique EN) 3,080,771
DE Lexicons noun verb adjective
Gain in coverage. U BY offers an increased
coverage compared to the single LSRs as reflected 4 1,546 - -
in the resource statistics. Tables 2 and 3 show the 3 10,374 372 342
2 26,813 3,174 2,643
statistics on central classes in U BY. As U BY is
1 803,770 6,108 7,737
organized in several Lexicons, the number of
(unique DE) 862,879
U BY lexical entries is the sum of the lexical en-
tries in all 10 Lexicons. Thus, U BY contains
Table 4: Number of lemmas (split by POS and lan-
more than 4.2 million lexical entries, 4.6 million
guage) with entries in i U BY lexicons, i = 1, . . . , 5.
senses, 5.3 million semantic relations between
senses and more than 750,000 alignments. These
statistics represent the total numbers of lexical en- 6 Using U BY
tries, senses and sense relations in U BY without
filtering of identical (i.e. corresponding) lexical U BY API. For convenient access to U BY, we
entries, senses and relations. Listing the num- implemented a Java-API which is built around
ber of unique senses would require a full align- the Hibernate14 framework. Hibernate allows to
ment between all integrated resources, which is easily store the XML data which results from
currently not available. converting resources into Uby-LMF into a corre-
We can, however, show that U BY contains over sponding SQL database.
3.08 million unique lemma-POS combinations for Our main design principle was to keep the ac-
English and over 860,000 for German, over 3.94 cess to the resource as simple as possible, despite
million in total, see Table 4. Therefore, we as- the rich and complex structure of U BY. Another
14
sessed the coverage on lemma level. Table 4 also http://www.hibernate.org/
587
important design aspect was to ensure that the censing allows,15 already converted resources. If
functionality of the individual, resource-specific resources cannot be made available for download,
APIs or user interfaces is mirrored in the U BY the conversion tools will still allow users with ac-
API. This enables porting legacy applications to cess to these resources to import them into U BY
our new resource. To facilitate the transition to easily. In this way, it will be possible for users to
U BY, we plan to provide reference tables which build their custom U BY containing selected re-
list the corresponding U BY-API operations for the sources. As the underlying resources are subject
most important operations in the WN API, some to continuous change, updates of the correspond-
of which are shown in Table 5. ing components will be made available on a regu-
lar basis.
WN function U BY function 7 Conclusions
Dictionary U BY
getIndexWord(pos, getLexicalEntries( We presented U BY, a large-scale, standardized
lemma) pos, lemma) LSR containing nine widely used resources in two
IndexWord LexicalEntry languages: English WN, WKT-en, WP-en, FN
getLemma() getLemmaForm() and VN, German WP-de, WKT-de, and GN, and
Synset Synset
OW in English and German. As all resources
getGloss() getDefinitionText()
getWords() getSenses() are modeled in U BY-LMF, U BY enables struc-
Pointer SynsetRelation tural interoperability across resources and lan-
getType() getRelName() guages down to a fine-grained level of informa-
Word Sense tion. For FN, VN and all of the CCRs in En-
getPointers() getSenseRelations() glish and German, this is done for the first time.
Besides, by integrating sense alignments we also
Table 5: Some equivalent operations in WN API and enable the lexical-semantic interoperability of re-
U BY API. sources. We presented a unified framework for
aligning any LSRs pairwise and reported on ex-
periments which align OW-de and WN. We will
While it is possible to limit access to single re-
release the U BY-LMF model, the resource and the
sources by a parameter and thus mimic the behav-
U BY-API at the time of publication.16 Due to the
ior of the legacy APIs (e.g. only retrieve Synsets
added value and the large scale of U BY, as well as
and their relations from WN), the true power of
its ease of use, we believe U BY will boost the per-
U BY API becomes visible when no such con-
formance of NLP making use of lexical-semantic
straints are applied. In this case, all imported re-
knowledge.
sources are queried to get one combined result,
while retaining the source of the respective in- Acknowledgments
formation. On top of this, the information about
existing sense alignments across resources can be This work has been supported by the Emmy
accessed via SenseAxis relations, so that the re- Noether Program of the German Research Foun-
turned combined result covers not only the lexi- dation (DFG) under grant No. GU 798/3-1 and
cal, but also the sense level. by the Volkswagen Foundation as part of the
Lichtenberg-Professorship Program under grant
No. I/82806. We thank Richard Eckart de
Community issues. One of the most important Castilho, Yevgen Chebotar, Zijad Maksuti and Tri
reasons for U BY is creating an easy-to-use pow- Duc Nghiem for their contributions to this project.
erful LSR to advance NLP research and develop-
ment. Therefore, community building around the
resource is one of our major concerns. To this end, References
we will offer free downloads of the lexical data Jordi Atserias, Lus Villarejo, German Rigau, Eneko
and software presented in this paper under open li- Agirre, John Carroll, Bernardo Magnini, and Piek
censes, namely: The U BY-LMF DTD, mappings 15
Only GermaNet is subject to a restricted license and can-
and conversion tools for existing resources and not be redistributed in U BY format.
16
sense alignments, the Java API, and, as far as li- http://www.ukp.tu-darmstadt.de/data/uby
588
Vossen. 2004. The Meaning Multilingual Central Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
Repository. In Proceedings of the second interna- Pfahringer, Peter Reutemann, and Ian H. Witten.
tional WordNet Conference (GWC 2004), pages 23 2009. The WEKA Data Mining Software: An
30, Brno, Czech Republic. Update. ACM SIGKDD Explorations Newsletter,
Collin F. Baker and Christiane Fellbaum. 2009. Word- 11(1):1018.
Net and FrameNet as complementary resources for Verena Henrich and Erhard Hinrichs. 2010. Standard-
annotation. In Proceedings of the Third Linguis- izing wordnets in the ISO standard LMF: Wordnet-
tic Annotation Workshop, ACL-IJCNLP 09, pages LMF for GermaNet. In Proceedings of the 23rd In-
125129, Suntec, Singapore. ternational Conference on Computational Linguis-
Collin F. Baker, Charles J. Fillmore, and John B. tics (COLING), pages 456464, Beijing, China.
Lowe. 1998. The Berkeley FrameNet project. In Richard Johansson and Pierre Nugues. 2007. Us-
Proceedings of the 36th Annual Meeting of the As- ing WordNet to extend FrameNet coverage. In
sociation for Computational Linguistics and 17th Proceedings of the Workshop on Building Frame-
International Conference on Computational Lin- semantic Resources for Scandinavian and Baltic
guistics (COLING-ACL98, pages 8690, Montreal, Languages, at NODALIDA, pages 2730, Tartu, Es-
Canada. tonia.
Christian Bizer, Jens Lehmann, Georgi Kobilarov, Karin Kipper, Anna Korhonen, Neville Ryant, and
Soren Auer, Christian Becker, Richard Cyganiak, Martha Palmer. 2008. A Large-scale Classification
and Sebastian Hellmann. 2009. DBpedia A Crys- of English Verbs. Language Resources and Evalu-
tallization Point for the Web of Data. Journal of ation, 42:2140.
Web Semantics: Science, Services and Agents on the Claudia Kunze and Lothar Lemnitzer. 2002. Ger-
World Wide Web, (7):154165. maNet representation, visualization, application.
In Proceedings of the Third International Con-
Daan Broeder, Marc Kemps-Snijders, Dieter Van Uyt-
ference on Language Resources and Evaluation
vanck, Menzo Windhouwer, Peter Withers, Peter
(LREC), pages 14851491, Las Palmas, Canary Is-
Wittenburg, and Claus Zinn. 2010. A Data Cat-
lands, Spain.
egory Registry- and Component-based Metadata
Framework. In Proceedings of the 7th International Beth Levin. 1993. English Verb Classes and Alterna-
Conference on Language Resources and Evaluation tions. The University of Chicago Press, Chicago,
(LREC), pages 4347, Valletta, Malta. IL, USA.
Michael Matuschek and Iryna Gurevych. 2011.
Paul Buitelaar, Philipp Cimiano, Peter Haase, and
Where the journey is headed: Collaboratively con-
Michael Sintek. 2009. Towards Linguistically
structed multilingual Wiki-based resources. In
Grounded Ontologies. In Lora Aroyo, Paolo
SFB 538: Mehrsprachigkeit, editor, Hamburger Ar-
Traverso, Fabio Ciravegna, Philipp Cimiano, Tom
beiten zur Mehrsprachigkeit, Hamburg, Germany.
Heath, Eero Hyvonen, Riichiro Mizoguchi, Eyal
John McCrae, Dennis Spohr, and Philipp Cimiano.
Oren, Marta Sabou, and Elena Simperl, editors, The
2011. Linking Lexical Resources and Ontologies
Semantic Web: Research and Applications, pages
on the Semantic Web with Lemon. In The Seman-
111125, Berlin/Heidelberg, Germany. Springer.
tic Web: Research and Applications, volume 6643
Gerard de Melo and Gerhard Weikum. 2009. Towards of Lecture Notes in Computer Science, pages 245
a universal wordnet by learning from combined ev- 259. Springer, Berlin/Heidelberg, Germany.
idence. In Proceedings of the 18th ACM conference
Clifton J. McFate and Kenneth D. Forbus. 2011.
on Information and knowledge management (CIKM
NULEX: an open-license broad coverage lexicon.
09), CIKM 09, pages 513522, New York, NY,
USA. ACM.
Association for Computational Linguistics: Human
Christiane Fellbaum. 1998. WordNet: An Electronic Language Technologies: short papers - Volume 2,
Lexical Database. MIT Press, Cambridge, MA, HLT 11, pages 363367, Portland, OR, USA.
USA. Christian M. Meyer and Iryna Gurevych. 2010. Worth
Charles J. Fillmore. 1982. Frame Semantics. In The its Weight in Gold or Yet Another Resource
Linguistic Society of Korea, editor, Linguistics in A Comparative Study of Wiktionary, OpenThe-
the Morning Calm, pages 111137. Hanshin Pub- saurus and GermaNet. In Alexander Gelbukh, ed-
lishing Company, Seoul, Korea. itor, Computational Linguistics and Intelligent Text
Gil Francopoulo, Nuria Bel, Monte George, Nico- Processing: 11th International Conference, volume
letta Calzolari, Monica Monachini, Mandy Pet, and 6008 of Lecture Notes in Computer Science, pages
Claudia Soria. 2006. Lexical Markup Framework 3849. Berlin/Heidelberg: Springer, Iasi, Romania.
(LMF). In Proceedings of the 5th International Christian M. Meyer and Iryna Gurevych. 2011. What
Conference on Language Resources and Evaluation Psycholinguists Know About Chemistry: Align-
(LREC), pages 233236, Genoa, Italy. ing Wiktionary and WordNet for Increased Domain
589
Coverage. In Proceedings of the 5th International edge. In Proceedings of the 16th International Con-
Joint Conference on Natural Language Processing ference on World Wide Web, pages 697706, Banff,
(IJCNLP), pages 883892, Chiang Mai, Thailand. Canada.
Roberto Navigli and Simone Paolo Ponzetto. 2010a. Antonio Toral, Stefania Bracale, Monica Monachini,
BabelNet: Building a Very Large Multilingual Se- and Claudia Soria. 2010. Rejuvenating the Italian
mantic Network. In Proceedings of the 48th Annual WordNet: Upgrading, Standarising, Extending. In
Meeting of the Association for Computational Lin- Proceedings of the 5th Global WordNet Conference
guistics, pages 216225, Uppsala, Sweden, July. (GWC), Bombay, India.
Roberto Navigli and Simone Paolo Ponzetto. 2010b. Piek Vossen, editor. 1998. EuroWordNet: A Multi-
Knowledge-rich Word Sense Disambiguation Ri- lingual Database with Lexical Semantic Networks.
valing Supervised Systems. In Proceedings of the Kluwer Academic Publishers, Dordrecht, Nether-
48th Annual Meeting of the Association for Com- lands.
putational Linguistics, pages 15221531, Uppsala,
Sweden.
Roberto Navigli. 2006. Meaningful Clustering of
Senses Helps Boost Word Sense Disambiguation
Performance. In Proceedings of the 21st Inter-
national Conference on Computational Linguistics
and the 44th Annual Meeting of the Association for
Computational Linguistics (COLING-ACL), pages
105112, Sydney, Australia.
Elisabeth Niemann and Iryna Gurevych. 2011. The
Peoples Web meets Linguistic Knowledge: Auto-
matic Sense Alignment of Wikipedia and WordNet.
In Proceedings of the 9th International Conference
on Computational Semantics (IWCS), pages 205
214, Oxford, UK.
Muntsa Padro, Nuria Bel, and Silvia Necsulescu.
2011. Towards the Automatic Merging of Lexical
Resources: Automatic Mapping. In Proceedings of
the International Conference on Recent Advances
in Natural Language Processing (RANLP), pages
296301, Hissar, Bulgaria.
Martha Palmer. 2009. Semlink: Linking PropBank,
VerbNet and FrameNet. In Proceedings of the Gen-
erative Lexicon Conference (GenLex-09), pages 9
15, Pisa, Italy.
Sameer S. Pradhan, Eduard Hovy, Mitch Mar-
cus, Martha Palmer, Lance Ramshaw, and Ralph
Weischedel. 2007. OntoNotes: A Unified Rela-
tional Semantic Representation. In Proceedings of
the International Conference on Semantic Comput-
ing, pages 517526, Washington, DC, USA.
Lei Shi and Rada Mihalcea. 2005. Putting Pieces To-
gether: Combining FrameNet, VerbNet and Word-
Net for Robust Semantic Parsing. In Proceedings
of the Sixth International Conference on Intelligent
Text Processing and Computational Linguistics (CI-
CLing), pages 100111, Mexico City, Mexico.
Claudia Soria, Monica Monachini, and Piek Vossen.
2009. Wordnet-LMF: fleshing out a standardized
format for Wordnet interoperability. In Proceed-
ings of the 2009 International Workshop on Inter-
cultural Collaboration, pages 139146, Palo Alto,
CA, USA.
Fabian M. Suchanek, Gjergji Kasneci, and Gerhard
Weikum. 2007. Yago: A Core of Semantic Knowl-
590
Word Sense Induction for Novel Sense Detection
Jey Han Lau, Paul Cook, Diana McCarthy,
David Newman, and Timothy Baldwin
NICTA Victoria Research Laboratory
Dept of Computer Science and Software Engineering, University of Melbourne
Dept of Computer Science, University of California Irvine
Lexical Computing
jhlau@csse.unimelb.edu.au, paulcook@unimelb.edu.au,
diana@dianamccarthy.co.uk, newman@uci.edu, tb@ldwin.net
Abstract i.e. the number of senses that best captures the

token occurrences of that word. Building on the
We apply topic modelling to automatically work of Brody and Lapata (2009) and others, we
induce word senses of a target word, and approach WSI via topic modelling using La-
demonstrate that our word sense induction
tent Dirichlet Allocation (LDA: Blei et al. (2003))
method can be used to automatically de-
tect words with emergent novel senses, as
and derivative approaches and use the topic
well as token occurrences of those senses. model to determine the appropriate sense gran-
We start by exploring the utility of stan- ularity. Topic modelling is an unsupervised ap-
dard topic models for word sense induction proach to jointly learn topics in the form of
(WSI), with a pre-determined number of multinomial probability distributions over words
topics (=senses). We next demonstrate that and per-document topic assignments in the
a non-parametric formulation that learns an form of multinomial probability distributions over
appropriate number of senses per word ac-
topics. LDA is appealing for WSI as it both as-
tually performs better at the WSI task. We
go on to establish state-of-the-art results signs senses to words (in the form of topic alloca-
over two WSI datasets, and apply the pro- tion), and outputs a representation of each sense
posed model to a novel sense detection task. as a weighted list of words. LDA offers a solu-
tion to the question of sense granularity determi-
nation via non-parametric formulations, such as
1 Introduction a Hierarchical Dirichlet Process (HDP: Teh et al.
Word sense induction (WSI) is the task of auto- (2006), Yao and Durme (2011)).
matically inducing the different senses of a given Our contributions in this paper are as follows.
word, generally in the form of an unsupervised We first establish the effectiveness of HDP for
learning task with senses represented as clusters WSI over both the SemEval-2007 and SemEval-
of token instances. It contrasts with word sense 2010 WSI datasets (Agirre and Soroa, 2007; Man-
disambiguation (WSD), where a fixed sense in- andhar et al., 2010), and show that the non-
ventory is assumed to exist, and token instances parametric formulation is superior to a standard
of a given word are disambiguated relative to the LDA formulation with oracle determination of
sense inventory. While WSI is intuitively appeal- sense granularity for a given word. We next
ing as a task, there have been no real examples of demonstrate that our interpretation of HDP-based
WSI being successfully deployed in end-user ap- WSI is superior to other topic model-based ap-
plications, other than work by Schutze (1998) and proaches to WSI, and indeed, better than the best-
Navigli and Crisafulli (2010) in an information re- published results for both SemEval datasets. Fi-
trieval context. A key contribution of this paper nally, we apply our method to the novel sense de-
is the successful application of WSI to the lexico- tection task based on a dataset developed in this
graphical task of novel sense detection, i.e. identi- research, and achieve highly encouraging results.
fying words which have taken on new senses over
2 Methodology
time.
One of the key challenges in WSI is learning In topic modelling, documents are assumed to ex-
the appropriate sense granularity for a given word, hibit multiple topics, with each document having
591
its own distribution over topics. Words are gen- In our initial experiments, we use LDA topic
erated in each document by first sampling a topic modelling, which requires us to set T , the num-
from the documents topic distribution, then sam- ber of topics to be learned by the model. The
pling a word from that topic. In this work we LDA generative process is: (1) draw a latent
use the topic modelss probabilistic assignment of topic z from a document-specific topic distribu-
topics to words for the WSI task. tion P (t = z|d) then; (2) draw a word w from
the chosen topic P (w|t = z). Thus, the probabil-
2.1 Data Representation and Pre-processing ity of producing a single copy of word w given a
In the context of WSI, topics form our sense rep- document d is given by:
resentation, and words in a sentence are gener-
ated conditioned on a particular sense of the target
T
P (w|d) = P (w|t = z)P (t = z|d).
word. The document in the WSI case is a sin- z=1
gle sentence or a short document fragment con-
taining the target word, as we would not expect In standard LDA, the user needs to specify the
to be able to generate a full document from the number of topics T . In non-parametric variants of
sense of a single target word.1 In the case of the LDA, the model dynamically learns the number of
SemEval datasets, we use the word contexts pro- topics as part of the topic modelling. The particu-
vided in the dataset, while in our novel sense de- lar implementation of non-parametric topic model
tection experiments, we use a context window of we experiment with is Hierarchical Dirichlet Pro-
three sentences, one sentence to either side of the cess (HDP: Teh et al. (2006)),3 where, for each
token occurrence of the target word. document, a distribution of mixture components
As our baseline representation, we use a bag of P (t|d) is sampled from a base distribution G0
words, where word frequency is kept but not word as follows: (1) choose a base distribution G0
order. All words are lemmatised, and stopwords DP (, H); (2) for each document d, generate dis-
and low frequency terms are removed. tribution P (t|d) DP (0 , G0 ); (3) draw a la-
We also experiment with the addition of po- tent topic z from the documents mixture compo-
sitional context word information, as commonly nent distribution P (t|d), in the same manner as
used in WSI. That is, we introduce an additional for LDA; and (4) draw a word w from the chosen
word feature for each of the three words to the left topic P (w|t = z).4
and right of the target word. For both LDA and HDP, we individually topic
Pado and Lapata (2007) demonstrated the im- model each target word, and determine the sense
portance of syntactic dependency relations in the assignment z for a given instance by aggregating
construction of semantic space models, e.g. for over the topic assignments for each word in the
WSD. Based on these findings, we include depen- instance and selecting the sense with the highest
dency relations as additional features in our topic aggregated probability, arg maxz P (t = z|d).
models,2 but just for dependency relations that in-
volve the target word. 3 SemEval Experiments
To facilitate comparison of our proposed method
2.2 Topic Modelling
for WSI with previous approaches, we use the
Topic models learn a probability distribution over dataset from the SemEval-2007 and SemEval-
topics for each document, by simply aggregating 2010 word sense induction tasks (Agirre and
the distributions over topics for each word in the 3
We use the C++ implementation of HDP
document. In WSI terms, we take this distribu- (http://www.cs.princeton.edu/blei/
tion over topics for each target word (instance topicmodeling.html) in our experiments.
4
in WSI parlance) as our distribution over senses The two HDP parameters and 0 control the variabil-
for that word. ity of senses in the documents. In particular, controls the
degree of sharing of topics across documents a high
1
Notwithstanding the one sense per discourse heuristic value leads to more topics, as topics for different documents
(Gale et al., 1992). are more dissimilar. 0 , on the other hand, controls the de-
2
We use the Stanford Parser to do part of speech tagging gree of mixing of topics within a document a high 0 gen-
and to extract the dependency relations (Klein and Manning, erates fewer topics, as topics are less homogeneous within a
2003; De Marneffe et al., 2006). document.
592
Soroa, 2007; Manandhar et al., 2010). We first In our original experiments with LDA, we set
experiment with the SemEval-2010 dataset, as it the number of topics (T ) for each target word to
includes explicit training and test data for each the number of senses represented in the test data
target word and utilises a more robust evaluation for that word (varying T for each target word).
methodology. We then return to experiment with This is based on the unreasonable assumption that
the SemEval-2007 dataset, for comparison pur- we will have access to gold-standard information
poses with other published results for topic mod- on sense granularity for each target word, and is
elling approaches to WSI. done to establish an upper bound score for LDA.
We then relax the assumption, and use a fixed T
3.1 SemEval-2010 setting for each of sets of nouns (T = 7) and
3.1.1 Dataset and Methodology verbs (T = 3), based on the average number of
Our primary WSI evaluation is based on senses from the test data in each case. Finally,
the dataset provided by the SemEval-2010 WSI we introduce positional context features for LDA,
shared task (Manandhar et al., 2010). The dataset once again using the fixed T values for nouns and
contains 100 target words: 50 nouns and 50 verbs. verbs.
For each target word, a fixed set of training and We next apply HDP to the WSI task, using
test instances are supplied, typically 1 to 3 sen- positional features, but learning the number of
tences in length, each containing the target word. senses automatically for each target word via the
The default approach to evaluation for the model. Finally, we experiment with adding de-
SemEval-2010 WSI task is in the form of WSD pendency features to the model.
over the test data, based on the senses that have To summarise, we provide results for the fol-
been automatically induced from the training lowing models:
data. Because the induced senses will likely vary
in number and nature between systems, the WSD 1. LDA+Variable T : LDA with variable T
evaluation has to incorporate a sense alignment for each target word based on the number of
step, which it performs by splitting the test in- gold-standard senses.
stances into two sets: a mapping set and an eval- 2. LDA+Fixed T : LDA with fixed T for each
uation set. The optimal mapping from induced of nouns and verbs.
senses to gold-standard senses is learned from the 3. LDA+Fixed T +Position: LDA with fixed
mapping set, and the resulting sense alignment is T and extra positional word features.
used to map the predictions of the WSI system to 4. HDP+Position: HDP (which automatically
pre-defined senses for the evaluation set. The par- learns T ), with extra positional word fea-
ticular split we use to calculate WSD effective- tures.
ness in this paper is 80%/20% (mapping/test), av- 5. HDP+Position+Dependency: HDP with
eraged across 5 random splits.5 both positional word and dependency fea-
The SemEval-2010 training data consists of ap- tures.
proximately 163K training instances for the 100
target words, all taken from the web. The test We compare our models with two baselines
data is approximately 9K instances taken from a from the SemEval-2010 task: (1) Baseline Ran-
variety of news sources. Following the standard dom randomly assign each test instance to one
approach used by the participating systems in the of four senses; (2) Baseline MFS most fre-
SemEval-2010 task, we induce senses only from quent sense baseline, assigning all test instances
the training instances, and use the learned model to one sense; and also a benchmark system
to assign senses to the test instances. (UoY), in the form of the University of York sys-
5
tem (Korkontzelos and Manandhar, 2010), which
A 60%/40% split is also provided as part of the task
achieved the best overall WSD results in the orig-
setup, but the results are almost identical to those for the
80%/20% split, and so are omitted from this paper. The original SemEval-2010 task.
inal task also made use of V-measure and Paired F-score to
evaluate the induced word sense clusters, but have degen- 3.2 SemEval-2010 Results
erate behaviour in correlating strongly with the number of
senses induced by the method (Manandhar et al., 2010), and The results of our experiments over the SemEval-
are hence omitted from this paper. 2010 dataset are summarised in Table 1.
593
WSD (80%/20%) not convey a coherent sense. These topics are an
System
All Verbs Nouns
artifact of HDP: they are learnt at a much later
Baselines
Baseline Random 0.57 0.66 0.51 stage of the iterative process of Gibbs sampling
Baseline MFS 0.59 0.67 0.53 and are often smaller than other topics (i.e. have
LDA more zero-probability terms). We notice that they
Variable T 0.64 0.69 0.60 are assigned as topics to instances very rarely (al-
Fixed T 0.63 0.68 0.59
Fixed T +Position 0.63 0.68 0.60 though they are certainly used to assign topics to
HDP non-target words in the instances), and as such,
+Position 0.68 0.72 0.65 they do not present a real issue when assigning
+Position+Dependency 0.68 0.72 0.65 the sense to an instance, as they are likely to be
Benchmark
UoY 0.62 0.67 0.59
overshadowed by the dominant senses.7 This con-
clusion is born out when we experimented with
Table 1: WSD F-score over the SemEval-2010 dataset manually filtering out these topics when assign-
ing instance to senses: there was no perceptible
change in the results, reinforcing our suggestion
Looking first at the results for LDA, we see that these topics do not impact on target word
that the first LDA approach (variable T ) is very sense assignment.
competitive, outperforming the benchmark sys- Comparing the results for HDP back to those
tem. In this approach, however, we assume perfor LDA, HDP tends to learn almost double the
fect knowledge of the number of gold senses of number of senses per target word as are in the
each target word, meaning that the method isnt gold-standard (and hence are used for the Vari-
truly unsupervised. When we fixed T for each able T version of LDA). Far from hurting our
of the nouns and verbs, we see a small drop in WSD F-score, however, the extra topics are dom-
F-score, but encouragingly the method still per- inated by junk topics, and boost WSD F-score for
forms above the benchmark. Adding positional the genuine topics. Based on this insight, we
word features improves the results very slightly ran LDA once again with variable T (and posi-
for nouns. tional and dependency features), but this time set-
When we relax the assumption on the number ting T to the value learned by HDP, to give LDA
of word senses in moving to HDP, we observe a the facility to use junk topics. This resulted in an
marked improvement in F-score over LDA. This F-score of 0.66 across all word classes (verbs =
is highly encouraging and somewhat surprising, 0.71, nouns = 0.62), demonstrating that, surpris-
as in hiding information about sense granularity ingly, even for the same T setting, HDP achieves
from the model, we have actually improved our superior results to LDA. I.e., not only does HDP
results. We return to discuss this effect below. learn T automatically, but the topic model learned
For the final feature, we add dependency features for a given T is superior to that for LDA.
to the HDP model (in addition to retaining the Looking at the other senses discovered for
positional word features), but see no movement cheat, we notice that the model has induced a
in the results.6 While the dependency features myriad of senses: the relationship sense of cheat
didnt reduce F-score, their utility is questionable (senses 1, 3 and 4, e.g. husband cheats); the exam
as the generation of the features from the Stanford usage of cheat (sense 2); the competition/game
parser is computationally expensive. usage of cheat (sense 5); and cheating in the po-
To better understand these results, we present litical domain (sense 6). Although the senses are
the top-10 terms for each of the senses induced for possibly split a little more than desirable (e.g.
the word cheat in Table 2. These senses are learnt senses 1, 3 and 4 arguably describe the same
using HDP with both positional word features sense), the overall quality of the produced senses
(e.g. husband #-1, indicating the lemma husband 7
In the WSD evaluation, the alignment of induced senses
to the immediate left of the target word) and de-
to the gold senses is learnt automatically based on the map-
pendency features (e.g. cheat#prep on#wife). The ping instances. E.g. if all instances that are assigned sense
first observation to make is that senses 7, 8 and a have gold sense x, then sense a is mapped to gold sense
9 are junk senses, in that the top-10 terms do x. Therefore, if the proportion of junk senses in the map-
ping instances is low, their influence on WSD results will be
6
An identical result was observed for LDA. negligible.
594
Sense Num Top-10 Terms
1 cheat think want ... love feel tell guy cheat#nsubj#include find
2 cheat student cheating test game school cheat#aux#to teacher exam study
3 husband wife cheat wife #1 tiger husband #-1 cheat#prep on#wife ... woman cheat#nsubj#husband
4 cheat woman relationship cheating partner reason cheat#nsubj#man woman #-1 cheat#aux#to spouse
5 cheat game play player cheating poker cheat#aux#to card cheated money
6 cheat exchange china chinese foreign cheat #-2 cheat #2 china #-1 cheat#aux#to team
7 tina bette kirk walk accuse mon pok symkyn nick star
8 fat jones ashley pen body taste weight expectation parent able
9 euro goal luck fair france irish single 2000 cheat#prep at#point complain
Table 2: The top-10 terms for each of the senses induced for the verb cheat by the HDP model (with positional
word and dependency features)
is encouraging. Also, we observe a spin-off ben- System F-Score

BL 0.855
efit of topic modelling approaches to WSI: the
YVD 0.857
high-ranking words in each topic can be used to SemEval Best (I2R) 0.868
gist the sense, and anecdotally confirm the impact Our method (default parameters) 0.842
of the different feature types (i.e. the positional Our method (tuned parameters) 0.869
word and dependency features).
Table 3: F-score for the SemEval-2007 WSI task, for
3.3 Comparison with other Topic Modelling our HDP method with default and tuned parameter set-
tings, as compared to competitor topic modelling and
Approaches to WSI
other approaches to WSI
The idea of applying topic modelling to WSI is
not entirely new. Brody and Lapata (2009) pro-
the nouns in this paper. Training data was not pro-
posed an LDA-based model which assigns differ-
vided as part of the original dataset, so we fol-
ent weights to different feature sets (e.g. unigram
low the approach of BL and YVD in construct-
tokens vs. dependency relations), using a lay-
ing our own training dataset for each target word
ered feature representation. They carry out ex-
from instances extracted from the British National
tensive parameter optimisation of both the (fixed)
Corpus (BNC: Burnard (2000)).8 Both BL and
number of senses, number of layers, and size of
YVD separately report slightly higher in-domain
the context window.
results from training on WSJ data (the SemEval-
Separately, Yao and Durme (2011) proposed 2007 data was taken from the WSJ). For the pur-
the use of non-parametric topic models in WSI. poses of model comparison under identical train-
The authors preprocess the instances slightly dif- ing settings, however, it is appropriate to report on
ferently, opting to remove the target word from results for only the BNC.
each instance and stem the tokens. They also We experiment with both our original method
tuned the hyperparameters of the topic model to (with both positional word and dependency fea-
optimise the WSI effectiveness over the evalua- tures, and default parameter settings for HDP)
tion set, and didnt use positional or dependency without any parameter tuning, and the same
features. method with the tuned parameter settings of
Both of these papers were evaluated over YVD, for direct comparability. We present the re-
only the SemEval-2007 WSI dataset (Agirre and sults in Table 3, including the results for the best-
Soroa, 2007), so we similarly apply our HDP performing system in the original SemEval-2007
method to this dataset for direct comparability. In task (I2R: Niu et al. (2007)).
the remainder of this section, we refer to Brody The results are enlightening: with default pa-
and Lapata (2009) as BL, and Yao and Durme rameter settings, our methodology is slightly be-
(2011) as YVD. low the results of the other three models. Bear
The SemEval-2007 dataset consists of roughly 8
In creating the training dataset, each instance is made
27K instances, for 65 target verbs and 35 target up of the sentence the target word occurs in, as we as one
nouns. BL report on results only over the noun sentence to either side of that sentence, i.e. 3 sentences in
instances, so we similarly restrict our attention to total per instance.
595
in mind, however, that the two topic modelling- WSI, in identifying words which have taken on
based approaches were tuned extensively to the novel senses over time, based on analysis of di-
dataset. When we use the tuned hyperparame- achronic data. Our topic modelling approach is
ter settings of YVD, our results rise around 2.5% particularly attractive for this task as, not only
to surpass both topic modelling approaches, and does it jointly perform type-level WSI, and token-
marginally outperform the I2R system from the level WSD based on the induced senses (in as-
original task. Recall that both BL and YVD report signing topics to each instance), but it is possible
higher results again using in-domain training data, to gist the induced senses via the contents of the
so we would expect to see further gains again over topic (typically using the topic words with highest
the I2R system in following this path. marginal probability).
Overall, these results agree with our findings The meanings of words can change over time;
over the SemEval-2010 dataset (Section 3.2), unin particular, words can take on new senses. Con-
derlining the viability of topic modelling to auto- temporary examples of new word-senses include
mated word sense induction. the meanings of swag and tweet as used below:
3.4 Discussion 1. We all know Frankie is adorable, but does he

have swag? [swag = style]
As part of our preprocessing, we remove all stop-
words (other than for the positional word and de- 2. The alleged victim gave a description of the
pendency features), as described in Section 2.1. man on Twitter and tweeted that she thought
We separately experimented with not removing she could identify him. [tweet = send a mes-
stopwords, based on the intuition that prepositions sage on Twitter]
such as to and on can be informative in determin- These senses of swag and tweet are not included
ing word sense based on local context. The results in many dictionaries or computational lexicons
were markedly worse, however. We also tried ap- e.g., neither of these senses is listed in Wordnet
pending part of speech information to each word 3.0 (Fellbaum, 1998) yet appear to be in regu-
lemma, but the resulting data sparseness meant lar usage, particularly in text related to pop culture
that results dropped marginally. and online media.
When determining the sense for an instance, we The manual identification of such new word-
aggregate the sense assignments for each word in senses is a challenge in lexicography over and
the instance (not just the target word). An alter- above identifying new words themselves, and
nate strategy is to use only the target word topic is essential to keeping dictionaries up-to-date.
assignment, but again, the results for this strategy Moreover, lexicons that better reflect contempo-
were inferior to the aggregate method. rary usage could benefit NLP applications that use
In the SemEval-2007 experiments (Sec- sense inventories.
tion 3.3), we found that YVDs hyperparameter The challenge of identifying changes in word
settings yielded better results than the default sense has only recently been considered in com-
settings. We experimented with parameter tuning putational linguistics. For example, Sagi et al.
over the SemEval-2010 dataset (including YVDs (2009), Cook and Stevenson (2010), and Gulor-
optimal setting on the 2007 dataset), but found dava and Baroni (2011) propose type-based mod-
that the default setting achieved the best overall els of semantic change. Such models do not
results: although the WSD F-score improved a account for polysemy, and appear best-suited to
little for nouns, it worsened for verbs. This obser- identifying changes in predominant sense. Bam-
vation is not unexpected: as the hyperparameters man and Crane (2011) use a parallel Latin
were optimised for nouns in their experiments, English corpus to induce word senses and build
the settings might not be appropriate for verbs. a WSD system, which they then apply to study
This also suggests that their results may be due in diachronic variation in word senses. Crucially, in
part to overfitting the SemEval-2007 data. this token-based approach there is a clear connec-
tion between word senses and tokens, making it
4 Identifying Novel Senses
possible to identify usages of a specific sense.
Having established the effectiveness of our ap- Based on the findings in Section 3.2, here we
proach at WSI, we next turn to an application of apply the HDP method for WSI to the task of
596
identifying new word-senses. In contrast to Bam- noted challenge for approaches to identifying lex-
man and Crane (2011) our token-based approach ical semantic differences between corpora (Peirs-
does not require parallel text to induce senses. man et al., 2010), but are difficult to avoid given
the corpora that are available. We use TreeTagger
4.1 Method (Schmid, 1994) to tokenise and lemmatise both
Given two corpora a reference corpus which corpora.
we take to represent standard usage, and a second Evaluating approaches to identifying seman-
corpus of newer texts we identify senses that tic change is a challenge, particularly due to the
are novel to the second corpus compared to the lack of appropriate evaluation resources; indeed,
reference corpus. For a given word w, we pool most previous approaches have used very small
all usages of w in the reference corpus and sec- datasets (Sagi et al., 2009; Cook and Stevenson,
ond corpus, and run the HDP WSI method on this 2010; Bamman and Crane, 2011). Because this
super-corpus to induce the senses of w. We then is a preliminary attempt at applying WSI tech-
tag all usages of w in both corpora with their sin- niques to identifying new word-senses, our evalu-
gle most-likely automatically-induced sense. ation will also be based on a rather small dataset.
Intuitively, if a word w is used in some sense We require a set of words that are known to
s in the second corpus, and w is never used in have acquired a new sense between the late 20th
that sense in the reference corpus, then w has ac- and early 21st centuries. The Concise Oxford
quired a new sense, namely s. We capture this English Dictionary aims to document contempo-
intuition into a novelty score (Nov) that indi- rary usage, and has been published in numerous
cates whether a given word w has a new sense in editions including Thompson (1995, COD95) and
the second corpus, s, compared to the reference Soanes and Stevenson (2008, COD08). Although
corpus, r, as below: some of the entries have been substantially re-
({ }) vised between editions, many have not, enabling
ps (ti ) pr (ti )
Nov(w) = max : ti T us to easily identify new senses amongst the en-
pr (ti )
(1) tries in COD08 relative to COD95. A manual lin-
where ps (ti ) and pr (ti ) are the probability of ear search through the entries in these dictionaries
sense ti in the second corpus and reference cor- would be very time consuming, but by exploit-
pus, respectively, calculated using smoothed max- ing the observation that new words often corre-
imum likelihood estimates, and T is the set of spond to concepts that are culturally salient (Ayto,
senses induced for w. Novelty is high if there is 2006), we can quickly identify some candidates
some sense t that has much higher relative fre- for words that have taken on a new sense.
quency in s than r and that is also relatively infre- Between the time periods of our two corpora,
quent in r. computers and the Internet have become much
more mainstream in society. We therefore ex-
4.2 Data tracted all entries from COD08 containing the
Because we are interested in the identification of word computing (which is often used as a topic la-
novel word-senses for applications such as lexi- bel in this dictionary) that have a token frequency
con maintenance, we focus on relatively newly- of at least 1000 in the BNC. We then read the
coined word-senses. In particular, we take the entries for these 87 lexical items in COD95 and
written portion of the BNC consisting primar- COD08 and identified those which have a clear
ily of British English text from the late 20th cen- computing sense in COD08 that was not present
tury as our reference corpus, and a similarly- in COD95. In total we found 22 such items. This
sized random sample of documents from the process, along with all the annotation in this sec-
ukWaC (Ferraresi et al., 2008) a Web corpus tion, is carried out by a native English-speaking
built from the .uk domain in 2007 which in- author of this paper.
cludes a wide range of text types as our sec- To ensure that the words identified from the
ond corpus. Text genres are represented to dif- dictionaries do in fact have a new sense in the
ferent extents in these corpora with, for example, ukWaC sample compared to the BNC, we exam-
text types related to the Internet being much more ine the usage of these words in the corpora. We
common in the ukWaC. Such differences are a extract a random sample of 100 usages of each
597
lemma from the BNC and ukWaC sample and Lemma Novelty Freq. ratio Novel sense freq.
annotate these usages as to whether they corre- domain (n) 116.2 2.60 41
worm (n) 68.4 1.04 30
spond to the novel sense or not. This binary dis- mirror (n) 38.4 0.53 10
tinction is easier than fine-grained sense annota- guess (v) 16.5 0.93
tion, and since we do not use these annotations export (v) 13.8 0.88 28
for formal evaluation only for selecting items founder (n) 11.0 1.20
cinema (n) 9.7 1.30
for our dataset we do not carry out an inter- poster (n) 7.9 1.83 4
annotator agreement study here. We eliminate any racism (n) 2.4 0.98
lemma for which we find evidence of the novel symptom (n) 2.1 1.16
sense in the BNC, or for which we do not find
Table 4: Novelty score (Nov), ratio of frequency in
evidence of the novel sense in the ukWaC sam-
the ukWaC sample and BNC, and frequency of the
ple.9 We further check word sketches (Kilgarriff novel sense in the manually-annotated 100 instances
and Tugwell, 2002)10 for each of these lemmas from the ukWaC sample (where applicable), for all
in the BNC and ukWaC for collocates that likely lemmas in our dataset. Lemmas shown in boldface
correspond to the novel sense; we exclude any have a novel sense in the ukWaC sample compared to
lemma for which we find evidence of the novel the BNC.
sense in the BNC, or fail to find evidence of the
novel sense in the ukWaC sample. At the end topic modelling. The results are shown in column
of this process we have identified the following Novelty in Table 4. The lemmas with a novel
5 lemmas that have the indicated novel senses in sense have higher novelty scores than the distrac-
the ukWaC compared to the BNC: domain (n) In- tors according to a one-sided Wilcoxon rank sum
ternet domain; export (v) export data; mirror test (p < .05).
(n) mirror website; poster (n) one who posts When a lemma takes on a new sense, it might
online; and worm (n) malicious program. For also increase in frequency. We therefore also con-
each of the 5 lemmas with novel senses, a sec- sider a baseline in which we rank the lemmas by
ond annotator also a native English-speaking the ratio of their frequency in the second and ref-
author of this paper annotated the sample of erence corpora. These results are shown in col-
100 usages from the ukWaC. The observed agree- umn Freq. ratio in Table 4. The difference be-
ment and unweighted Kappa between the two an- tween the frequency ratios for the lemmas with a
notators is 97.2% and 0.92, respectively, indicat- novel sense, and the distractors, is not significant
ing that this is indeed a relatively easy annotation (p > .05).
task. The annotators discussed the small number Examining the frequency of the novel senses
of disagreements to reach consensus. shown in column Novel sense freq. in Table 4
For our dataset we also require items that have we see that the lowest-ranked lemma with a
not acquired a novel sense in the ukWaC sample. novel sense, poster, is also the lemma with the
For each of the above 5 lemmas we identified a least-frequent novel sense. This result is unsur-
distractor lemma of the same part-of-speech that prising as our novelty score will be higher for
has a similar frequency in the BNC, and that has higher-frequency novel senses. The identification
not undergone sense change between COD95 and of infrequent novel senses remains a challenge.
COD08. The 5 distractors are: cinema (n); guess The top-ranked topic words for the sense cor-
(v); symptom (n); founder (n); and racism (n). responding to the maximum in Equation 1 for
the highest-ranked distractor, guess, are the fol-
4.3 Results lowing: @card@, post, ..., nt, comment, think,
We compute novelty (Nov, Equation 1) for all subject, forum, view, guess. This sense seems
10 items in our dataset, based on the output of the to correspond to usages of guess in the context
of online forums, which are better represented
9
We use the IMS Open Corpus Workbench (http:// in the ukWaC sample than the BNC. Because of
cwb.sourceforge.net/) to extract the usages of our the challenges posed by such differences between
target lemmas from the corpora. This extraction process fails
in some cases, and so we also eliminate such items from our
corpora (discussed in Section 4.2) we are unsur-
dataset. prised to see such an error, but this could be ad-
10
http://www.sketchengine.co.uk/ dressed in the future by building comparable cor-
598
Topic Selection Methodology
Lemma Nov Oracle (single topic) Oracle (multiple topics)
Precision Recall F-score Precision Recall F-score Precision Recall F-score
domain (n) 1.00 0.29 0.45 1.00 0.56 0.72 0.97 0.88 0.92
export (v) 0.93 0.96 0.95 0.93 0.96 0.95 0.90 1.00 0.95
mirror (n) 0.67 1.00 0.80 0.67 1.00 0.80 0.67 1.00 0.80
poster (n) 0.00 0.00 0.00 0.44 1.00 0.62 0.44 1.00 0.62
worm (n) 0.93 0.90 0.92 0.93 0.90 0.92 0.86 1.00 0.92
Table 5: Results for identifying the gold-standard novel senses based on the three topic selection methodologies
of: (1) Nov; (2) oracle selection of a single topic; and (3) oracle selection of multiple topics.
pora for use in this application. the sense selection heuristic could theoretically
Having demonstrated that our method for iden- improve our method for identifying novel senses,
tifying novel senses can distinguish lemmas that and that the topic modelling approach proposed
have a novel sense in one corpus compared to an- in this paper has considerable promise for auto-
other from those that do not, we now consider matic novel sense detection. Of particular note is
whether this method can also automatically iden- the result for poster: although the gold-standard
tify the usages of the induced novel sense. novel sense of poster is rare, all of its usages are
For each lemma with a gold-standard novel grouped into a single topic.
sense, we define the automatically-induced novel Finally, we consider whether an oracle which
sense to be the single sense corresponding to the can select the best subset of induced senses in
maximum in Equation 1. We then compute the terms of F-score as the novel sense could of-
precision, recall, and F-score of this novel sense fer further improvements. In this case results
with respect to the gold-standard novel sense, shown in the final three columns of Table 5
based on the 100 annotated tokens for each of we again see an increase in F-score to 0.92 for
the 5 lemmas with a novel sense. The results are domain. For this lemma the gold-standard novel
shown in the first three numeric columns of Ta- sense usages were split across multiple induced
ble 5. topics, and so we are unsurprised to find that a
In the case of export and worm the results are method which is able to select multiple topics as
remarkably good, with precision and recall both the novel sense performs well. Based on these
over 0.90. For domain, the low recall is a result of findings, in future work we plan to consider alter-
the majority of usages of the gold-standard novel native formulations of novelty.
sense (Internet domain) being split across two
induced senses the top-two highest ranked in- 5 Conclusion
duced senses according to Equation 1. The poor
performance for poster is unsurprising due to the We propose the application of topic modelling
very low frequency of this lemmas gold-standard to the task of word sense induction (WSI), start-
novel sense. ing with a simple LDA-based methodology with
These results are based on our novelty rank- a fixed number of senses, and culminating in
ing method (Nov), and the assumption that a nonparametric method based on a Hierarchi-
the novel sense will be represented in a single cal Dirichlet Process (HDP), which automatically
topic. To evaluate the theoretical upper-bound learns the number of senses for a given target
for a topic-ranking method which uses our HDP- word. Our HDP-based method outperforms all
based WSI method and selects a single topic to methods over the SemEval-2010 WSI dataset, and
capture the novel sense, we next evaluate an op- is also superior to other topic modelling-based
timal topic selection approach. In the middle approaches to WSI based on the SemEval-2007
three numeric columns of Table 5, we present re- dataset. We applied the proposed WSI model to
sults for an experimental setup in which the sin- the task of identifying words which have taken on
gle best induced sense in terms of F-score new senses, including identifying the token oc-
is selected as the novel sense by an oracle. We currences of the new word sense. Over a small
see big improvements in F-score for domain and dataset developed in this research, we achieved
poster. This encouraging result suggests refining highly encouraging results.
599
References Dan Klein and Christopher D. Manning. 2003. Fast
exact inference with a factored model for natural
Eneko Agirre and Aitor Soroa. 2007. SemEval-2007 language parsing. In Advances in Neural Informa-
Task 02: Evaluating word sense induction and dis- tion Processing Systems 15 (NIPS 2002), pages 3
crimination systems. In Proceedings of the Fourth 10, Whistler, Canada.
International Workshop on Semantic Evaluations
Ioannis Korkontzelos and Suresh Manandhar. 2010.
(SemEval-2007), pages 712, Prague, Czech Re-
Uoy: Graphs of unambiguous vertices for word
public.
sense induction and disambiguation. In Proceed-
John Ayto. 2006. Movers and Shakers: A Chronology
ings of the 5th International Workshop on Semantic
of Words that Shaped our Age. Oxford University
Evaluation, pages 355358, Uppsala, Sweden.
Press, Oxford.
Suresh Manandhar, Ioannis Klapaftis, Dmitriy Dli-
David Bamman and Gregory Crane. 2011. Measur-
gach, and Sameer Pradhan. 2010. SemEval-2010
ing historical word sense variation. In Proceedings
Task 14: Word sense induction & disambiguation.
of the 2011 Joint International Conference on Dig-
In Proceedings of the 5th International Workshop
ital Libraries (JCDL 2011), pages 110, Ottawa,
on Semantic Evaluation, pages 6368, Uppsala,
Canada.
Sweden.
D. Blei, A. Ng, and M. Jordan. 2003. Latent dirichlet
Roberto Navigli and Giuseppe Crisafulli. 2010. In-
allocation. Journal of Machine Learning Research,
ducing word senses to improve web search result
3:9931022.
clustering. In Proceedings of the 2010 Conference
S. Brody and M. Lapata. 2009. Bayesian word sense
on Empirical Methods in Natural Language Pro-
induction. pages 103111, Athens, Greece.
cessing, pages 116126, Cambridge, USA.
Lou Burnard. 2000. The British National Corpus
Zheng-Yu Niu, Dong-Hong Ji, and Chew-Lim Tan.
Users Reference Guide. Oxford University Com-
2007. I2R: Three systems for word sense discrimi-
puting Services.
nation, chinese word sense disambiguation, and en-
Paul Cook and Suzanne Stevenson. 2010. Automat-
glish word sense disambiguation. In Proceedings
ically identifying changes in the semantic orienta-
of the Fourth International Workshop on Seman-
tion of words. In Proceedings of the Seventh In-
tic Evaluations (SemEval-2007), pages 177182,
Prague, Czech Republic.
Evaluation (LREC 2010), pages 2834, Valletta,
Malta. Sebastian Pado and Mirella Lapata. 2007.
Dependency-based construction of semantic
Marie-Catherine De Marneffe, Bill Maccartney, and
space models. Comput. Linguist., 33:161199.
Christopher D. Manning. 2006. Generating typed
dependency parses from phrase structure parses. Yves Peirsman, Dirk Geeraerts, and Dirk Speelman.
Genoa, Italy. 2010. The automatic identification of lexical varia-
tion between language varieties. Natural Language
Christiane Fellbaum, editor. 1998. WordNet: An Elec-
Engineering, 16(4):469491.
tronic Lexical Database. MIT Press, Cambridge,
MA. Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2009.
Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Semantic density analysis: Comparing word mean-
Silvia Bernardini. 2008. Introducing and evaluating across time and space. In Proceedings of
ing ukwac, a very large web-derived corpus of en- the EACL 2009 Workshop on GEMS: GEometrical
glish. In Proceedings of the 4th Web as Corpus Models of Natural Language Semantics, pages 104
Workshop: Can we beat Google, pages 4754, Mar- 111, Athens, Greece.
rakech, Morocco. Helmut Schmid. 1994. Probabilistic part-of-speech
William A. Gale, Kenneth W. Church, and David tagging using decision trees. In Proceedings of the
Yarowsky. 1992. One sense per discourse. pages International Conference on New Methods in Lan-
233237. guage Processing, pages 4449, Manchester, UK.
Kristina Gulordava and Marco Baroni. 2011. A dis- Hinrich Schutze. 1998. Automatic word sense dis-
tributional similarity approach to the detection of crimination. Computational Linguistics, 24(1):97
semantic change in the Google Books Ngram cor- 123.
pus. In Proceedings of the GEMS 2011 Workshop Catherine Soanes and Angus Stevenson, editors. 2008.
on GEometrical Models of Natural Language Se- The Concise Oxford English Dictionary. Oxford
mantics, pages 6771, Edinburgh, Scotland. University Press, eleventh (revised) edition. Oxford
Adam Kilgarriff and David Tugwell. 2002. Sketch- Reference Online.
ing words. In Marie-Helene Correard, editor, Lex- Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei.
icography and Natural Language Processing: A 2006. Hierarchical Dirichlet processes. Journal
Festschrift in Honour of B. T. S. Atkins, pages 125 of the American Statistical Association, 101:1566
137. Euralex, Grenoble, France. 1581.
600
Della Thompson, editor. 1995. The Concise Oxford
Dictionary of Current English. Oxford University
Press, Oxford, ninth edition.
Xuchen Yao and Benjamin Van Durme. 2011. Non-
parametric bayesian word sense induction. In Pro-
ceedings of TextGraphs-6: Graph-based Methods
for Natural Language Processing, pages 1014,
Portland, Oregon.
601
Learning Language from Perceptual Context
Raymond Mooney
University of Texas at Austin
mooney@cs.utexas.edu
Abstract
Machine learning has become the dominant approach to building natural-language processing sys-
tems. However, current approaches generally require a great deal of laboriously constructed human-
annotated training data. Ideally, a computer would be able to acquire language like a child by being
exposed to linguistic input in the context of a relevant but ambiguous perceptual environment. As
a step in this direction, we have developed systems that learn to sportscast simulated robot soccer
games and to follow navigation instructions in virtual environments by simply observing sample hu-
man linguistic behavior in context. This work builds on our earlier work on supervised learning of
semantic parsers that map natural language into a formal meaning representation. In order to apply
such methods to learning from observation, we have developed methods that estimate the meaning of
sentences given just their ambiguous perceptual context.
602
Learning for Microblogs with Distant Supervision:
Political Forecasting with Twitter
Micol Marchetti-Bowick Nathanael Chambers

Microsoft Corporation Department of Computer Science
475 Brannan Street United States Naval Academy
San Francisco, CA 94122 Annapolis, MD 21409
micolmb@microsoft.com nchamber@usna.edu
Abstract apply our approach to the problem of predicting

Presidential Job Approval polls from Twitter data,
Microblogging websites such as Twitter
and we present results that improve on previous
offer a wealth of insight into a popu-
lations current mood. Automated ap-
work in this area. We also present a novel base-
proaches to identify general sentiment to- line that performs remarkably well without using
ward a particular topic often perform two topic identification.
steps: Topic Identification and Sentiment Topic identification is the task of identifying
Analysis. Topic Identification first identi- text that discusses a topic of interest. Most pre-
fies tweets that are relevant to a desired vious work on microblogs uses simple keyword
topic (e.g., a politician or event), and Sen-
searches to find topic-relevant tweets on the as-
timent Analysis extracts each tweets atti-
tude toward the topic. Many techniques for sumption that short tweets do not need more so-
Topic Identification simply involve select- phisticated processing. For instance, searches for
ing tweets using a keyword search. Here, the name Obama have been assumed to return
we present an approach that instead uses a representative set of tweets about the U.S. Pres-
distant supervision to train a classifier on ident (OConnor et al., 2010). One of the main
the tweets returned by the search. We show contributions of this paper is to show that keyword
that distant supervision leads to improved search can lead to noisy results, and that the same
performance in the Topic Identification task
keywords can instead be used in a distantly super-
as well in the downstream Sentiment Anal-
ysis stage. We then use a system that incor- vised framework to yield improved performance.
porates distant supervision into both stages Distant supervision uses noisy signals in text
to analyze the sentiment toward President as positive labels to train classifiers. For in-
Obama expressed in a dataset of tweets. stance, the token Obama can be used to iden-
Our results better correlate with Gallups tify a series of tweets that discuss U.S. President
Presidential Job Approval polls than pre- Barack Obama. Although searching for token
vious work. Finally, we discover a sur-
prising baseline that outperforms previous
matches can return false positives, using the re-
work without a Topic Identification stage. sulting tweets as positive training examples pro-
vides supervision from a distance. This paper ex-
periments with several diverse sets of keywords
1 Introduction to train distantly supervised classifiers for topic
Social networks and blogs contain a wealth of identification. We evaluate each classifier on a
data about how the general public views products, hand-labeled dataset of political and apolitical
campaigns, events, and people. Automated algo- tweets, and demonstrate an improvement in F1
rithms can use this data to provide instant feed- score over simple keyword search (.39 to .90 in
back on what people are saying about a topic. the best case). We also make available the first la-
Two challenges in building such algorithms are beled dataset for topic identification in politics to
(1) identifying topic-relevant posts, and (2) iden- encourage future work.
tifying the attitude of each post toward the topic. Sentiment analysis encompasses a broad field
This paper studies distant supervision (Mintz et of research, but most microblog work focuses
al., 2009) as a solution to both challenges. We on two moods: positive and negative sentiment.
603
Algorithms to identify these moods range from In contrast to lexicons, many approaches in-
matching words in a sentiment lexicon to training stead focus on ways to train supervised classi-
classifiers with a hand-labeled corpus. Since la- fiers. However, labeled data is expensive to cre-
beling corpora is expensive, recent work on Twit- ate, and examples of Twitter classifiers trained on
ter uses emoticons (i.e., ASCII smiley faces such hand-labeled data are few (Jiang et al., 2011). In-
as :-( and :-)) as noisy labels in tweets for distant stead, distant supervision has grown in popular-
supervision (Pak and Paroubek, 2010; Davidov et ity. These algorithms use emoticons to serve as
al., 2010; Kouloumpis et al., 2011). This paper semantic indicators for sentiment. For instance,
presents new analysis of the downstream effects a sad face (e.g., :-() serves as a noisy label for a
of topic identification on sentiment classifiers and negative mood. Read (2005) was the first to sug-
their application to political forecasting. gest emoticons for UseNet data, followed by Go
Interest in measuring the political mood of et al. (Go et al., 2009) on Twitter, and many others
a country has recently grown (OConnor et al., since (Bifet and Frank, 2010; Pak and Paroubek,
2010; Tumasjan et al., 2010; Gonzalez-Bailon et 2010; Davidov et al., 2010; Kouloumpis et al.,
al., 2010; Carvalho et al., 2011; Tan et al., 2011). 2011). Hashtags (e.g., #cool and #happy) have
Here we compare our sentiment results to Presi- also been used as noisy sentiment labels (Davi-
dential Job Approval polls and show that the sen- dov et al., 2010; Kouloumpis et al., 2011). Fi-
timent scores produced by our system are posi- nally, multiple models can be blended into a sin-
tively correlated with both the Approval and Dis- gle classifier (Barbosa and Feng, 2010). Here, we
approval job ratings. adopt the emoticon algorithm for sentiment analy-
In this paper we present a method for cou- sis, and evaluate it on a specific domain (politics).
pling two distantly supervised algorithms for Topic identification in Twitter has received
topic identification and sentiment classification on much less attention than sentiment analysis. The
Twitter. In Section 4, we describe our approach to majority of approaches simply select a single
topic identification and present a new annotated keyword (e.g., Obama) to represent their topic
corpus of political tweets for future study. In Sec- (e.g., US President) and retrieve all tweets that
tion 5, we apply distant supervision to sentiment contain the word (OConnor et al., 2010; Tumas-
analysis. Finally, Section 6 discusses our sys- jan et al., 2010; Tan et al., 2011). The underlying
tems performance on modeling Presidential Job assumption is that the keyword is precise, and due
Approval ratings from Twitter data. to the vast number of tweets, the search will re-
turn a large enough dataset to measure sentiment
2 Previous Work toward that topic. In this work, we instead use
a distantly supervised system similar in spirit to
The past several years have seen sentiment anal- those recently applied to sentiment analysis.
ysis grow into a diverse research area. The idea Finally, we evaluate the approaches presented
of sentiment applied to microblogging domains is in this paper on the domain of politics. Tumasjan
relatively new, but there are numerous recent pub- et al. (2010) showed that the results of a recent
lications on the subject. Since this paper focuses German election could be predicted through fre-
on the microblog setting, we concentrate on these quency counts with remarkable accuracy. Most
contributions here. similar to this paper is that of OConnor et al.
The most straightforward approach to senti- (2010), in which tweets relating to President
ment analysis is using a sentiment lexicon to la- Obama are retrieved with a keyword search and
bel tweets based on how many sentiment words a sentiment lexicon is used to measure overall
appear. This approach tends to be used by appli- approval. This extracted approval ratio is then
cations that measure the general mood of a popu- compared to Gallups Presidential Job Approval
lation. OConnor et al. (2010) use a ratio of posi- polling data. We directly compare their results
tive and negative word counts on Twitter, Kramer with various distantly supervised approaches.
(2010) counts lexicon words on Facebook, and
3 Datasets
Thelwall (2011) uses the publicly available Sen-
tiStrength algorithm to make weighted counts of The experiments in this paper use seven months of
keywords based on predefined polarity strengths. tweets from Twitter (www.twitter.com) collected
604
between June 1, 2009 and December 31, 2009. ID Type Keywords
The corpus contains over 476 million tweets la- PC-1 Obama obama
beled with usernames and timestamps, collected PC-2 General republican, democrat, senate,
congress, government
through Twitters spritzer API without keyword
PC-3 Topic health care, economy, tax cuts,
filtering. Tweets are aligned with polling data in tea party, bailout, sotomayor
Section 6 using their timestamps. PC-4 Politician obama, biden, mccain, reed,
The full system is evaluated against the pub- pelosi, clinton, palin
licly available daily Presidential Job Approval PC-5 Ideology liberal, conservative, progres-
polling data from Gallup1 . Every day, Gallup asks sive, socialist, capitalist
1,500 adults in the United States about whether
Table 1: The keywords used to select positive training
they approve or disapprove of the job Presi-
sets for each political classifier (a subset of all PC-3
dent Obama is doing as president. The results and PC-5 keywords are shown to conserve space).
are compiled into two trend lines for Approval
and Disapproval ratings, as shown in Figure 1.
positive: LOL, obama made a bears refer-
We compare our positive and negative sentiment
ence in green bay. uh oh.
scores against these two trends.
negative: New blog up! It regards the new
4 Topic Identification iPhone 3G S: <URL>
This section addresses the task of Topic Identi- We then use these automatically extracted
fication in the context of microblogs. While the datasets to train a multinomial Naive Bayes classi-
general field of topic identification is broad, its fier. Before feature collection, the text is normal-
use on microblogs has been somewhat limited. ized as follows: (a) all links to photos (twitpics)
Previous work on the political domain simply uses are replaced with a single generic token, (b) all
keywords to identify topic-specific tweets (e.g., non-twitpic URLs are replaced with a token, (c)
OConnor et al. (2010) use Obama to find pres- all user references (e.g., @MyFriendBob) are col-
idential tweets). This section shows that distant lapsed, (d) all numbers are collapsed to INT, (e)
supervision can use the same keywords to build a tokens containing the same letter twice or more
classifier that is much more robust to noise than in a row are condensed to a two-letter string (e.g.
approaches that use pure keyword search. the word ahhhhh becomes ahh), (f) lowercase the
text and insert spaces between words and punctu-
4.1 Distant Supervision ation. The text of each tweet is then tokenized,
Distant supervision uses noisy signals to identify and the tokens are used to collect unigram and bi-
positive examples of a topic in the face of unla- gram features. All features that occur fewer than
beled data. As described in Section 2, recent sen- 10 times in the training corpus are ignored.
timent analysis work has applied distant supervi- Finally, after training a classifier on this dataset,
sion using emoticons as the signals. The approach every tweet in the corpus is classified as either
extracts tweets with ASCII smiley faces (e.g., :) positive (i.e., relevant to the topic) or negative
and ;)) and builds classifiers trained on these pos- (i.e., irrelevant). The positive tweets are then sent
itive examples. We apply distant supervision to to the second sentiment analysis stage.
topic identification and evaluate its effectiveness
on this subtask. 4.2 Keyword Selection
As with sentiment analysis, we need to collect Keywords are the input to our proposed distantly
positive and negative examples of tweets about supervised system, and of course, the input to pre-
the target topic. Instead of emoticons, we extract vious work that relies on keyword search. We
positive tweets containing one or more predefined evaluate classifiers based on different keywords to
keywords. Negative tweets are randomly chosen measure the effects of keyword selection.
from the corpus. Examples of positive and neg- OConnor et al. (2010) used the keywords
ative tweets that can be used to train a classifier Obama and McCain, and Tumasjan et al.
based on the keyword Obama are given here: (2010) simply extracted tweets containing Ger-
1
http://gallup.com/poll/113980/gallup-daily-obama-job- manys political party names. Both approaches
approval.aspx extracted matching tweets, considered them rele-
605
Gallup Daily Obama Job Approval Ratings
Figure 1: Gallup presidential job Approval and Disapproval ratings measured between June and Dec 2009.
vant (correctly, in many cases), and applied sen- domly chosen from the keyword searches of PC-
timent analysis. However, different keywords 2, PC-3, PC-4, and PC-5 with 500 tweets from
may result in very different extractions. We in- each. This combined dataset enables an evalua-
stead attempted to build a generic political topic tion of how well each classifier can identify tweets
classifier. To do this, we experimented with the from other classifiers. The General Dataset con-
five different sets of keywords shown in Table 1. tains 2,000 random tweets from the entire corpus.
For each set, we extracted all tweets matching This dataset allows us to evaluate how well clas-
one or more keywords, and created a balanced sifiers identify political tweets in the wild.
positive/negative training set by then selecting This papers authors initially annotated the
negative examples randomly from non-matching same 200 tweets in the General Dataset to com-
tweets. A couple examples of ideology (PC-5) ex- pute inter-annotator agreement. The Kappa was
tractions are shown here: 0.66, which is typically considered good agree-
ment. Most disagreements occurred over tweets
You often hear of deontologist libertarians
and utilitarian liberals but are there any about money and the economy. We then split the
Aristotelian socialists? remaining portions of the two datasets between
<url> - Then, slather on a liberal amount
the two annotators. The Political Dataset con-
of plaster, sand down smooth, and paint tains 1,691 political and 309 apolitical tweets, and
however you want. I hope this helps! the General Dataset contains 28 political tweets
and 1,978 apolitical tweets. These two datasets of
The second tweet is an example of the noisy 2000 tweets each are publicly available for future
nature of keyword extraction. Most extractions evaluation and comparison to this work2 .
are accurate, but different keywords retrieve very
different sets of tweets. Examples for the political 4.4 Experiments
topics (PC-3) are shown here: Our first experiment addresses the question of
RT @PoliticalMath: hope the presidents keyword variance. We measure performance on
health care predictions <url> are better the Political Dataset, a combination of all of our
than his stimulus predictions <url> proposed political keywords. Each keyword set
@adamjschmidt You mean we could have contributed to 25% of the dataset, so the eval-
chosen health care for every man woman uation measures the extent to which a classifier
and child in America or the Iraq war? identifies other keyword tweets. We classified
the 2000 tweets with the five distantly supervised
Each keyword set builds a classifier using the ap-
classifiers and the one Obama keyword extrac-
proach described in Section 4.1.
tor from OConnor et al. (2010).
4.3 Labeled Datasets Results are shown on the left side of Figure 2.
Precision and recall calculate correct identifica-
In order to evaluate distant supervision against
tion of the political label. The five distantly super-
keyword search, we created two new labeled
vised approaches perform similarly, and show re-
datasets of political and apolitical tweets.
markable robustness despite their different train-
The Political Dataset is an amalgamation of all
ing sets. In contrast, the keyword extractor only
four keyword extractions (PC-1 is a subset of PC-
2
4) listed in Table 1. It consists of 2,000 tweets ran- http://www.usna.edu/cs/nchamber/data/twitter
606
Figure 2: Five distantly supervised classifiers and the Obama keyword classifier. Left panel: the Political Dataset
of political tweets. Right panel: the General Dataset representative of Twitter as a whole.
captures about a quarter of the political tweets. my life I am ashamed of our government.
PC-1 is the distantly supervised analog to the
Obama keyword extractor, and we see that dis- These results also illustrate that distant supervi-
tant supervision increases its F1 score dramati- sion allows for flexibility in construction of the
cally from 0.39 to 0.90. classifier. Different keywords show little change
The second evaluation addresses the question in classifier performance.
of classifier performance on Twitter as a whole, The General Dataset experiment evaluates clas-
not just on a political dataset. We evaluate on the sifier performance in the wild. The keyword ap-
General Dataset just as on the Political Dataset. proach again scores below those trained on noisy
Results are shown on the right side of Figure 2. labels. It classifies most tweets as apolitical and
Most tweets posted to Twitter are not about pol- thus achieves very low recall for tweets that are
itics, so the apolitical label dominates this more actually about politics. On the other hand, distant
representative dataset. Again, the five distant supervision creates classifiers that over-extract
supervision classifiers have similar results. The political tweets. This is a result of using balanced
Obama keyword search has the highest precision, datasets in training; such effects can be mitigated
but drastically sacrifices recall. Four of the five by changing the training balance. Even so, four
classifiers outperform keyword search in F1 score. of the five distantly trained classifiers score higher
than the raw keyword approach. The only under-
4.5 Discussion performer was PC-1, which suggests that when
The Political Dataset results show that distant su- building a classifier for a relatively broad topic
pervision adds robustness to a keyword search. like politics, a variety of keywords is important.
The distantly supervised Obama classifier (PC- The next section takes the output from our clas-
1) improved the basic Obama keyword search sifiers (i.e., our topic-relevant tweets) and eval-
by 0.51 absolute F1 points. Furthermore, dis- uates a fully automated sentiment analysis algo-
tant supervision doesnt require additional human rithm against real-world polling data.
input, but simply adds a trained classifier. Two
example tweets that an Obama keyword search 5 Targeted Sentiment Analysis
misses but that its distantly supervised analog The previous section evaluated algorithms that
captures are shown here: extract topic-relevant tweets. We now evaluate
Why does Congress get to opt out of the methods to distill the overall sentiment that they
Obummercare and we cant. A company express. This section compares two common ap-
gets fined if they dont comply. Kiss free- proaches to sentiment analysis.
dom goodbye. We first replicated the technique used in
I agree with the lady from california, I am OConnor et al. (2010), in which a lexicon of pos-
sixty six years old and for the first time in itive and negative sentiment words called Opin-
607
ionFinder (Wilson and Hoffmann, 2005) is used tweets that contain at least one positive emoti-
to evaluate the sentiment of each tweet (others con and no negative emoticons. We generated a
have used similar lexicons (Kramer, 2010; Thel- negative training set using an analogous process.
wall et al., 2010)). We evaluate our full distantly The emoticon symbols used for positive sentiment
supervised approach to theirs. We also experi- were :) =) :-) :] =] :-] :} :o) :D =D :-D :P =P
mented with SentiStrength, a lexicon-based pro- :-P C:. Negative emoticons were :( =( :-( :[ =[
gram built to identify sentiment in online com- :-[ :{ :-c :c} D: D= :S :/ =/ :-/ :( : (. Using this
ments of the social media website, MySpace. data, we train a multinomial Naive Bayes classi-
Though MySpace is close in genre to Twitter, we fier using the same method used for the political
did not observe a performance gain. All reported classifiers described in Section 4.1. This classifier
results thus use OpinionFinder to facilitate a more is then used to label topic-specific tweets as ex-
accurate comparison with previous work. pressing positive or negative sentiment. Finally,
Second, we built a distantly supervised system the three overall sentiment scores Spos , Sneg , and
using tweets containing emoticons as done in pre- Sratio are calculated from the results.
vious work (Read, 2005; Go et al., 2009; Bifet and
Frank, 2010; Pak and Paroubek, 2010; Davidov 6 Predicting Approval Polls
et al., 2010; Kouloumpis et al., 2011). Although
distant supervision has previously been shown to This section uses the two-stage Targeted Senti-
outperform sentiment lexicons, these evaluations ment Analysis system described above in a real-
do not consider the extra topic identification step. world setting. We analyze the sentiment of Twit-
ter users toward U.S. President Barack Obama.
5.1 Sentiment Lexicon This allows us to both evaluate distant supervision
The OpinionFinder lexicon is a list of 2,304 pos- against previous work on the topic, and demon-
itive and 4,151 negative sentiment terms (Wilson strate a practical application of the approach.
and Hoffmann, 2005). We ignore neutral words
6.1 Experiment Setup
in the lexicon and we do not differentiate between
weak and strong sentiment words. A tweet is la- The following experiments combine both topic
beled positive if it contains any positive terms, and identification and sentiment analysis. The previ-
negative if it contains any negative terms. A tweet ous sections described six topic identification ap-
can be marked as both positive and negative, and proaches, and two sentiment analysis approaches.
if a tweet contains words in neither category, it We evaluate all combinations of these systems,
is marked neutral. This procedure is the same as and compare their final sentiment scores for each
used by OConnor et al. (2010). The sentiment day in the nearly seven-month period over which
scores Spos and Sneg for a given set of N tweets our dataset spans.
are calculated as follows: Gallups Daily Job Approval reports two num-
P
1{xlabel = positive} bers: Approval and Disapproval. We calculate in-
Spos = x (1) dividual sentiment scores Spos and Sneg for each
N
P day, and compare the two sets of trends using
1{xlabel = negative} Pearsons correlation coefficient. OConnor et al.
Spos = x (2)
N do not explicitly evaluate these two, but instead
where 1{xlabel = positive} is 1 if the tweet x is use the ratio Sratio . We also calculate this daily
labeled positive, and N is the number of tweets in ratio from Gallup for comparison purposes by di-
the corpus. For the sake of comparison, we also viding the Approval by the Disapproval.
calculate a sentiment ratio as done in OConnor
et al. (2010): 6.2 Results and Discussion
The first set of results uses the lexicon-based clas-
P
x 1{xlabel = positive}
Sratio = P (3) sifier for sentiment analysis and compares the dif-
x 1{xlabel = negative}
ferent topic identification approaches. The first
5.2 Distant Supervision table in Table 2 reports Pearsons correlation co-
To build a trained classifier, we automatically gen- efficient with Gallups Approval and Disapproval
erated a positive training set by searching for ratings. Regardless of the Topic classifier, all
608
Sentiment Lexicon cient for this approach is 0.71 with Approval and
Topic Classifier Approval Disapproval 0.73 with Disapproval.
keyword -0.22 0.42 Finally, we compute the ratio Sratio between
PC-1 -0.65 0.71 the positive and negative sentiment scores (Equa-
PC-2 -0.61 0.71 tion 3) to compare to OConnor et al. (2010). Ta-
PC-3 -0.51 0.65 ble 3 shows the results. The distantly supervised
PC-4 -0.49 0.60 topic identification algorithms show little change
PC-5 -0.65 0.74 between a sentiment lexicon or a classifier. How-
ever, OConnor et al.s keyword approach im-
Distantly Supervised Sentiment proves when used with a distantly supervised sen-
Topic Classifier Approval Disapproval timent classifier (.22 to .40). Merging Approval
keyword 0.27 0.38 and Disapproval into one ratio appears to mask
PC-1 0.71 0.73 the sentiment lexicons poor correlation with Ap-
PC-2 0.33 0.46 proval. The ratio may not be an ideal evalua-
PC-3 0.05 0.31 tion metric for this reason. Real-world interest in
PC-4 0.08 0.26 Presidential Approval ratings desire separate Ap-
PC-5 0.54 0.62 proval and Disapproval scores, as Gallup reports.
Our results (Table 2) show that distant supervi-
Table 2: Correlation between Gallup polling data and
sion avoids a negative correlation with Approval,
the extracted sentiment with a lexicon (trends shown
in Figure 3) and distant supervision (Figure 4). but the ratio hides this important advantage.
One reason the ratio may mask the negative
Sentiment Lexicon Approval correlation is because tweets are often
keyword PC-1 PC-2 PC-3 PC-4 PC-5 classified as both positive and negative by a lexi-
.22 .63 .46 .33 .27 .61 con (Section 5.1). This could explain the behav-
Distantly Supervised Sentiment ior seen in Figure 3 in which both the positive and
keyword PC-1 PC-2 PC-3 PC-4 PC-5
negative sentiment scores rise over time. How-
.40 .64 .46 .30 .28 .60 ever, further experimentation did not rectify this
pattern. We revised Spos and Sneg to make binary
Table 3: Correlation between Gallup Approval / Dis- decisions for a lexicon: a tweet is labeled posi-
approval ratio and extracted sentiment ratio scores. tive if it strictly contains more positive words than
negative (and vice versa). Correlation showed lit-
systems inversely correlate with Presidential Ap- tle change. Approval was still negatively corre-
proval. However, they correlate well with Dis- lated, Disapproval positive (although less so in
approval. Figure 3 graphically shows the trend both), and the ratio scores actually dropped fur-
lines for the keyword and the distantly supervised ther. The sentiment ratio continued to hide the
system PC-1. The visualization illustrates how poor Approval performance by a lexicon.
the keyword-based approach is highly influenced
by day-by-day changes, whereas PC-1 displays a 6.3 New Baseline: Topic-Neutral Sentiment
much smoother trend. Distant supervision for sentiment analysis outper-
The second set of results uses distant supervi- forms that with a sentiment lexicon (Table 2).
sion for sentiment analysis and again varies the Distant supervision for topic identification further
topic identification approach. The second table improves the results (PC-1 v. keyword). The
in Table 2 gives the correlation numbers and Fig- best system uses distant supervision in both stages
ure 4 shows the keyword and PC-1 trend lines.The (PC-1 with distantly supervised sentiment), out-
results are widely better than when a lexicon is performing the purely keyword-based algorithm
used for sentiment analysis. Approval is no longer of OConnor et al. (2010). However, the question
inversely correlated, and two of the distantly su- of how important topic identification is has not yet
pervised systems strongly correlate (PC-1, PC-5). been addressed here or in the literature.
The best performing system (PC-1) used dis- Both OConnor et al. (2010) and Tumasjan et
tant supervision for both topic identification and al. (2010) created joint systems with two topic
sentiment analysis. Pearsons correlation coeffi- identification and sentiment analysis stages. But
609
Sentiment Lexicon
Figure 3: Presidential job approval and disapproval calculated using two different topic identification techniques,
and using a sentiment lexicon for sentiment analysis. Gallup polling results are shown in black.
Distantly Supervised Sentiment
Figure 4: Presidential job approval sentiment scores calculated using two different topic identification techniques,
and using the emoticon classifier for sentiment analysis. Gallup polling results are shown in black.
Topic-Neutral Sentiment
Figure 5: Presidential job approval sentiment scores calculated using the entire twitter corpus, with two different
techniques for sentiment analysis. Gallup polling results are shown in black for comparison.
610
Topic-Neutral Sentiment build upon what has recently been shown in the
Algorithm Approval Disapproval literature: distant supervision with emoticons is
Distant Sup. 0.69 0.74 a valuable methodology. We also expand upon
Keyword Lexicon -0.63 0.69 prior work by discovering drastic performance
differences between positive and negative lexi-
Table 4: Pearsons correlation coefficient of Sentiment
con words. The OpinionFinder lexicon failed
Analysis without Topic Identification.
to correlate (inversely) with Gallups Approval
polls, whereas a distantly trained classifier cor-
what if the topic identification step were removed related strongly with both Approval and Disap-
and sentiment analysis instead run on the entire proval (Pearsons .71 and .73). We only tested
Twitter corpus? To answer this question, we OpinionFinder and SentiStrength, so it is possible
ran the distantly supervised emoticon classifier to that another lexicon might perform better. How-
classify all tweets in the 7 months of Twitter data. ever, our results suggest that lexicons vary in their
For each day, we computed the positive and neg- quality across sentiment, and distant supervision
ative sentiment scores as above. The evaluation is may provide more robustness.
identical, except for the removal of topic identifi- Third, our results outperform previous work on
cation. Correlation results are shown in Table 4. Presidential Job Approval prediction (OConnor
This baseline parallels the results seen when et al., 2010). We presented two novel approaches
topic identification is used: the sentiment lexi- to the domain: a coupled distantly supervised sys-
con is again inversely correlated with Approval, tem, and a topic-neutral baseline, both of which
and distant supervision outperforms the lexicon outperform previous results. In fact, the baseline
approach in both ratings. This is not surpris- surprisingly matches or outperforms the more so-
ing given previous distantly supervised work on phisticated approaches that use topic identifica-
sentiment analysis (Go et al., 2009; Davidov et tion. The baseline correlates .69 with Approval
al., 2010; Kouloumpis et al., 2011). However, and .74 with Disapproval. This suggests a new
our distant supervision also performs as well as baseline that should be used in all topic-specific
the best performing topic-specific system. The sentiment applications.
best performing topic classifier, PC-1, correlated Fourth, we described and made available two
with Approval with r=0.71 (0.69 here) and Dis- new annotated datasets of political tweets to facil-
approval with r=0.73 (0.74 here). Computing itate future work in this area.
overall sentiment on Twitter performs as well as Finally, Twitter users are not a representative
political-specific sentiment. This unintuitive re- sample of the U.S. population, yet the high corre-
sult suggests a new baseline that all topic-based lation between political sentiment on Twitter and
systems should compute. Gallup ratings makes these results all the more
intriguing for polling methodologies. Our spe-
7 Discussion cific 7-month period of time differs from previous
This paper introduces a new methodology for work, and thus we hesitate to draw strong con-
gleaning topic-specific sentiment information. clusions from our comparisons or to extend im-
We highlight four main contributions here. plications to non-political domains. Future work
First, this work is one of the first to evaluate should further investigate distant supervision as a
distant supervision for topic identification. All tool to assist topic detection in microblogs.
five political classifiers outperformed the lexicon-
Acknowledgments
driven keyword equivalent that has been widely
used in the past. Our model achieved .90 F1 com- We thank Jure Leskovec for the Twitter data,
pared to the keyword .39 F1 on our political tweet Brendan OConnor for open and frank correspon-
dataset. On twitter as a whole, distant supervision dence, and the reviewers for helpful suggestions.
increased F1 by over 100%. The results also sug-
gest that performance is relatively insensitive to
the specific choice of seed keywords that are used
to select the training set for the political classifier.
Second, the sentiment analysis experiments
611
References Conference On Language Resources and Evalua-
tion (LREC).
Luciano Barbosa and Junlan Feng. 2010. Robust sen-
Jonathon Read. 2005. Using emoticons to reduce de-
timent detection on twitter from biased and noisy
pendency in machine learning techniques for senti-
data. In Proceedings of the 23rd International
ment classification. In Proceedings of the ACL Stu-
Conference on Computational Linguistics (COL-
dent Research Workshop (ACL-2005).
ING 2010).
Chenhao Tan, Lillian Lee, Jie Tang, Long Jiang, Ming
Albert Bifet and Eibe Frank. 2010. Sentiment knowl- Zhou, and Ping Li. 2011. User-level sentiment
edge discovery in twitter streaming data. In Lecture analysis incorporating social networks. In Pro-
Notes in Computer Science, volume 6332, pages 1 ceedings of the 17th ACM SIGKDD Conference on
15. Knowledge Discovery and Data Mining.
Paula Carvalho, Luis Sarmento, Jorge Teixeira, and Mike Thelwall, Kevan Buckley, Georgios Paltoglou,
Mario J. Silva. 2011. Liars and saviors in a senti- Di Cai, and Arvid Kappas. 2010. Sentiment
ment annotated corpus of comments to political de- strength detection in short informal text. Journal of
bates. In Proceedings of the Association for Com- the American Society for Information Science and
putational Linguistics (ACL-2011), pages 564568. Technology, 61(12):25442558.
Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. Mike Thelwall, Kevan Buckley, and Georgios Pal-
Enhanced sentiment learning using twitter hashtags toglou. 2011. Sentiment in twitter events. Jour-
and smileys. In Proceedings of the 23rd Inter- nal of the American Society for Information Science
national Conference on Computational Linguistics and Technology, 62(2):406418.
(COLING 2010). Andranik Tumasjan, Timm O. Sprenger, Philipp G.
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit- Sandner, and Isabell M. Welpe. 2010. Election
ter sentiment classification using distant supervi- forecasts with twitter: How 140 characters reflect
sion. Technical report. the political landscape. Social Science Computer
Sandra Gonzalez-Bailon, Rafael E. Banchs, and An- Review.
dreas Kaltenbrunner. 2010. Emotional reactions J.; Wilson, T.; Wiebe and P. Hoffmann. 2005. Recog-
and the pulse of public opinion: Measuring the im- nizing contextual polarity in phrase-level sentiment
pact of political events on the sentiment of online analysis. In Proceedings of the Conference on Hu-
discussions. Technical report. man Language Technology and Empirical Methods
Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and in Natural Language Processing.
Tiejun Zhao. 2011. Target-dependent twitter sen-
timent classification. In Proceedings of the Associ-
ation for Computational Linguistics (ACL-2011).
Efthymios Kouloumpis, Theresa Wilson, and Johanna
Moore. 2011. Twitter sentiment analysis: The good
the bad and the omg! In Proceedings of the Fifth
International AAAI Conference on Weblogs and So-
cial Media.
Adam D. I. Kramer. 2010. An unobtrusive behavioral
model of gross national happiness. In Proceed-
ings of the 28th International Conference on Human
Factors in Computing Systems (CHI 2010).
Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-
rafsky. 2009. Distant supervision for relation ex-
traction without labeled data. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP, ACL
09, pages 10031011.
Brendan OConnor, Ramnath Balasubramanyan,
Bryan R. Routledge, and Noah A. Smith. 2010.
From tweets to polls: Linking text sentiment to
public opinion time series. In Proceedings of the
AAAI Conference on Weblogs and Social Media.
Alexander Pak and Patrick Paroubek. 2010. Twitter
as a corpus for sentiment analysis and opinion min-
ing. In Proceedings of the Seventh International
612
Learning from evolving data streams: online triage of bug reports
Grzegorz Chrupala
Spoken Language Systems
Saarland University
gchrupala@lsv.uni-saarland.de
Abstract to substantially reduce the time and cost of this

task.
Open issue trackers are a type of social me-
dia that has received relatively little atten- 1.2 Issue trackers as social media
tion from the text-mining community. We
investigate the problems inherent in learn- In a large software project with a loose, not
ing to triage bug reports from time-varying strictly hierarchical organization, standards and
data. We demonstrate that concept drift is practices are not exclusively imposed top-down
an important consideration. We show the but also tend to spontaneously arise in a bottom-
effectiveness of online learning algorithms
up fashion, arrived at through interaction of in-
by evaluating them on several bug report
datasets collected from open issue trackers dividual developers, testers and users. The indi-
associated with large open-source projects. viduals involved may negotiate practices explic-
We make this collection of data publicly itly, but may also imitate and influence each other
available. via implicitly acquired reputation and status. This
process has a strong emergent component: an in-
formal taxonomy may arise and evolve in an is-
1 Introduction
sue tracker via the use of free-form tags or labels.
There has been relatively little research to date Developers, testers and users can attach tags to
on applying machine learning and Natural Lan- their issue reports in order to informally classify
guage Processing techniques to automate soft- them. The issue tracking software may give users
ware project workflows. In this paper we address feedback by informing them which tags were fre-
the problem of bug report triage. quently used in the past, or suggest tags based
on the content of the report or other information.
1.1 Issue tracking Through this collaborative, feedback driven pro-
Large software projects typically track defect re- cess involving both human and machine partici-
ports, feature requests and other issue reports us- pants, an evolving consensus on the label inven-
ing an issue tracker system. Open source projects tory and semantics typically arises, without much
tend to use trackers which are open to both devel- top-down control (Halpin et al. 2007).
opers and users. If the product has many users its This kind of emergent taxonomy is known as
tracker can receive an overwhelming number of a folksonomy or collaborative tagging and is
issue reports: Mozilla was receiving almost 300 very common in the context of social web appli-
reports per day in 2006 (Anvik et al. 2006). Some- cations. Large software projects, especially those
one has to monitor those reports and triage them, with open policies and little hierarchical struc-
that is decide which component they affect and tures, tend to exhibit many of the same emergent
which developer or team of developers should be social properties as the more prototypical social
responsible for analyzing them and fixing the re- applications. While this is a useful phenomenon,
ported defects. An automated agent assisting the it presents a special challenge from the machine-
staff responsible for such triage has the potential learning point of view.
613
1.3 Concept drift 1.4 Online learning
Many standard supervised approaches in This paucity of research on online learning from
machine-learning assume a stationary distribution issue tracker streams is rather surprising, given
from which training examples are independently that truly incremental learners have been well-
drawn. The set of training examples is processed known for many years. In fact one of the first
as a batch, and the resulting learned decision learning algorithms proposed was Rosenblatts
function (such as a classifier) is then used on test perceptron, a simple mistake-driven discrimina-
items, which are assumed to be drawn from the tive classification algorithm (Rosenblatt 1958). In
same stationary distribution. the current paper we address this situation and
If we need an automated agent which uses hu- show that by using simple, standard online learn-
man labels to learn to tag objects the batch learning methods we can improve on batch or pseudo-
ing approach is inadequate. Examples arrive one- online learning. We also show that when using
by-one in a stream, not as a batch. Even more a sophisticated state-of-the-art stochastic gradient
importantly, both the output (label) distribution descent technique the performance gains can be
and the input distribution from which the exam- quite large.
ples come are emphatically not stationary. As a
1.5 Contributions
software project progresses and matures, the type
of issues reported is going to change. As project Our main contributions are the following: Firstly,
members and users come and go, the vocabulary we explicitly show that concept-drift is pervasive
they use to describe the issues will vary. As the and serious in real bug report streams. We then
consensus tag folksonomy emerges, the label and address this problem by leveraging state-of-the-
training example distribution will evolve. This art online learning techniques which automati-
phenomenon is sometimes referred to as concept cally track the evolving data stream and incremen-
drift (Widmer and Kubat 1996, Tsymbal 2004). tally update the model after each data item. We
Early research on learning to triage tended to also adopt the continuous evaluation paradigm,
either not notice the problem (Cubranic and Mur- where the learner predicts the output for each ex-
phy 2004), or acknowledge but not address it (An- ample before using it to update the model. Sec-
vik et al. 2006): the evaluation these authors used ondly, we address the important issue of repro-
assigned bug reports randomly to training and ducibility in research in bug triage automation
evaluation sets, discarding the temporal sequenc- by making available the data sets which we col-
ing of the data stream. lected and used, in both their raw and prepro-
cessed forms.
Bhattacharya and Neamtiu (2010) explicitly
address the issue of online training and evalua-
2 Open issue-tracker data
tion. In their setup, the system predicts the out-
put for an item based only on items preceding it Open source software repositories and their as-
in time. However, their approach to incremen- sociated issue trackers are a naturally occurring
tal learning is simplistic: they use a batch clas- source of large amounts of (partially) labeled data.
sifier, but retrain it from scratch after receiving There seems to be growing interest in exploiting
each training example. A fully retrained batch this rich resource as evidenced by existing publi-
classifier will adapt only slowly to changing data cations as well as the appearance of a dedicated
stream, as more recent example have no more in- workshop (Working Conference on Mining Soft-
fluence on the decision function that less recent ware Repositories).
ones. In spite of the fact that the data is publicly avail-
Tamrawi et al. (2011) propose an incremental able in open repositories, it is not possible to di-
approach to bug triage: the classes are ranked rectly compare the results of the research con-
according to a fuzzy set membership function, ducted on bug triage so far: authors use non-
which is based on incrementally updated fea- trivial project-specific filtering, re-labeling and
ture/class co-occurrence counts. The model is ef- pre-processing heuristics; these steps are usually
ficient in online classification, but also adapts only not specified in enough detail that they could be
slowly. easily reproduced.
614
Field Meaning of the closed statuses. We generated two data sets
Identifier Issue ID from the Chromium issues:
Title Short description of issue
Description Content of issue report, which Chromium S UBCOMPONENT. Chromium
may include steps to reproduce, uses special tags to help triage the bug re-
error messages, stack traces etc. ports. Tags prefixed with Area- specify
Author ID of report submitter
which subcomponent of the project the bug
CCS List of IDs of people CCd on
the issue report should be routed to. In some cases more
Labels List of tags associated with is- than one Area- tag is present. Since this
sue affects less than 1% of reports, for simplic-
Status Label describing the current sta- ity we treat these as single, compound labels.
tus of the issue (e.g. Invalid, The development set contains 31,953 items,
Fixed, Wont Fix) and 75 unique output labels.
Assigned To ID of person who has been as-
signed to deal with the issue Chromium A SSIGNED. In this dataset the
Published Date on which issue report was output is the value of the assignedTo
submitted field. We discarded issues where the
field was left empty, as well as the
Table 1: Issue report record
ones which contained the placeholder value
all-bugs-test.chromium.org. The
To help remedy this situation we decided to col- development set contains 16,154 items and
lect data from several open issue trackers, use the 591 unique output labels.
minimal amount of simple preprocessing and fil-
Android Android is a mobile operating sys-
ter heuristics to get useful input data, and publicly
tem project (http://code.google.com/
share both the raw and preprocessed data.
p/android/). We retrieved all the bugs reports,
We designed a simple record type which acts of which 6,341 had a closed status. We generated
as a common denominator for several tracker for- two datasets:
mats. Thus we can use a common representation
for issue reports from various trackers. The fields Android S UBCOMPONENT. The reports
in our record are shown in Table 1. which are labeled with tags prefixed with
Below we describe the issue trackers used Component-. The development set con-
and the datasets we build from them. As dis- tains 888 items and 12 unique output labels.
cussed above (and in more detail in Section 4.1), Android A SSIGNED. The output label is the
we use progressive validation rather than a split value of the assignedTo field. We dis-
into training and test set. However, in order carded issues with the field left empty. The
to avoid developing on the test data, we split development set contains 718 items and 72
each data stream into two substreams, by assign- unique output labels.
ing odd-numbered examples to the test stream
and the even-numbered ones to the development Firefox Firefox is the well-known web-browser
stream. We can use the development stream for project (https://bugzilla.mozilla.
exploratory data analysis and feature and param- org).
eter tuning, and then use progressive validation to We obtained a total of 81,987 issues with a
evaluate on entirely unseen test data. Below we closed status.
specify the size and number of unique labels in Firefox A SSIGNED. We discarded issues
the development sets; the test sets are very similar where the field was left empty, as well as
in size. the ones which contained a placeholder value
(nobody). The development set contains
Chromium Chromium is the open source-
12,733 items and 503 unique output labels.
project behind Googles Chrome browser
(http://code.google.com/p/ Launchpad Launchpad is an issue tracker
chromium/). We retrieved all the bugs run by Canonical Ltd for mostly Ubuntu-related
from the issue tracker, of which 66,704 have one projects (https://bugs.launchpad.
615
net/). We obtained a total of 99,380 issues with
a closed status.
Launchpad A SSIGNED. We discarded issues

where the field was left empty. The devel-
opment set contains 18,634 items and 1,970
unique output labels.
3 Analysis of concept drift
In the introduction we have hypothesized that in

issue tracker streams concept drift would be an
especially acute problem. In this section we show
how class distributions evolve over time in the
data we collected.
A time-varying distribution is difficult to sum-
marize with a single number, but it is easy to ap-
preciate in a graph. Figures 1 and 2 show concept
drift for several of our data streams. The horizon-
tal axis indexes the position in the data stream.
The vertical axis shows the class proportions at Figure 1: S UBCOMPONENT class distribution change
each position, averaged over a window containing over time
7% of all the examples in the stream, i.e. in each
thin vertical bar the proportion of colors used cor- 4 Experimental results
responds to the smoothed class distribution at a
particular position in the stream. In an online setting it is important to use an evalu-
ation regime which closely mimics the continuous
Consider the plot for Chromium S UBCOMPO -
use of the system in a real-life situation.
NENT . We can see that a bit before the middle
point in the stream class proportions change quite 4.1 Progressive validation
dramatically: The orange BROWSERUI and vio-
When learning from data streams the standard
let MISC almost disappears, while blue INTER -
evaluation methodology where data is split into a
NALS , pink UI and dark red UNDEFINED take
separate training and test set is not applicable. An
over. This likely corresponds to an overhaul in the
evaluation regime know as progressive validation
label inventory and/or recommended best practice
has been used to accurately measure the general-
for triage in this project. There are also more
ization performance of online algorithms (Blum
gradual and smaller scale changes throughout the
et al. 1999). Under progressive evaluation, an in-
data stream.
put example from a temporally ordered sequence
The Android S UBCOMPONENT stream con- is sent to the learner, which returns the prediction.
tains much less data so the plot is less smooth, but The error incurred on this example is recorded,
there are clear transitions in this image also. We and the true output is only then sent to the learner
see that light blue GOOGLE all but disappears after which may update its model based on it. The fi-
about two thirds point and the proportion of vio- nal error is the mean of the per-example errors.
let TOOLS and light-green DALVIK dramatically Thus even though there is no separate test set, the
increases. prediction for each input is generated based on a
In Figure 2 we see the evolution of class pro- model trained on examples which do not include
portions in the A SSIGNED datasets. Each plots it.
idiosyncratic shape illustrates that there is wide In previous work on bug report triage, Bhat-
variation in the amount and nature of concept drift tacharya and Neamtiu (2010) and Tamrawi et al.
in different software project issue trackers. (2011) used an evaluation scheme (close to) pro-
616
Figure 2: A SSIGNED class distribution change over time
1 th whole rankings for all the examples. MRR is also

gressive validation. They omit the initial 11 of
the examples from the mean. a special case of Mean Average Precision when
there is only one true output per item.
4.2 Mean reciprocal rank
A bug report triaging agent is most likely to be 4.3 Input representation
used in a semi-automatic workflow, where a hu- Since in this paper we focus on the issues related
man triager is presented with a ranked list of to concept drift and online learning, we kept the
possible outputs (component labels or developer feature set relatively simple. We preprocess the
IDs). As such it is important to evaluate not only text in the issue report title and description fields
accuracy of the top ranking suggesting, but rather by removing HTML markup, tokenizing, lower-
the quality of the whole ranked list. casing and removing most punctuation. We then
Previous research (Bhattacharya and Neamtiu extracted the following feature types:
2010, Tamrawi et al. 2011) made an attempt at
approximating this criterion by reporting scores Title unigram and bigram counts
which indicate whether the true output is present Description unigram and bigram counts
in the top n elements of the ranking, for several Author ID (binary indicator feature)
values of n. Here we suggest borrowing the mean
reciprocal rank (MRR) metric from the informa- Year, month and day of submission (binary
tion retrieval domain (Voorhees 2000). It is de- indicator features)
fined as the mean of the reciprocals of the rank at 4.4 Models
which the true output is found:
We tested a simple online baseline, a pseudo-
1
N
X online algorithm which uses a batch model and
MRR = rank(i)1 repeatedly retrains it, an online model used in pre-
N
i=1 vious research on bug triage and two generic on-
line learning algorithms.
where rank(i) indicates the rank of the ith true
output. MRR has the advantage of providing a Window Frequency Baseline This baseline
single number which summarizes the quality of does not use any input features. It outputs the
617
ranked list of labels for the current item based Algorithm 1 Multiclass online perceptron
on the relative frequencies of output labels in the function PREDICT(Y, W, x)
window of k previous items. We tested windows return {(y, WyT x) | y Y }
of size 100 and 1000 and report the better result.
procedure UPDATE(W, x, y, y)
SVM Minibatch This model uses the mul- if y 6= y then
ticlass linear Support Vector Machine model Wy Wy x
(Crammer and Singer 2002) as implemented in Wy Wy + x
SVM Light (Joachims 1999). SVM is known
as a state-of-the-art batch model in classification
in general and in text categorization in particu- where y is the output label, X the set of features
lar. The output classes for an input example are in the input issue report, n(y, x) the number of ex-
ranked according to the value of the discriminant amples labeled as y which contain feature x, n(y)
values returned by the SVM classifier. In order number of examples labeled y and n(x) number
to adapt the model to an online setting we retrain of examples containing feature x. The counts are
it every n examples on the window of k previous updated online. Tamrawi et al. (2011) also use
examples. The parameters n and k can have large two so called caches: the label cache keeps the
influence on the prediction, but it is not clear how j% most recent labels and the term cache the k
to set them when learning from streams. Here we most significant features for each label. Since
chose the values (100,1000) based on how feasi- in Tamrawi et al. (2011)s experiments the label
ble the run time was and on the performance dur- cache did not affect the results significantly, here
ing exploratory experiments on Chromium S UB - we always set j to 100%. We select the optimal
COMPONENT . Interestingly, keeping the window k parameter from {100, 1000, 5000} based on the
parameter relatively small helps performance: a development set.
window of 1,000 works better than a window of Regression with Stochastic Gradient Descent
5,000. This model performs online multiclass learning
Perceptron We implemented a single-pass on- by means of a reduction to regression. The re-
line multiclass Perceptron with a constant learn- gressor is a linear model trained using Stochastic
ing rate. It maintains a weight vector for each Gradient Descent (Zhang 2004). SGD updates the
output seen so far: the prediction function ranks current parameter vector w(t) based on the gradi-
outputs according to the inner product of the cur- ent of the loss incurred by the regressor on the
rent example with the corresponding weight vec- current example (x(t) , y (t) ):
tor. The update function takes the true output and T
the predicted output. If they are not equal, the w(t+1) = w(t) (t)L(y (t) , w(t) x(t) )
current input is subtracted from the weight vector The parameter (t) is the learning rate at time t,
corresponding to the predicted output and added and L is the loss function. We use the squared
to the weight vector corresponding to the true out- loss:
put (see Algorithm 1). We hash each feature to an L(y, y) = (y y)2
integer value and use it as the features index in
the weight vectors in order to bound memory us- We reduce multiclass learning to regression us-
age in an online setting (Weinberger et al. 2009). ing a one-vs-all-type scheme, by effectively trans-
The Perceptron is a simple but strong baseline for forming an example (x, y) X Y into |Y |
online learning. (x0 , y 0 ) X 0 {0, 1} examples, where Y is the
set of labels seen so far. The transform T is de-
Bugzie This is the model described in Tamrawi fined as follows:
et al. (2011). The output classes are ranked ac-
cording to the fuzzy set membership function de- T (x, y) = {(x0 , I(y = y 0 )) | y 0 Y, x0h(i,y0 ) = xi }
fined as follows:
where h(i, y 0 ) composes the index i with the label
0
Y y (by hashing).
n(y, x) For a new input x the ranking of the outputs
(y, X) = 1 1
n(y) + n(x) n(y, x) y Y is obtained according to the value of the
xX
618
prediction of the base regressor on the binary ex- Dataset RER
ample corresponding to each class label. Chromium S UB 0.36
Android S UB 0.38
As our basic regression learner we use the ef-
Chromium AS 0.21
ficient implementation of regression via SGD, Android AS 0.19
Vowpal Wabbit (VW) (Langford et al. 2011). VW Firefox AS 0.16
implements setting adaptive individual learning Launchpad AS 0.49
rates for each feature as proposed by Duchi et al.
(2010), McMahan and Streeter (2010). Table 2: Best models error relative to baseline on the
This is appropriate when there are many sparse development set
features, and is especially useful in learning from
text from fast evolving data. The features such Task Model MRR Acc
as unigram and bigram counts that we rely on are Chromium Window 0.5747 0.3467
notoriously sparse, and this is exacerbated by the SVM 0.5766 0.4535
Perceptron 0.5793 0.4393
change over time in bug report streams. Bugzie 0.4971 0.2638
Regression 0.7271 0.5672
4.5 Results
Android Window 0.5209 0.3080
Figures 3 and 4 show the progressive validation SVM 0.5459 0.4255
results on all the development data streams. The Perceptron 0.5892 0.4390
horizontal lines indicate the mean MRR scores for Bugzie 0.6281 0.4614
the whole stream. The curves show a moving av- Regression 0.7012 0.5610
erage of MRR in a window comprised of 7% of Table 3: S UBCOMPONENT evaluation results on test
the total number of items. In most of the plots it is set.
evident how the prediction performance depends
on the concept drift illustrated in the plots in Sec-
tion 3: for example on Chromium S UBCOMPO - pression seems to be born out by informal inspec-
NENT the performance of all the models drops a
tion.
bit before the midpoint in the stream while the On the other hand as the scores in Table 2
learners adapt to the change in label distribution indicate, Chromium S UBCOMPONENT, Android
that is happening at this time. This is especially S UBCOMPOMENT and Launchpad A SSIGNED
pronounced for Bugzie, since it is not able to learn contain enough high-quality signal for the best
from mistakes and adapt rapidly, but simply accu- model to substantially outperform the label fre-
mulates counts. quency baseline.
For five out of the six datasets, Regression SGD On Launchpad A SSIGNED Regression SGD
gives the best overall performance. On Launch- performs worse than Bugzie. The concept drift
pad A SSIGNED, Bugzie scores higher we inves- plot for these data suggests one reason: there is
tigate this anomaly below. very little change in class distribution over time
Another observation is that the window-based as compared to the other datasets. In fact, even
frequency baseline can be quite hard to beat: though the issue reports in Launchpad range from
In three out of the six cases, the minibatch year 2005 to 2011, the more recent ones are heav-
SVM model is no better than the baseline. ily overrepresented: 84% of the items in the de-
Bugzie sometimes performs quite well, but for velopment data are from 2011. Thus fast adap-
Chromium S UBCOMPONENT and Firefox A S - tation is less important in this case and Bugzie is
SIGNED it scores below the baseline. able to perform well.
Regarding the quality of the different datasets, On the other hand, the reason for the less than
an interesting indicator is the relative error reduc- stellar score achieved with Regression SGD is due
tion by the best model over the baseline (see Ta- to another special feature of this dataset: it has
ble 2). It is especially hard to extract meaning- by far the largest number of labels, almost 2,000.
ful information about the labeling from the inputs This degrades the performance for the one-vs-all
on the Firefox A SSIGNED dataset. One possible scheme we use with SGD Regression. Prelim-
cause of this can be that the assignment labeling inary investigation indicates that the problem is
practices in this project are not consistent: this im- mostly caused by our application of the hash-
619
Task Model MRR Acc
Chromium Window 0.0999 0.0472
SVM 0.0908 0.0550
Bugzie 0.2063 0.0960
Android Window 0.3198 0.1684
SVM 0.2541 0.1684
Bugzie 0.3690 0.2086
Firefox Window 0.5695 0.4426
SVM 0.4604 0.4166
Bugzie 0.5402 0.4100
Launchpad Window 0.0725 0.0337
SVM 0.1006 0.0704
Bugzie 0.5271 0.4339
Table 4: A SSIGNED evaluation results on test set

Figure 3: S UBCOMPONENT evaluation results on the
development set
the data representation (bag of words) and learn-
ing algorithm (Naive Bayes) typical for text clas-
ing trick to feature-label pairs (see section 4.4), sification at the time. They collect over 15,000
which leads to excessive collisions with very large bug reports from the Eclipse project. The max-
label sets. Our current implementation can use at imum accuracy they report is 30% which was
most 29 bit-sized hashes which is insufficient for achieved by using 90% of the data for training.
datasets like Launchpad A SSIGNED. We are cur- In Anvik et al. (2006) the authors experiment
rently removing this limitation and we expect it with three learning algorithms: Naive Bayes,
will lead to substantial gains on massively multi- SVM and Decision Tree: SVM performs best in
class problems. their experiments. They evaluate using precision
In Tables 3 and 4 we present the overall MRR and recall rather than accuracy. They report re-
results on the test data streams. The picture is sim- sults on the Eclipse and Firefox projects, with pre-
ilar to the development data discussed above. cision 57% and 64% respectively, but very low re-
call (7% and 2%).
5 Discussion and related work
Matter et al. (2009) adopt a different approach
Our results show that by choosing the appropri- to bug triage. In addition to the projects issue
ate learner for the scenario of learning from data tracker data, they use also the source-code ver-
streams, we can achieve much better results than sion control data. They build an expertise model
by attempting to twist batch algorithm to fit the for each developer which is a word count vec-
online learning setting. Even a simple and well- tor of the source code changes committed. They
know algorithm such as Perceptron can be effec- also build a word count vector for each bug report,
tive, but by using recent advances in research on and use the cosine between the report and the ex-
SGD algorithms we can obtain substantial im- pertise model to rank developers. Using this ap-
provements on the best previously used approach. proach (with a heuristic term weighting scheme)
Below we review the research on bug report triage they report 33.6% accuracy on Eclipse.
most relevant to our work. Bhattacharya and Neamtiu (2010) acknowl-
Cubranic and Murphy (2004) seems to be the edge the evolving nature of bug report streams
first attempt to automate bug triage. The authors and attempt to apply incremental learning meth-
cast bug triage as a text classification task and use ods to bug triage. They use a two-step approach:
620
Figure 4: A SSIGNED evaluation results on the development set
first they predict the most likely developer to as- streams and on the evaluation of learning under its
sign to a bug using a classifier. In a second step constraints. We also show that for evolving issue
they rank candidate developers according to how tracker data, in a large majority of cases SGD Re-
likely they were to take over a bug from the de- gression handily outperforms Bugzie.
veloper predicted in the first step. Their approach
to incremental learning simply involves fully re- 6 Conclusion
training a batch classifier after each item in the We demonstrate that concept drift is a real, perva-
data stream. They test their approach on fixed sive issue for learning from issue tracker streams.
bugs in Mozilla and Eclipse, reporting accuracies We show how to adapt to it by leveraging recent
of 27.5% and 38.2% respectively. research in online learning algorithms. We also
Tamrawi et al. (2011) propose the Bugzie make our dataset collection publicly available to
model where developers are ranked according to enable direct comparisons between different bug
the fuzzy set membership function as defined triage systems.1
in section 4.4. They also use the label (devel- We have identified a good learning framework
oper) cache and term cache to speed up pro- for mining bug reports: in future we would like
cessing and make the model adapt better to the to explore smarter ways of extracting useful sig-
evolving data stream. They evaluate Bugzie and nals from the data by using more linguistically
compare its performance to the models used in informed preprocessing and higher-level features
Bhattacharya and Neamtiu (2010) on seven issue such as word classes.
trackers: Bugzie has superior performance on all
of them ranging from 29.9% to 45.7% for top-1 Acknowledgments
output. They do not use separate validation sets This work was carried out in the context of
for system development and parameter tuning. the Software-Cluster project E MERGENT and was
In comparison to Bhattacharya and Neamtiu partially funded by BMBF under grant number
(2010) and Tamrawi et al. (2011), here we focus 01IC10S01O.
1
much more on the analysis of concept drift in data Available from http://goo.gl/ZquBe
621
References Rosenblatt, F. (1958). The perceptron: A prob-
abilistic model for information storage and or-
Anvik, J., Hiew, L., and Murphy, G. (2006). Who ganization in the brain. Psychological review,
should fix this bug? In Proceedings of the 28th 65(6):386.
international conference on Software engineer-
ing, pages 361370. ACM. Tamrawi, A., Nguyen, T., Al-Kofahi, J., and
Nguyen, T. (2011). Fuzzy set and cache-based
Bhattacharya, P. and Neamtiu, I. (2010). Fine- approach for bug triaging. In Proceedings of
grained incremental learning and multi-feature the 19th ACM SIGSOFT symposium and the
tossing graphs to improve bug triaging. In 13th European conference on Foundations of
International Conference on Software Mainte- software engineering, pages 365375. ACM.
nance (ICSM), pages 110. IEEE.
Tsymbal, A. (2004). The problem of concept
Blum, A., Kalai, A., and Langford, J. (1999). drift: definitions and related work. Computer
Beating the hold-out: Bounds for k-fold and Science Department, Trinity College Dublin.
progressive cross-validation. In Proceedings
Voorhees, E. (2000). The TREC-8 question an-
of the twelfth annual conference on Computa-
swering track report. NIST Special Publication,
tional learning theory, pages 203208. ACM.
pages 7782.
Crammer, K. and Singer, Y. (2002). On the al- Weinberger, K., Dasgupta, A., Langford, J.,
gorithmic implementation of multiclass kernel- Smola, A., and Attenberg, J. (2009). Feature
based vector machines. The Journal of Ma- hashing for large scale multitask learning. In
chine Learning Research, 2:265292. Proceedings of the 26th Annual International
Duchi, J., Hazan, E., and Singer, Y. (2010). Adap- Conference on Machine Learning, pages 1113
tive subgradient methods for online learning 1120. ACM.
and stochastic optimization. Journal of Ma- Widmer, G. and Kubat, M. (1996). Learning in the
chine Learning Research. presence of concept drift and hidden contexts.
Halpin, H., Robu, V., and Shepherd, H. (2007). Machine learning, 23(1):69101.
The complex dynamics of collaborative tag- Zhang, T. (2004). Solving large scale linear
ging. In Proceedings of the 16th international prediction problems using stochastic gradient
conference on World Wide Web, pages 211 descent algorithms. In Proceedings of the
220. ACM. twenty-first international conference on Ma-
Joachims, T. (1999). Making large-scale svm chine learning, page 116. ACM.
learning practical. In Scholkopf, B., Burges, Cubranic, D. and Murphy, G. C. (2004). Auto-
C., and Smola, A., editors, Advances in Kernel matic bug triage using text categorization. In
Methods-Support Vector Learning. MIT-Press. In SEKE 2004: Proceedings of the Sixteenth In-
Langford, J., Hsu, D., Karampatziakis, N., ternational Conference on Software Engineer-
Chapelle, O., Mineiro, P., Hoffman, M., ing & Knowledge Engineering, pages 9297.
Hofman, J., Lamkhede, S., Chopra, S., KSI Press.
Faigon, A., Li, L., Rios, G., and Strehl,
A. (2011). Vowpal wabbit. https:
//github.com/JohnLangford/
vowpal_wabbit/wiki.
Matter, D., Kuhn, A., and Nierstrasz, O. (2009).
Assigning bug reports using a vocabulary-
based expertise model of developers. In Sixth
IEEE Working Conference on Mining Software
Repositories.
McMahan, H. and Streeter, M. (2010). Adap-
tive bound optimization for online convex op-
timization. Arxiv preprint arXiv:1002.4908.
622
Towards a model of formal and informal address in English
Manaal Faruqui Sebastian Pad

Computer Science and Engineering Institute of Computational Linguistics
Indian Institute of Technology Heidelberg University
Kharagpur, India Heidelberg, Germany
manaalfar@gmail.com pado@cl.uni-heidelberg.de
Abstract information about formality, e.g. the extraction of

social relationships or, notably, machine transla-
Informal and formal (T/V) address in dia- tion from English into languages with a T/V dis-
logue is not distinguished overtly in mod- tinction which involves a pronoun choice.
ern English, e.g. by pronoun choice like In this paper, we investigate the possibility to
in many other languages such as French
recover the T/V distinction for (monolingual) sen-
(tu/vous). Our study investigates the
status of the T/V distinction in English liter- tences of 19th and 20th-century English such as:
ary texts. Our main findings are: (a) human (1) Can I help you, Sir? (V)
raters can label monolingual English utter-
(2) You are my best friend! (T)
ances as T or V fairly well, given sufficient
context; (b), a bilingual corpus can be ex- After describing the creation of an English corpus
ploited to induce a supervised classifier for of T/V labels via annotation projection (Section 3),
T/V without human annotation. It assigns
we present an annotation study (Section 4) which
T/V at sentence level with up to 68% accu-
racy, relying mainly on lexical features; (c),
establishes that taggers can indeed assign T/V la-
there is a marked asymmetry between lex- bels to monolingual English utterances in context
ical features for formal speech (which are fairly reliably. Section 5 investigates how T/V is
conventionalized and therefore general) and expressed in English texts by experimenting with
informal speech (which are text-specific). different types of features, including words, seman-
tic classes, and expressions based on Politeness
Theory. We find word features to be most reliable,
1 Introduction obtaining an accuracy of close to 70%.
In many Indo-European languages, there are two
2 Related Work
pronouns corresponding to the English you. This
distinction is generally referred to as the T/V di- There is a large body of work on the T/V distinc-
chotomy, from the Latin pronouns tu (informal, T) tion in (socio-)linguistics and translation studies,
and vos (formal, V) (Brown and Gilman, 1960). covering in particular the conditions governing
The V form (such as Sie in German and Vous in T/V usage in different languages (Kretzenbacher
French) can express neutrality or polite distance et al., 2006; Schpbach et al., 2006) and the diffi-
and is used to address social superiors. The T culties in translation (Ardila, 2003; Knzli, 2010).
form (German du, French tu) is employed towards However, many observations from this literature
friends or addressees of lower social standing, and are difficult to operationalize. Brown and Levin-
implies solidarity or lack of formality. son (1987) propose a general theory of politeness
English used to have a T/V distinction until the which makes many detailed predictions. They as-
18th century, using you as V pronoun and thou sume that the pragmatic goal of being polite gives
for T. However, in contemporary English, you has rise to general communication strategies, such as
taken over both uses, and the T/V distinction is not avoiding to lose face (cf. Section 5.2).
marked anymore. In NLP, this makes generation In computational linguistics, it is a common
in English and translation into English easy. Con- observation that for almost every language pair,
versely, many NLP tasks suffer from the lack of there are distinctions that are expressed overtly
623
V projection V study since they either contain almost no direct
address at all or, if they do, just formal address (V).
Darf ich Sie etwas fragen? Please permit me to ask Fortunately, for many literary texts from the 19th
you a question. and early 20th century, copyright has expired, and
Step 1: German pronoun Step 2: copy T/V class they are freely available in several languages.
provides overt T/V label label to English sentence
We identified 110 stories and novels among the
texts provided by Project Gutenberg (English) and
Figure 1: T/V label induction for English sentences in
a parallel corpus with annotation projection Project Gutenberg-DE (German)1 that were avail-
able in both languages, with a total of 0.5M sen-
tences per language. Examples are Dickens David
in one language, but remain covert in the other. Copperfield or Tolstoys Anna Karenina. We ex-
Examples include morphology (Fraser, 2009) and cluded plays and poems, as well as 19th-century
tense (Schiehlen, 1998). A technique that is often adventure novels by Sir Walter Scott and James F.
applied in such cases is annotation projection, the Cooper which use anachronistic English for stylis-
use of parallel corpora to copy information from a tic reasons, including words that previously (until
language where it is overtly realized to one where the 16th century) indicated T (thee, didst).
it is not (Yarowsky and Ngai, 2001; Hwa et al., We cleaned the English and German novels man-
2005; Bentivogli and Pianta, 2005). ually by deleting the tables of contents, prologues,
The phenomenon of formal and informal ad- epilogues, as well as chapter numbers and titles
dress has been considered in the contexts of transla- occurring at the beginning of each chapter to ob-
tion into (Hobbs and Kameyama, 1990; Kanayama, tain properly parallel texts. The files were then
2003) and generation in Japanese (Bateman, 1988). formatted to contain one sentence per line using
Li and Yarowsky (2008) learn pairs of formal and the sentence splitter and tokenizer provided with
informal constructions in Chinese with a para- EUROPARL (Koehn, 2005). Blank lines were
phrase mining strategy. Other relevant recent stud- inserted to preserve paragraph boundaries. All
ies consider the extraction of social networks from novels were lemmatized and POS-tagged using
corpora (Elson et al., 2010). A related study is TreeTagger (Schmid, 1994).2 Finally, they were
(Bramsen et al., 2011) which considers another sentence-aligned using Gargantuan (Braune and
sociolinguistic distinction, classifying utterances Fraser, 2010), an aligner that supports one-to-many
as upspeak and downspeak based on the social alignments, and word-aligned in both directions
relationship between speaker and addressee. using Giza++ (Och and Ney, 2003).
This paper extends a previous pilot study
(Faruqui and Pad, 2011). It presents more an- 3.2 T/V Gold Labels for English Utterances
notation, investigates a larger and better motivated
As Figure 1 shows, the automatic construction of
feature set, and discusses the findings in detail.
T/V labels for English involves two steps.
3 A Parallel Corpus of Literary Texts Step 1: Labeling German Pronouns as T/V.
This section discusses the construction of T/V gold German has three relevant personal pronouns for
standard labels for English sentences. We obtain the T/V distinction: du (T), sie (V), and ihr (T/V).
these labels from a parallel EnglishGerman cor- However, various ambiguities makes their interpre-
pus using the technique of annotation projection tation non-straightforward.
(Yarowsky and Ngai, 2001) sketched in Figure 1: The pronoun ihr can both be used for plural T
We first identify the T/V status of German pro- address or for a somewhat archaic singular or plu-
nouns, then copy this T/V information onto the ral V address. In principle, these usages should
corresponding English sentence. be distinguished by capitalization (V pronouns
are generally capitalized in German), but many
3.1 Data Selection and Preparation T instances in our corpora informal use are nev-
Annotation projection requires a parallel corpus. ertheless capitalized. Additional, ihr can be the
We found commonly used parallel corpora like EU- 1
http://www.gutenberg.org, http://gutenberg.spiegel.de/
ROPARL (Koehn, 2005) or the JRC Acquis corpus 2
It must be expected that the tagger degrades on this
(Steinberger et al., 2006) to be unsuitable for our dataset; however we did not quantify this effect.
624
dative form of the 3rd person feminine pronoun sie Comparison No context In context
(she/her). These instances are neutral with respect A1 vs. A2 75% (.49) 79% (.58)
A1 vs. GS 60% (.20) 70% (.40)
to T/V but were misanalysed by TreeTagger as in-
A2 vs. GS 65% (.30) 76% (.52)
stances of the T/V lemma ihr. Since TreeTagger
(A1 A2) vs. GS 67% (.34) 79% (.58)
does not provide person information, and we did
not want to use a full parser, we decided to omit Table 1: Manual annotation for T/V on a 200-sentence
ihr/Ihr from consideration.3 sample. Comparison among human annotators (A1 and
Of the two remaining pronouns (du and sie), du A2) and to projected gold standard (GS). All cells show
raw agreement and Cohens (in parentheses).
expresses (singular) T. A minor problem is pre-
sented by novels set in France, where du is used as
an nobiliary particle. These instances can be recog- 18K T sentences4 , of which 255 (0.6%) are labeled
nised reliably since the names before and after du as both T and V. We exclude these sentences.
are generally unknown to the German tagger. Thus Note that this strategy relies on the direct cor-
we do not interpret du as T if the word preceding respondence assumption (Hwa et al., 2005), that
or succeeding it has unknown as its lemma. is, it assumes that the T/V status of an utterance is
The V pronoun, sie, doubles as the pronoun for not changed in translation. We believe that this is
third person (she/they) when not capitalized. We a reasonable assumption, given that T/V is deter-
therefore interpret only capitalized instances of Sie mined by the social relation between interlocutors;
as V. Furthermore, we ignore utterance-initial po- but see Section 4 for discussion.
sitions, where all words are capitalized. This is
defined as tokens directly after a sentence bound- 3.3 Data Splitting
ary (POS $.) or after a bracket (POS $(). Finally, we divided our English data into train-
These rules concentrate on precision rather than ing, development and test sets with 74 novels
recall. They leave many instances of German sec- (26K sentences), 19 novels (9K sentences) and
ond person pronouns unlabeled; however, this is 13 novels (8K sentences), respectively. The cor-
not a problem since we do not currently aim at pus is available for download at http://www.
obtaining complete coverage on the English side nlpado.de/~sebastian/data.shtml.
of our parallel corpus. From the 0.5M German sen-
tences, about 14% of the sentences were labeled 4 Human Annotation of T/V for English
as T or V (37K for V and 28K for T). In a random
sample of roughly 300 German sentences which This section investigates how well the T/V distinc-
we analysed, we did not find any errors. This puts tion can be made in English by human raters, and
the precision of our heuristics at above 99%. on the basis of what information. Two annotators
with near native-speaker competence in English
Step 2: Annotation Projection. We now copy were asked to label 200 random sentences from
the information over onto the English side. We the training set as T or V. Sentences were first pre-
originally intended to transfer T/V labels between sented in isolation (no context). Subsequently,
German and English word-aligned pronouns. How- they were presented with three sentences pre- and
ever, we pronouns are not necessarily translated post-context each (in context).
into pronouns; additionally, we found word align- Table 1 shows the results of the annotation
ment accuracy for pronouns to be far from perfect, study. The first line compares the annotations
due to the variability in function word translation. of the two annotators against each other (inter-
For these reason, we decided to look at T/V labels annotator agreement). The next two lines compare
at the level of complete sentences, ignoring word the taggers annotations against the gold standard
alignment. This is generally unproblematic ad- labels projected from German (GS). The last line
dress is almost always consistent within sentences: compares the annotator-assigned labels to the GS
of the 65K German sentences with T or V labels, for the instances on which the annotators agree.
only 269 (< 0.5%) contain both T and V. Our pro- For all cases, we report raw accuracy and Co-
jection on the English side results in 25K V and hens (1960), i.e. chance-corrected agreement.
3 4
Instances of ihr as possessive pronoun occurred as well, Our sentence aligner supports one-to-many alignments
but could be filtered out on the basis of the POS tag. and often aligns single German to multiple English sentences.
625
We first observe that the T/V distinction is con- be T, as presumed by both annotators. Conver-
siderably more difficult to make for individual sations between lovers or family members form
sentences (no context) than when the discourse is another example, where T is modern usage, but
available. In context, inter-annotator agreement in- the novels tend to use V:
creases from 75% to 79%, and agreement with the
(6) [...] she covered her face with the other
gold standard rises by 10%. It is notable that the
to conceal her tears. Corinne!, said Os-
two annotators agree worse with one another than
wald, Dear Corinne! My absence has
with the gold standard (see below for discussion).
then rendered you unhappy!6
On those instances where they agree, Cohens
reaches 0.58 in context, which is interpreted as In sum, our annotation study establishes that the
approaching good agreement (Fleiss, 1981). Al- T/V distinction, although not realized by different
though far from perfect, this inter-annotator agree- pronouns in English, can be recovered manually
ment is comparable to results for the annotation from text, provided that discourse context is avail-
of fine-grained word sense or sentiment (Navigli, able. A substantial part of the errors is due to social
2009; Bermingham and Smeaton, 2009). changes in T/V usage.
An analysis of disagreements showed that many
sentences can be uttered in both T and V contexts 5 Monolingual T/V Modeling
and cannot be labeled without context: The second part of the paper explores the auto-
matic prediction of the T/V distinction for English
(3) And perhaps sometime you may see her. sentences. Given the ability to create an English
This case (gold label: V) is disambiguated by the training corpus with T/V labels with the annotation
previous sentence which indicates a hierarchical projection methods described in Section 3.2, we
social relation between speaker and addressee: can phrase T/V prediction for English as a standard
supervised learning task. Our experiments have
(4) And she is a sort of relation of your lord- a twin motivation: (a), on the NLP side, we are
ships, said Dawson. . . . mainly interested in obtaining a robust classifier
to assign the labels T and V to English sentences;
Still, even a three-sentence window is often not (b), on the sociolinguistic side, we are interested in
sufficient, since the surrounding sentences may be investigating through which features the categories
just as uninformative. In these cases, more global T and V are expressed in English.
information about the situation is necessary. Even
with perfect information, however, judgments can 5.1 Classification Framework
sometimes deviate, as there are considerable grey We phrase T/V labeling as a binary classification
areas in T/V usage (Kretzenbacher et al., 2006). task at the sentence level, performing the classifica-
In addition, social rules like T/V usage vary tion with L2-regularized logistic regression using
in time and between countries (Schpbach et al., the LibLINEAR library (Fan et al., 2008). Logis-
2006). This helps to explain why annotators agree tic regression defines the probability that a binary
better with one another than with the gold standard: response variable y takes some value as a logit-
21st century annotators tend to be unfamiliar with transformed linear combination of the features fi ,
19th century T/V usage. Consider this example each of which is assigned a coefficient i .
from a book written in second person perspective: 1 X
p(y = 1) = with z = i fi (7)
(5) Finally, you acquaint Caroline with the 1 + ez
i
fatal result: she begins by consoling you. Regularization incorporates the size of the coef-
One hundred thousand francs lost! We ficient vector into the objective function, sub-
shall have to practice the strictest econ- tracting it from the likelihood of the data given the
omy, you imprudently add.5 model. This allows the user to trade faithfulness
to the data against generalization.7
Here, the author and translator use V to refer to the
6
reader, while todays usage would almost certainly A.L.G. de Stal: Corinne
7
We use LIBLINEARs default parameters and set the
5
H. de Balzac: Petty Troubles of Married Life cost (regularization) parameter to 0.01.
626
p(C|V )
p(C|T ) Words indicative for V, ranked by the ratio of probabilities
4.59 Mister, sir, Monsieur, sirrah, . . . for T and V, estimated on the training set.
2.36 Mlle., Mr., M., Herr, Dr., . . .
1.60 Gentlemen, patients, rascals, . . . Politeness Theory Features. The third feature
type is based on the Politeness Theory (Brown
Table 2: 3 of the 400 clustering-based semantic classes and Levinson, 1987). Brown and Levinsons pre-
(classes most indicative for V)
diction is that politeness levels will be detectable
in concrete utterances in a number of ways, e.g.
5.2 Feature Types a higher use of conjunctive or hedges in polite
We experiment with three features types that are speech. Formal address (i.e., V as opposed to T) is
candidates to express the T/V English distinction. one such expression. Politeness Theory therefore
predicts that other politeness indicators should cor-
Word Features. The intuition to use word fea- relate with the T/V classification. This holds in
tures draws on the parallel between T/V and infor- particular for English, where pronoun choice is
mation retrieval tasks like document classification: unavailable to indicate politeness.
some words are presumably correlated with formal We constructed 16 features on the basis of Po-
address (like titles), while others should indicate liteness Theory predictions, that is, classes of ex-
informal address (like first names). In a prelimi- pressions indicating either formality or informality.
nary experiment, we noticed that in the absence of From a computational perspective, the problem
further constraints, many of the most indicative fea- with Politeness Theory predictions is that they are
tures are names of persons from particular novels only described qualitatively and by example, with-
which are systematically addressed formally (like out detailed lists. For each feature, we manually
Phileas Fogg from J. Vernes Around the world in identified around 10 words or multi-word relevant
eighty days) or informally (like Mowgli, Baloo, expressions. Table 3 shows these 16 features with
and Bagheera from R. Kiplings Jungle Book). their intended classes and some example expres-
These features clearly do not generalize to new sions. Similar to the semantic class features, the
books. We therefore added a constraint to remove value of each politeness feature is the sum of the
all features which did not occur in at least three frequencies of its members in a sentence.
novels. To reduce the number of word features to a
reasonable order of magnitude, we also performed 5.3 Context: Size and Type
a 2 -based feature selection (Manning et al., 2008) As our annotation study in Section 4 found, con-
on the training set. Preliminary experiments es- text is crucial for human annotators, and this pre-
tablished that selecting the top 800 word features sumably carries over to automatic methods human
yielded a model with good generalization. annotators: if the features for a sentence are com-
Semantic Class Features. Our second feature puted just on that sentence, we will face extremely
type is semantic class features. These can be seen sparse data. We experiment with symmetrical win-
as another strategy to counteract the sparseness dow contexts, varying the size between n = 0 (just
at the level of word features. We cluster words the target sentence) and n = 10 (target sentence
into 400 semantic classes on the basis of distribu- plus 10 preceding and 10 succeeding sentences).
tional and morphological similarity features which This kind of simple sentence context makes an
are extracted from an unlabeled English collec- important oversimplification, however. It lumps to-
tion of Gutenberg novels comprising more than gether material from different speech turns as well
100M tokens, using the approach by Clark (2003). as from narrative sentences, which may generate
These features measure how similar tokens are to misleading features. For example, narrative sen-
one another in terms of their occurrences in the tences may refer to protagonists by their full names
document and are useful in Named Entity Recog- including titles (strong features for V) even when
nition (Finkel and Manning, 2009). As features these protagonists are in T-style conversations:
in the T/V classification of a given sentence, we
(8) You are the love of my life, said Sir
simply count for each class the number of tokens
Phileas Fogg.8 (T)
in this class present in the current sentence. For
8
illustration, Table 2 shows the three classes most J. Verne: Around the world in 80 days
627
Class Example expressions Class Example expressions
Inclusion (T) lets, shall we Exclamations (T) hey, yeah
Subjunctive I (T) can, will Subjunctive II (V) could, would
Proximity (T) this, here Distance (V) that, there
Negated question (V) didnt I, hasnt it Indirect question (V) would there, is there
Indefinites (V) someone, something Apologizing (V) bother, pardon
Polite adverbs (V) marvellous, superb Optimism (V) I hope, would you
Why + modal (V) why would(nt) Impersonals (V) necessary, have to
Polite markers (V) please, sorry Hedges (V) in fact, I guess
Table 3: 16 Politeness theory-based features with intended classes and example expressions
Example (8) also demonstrates that narrative mate-
67

rial and direct speech may even be mixed within
66

individual sentences.

Accuracy (%)
For these reasons, we introduce an alternative
65

concept of context, namely direct speech context,
64

whose purpose is to exclude narrative material. We
compute direct speech context in two steps: (a),
63

segmentation of sentences into chunks that are
62
either completely narrative or speech, and (b), la-
beling of chunks with a classifier that distinguishes
61
these two classes. The segmentation step (a) takes

place with a regular expression that subdivides sen- 0 2 4 6 8 10
tences on every occurrence of quotes ( , , , ,
Context size (n)
etc.). As training data for the classification step
(b), we manually tagged 1000 chunks from our
Figure 2: Accuracy vs. number of sentences in context
training data as either B-DS (begin direct speech),
(empty circles: sentence context; solid circles: direct
I-DS (inside direct speech) and O (outside direct speech context)
speech, i.e. narrative material).9 We used this
dataset to train the CRF-based sequence tagger
Mallet (McCallum, 2002) using all tokens, includ- together utterances by different speakers and can
ing punctuation, as features.10 This tagger is used therefore yield misleading features in the case of
to classify all chunks in our dataset, resulting in asymmetric conversational situations, in addition
output like the following example: to possible direct speech misclassifications.
(B-DS) I am going to see his Ghost! 6 Experimental Evaluation

(I-DS) It will be his Ghost not him!
(9) 6.1 Evaluation on the Development Set
(O) Mr. Lorry quietly chafed the
hands that held his arm.11 We first perform model selection on the develop-
ment set and then validate our results on the test
Direct speech chunks belonging to the same sen- set (cf. Section 3.3).
tence are subsequently recombined.
We define the direct speech context of size n for Influence of Context. Figure 2 shows the influ-
a given sentence as the n preceding and following ence of size and type of context, using only words
direct speech chunks that are labeled B-DS or I-DS as features. Without context, we obtain a perfor-
while skipping any chunks labeled O. Note that mance of 61.1% (sentence context) and of 62.9%
this definition of direct speech context still lumps (direct speech context). These numbers beat the
random baseline (50.0%) and the frequency base-
9
The labels are chosen after IOB notation conventions line (59.1%). The addition of more context further
(Ramshaw and Marcus, 1995).
10
We also experimented with rule-based chunk labeling
improves performance substantially for both con-
based on quotes, but found the use of quotes too inconsistent. text types. The ideal context size is fairly large,
11
C. Dickens: A tale of two cities. namely 7 sentences and 8 direct speech chunks, re-
628
Model Accuracy Model Accuracy to dev set
Random Baseline 50.0 Frequency baseline 59.3 + 0.2
Frequency Baseline 59.1 Words (no context) 62.5 - 0.4
Words 67.0 Words (context size 6) 67.3 + 1.0
SemClass 57.5 Words (context size 8) 67.5 + 0.5
PoliteClass 59.6 Words (context size 10) 66.8 + 1.0
Words + SemClass 66.6
Words + PoliteClass 66.4 Table 5: T/V classification accuracy on the test set and
Words + PoliteClass + SemClass 66.2 differences to dev set results (direct speech context)
Raw human IAA (no context) 75.0
Raw human IAA (in context) 79.0 not overfit on the development set when picking
Table 4: T/V classification accuracy on the develop- the best model. The tendencies correspond well
ment set (direct speech context, size 8). : Significant to the development set: the frequency baseline is
difference to frequency baseline (p<0.01) almost identical, as are the results for the different
models. The differences to the development set
are all equal to or smaller than 1% accuracy, and
spectively. This indicates that sparseness is indeed
the best result at 67.5% is 0.5% better than on the
a major challenge, and context can become large
development set. This is a reassuring result, as our
before the effects mentioned in Section 5.3 counter-
model appears to generalize well to unseen data.
act the positive effect of more data. Direct speech
context outperforms sentence context throughout, 6.3 Analysis by Feature Types
with a maximum accuracy of 67.0% as compared
The results from Section 6.1 motivate further anal-
to 65.2%, even though it shows higher variation,
ysis of the individual feature types.
which we attribute to the less stable nature of the
direct speech chunks and their automatically cre- Analysis of Word Features. Word features are
ated labels. From now on, we adopt a direct speech by far the most effective features. Table 6 lists
context of size 8 unless specified differently. the top twenty words indicating T and V (ranked
by the ratio of probabilities for the two classes
Influence of Features. Table 4 shows the results
on the training set). The list still includes some
for different feature types. The best model (word
proper names like Vrazumihin or Louis-Gaston
features only) is highly significantly better than
(even though all features have to occur in at least
the frequency baseline (which it beats by 8%) as
three novels), but they are relatively infrequent.
determined by a bootstrap resampling test (Noreen,
The most prominent indicators for the formal class
1989). It gains 17% over the random baseline,
V are titles (monsieur, (ma)am) and instances of
but is still more than 10% below inter-annotator
formulaic language (Permit (me), Excuse (me)).
agreement in context, which is often seen as an
There are also some terms which are not straight-
upper bound for automatic models.
forward indicators of formal address (angelic, stub-
Disappointingly, the comparison of the feature
bornness), but are associated with a high register.
groups yields a null result: We are not able to
There is a notable asymmetry between T and
improve over the results for just word features with
V. The word features for T are considerably more
either the semantic class or the politeness features.
difficult to interpret. We find some forms of earlier
Neither feature type outperforms the frequency
period English (thee, hast, thou, wilt) that result
baseline significantly (p>0.05). Combinations of
from occasional archaic passages in the novels as
the different feature types also do worse than just
well first names (Louis-Gaston, Justine). Never-
words. The differences between the best model
theless, most features are not straightforward to
(just words) and the combination models are all
connect to specifically informal speech.
not significant (p>0.05). These negative results
warrant further analysis. It follows in Section 6.3. Analysis of Semantic Class Features. We
ranked the semantic classes we obtained by distri-
6.2 Results on the Test Set butional clustering in a similar manner to the word
Table 5 shows the results of evaluating models features. Table 2 shows the top three classes in-
with the best feature set and with different context dicative for V. Almost all others of the 400 clusters
sizes on the test set, in order to verify that we did do not have a strong formal/informal association
629
Top 20 words for V Top 20 words for T p(f |V )/p(f |T ) are between 0.9 and 1.3, that is,
P (w|V ) P (w|T )
Word w P (w|T ) Word w P (w|V ) the features were only weakly indicative of one of
Excuse 36.5 thee 94.3 the classes. Furthermore, not all features turned
Permit 35.0 amenable 94.3 out to be indicative of the class we designed them
ai 29.2 stuttering 94.3 for. The best indicator for V was the Indefinites
am 29.2 guardian 94.3
feature (somehow, someone cf. Table 3), as ex-
stubbornness 29.2 hast 92.0
flights 29.2 Louis-Gaston 92.0 pected. In contrast, the best indicator for T was the
monsieur 28.6 lease-making 92.0 Negation question feature which was supposedly
Vrazumihin 28.6 melancholic 92.0 an indicator for V (didnt I, havent we).
mademoiselle 26.5 ferry-boat 92.0 A majority of politeness features (13 of the 16)
angelic 26.5 Justine 92.0 had p(f |V )/p(f |T ) values above 1, that is, were
Allow 24.5 Thou 66.0 indicative for the class V. Thus for this feature type,
madame 21.2 responsibility 63.8
like for the others, it appears to be more difficult to
delicacies 21.2 thou 63.8
entrapped 21.2 Iddibal 63.8 identify T than to identify V. This negative result
lack-a-day 21.2 twenty-fifth 63.8 can be attributed at least in part to our method of
ma 21.0 Chic 63.8 hand-crafting lists of expressions for these features.
duke 18.0 allegiance 63.8 The inadvertent inclusion of overly general terms
policeman 18.0 Jouy 63.8 V might be responsible for the features inability
free-will 18.0 wilt 47.0 to discriminate well, while we have presumably
Canon 18.0 shall 47.0
missed specific terms which has hurt coverage.
Table 6: Most indicative word features for T or V This situation may in the future be remedied with
the semi-automatic acquisition of instantiations of
politeness features.
but mix formal, informal, and neutral vocabulary.
This tendency is already apparent in class 3: Gen- 6.4 Analysis of Individual Novels
tlemen is clearly formal, while rascals is informal.
One possible hypothesis regarding the difficulty
patients can belong to either class. Even in class
of finding indicators for the class T is that indi-
1, we find Sirrah, a contemptuous term used in ad-
cators for T tend to be more novel-specific than
dressing a man or boy with a low formality score
indicators for V, since formal language is more
(p(w|V )/p(w|T ) = 0.22). From cluster 4 onward,
conventionalized (Brown and Levinson, 1987). If
none of the clusters is strongly associated with ei-
this were the case, then our strategy of building
ther V or T (p(c|V )/p(c|T ) 1).
well-generalizing models by combining text from
Our interpretation of these observations is that
different novels would naturally result in models
in contrast to text categorization, there is no clear-
that have problems with picking up T features.
cut topical or domain difference between T and V:
both categories co-occur with words from almost To investigate this hypothesis, we trained mod-
any domain. In consequence, semantic classes do els with the best parameters as before (8-sentence
not, in general, represent strong unambiguous indi- direct speech context, words as features). How-
cators. Similar to the word features, the situation ever, this time we trained novel-specific models,
is worse for T than for V: there still are reasonably splitting each novel into 50% training data and
strong features for V, the marked case, but it is 50% testing data. We required novels to contain
more difficult to find indicators for T. more than 200 labeled sentences. This ruled out
most short stories, leaving us with 7 novels in the
Analysis of politeness features. A major reason test set. The results are shown in Table 7 and show
for the ineffectiveness of the Politeness Theory- a clear improvement. The accuracy is 13% higher
based features seems to be their low frequency: than in our main experiment (67% vs. 80%), even
in the best model, with a direct speech context of though the models were trained on considerably
size 8, only an average of 7 politeness features less data. Six of the seven novels perform above
was active for any given sentence. However, fre- the 67.5% result from the main experiment.
quency was not the only problem the politeness The top-ranked features for T and V show a
features were generally unable to discriminate well much higher percentage of names for both T and
between T and V. For all features, the values of V than in the main experiment. This is to be ex-
630
Novel Accuracy lation system for a task-based evaluation on the
H. Beecher-Stove: Uncle Toms Cabin 90.0 translation of direct address into German and other
J. Spyri: Cornelli 88.3
languages with different T/V pronouns.
E. Zola: Lourdes 83.9
H. de Balzac: Cousin Pons 82.3 Considering our sociolinguistic goal of deter-
C. Dickens: The Pickwick Papers 77.7 mining the ways in which English realizes the T/V
C. Dickens: Nicholas Nickleby 74.8 distinction, we first obtained a negative result: only
F. Hodgson Burnett: Little Lord 61.6 word features perform well, while semantic classes
All (micro average) 80.0 and politeness features do hardly better than a fre-
quency baseline. Notably, there are no clear topi-
Table 7: T/V prediction models for individual novels
(50% of each novel for training and 50% testing) cal divisions between T and V, like for example
in text categorization: almost all words are very
weakly correlated with either class, and seman-
pected, since this experiment does not restrict itself tically similar words can co-occur with different
to features that occurred in at least three novels. classes. Consequently, distributionally determined
The price we pay for this is worse generalization to semantic classes are not helpful for the distinction.
other novels. There is also still a T/V asymmetry: Politeness features are difficult to operationalize
more top features are shared among the V lists of with sufficiently high precision and recall.
individual novels and with the main experiment An interesting result is the asymmetry between
V list than on the T side. Like in the main exper- the linguistic features for V and T at the lexical
iment (cf. Section 6.3), V features indicate titles level. V language appears to be more convention-
and other features of elevated speech, while T fea- alized; the models therefore identified formulaic
tures mostly refer to novel-specific protagonists expressions and titles as indicators for V. On the
and events. In sum, these results provide evidence other hand, very few such generic features exist for
for a difference in status of T and V. the class T; consequently, the classifier has a hard
time learning good discriminating and yet generic
7 Discussion and Conclusions
features. Those features that are indicative of T,
In this paper, we have studied the distinction such as first names, are highly novel-specific and
between formal and information (T/V) address, were deliberately excluded from the main exper-
which is not expressed overtly through pronoun iment. When we switched to individual novels,
choice or morphosyntactic marking in modern En- the models picked up such features, and accuracy
glish. Our hypothesis was that the T/V distinction increased at the cost of lower generalizability
can be recovered in English nevertheless. Our man- between novels. A more technical solution to this
ual annotation study has shown that annotators can problem would be the training of a single-class
in fact tag monolingual English sentences as T or classifier for V, treating T as the default class
V with reasonable accuracy, but only if they have (Tax and Duin, 1999).
sufficient context. We exploited the overt informa- Finally, an error analysis showed that many er-
tion from German pronouns to induce T/V labels rors arise from sentences that are too short or un-
for English and used this labeled corpus to train a specific to determine T or V reliably. This points
monolingual T/V classifier for English. We exper- to the fact that T/V should not be modelled as a
imented with features based on words, semantic sentence-level classification task in the first place:
classes, and Politeness Theory predictions. T/V is not a choice made for each sentence, but
With regard to our NLP goal of building a T/V one that is determined once for each pair of inter-
classifier, we conclude that T/V classification is locutors and rarely changed. In future work, we
a phenomenon that can be modelled on the basis will attempt to learn social networks from novels
of corpus features. A major factor in classifica- (Elson et al., 2010), which should provide con-
tion performance is the inclusion of a wide context straints on all instances of communication between
to counteract sparse data, and more sophisticated a speaker and an addressee. However, the big and
context definitions improve results. We currently unsolved, as far as we know challenge is to au-
achieve top accuracies of 67%-68%, which still tomatically assign turns to interlocutors, given the
leave room for improvement. We next plan to varied and often inconsistent presentation of direct
couple our T/V classifier with a machine trans- speech turns in novels.
631
References Joseph L. Fleiss. 1981. Statistical methods for rates
and proportions. John Wiley, New York, 2nd edi-
John Ardila. 2003. (Non-Deictic, Socio-Expressive) tion.
T-/V-Pronoun Distinction in Spanish/English Formal
Alexander Fraser. 2009. Experiments in morphosyn-
Locutionary Acts. Forum for Modern Language
tactic processing for translating to and from German.
Studies, 39(1):7486.
In Proceedings of the EACL MT workshop, pages
John A. Bateman. 1988. Aspects of clause politeness in 115119, Athens, Greece.
Japanese: An extended inquiry semantics treatment. Jerry Hobbs and Megumi Kameyama. 1990. Trans-
In Proceedings of ACL, pages 147154, Buffalo, lation by abduction. In Proceedings of COLING,
New York. pages 155161, Helsinki, Finland.
Luisa Bentivogli and Emanuele Pianta. 2005. Ex- Rebecca Hwa, Philipp Resnik, Amy Weinberg, Clara
ploiting parallel texts in the creation of multilingual Cabezas, and Okan Kolak. 2005. Bootstrap-
semantically annotated resources: the MultiSemCor ping parsers via syntactic projection across parallel
Corpus. Journal of Natural Language Engineering, texts. Journal of Natural Language Engineering,
11(3):247261. 11(3):311325.
Adam Bermingham and Alan F. Smeaton. 2009. A Hiroshi Kanayama. 2003. Paraphrasing rules for au-
study of inter-annotator agreement for opinion re- tomatic evaluation of translation into Japanese. In
trieval. In Proceedings of ACM SIGIR, pages 784 Proceedings of the Second International Workshop
785. on Paraphrasing, pages 8893, Sapporo, Japan.
Philip Bramsen, Martha Escobar-Molano, Ami Patel, Philipp Koehn. 2005. Europarl: A Parallel Corpus for
and Rafael Alonso. 2011. Extracting social power Statistical Machine Translation. In Proceedings of
relationships from natural language. In Proceedings the 10th Machine Translation Summit, pages 7986,
of ACL/HLT, pages 773782, Portland, OR. Phuket, Thailand.
Fabienne Braune and Alexander Fraser. 2010. Im- Heinz L. Kretzenbacher, Michael Clyne, and Doris
proved unsupervised sentence alignment for symmet- Schpbach. 2006. Pronominal Address in German:
rical and asymmetrical parallel corpora. In Coling Rules, Anarchy and Embarrassment Potential. Aus-
2010: Posters, pages 8189, Beijing, China. tralian Review of Applied Linguistics, 39(2):17.1
Roger Brown and Albert Gilman. 1960. The pronouns 17.18.
of power and solidarity. In Thomas A. Sebeok, edi- Alexander Knzli. 2010. Address pronouns as a prob-
tor, Style in Language, pages 253277. MIT Press, lem in French-Swedish translation and translation
Cambridge, MA. revision. Babel, 55(4):364380.
Penelope Brown and Stephen C. Levinson. 1987. Po- Zhifei Li and David Yarowsky. 2008. Mining and
liteness: Some Universals in Language Usage. Num- modeling relations between formal and informal Chi-
ber 4 in Studies in Interactional Sociolinguistics. nese phrases from web corpora. In Proceedings of
Cambridge University Press. EMNLP, pages 10311040, Honolulu, Hawaii.
Alexander Clark. 2003. Combining distributional and Christopher D. Manning, Prabhakar Raghavan, and
morphological information for part of speech induc- Hinrich Schtze. 2008. Introduction to Information
tion. In Proceedings of EACL, pages 5966, Bu- Retrieval. Cambridge University Press, Cambridge,
dapest, Hungary. UK, 1st edition.
J. Cohen. 1960. A Coefficient of Agreement for Nomi- Andrew Kachites McCallum. 2002. Mal-
nal Scales. Educational and Psychological Measure- let: A machine learning for language toolkit.
ment, 20(1):3746. http://mallet.cs.umass.edu.
David Elson, Nicholas Dames, and Kathleen McKe- Roberto Navigli. 2009. Word Sense Disambiguation:
own. 2010. Extracting social networks from literary a survey. ACM Computing Surveys, 41(2):169.
fiction. In Proceedings of ACL, pages 138147, Up- Eric W. Noreen. 1989. Computer-intensive Methods
psala, Sweden. for Testing Hypotheses: An Introduction. John Wiley
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- and Sons Inc.
Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: Franz Josef Och and Hermann Ney. 2003. A System-
A library for large linear classification. Journal of atic Comparison of Various Statistical Alignment
Machine Learning Research, 9:18711874. Models. Computational Linguistics, 29(1):1951.
Manaal Faruqui and Sebastian Pad. 2011. I Thou Lance Ramshaw and Mitch Marcus. 1995. Text chunk-
Thee, Thou Traitor: Predicting formal vs. infor- ing using transformation-based learning. In Proceed-
mal address in English literature. In Proceedings of ing of the 3rd ACL Workshop on Very Large Corpora,
ACL/HLT 2011, pages 467472, Portland, OR. Cambridge, MA.
Jenny Rose Finkel and Christopher D. Manning. 2009. Michael Schiehlen. 1998. Learning tense transla-
Nested named entity recognition. In Proceedings of tion from bilingual corpora. In Proceedings of
EMNLP, pages 141150, Singapore. ACL/COLING, pages 11831187, Montreal, Canada.
632
Helmut Schmid. 1994. Probabilistic Part-of-Speech
Tagging Using Decision Trees. In Proceedings of the
International Conference on New Methods in Lan-
guage Processing, pages 4449, Manchester, UK.
Doris Schpbach, John Hajek, Jane Warren, Michael
Clyne, Heinz Kretzenbacher, and Catrin Norrby.
2006. A cross-linguistic comparison of address pro-
noun use in four European languages: Intralingual
and interlingual dimensions. In Proceedings of the
Annual Meeting of the Australian Linguistic Society,
Brisbane, Australia.
Ralf Steinberger, Bruno Pouliquen, Anna Widiger,
Camelia Ignat, Toma Erjavec, and Dan Tufis. 2006.
The JRC-Acquis: A multilingual aligned parallel cor-
pus with 20+ languages. In Proceedings of LREC,
pages 21422147, Genoa, Italy.
David M. J. Tax and Robert P. W. Duin. 1999. Sup-
port vector domain description. Pattern Recognition
Letters, 20:11911199.
David Yarowsky and Grace Ngai. 2001. Inducing mul-
tilingual POS taggers and NP bracketers via robust
projection across aligned corpora. In Proceedings of
NAACL, pages 200207, Pittsburgh, PA.
633
Character-based Kernels for Novelistic Plot Structure
Micha Elsner
Institute for Language, Cognition and Computation (ILCC)
School of Informatics
University of Edinburgh
melsner0@gmail.com
Abstract text to fiction is that the most important struc-

ture underlying the narrativeits plotoccurs at
Better representations of plot structure a high level of abstraction, while the actual narra-
could greatly improve computational meth-
tion is of a series of lower-level events.
ods for summarizing and generating sto-
ries. Current representations lack abstrac- A short synopsis of Jane Austens novel Pride
tion, focusing too closely on events. We and Prejudice, for example, is that Elizabeth Ben-
present a kernel for comparing novelistic net first thinks Mr. Darcy is arrogant, but later
plots at a higher level, in terms of the grows to love him. But this is not stated straight-
cast of characters they depict and the so- forwardly in the text; the reader must infer it from
cial relationships between them. Our kernel the behavior of the characters as they participate
compares the characters of different nov-
in various everyday scenes.
els to one another by measuring their fre-
quency of occurrence over time and the In this paper, we present the plot kernel, a
descriptive and emotional language associ- coarse-grained, but robust representation of nov-
ated with them. Given a corpus of 19th- elistic plot structure. The kernel evaluates the
century novels as training data, our method similarity between two novels in terms of the
can accurately distinguish held-out novels characters and their relationships, constructing
in their original form from artificially dis- functional analogies between them. These are in-
ordered or reversed surrogates, demonstrat-
tended to correspond to the labelings produced by
ing its ability to robustly represent impor-
tant aspects of plot structure. human literary critics when they write, for exam-
ple, that Elizabeth Bennet and Emma Woodhouse
are protagonists of their respective novels. By fo-
1 Introduction cusing on which characters and relationships are
Every culture has stories, and storytelling is one important, rather than specifically how they inter-
of the key functions of human language. Yet while act, our system can abstract away from events and
we have robust, flexible models for the structure focus on more easily-captured notions of what
of informative documents (for instance (Chen et makes a good story.
al., 2009; Abu Jbara and Radev, 2011)), current The ability to find correspondences between
approaches have difficulty representing the nar- characters is key to eventually summarizing or
rative structure of fictional stories. This causes even generating interesting stories. Once we can
problems for any task requiring us to model effectively model the kinds of people a romance
fiction, including summarization and generation or an adventure story is usually about, and what
of stories; Kazantseva and Szpakowicz (2010) kind of relationships should exist between them,
show that state-of-the-art summarizers perform we can begin trying to analyze new texts by com-
extremely poorly on short fictional texts1 . A ma- parison with familiar ones. In this work, we eval-
jor problem with applying models for informative uate our system on the comparatively easy task
1
Apart from Kazantseva, we know of one other at- projects/autosummarize. Although this cannot be
tempt to apply a modern summarizer to fiction, by the treated as a scientific experiment, the results are unusably
artist Jason Huff, using Microsoft Word 2008s extrac- bad; they consist mostly of short exclamations containing
tive summary feature: http://jason-huff.com/ the names of major characters.
634
of recognizing acceptable novels (section 6), but ture in terms of both characters and their emo-
recognition is usually a good first step toward tional states. However, they operate at a very de-
generationa recognition model can always be tailed level and so can be applied only to short
used as part of a generate-and-rank pipeline, and texts. Scheherazade (Elson and McKeown, 2010)
potentially its underlying representation can be allows human annotators to mark character goals
used in more sophisticated ways. We show a de- and emotional states in a narrative, and indicate
tailed analysis of the character correspondences the causal links between them. AESOP (Goyal et
discovered by our system, and discuss their po- al., 2010) attempts to learn a similar structure au-
tential relevance to summarization, in section 9. tomatically. AESOPs accuracy, however, is rel-
atively poor even on short fables, indicating that
2 Related work this fine-grained approach is unlikely to be scal-
Some recent work on story understanding has fo- able to novel-length texts; our system relies on a
cused on directly modeling the series of events much coarser analysis.
that occur in the narrative. McIntyre and Lapata Kazantseva and Szpakowicz (2010) summarize
(2010) create a story generation system that draws short stories, although unlike the other projects
on earlier work on narrative schemas (Chambers we discuss here, they explicitly try to avoid giving
and Jurafsky, 2009). Their system ensures that away plot detailstheir goal is to create spoiler-
generated stories contain plausible event-to-event free summaries focusing on characters, settings
transitions and are coherent. Since it focuses only and themes, in order to attract potential readers.
on events, however, it cannot enforce a global no- They do find it useful to detect character men-
tion of what the characters want or how they relate tions, and also use features based on verb aspect to
to one another. automatically exclude plot events while retaining
Our own work draws on representations that descriptive passages. They compare their genre-
explicitly model emotions rather than events. Alm specific system with a few state-of-the-art meth-
and Sproat (2005) were the first to describe sto- ods for summarizing news, and find it outper-
ries in terms of an emotional trajectory. They an- forms them substantially.
notate emotional states in 22 Grimms fairy tales We evaluate our system by comparing real nov-
and discover an increase in emotion (mostly posi- els to artificially produced surrogates, a procedure
tive) toward the ends of stories. They later use this previously used to evaluate models of discourse
corpus to construct a reasonably accurate clas- coherence (Karamanis et al., 2004; Barzilay and
sifier for emotional states of sentences (Alm et Lapata, 2005) and models of syntax (Post, 2011).
al., 2005). Volkova et al. (2010) extend the hu- As in these settings, we anticipate that perfor-
man annotation approach using a larger number of mance on this kind of task will be correlated with
emotion categories and applying them to freely- performance in applied settings, so we use it as an
defined chunks instead of sentences. The largest- easier preliminary test of our capabilities.
scale emotional analysis is performed by Moham-
3 Dataset
mad (2011), using crowd-sourcing to construct a
large emotional lexicon with which he analyzes We focus on the 19th century novel, partly fol-
adult texts such as plays and novels. In this work, lowing Elson et al. (2010) and partly because
we adopt the concept of emotional trajectory, but these texts are freely available via Project Guten-
apply it to particular characters rather than works berg. Our main dataset is composed of romances
as a whole. (which we loosely define as novels focusing on a
In focusing on characters, we follow Elson et courtship or love affair). We select 41 texts, tak-
al. (2010), who analyze narratives by examining ing 11 as a development set and the remaining
their social network relationships. They use an 30 as a test set; a complete list is given in Ap-
automatic method based on quoted speech to find pendix A. We focus on the novels used in Elson
social links between characters in 19th century et al. (2010), but in some cases add additional ro-
novels. Their work, designed for computational mances by an already-included author. We also
literary criticism, does not extract any temporal selected 10 of the least romantic works as an out-
or emotional structure. of-domain set; experiments on these are in section
A few projects attempt to represent story struc- 8.
635
4 Preprocessing reply left-of-[name] 17
right-of-[name] feel 14
In order to compare two texts, we must first ex- right-of-[name] look 10
tract the characters in each and some features of right-of-[name] mind 7
their relationships with one another. Our first step right-of-[name] make 7
is to split the text into chapters, and each chapter
into paragraphs; if the text contains a running di- Table 1: Top five stemmed unigram dependency fea-
alogue where each line begins with a quotation tures for Miss Elizabeth Bennet, protagonist of
mark, we append it to the previous paragraph. Pride and Prejudice, and their frequencies.
We segment each paragraph with MXTerminator
(Reynar and Ratnaparkhi, 1997) and parse it with
the self-trained Charniak parser (McClosky et al., and the first and last names are consistent (Char-
2006). Next, we extract a list of characters, com- niak, 2001). We then merge single-word mentions
pute dependency tree-based unigram features for with matching multiword mentions if they appear
each character, and record character frequencies in the same paragraph, or if not, with the multi-
and relationships over time. word mention that occurs in the most paragraphs.
When this process ends, we have resolved each
4.1 Identifying characters mention in the novel to some specific character.
As in previous work, we discard very infrequent
We create a list of possible character references
characters and their mentions.
for each work by extracting all strings of proper
nouns (as detected by the parser), then discarding For the reasons stated, this method is error-
those which occur less than 5 times. Grouping prone. Our intuition is that the simpler method
these into a useful character list is a problem of described in Elson et al. (2010), which merges
cross-document coreference. each mention to the most recent possible coref-
Although cross-document coreference has been erent, must be even more so. However, due to
extensively studied (Bhattacharya and Getoor, the expense of annotation, we make no attempt to
2005) and modern systems can achieve quite high compare these methods directly.
accuracy on the TAC-KBP task, where the list
of available entities is given in advance (Dredze
4.2 Unigram character features
et al., 2010), novelistic text poses a significant
challenge for the methods normally used. The Once we have obtained the character list, we use
typical 19th-century novel contains many related the dependency relationships extracted from our
characters, often named after one another. There parse trees to compute features for each charac-
are complicated social conventions determining ter. Similar feature sets are used in previous work
which titles are used for whomfor instance, in word classification, such as (Lin and Pantel,
the eldest unmarried daughter of a family can be 2001). A few example features are shown in Table
called Miss Bennet, while her younger sister 1.
must be Miss Elizabeth Bennet. And characters
To find the features, we take each mention in
often use nicknames, such as Lizzie.
the corpus and count up all the words outside the
Our system uses the multi-stage clustering
mention which depend on the mention head, ex-
approach outlined in Bhattacharya and Getoor
cept proper nouns and stop words. We also count
(2005), but with some features specific to 19th
the mentions own head word, and mark whether
century European names. To begin, we merge all
it appears to the right or the left (in general, this
identical mentions which contain more than two
word is a verb and the direction reflects the men-
words (leaving bare first or last names unmerged).
tions role as subject or object). We lemmatize
Next, we heuristically assign each mention a gen-
all feature words with the WordNet (Miller et al.,
der (masculine, feminine or neuter) using a list of
1990) stemmer. The resulting distribution over
gendered titles, then a list of male and female first
words is our set of unigram features for the char-
names2 . We then merge mentions where each is
acter. (We do not prune rare features, although
longer than one word, the genders do not clash,
they have proportionally little influence on our
2
The most frequent names from the 1990 US census. measurement of similarity.)
636
1.6
1.4 Freq of Miss Elizabeth Bennet
Emotions of Miss Elizabeth Bennet
1.2 Cross freq x Mr. Darcy
1.0
0.8
0.6
0.4
0.2
0.00 10 20 30 40 50
Figure 1: Normalized frequency and emotions associated with Miss Elizabeth Bennet, protagonist of Pride
and Prejudice, and frequency of paragraphs about her and Mr. Darcy, smoothed and projected onto 50 basis
points.
4.3 Temporal relationships care mostly about the strength of key relationships
rather than the existence of infrequent ones.
We record two time-varying features for each
Finally, we perform some smoothing, by taking
character, each taking one value per chapter. The
a weighted moving average of each feature value
first is the characters frequency as a proportion
with a window of the three values on either side.
of all character mentions in the chapter. The sec-
Then, in order to make it easy to compare books
ond is the frequency with which the character is
with different numbers of chapters, we linearly in-
associated with emotional languagetheir emo-
terpolate each series of points into a curve and
tional trajectory (Alm et al., 2005). We use the
project it onto a fixed basis of 50 evenly spaced
strong subjectivity cues from the lexicon of Wil-
points. An example of the final output is shown in
son et al. (2005) as a measurement of emotion.
Figure 1.
If, in a particular paragraph, only one character
is mentioned, we count all emotional words in
5 Kernels
that paragraph and add them to the characters
total. To render the numbers comparable across Our plot kernel k(x, y) measures the similarity
works, each paragraph subtotal is normalized by between two novels x and y in terms of the fea-
the amount of emotional language in the novel as tures computed above. It takes the form of a
a whole. Then the chapter score is the average convolution kernel (Haussler, 1999) where the
over paragraphs. parts of each novel are its characters u x,
For pairwise character relationships, we count v y and c is a kernel over characters:
the number of paragraphs in which only two char- XX
acters are mentioned, and treat this number (as a k(x, y) = c(u, v) (1)
proportion of the total) as a measurement of the ux vy
strength of the relationship between that pair3 . El-
We begin by constructing a first-order ker-
son et al. (2010) show that their method of find-
nel over characters, c1 (u, v), which is defined in
ing conversations between characters is more pre-
terms of a kernel d over the unigram features and
cise in showing whether a relationship exists, but
a kernel e over the single-character temporal fea-
the co-occurrence technique is simpler, and we
tures. We represent the unigram feature counts as
3
We tried also counting emotional language in these para-
distributions pu (w) and pv (w), and compute their
graphs, but this did not seem to help in development experi- similarity as the amount of shared mass, times a
ments. small penalty of .1 for mismatched genders:
637
co-occur with), smoothed and normalized as de-
P scribed in subsection 4.3. This produces a single
d(pu , pv ) = exp((1 w min(pu (w), pv (w)))) time-varying curve for each novel, representing
.1 I{genu = genv } the average emotional intensity of each chapter.
We use our curve kernel e (equation 2) to mea-
We compute similarity between a pair of time- sure similarity between novels.
varying curves (which are projected onto 50
evenly spaced points) using standard cosine dis- 6 Experiments
tance, which approximates the normalized inte-
gral of their product. We evaluate our kernels on their ability to distin-
guish between real novels from our dataset and
!
uv artificial surrogate novels of three types. First, we
e(u, v) = p (2) alter the order of a real novel by permuting its
kukkvk
chapters before computing features. We construct
The weights and are parameters of the sys- one uniformally-random permutation for each test
tem, which scale d and e so that they are compa- novel. Second, we change the identities of the
rable to one another, and also determine how fast characters by reassigning the temporal features
the similarity scales up as the feature sets grow for the different characters uniformally at random
closer; we set them to 5 and 10 respectively. while leaving the unigram features unaltered. (For
We sum together the similarities of the char- example, we might assign the frequency, emotion
acter frequency and emotion curves to measure and relationship curves for Mr. Collins to Miss
overall temporal similarity between the charac- Elizabeth Bennet instead.) Again, we produce
ters. Thus our first-order character kernel c1 is: one test instance of this type for each test novel.
Third, we experiment with a more difficult order-
ing task by taking the chapters in reverse.
c1 (u, v) = d(pu , pv )(e(uf req , vf req )+e(uemo , vemo )) In each case, we use our kernel to perform
We use c1 and equation 1 to construct a first- a ranking task, deciding whether k(x, y) >
order plot kernel (which we call k1 ), and also as k(x, yperm ). Since this is a binary forced-choice
an ingredient in a second-order character kernel classification, a random baseline would score
c2 which takes into account the curve of pairwise 50%. We evaluate performance in the case where
frequencies u,d u0 between two characters u and u0 we are given only a single training document x,
in the same novel. and for a whole training set X, in which case we
combine the decisions using a weighted nearest
XX neighbor (WNN) strategy:
c2 (u, v) = c1 (u, v) e(u,
d u0 , v,
d v 0 )c1 (u0 , v 0 ) X X
u0 x v 0 y k(x, y) > k(x, yperm )
xX xX
In other words, u is similar to v if, for some
relationships of u with other characters u0 , there In each case, we perform the experiment in
are similar characters v 0 who serves the same role a leave-one-out fashion; we include the 11 de-
for v. We use c2 and equation 1 to construct our velopment documents in X, but not in the test
full plot kernel k2 . set. Thus there are 1200 single-document compar-
isons and 30 with WNN. The results of our three
5.1 Sentiment-only baseline systems (the baseline, the first-order kernel k1 and
In addition to our plot kernel systems, we imple- the second-order kernel k2 ) are shown in Table
ment a simple baseline intended to test the effec- 2. (The sentiment-only baseline has no character-
tiveness of tracking the emotional trajectory of specific features, and so cannot perform the char-
the novel without using character identities. We acter task.)
give our baseline access to the same subjectiv- Using the full dataset and second-order kernel
ity lexicon used for our temporal features. We k2 , our systems performance on these tasks is
compute the number of emotional words used in quite good; we are correct 90% of the time for
each chapter (regardless of which characters they order and character examples, and 67% for the
638
order character reverse tion.
sentiment only 46.2 - 51.5
single doc k1 59.5 63.7 50.7 7 Significance testing
single doc k2 61.8 67.7 51.6
In addition to using our kernel as a classifier, we
WNN sentiment 50 - 53
can directly test its ability to distinguish real from
WNN k1 77 90 63
altered novels via a non-parametric two-sample
WNN k2 90 90 67
significance test, the Maximum Mean Discrep-
Table 2: Accuracy of kernels ranking 30 real novels ancy (MMD) test (Gretton et al., 2007). Given
against artificial surrogates (chance accuracy 50%). samples from a pair of distributions p and q and
a kernel k, this test determines whether the null
hypothesis that p and q are identically distributed
more difficult reverse cases. Results of this qual- in the kernels feature space can be rejected. The
ity rely heavily on the WNN strategy, which trusts advantage of this test is that, since it takes all
close neighbors more than distant ones. pairwise comparisons (except self-comparisons)
In the single training point setup, the system within and across the classes into account, it uses
is much less accurate. In this setting, the sys- more information than our classification experi-
tem is forced to make decisions for all pairs of ments, and can therefore be more sensitive.
texts independently, including pairs it considers As in Gretton et al. (2007), we find an unbiased
very dissimilar because it has failed to find any estimate of the test statistic M M D2 for sample
useful correspondences. Performance for these sets x p, y q, each with m samples, by pair-
pairs is close to chance, dragging down overall ing the two as z = (xi , yi ) and computing:
scores (52% for reverse) even if the system per-
forms well on pairs where it finds good correspon-
m
dences, enabling a higher WNN score (67%). 1 X
M M D2 (x, y) = h(zi , zj )
The reverse case is significantly harder than (m)(m 1)
i6=j
order. This is because randomly permuting a
novel actually breaks up the temporal continuity h(zi , zj ) = k(xi , xj )+k(yi , yj )k(xi , yj )k(xj , yi )
of the textfor instance, a minor character who
appeared in three adjacent chapters might now ap- Intuitively, M M D2 approaches 0 if the ker-
pear in three separate places. Reversing the text nel cannot distinguish x from y and is positive
does not cause this kind of disruption, so correctly otherwise. The null distribution is computed by
detecting a reversal requires the system to repre- the bootstrap method; we create null-distributed
sent patterns with a distinct temporal orientation, samples by randomly swapping xi and yi in ele-
for instance an intensification in the main char- ments of z and computing the test statistic. We
acters emotions, or in the number of paragraphs use 10000 test permutations. Using both k1 and
focusing on pairwise relationships, toward the end k2 , we can reject the null hypothesis that the dis-
of the text. tribution of novels is equal to order or characters
The baseline system is ineffective at detecting with p < .001; for reversals, we cannot reject the
either ordering or reversals4 . The first-order ker- null hypothesis.
nel k1 is as good as k2 in detecting character per- 8 Out-of-domain data
mutations, but less effective on reorderings and
reversals. As we will show in section 9, k1 places In our main experiments, we tested our kernel
more emphasis on correspondences between mi- only on romances; here we investigate its ability
nor characters and between places, while k2 is to generalize across genres. We take as our train-
more sensitive to protagonists and their relation- ing set X the same romances as above, but as our
ships, which carry the richest temporal informa- test set Y a disjoint set of novels focusing mainly
on crime, children and the supernatural.
4
The baseline detects reversals as well as the plot kernels Our results (Table 3) are not appreciably differ-
given only a single point of comparison, but these results do
not transfer to the WNN strategy. This suggests that unlike
ent from those of the in-domain experiments (Ta-
the plot kernels, the baseline is no more accurate for docu- ble 2) considering the small size of the dataset.
ments it considers similar than for those it judges are distant. This shows our system to be robust, but shallow;
639
order character reverse Emma Woodhouse, both labeled female pro-
sentiment only 33.0 - 53.4 tagonist, contributes 26% of the kernel similarity
single doc k1 59.5 61.7 52.7 between the works in which they appear.) We plot
single doc k2 63.7 62.0 57.3 these as Hinton-style diagrams in Figure 2. The
WNN sentiment 20 - 70 size of each black rectangle indicates the magni-
WNN k1 80 90 80 tude of the contribution. (Since kernel functions
WNN k2 100 80 70 are symmetric, we show only the lower diagonal.)
Table 3: Accuracy of kernels ranking 10 non-romance Under the kernel for unigram features, d
novels against artificial surrogates, with 41 romances (top), the most common character typesnon-
used for comparison. characters (almost always places) and non-
marriageable womencontribute most to the ker-
the patterns it can represent generalize acceptably nel scores; this is especially true for places, since
across domains, but this suggests it is describing they often occur with similar descriptive terms.
broad concepts like main character rather than The diagram also shows the effect of the kernels
genre-specific ones like female romantic lead. penalty for gender mismatches, since females pair
more strongly with females and males with males.
9 Character-level analysis Character roles have relatively little impact.
The first-order kernel c1 (middle), which takes
To gain some insight into exactly what kinds of
into account frequency and emotion as well as un-
similarities the system picks up on when compar-
igrams, is much better than d at distinguishing
ing two works, we sorted the characters detected
places from real characters, and assigns somewhat
by our system into categories and measured their
more weight to protagonists.
contribution to the kernels overall scores. We
selected four Jane Austen works from the devel- Finally, c2 (bottom), which takes into account
opment set5 and hand-categorized each character second-order relationships, places much more
detected by our system. (We performed the cate- emphasis on female protagonists and much less
gorization based on the most common full name on places. This is presumably because the female
mention in each cluster. This name is usually a protagonists of Jane Austens novels are the view-
good identifier for all the mentions in the cluster, point characters, and the novels focus on their re-
but if our coreference system has made an error, it lationships, while characters do not tend to have
may not be.) strong relationships with places. An increased
Our categorization for characters is intended to tendency to match male marriageable characters
capture the stereotypical plot dynamics of liter- with marriageable females, and other males
ary romance, sorting the characters according to with other females, suggests that c2 relies more
their gender and a simple notion of their plot func- on character function and less on unigrams than
tion. The genders are female, male, plural (the c1 when finding correspondences between char-
Crawfords) or not a character (London). The acters.
functional classes are protagonist (used for the
As we concluded in the previous section, the
female viewpoint character and her eventual hus-
frequent confusion between categories suggests
band), marriageable (single men and women
that the analogies we construct are relatively non-
who are seeking to marry within the story) and
specific. We might hope to create role-based sum-
other (older characters, children, and characters
mary of novels by finding their nearest neighbors
married before the story begins).
and then propagating the character categories (for
We evaluate the pairwise kernel similarities
example, is the protagonist of this novel. She
among our four works, and add up the propor-
lives at . She eventually marries , her other
tional contribution made by character pairs of
suitors are and her older guardian is .)
each type to the eventual score. (For instance,
but the present system is probably not adequate
the similarity between Elizabeth Bennet and
for the purpose. We expect that detecting a fine-
5
Pride and Prejudice, Emma, Mansfield Park and Per- grained set of emotions will help to separate char-
suasion. acter functions more clearly.
640
10 Conclusions
This work presents a method for describing nov-
ns
Toke elistic plots at an abstract level. It has three main
s
Type
contributions: the description of a plot in terms
F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non of analogies between characters, the use of emo-
t ot rr. arr. arr. her ther ther -char tional and frequency trajectories for individual
Character frequency by category characters rather than whole works, and evalua-
t tion using artificially disordered surrogate novels.
F Pro
In future work, we hope to sharpen the analogies
ot
M Pr we construct so that they are useful for summa-
rr.
F Ma rization, perhaps by finding an external standard
arr.
MM by which we can make the notion of analogous
arr.
Pl M characters precise. We would also like to investi-
her gate what gains are possible with a finer-grained
F Ot
ther emotional vocabulary.
MO
ther
Pl O
r Acknowledgements
-cha
Non
F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non Thanks to Sharon Goldwater, Mirella Lapata, Vic-
t ot rr. arr. arr. her ther ther -char
Unigram features (d) toria Adams and the ProbModels group for their
comments on preliminary versions of this work,
t
F Pro Kira Mourao for suggesting graph kernels, and
ot
M Pr three reviewers for their comments.
rr.
F Ma
arr.
MM
arr.
References
Pl M
her
F Ot Amjad Abu Jbara and Dragomir Radev. 2011. Coher-
ther ent citation-based summarization of scientific pa-
MO
ther
pers. In Proceedings of ACL 2011, Portland, Ore-
Pl O gon.
r
-cha
Non Cecilia Ovesdotter Alm and Richard Sproat. 2005.
F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non Emotional sequencing and development in fairy
First-order (c1) tales. In ACII, pages 668674.
Cecilia Ovesdotter Alm, Dan Roth, and Richard
t
F Pro Sproat. 2005. Emotions from text: Machine learn-
ot
M Pr ing for text-based emotion prediction. In Proceed-
rr. ings of Human Language Technology Conference
F Ma
arr.
and Conference on Empirical Methods in Natural
MM Language Processing, pages 579586, Vancouver,
arr.
Pl M British Columbia, Canada, October. Association for
her
F Ot Computational Linguistics.
ther Regina Barzilay and Mirella Lapata. 2005. Model-
MO
ther
ing local coherence: an entity-based approach. In
Pl O Proceedings of the 43rd Annual Meeting of the As-
r
-cha
Non sociation for Computational Linguistics (ACL05).
F Pro M Pr F Ma M M Pl M F Ot M O Pl O Non Indrajit Bhattacharya and Lise Getoor. 2005. Rela-
Second-order (c2) tional clustering for multi-type entity resolution. In
Proceedings of the 4th international workshop on
Figure 2: Affinity diagrams showing character types Multi-relational mining, MRDM 05, pages 312,
contributing to the kernel similarity between four New York, NY, USA. ACM.
works by Jane Austen. Nathanael Chambers and Dan Jurafsky. 2009. Un-
supervised learning of narrative schemas and their
participants. In Proceedings of the Joint Confer-
ence of the 47th Annual Meeting of the ACL and the
641
4th International Joint Conference on Natural Lan- Anna Kazantseva and Stan Szpakowicz. 2010. Sum-
guage Processing of the AFNLP, pages 602610, marizing short stories. Computational Linguistics,
Suntec, Singapore, August. Association for Com- pages 71109.
putational Linguistics. Dekang Lin and Patrick Pantel. 2001. Induction of
Eugene Charniak. 2001. Unsupervised learning of semantic classes from natural language text. In
name structure from coreference data. In Second Proceedings of the seventh ACM SIGKDD interna-
Meeting of the North American Chapter of the Asso- tional conference on Knowledge discovery and data
ciation for Computational Linguistics (NACL-01). mining, KDD 01, pages 317322, New York, NY,
Harr Chen, S.R.K. Branavan, Regina Barzilay, and USA. ACM.
David R. Karger. 2009. Global models of docu- David McClosky, Eugene Charniak, and Mark John-
ment structure using latent permutations. In Pro- son. 2006. Effective self-training for parsing. In
ceedings of Human Language Technologies: The Proceedings of the Human Language Technology
2009 Annual Conference of the North American Conference of the NAACL, Main Conference, pages
Chapter of the Association for Computational Lin- 152159.
guistics, pages 371379, Boulder, Colorado, June. Neil McIntyre and Mirella Lapata. 2010. Plot induc-
Association for Computational Linguistics. tion and evolutionary search for story generation.
Mark Dredze, Paul McNamee, Delip Rao, Adam Ger- In Proceedings of the 48th Annual Meeting of the
ber, and Tim Finin. 2010. Entity disambigua- Association for Computational Linguistics, pages
tion for knowledge base population. In Proceed- 15621572, Uppsala, Sweden, July. Association for
ings of the 23rd International Conference on Com- Computational Linguistics.
putational Linguistics (Coling 2010), pages 277 G. Miller, A.R. Beckwith, C. Fellbaum, D. Gross, and
285, Beijing, China, August. Coling 2010 Organiz- K. Miller. 1990. Introduction to WordNet: an on-
ing Committee. line lexical database. International Journal of Lexi-
David K. Elson and Kathleen R. McKeown. 2010. cography, 3(4).
Building a bank of semantically encoded narratives. Saif Mohammad. 2011. From once upon a time
In Nicoletta Calzolari (Conference Chair), Khalid to happily ever after: Tracking emotions in novels
Choukri, Bente Maegaard, Joseph Mariani, Jan and fairy tales. In Proceedings of the 5th ACL-
Odijk, Stelios Piperidis, Mike Rosner, and Daniel HLT Workshop on Language Technology for Cul-
Tapias, editors, Proceedings of the Seventh con- tural Heritage, Social Sciences, and Humanities,
ference on International Language Resources and pages 105114, Portland, OR, USA, June. Associa-
Evaluation (LREC10), Valletta, Malta, May. Euro- tion for Computational Linguistics.
pean Language Resources Association (ELRA). Matt Post. 2011. Judging grammaticality with tree
David Elson, Nicholas Dames, and Kathleen McKe- substitution grammar derivations. In Proceedings
own. 2010. Extracting social networks from liter- of the 49th Annual Meeting of the Association
ary fiction. In Proceedings of the 48th Annual Meet- for Computational Linguistics: Human Language
ing of the Association for Computational Linguis- Technologies, pages 217222, Portland, Oregon,
tics, pages 138147, Uppsala, Sweden, July. Asso- USA, June. Association for Computational Linguis-
ciation for Computational Linguistics. tics.
Amit Goyal, Ellen Riloff, and Hal Daume III. 2010. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997.
Automatically producing plot unit representations A maximum entropy approach to identifying sen-
for narrative text. In Proceedings of the 2010 Con- tence boundaries. In Proceedings of the Fifth Con-
ference on Empirical Methods in Natural Language ference on Applied Natural Language Processing,
Processing, pages 7786, Cambridge, MA, Octo- pages 1619, Washington D.C.
ber. Association for Computational Linguistics. Ekaterina P. Volkova, Betty Mohler, Detmar Meur-
Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, ers, Dale Gerdemann, and Heinrich H. Bulthoff.
Bernhard Schlkopf, and Alexander J. Smola. 2007. 2010. Emotional perception of fairy tales: Achiev-
A kernel method for the two-sample-problem. In ing agreement in emotion annotation of text. In
B. Scholkopf, J. Platt, and T. Hoffman, editors, Ad- Proceedings of the NAACL HLT 2010 Workshop on
vances in Neural Information Processing Systems Computational Approaches to Analysis and Gener-
19, pages 513520. MIT Press, Cambridge, MA. ation of Emotion in Text, pages 98106, Los Ange-
David Haussler. 1999. Convolution kernels on dis- les, CA, June. Association for Computational Lin-
crete structures. Technical Report UCSC-CRL-99- guistics.
10, Computer Science Department, UC Santa Cruz. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.
Nikiforos Karamanis, Massimo Poesio, Chris Mellish, 2005. Recognizing contextual polarity in phrase-
and Jon Oberlander. 2004. Evaluating centering- level sentiment analysis. In Proceedings of Hu-
based metrics of coherence. In ACL, pages 391 man Language Technology Conference and Confer-
398. ence on Empirical Methods in Natural Language
642
Processing, pages 347354, Vancouver, British
Columbia, Canada, October. Association for Com-
643
A List of texts
Dev set (11 works)
Austen Emma, Mansfield Park, Northanger Bronte, Emily Wuthering Heights
Abbey, Persuasion, Pride and Prej-
udice, Sense and Sensibility
Burney Cecilia (1782) Hardy Tess of the DUrbervilles
James The Ambassadors Scott Ivanhoe
Test set (30 works)
Braddon Aurora Floyd Bronte, Anne The Tenant of Wildfell Hall
Bronte, Charlotte Jane Eyre, Villette Bulwer-Lytton Zanoni
Disraeli Coningsby, Tancred Edgeworth The Absentee, Belinda, Helen
Eliot Adam Bede, Daniel Deronda, Mid- Gaskell Mary Barton, North and South
dlemarch
Gissing In the Year of Jubilee, New Grub Hardy Far From the Madding Crowd, Jude
Street the Obscure, Return of the Native,
Under the Greenwood Tree
James The Wings of the Dove Meredith The Egoist, The Ordeal of Richard
Feverel
Scott The Bride of Lammermoor Thackeray History of Henry Esmond, History
of Pendennis, Vanity Fair
Trollope Doctor Thorne
Out-of-domain set (10 works)
Ainsworth The Lancashire Witches Bulwer-Lytton Paul Clifford
Dickens Oliver Twist, The Pickwick Papers Collins The Moonstone
Conan-Doyle A Study in Scarlet, The Sign of the Hughes Tom Browns Schooldays
Four
Stevenson Treasure Island Stoker Dracula
Table 4: 19th century novels used in our study.
644
Smart Paradigms and the Predictability and Complexity of Inflectional
Morphology
Gregoire Detrez and Aarne Ranta

Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
Abstract aiming for more precision, has 235 paradigms for

Swedish.
Morphological lexica are often imple- Mathematically, a paradigm is a function that
mented on top of morphological paradigms,
produces inflection tables. Its argument is a word
corresponding to different ways of building
the full inflection table of a word. Compu- string (either a dictionary form or a stem), and
tationally precise lexica may use hundreds its value is an n-tuple of strings (the word forms):
of paradigms, and it can be hard for a lex-
icographer to choose among them. To au- P : String Stringn
tomate this task, this paper introduces the
notion of a smart paradigm. It is a meta- We assume that the exponent n is determined by
paradigm, which inspects the base form and the language and the part of speech. For instance,
tries to infer which low-level paradigm ap- English verbs might have n = 5 (for sing, sings,
plies. If the result is uncertain, more forms sang, sung, singing), whereas for French verbs in
are given for discrimination. The number Bescherelle, n = 51. We assume the tuples to
of forms needed in average is a measure
be ordered, so that for instance the French sec-
of predictability of an inflection system.
The overall complexity of the system also ond person singular present subjunctive is always
has to take into account the code size of found at position 17. In this way, word-paradigm
the paradigms definition itself. This pa- pairs can be easily converted to morphogical lex-
per evaluates the smart paradigms imple- ica and to transducers that map form descriptions
mented in the open-source GF Resource to surface forms and back. A properly designed
Grammar Library. Predictability and com- set of paradigms permits a compact representation
plexity are estimated for four different lan-
of a lexicon and a user-friendly way to extend it.
guages: English, French, Swedish, and
Finnish. The main result is that predictabil- Different paradigm systems may have different
ity does not decrease when the complex- numbers of paradigms. There are two reasons for
ity of morphology grows, which means that this. One is that traditional paradigms often in fact
smart paradigms provide an efficient tool require more arguments than one:
for the manual construction and/or auto-
matically bootstrapping of lexica. P : Stringm Stringn
Here m n and the set of arguments is a subset

1 Introduction
of the set of values. Thus the so-called fourth verb
Paradigms are a cornerstone of grammars in the conjugation in Swedish actually needs three forms
European tradition. A classical Latin grammar to work properly, for instance sitta, satt, suttit for
has five paradigms for nouns (declensions) and the equivalent of sit, sat, sat in English. In Hell-
four for verbs (conjugations). The modern ref- berg (1978), as in the French Bescherelle, each
erence on French verbs, Bescherelle (Bescherelle, paradigm is defined to take exactly one argument,
1997), has 88 paradigms for verbs. Swedish and hence each vowel alternation pattern must be
grammars traditionally have, like Latin, five a different paradigm.
paradigms for nouns and four for verbs, but a The other factor that affects the number of
modern computational account (Hellberg, 1978), paradigms is the nature of the string operations
645
allowed in the function P . In Hellberg (1978), conj19finir(s), if s ends ir
noun paradigms only permit the concatenation of conj53rendre(s), if s ends re
suffixes to a stem. Thus the paradigms are iden- conj14assieger(s), if s ends eger
tified with suffix sets. For instance, the inflection conj11jeter(s), if s ends eler or
patterns bilbilar (carcars) and nyckelnycklar eter
(keykeys) are traditionally both treated as in- conj10ceder(s), if s ends eder
stances of the second declension, with the plural conj07placer(s), if s ends cer
ending ar and the contraction of the unstressed conj08manger(s), if s ends ger
e in the case of nyckel. But in Hellberg, the conj16payer(s), if s ends yer
word nyckel has nyck as its technical stem, to conj06parler(s), if s ends er
which the paradigm numbered 231 adds the sin-
gular ending el and the plural ending lar. Notice that the cases must be applied in the given
The notion of paradigm used in this paper al- order; for instance, the last case applies only to
lows multiple arguments and powerful string op- those verbs ending with er that are not matched
erations. In this way, we will be able to reduce by the earlier cases.
the number of paradigms drastically: in fact, each Also notice that the above paradigm is just
lexical category (noun, adjective, verb), will have like the more traditional ones, in the sense that
just one paradigm but with a variable number of we cannot be sure if it really applies to a given
arguments. Paradigms that follow this design will verb. For instance, the verb partir ends with ir
be called smart paradigms and are introduced and would hence receive the same inflection as
in Section 2. Section 3 defines the notions of finir; however, its real conjugation is number 26
predictability and complexity of smart paradigm in Bescherelle. That mkV uses 19 rather than
systems. Section 4 estimates these figures for four number 26 has a good reason: a vast majority of
different languages of increasing richness in mor- ir verbs is inflected in this conjugation, and it is
phology: English, Swedish, French, and Finnish. also the productive one, to which new ir verbs are
We also evaluate the smart paradigms as a data added.
compression method. Section 5 explores some Even though there is no mathematical differ-
uses of smart paradigms in lexicon building. Sec- ence between the mkV paradigm and the tradi-
tion 6 compares smart paradigms with related tional paradigms like those in Bescherelle, there
techniques such as morphology guessers and ex- is a reason to call mkV a smart paradigm. This
traction tools. Section 7 concludes. name implies two things. First, a smart paradigm
implements some artificial intelligence to pick
2 Smart paradigms the underlying stupid paradigm. Second, a
In this paper, we will assume a notion of paradigm smart paradigm uses heuristics (informed guess-
that allows multiple arguments and arbitrary coming) if string matching doesnt decide the matter;
putable string operations. As argued in (Ka- the guess is informed by statistics of the distribu-
plan and Kay, 1994) and amply demonstrated in tions of different inflection classes.
(Beesley and Karttunen, 2003), no generality is One could thus say that smart paradigms are
lost if the string operators are restricted to ones second-order or meta-paradigms, compared
computable by finite-state transducers. Thus the to more traditional ones. They implement a
examples of paradigms that we will show (only lot of linguistic knowledge and intelligence, and
informally), can be converted to matching and re- thereby enable tasks such as lexicon building to
placements with regular expressions. be performed with less expertise than before. For
For example, a majority of French verbs can instance, instead of 07 for foncer and 06
be defined by the following paradigm, which for marcher, the lexicographer can simply write
analyzes a variable-size suffix of the infinitive mkV for all verbs instead of choosing from 88
form and dispatches to the Bescherelle paradigms numbers.
(identified by a number and an example verb): In fact, just V, indicating that the word is
a verb, will be enough, since the name of the
mkV : String String51 paradigm depends only on the part of speech.
mkV(s) = This follows the model of many dictionaries and
646
methods of language teaching, where character- elsewhere in the library. Function application is
istic forms are used instead of paradigm identi- expressed without parentheses, by the juxtaposi-
fiers. For instance, another variant of mkV could tion of the function and the argument.
use as its second argument the first person plural
mkV : Str -> V
present indicative to decide whether an ir verb is
mkV s = case s of {
in conjugation 19 or in 26:
_ + "ir" -> conj19finir s ;
mkV : String2 String51 _ + ("eler"|"eter")
-> conj11jeter s ;
mkV(s, t) =
_ + "er" -> conj06parler s ;
conj26partir(s), if for some x, s = }
x+ir and t = x+ons
conj19finir(s), if s ends with ir The GF Resource Grammar Library1 has
(all the other cases that can be rec- comprehensive smart paradigms for 18 lan-
ognized by this extra form) guages: Amharic, Catalan, Danish, Dutch, En-
mkV(s) otherwise (fall-back to the glish, Finnish, French, German, Hindi, Italian,
one-argument paradigm) Nepalese, Norwegian, Romanian, Russian, Span-
ish, Swedish, Turkish, and Urdu. A few other lan-
In this way, a series of smart paradigms is built guages have complete sets of traditional inflec-
for each part of speech, with more and more ar- tion paradigms but no smart paradigms.
guments. The trick is to investigate which new Six languages in the library have comprehen-
forms have the best discriminating power. For sive morphological dictionaries: Bulgarian (53k
ease of use, the paradigms should be displayed to lemmas), English (42k), Finnish (42k), French
the user in an easy to understand format, e.g. as a (92k), Swedish (43k), and Turkish (23k). They
table specifying the possible argument lists: have been extracted from other high-quality re-
verb parler sources via conversions to GF using the paradigm
verb parler, parlons systems. In Section 4, four of them will be used
verb parler, parlons, parlera, parla, parle for estimating the strength of the smart paradigms,
noun chien that is, the predictability of each language.
noun chien, masculine
noun chien, chiens, masculine 3 Cost, predictability, and complexity
Notice that, for French nouns, the gender is listed Given a language L, a lexical category C, and a set
as one of the pieces of information needed for P of smart paradigms for C, the predictability of
lexicon building. In many cases, it can be in- the morphology of C in L by P depends inversely
ferred from the dictionary form just like the in- on the average number of arguments needed to
flection; for instance, that most nouns ending e generate the correct inflection table for a word.
are feminine. A gender argument in the smart The lower the number, the more predictable the
noun paradigm makes it possible to override this system.
default behaviour.
Predictability can be estimated from a lexicon
2.1 Paradigms in GF that contains such a set of tables. Formally, a
smart paradigm is a family Pm of functions
Smart paradigms as used in this paper have been
implemented in the GF programming language
Pm : Stringm Stringn
(Grammatical Framework, (Ranta, 2011)). GF is
a functional programming lnguage enriched with
where m ranges over some set of integers from 1
regular expressions. For instance, the following
to n, but need not contain all those integers. A
function implements a part of the one-argument
lexicon L is a finite set of inflection tables,
French verb paradigm shown above. It uses a case
expression to pattern match with the argument s;
L = {wi : Stringn | i = 1, . . . , ML }
the pattern _ matches anything, while + divides a
string to two pieces, and | expresses alternation. 1
Source code and documentation in http://www.
The functions conj19finir etc. are defined grammaticalframework.org/lib.
647
As the n is fixed, this is a lexicon specialized to source code size rather than e.g. a finite automa-
one part of speech. A word is an element of the ton size gives in our view a better approximation
lexicon, that is, an inflection table of size n. of the cognitive load of the paradigm system,
An application of a smart paradigm Pm to a its learnability. As a functional programming
word w L is an inflection table resulting from language, GF permits abstractions comparable to
applying Pm to the appropriate subset m (w) of those available for human language learners, who
the inflection table w, dont need to learn the repetitive details of a finite
automaton.
Pm [w] = Pm (m (w)) : Stringn We define the code complexity as the size of
the abstract syntax tree of the source code. This
Thus we assume that all arguments are existing
size is given as the number of nodes in the syntax
word forms (rather than e.g. stems), or features
tree; for instance,
such as the gender. n
X
An application is correct if size(f (x1 , . . . , xn )) = 1 + size(xi )
i=1
Pm [w] = w size(s) = 1, for a string literal s
Using the abstract syntax size makes it possible
The cost of a word w is the minimum number of
to ignore programmer-specific variation such as
arguments needed to make the application correct:
identifier size. Measurements of the GF Resource
cost(w) = argmin(Pm [w] = w) Grammar Library show that code size measured
m in this way is in average 20% of the size of source
For practical applications, it is useful to require files in bytes. Thus a source file of 1 kB has the
Pm to be monotonic, in the sense that increasing code complexity around 200 on the average.
m preserves correctness. Notice that code complexity is defined in a way
The cost of a lexicon L is the average cost for that makes it into a straightforward generaliza-
its words, tion of the cost of a word as expressed in terms
of paradigm applications in GF source code. The
ML
X source code complexity of a paradigm application
cost(wi )
is
i=1
cost(L) = size(Pm [w]) = 1 + m
ML
where ML is the number of words in the lexicon, Thus the complexity for a word w is its cost plus
as defined above. one; the addition of one comes from the applica-
The predictability of a lexicon could be de- tion node for the function Pm and corresponds to
fined as a quantity inversely dependent on its cost. knowing the part of speech of the word.
For instance, an information-theoretic measure
4 Experimental results
could be defined
1 We conducted experiments in four languages (En-
predict(L) = glish, Swedish, French and Finnish2 ), presented
1 + log cost(L)
here in order of morphological richness. We used
with the intuition that each added argument cor- trusted full form lexica (i.e. lexica giving the com-
responds to a choice in a decision tree. However, plete inflection table of every word) to compute
we will not use this measure in this paper, but just the predictability, as defined above, in terms of
the concrete cost. the smart paradigms in GF Resource Grammar Li-
The complexity of a paradigm system is de- brary.
fined as the size of its code in a given coding We used a simple algorithm for computing the
system, following the idea of Kolmogorov com- cost c of a lexicon L with a set Pm of smart
plexity (Solomonoff, 1964). The notion assumes paradigms:
a coding system, which we fix to be GF source 2
This choice correspond to the set of language for which
code. As the results are relative to the coding both comprehensive smart paradigms and morphological
system, they are only usable for comparing def- dictionaries were present in GF with the exception of Turk-
initions in the same system. However, using GF ish, which was left out because of time constraints.
648
set c := 0 one third of the nouns of the lexicon were not in-
cluded in the experiment because one of the form
for each word wi in L, was missing. The vast majority of the remaining
for each m in growing order for which 15,000 nouns are very regular, with predictable
Pm is defined: deviations such as kiss - kisses and fly - flies which
if Pm [w] = w, then c := c + m, else try can be easily predicted by the smart paradigm.
with next m With the average cost of 1.05, this was the most
predictable lexicon in our experiment.
return c Verbs. Verbs are the most interesting category
in English because they present the richest mor-
The average cost is c divided by the size of L. phology. Indeed, as shown by Table 1, the cost
The procedure presupposes that it is always for English verbs, 1.21, is similar to what we got
possible to get the correct inflection table. For for morphologically richer languages.
this to be true, the smart paradigms must have a
worst case scenario version that is able to gen- 4.2 Swedish
erate all forms. In practice, this was not always As gold standard, we used the SALDO lexicon
the case but we checked that the number of prob- (Borin et al., 2008).
lematic words is so small that it wouldnt be sta- Nouns. The noun inflection tables had 8
tistically significant. A typical problem word was forms (singular/plural indefinite/definite nomina-
the equivalent of the verb be in each language. tive/genitive) plus a gender (uter/neuter). Swedish
Another source of deviation is that a lexicon nouns are intrinsically very unpredictable, and
may have inflection tables with size deviating there are many examples of homonyms falling un-
from the number n that normally defines a lex- der different paradigms (e.g. val - val choice vs.
ical category. Some words may be defective, val -valar whale). The cost 1.70 is the highest
i.e. lack some forms (e.g. the singular form of all the lexica considered. Of course, there may
in plurale tantum words), whereas some words be room for improving the smart paradigm.
may have several variants for a given form (e.g. Verbs. The verbs had 20 forms, which in-
learned and learnt in English). We made no ef- cluded past participles. We ran two experiments,
fort to predict defective words, but just ignored by choosing either the infinitive or the present in-
them. With variant forms, we treated a prediction dicative as the base form. In traditional Swedish
as correct if it matched any of the variants. grammar, the base form of the verb is considered
The above algorithm can also be used for help- to be the infinitive, e.g. spela, leka (play in
ing to select the optimal sets of characteristic two different senses). But this form doesnt dis-
forms; we used it in this way to select the first tinguish between the first and the second con-
form of Swedish verbs and the second form of jugation. However, the present indicative, here
Finnish nouns. spelar, leker, does. Using it gives a predictive
The results are collected in Table 1. The sec- power 1.13 as opposed to 1.22 with the infinitive.
tions below give more details of the experiment in Some modern dictionaries such as Lexin4 there-
each language. fore use the present indicative as the base form.
4.1 English 4.3 French
As gold standard, we used the electronic version For French, we used the Morphalou morpholog-
of the Oxford Advanced Learners Dictionary of ical lexicon (Romary et al., 2004). As stated in
Current English3 which contains about 40,000 the documentation5 the current version of the lex-
root forms (about 70,000 word forms). icon (version 2.0) is not complete, and in par-
Nouns. We considered English nouns as hav- ticular, many entries are missing some or all in-
ing only two forms (singular and plural), exclud- flected forms. So for those experiments we only
ing the genitive forms which can be considered to
4
be clitics and are completely predictable. About http://lexin.nada.kth.se/lexin/
5
http://www.cnrtl.fr/lexiques/
3
available in electronic form at http://www.eecs. morphalou/LMF-Morphalou.php#body_3.4.11,
qmul.ac.uk/mpurver/software.html accessed 2011-11-04
649
Table 1: Lexicon size and average cost for the nouns (N) and verbs (V) in four languages, with the percentage of
words correctly inferred from one and two forma (i.e. m = 1 and m 2, respectively).
Lexicon Forms Entries Cost m=1 m2
Eng N 2 15,029 1.05 95% 100%
Eng V 5 5,692 1.21 84% 95%
Swe N 9 59,225 1.70 46% 92%
Swe V 20 4,789 1.13 97% 97%
Fre N 3 42,390 1.25 76% 99%
Fre V 51 6,851 1.27 92% 94%
Fin N 34 25,365 1.26 87% 97%
Fin V 102 10,355 1.09 96% 99%
included entries where all the necessary forms glutinative way. The traditional number and case
were presents. count for nouns gives 26, whereas for verbs the
Nouns: Nouns in French have two forms (sin- count is between 100 and 200, depending on how
gular and plural) and an intrinsic gender (mascu- participles are counted. Notice that the definition
line or feminine), which we also considered to be of predictability used in this paper doesnt depend
a part of the inflection table. Most of the unpre- on the number of forms produced (i.e. not on n
dictability comes from the impossibility to guess but only on m); therefore we can simply ignore
the gender. this question. However, the question is interesting
Verbs: The paradigms generate all of the sim- if we think about paradigms as a data compression
ple (as opposed to compound) tenses given in tra- method (Section 4.5).
ditional grammars such as the Bescherelle. Also Nouns. Compound nouns are a problem for
the participles are generated. The auxiliary verb morphology prediction in Finnish, because inflec-
of compound tenses would be impossible to guess tion is sensitive to the vowel harmony and num-
from morphological clues, and was left out of ber of syllables, which depend on where the com-
consideration. pound boundary goes. While many compounds
are marked in KOTUS, we had to remove some
4.4 Finnish compounds with unmarked boundaries. Another
The Finnish gold standard was the KOTUS lexi- peculiarity was that adjectives were included in
con (Kotimaisten Kielten Tutkimuskeskus, 2006). nouns; this is no problem since the inflection pat-
It has around 90,000 entries tagged with part terns are the same, if comparison forms are ig-
of speech, 50 noun paradigms, and 30 verb nored. The figure 1.26 is better than the one re-
paradigms. Some of these paradigms are rather ported in (Ranta, 2008), which is 1.42; the reason
abstract and powerful; for instance, grade alterna- is mainly that the current set of paradigms has a
tion would multiply many of the paradigms by a better coverage of three-syllable nouns.
factor of 10 to 20, if it was treated in a concate- Verbs. Even though more numerous in forms
native way. For instance, singular nominative- than nouns, Finnish verbs are highly predictable
genitive pairs show alternations such as talotalon (1.09).
(house), kattokaton (roof), kantokannon
(stub), rakoraon (crack), and satosadon 4.5 Complexity and data compression
(harvest). All of these are treated with one and The cost of a lexicon has an effect on learnabil-
the same paradigm, which makes the KOTUS sys- ity. For instance, even though Finnish words have
tem relatively abstract. ten or a hundred times more forms than English
The total number of forms of Finnish nouns and forms, these forms can be derived from roughly
verbs is a question of definition. Koskenniemi the same number of characteristic forms as in En-
(Koskenniemi, 1983) reports 2000 for nouns and glish. But this is of course just a part of the truth:
12,000 for verbs, but most of these forms result by it might still be that the paradigm system itself is
adding particles and possessive suffixes in an ag- much more complex in some languages than oth-
650
gives, for the Finnish verb lexicon, a file of 60 kB,
Table 2: Paradigm complexities for nouns and verbs
in the four languages, computed as the syntax tree size which implies a joint compression rate of 227.
of GF code. That the compression rates for the code can be
language noun verb total higher than the numbers of forms in the full-form
English 403 837 991 lexicon is explained by the fact that the gener-
Swedish 918 1039 1884 ated forms are longer than the base forms. For
instance, the full-form entry of the Finnish verb
French 351 2193 2541
uida (swim) is 850 bytes, which means that the
Finnish 4772 3343 6885
average form size is twice the size of the basic
form.
ers. 5 Smart paradigms in lexicon building

Following the definitions of Section 3, we have
counted the the complexity of the smart paradigm Building a high-quality lexicon needs a lot of
definitions for nouns and verbs in the different manual work. Traditionally, when one is not writ-
languages in the GF Resource Grammar Library. ing all the forms by hand (which would be almost
Notice that the total complexity of the system is impossible in languages with rich morphology),
lower than the sum of the parts, because many sets of paradigms are used that require the lexi-
definitions (such as morphophonological transfor- cographer to specify the base form of the word
mations) are reused in different parts of speech. and an identifier for the paradigm to use. This has
The results are in Table 2. several usability problems: one has to remember
These figures suggest that Finnish indeed has a all the paradigm identifiers and choose correctly
more complex morphology than French, and En- from them.
glish is the simplest. Of course, the paradigms Smart paradigm can make this task easier, even
were not implemented with such comparisons in accessible to non-specialist, because of their abil-
mind, and it may happen that some of the differ- ity to guess the most probable paradigm from a
ences come from different coding styles involved single base form. As shown by Table 1, this is
in the collaboratively built library. Measuring more often correct than not, except for Swedish
code syntax trees rather than source code text neu- nouns. If this information is not enough, only a
tralizes some of this variation (Section 3). few more forms are needed, requiring only prac-
tical knowledge of the language. Usually (92% to
Finally, we can estimate the power of smart
100% in Table 1), adding a second form (m = 2)
paradigms as a data compression function. In a
is enough to cover all words. Then the best prac-
sense, a paradigm is a function designed for the
tice for lexicon writing might be always to give
very purpose of compressing a lexicon, and one
these two forms instead of just one.
can expect better compression than with generic
Smart paradigms can also be used for an auto-
tools such as bzip2. Table 3 shows the compres-
matic bootstrapping of a list of base forms into a
sion rates for the same full-form lexica as used
full form lexicon. As again shown by the last col-
in the predictability experiment (Table 1). The
umn of Table 1, one form alone can provide an
sizes are in kilobytes, where the code size for
excellent first approximation in most cases. What
paradigms is calculated as the number of con-
is more, it is often the case that uncaught words
structors multiplied by 5 (Section 3). The source
belong to a limited set of irregular words, such
lexicon size is a simple character count, similar to
as the irregular verbs in many languages. All new
the full-form lexicon.
words can then be safely inferred from the base
Unexpectedly, the compression rate of the
form by using smart paradigms.
paradigms improves as the number of forms in
the full-form lexicon increases (see Table 1 for 6 Related work
these numbers). For English and French nouns,
bzip2 is actually better. But of course, unlike Smart paradigms were used for a study of Finnish
the paradigms, it also gives a global compression morphology in (Ranta, 2008). The present paper
over all entries in the lexicon. Combining the can be seen as a generalization of that experiment
two methods by applying bzip2 to the source code to more languages and with the notion of code
651
Table 3: Comparison between using bzip2 and paradigms+lexicon source as a compression method. Sizes in
kB.
Lexicon Fullform bzip2 fullform/bzip2 Source fullform/source
Eng N 264 99 2.7 135 2.0
Eng V 245 78 3.2 57 4.4
Swe N 6,243 1,380 4.5 1,207 5.3
Swe V 840 174 4.8 58 15
Fre N 952 277 3.4 450 2.2
Fre V 3,888 811 4.8 98 40
Fin N 11,295 2,165 5.2 343 34
Fin V 13,609 2,297 5.9 123 114
complexity. Also the paradigms for Finnish are ber of forms to determine that a word belongs to a
improved here (cf. Section 4.4 above). certain paradigm. Smart paradigms can then give
Even though smart paradigm-like descriptions the method to actually construct the full inflection
are common in language text books, there is to tables from the characteristic forms.
our knowledge no computational equivalent to the
smart paradigms of GF. Finite state morphology 7 Conclusion
systems often have a function called a guesser, We have introduced the notion of smart
which, given a word form, tries to guess either paradigms, which implement the linguistic
the paradigm this form belongs to or the dictio- knowledge involved in inferring the inflection of
nary form (or both). A typical guesser differs words. We have used the paradigms to estimate
from a smart paradigms in that it does not make the predictability of nouns and verbs in English,
it possible to correct the result by giving more Swedish, French, and Finnish. The main result
forms. Examples of guessers include (Chanod is that, with the paradigms used, less than two
and Tapanainen, 1995) for French, (Hlavacova, forms in average is always enough. In half of the
2001) for Czech, and (Nakov et al., 2003) for Ger- languages and categories, one form is enough to
man. predict more than 90% of forms correctly. This
Another related domain is the unsupervised gives a promise for both manual lexicon building
learning of morphology where machine learning and automatic bootstrapping of lexicon from
is used to automatically build a language mor- word lists.
phology from corpora (Goldsmith, 2006). The To estimate the overall complexity of inflection
main difference is that with the smart paradigms, systems, we have also measured the size of the
the paradigms and the guess heuristics are imple- source code for the paradigm systems. Unsurpris-
mented manually and with a high certainty; in un- ingly, Finnish is around seven times as complex
supervised learning of morphology the paradigms as English, and around three times as complex as
are induced from the input forms with much lower Swedish and French. But this cost is amortized
certainty. Of particular interest are (Chan, 2006) when big lexica are built.
and (Dreyer and Eisner, 2011), dealing with the Finally, we looked at smart paradigms as a data
automatic extraction of paradigms from text and compression method. With simple morphologies,
investigate how good these can become. The main such as English nouns, bzip2 gave a better com-
contrast is, again, that our work deals with hand- pression of the lexicon than the source code us-
written paradigms that are correct by design, and ing paradigms. But with Finnish verbs, the com-
we try to see how much information we can drop pression rate was almost 20 times higher with
before losing correctness. paradigms than with bzip2.
Once given, a set of paradigms can be used in The general conclusion is that smart paradigms
automated lexicon extraction from raw data, as in are a good investment when building morpho-
(Forsberg et al., 2006) and (Clement et al., 2004), logical lexica, as they ease the task of both hu-
by a method that tries to collect a sufficient num- man lexicographers and automatic bootstrapping
652
methods. They also suggest a method to assess [Goldsmith2006] John Goldsmith. 2006. An Algo-
the complexity and learnability of languages, re- rithm for the Unsupervised Learning of Morphol-
lated to Kolmogorov complexity. The results in ogy. Nat. Lang. Eng., 12(4):353371.
the current paper are just preliminary in this re- [Hellberg1978] Staffan Hellberg. 1978. The Morphol-
ogy of Present-Day Swedish. Almqvist & Wiksell.
spect, since they might still tell more about par-
[Hlavacova2001] Jaroslava Hlavacova. 2001. Mor-
ticular implementations of paradigms than about
phological guesser of czech words. In Vaclav Ma-
the languages themselves. tousek, Pavel Mautner, Roman Moucek, and Karel
Tauser, editors, Text, Speech and Dialogue, volume
Acknowledgements 2166 of Lecture Notes in Computer Science, pages
We are grateful to the anonymous referees for 7075. Springer Berlin / Heidelberg.
valuable remarks and questions. The research [Kaplan and Kay1994] R. Kaplan and M. Kay. 1994.
leading to these results has received funding from Regular Models of Phonological Rule Systems.
the European Unions Seventh Framework Pro- Computational Linguistics, 20:331380.
[Koskenniemi1983] Kimmo Koskenniemi. 1983.
gramme (FP7/2007-2013) under grant agreement
Two-Level Morphology: A General Computational
no FP7-ICT-247914 (the MOLTO project). Model for Word-Form Recognition and Production.
Ph.D. thesis, University of Helsinki.
[Kotimaisten Kielten Tutkimuskeskus2006]
References
Kotimaisten Kielten Tutkimuskeskus. 2006.
[Beesley and Karttunen2003] Kenneth R. Beesley and KOTUS Wordlist. http://kaino.kotus.
Lauri Karttunen. 2003. Finite State Morphology. fi/sanat/nykysuomi.
CSLI Publications. [Nakov et al.2003] Preslav Nakov, Yury Bonev, and
[Bescherelle1997] Bescherelle. 1997. La conjugaison et al. 2003. Guessing morphological classes of un-
pour tous. Hatier. known german nouns.
[Borin et al.2008] Lars Borin, Markus Forsberg, and [Ranta2008] Aarne Ranta. 2008. How pre-
Lennart Lonngren. 2008. Saldo 1.0 (svenskt as- dictable is Finnish morphology? an experi-
sociationslexikon version 2). Sprakbanken, 05. ment on lexicon construction. In J. Nivre and
[Chan2006] Erwin Chan. 2006. Learning probabilistic M. Dahllof and B. Megyesi, editor, Resource-
paradigms for morphology in a latent class model. ful Language Technology: Festschrift in Honor
In Proceedings of the Eighth Meeting of the ACL of Anna Sagvall Hein, pages 130148. University
Special Interest Group on Computational Phonol- of Uppsala. http://publications.uu.se/
ogy and Morphology, SIGPHON 06, pages 6978, abstract.xsql?dbid=8933.
Stroudsburg, PA, USA. Association for Computa- [Ranta2011] Aarne Ranta. 2011. Grammatical Frame-
tional Linguistics. work: Programming with Multilingual Grammars.
[Chanod and Tapanainen1995] Jean-Pierre Chanod CSLI Publications, Stanford. ISBN-10: 1-57586-
and Pasi Tapanainen. 1995. Creating a tagset, 626-9 (Paper), 1-57586-627-7 (Cloth).
lexicon and guesser for a french tagger. CoRR, [Romary et al.2004] Laurent Romary, Susanne
cmp-lg/9503004. Salmon-Alt, and Gil Francopoulo. 2004. Standards
[Clement et al.2004] Lionel Clement, Benot Sagot, going concrete: from LMF to Morphalou. In The
and Bernard Lang. 2004. Morphology based au- 20th International Conference on Computational
tomatic acquisition of large-coverage lexica. In Linguistics - COLING 2004, Geneve/Switzerland.
Proceedings of LREC-04, Lisboa, Portugal, pages coling.
18411844. [Solomonoff1964] Ray J. Solomonoff. 1964. A formal
[Dreyer and Eisner2011] Markus Dreyer and Jason theory of inductive inference: Parts 1 and 2. Infor-
Eisner. 2011. Discovering morphological mation and Control, 7:122 and 224254.
paradigms from plain text using a dirichlet process
mixture model. In Proceedings of the Conference
cessing, EMNLP 11, pages 616627, Stroudsburg,
tics.
[Forsberg et al.2006] Markus Forsberg, Harald Ham-
marstrom, and Aarne Ranta. 2006. Morpholog-
ical Lexicon Extraction from Raw Text Data. In
T. Salakoski, editor, FinTAL 2006, volume 4139 of
LNCS/LNAI.
653
Probabilistic Hierarchical Clustering of
Morphological Paradigms
Burcu Can Suresh Manandhar

Department of Computer Science Department of Computer Science
University of York University of York
Heslington, York, YO10 5GH, UK Heslington, York, YO10 5GH, UK
burcucan@gmail.com suresh@cs.york.ac.uk
Abstract (StemList, SuffixList) such that each concatena-

tion of Stem+Suffix (where Stem StemList and
We propose a novel method for learning Suffix SuffixList) is a valid word form. The
morphological paradigms that are struc- learning of morphological paradigms is not novel
tured within a hierarchy. The hierarchi-
as there has already been existing work in this area
cal structuring of paradigms groups mor-
phologically similar words close to each
such as Goldsmith (2001), Snover et al. (2002),
other in a tree structure. This allows detect- Monson et al. (2009), Can and Manandhar (2009)
ing morphological similarities easily lead- and Dreyer and Eisner (2011). However, none of
ing to improved morphological segmen- these existing approaches address learning of the
tation. Our evaluation using (Kurimo et hierarchical structure of paradigms.
al., 2011a; Kurimo et al., 2011b) dataset Hierarchical organisation of words help cap-
shows that our method performs competi- ture morphological similarities between words in
tively when compared with current state-of-
a compact structure by factoring these similarities
art systems.
through stems, suffixes or prefixes. Our inference
algorithm simultaneously infers latent variables
1 Introduction (i.e. the morphemes) along with their hierarchical
organisation. Most hierarchical clustering algo-
Unsupervised morphological segmentation of a
rithms are single-pass, where once the hierarchi-
text involves learning rules for segmenting words
cal structure is built, the structure does not change
into their morphemes. Morphemes are the small-
further.
est meaning bearing units of words. The learn-
The paper is structured as follows: section 2
ing process is fully unsupervised, using only raw
gives the related work, section 3 describes the
text as input to the learning system. For example,
probabilistic hierarchical clustering scheme, sec-
the word respectively is split into morphemes re-
tion 4 explains the morphological segmenta-
spect, ive and ly. Many fields, such as machine
tion model by embedding it into the clustering
translation, information retrieval, speech recog-
scheme and describes the inference algorithm
nition etc., require morphological segmentation
along with how the morphological segmentation
since new words are always created and storing
is performed, section 5 presents the experiment
all the word forms will require a massive dictio-
settings along with the evaluation scores, and fi-
nary. The task is even more complex, when mor-
nally section 6 presents a discussion with a com-
phologically complicated languages (i.e. agglu-
parison with other systems that participated in
tinative languages) are considered. The sparsity
Morpho Challenge 2009 and 2010 .
problem is more severe for more morphologically
complex languages. Applying morphological seg- 2 Related Work
mentation mitigates data sparsity by tackling the
issue with out-of-vocabulary (OOV) words. We propose a Bayesian approach for learning of
In this paper, we propose a paradigmatic ap- paradigms in a hierarchy. If we ignore the hierar-
proach. A morphological paradigm is a pair chical aspect of our learning algorithm, then our
654
Dk
{walk, talk, quick}{0,ed,ing,ly, s}
Di
{walk, talk}{0,ed,ing,s} Dj
{walk}{0,ing} {talk}{ed,s} {quick}{0,ly} X1 X2 X3 X4
Figure 2: A segment of a tree with with internal nodes

walk walking talked talks quick quickly Di , Dj , Dk having data points {x1 , x2 , x3 , x4 }. The
subtree below the internal node Di is called Ti , the
subtree below the internal node Dj is Tj , and the sub-
Figure 1: A sample tree structure. tree below the internal node Dk is Tk .
method is similar to the Dirichlet Process (DP) model can be denoted as p(xi |) where denotes
based model of Goldwater et al. (2006). From the parameters of the probabilistic model.
this perspective, our method can be understood The marginal probability of data in any node
as adding a hierarchical structure learning layer can be calculated as:
on top of the DP based learning method proposed
in Goldwater et al. (2006). Dreyer and Eisner p(Dk ) = p(Dk |)p(|)d (1)
(2011) propose an infinite Diriclet mixture model
for capturing paradigms. However, they do not
The likelihood of data under any subtree is de-
address learning of hierarchy.
fined as follows:
The method proposed in Chan (2006) also
learns within a hierarchical structure where La- p(Dk |Tk ) = p(Dk )p(Dl |Tl )p(Dr |Tr ) (2)
tent Dirichlet Allocation (LDA) is used to find
stem-suffix matrices. However, their work is su- where the probability is defined in terms of left Tl
pervised, as true morphological analyses of words and right Tr subtrees. Equation 2 provides a re-
are provided to the system. In contrast, our pro- cursive decomposition of the likelihood in terms
posed method is fully unsupervised. of the likelihood of the left and the right sub-
trees until the leaf nodes are reached. We use the
3 Probabilistic Hierarchical Model marginal probability (Equation 1) as prior infor-
The hierarchical clustering proposed in this work mation since the marginal probability bears the
is different from existing hierarchical clustering probability of having the data from the left and
algorithms in two aspects: right subtrees within a single cluster.
4 Morphological Segmentation
It is not single-pass as the hierarchical struc-
ture changes. In our model, data points are words to be clus-
tered and each cluster represents a paradigm. In
It is probabilistic and is not dependent on a the hierarchical structure, words will be organised
distance metric. in such a way that morphologically similar words
will be located close to each other to be grouped
3.1 Mathematical Definition in the same paradigms. Morphological similarity
In this paper, a hierarchical structure is a binary refers to at least one common morpheme between
tree in which each internal node represents a clus- words. However, we do not make a distinction be-
ter. tween morpheme types. Instead, we assume that
Let a data set be D = {x1 , x2 , . . . , xn } and each word is organised as a stem+suffix combina-
T be the entire tree, where each data point xi is tion.
located at one of the leaf nodes (see Figure 2).
Here, Dk denotes the data points in the branch 4.1 Model Definition
Tk . Each node defines a probabilistic model for Let a dataset D consist of words to be analysed,
words that the cluster acquires. The probabilistic where each word wi has a latent variable which is
655
the split point that analyses the word into its stem
s m
si and suffix mi :
D = {w1 = s1 + m1 , . . . , wn = sn + mn }
Ps Gs Gm Pm
The marginal likelihood of words in the node k
is defined such that:
si mi
p(Dk ) = p(Sk )p(Mk )
L N
= p(s1 , s2 , . . . , sn )p(m1 , m2 , . . . , mn )
The words in each cluster represents a wi

paradigm that consists of stems and suffixes. The n
hierarchical model puts words sharing the same

stems or suffixes close to each other in the tree.
Figure 3: The plate diagram of the model, representing
Each word is part of all the paradigms on the
the generation of a word wi from the stem si and the
path from the leaf node having that word to the suffix mi that are generated from Dirichlet processes.
root. The word can share either its stem or suffix In the representation, solid-boxes denote that the pro-
with other words in the same paradigm. Hence, cess is repeated with the number given on the corner
a considerable number of words can be generated of each box.
through this approach that may not be seen in the
corpus. lengths implicitly through the morpheme letters:
We postulate that stems and suffixes are gen-

erated independently from each other. Thus, the Ps (si ) = p(ci ) (4)
probability of a word becomes: ci si
p(w = s + m) = p(s)p(m) (3) where ci denotes the letters, which are distributed
uniformly. Modelling morpheme letters is a way
We define two Dirichlet processes to generate of modelling the morpheme length since shorter
stems and suffixes independently: morphemes are favoured in order to have fewer
factors in Equation 4 (Creutz and Lagus, 2005b).
Gs |s , Ps DP (s , Ps ) The Dirichlet process, DP (m , Pm ), is defined
Gm |m , Pm DP (m , Pm ) for suffixes analogously. The graphical represen-
s|Gs Gs tation of the entire model is given in Figure 3.
m|Gm Gm Once the probability distributions G =
{Gs , Gm } are drawn from both Dirichlet pro-
where DP (s , Ps ) denotes a Dirichlet process cesses, words can be generated by drawing a stem
that generates stems. Here, s is the concentration from Gs and a suffix from Gm . However, we do
parameter, which determines the number of stem not attempt to estimate the probability distribu-
types generated by the Dirichlet process. The tions G; instead, G is integrated out. The joint
smaller the value of the concentration parameter, probability of stems is calculated by integrating
the less likely to generate new stem types the pro- out Gs :
cess is. In contrast, the larger the value of concen-
tration parameter, the more likely it is to generate p(s1 , s2 , . . . , sM )

L
new stem types, yielding a more uniform distribu- (5)
= p(Gs ) p(si |Gs )dGs
tion over stem types. If s < 1, sparse stems are
i=1
supported, it yields a more skewed distribution.
To support a small number of stem types in each where L denotes the number of stem tokens. The
cluster, we chose s < 1. joint probability distribution of stems can be tack-
Here, Ps is the base distribution. We use the led as a Chinese restaurant process. The Chi-
base distribution as a prior probability distribu- nese restaurant process introduces dependencies
tion for morpheme lengths. We model morpheme between stems. Hence, the joint probability of
656
stems S = {s1 , . . . , sL } becomes:
p(s1 , s2 , . . . , sL )
= p(s1 )p(s2 |s1 ) . . . p(sM |s1 , . . . , sM 1 )
(s ) K K
= sK1 Ps (si ) (nsi 1)!
(L + s )
i=1 i=1
(6)
where K denotes the number of stem types. In
the equation, the second and the third factor corre- exclaim+ed
spond to the case where novel stems are generated consist+ed
consist+s
for the first time; the last factor corresponds to the
case in which stems that have already been gener-
ated for nsi times previously are being generated plugg+ed skew+ed
again. The first factor consists of all denominators

from both cases.
The integration process is applied for proba-
bility distributions Gm for suffixes analogously.
Hence, the joint probability of suffixes M =
{m1 , . . . , mN } becomes:
liken+s liken+ed
p(m1 , m2 , . . . , mN )
borrow+s borrow+ed
= p(m1 )p(m2 |m1 ) . . . p(mN |m1 , . . . , mN 1 )
() T T
= T Pm (mi ) (nmi 1)!
(N + )
i=1 i=1 Figure 4: A portion of a sample tree.
(7)
where T denotes the number of suffix types and
nmi is the number of stem types mi which have the set of suffixes, excluding the new instance of
been already generated. the suffix mi .
Following the joint probability distribution of A portion of a tree is given in Figure 4. As
stems, the conditional probability of a stem given can be seen on the figure, all words are lo-
previously generated stems can be derived as: cated at leaf nodes. Therefore, the root node
of this subtree consists of words {plugg+ed,
p(si |S si , s , Ps ) skew+ed, exclaim+ed, borrow+s, borrow+ed,
s
nSsi i si
L1+s if si S
(8) liken+s, liken+ed, consist+s, consist+ed}.
=
s Ps (si ) otherwise
L1+s 4.2 Inference
s
where nSsi i denotes the number of stem in- The initial tree is constructed by randomly choos-
stances si that have been previously generated, ing a word from the corpus and adding this into a
where S si denotes the stem set excluding the randomly chosen position in the tree. When con-
new instance of the stem si . structing the initial tree, latent variables are also
The conditional probability of a suffix given the assigned randomly, i.e. each word is split at a ran-
other suffixes that have been previously generated dom position (see Algorithm 1).
is defined similarly: We use Metropolis Hastings algorithm (Hast-
ings, 1970), an instance of Markov Chain Monte
p(mi |M mi ,
m , Pm ) Carlo (MCMC) algorithms, to infer the optimal
mi
nM mi hierarchical structure along with the morphologi-
N 1+m if mi M
mi
= cal segmentation of words (given in Algorithm 2).
m Pm (mi ) otherwise
N 1+m During each iteration i, a leaf node Di = {wi =
(9)
M i
si + mi } is drawn from the current tree structure.
where nmikis the number of instances mi that The drawn leaf node is removed from the tree.
have been generated previously where M m is
i
Next, a node Dk is drawn uniformly from the tree
657
Algorithm 1 Creating initial tree. Algorithm 2 Inference algorithm
1: input: data D = {w1 = s1 + m1 , . . . , wn = 1: input: data D = {w1 = s1 + m1 , . . . , wn =
sn + mn }, sn + mn }, initial tree T , initial temperature
2: initialise: root D1 where of the system , the target temperature of the
D1 = {w1 = s1 + m1 } system , temperature decrement
3: initialise: c n 1 2: initialise: i 1, w wi = si + mi ,
4: while c >= 1 do pcur (D|T ) p(D|T )
5: Draw a word wj from the corpus. 3: while > do
6: Split the word randomly such that wj = 4: Remove the leaf node Di that has the
s j + mj word wi = si + mi
7: Create a new node Dj where Dj = 5: Draw a split point for the word such that

{wj = sj + mj } wi = si + mi
8: Choose a sibling node Dk for Dj 6: Draw a sibling node Dj
9: Merge Dnew Dj Dk 7: Dm Di Dj
10: Remove wj from the corpus 8: Update pnext (D|T )
11: cc1 9: if pnext (D|T ) >= pcur (D|T ) then
12: end while 10: Accept the new tree structure
13: output: Initial tree 11: pcur (D|T ) pnext (D|T )
12: else
13: random N ormal(0, 1)
to make it a sibling node to Di . In addition to a ( )1

sibling node, a split point wi = si + mi is drawn 14: if random < ppnext (D|T )
cur (D|T )
then

uniformly. Next, the node Di = {wi = si + mi } 15: Accept the new tree structure
is inserted as a sibling node to Dk . After updating 16: pcur (D|T ) pnext (D|T )
all probabilities along the path to the root, the new 17: else
tree structure is either accepted or rejected by ap- 18: Reject the new tree structure
plying the Metropolis-Hastings update rule. The 19: Re-insert the node Di at its pre-
likelihood of data under the given tree structure is vious position with the previous
used as the sampling probability. split point
We use a simulated annealing schedule to up- 20: end if
date PAcc : 21: end if
22: w wi+1 = si+1 + mi+1
( )1 23:
pnext (D|T )
PAcc = (10) 24: end while
pcur (D|T )
25: output: A tree structure where each node
where denotes the current temperature, corresponds to a paradigm.
pnext (D|T ) denotes the marginal likelihood
of the data under the new tree structure, and
pcur (D|T ) denotes the marginal likelihood of
ture decreases only tree structures that lead lead to
data under the latest accepted tree structure. If
a considerable improvement in the marginal prob-
(pnext (D|T ) > pcur (D|T )) then the update is
ability p(D|T ) are accepted.
accepted (see line 9, Algorithm 2), otherwise, the
tree structure is still accepted with a probability An illustration of sampling a new tree structure
of pAcc (see line 14, Algorithm 2). In our is given in Figure 5 and 6. Figure 5 shows that
experiments (see section 5) we set to 2. The D0 will be removed from the tree in order to sam-
system temperature is reduced in each iteration ple a new position on the tree, along with a new
of the Metropolis Hastings algorithm: split point of the word. Once the leaf node is re-
moved from the tree, the parent node is removed
from the tree, as the parent node D5 will consist
(11)
of only one child. Figure 6 shows that D8 is sam-
Most tree structures are accepted in the earlier pled to be the sibling node of D0 . Subsequently,
stages of the algorithm, however, as the tempera- the two nodes are merged within a new cluster that
658
D6 p(sj |Sroot , s , Ps ) p(mj |Mroot , m , Pm )
(13)
D7
where Sroot denotes all the stems in Droot and
D5 D8
Mroot denotes all the suffixes in Droot . Here
p(sj |Sroot , s , Ps ) is calculated as given below:
D0 D1 D2 D3 D4
p(si |Sroot
, Ss , Ps ) =
Figure 5: D0 will be removed from the tree. nsiroot
L+s if si Sroot
(14)
s Ps (si ) otherwise
D6 L+s
D7 Similarly, p(mj |Mroot , m , Pm ) is calculated

D9 as:
D8
p(mi |Mroot
, m , Pm ) =
nM root
N +m if mi Mroot
mi
D1 D2 D3 D4 D0 (15)
m Pm (mi ) otherwise
N +m
Figure 6: D8 is sampled to be the sibling of D0 .
4.3.2 Multiple Split Points
In order to discover words with multiple split
introduces a new node D9 . points, we propose a hierarchical segmentation
4.3 Morphological Segmentation where each segment is split further. The rules for
generating multiple split points is given by the fol-
Once the optimal tree structure is inferred, along lowing context free grammar:
with the morphological segmentation of words,
any novel word can be analysed. For the segmen-
tation of novel words, the root node is used as it w s1 m1 |s2 m2 (16)
contains all stems and suffixes which are already s1 s m|s s (17)
extracted from the training data. Morphological
s2 s (18)
segmentation is performed in two ways: segmen-
tation at a single point and segmentation at multi- m1 m m (19)
ple points. m2 s m|m m (20)
4.3.1 Single Split Point
In order to find single split point for the mor- Here, s is a pre-terminal node that generates all
phological segmentation of a word, the split point the stems from the root node. And similarly, m is
yielding the maximum probability given inferred a pre-terminal node that generates all the suffixes
stems and suffixes is chosen to be the final analy- from the root node. First, using Equation 16, the
sis of the word: word (e.g. housekeeper) is split into s1 m1 (e.g.
housekeep+er) or s2 m2 (house+keeper). The first
segment is regarded as a stem, and the second
arg max p(wi = sj + mj |Droot , m , Pm , s , Ps )
j
segment is either a stem or a suffix, consider-
(12) ing the probability of having a compound word.
where Droot refers to the root of the entire tree. Equation 12 is used to decide whether the sec-
Here, the probability of a segmentation of a ond segment is a stem or a suffix. At the sec-
given word given Droot is calculated as given be- ond segmentation level, each segment is split once
low: more. If the first production rule is followed in
the first segmentation level, the first segment s1
p(wi = sj + mj |Droot , m , Pm , s , Ps ) = can be analysed as s m (e.g. housekeep+) or s s
659

!"#$%&%%'%(

!

!"#$% &%%'%(

!"#$% ) &%%' %(

Figure 7: An example that depicts how the word
housekeeper can be analysed further to find more split Figure 8: Marginal likelihood convergence for datasets
points. of size 16K and 22K words.
(e.g. house+keep) (Equation 17). The decision are generated, by splitting each stem and suffix
to choose which production rule to apply is made once more, if it is possible to do so.
using: Morpho Challenge (Kurimo et al., 2011b) pro-
vides a well established evaluation framework
{
s s if p(s|S, s , Ps ) > p(m|M, m , Pm ) that additionally allows comparing our model in
s1
s m otherwise a range of languages. In both sets of experiments,
(21)
the Morpho Challenge 2010 dataset is used (Ku-
where S and M denote all the stems and suffixes
rimo et al., 2011b). Experiments are performed
in the root node.
for English, where the dataset consists of 878,034
Following the same production rule, the second words. Although the dataset provides word fre-
segment m1 can only be analysed as m m (er+). quencies, we have not used any frequency infor-
We postulate that words cannot have more than mation. However, for training our model, we only
two stems and suffixes always follow stems. We chose words with frequency
greater than 200.
do not allow any prefixes, circumfixes, or infixes. In our experiments, we used dataset sizes of
Therefore, the first production rule can output two 10K, 16K, 22K words. However, for final eval-
different analyses: s m m m and s s m m (e.g. uation, we trained our models on 22K words. We
housekeep+er and house+keep+er). were unable to complete the experiments with
On the other hand, if the word is analysed as larger training datasets due to memory limita-
s2 m2 (e.g. house+keeper), then s2 cannot be tions. We plan to report this in future work. Once
analysed further. (e.g. house). The second seg- the tree is learned by the inference algorithm, the
ment m2 can be analysed further, such that s m final tree is used for the segmentation of the entire
(stem+suffix) (e.g. keep+er, keeper+) or m m dataset. Several experiments are performed for
(suffix+suffix). The decision to choose which pro- each setting where the setting varies with the tree
duction rule to apply is made as follows: size and the model parameters. Model parameters
{ are the concentration parameters = {s , m }
s m if p(s|S, s , Ps ) > p(m|M, m , Pm ) of the Dirichlet processes. The concentration pa-
m2
m m otherwise
rameters, which are set for the experiments, are
(22)
0.1, 0.2, 0.02, 0.001, 0.002.
Thus, the second production rule yields two
In all experiments, the initial temperature of the
different analyses: s s m and s m m (e.g.
system is assigned as = 2 and it is reduced to
house+keep+er or house+keeper).
the temperature = 0.01 with decrements =
5 Experiments & Results 0.0001. Figure 8 shows how the log likelihoods of
trees of size 16K and 22K converge in time (where
Two sets of experiments were performed for the the time axis refers to sampling iterations).
evaluation of the model. In the first set of exper- Since different training sets will lead to differ-
iments, each word is split at single point giving a ent tree structures, each experiment is repeated
single stem and a single suffix. In the second set three times keeping the experiment setting the
of experiments, potentially multiple split points same.
660
Data Size P(%) R(%) F(%) s , m System P(%) R(%) F(%)
10K 81.48 33.03 47.01 0.1, 0.1 Allomorf1 68.98 56.82 62.31
16K 86.48 35.13 50.02 0.002, 0.002 Morf. Base.2 74.93 49.81 59.84
22K 89.04 36.01 51.28 0.002, 0.002 PM-Union3 55.68 62.33 58.82
Lignos4 83.49 45.00 58.48
Table 1: Highest evaluation scores of single split point Prob. Clustering (multiple) 57.08 57.58 57.33
experiments obtained from the trees with 10K, 16K, PM-mimic3 53.13 59.01 55.91
and 22K words. MorphoNet5 65.08 47.82 55.13
Rali-cof6 68.32 46.45 55.30
Data Size P(%) R(%) F(%) s , m
CanMan7 58.52 44.82 50.76
10K 62.45 57.62 59.98 0.1, 0.1
1
16K 67.80 57.72 62.36 0.002, 0.002 Virpioja et al. (2009)
2
22K 68.71 62.56 62.56 0.001 0.001 Creutz and Lagus (2002)
3
Monson et al. (2009)
4
Table 2: Evaluation scores of multiple split point ex- Lignos et al. (2009)
5
periments obtained from the trees with 10K, 16K, and Bernhard (2009)
6
22K words. Lavallee and Langlais (2009)
7
Can and Manandhar (2009)
5.1 Experiments with Single Split Points Table 3: Comparison with other unsupervised systems
that participated in Morpho Challenge 2009 for En-
In the first set of experiments, words are split into glish.
a single stem and suffix. During the segmentation,
Equation 12 is used to determine the split position
of each word. Evaluation scores are given in Ta- We compare our system with the other partici-
ble 1. The highest F-measure obtained is 51.28% pant systems in Morpho Challenge 2010. Results
with the dataset of 22K words. The scores are no- are given in Table 6 (Virpioja et al., 2011). Since
ticeably higher with the largest training set. the model is evaluated using the official (hidden)
Morpho Challenge 2010 evaluation dataset where
5.2 Experiments with Multiple Split Points we submit our system for evaluation to the organ-
isers, the scores are different from the ones that
The evaluation scores of experiments with mul-
we presented Table 1 and Table 2.
tiple split points are given in Table 2. The high-
We also demonstrate experiments with Morpho
est F-measure obtained is 62.56% with the dataset
Challenge 2009 English dataset. The dataset con-
with 22K words. As for single split points, the
sists of 384, 904 words. Our results and the re-
scores are noticeably higher with the largest train-
sults of other participant systems in Morpho Chal-
ing set.
lenge 2009 are given in Table 3 (Kurimo et al.,
For both, single and multiple segmentation, the
2009). It should be noted that we only present
same inferred tree has been used.
the top systems that participated in Morpho Chal-
5.3 Comparison with Other Systems lenge 2009. If all the systems are considered, our
system comes 5th out of 16 systems.
For all our evaluation experiments using Mor- The problem of morphologically rich lan-
pho Challenge 2010 (English and Turkish) and guages is not our priority within this research.
Morpho Challenge 2009 (English), we used 22k Nevertheless, we provide evaluation scores on
words for training. For each evaluation, we ran- Turkish. The Turkish dataset consists of 617,298
domly chose 22k words for training and ran our words. We chose words with frequency greater
MCMC inference procedure to learn our model. than 50 for Turkish since the Turkish dataset is not
We generated 3 different models by choosing 3 large enough. The results for Turkish are given in
different randomly generated training sets each Table 4. Our system comes 3rd out of 7 systems.
consisting of 22k words. The results are the best
results over these 3 models. We are reporting the 6 Discussion
best results out of the 3 models due to the small
(22k word) datasets used. Use of larger datasets The model can easily capture common suffixes
would have resulted in less variation and better such as -less, -s, -ed, -ment, etc. Some sample tree
results. nodes obtained from trees are given in Table 6.
661
System P(%) R(%) F(%) System P(%) R(%) F(%)
Morf. CatMAP 79.38 31.88 45.49 Base Inference1 80.77 53.76 64.55
Aggressive Comp. 55.51 34.36 42.45 Iterative Comp.1 80.27 52.76 63.67
Prob. Clustering (multiple) 72.36 25.81 38.04 Aggressive Comp.1 71.45 52.31 60.40
Iterative Comp. 68.69 21.44 32.68 Nicolas2 67.83 53.43 59.78
Nicolas 79.02 19.78 31.64 Prob. Clustering (multiple) 57.08 57.58 57.33
Morf. Base. 89.68 17.78 29.67 Morf. Baseline3 81.39 41.70 55.14
Base Inference 72.81 16.11 26.38 Prob. Clustering (single) 70.76 36.51 48.17
Morf. CatMAP4 86.84 30.03 44.63
Table 4: Comparison with other unsupervised systems 1
Lignos (2010)
that participated in Morpho Challenge 2010 for Turk- 2
Nicolas et al. (2010)
ish. 3
Creutz and Lagus (2002)
4
Creutz and Lagus (2005a)
regard+less, base+less, shame+less, bound+less,
harm+less, regard+ed, relent+less Table 6: Comparison of our model with other unsuper-
solve+d, high+-priced, lower+s, lower+-level, vised systems that participated in Morpho Challenge
high+-level, lower+-income, histor+ians 2010 for English.
pre+mise, pre+face, pre+sumed, pre+, pre+gnant
base+ment, ail+ment, over+looked, predica+ment,
deploy+ment, compart+ment, embodi+ment Sometimes similarities may not yield a valid
anti+-fraud, anti+-war, anti+-tank, anti+-nuclear, analysis of words. For example, the prefix pre-
anti+-terrorism, switzer+, anti+gua, switzer+land
leads the words pre+mise, pre+sumed, pre+gnant
sharp+ened, strength+s, tight+ened, strength+ened,
black+ened
to be analysed wrongly, whereas pre- is a valid
inspir+e, inspir+ing, inspir+ed, inspir+es, earn+ing, prefix for the word pre+face. Another nice fea-
ponder+ing ture about the model is that compounds are easily
downgrade+s, crash+ed, crash+ing, lack+ing, captured through common stems: e.g. doubt+fire,
blind+ing, blind+, crash+, compris+ing, com- bon+fire, gun+fire, clear+cut.
pris+es, stifl+ing, compris+ed, lack+s, assist+ing,
blind+ed, blind+er,
7 Conclusion & Future Work
Table 5: Sample tree nodes obtained from various
trees.
In this paper, we present a novel probabilis-
tic model for unsupervised morphology learn-
As seen from the table, morphologically similar ing. The model adopts a hierarchical structure
words are grouped together. Morphological sim- in which words are organised in a tree so that
ilarity refers to at least one common morpheme morphologically similar words are located close
between words. For example, the words high- to each other.
priced and lower-level are grouped in the same In hierarchical clustering, tree-cutting would be
node through the word high-level which shares a very useful thing to do but it is not addressed
the same stem with high-priced and the same end- in the current paper. We used just the root node
ing with lower-level. as a morpheme lexicon to apply segmentation.
As seen from the sample nodes, prefixes Clearly, adding tree cutting would improve the ac-
can also be identified, for example anti+fraud, curacy of the segmentation and will help us iden-
anti+war, anti+tank, anti+nuclear. This illus- tify paradigms with higher accuracy. However,
trates the flexibility in the model by capturing the the segmentation accuracy obtained without us-
similarities through either stems, suffixes or pre- ing tree cutting provides a very useful indicator
fixes. However, as mentioned above, the model to show whether this approach is promising. And
does not consider any discrimination between dif- experimental results show that this is indeed the
ferent types of morphological forms during train- case.
ing. As the prefix pre- appears at the beginning of In the current model, we did not use any syn-
words, it is identified as a stem. However, identi- tactic information, only words. POS tags can be
fying pre- as a stem does not yield a change in the utilised to group words which are both morpho-
morphological analysis of the word. logically and syntactically similar.
662
References ments, CLEF09, pages 578597, Berlin, Heidel-
berg. Springer-Verlag.
Delphine Bernhard. 2009. Morphonet: Exploring the
Mikko Kurimo, Krista Lagus, Sami Virpioja, and
use of community structure for unsupervised mor-
Ville Turunen. 2011a. Morpho challenge
pheme analysis. In Working Notes for the CLEF
2009. http://research.ics.tkk.fi/
2009 Workshop, September.
events/morphochallenge2009/, June.
Burcu Can and Suresh Manandhar. 2009. Cluster- Mikko Kurimo, Krista Lagus, Sami Virpioja, and
ing morphological paradigms using syntactic cate- Ville Turunen. 2011b. Morpho challenge
gories. In Working Notes for the CLEF 2009 Work- 2010. http://research.ics.tkk.fi/
shop, September. events/morphochallenge2010/, June.
Erwin Chan. 2006. Learning probabilistic paradigms Jean Francois Lavallee and Philippe Langlais. 2009.
for morphology in a latent class model. In Proceed- Morphological acquisition by formal analogy. In
ings of the Eighth Meeting of the ACL Special Inter- Working Notes for the CLEF 2009 Workshop,
est Group on Computational Phonology and Mor- September.
phology, SIGPHON 06, pages 6978, Stroudsburg, Constantine Lignos, Erwin Chan, Mitchell P. Marcus,
PA, USA. Association for Computational Linguis- and Charles Yang. 2009. A rule-based unsuper-
tics. vised morphology learning framework. In Working
Mathias Creutz and Krista Lagus. 2002. Unsu- Notes for the CLEF 2009 Workshop, September.
pervised discovery of morphemes. In Proceed- Constantine Lignos. 2010. Learning from unseen
ings of the ACL-02 workshop on Morphological data. In Mikko Kurimo, Sami Virpioja, Ville Tu-
and phonological learning - Volume 6, MPL 02, runen, and Krista Lagus, editors, Proceedings of the
pages 2130, Stroudsburg, PA, USA. Association Morpho Challenge 2010 Workshop, pages 3538,
for Computational Linguistics. Aalto University, Espoo, Finland.
Mathias Creutz and Krista Lagus. 2005a. Induc- Christian Monson, Kristy Hollingshead, and Brian
ing the morphological lexicon of a natural language Roark. 2009. Probabilistic paramor. In Pro-
from unannotated text. In In Proceedings of the ceedings of the 10th cross-language evaluation fo-
International and Interdisciplinary Conference on rum conference on Multilingual information access
Adaptive Knowledge Representation and Reasoning evaluation: text retrieval experiments, CLEF09,
(AKRR 2005, pages 106113. September.
Mathias Creutz and Krista Lagus. 2005b. Unsu- Lionel Nicolas, Jacques Farre, and Miguel A. Mo-
pervised morpheme segmentation and morphology linero. 2010. Unsupervised learning of concate-
induction from text corpora using morfessor 1.0. native morphology based on frequency-related form
Technical Report A81. occurrence. In Mikko Kurimo, Sami Virpioja, Ville
Markus Dreyer and Jason Eisner. 2011. Discover- Turunen, and Krista Lagus, editors, Proceedings of
ing morphological paradigms from plain text using the Morpho Challenge 2010 Workshop, pages 39
a dirichlet process mixture model. In Proceedings 43, Aalto University, Espoo, Finland.
of the 2011 Conference on Empirical Methods in Matthew G. Snover, Gaja E. Jarosz, and Michael R.
Natural Language Processing, pages 616627, Ed- Brent. 2002. Unsupervised learning of morphol-
inburgh, Scotland, UK., July. Association for Com- ogy using a novel directed search algorithm: Taking
putational Linguistics. the first step. In Proceedings of the ACL-02 Work-
John Goldsmith. 2001. Unsupervised learning of the shop on Morphological and Phonological Learn-
morphology of a natural language. Computational ing, pages 1120, Morristown, NJ, USA. ACL.
Linguistics, 27(2):153198. Sami Virpioja, Oskar Kohonen, and Krista Lagus.
Sharon Goldwater, Thomas L. Griffiths, and Mark 2009. Unsupervised morpheme discovery with al-
Johnson. 2006. Interpolating between types and to- lomorfessor. In Working Notes for the CLEF 2009
kens by estimating power-law generators. In In Ad- Workshop. September.
vances in Neural Information Processing Systems Sami Virpioja, Ville T. Turunen, Sebastian Spiegler,
18, page 18. Oskar Kohonen, and Mikko Kurimo. 2011. Em-
W. K. Hastings. 1970. Monte carlo sampling meth- pirical comparison of evaluation methods for unsu-
ods using markov chains and their applications. pervised learning of morphology. In Traitement Au-
Biometrika, 57:97109. tomatique des Langues.
Mikko Kurimo, Sami Virpioja, Ville T. Turunen,
Graeme W. Blackwood, and William Byrne. 2009.
Overview and results of morpho challenge 2009.
In Proceedings of the 10th cross-language eval-
uation forum conference on Multilingual infor-
mation access evaluation: text retrieval experi-
663
Modeling Inflection and Word-Formation in SMT
Alexander Fraser Marion Weller Aoife Cahill Fabienne Cap

Institut fur Maschinelle Sprachverarbeitung Educational Testing Service
Universitat Stuttgart Princeton, NJ 08541
D70174 Stuttgart, Germany USA
{fraser,wellermn,cap}@ims.uni-stuttgart.de acahill@ets.org
Abstract pare the mostly unlexicalized prediction of lin-

guistic features (with a subsequent surface form
The current state-of-the-art in statistical generation step) versus the direct prediction of
machine translation (SMT) suffers from is- surface forms, and show that both approaches
sues of sparsity and inadequate modeling
have complementary strengths. (iii) We com-
power when translating into morphologi-
cally rich languages. We model both in-
bine the advantages of the prediction of linguis-
flection and word-formation for the task tic features with the prediction of surface forms.
of translating into German. We translate We implement this in a CRF framework which
from English words to an underspecified improves on a standard phrase-based SMT base-
German representation and then use linear- line. (iv) We develop separate (but related) pro-
chain CRFs to predict the fully specified cedures for inflection prediction and dealing with
German representation. We show that im- word-formation (compounds and portmanteaus),
proved modeling of inflection and word-
in contrast with most previous work which usu-
formation leads to improved SMT.
ally either approaches both problems as inflec-
tional problems, or approaches both problems as
1 Introduction word-formation problems.
We evaluate on the end-to-end SMT task of
Phrase-based statistical machine translation
translating from English to German of the 2009
(SMT) suffers from problems of data sparsity
ACL workshop on SMT. We achieve BLEU score
with respect to inflection and word-formation
increases on both the test set and the blind test set.
which are particularly strong when translating to
a morphologically rich target language, such as 2 Overview of the translation process for
German. We address the problem of inflection inflection prediction
by first translating to a stem-based representation,
and then using a second process to inflect these The work we describe is focused on generaliz-
stems. We study several models for doing ing phrase-based statistical machine translation to
this, including: strongly lexicalized models, better model German NPs and PPs. We particu-
unlexicalized models using linguistic features, larly want to ensure that we can generate novel
and models combining the strengths of both of German NPs, where what we mean by novel is
these approaches. We address the problem of that the (inflected) realization is not present in the
word-formation for compounds in German, by parallel German training data used to build the
translating from English into German word parts, SMT system, and hence cannot be produced by
and then determining whether to merge these our baseline (a standard phrase-based SMT sys-
parts to form compounds. tem). We first present our system for dealing with
We make the following new contributions: (i) the difficult problem of inflection in German, in-
we introduce the first SMT system combining cluding the inflection-dependent phenomenon of
inflection prediction with synthesis of portman- portmanteaus. Later, after performing an exten-
teaus and compounds. (ii) For inflection, we com- sive analysis of this system, we will extend it
664
to model compounds, a highly productive phe- We then build a standard Moses system trans-
nomenon in German (see Section 8). lating from English to German stems. We obtain
The key linguistic knowledge sources that we a sequence of stems and POS2 from this system,
use are morphological analysis and generation of and then predict the correct inflection using a se-
German based on SMOR, a morphological ana- quence model. Finally we generate surface forms.
lyzer/generator of German (Schmid et al., 2004)
and the BitPar parser, which is a state-of-the-art 2.3 German Stem Markup
parser of German (Schmid, 2004). The translation process consists of two major
steps. The first step is translation of English
2.1 Issues of inflection prediction words to German stems, which are enriched with
In order to ensure coherent German NPs, we some inflectional markup. The second step is
model linguistic features of each word in an NP. the full inflection of these stems (plus markup)
We model case, gender, and number agreement to obtain the final sequence of inflected words.
and whether or not the word is in the scope of The purpose of the additional German inflectional
a determiner (such as a definite article), which markup is to strongly improve prediction of in-
we label in-weak-context (this linguistic feature flection in the second step through the addition of
is necessary to determine the type of inflection of markup to the stems in the first step.
adjectives and other words: strong, weak, mixed). In general, all features to be predicted are
This is a diverse group of features. The number stripped from the stemmed representation because
of a German noun can often be determined given they are subject to agreement restrictions of a
only the English source word. The gender of a noun or prepositional phrase (such as case of
German noun is innate and often difficult to deter- nouns or all features of adjectives). However, we
mine given only the English source word. Case need to keep all morphological features that are
is a function of the slot in the subcategorization not dependent on, and thus not predictable from,
frame of the verb (or preposition). There is agree- the (German) context. They will serve as known
ment in all of these features in an NP. For instance input for the inflection prediction model. We now
the number of an article or adjective is determined describe this markup in detail.
by the head noun, while the type of inflection of an Nouns are marked with gender and number: we
adjective is determined by the choice of article. consider the gender of a noun as part of its stem,
We can have a large number of surface forms. whereas number is a feature which we can obtain
For instance, English blue can be translated as from English nouns.
German blau, blaue, blauer, blaues, blauen. We Personal pronouns have number and gender an-
predict which form is correct given the context. notation, and are additionally marked with nom-
Our system can generate forms not seen in the inative and not-nominative, because English pro-
training data. We follow a two-step process: in nouns are marked for this (except for you).
step-1 we translate to blau (the stem), in step-2 we Prepositions are marked with the case their ob-
predict features and generate the inflected form.1 ject takes: this moves some of the difficulty in pre-
dicting case from the inflection prediction step to
2.2 Procedure the stem translation step. Since the choice of case
We begin building an SMT system by parsing the in a PP is often determined by the PPs meaning
German training data with BitPar. We then extract (and there are often different meanings possible
morphological features from the parse. Next, we given different case choices), it seems reasonable
lookup the surface forms in the SMOR morpholog- to make this decision during stem translation.
ical analyzer. We use the morphological features Verbs are represented using their inflected surface
in the parse to disambiguate the set of possible form. Having access to inflected verb forms has a
SMOR analyses. Finally, we output the stems
positive influence on case prediction in the second
of the German text, with the addition of markup 2
We use an additional target factor to obtain the coarse
taken from the parse (discussed in Section 2.3). POS for each stem, applying a 7-gram POS model. Koehn
and Hoang (2007) showed that the use of a POS factor only
1
E.g., case=nominative, gender=masculine, num- results in negligible BLEU improvements, but we need ac-
ber=singular, in-weak-context=true; inflected: blaue. cess to the POS in our inflection prediction models.
665
input decoder output inflected merged must be inflected before making a decision about
in<APPR><Dat> in
in im whether to merge a preposition and the article into
die<+ART><Def> dem
contrast Gegensatz<+NN><Masc><Sg> Gegensatz Gegensatz a portmanteau. See Table 1 for examples.
to zu<APPR><Dat> zu
zur
the die<+ART><Def> der
animated lebhaft<+ADJ><Pos> lebhaften lebhaften
debate Debatte<+NN><Fem><Sg> Debatte Debatte 4 Models for Inflection Prediction
Table 1: Re-merging of prepositions and articles after We present 5 procedures for inflectional predic-
inflection to form portmanteaus, in dem means in the. tion using supervised sequence models. The first
two procedures use simple N-gram models over
fully inflected surface forms.
step through subject-verb agreement. 1. Surface with no features is presented with an
Articles are reduced to their stems (the stem itself underspecified input (a sequence of stems), and
makes clear the definite or indefinite distinction, returns the most likely inflected sequence.
but lemmatizing involves removing markings of
2. Surface with case, number, gender is a hybrid
case, gender and number features).
system giving the surface model access to linguis-
Other words are also represented by their stems
tic features. In this system prepositions have addi-
(except for words not covered by SMOR, where
tionally been labeled with the case they mark (in
surface forms are used instead).
both the underspecified input and the fully spec-
3 Portmanteaus ified output the sequence model is built on) and
gender and number markup is also available.
Portmanteaus are a word-formation phenomenon
The rest of the procedures predict morpholog-
dependent on inflection. As we have discussed,
ical features (which are input to a morphological
standard phrase-based systems have problems
generator) rather than surface words. We have de-
with picking a definite article with the correct
veloped a two-stage process for predicting fully
case, gender and number (typically due to spar-
inflected surface forms. The first stage takes a
sity in the language model, e.g., a noun which
stem and predicts morphological features for that
was never before seen in dative case will often
stem, based on the surrounding context. The aim
not receive the correct article). In German, port-
of the first stage is to take a stem and predict
manteaus increase this sparsity further, as they
four morphological features: case, gender, num-
are compounds of prepositions and articles which
ber and type of inflection. We experiment with
must agree with a noun.
a number of models for doing this. The sec-
We adopt the linguistically strict definition of
ond stage takes the stems marked with morpho-
the term portmanteau: the merging of two func-
logical features (predicted in the first stage) and
tion words.3 We treat this phenomena by split-
uses a morphological generator to generate the
ting the component parts during training and re-
full surface form. For the second stage, a modified
merging during generation. Specifically for
version of SMOR (Schmid et al., 2004) is used,
German, this requires splitting the words which
which, given a stem annotated with morphologi-
have German POS tag APPRART into an APPR
cal features, generates exactly one surface form.
(preposition) and an ART (article). Merging is re-
stricted, the article must be definite, singular4 and We now introduce our first linguistic feature
the preposition can only take accusative or dative prediction systems, which we call joint sequence
case. Some prepositions allow for merging with models (JSMs). These are standard language
an article only for certain noun genders, for exam- models, where the word tokens are not repre-
ple the preposition inDative is only merged with sented as surface forms, but instead using POS
the following article if the following noun is of and features. In testing, we supply the input as a
masculine or neuter gender. The definite article sequence in underspecified form, where some of
the features are specified in the stem markup (for
3
Some examples are: zum (to the) = zu (to) + dem (the) instance, POS=Noun, gender=masculine, num-
[German], du (from the) = de (from) + le (the) [French] or al
(to the) = a (to) + el (the) [Spanish].
ber=plural), and then use Viterbi search to find the
4
This is the reason for which the preposition + article in most probable fully specified form (for instance,
Table 2 remain unmerged. POS=Noun, gender=masculine, number=plural,
666
output decoder input prediction output prediction inflected forms gloss
haben<VAFIN> haben-V haben-V haben have
Zugang<+NN><Masc><Sg> NN-Sg-Masc NN-Masc.Acc.Sg.in-weak-context=false Zugang access
zu<APPR><Dat> APPR-zu-Dat APPR-zu-Dat zu to
die<+ART><Def> ART-in-weak-context=true ART-Neut.Dat.Pl.in-weak-context=true den the
betreffend<+ADJ><Pos> ADJA ADJA-Neut.Dat.Pl.in-weak-context=true betreffenden respective
Land<+NN><Neut><Pl> NN-Pl-Neut NN-Neut.Dat.Pl.in-weak-context=true Landern countries
Table 2: Overview: inflection prediction steps using a single joint sequence model. All words except verbs and
prepositions are replaced by their POS tags in the input. Verbs are inflected in the input (haben, meaning
have as in they have, in the example). Prepositions are lexicalized (zu in the example) and indicate which
case value they mark (Dat, i.e., Dative in the example).
case=nominative, in-weak-context=true).5 sonable linguistic assumption to make given the

3. Single joint sequence model on features. We additional German markup that we use. By split-
illustrate the different stages of the inflection pre- ting the inflection prediction problem into 4 com-
diction when using a joint sequence model. The ponent parts, we end up with 4 simpler models
stemmed input sequence (cf. Section 2.3) contains which are less sensitive to data sparseness.
several features that will be part of the input to Each linguistic feature is modeled indepen-
the inflection prediction. With the exception of dently (by a JSM) and has a different input rep-
verbs and prepositions, the representation for fea- resentation based on the previously described
ture prediction is based on POS-tags. markup. The input consists of a sequence of
As gender and number are given by the heads coarse POS tags, and for those stems that are
of noun phrases and prepositional phrases, and marked up with the relevant feature, this feature
the expected type of inflection is set by articles, value. Finally, we combine the predicted fea-
the model has sufficient information to compute tures together to produce the same final output as
values for these features and there is no need to the single joint sequence model, and then generate
know the actual words. In contrast, the prediction each surface form using SMOR.
of case is more difficult as it largely depends on 5. Using four CRFs (one for each linguistic fea-
the content of the sentence (e.g. which phrase is ture). The sequence models already presented are
object, which phrase is subject). Assuming that limited to the n-gram feature space, and those that
verbs and prepositions indicate subcategorization predict linguistic features are not strongly lexi-
frames, the model is provided crucial information calized. Toutanova et al. (2008) uses an MEMM
for the prediction of case by keeping verbs (recall which allows the integration of a wide variety of
that verbs are produced by the stem translation feature functions. We also wanted to experiment
system in their inflected form) and prepositions with additional feature functions, and so we train
(the prepositions also have case markup) instead 4 separate linear chain CRF6 models on our data
of replacing them with their tags. (one for each linguistic feature we want to pre-
After having predicted a single label with val- dict). We chose CRFs over MEMMs to avoid the
ues for all features, an inflected word form for the label bias problem (Lafferty et al., 2001).
stem and the features is generated. The prediction The CRF feature functions, for each German
steps are illustrated in Table 2. word wi , are in Table 3. The common feature
4. Using four joint sequence models (one for functions are used in all models, while each of the
each linguistic feature). Here the four linguistic 4 separate models (one for each linguistic feature)
feature values are predicted separately. The as- includes the context of only that linguistic feature.
sumption that the different linguistic features can We use L1 regularization to eliminate irrelevant
be predicted independently of one another is a rea- feature functions, the regularization parameter is
5
Joint sequence models are a particularly simple HMM.
optimized on held out data.
Unlike the HMMs used for POS-tagging, an HMM as used
6
here only has a single emission possibility for each state, We use the Wapiti Toolkit (Lavergne et al., 2010) on 4
with probability 1. The states in the HMM are the fully x 12-Core Opteron 6176 2.3 GHz with 256GB RAM to train
specified representation. The emissions of the HMM are the our CRF models. Training a single CRF model on our data
stems+markup (the underspecified representation). was not tractable, so we use one for each linguistic feature.
667
Common lemmawi5 ...wi+5 , tagwi7 ...wi+7 tion 2.3), and the second is inflection prediction
Case casewi5 ...wi+5
Gender genderwi5 ...wi+5 as described previously in the paper. To derive
Number numberwi5 ...wi+5 the stem+markup representation we first parse
in-weak-context in-weak-contextwi5 ...wi+5
the German training data and then produce the
Table 3: Feature functions used in CRF models (fea- stemmed representation. We then build a sys-
ture functions are binary indicators of the pattern). tem for translating from English words to Ger-
man stems (the stem+markup representation), on
the same data (so the German side of the parallel
5 Experimental Setup data, and the German language modeling uses the
To evaluate our end-to-end system, we perform stem+markup representation). Likewise, MERT
the well-studied task of news translation, us- is performed using references which are in the
ing the Moses SMT package. We use the En- stem+markup representation.
glish/German data released for the 2009 ACL To train the inflection prediction systems, we
Workshop on Machine Translation shared task on use the monolingual data. The basic surface form
translation.7 There are 82,740 parallel sentences model is trained on lowercased surface forms,
from news-commentary09.de-en and 1,418,115 the hybrid surface form model with features is
parallel sentences from europarl-v4.de-en. The trained on lowercased surface forms annotated
monolingual data contains 9.8 M sentences.8 with markup. The linguistic feature prediction
To build the baseline, the data was tokenized systems are trained on the monolingual data pro-
using the Moses tokenizer and lowercased. We cessed as described previously (see Table 2).
use GIZA++ to generate alignments, by running Our JSMs are trained using the SRILM Toolkit.
5 iterations of Model 1, 5 iterations of the HMM We use the SRILM disambig tool for predicting
Model, and 4 iterations of Model 4. We sym- inflection, which takes a map that specifies the
metrize using the grow-diag-final-and heuris- set of fully specified representations that each un-
tic. Our Moses systems use default settings. The derspecified stem can map to. For surface form
LM uses the monolingual data and is trained as models, it specifies the mapping from stems to
a five-gram9 using the SRILM-Toolkit (Stolcke, lowercased surface forms (or surface forms with
2002). We run MERT separately for each sys- markup for the hybrid surface model).
tem. The recaser used is the same for all systems.
6 Results for Inflection Prediction
It is the standard recaser supplied with Moses,
trained on all German training data. The dev set We build two different kinds of translation sys-
is wmt-2009-a and the test set is wmt-2009-b, and tem, the baseline and the stem translation system
we report end-to-end case sensitive BLEU scores (where MERT is used to train the system to pro-
against the unmodified reference SGML file. The duce a stem+markup sequence which agrees with
blind test set used is wmt-2009-blind (all lines). the stemmed reference of the dev set). In this sec-
In developing our inflection prediction sys- tion we present the end-to-end translation results
tems (and making such decisions as n-gram order for the different inflection prediction models de-
used), we worked on the so-called clean data fined in Section 4, see Table 4.
task, predicting the inflection on stemmed refer- If we translate from English into a stemmed
ence sentences (rather than MT output). We used German representation and then apply a unigram
the 2000 sentence dev-2006 corpus for this task. stem-to-surface-form model to predict the surface
Our contrastive systems consist of two steps, form, we achieve a BLEU score of 9.97 (line 2).
the first is a translation step using a similar This is only presented for comparison.
Moses system (except that the German side is The baseline10 is 14.16, line 1. We compare
stemmed, with the markup indicated in Sec- this with a 5-gram sequence model11 that predicts
7 10
http://www.statmt.org/wmt09/translation-task.html This is a better case-sensitive score than the baselines
8
However, we reduced the monolingual data (only) by on wmt-2009-b in experiments by top-performers Edinburgh
retaining only one copy of each unique line, which resulted and Karlsruhe at the shared task. We use Moses with default
in 7.55 M sentences. settings.
9 11
Add-1 smoothing for unigrams and Kneser-Ney Note that we use a different set, the clean data set, to
smoothing for higher order n-grams, pruning defaults. determine the choice of n-gram order, see Section 7. We use
668
surface forms without access to morphological 1 baseline 14.16
2 unigram surface (no features) 9.97
features, resulting in a BLEU score of 14.26. In- 3 surface (no features) 14.26
troducing morphological features (case on prepo- 4 surface (with case, number, gender features) 14.58
5 1 JSM morphological features 14.53
sitions, number and gender on nouns) increases 6 4 JSMs morphological features 14.29
the BLEU score to 14.58, which is in the same 7 4 CRFs morphological features, lexical information 14.72
range as the single JSM system predicting all lin-
guistic features at once. Table 4: BLEU scores (detokenized, case sensitive) on
the development test set wmt-2009-b
This result shows that the mostly unlexicalized
single JSM can produce competitive results with
direct surface form prediction, despite not having each linguistic feature performs best (14.72, line
access to a model of inflected forms, which is the 7). The CRF framework combines the advantages
desired final output. This strongly suggests that of surface form prediction and linguistic feature
the prediction of morphological features can be prediction by using feature functions that effec-
used to achieve additional generalization over di- tively cover the feature function spaces used by
rect surface form prediction. When comparing the both forms of prediction. The performance of the
simple direct surface form prediction (line 3) with CRF models results in a statistically significant
the hybrid system enriched with number, gender improvement12 (p < 0.05) over the baseline. We
and case (line 4), it becomes evident that feature also tried CRFs with bilingual features (projected
markup can also aid surface form prediction. from English parses via the alignment output by
Since the single JSM has no access to lexical Moses), but obtained only a small improvement of
information, we used a language model to score 0.03, probably because the required information
different feature predictions: for each sentence of is transferred in our stem markup (also a poor im-
the development set, the 100 best feature predic- provement beyond monolingual features is con-
tions were inflected and scored with a language sistent with previous work, see Section 8.3). De-
model. We then optimized weights for the two tails are omitted due to space.
scores LM (language model on surface forms)
We further validated our results by translating
and FP (feature prediction, the score assigned by
the blind test set from wmt-2009, which we have
the JSM). This method disprefers feature predic-
never looked at in any way. Here we also had
tions with a top FP-score if the inflected sen-
a statistically significant difference between the
tence obtains a bad LM score and likewise dis-
baseline and the CRF-based prediction, the scores
favors low-ranked feature prediction with a high
were 13.68 and 14.18.
LM score. The prediction of case is the most
difficult given no lexical information, thus scor- 7 Analysis of Inflection-based System
ing different prediction possibilities on inflected
words is helpful. An example is when the case of Stem Markup. The first step of translating
a noun phrase leads to an inflected phrase which from English to German stems (with the markup
never occurs in the (inflected) language model we previously discussed) is substantially easier
(e.g., case=genitive vs. case=other). Applying than translating directly to inflected German (we
this method to the single JSM leads to a negligible see BLEU scores on stems+markup that are over
improvement (14.53 vs. 14.56). Using the n-best 2.0 BLEU higher than the BLEU scores on in-
output of the stem translation system did not lead flected forms when running MERT). The addition
to any improvement. of case to prepositions only lowered the BLEU
The comparison between different feature pre- score reached by MERT by about 0.2, but is very
diction models is also illustrative. Performance helpful for prediction of the case feature.
decreases somewhat when using individual joint Inflection Prediction Task. Clean data task re-
sequence models (one for each linguistic feature) sults13 are given in Table 5. The 4 CRFs outper-
compared to one single model (14.29, line 6). form the 4 JSMs by more than 2%.
The framework using the individual CRFs for 12
We used Kevin Gimpels implementation of pairwise
a 5-gram for surface forms and a 4-gram for JSMs, and the bootstrap resampling with 1000 samples.
13
same smoothing (Kneser-Ney, add-1 for unigrams, default 26,061 of 55,057 tokens in our test set are ambiguous.
pruning). We report % surface form matches for ambiguous tokens.
669
Model Accuracy generalize from the accusative example with no
unigram surface (no features) 55.98
surface (no features) 86.65 portmanteau and take advantage of longer phrase
surface (with case, number, gender features) 91.24 pairs, even when translating to something that will
1 JSM morphological features 92.45
4 JSMs morphological features 92.01
be inflected as dative and should be realized as a
4 CRFs morphological features, lexical information 94.29 portmanteau. The baseline does not have this ca-
pability. It should be noted that the portmanteau
Table 5: Comparing predicting surface forms directly merging method described in Section 3 remerges
with predicting morphological features.
all occurrences of APPR and ART that can techni-
cally form a portmanteau. There are a few cases
training data 1 model 4 models
7.3 M sentences 92.41 91.88
where merging, despite being grammatical, does
1.5 M sentences 92.45 92.01 not lead to a good result. Such exceptions require
100000 sentences 90.20 90.64 semantic interpretation and are difficult to capture
1000 sentences 83.72 86.94
with a fixed set of rules.
Table 6: Accuracy for different training data sizes of
the single and the four separate joint sequence models. 8 Adding Compounds to the System
Compounds are highly productive in German and
lead to data sparsity. We split the German com-
As we mentioned in Section 4, there is a spar- pounds in the training data, so that our stem trans-
sity issue at small training data sizes for the sin- lation system can now work with the individual
gle joint sequence model. This is shown in Ta- words in the compounds. After we have trans-
ble 6. At the largest training data sizes, model- lated to a split/stemmed representation, we deter-
ing all 4 features together results in the best pre- mine whether to merge words together to form a
dictions of inflection. However using 4 separate compound. Then we merge them to create stems
models is worth this minimal decrease in perfor- in the same representation as before and we per-
mance, since it facilitates experimentation with form inflection and portmanteau merging exactly
the CRF framework for which the training of a as previously discussed.
single model is not currently tractable.
Overall, the inflection prediction works well for 8.1 Details of Splitting Process
gender, number and type of inflection, which are We prepare the training data by splitting com-
local features to the NP that normally agree with pounds in two steps, following the technique of
the explicit markup output by the stem transla- Fritzinger and Fraser (2010). First, possible split
tion system (for example, the gender of a com- points are extracted using SMOR, and second, the
mon noun, which is marked in the stem markup, best split points are selected using the geometric
is usually successfully propagated to the rest of mean of word part frequencies.
the NP). Prediction of case does not always work
compound word parts gloss
well, and could maybe be improved through hier- Inflationsrate Inflation Rate inflation rate
archical labeled-syntax stem translation. auszubrechen aus zu brechen out to break (to break out)
Portmanteaus. An example of where the sys- Training data is then stemmed as described in
tem is improved because of the new handling of Section 2.3. The formerly modifying words of the
portmanteaus can be seen in the dative phrase compound (in our example the words to the left
im internationalen Rampenlicht (in the interna- of the rightmost word) do not have a stem markup
tional spotlight), which does not occur in the par- assigned, except for two cases: i) they are nouns
allel data. The accusative phrase in das interna- themselves or ii) they are particles separated from
tionale Rampenlicht does occur, however in this a verb. In these cases, former modifiers are rep-
case there is no portmanteau, but a one-to-one resented identically to their individual occurring
mapping between in the and in das. For a given counterparts, which helps generalization.
context, only one of accusative or dative case is
valid, and a strongly disfluent sentence results 8.2 Model for Compound Merging
from the incorrect choice. In our system, these After translation, compound parts have to be
two cases are handled in the same way (def-article resynthesized into compounds before inflection.
international Rampenlicht). This allows us to Two decisions have to be taken: i) where to
670
merge and ii) how to merge. Following the work 1 1 JSM morphological features 13.94
2 4 CRFs morphological features, lexical information 14.04
of Stymne and Cancedda (2011), we implement
a linear-chain CRF merging system using the
following features: stemmed (separated) surface Table 7: Results with Compounds on the test set
form, part-of-speech14 and frequencies from the
training corpus for bigrams/merging of word and ture can be translated as German Miniatur- and
word+1, word as true prefix, word+1 as true suf- gets the correct output.
fix, plus frequency comparisons of these. The
CRF is trained on the split monolingual data. It 9 Related Work
only proposes merging decisions, merging itself
uses a list extracted from the monolingual data There has been a large amount of work on trans-
(Popovic et al., 2006). lating from a morphologically rich language to
English, we omit a literature review here due to
8.3 Experiments space considerations. Our work is in the opposite
We evaluated the end-to-end inflection system direction, which primarily involves problems of
with the addition of compounds.15 As in the in- generation, rather than problems of analysis.
flection experiments described in Section 5, we The idea of translating to stems and then in-
use a 5-gram surface LM and a 7-gram POS flecting is not novel. We adapted the work of
LM, but for this experiment, they are trained on Toutanova et al. (2008), which is effective but lim-
stemmed, split data. The POS LM helps com- ited by the conflation of two separate issues: word
pound parts and heads appear in correct order. formation and inflection.
The results are in Table 7. The BLEU score of the Given a stem such as brother, Toutanova et. als
CRF on test is 14.04, which is low. However the system might generate the stem and inflection
system produces 19 compound types which are corresponding to and his brother. Viewing and
in the reference but not in the parallel data, and and his as inflection is problematic since a map-
therefore not accessible to other systems. We also ping from the English phrase and his brother to
observe many more compounds in general. The the Arabic stem for brother is required. The situ-
100-best inflection rescoring technique previously ation is worse if there are English words (e.g., ad-
discussed reached 14.07 on the test set. Blind jectives) separating his and brother. This required
test results with CRF prediction are much better, mapping is a significant problem for generaliza-
14.08, which is a statistically significant improve- tion. We view this issue as a different sort of prob-
ment over the baseline (13.68) and approaches the lem entirely, one of word-formation (rather than
result we obtained without compounds (14.18). inflection). We apply a split in preprocessing and
Correctly generated compounds are single words resynthesize in postprocessing approach to these
which usually carry the same information as mul- phenomena, combined with inflection prediction
tiple words in English, and are hence likely un- that is similar to that of Toutanova et. al. The
derweighted by BLEU. We again see many in- only work that we are aware of which deals with
teresting generalizations. For instance, take the both issues is the work of de Gispert and Marino
case of translating English miniature cameras to (2008), which deals with verbal morphology and
the German compound Miniaturkameras. minia- attached pronouns. There has been other work
ture camera or miniature cameras does not occur on solving inflection. Koehn and Hoang (2007)
in the training data, and so there is no appropri- introduced factored SMT. We use more complex
ate phrase pair in any system (baseline, inflec- context features. Fraser (2009) tried to solve the
tion, or inflection&compound-splitting). How- inflection prediction problem by simply building
ever, our system with compound splitting has an SMT system for translating from stems to in-
learned from split composita that English minia- flected forms. Bojar and Kos (2010) improved on
this by marking prepositions with the case they
14
Compound modifiers get assigned a special tag based on mark (one of the most important markups in our
the POS of their former heads, e.g., Inflation in the example
is marked as a non-head of a noun.
system). Both efforts were ineffective on large
15
We found it most effective to merge word parts during data sets. Williams and Koehn (2011) used uni-
MERT (so MERT uses the same stem references as before). fication in an SMT system to model some of the
671
agreement phenomena that we model. Our CRF coded in a rule-based morphological analyser and
framework allows us to use more complex con- then selecting the best analysis based on the ge-
text features. ometric mean of word part frequencies. Other
We have directly addressed the question as to approaches use less deep linguistic resources
whether inflection should be predicted using sur- (e.g., POS-tags Stymne (2008)) or are (almost)
face forms as the target of the prediction, or knowledge-free (e.g., Koehn and Knight (2003)).
whether linguistic features should be predicted, Compound merging is less well studied. Popovic
along with the use of a subsequent generation et al. (2006) used a simple, list-based merging ap-
step. The direct prediction of surface forms is proach, merging all consecutive words included
limited to those forms observed in the training in a merging list. This approach resulted in too
data, which is a significant limitation. How- many compounds. We follow Stymne and Can-
ever, it is reasonable to expect that the use of cedda (2011), for compound merging. We trained
features (and morphological generation) could a CRF using (nearly all) of the features they used
also be problematic as this requires the use of and found their approach to be effective (when
morphologically-aware syntactic parsers to anno- combined with inflection and portmanteau merg-
tate the training data with such features, and addi- ing) on one of our two test sets.
tionally depends on the coverage of morpholog-
ical analysis and generation. Despite this, our 10 Conclusion
research clearly shows that the feature-based ap-
We have shown that both the prediction of sur-
proach is superior for English-to-German SMT.
face forms and the prediction of linguistic features
This is a striking result considering state-of-the-
are of interest for improving SMT. We have ob-
art performance of German parsing is poor com-
tained the advantages of both in our CRF frame-
pared with the best performance on English pars-
work, and also integrated handling of compounds,
ing. As parsing performance improves, the per-
and an inflection-dependent word formation phe-
formance of linguistic-feature-based approaches
nomenon, portmanteaus. We validated our work
will increase.
on a well-studied large corpora translation task.
Virpioja et al. (2007), Badr et al. (2008), Luong
et al. (2010), Clifton and Sarkar (2011), and oth-
Acknowledgments
ers are primarily concerned with using morpheme
segmentation in SMT, which is a useful approach The authors wish to thank the anonymous review-
for dealing with issues of word-formation. How- ers for their comments. Aoife Cahill was partly
ever, this does not deal directly with linguistic fea- supported by Deutsche Forschungsgemeinschaft
tures marked by inflection. In German these lin- grant SFB 732. Alexander Fraser, Marion Weller
guistic features are marked very irregularly and and Fabienne Cap were funded by Deutsche
there is widespread syncretism, making it difficult Forschungsgemeinschaft grant Models of Mor-
to split off morphemes specifying these features. phosyntax for Statistical Machine Translation.
So it is questionable as to whether morpheme seg- The research leading to these results has received
mentation techniques are sufficient to solve the in- funding from the European Communitys Seventh
flectional problem we are addressing. Framework Programme (FP7/2007-2013) under
Much previous work looks at the impact of us- grant agreement Nr. 248005. This work was sup-
ing source side information (i.e., feature func- ported in part by the IST Programme of the Euro-
tions on the aligned English), such as those pean Community, under the PASCAL2 Network
of Avramidis and Koehn (2008), Yeniterzi and of Excellence, IST-2007-216886. This publica-
Oflazer (2010) and others. Toutanova et. al.s tion only reflects the authors views. We thank
work showed that it is most important to model Thomas Lavergne and Helmut Schmid.
target side coherence and our stem markup also
allows us to access source side information. Us-
ing additional source side information beyond the References
markup did not produce a gain in performance. Eleftherios Avramidis and Philipp Koehn. 2008. En-
For compound splitting, we follow Fritzinger riching Morphologically Poor Languages for Statis-
and Fraser (2010), using linguistic knowledge en- tical Machine Translation. In Proceedings of ACL-
672
08: HLT, pages 763770, Columbus, Ohio, June. Thomas Lavergne, Olivier Cappe, and Francois Yvon.
Association for Computational Linguistics. 2010. Practical very large scale CRFs. In Proceed-
Ibrahim Badr, Rabih Zbib, and James Glass. 2008. ings the 48th Annual Meeting of the Association for
Segmentation for English-to-Arabic statistical ma- Computational Linguistics (ACL), pages 504513.
chine translation. In Proceedings of ACL-08: HLT, Association for Computational Linguistics, July.
Short Papers, pages 153156, Columbus, Ohio, Minh-Thang Luong, Preslav Nakov, and Min-Yen
June. Association for Computational Linguistics. Kan. 2010. A Hybrid Morpheme-Word Represen-
Ondrej Bojar and Kamil Kos. 2010. 2010 Failures in tation for Machine Translation of Morphologically
English-Czech Phrase-Based MT. In Proceedings Rich Languages. In Proceedings of the 2010 Con-
of the Joint Fifth Workshop on Statistical Machine ference on Empirical Methods in Natural Language
Translation and MetricsMATR, pages 6066, Upp- Processing, pages 148157, Cambridge, MA, Octo-
sala, Sweden, July. Association for Computational ber. Association for Computational Linguistics.
Linguistics. Maja Popovic, Daniel Stein, and Hermann Ney. 2006.
Ann Clifton and Anoop Sarkar. 2011. Combin- Statistical Machine Translation of German Com-
ing morpheme-based machine translation with post- pound Words. In Proceedings of FINTAL-06, pages
processing morpheme prediction. In Proceed- 616624, Turku, Finland. Springer Verlag, LNCS.
ings of the 49th Annual Meeting of the Associa- Helmut Schmid, Arne Fitschen, and Ulrich Heid.
tion for Computational Linguistics: Human Lan- 2004. SMOR: A German Computational Morphol-
guage Technologies, pages 3242, Portland, Ore- ogy Covering Derivation, Composition, and Inflec-
gon, USA, June. Association for Computational tion. In 4th International Conference on Language
Linguistics. Resources and Evaluation.
Adria de Gispert and Jose B. Marino. 2008. On the Helmut Schmid. 2004. Efficient Parsing of Highly
impact of morphology in English to Spanish statisti- Ambiguous Context-Free Grammars with Bit Vec-
cal MT. Speech Communication, 50(11-12):1034 tors. In Proceedings of Coling 2004, pages 162
1046. 168, Geneva, Switzerland, Aug 23Aug 27. COL-
Alexander Fraser. 2009. Experiments in Morphosyn- ING.
tactic Processing for Translating to and from Ger- Andreas Stolcke. 2002. SRILM - An Extensible Lan-
man. In Proceedings of the Fourth Workshop on guage Modeling Toolkit. In International Confer-
Statistical Machine Translation, pages 115119, ence on Spoken Language Processing.
Athens, Greece, March. Association for Computa- Sara Stymne and Nicola Cancedda. 2011. Produc-
tional Linguistics. tive Generation of Compound Words in Statistical
Fabienne Fritzinger and Alexander Fraser. 2010. How Machine Translation. In Proceedings of the Sixth
to Avoid Burning Ducks: Combining Linguistic Workshop on Statistical Machine Translation, pages
Analysis and Corpus Statistics for German Com- 250260, Edinburgh, Scotland UK, July. Associa-
pound Processing. In Proceedings of the Fifth tion for Computational Linguistics.
Workshop on Statistical Machine Translation, pages Sara Stymne. 2008. German Compounds in Factored
224234. Association for Computational Linguis- Statistical Machine Translation. In Proceedings of
tics. GOTAL-08, pages 464475, Gothenburg, Sweden.
Philipp Koehn and Hieu Hoang. 2007. Factored Springer Verlag, LNCS/LNAI.
Translation Models. In Proceedings of the 2007 Kristina Toutanova, Hisami Suzuki, and Achim
Joint Conference on Empirical Methods in Natural Ruopp. 2008. Applying Morphology Generation
Language Processing and Computational Natural Models to Machine Translation. In Proceedings of
Language Learning (EMNLP-CoNLL), pages 868 ACL-08: HLT, pages 514522, Columbus, Ohio,
876, Prague, Czech Republic, June. Association for June. Association for Computational Linguistics.
Computational Linguistics. Sami Virpioja, Jaakko J. Vayrynen, Mathias Creutz,
Philipp Koehn and Kevin Knight. 2003. Empirical and Markus Sadeniemi. 2007. Morphology-aware
methods for compound splitting. In EACL 03: statistical machine translation based on morphs in-
Proceedings of the 10th conference of the European duced in an unsupervised manner. In PROC. OF
chapter of the Association for Computational Lin- MT SUMMIT XI, pages 491498.
guistics, pages 187193, Morristown, NJ, USA. As- Philip Williams and Philipp Koehn. 2011. Agree-
sociation for Computational Linguistics. ment constraints for statistical machine translation
John Lafferty, Andrew McCallum, and Fernando into German. In Proceedings of the Sixth Workshop
Pereira. 2001. Conditional random fields: Prob- on Statistical Machine Translation, pages 217226,
abilistic models for segmenting and labeling se- Edinburgh, Scotland, July. Association for Compu-
quence data. In Proceedings of the International tational Linguistics.
Conference on Machine Learning, pages 282289. Reyyan Yeniterzi and Kemal Oflazer. 2010. Syntax-
Morgan Kaufmann, San Francisco, CA. to-Morphology Mapping in Factored Phrase-Based
673
Statistical Machine Translation from English to
Turkish. In Proceedings of the 48th Annual Meet-
ing of the Association for Computational Linguis-
tics, pages 454464, Uppsala, Sweden, July. Asso-
674
Identifying Broken Plurals, Irregular Gender,
and Rationality in Arabic Text
Sarah Alkuhlani and Nizar Habash

Center for Computational Learning Systems
Columbia University
{sma2149,nh2142}@columbia.edu
Abstract that look masculine), and the semantic feature

of rationality, which has no morphological re-
Arabic morphology is complex, partly be- alization (Smr, 2007b; Alkuhlani and Habash,
cause of its richness, and partly because
2011). These features heavily participate in Ara-
of common irregular word forms, such as
broken plurals (which resemble singular bic morpho-syntactic agreement. Alkuhlani and
nouns), and nouns with irregular gender Habash (2011) show that without proper model-
(feminine nouns that look masculine and ing, Arabic agreement cannot be accounted for
vice versa). In addition, Arabic morpho- in about a third of all noun-adjective pairs and
syntactic agreement interacts with the lex- a quarter of verb-subject pairs. They also report
ical semantic feature of rationality, which that over half of all plurals in Arabic are irregular,
has no morphological realization. In this 8% of nominals have irregular gender and almost
paper, we present a series of experiments
half of all proper nouns and 5% of all nouns are
on the automatic prediction of the latent
linguistic features of functional gender and rational.
number, and rationality in Arabic. We com- In this paper, we present results on the task
pare two techniques, using simple maxi- of automatic identification of functional gender,
mum likelihood (MLE) with back-off and number and rationality of Arabic words in con-
a support vector machine based sequence
tagger (Yamcha). We study a number of
text. We consider two supervised learning tech-
orthographic, morphological and syntactic niques: a simple maximum-likelihood model with
learning features. Our results show that back-off (MLE) and a support-vector-machine-
the MLE technique is preferred for words based sequence tagger, Yamcha (Kudo and Mat-
seen in the training data, while the Yam- sumoto, 2003). We consider a large number of
cha technique is optimal for unseen words, orthographic, morphological and syntactic learn-
which are our real target. Furthermore, we ing features. Our results show that the MLE tech-
show that for unseen words, morphological
nique is preferred for words seen in the training
features help beyond orthographic features
and that syntactic features help even more. data, while the Yamcha technique is optimal for
A combination of the two techniques im- unseen words, which are our real target. Further-
proves overall performance even further. more, we show that for unseen words, morpho-
logical features help beyond orthographic features
1 Introduction and that syntactic features help even more. A
Arabic morphology is complex, partly because combination of the two techniques improves over-
of its richness, and partly because of its com- all performance even further.
plex morpho-syntactic agreement rules which de- This paper is structured as follows: Sec-
pend on functional features not necessarily ex- tions 2 and 3 present relevant linguistic facts and
pressed in word forms. Particularly challeng- related work, respectively. Section 4 presents the
ing are broken plurals (which resemble singu- data collection we use and the metrics we target.
lar nouns), nouns with irregular gender (mascu- Section 5 discusses our approach. And Section 6
line nouns that look feminine and feminine nouns presents our results.
675
VRB
J
SBJ OBJ MOD

NOM NOM PRT

H. AJ@ A

MOD MOD OBJ
NOM NOM NOM
JK
Ym '@
YK Yg Jj.@

.
MOD MOD
NOM NOM

G. Q@ '
Y@
Word ystlhm AlktAb AlHdywn qSSA jdydh mn Almjtm Alrby Alqdym
Form MS MS MP MS FS NaNa MS MS MS
Func MSN MPR MPN FPI FSN NaNaNa MSI MSN MSN
Gloss be-inspired the-writers the-modern stories new from culture Arab ancient
English Modern writers are inspired by ancient Arab culture to write new stories .
Figure 1: An example Arabic sentence showing its dependency representation together with the form-based and
functional gender and number features and rationality. The dependency tree is in the CATiB treebank represen-
tation (Habash and Roth, 2009). The shown POS tags are VRB verb, NOM nominal (noun/adjective), and
PRT particle. The relations are SBJ subject, OBJ object and MOD modifier. The form-based features
are only for gender and number.
2 Linguistic Facts
QA mAhrwn (M P ), and H@ QA mAhrAt
(F P ). For a sizable minority of words, these
Arabic has a rich and complex morphology. In features are expressed templatically, i.e., through
addition to being both templatic (root/pattern) and pattern change, coupled with some singular suf-
concatenative (stems/affixes/clitics), Arabics op- fix. A typical example of this phenomenon is the
tional diacritics add to the degree of word ambi- class of broken plurals, which accounts for over
guity. We focus on two problems of Arabic mor- half of all plurals (Alkuhlani and Habash, 2011).
phology: the discrepancy between morphological In such cases, the form of the morphology (sin-
form and function; and the complexity of morpho- gular suffix) is inconsistent with the words func-
syntactic agreement rules. tional number (plural). For example, the word
2.1 Form and Function I.KA kAtb (M S) writer has the broken plural:
H. AJ ktAb ( MMPS ).2 See the second word in the ex-
Arabic nominals (i.e. nouns, proper nouns and
ample in Figure 1, which is the word H

adjectives) and verbs inflect for gender: mascu- . AJ ktAb
line (M ) and feminine (F ), and for number: sin- writers prefixed with the definite article Al+. In
gular (S), dual (D) and plural (P ). These features addition to broken plurals, Arabic has words with
are regularly expressed using a set of suffixes that irregular gender, e.g., the feminine singular ad-
uniquely convey gender and number combina- jective red Z@Qg HmrA ( M S
F S ), and the nouns

tions: + (M S), + +h1 (F S), + +wn (M P ), J
g xlyfh ( MF SS ) caliph and Ag HAml ( MF SS )
and H@ + +At (F P ). For example, the adjective pregnant. Verbs and nominal duals do not dis-
play this discrepancy.
QA mAhr clever has the following forms among

others: QA mAhr (M S), QA mAhrh (F S), 2.2 Morpho-syntactic Agreement
1
Arabic transliteration is presented in the Habash-Soudi- Arabic gender and number features participate in
Buckwalter (HSB) scheme (Habash et al., 2007): (in alpha- morpho-syntactic agreement within specific con-
betical order) AbtjHxdrzsSDTDfqklmnhwy and the ad-
2 F orm
ditional symbols: Z, @, A @, A @, w ', y Z', h , . This nomenclature denotes ( F unction ).
676
structions such as nouns with their adjectives Altantawy et al., 2010; Alkuhlani and Habash,
and verbs with their subjects. Arabic agreement 2011).
rules are more complex than the simple match- In terms of resources, Smr (2007b)s work
ing rules found in languages such as Spanish contrasting illusory (form) features and functional
(Holes, 2004; Habash, 2010). For instance, Ara- features inspired our distinction of morphologi-
bic adjectives agree with the nouns they mod- cal form and function. However, unlike him, we
ify in gender and number except for plural ir- do not distinguish between sub-functional (logi-
rational (non-human) nouns, which always take cal and formal) features. His ElixirFM analyzer
feminine singular adjectives. Rationality (hu- (Smr, 2007a) extends BAMA by including func-

manness A Q
/ A) is a morpho-lexical tional number and some functional gender infor-
feature that is narrower than animacy. English mation, but not rationality. This analyzer was
expresses it mainly in pronouns (he/she vs. it) used as part of the annotation of the Prague Ara-
and relativizers (men who... vs. cars/cows bic Dependency Treebank (PADT) (Smr and Ha-
which...). We follow the convention by Alkuh- jic, 2006). More recently, Alkuhlani and Habash
lani and Habash (2011) who specify rationality (2011) built on the work of Smr (2007b) and ex-
as part of the functional features of the word. tended beyond it to fully annotate functional gen-
The values of this feature are: rational (R), irra- der, number and rationality in the PATB part 3.
tional (I), and not-specified (N ). N is assigned to We use their resource to train and evaluate our
verbs, adjectives, numbers and quantifiers.3 For system.
example, in Figure 1, the plural rational noun In terms of techniques, Goweder et al. (2004)

H. AJ@ AlktAb ( MMPSR ) writers takes the plural investigated several approaches using root and
adjective JK
Ym '@ AlHdywn ( M P ) modern; pattern morphology for identifying broken plu-
MP N

while the plural irrational word A qSSA sto- rals in undiacritized Arabic text. Their effort re-
ries ( FMPSI ) takes the feminine singular adjective sulted in an improved stemming system for Ara-
YK Yg jdydh ( F S ). bic information retrieval that collapses singulars

. F SN and plurals. They report results on identifying
3 Related Work broken plurals out of context. Similar to them,
we undertake the task of identifying broken plu-
Much work has been done on Arabic morpholog- rals; however, we also target the templatic gen-
ical analysis, morphological disambiguation and der and rationality features, and we do this in-
part-of-speech (POS) tagging (Al-Sughaiyer and context. Elghamry et al. (2008) presented an auto-
Al-Kharashi, 2004; Soudi et al., 2007; Habash, matic cue-based algorithm that uses bilingual and
2010). The bulk of this work does not address monolingual cues to build a web-extracted lexi-
form-function discrepancy or morpho-syntactic con enriched with gender, number and rationality
agreement issues. This includes the most com- features. Their automatic technique achieves an
monly used resources and tools for Arabic NLP: F-score of 89.7% against a gold standard set. Un-
the Buckwalter Arabic Morphological Analyzer like them, we use a manually annotated corpus to
(BAMA) (Buckwalter, 2004) which is used in the train and test the prediction of gender, number and
Penn Arabic Tree Bank (PATB) (Maamouri et al., rationality features.
2004), and the various POS tagging and morpho- Our approach to identifying these features ex-
logical disambiguation tools trained using them plores a large set of orthographic, morphological
(Diab et al., 2004; Habash and Rambow, 2005). and syntactic learning features. This is very much
There are some important exceptions (Goweder et following several previous efforts in Arabic NLP
al., 2004; Habash, 2004; Smr, 2007b; Elghamry in which different tagsets and morphological fea-
et al., 2008; Abbs et al., 2004; Attia, 2008; tures have been studied for a variety of purposes,
3
We previously defined the rationality value N as not- e.g., base phrase chunking (Diab, 2007) and de-
applicable when we only considered nominals (Alkuhlani pendency parsing (Marton et al., 2010). In this
and Habash, 2011). In this work, we rename the rationality paper we use the parser of Marton et al. (2010)
value N as not-specified without changing its meaning. We
use the value N a (not-applicable) for parts-of-speech that
as our source of syntactic learning features. We
do not have a meaningful value for any feature, e.g., prepo- follow their splits for training, development and
sitions have gender, number and rationality values of N a. testing.
677
4 Problem Definition 5 Approach
Our approach involves using two techniques:
Our goal is to predict the functional gender, num-
MLE with back-off and Yamcha. For each tech-
ber and rationality features for all words.
nique, we explore the effects of different learning
features and try to come up with the best tech-
4.1 Corpus and Experimental Settings nique and feature set for each target feature.
We use the corpus of Alkuhlani and Habash 5.1 Learning Features

(2011), which is based on the PATB. The corpus We investigate the contribution of different learn-
contains around 16.6K sentences and over 400K ing features in predicting functional gender, num-
tokens. We use the train/development/test splits ber and rationality features. The learning features
of Marton et al. (2010). We train on a quarter of are explored in the following order:
the training set and classify words in sequence.
We only use a portion of the training data to in- Orthographic Features These features are or-
crease the percentage of words unseen in training. ganized in two sets: W1 is the unnormalized form
We also compare to using all of the training data of the word, and W2 includes W1 plus letter n-
in Section 6.7. grams. The n-grams used are the first letter, first
two letters, last letter, and last two letters of the
Our data is gold tokenized; however, all of word form. We tried using the Alif/Ya normalized
the features we use are predicted using MADA forms of the words (Habash, 2010), but these be-
(Habash and Rambow, 2005) following the work haved consistently worse than the unnormalized
of Marton et al. (2010). Words whose tags are un- forms.
known in the training set are excluded from the
evaluation, but not training. In terms of ambigu- Morphological Features We explore the fol-
ity, the percentage of word types with ambiguous lowing morphological features inspired by the
gender, number and rationality in the train set is work of Marton et al. (2010):
1.35%, 0.79%, and 4.8% respectively. These per- POS tags. We experiment with different POS
centages are consistent with how we perform on tag sets: CATiB-6 (6 tags) (Habash et al., 2009),
these features, with number being the easiest and CATiB-EX (44 tags), Kulick (34 tags) (Kulick et
rationality the hardest. al., 2006), Buckwalter (BW) (Buckwalter, 2004),
which is the tag used in the PATB (430 tags),
and a reduced form of BW tag that ignores case
4.2 Metrics
and mood (BW-) (217 tags). These tags differ in
We report all results in terms of token accuracy. their granularity and range from very specific tags
Evaluation is done for the following sets: all (Buckwalter) to more general tags (CATiB).
words, seen words, and unseen words. A word is Lemma. We use the diacritized lemma
considered seen if it is in the training data regard- (Lemma), and the normalized and undiacritized
less of whether it appears with the same lemma form of the lemma, the LMM (LMM).
and POS tag or not. Defining seen words this way Form-based features. Form-based features
makes the decision on whether a word is seen or (F) are extracted from the word form and do not
unseen unaffected by lemma and/or POS predic- necessarily reflect functional features. These fea-
tion errors in the development and test sets. Us- tures are form-based gender, form-based number,
ing our definition of seen words, 34.3% of words person and the definite article.
types (and 10.2% of word tokens) in the devel- Syntactic Features We use the following syn-
opment set have not been seen in quarter of the tactic features (SYN) derived from the CATiB de-
training set. pendency version of the PATB (Habash and Roth,
We train single classifiers for G (gender), N 2009): parent, dependency relation, order of ap-
(number), R (rationality), GN and GNR, and eval- pearance (the word comes before or after its par-
uate them. We also combine the tags of the sin- ent), the distance between the word and its parent,
gle classifiers into larger tags (G+N, GN+R and and the parents orthographic and morphological
G+N+R). features.
678
For all of these features, we train on gold val- Single vs Joint Classification In this paper, we
ues, but only experiment with predicted values in only discuss systems trained for a single classifier
the development and test sets. For predicting mor- (for gender, for number and for rationality). In
phological features, we use the MADA system experiments we have done, we found that training
(Habash and Rambow, 2005). The MADA sys- single classifiers and combining their outcomes
tem corrects for suboptimal orthographic choices almost always outperforms a single joint classi-
and effectively produces a consistent and unnor- fier for the three target features. In other words,
malized orthography. For the syntactic features, combining the results of G and N (G+N) outper-
we use Marton et al. (2010)s system. forms the results of the single classifier GN. The
same is also true for G+N+R, which outperforms
5.2 Techniques GNR and GN+R. Therefore, we only present the
We describe below the two techniques we ex- results for the single classifiers G, N, R and their
plored. combination G+N+R.
MLE with Back-off We implemented an MLE 6 Results
system with multiple back-off modes using our
set of linguistic features. The order of the back-off We perform a series of experiments increasing in
is from specific to general. We start with an MLE feature complexity. We greedily select which fea-
system that uses only the word form, and backs tures to pass on to the next level of experiments.
off to the most common feature value across all In cases of ties, we pass the top two performers
words (excluding unknown and N a values). This to the next step. We discuss each of these exper-
simple MLE system is used as a baseline. iments next for both the MLE and Yamcha tech-
As we add more features to the MLE system, niques. Statistical significance is measured using
it tries to match all these features to predict the the McNemar test of statistical significance (Mc-
value for a given word. If such a combination of Nemar, 1947).
features is not seen in the training set, the sys- 6.1 Experiment Set I: Orthographic
tem backs off to a more general combination of Features
features. For example, if an MLE system is us-
The first set of experiments uses the orthographic
ing the features W2+LMM+BW, the system tries
features. See Table 1. The MLE system with the
to match this combination. If it is not seen in
word only feature (W1) is effectively our base-
training, the system backs off to the following set:
line. It does surprisingly well for seen cases. In
LMM+BW, and tries to return the most common
fact it is the highest performer across all exper-
value for this POS tag and lemma combination. If
iments in this paper for seen cases. For unseen
again it fails to find a match, it backs off to BW,
cases, it produces a miserable and expected low
and returns the most common value for that par-
score of 21.0% accuracy. The addition of the n-
ticular POS tag. If no word is seen with this POS
gram features (W2) improves statistically signif-
tag, the system returns the most common value
icantly over W1 for unseen cases, but it is indis-
across all words.
tinguishable for seen cases. The Yamcha system
Yamcha Sequence Tagger We use Yamcha shows the same difference in results between W1
(Kudo and Matsumoto, 2003), a support-vector- and W2.
machine-based sequence tagger. We perform dif- Across the two sets of features, the MLE sys-
ferent experiments with the different sets of fea- tem consistently outperforms Yamcha in the case
tures presented above. After that, we apply a of seen words, while Yamcha does better for un-
consistency filter that ensures that every word- seen words. This can be explained by the fact that
lemma-pos combination always gets the same the MLE system matches only on the word form
value for gender, number and rationality features. and if the word is unseen, it backs off to the most
Yamcha in its default settings tags words using a common value across all words. Moreover, Yam-
window of two words before and two words af- cha uses some limited context information that al-
ter the word being tagged. This gives Yamcha an lows it to generalize for unseen words.
advantage over the MLE system which tags each Among the target features, number is the easi-
word independently. est to predict, while rationality is the hardest.
679
MLE Yamcha
G N R G+N+R G N R G+N+R
Features seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen
W1 99.2 61.6 99.3 69.2 97.4 44.7 97.0 21.0 95.9 67.8 96.7 72.0 94.5 67.4 90.2 35.2
W2 99.2 81.7 99.3 81.6 97.4 63.4 97.0 49.1 97.1 86.6 97.7 87.1 95.6 82.0 92.8 65.5
Table 1: Experiment Set I: Baselines and simple orthographic features. W1 is the word only. W2 is the word
with additional 1-gram and 2-gram prefix and suffix features. All numbers are accuracy percentages.
MLE Yamcha
G N R G+N+R G N R G+N+R
Features seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen seen unseen
W2+F 99.2 86.9 99.3 88.9 97.4 63.4 96.9 51.9 97.7 89.8 98.1 91.7 96.0 83.5 93.8 72.0
W2+Lemma 97.4 68.3 97.6 71.5 95.6 70.3 95.2 33.8 97.4 86.8 97.7 86.4 96.1 82.2 93.3 65.4
W2+LMM 99.1 68.8 99.3 71.7 97.2 67.6 96.8 33.2 97.5 86.7 97.9 86.6 96.1 82.6 93.5 65.7
W2+CATIB 99.1 85.0 99.3 83.8 97.4 70.0 97.1 56.2 97.5 87.9 98.0 88.6 96.0 83.5 93.6 69.7
W2+CATIB-EX 99.1 85.7 99.3 84.3 97.4 70.4 97.1 56.7 97.5 88.0 97.9 88.1 96.0 83.6 93.6 69.9
W2+Kulick 99.0 86.7 99.1 85.6 97.1 78.7 96.7 65.5 97.3 88.8 97.9 89.4 95.8 83.5 93.3 70.9
W2+BW- 99.0 88.8 99.0 88.8 97.0 80.7 96.6 68.5 97.5 89.7 98.0 91.2 96.0 85.2 93.7 73.2
W2+BW 98.6 87.9 98.5 88.8 96.8 80.3 95.9 67.8 97.5 89.5 97.9 89.5 96.1 85.7 93.7 72.8
Table 2: Experiment Set II.a: Morphological features: (i) form-based gender and number, (ii) lemma and LMM
(undiacritized lemma) and (iii) a variety of POS tag sets. For each subset, the best performers are bolded.
6.2 Experiment Set II: Morphological reasonable given that LMM is easier to predict;
Features although LMM is more ambiguous.
As for the POS tag sets, looking at the MLE
Individual Morphological Features In this set results, CATIB-EX is the best performer for seen
of experiments, we use our best system from the words, and BW- is the best for unseen. CATIB-6
previous set, W2, and add individual morpholog- is a general POS tag set and since the MLE tech-
ical features to it. We organize these features in nique is very strict in its matching process (an ex-
three sub-groups: (i) form-based features (F), (ii) act match or no match), using a general key to
lemma and LMM, and (iii) the five POS tag sets. match on adds a lot of ambiguity. With Yamcha,
See Table 2. BW and BW- are the best among all POS. Yamcha
The F, Lemma and LMM improve over the is still doing consistently better in terms of unseen
baseline in terms of unseen words for both MLE words. The best two systems from both Yamcha
and Yamcha techniques. However, for seen and MLE are used as the basic systems for the
words, these systems do worse than or equal to the next subset of experiments where we combine the
baseline when the MLE technique is used. The morphological features.
MLE system in these cases tries to match the word
and its morphological features as a single unit and Combined Morphological Features Until this
if such a combination is not seen, it backs off to point, all experiments using the two techniques
the morphological feature which is more general. are similar. In this subset, MLE explores the ef-
Since we are using predicted data, prediction er- fect of using the CATIB-EX and BW- with other
rors could be the reason behind this decrease in morphological features. And Yamcha explores
accuracy for seen words. Among these systems, the effect of using BW- and BW with other mor-
W2+F is the best for both Yamcha and MLE ex- phological features. See Table 3. Again, Yamcha
cept for rationality which is expected since there is still doing consistently better in terms of unseen
are no form-based features for rationality. In this words, but when it comes to seen words, MLE
set of experiments, Yamcha consistently outper- performs better. For seen words, our best results
forms MLE when it comes to unseen words, but come from MLE using CATIB-EX and LMM. For
for seen words, MLE does better almost always. unseen words, our best results come from Yam-
LMM overall does better than Lemma. This is cha with the BW- tag and the form-based features
680
MLE Yamcha
Features: G N R G+N+R Features: G N R G+N+R
W2 seen unseen seen unseen seen unseen seen unseen W2 seen unseen seen unseen seen unseen seen unseen
+CATIB-EX 99.1 85.7 99.3 84.3 97.4 70.4 97.0 56.7 +BW 97.5 89.5 97.9 89.5 96.1 85.7 93.7 72.8
+F 98.7 88.6 99.1 89.4 94.9 70.4 94.3 59.7 +F 97.8 90.6 98.2 92.4 96.3 85.3 94.2 75.4
+LMM 99.1 78.9 99.3 80.4 97.3 69.6 96.9 44.7 +LMM 97.6 88.9 98.1 88.9 96.5 85.7 94.1 72.3
+LMM+F 98.7 89.9 99.0 89.7 94.8 69.6 94.2 58.1 +LMM+F 98.1 90.4 98.4 92.5 96.7 85.8 94.8 75.9
+BW- 99.0 88.8 99.0 88.8 97.0 80.7 96.6 68.5 +BW- 97.5 89.7 98.0 91.2 96.0 85.2 93.7 73.2
+F 99.0 88.8 99.1 89.9 97.0 80.7 96.6 69.6 +F 97.7 90.7 98.2 92.5 96.1 85.6 94.0 75.3
+LMM 98.9 90.0 99.0 88.0 97.0 83.6 96.6 69.8 +LMM 97.7 89.6 98.1 90.4 96.2 85.1 94.0 72.5
+LMM+F 98.9 90.0 99.0 89.1 97.0 83.6 96.6 70.8 +LMM+F 98.0 90.3 98.2 92.4 96.5 85.7 94.5 75.1
Table 3: Experiment Set II.b: Combining different morphological features.
Yamcha
G N R G+N+R
Features: seen unseen seen unseen seen unseen seen unseen
W2 +BW +F+SYN 97.3 90.6 97.8 92.5 96.1 86.1 93.5 76.0
W2 +BW +LMM+SYN 97.4 89.1 97.5 88.3 96.2 86.0 93.4 71.7
W2 +BW +LMM+F+SYN 97.5 90.8 98.0 92.5 96.4 86.2 93.8 76.2
W2 +BW- +F+SYN 97.4 90.7 97.9 92.7 96.1 85.2 93.5 75.0
W2 +BW- +LMM+SYN 97.4 89.5 97.7 89.8 96.1 85.7 93.4 72.1
W2 +BW- +LMM+F+SYN 97.4 90.8 97.9 92.7 96.2 85.3 93.6 75.2
Table 4: Experiment Set III: Syntactic features.
for both gender and number. For rationality, the words. In Yamcha, we can argue that the +/-2
best features to use with Yamcha are BW, LMM word window allows some form of shallow syn-
and form-based features. The lemma seems to ac- tax modeling, which is why Yamcha is doing bet-
tually hurt when predicting gender and number. ter from the start. But the longer distance features
This can be explained by the fact that gender and are helping even more, perhaps because they cap-
number features are often properties of the word ture agreement relations. The overall best system
form and not of the lemma. This is different for for unseen words is W2+BW+LMM+F+SYN,
rationality, which is a property of the lemma and except for number, where W2+BW-+F+SYN
therefore, we expect the lemma to help. is slightly better. In terms of G+N+R
The fact that the predicted BW set helps is not scores, W2+BW+LMM+F+SYN is statistically
consistent with previous work by Marton et al. significantly better than all other systems in
(2010). In that effort, BW helps parsing only in this set for seen and unseen words, ex-
the gold condition. BW prediction accuracy is cept for unseen words with W2+BW+F+SYN.
low because it includes case endings. We pos- W2+BW+LMM+F+SYN is also statistically sig-
tulate that perhaps in our task, which is far more nificantly better than its non-syntactic variant for
limited than general parsing, errors in case pre- both seen and unseen words. The prediction ac-
diction may not matter too much. The more com- curacy for seen words is still not as good as the
plex tag set may actually help establish good lo- MLE systems.
cal agreement sequences (even if incorrect case-
wise), which is relevant to the target features. 6.4 System Combination
The simple MLE W1 system, which happens to be
6.3 Experiment Set III: Syntactic Features the baseline, is the best predictor for seen words,
This set of experiments adds syntactic features and the more advanced Yamcha system using syn-
to the experiments in set II. We add syntax to tactic features is the best predictor for unseen
the systems that uses Yamcha only since it is words. Next, we create a new system that takes
not obvious how to add syntactic information to advantage of the two systems. We use the sim-
the MLE system. Syntax improves the predic- ple MLE W1 system for seen words, and Yam-
tion accuracy for unseen words but not for seen cha with syntax for unseen words. For unseen
681
words, since each target feature has its own set of All seen unseen
best learning features, we also build a combina- MLE W1 88.5 96.8 21.2
tion system that uses the best systems for gender, Yamcha BW+LMM+F 91.4 94.1 70.4
Yamcha BW+LMM+F+SYN 91.0 93.3 72.2
number and rationality and combine their output
Combination 94.1 96.8 72.4
into a single system for unseen words. For gender
and rationality, we use W2+BW+LMM+F+SYN, Table 5: Results on blind test. Scores for
and for number, we use W2+BW-+F+SYN. As All/Seen/Unseen are shown for the G+N+R condition.
expected the combination system outperforms the We compare the MLE word baseline, with the best
basic systems. For comparison: The MLE W1 Yamcha system with and without syntactic features
and the combined system.
system gets an (all, seen, unseen) scores of (89.3,
97.0, 21.0) for G+N+R, while the best single
Yamcha syntactic system gets (92.0, 93.8, 76.2); Since the Yamcha system uses MADA features,
the combination on the other hand gets (94.9, we investigated the effect of the correctness of
97.0, 76.2). The overall (all) improvement over MADA features on the system prediction accu-
the MLE baseline or the best Yamcha translates racy. The overall MADA accuracy in identifying
into 52% error reduction or 36% error reduction, the lemma and the Buckwalter tag together a
respectively. very harsh measure is 77.0% (79.3% for seen
and 56.8% for unseen). Our error analysis shows
6.5 Error Analysis that when MADA is correct, the prediction ac-
We conducted an analysis of the errors in the out- curacy for G+N+R is 95.6%, 96.5% and 84.4%
put of the combination system as well as the two for all, seen and unseen, respectively. However,
systems that contributed to it. this accuracy goes down to 79.2%, 82.5% and
In the combination system, out of the total er- 65.5% for all, seen and unseen, respectively, when
ror in G+N+R (5.1%), 53% of the cases are for MADA is wrong. This suggests that the Yam-
seen words (3.0% of all seen) and 47% for unseen cha system suffers when MADA makes wrong
words (23.8% of all unseen). Overall, rational- choices and improving MADA would lead to im-
ity errors are the biggest contributor to G+N+R provement in the systems performance.
error at 73% relative, followed by gender (33%
relative) and number (26% relative). Among er- 6.6 Blind Test
ror cases of seen words, rationality errors soar to Finally, we apply our baseline, best combination
87% relative, almost four times the corresponding model and best single Yamcha syntactic model
gender and number errors (27% and 22%, respec- (with and without syntax) to the blind test set.
tively). However, among error cases of unseen The results are in Table 5. The results in the blind
words, rationality errors are 57% relative, while test are consistent with the development set. The
gender and number corresponding errors are (39% MLE baseline is best on seen words, Yamcha is
and 31%, respectively). As expected, rational- best on unseen words, syntactic features help in
ity is much harder to tag than gender and number handling unseen words, and overall combination
due to its higher word-form ambiguity and depen- improves over all specific systems.
dence on context.
We classified the type of errors in the MLE sys- 6.7 Additional Training Data
tem for seen words, which we use in the combi- After experimenting on quarter of the train set to
nation system. We found that 86% of the G+N+R optimize for various settings, we train our com-
errors involve an ambiguity in the training data bination system on the full train set and achieve
where the correct answer was present but not cho- (96.0, 96.8, 74.9) for G+N+R (all, seen, unseen)
sen. This is an expected limitation of the MLE ap- on the development set and (96.5, 96.8, 65.6)
proach. In the rest of the cases, the correct answer on the blind test set. As expected, the overall
was not actually present in the training data. The (all) scores are higher simply due to the addi-
proportion of ambiguity errors is almost identical tional training data. The results on seen and un-
for gender, number and rationality. However ra- seen words, which are redefined against the larger
tionality overall is the biggest cause of error, sim- training set, are not higher than results for the
ply due to its higher degree of ambiguity. quarter training data. Of course, these numbers
682
should not be compared directly. The number of 7 Conclusions and Future Work
unseen word tokens in the full train set is 3.7%
We presented a series of experiments for auto-
compared to 10.2% in quarter of the train set.
matic prediction of the latent features of func-
tional gender and number, and rationality in Ara-
6.8 Comparison with MADA
bic. We compared two techniques, a simple MLE
We compare our results with the form-based with back-off and an SVM-based sequence tag-
features from the state-of-the-art morphological ger, Yamcha, using a number of orthographic,
analyzer MADA (Habash and Rambow, 2005). morphological and syntactic features. Our con-
We use the form-based gender and number fea- clusions are that for words seen in training, the
tures produced by MADA after we filter MADA MLE model does best; for unseen word, Yamcha
choices by tokenization. Since MADA does not does best; and most interestingly, we found that
give a rationality value, we assign the value I (ir- syntactic features help the prediction for unseen
rational) to nouns and proper nouns and the value words.
N (not-specified) to verbs and adjectives. Every- In the future, we plan to explore training on pre-
thing else receives N a (not-applicable). The POS dicted features instead of gold features to mini-
tags are determined by MADA. mize the effect of tagger errors. Furthermore, we
On the development set, MADA achieves plan to use our tools to collect vocabulary not cov-
(72.6, 73.1, 58.6) for G+N+R (all, seen, unseen), ered by commonly used morphological analyzers
where the seen/unseen distinction is based on the and try to assign them correct functional features.
full training set in the previous section and is pro- Finally, we would like to use our predictions for
vided for comparison reasons only. The results for gender, number and rationality as learning fea-
the test set are (71.4, 72.2, 53.7). These results are tures for relevant NLP applications such as senti-
consistent with our expectation that MADA will ment analysis, phrase-based chunking and named
do badly on this task since it is not designed for entity recognition.
it (Alkuhlani and Habash, 2011). We should re-
mind the reader that MADA-derived features are Acknowledgments
used as machine learning features in this paper,
We would like to thank Yuval Marton for help
where they actually help. In the future, we plan to
with the parsing experiments. The first author was
integrate this task inside of MADA.
funded by a scholarship from the Saudi Arabian
Ministry of Higher Education. The rest of the
6.9 Extrinsic Evaluation
work was funded under DARPA projects number
We use the predicted gender, number and rational- HR0011-08-C-0004 and HR0011-08-C-0110.
ity features that we get from training on the full
train set in a dependency syntactic parsing exper-
References
iment. The parsing feature set we use is the best
performing feature set described in (Marton et al., Ramzi Abbs, Joseph Dichy, and Mohamed Has-
2011), which used an earlier unpublished version soun. 2004. The Architecture of a Standard Arabic
of our MLE model. The parser we use is the Easy- Lexical Database. Some Figures, Ratios and Cat-
First Parser (Goldberg and Elhadad, 2010). More egories from the DIINAR.1 Source Program. In
Ali Farghaly and Karine Megerdoomian, editors,
details on this parsing experiment is in Marton et COLING 2004 Computational Approaches to Ara-
al. (2012). bic Script-based Languages, pages 1522, Geneva,
The functional gender and number features in- Switzerland, August 28th. COLING.
crease the labeled attachment score by 0.4% abso- Imad Al-Sughaiyer and Ibrahim Al-Kharashi. 2004.
lute over a comparable model that uses the form- Arabic Morphological Analysis Techniques: A
based gender and number features. Rationality on Comprehensive Survey. Journal of the American
Society for Information Science and Technology,
the other hand does not help much. One possible
55(3):189213.
reason for this is the lower quality of the predicted Sarah Alkuhlani and Nizar Habash. 2011. A Corpus
rationality feature compared to the other features. for Modeling Morpho-Syntactic Agreement in Ara-
Another possible reason is that the rationality fea- bic: Gender, Number and Rationality. In Proceed-
ture is not utilized optimally in the parser. ings of the 49th Annual Meeting of the Association
683
for Computational Linguistics (ACL11), Portland, tional Morphology: Knowledge-based and Empir-
Oregon, USA. ical Methods. Springer.
Mohamed Altantawy, Nizar Habash, Owen Rambow, Nizar Habash, Reem Faraj, and Ryan Roth. 2009.
and Ibrahim Saleh. 2010. Morphological Analy- Syntactic Annotation in the Columbia Arabic Tree-
sis and Generation of Arabic Nouns: A Morphemic bank. In Proceedings of MEDAR International
Functional Approach. In Proceedings of the seventh Conference on Arabic Language Resources and
International Conference on Language Resources Tools, Cairo, Egypt.
and Evaluation (LREC), Valletta, Malta. Nizar Habash. 2004. Large Scale Lexeme Based
Mohammed Attia. 2008. Handling Arabic Morpho- Arabic Morphological Generation. In Proceedings
logical and Syntactic Ambiguity within the LFG of Traitement Automatique des Langues Naturelles
Framework with a View to Machine Translation. (TALN-04), pages 271276. Fez, Morocco.
Ph.D. thesis, The University of Manchester, Manch- Nizar Habash. 2010. Introduction to Arabic Natural
ester, UK. Language Processing. Morgan & Claypool Pub-
Tim Buckwalter. 2004. Buckwalter arabic morpho- lishers.
logical analyzer version 2.0. LDC catalog number Clive Holes. 2004. Modern Arabic: Structures, Func-
LDC2004L02, ISBN 1-58563-324-0. tions, and Varieties. Georgetown Classics in Arabic
Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. Language and Linguistics. Georgetown University
2004. Automatic Tagging of Arabic Text: From Press.
Raw Text to Base Phrase Chunks. In Proceed- Taku Kudo and Yuji Matsumoto. 2003. Fast Meth-
ings of the 5th Meeting of the North Ameri- ods for Kernel-Based Text Analysis. In Proceed-
can Chapter of the Association for Computational ings of the 41st Annual Meeting of the Association
Linguistics/Human Language Technologies Con- for Computational Linguistics (ACL03), pages 24
ference (HLT-NAACL04), pages 149152, Boston, 31, Sapporo, Japan, July.
MA. Seth Kulick, Ryan Gabbard, and Mitch Marcus. 2006.
Mona Diab. 2007. Towards an Optimal POS tag set Parsing the Arabic Treebank: Analysis and Im-
for Modern Standard Arabic Processing. In Pro- provements. In Proceedings of the Treebanks
ceedings of Recent Advances in Natural Language and Linguistic Theories Conference, pages 3142,
Processing (RANLP), Borovets, Bulgaria. Prague, Czech Republic.
Khaled Elghamry, Rania Al-Sabbagh, and Nagwa El- Mohamed Maamouri, Ann Bies, Tim Buckwalter, and
Zeiny. 2008. Cue-based bootstrapping of Arabic Wigdan Mekki. 2004. The Penn Arabic Treebank:
semantic features. In JADT 2008: 9es Journes Building a Large-Scale Annotated Arabic Corpus.
internationales dAnalyse statistique des Donnes In NEMLAR Conference on Arabic Language Re-
Textuelles. sources and Tools, pages 102109, Cairo, Egypt.
Yoav Goldberg and Michael Elhadad. 2010. An effi- Yuval Marton, Nizar Habash, and Owen Rambow.
cient algorithm for easy-first non-directional depen- 2010. Improving Arabic Dependency Parsing with
dency parsing. In Human Language Technologies: Lexical and Inflectional Morphological Features. In
The 2010 Annual Conference of the North American Proceedings of the NAACL HLT 2010 First Work-
Chapter of he Association for Computational Lin- shop on Statistical Parsing of Morphologically-Rich
guistics, pages 742750, Los Angeles, California, Languages, pages 1321, Los Angeles, CA, USA,
June. Association for Computational Linguistics. June.
Abduelbaset Goweder, Massimo Poesio, Anne De Yuval Marton, Nizar Habash, and Owen Rambow.
Roeck, and Jeff Reynolds. 2004. Identifying Bro- 2011. Improving Arabic Dependency Parsing with
ken Plurals in Unvowelised Arabic Text. In Dekang Form-based and Functional Morphological Fea-
Lin and Dekai Wu, editors, Proceedings of EMNLP tures. In Proceedings of the 49th Annual Meet-
2004, pages 246253, Barcelona, Spain, July. ing of the Association for Computational Linguis-
Nizar Habash and Owen Rambow. 2005. Arabic Tok- tics (ACL11), Portland, Oregon, USA.
enization, Part-of-Speech Tagging and Morpholog- Yuval Marton, Nizar Habash, and Owen Rabmow.
ical Disambiguation in One Fell Swoop. In Pro- 2012. Dependency Parsing of Modern Stan-
ceedings of the 43rd Annual Meeting of the Associa- dard Arabic with Lexical and Inflectional Features.
tion for Computational Linguistics (ACL05), pages Manuscript submitted for publication.
573580, Ann Arbor, Michigan. Quinn McNemar. 1947. Note on the sampling error
Nizar Habash and Ryan Roth. 2009. CATiB: The of the difference between correlated proportions or
Columbia Arabic Treebank. In Proceedings of the percentages. Psychometrika, 12(2):153157.
ACL-IJCNLP 2009 Conference Short Papers, pages Otakar Smr and Jan Hajic. 2006. The Other Ara-
221224, Suntec, Singapore. bic Treebank: Prague Dependencies and Functions.
Nizar Habash, Abdelhadi Soudi, and Tim Buckwalter. In Ali Farghaly, editor, Arabic Computational Lin-
2007. On Arabic Transliteration. In A. van den guistics: Current Implementations. CSLI Publica-
Bosch and A. Soudi, editors, Arabic Computa- tions.
684
Otakar Smr. 2007a. ElixirFM implementation of
functional arabic morphology. In ACL 2007 Pro-
ceedings of the Workshop on Computational Ap-
proaches to Semitic Languages: Common Issues
and Resources, pages 18, Prague, Czech Repub-
lic. ACL.
Otakar Smr. 2007b. Functional Arabic Morphology.
Formal System and Implementation. Ph.D. thesis,
Charles University in Prague, Prague, Czech Re-
public.
Abdelhadi Soudi, Antal van den Bosch, and Gn-
ter Neumann, editors. 2007. Arabic Computa-
tional Morphology. Knowledge-based and Empiri-
cal Methods, volume 38 of Text, Speech and Lan-
guage Technology. Springer, August.
685
Framework of Semantic Role Assignment based on Extended Lexical
Conceptual Structure: Comparison with VerbNet and FrameNet
Yuichiroh Matsubayashi Yusuke Miyao Akiko Aizawa

, National Institute of Informatics, Japan
{y-matsu,yusuke,aizawa}@nii.ac.jp
Sentence [John] threw [a ball] [from the window] .

Abstract Affection Agent Patient
Movement Source Theme Source/Path
Widely accepted resources for semantic PropBank Arg0 Arg1 Arg2
parsing, such as PropBank and FrameNet, VerbNet Agent Theme Source
are not perfect as a semantic role label- FrameNet Agent Theme Source
ing framework. Their semantic roles are
not strictly defined; therefore, their mean- Table 1: Examples of single role assignments with ex-
ings and semantic characteristics are un- isting resources.
clear. In addition, it is presupposed that
a single semantic role is assigned to each
syntactic argument. This is not necessarily
ing, current usage of semantic labels for SRL sys-
true when we consider internal structures of
verb semantics. We propose a new frame-
tems is questionable from a theoretical viewpoint.
work for semantic role annotation which For example, most of the works on SRL have
solves these problems by extending the the- used PropBanks numerical role labels (Arg0 to
ory of lexical conceptual structure (LCS). Arg5). However, the meanings of these numbers
By comparing our framework with that of depend on each verb in principle and PropBank
existing resources, including VerbNet and does not expect semantic consistency, namely on
FrameNet, we demonstrate that our ex- Arg2 to Arg5. Moreover, Yi et al. (2007) explic-
tended LCS framework can give a formal
itly showed that Arg2 to Arg5 are semantically
definition of semantic role labels, and that
multiple roles of arguments can be repre- inconsistent. The reason why such labels have
sented strictly and naturally. been used in SRL systems is that verb-specific
roles generally have a small number of instances
and are not suitable for learning. However, it is
1 Introduction necessary to avoid using inconsistent labels since
Recent developments of large semantic resources those labels confuse machine learners and can be
have accelerated empirical research on seman- a cause of low accuracy in automatic process-
tic processing (Marquez et al., 2008). Specif- ing. In addition, clarity of the definition of roles
ically, corpora with semantic role annotations, are particularly important for users to rationally
such as PropBank (Kingsbury and Palmer, 2002) know how to use each role in their applications.
and FrameNet (Ruppenhofer et al., 2006), are in- For this reasons, well-organized and generalized
dispensable resources for semantic role labeling. labels grounded in linguistic characteristics are
However, there are two topics we have to carefully needed in practice. Semantic roles of FrameNet
take into consideration regarding role assignment and VerbNet (Kipper et al., 2000) are used more
frameworks: (1) clarity of semantic role meanings consistently to some extent, but the definition of
and (2) the constraint that a single semantic role the roles is not given in a formal manner and their
is assigned to each syntactic argument. semantic characteristics are unclear.
While these resources are undoubtedly invalu- Another somewhat related problem of existing
able for empirical research on semantic process- annotation frameworks is that it is presupposed
686
that a single semantic role is assigned to each syn- proach, we demonstrate that some sort of seman-
tactic argument.1 In fact, one syntactic argument tic characteristics that VerbNet and FrameNet in-
can play multiple roles in the event (or events) ex- formally/implicitly describe in their roles can be
pressed by a verb. For example, Table 1 shows a given formal definitions and that multiple argu-
sentence containing the verb throw and seman- ment roles can be represented strictly and natu-
tic roles assigned to its arguments in each frame- rally by extending the LCS theory.
work. The table shows that each framework as- In the first half of this paper, we define our ex-
signs a single role, such as Arg0 and Agent, to tended LCS framework and describe how it gives
each syntactic argument. However, we can ac- a formal definition of roles and solves the problem
quire information from this sentence that John of multiple roles. In the latter half, we discuss
is an agent of the throwing event (the Affec- the analysis of the empirical data we collected
tion row), as well as a source of the movement for 60 Japanese verbs and also discuss theoreti-
event of the ball (the Movement row). Existing cal relationships with the frameworks of existing
frameworks of assigning single roles simply ig- resources. We discuss in detail the relationships
nore such information that verbs inherently have between our role labels and VerbNets thematic
in their semantics. We believe that giving a clear roles. We also describe the relationship between
definition of multiple argument roles would be our framework and FrameNet, with regards to the
beneficial not only as a theoretical framework but definitions of the relationships between semantic
also for practical applications that require detailed frames.
meanings derived from secondary roles.
This issue is also related to fragmentation and
2 Related works
the unclear definition of semantic roles in these There have been several attempts in linguistics
frameworks. As we exemplify in this paper, mul- to assign multiple semantic properties to one ar-
tiple semantic characteristics are conflated in a gument. Gruber (1965) demonstrated the dis-
single role label in these resources due to the man- pensability of the constraint that an argument
ner of single-role assignment. This means that se- takes only one semantic role, with some concrete
mantic roles of existing resources are not mono- examples. Rozwadowska (1988) suggested an
lithic and inherently not mutually independent, approach of feature decomposition for semantic
but they share some semantic characteristics. roles using her three features of change, cause,
The aim of this paper is more on theoreti- and sentient, and defined typical thematic roles
cal discussion for role-labeling frameworks rather by combining these features. This approach made
than introducing a new resource. We developed it possible for us to classify semantic properties
a framework of verb lexical semantics, which is across thematic roles. However, Levin and Rap-
an extension of the lexical conceptual structure paport Hovav (2005) argued that the number of
(LCS) theory, and compare it with other exist- combinations using defined features is usually
ing frameworks which are used in VerbNet and larger than the actual number of possible com-
FrameNet, as an annotation scheme of SRL. LCS binations; therefore, feature decomposition ap-
is a decomposition-based approach to verb se- proaches should predict possible feature combi-
mantics and describes a meaning by composing nations.
a set of primitive predicates. The advantage of Culicover and Wilkins (1984) divided their
this approach is that primitive predicates and their roles into two groups, action and perceptional
compositions are formally defined. As a result, roles, and explained that dual assignment of roles
we can give a strict definition of semantic roles always involves one role from each set. Jackend-
by grounding them to lexical semantic structures off (1990) proposed an LCS framework for rep-
of verbs. In fact, we define semantic roles as ar- resenting the meaning of a verb by using several
gument slots in primitive predicates. With this ap- primitive predicates. Jackendoff also stated that
an LCS represents two tiers in its structure, action
1
To be precise, FrameNet permits multiple-role assign- tier and thematic tier, which are similar to Culi-
ment, while it does not perform this systematically as we
show in Table 1. It mostly defines a single role label for a
cover and Wilkinss two sets. Essentially, these
corresponding syntactic argument, that plays multiple roles two approaches distinguished roles related to ac-
in several sub-events in a verb. tion and change, and successfully restricted com-
687
2 2 3 3
from(locate(in(i))) Predicates Semantic Functions
6 7 7
6cause(affect(i,j), go(j, 6
4fromward(locate(at(k)))5))7 state(x, y) First argument is in state specified by
4 5
toward(locate(at(l))) second argument.
cause(x, y) Action in first argument causes change
specified in second argument.
Figure 1: LCS of the verb throw. act(x) First argument affects itself.
affect(x, y) First argument affects second argument.
react(x, y) First argument affects itself, due to the
binations of roles by taking a role from each set. effect from second argument.
Dorr (1997) created an LCS-based lexical re- go(x, y) First argument changes according to the
path described in the second argument.
source as an interlingual representation for ma-
from(x) Starting point of certain change event.
chine translation. This framework was also used fromward(x) Direction of starting point.
for text generation (Habash et al., 2003). How- via(x) Pass point of certain change event.
ever, the problem of multiple-role assignment was toward(x) Direction of end point.
not completely solved on the resource. As a to(x) End point of certain change event.
along(x) Linear-shaped path of change event.
comparison of different semantic structures, Dorr
(2001) and Hajicova and Kucerova (2002) ana-
Table 2: Major primitive predicates and their semantic
lyzed the connection between LCS and PropBank
functions.
roles, and showed that the mapping between LCS
and PropBank roles was many to many correspon-
dence and roles can map only by comparing a ure 1 represents the action changing the state of j.
whole argument structure of a verb. Habash and The inner structure of the second argument of go
Dorr (2001) tried to map LCS structures into the- represents the path of the change.
matic roles by using their thematic hierarchy. The overall definition of our extended LCS
framework is shown in Figure 2.2 Basically, our
3 Multiple role expression using lexical definition is based on Jackendoffs LCS frame-
conceptual structure work (1990), but performed some simplifications
and added extensions. The modification is per-
Lexical conceptual structure is an approach to de- formed in order to increase strictness and gen-
scribe a generalized structure of an event or state erality of representation and also a coverage for
represented by a verb. A meaning of a verb is rep- various verbs appearing in a corpus. The main
resented as a structure composed of several prim- differences between the two LCS frameworks are
itive predicates. For example, the LCS structure as follows. In our extended LCS framework, (i)
for the verb throw is shown in Figure 1 and the possible combinations of cause, act, affect,
includes the predicates cause, affect, go, from, react, and go are clearly restricted, (ii) multiple
fromward, toward, locate, in, and at. The argu- actions or changes in an event can be described
ments of primitive predicates are filled by core ar- by introducing a combination function (comb for
guments of the verb. This type of decomposition short), (iii) GO, STAY and INCH in Jackendoffs
approach enables us to represent a case that one theory are incorporated into one function go, and
syntactic argument fills multiple slots in the struc- (iv) most of the change-of-state events are repre-
ture. In Figure 1, the argument i appears twice in sented as a metaphor using a spatial transition.
the structure: as the first argument of affect and
The idea of a comb function comes from a nat-
the argument in from.
ural extension of Jackendoffs EXCH function.
The primitives are designed to represent a full
In our case, comb is not limited to describing
or partial action-change-state chain, which con-
a counter-transfer of the main event but can de-
sists of a state, a change in or maintaining of a
scribe subordinate events occurring in relation to
state, or an action that changes/maintains a state.
the main event.3 We can also describe multiple
Table 2 shows primitives that play important roles
2
to represent that chain. Some primitives embed Here we omitted the attributes taken by each predicate,
other primitives as their arguments and the seman- in order to simplify the explanation. We also omitted an
explanation for lower level primitives, such as STATE and
tics of the entire structure of an LCS structure PLACE groups, which are not necessarily important for the
is calculated according to the definition of each topic of this paper.
3
primitive. For instance, the LCS structure in Fig- In our extended LCS theory, we can describe multiple
688
2 3 8 9
EVENT+ Role Description
h i >be
>
>
>
>
>
LCS = 4 5 >
> locate(PLACE) >
> Protagonist Entity which is viewpoint of verb.
comb EVENT * < =
STATE = orient(PLACE) Theme Entity in which its state or change of state
>
> >
> is mentioned.
>extent(PLACE)>
> >
>
: >
;
connect(arg) State Current state of certain entity.
28 93 Actor Entity which performs action that
state(arg, STATE)
6>
>
>
>7
>
>
changes/maintains its state.
6>
>
< go(arg, PATH) >
>7
= Effector Entity which performs action that
6 7
6 cause(act(arg1), go(arg1, PATH)) 7 changes/maintains a state of another entity.
6> >7
6>
>cause(affect(arg1, arg2), go(arg2, PATH))> >7
> Patient Entity which is changed/maintained its
6> 7
6>
: >
;7 state by another entity.
EVENT = 6
6 cause(react(arg1, arg2), go(arg1, PATH)) 7
7
6 7 Stimulus Entity which is cause of the action.
6manner(constant)? 7
6 7 Source Starting point of certain change event.
6mean(constant)? 7
6 7 Source dir Direction of starting point.
6instrument(constant)? 7
4 5 Middle Pass point of certain change event.
purpose(EVENT)* Goal End point of certain change event.
8 9 2 3
> > Goal dir Direction of end point.
>in(arg)
> >
>
from(STATE)?
>
> >
> 6 7 Route Linear-shaped path of certain change event.
>
> on(arg) >
> 6fromward(STATE)?7
>
> >
> 6 7
>cover(arg) >
>
> > PATH=6
> 6via(STATE)?
7
7
>
>fit(arg) >
> 6toward(STATE)? 7 Table 3: Semantic role list for proposing extended LCS
>
> >
> 6 7
<inscribed(arg)>
> = 6 7 framework.
4to(STATE)? 5
PLACE =
>
> beside(arg) > > along(arg)?
>
> >
>
> >
>around(arg) >
>
> >
> tions of the arguments of the primitive predicates
>
> near(arg) >
>
>
> >
> can be explained using generalized semantic roles
>
> >
>
>
> inside(arg) >
>
>
: >
; such as typical thematic roles. In order to sim-
at(arg)
ply represent the semantic functions of the ar-
Figure 2: Description system of our LCS. Operators guments in the LCS primitives or make it eas-
+, , ? follow the basic regular expression syntax. {} ier to compare our extended LCS framework with
represents a choice of the elements. other SRL frameworks, we define a semantic role
set that corresponds to the semantic functions of
the primitive predicates in the LCS structure (Ta-
main events if the agent does more than two ac- ble 3). We employed role names similarly to typ-
tions simultaneously and all the actions are the ical thematic roles in order to easily compare the
focus (e.g., John exchanges A with B). This ex- role sets, but the definition is different. Also, due
tension is simple, but essential for creating LCS to the increase of the generality of LCS represen-
structures of predicates appearing in actual data. tation, we obtained clearer definition to explain a
In our development of 60 Japanese predicates correspondence between LCS primitives and typ-
(verb and verbal noun) frequently appearing in ical thematic roles than the Jackendoffs predi-
Kyoto University Text Corpus (KTC) (Kurohashi cates. Note that the core semantic information of
and Nagao, 1997) , 37.6% of the frames included a verb represented by a LCS framework is em-
multiple events. By using the comb function, we bodied directly in its LCS structure and the in-
can express complicated events with predicate de- formation decreases if the structure is mapped to
composition and prevent missing (multiple) roles. the semantic roles. The mapping is just for con-
A key point for associating LCS framework trasting thematic roles. Each role is given an ob-
with the existing frameworks of semantic roles is vious meaning and designed to fit to the upper-
that each primitive predicate of LCS represents level primitives of the LCS structure, which are
a fundamental function in semantics. The func- the arguments of EVENT and PATH functions. In
events in the semantic structure of a verb. However, gener-
Table 4, we can see that these roles correspond al-
ally, a verb focuses on one of those events and this makes most one-to-one to the primitive arguments. One
a semantic variation among verbs such as buy, sell, and pay special role is Protagonist, which does not match
as well as difference of syntactic behavior of the arguments. an argument of a specific primitive. The Pro-
Therefore, focused event should be distinguished from the
others as lexical information. We expressed focused events
tagonist is assigned to the first argument in the
as main formulae (formulae that are not surrounded by a main formula to distinguish that formula from the
comb function). sub formulae. There are 13 defined roles, and
689
Predicate 1st arg 2nd arg Role Single Multiple Grow (%)
state Theme State Theme 21 108 414
act Actor State 1 1 0
affect Effector Patient Actor 12 13 8.3
react Actor Stimulus Effector 73 92 26
go Theme PATH Patient 77 79 2.5
from Source Stimulus 0 0 0
fromward Source dir Source 11 44 300
via Middle Source dir 4 4 0
toward Goal dir
Middle 1 8 700
to Goal
Goal 42 81 93
along Route
Goal dir 2 3 50
Route 2 2 0
Table 4: Correspondence between semantic roles and
w/o Theme 225 327 45
arguments of LCS primitives
Total 246 435 77
this number is comparatively smaller than that in Table 5: Number of appearances of each role
VerbNet. The discussion with regard to this num-
ber is described in the next section.
ated the dictionary looking at the instances of
Essentially, the semantic functions of the ar-
the target verbs in KTC. To increase the cover-
guments in LCS primitives are similar to those
age of senses and case frames, we also consulted
of traditional, or basic, thematic roles. However,
the online Japanese dictionary Digital Daijisen5
there are two important differences. Our extended
and Kyoto university case frames (Kawahara and
LCS framework principally guarantees that the
Kurohashi, 2006) which is a compilation of case
primitive predicates do not contain any informa-
frames automatically acquired from a huge web
tion concerning (i) selectional preference and (ii)
corpus. There were 97 constructed frames in the
complex structural relation of arguments. Primi-
dictionary.
tives are designed to purely represent a function
in an action-change-state chain, thus the informa- Then we analyzed how many roles are addi-
tion of selectional preference is annotated to a dif- tionally assigned by permitting multiple role as-
ferent layer; specifically, it is directly annotated to signment (see Table 5). The numbers of assigned
core arguments (e.g., we can annotate i with sel- roles for single role are calculated by counting
Pref(animate organization) in Figure 1). Also, roles that appear first for each target argument in
the semantic function is already decomposed and the structure. Table 5 shows that the total number
the structural relation among the arguments is rep- of assigned roles is 1.77 times larger than single-
resented as a structure of primitives in LCS rep- role assignment. The main reason is an increase in
resentation. Therefore, each argument slot of Theme. For single-role assignment, Theme, in our
the primitive predicates does not include compli- sense, in action verbs is always duplicated with
cated meanings and represents a primitive seman- Actor/Patient. On the other hand, LCS strictly
tic property which is highly functional. These divides a function for action and change; there-
characteristics are necessary to ensure clarity of fore the duplicated Theme is correctly annotated.
the semantic role meanings. We believe that even Moreover, we obtained a 45% increase even when
though there surely exists a certain type of com- we did not count duplicated Theme. Most of in-
plex semantic role, it is reasonable to represent crease are a result from the increase in Source
that role based on decomposed properties. and Goal. For example, Effectors of transmission
In order to show an instance of our extended verbs are also annotated with a Source, and Effec-
LCS theory, we constructed a dictionary of LCS tors of movement verbs are sometimes annotated
structures for 60 Japanese verbs (including event with Source or Goal.
nouns) using our extended LCS framework. The contain a phonogram form (Hiragana form) of a certain verb
60 verbs were the most frequent verbs in KTC af- written with Kanji characters, and that phonogram form gen-
ter excluding 100 most frequent ones.4 We cre- erally has a huge ambiguity because many different verbs
have same pronunciation in Japanese.
4 5
We omitted top 100 verbs since these most frequent ones Available at http://dictionary.goo.ne.jp/jn/.
690
Resource Frame-independent # of roles into account specific syntactic behaviors of cer-
LCS yes 13 tain semantic roles. Packing such complex infor-
VerbNet (v3.1) yes 30 mation to semantic roles is useful for analyzing
FrameNet (r1.4) no 8884 argument realization. However, from the view-
point of semantic representation, the clarity for
Table 6: Number of roles in each resource. semantic properties provided using a predicate de-
composition approach is beneficial. The 13 roles
4 Comparison with other resources for the LCS approach is sufficient for obtaining
a function in the action-change-state chain. In
4.1 Number of semantic roles our LCS framework, selectional preference can
The number of roles is related to the number of se- be assigned to arguments in an individual verb or
mantic properties represented in a framework and verb class level instead of role labels themselves
to the generality of that property. Table 6 lists the to maintain generality of semantic functions. In
number of semantic roles defined in our extended addition, our extended LCS framework can easily
LCS framework, VerbNet and FrameNet. separate complex structural information from role
There are two ways to define semantic roles. labels because LCS directly represents a structure
One is frame specific, where the definition of each among the arguments. We can calculate the infor-
role depends on a specific lexical entry and such mation from the LCS structure instead of coding
a role is never used in the other frames. The other it into role labels. As a result, our extended LCS
is frame independent, which is to construct roles framework maintains generality of roles and the
whose semantic function is generalized across number of roles is smaller than other frameworks.
all verbs. The number of roles in FrameNet is
comparatively large because it defines roles in a 4.2 Clarity of role meanings
frame-specific way. FrameNet respects individual We showed that an approach of predicate decom-
meanings of arguments rather than generality of position used in LCS theory clarified role mean-
roles. ings assigned to syntactic arguments. Moreover,
Compared with VerbNet, the number of roles LCS achieves high generality of roles by separat-
defined in our extended LCS framework is less ing selectional preference or structural informa-
than half. However, this fact does not mean tion from role labels. The complex meaning of
that the representation ability of our framework is one syntactic argument is represented by multi-
lower than VerbNet. We manually checked and ple appearances of the argument in an LCS struc-
listed a corresponding representation in our ex- ture. For example, we show an LCS structure
tended LCS framework for each thematic role in and a frame in VerbNet with regard to the verb
VerbNet in Table 6. This table does not provide a buy in Figure 3. The LCS structure consists
perfect or complete mapping between the roles in of four formulae. The first one is the main for-
these two frameworks because the mappings are mula and the others are sub-formulae that rep-
not based on annotated data. However, we can resent co-occurring actions. The semantic-role-
roughly say that the VerbNet roles combine three like representation of the structure is given in Ta-
types of information, a function of the argument ble 4: i = {Protagonist, Effector, Source, Goal},
in the action-change-state chain, selectional pref- j = {Patient, Theme}, k = {Eector, Source,
erence, and structural information of arguments, Goal}, and l = {Patient, Theme}. Selectional
which are in different layers in LCS representa- preference is annotated to each argument as i:
tion. VerbNet has many roles whose functions in selPref(animate organization), j: selPref(any),
the action-change-state chain are duplicated. For k: selPref(animate organization), and l: sel-
example, Destination, Recipient, and Beneficiary Pref(valuable entity). If we want to represent the
have the same property end-state (Goal in LCS) information, such as Source of what?, then we
of a changing event. The difference between such can extend the notation as Source(j) to refer to a
roles comes from a specific sub-type of a chang- changing object.
ing event (possession), selectional preference, and On the other hand, VerbNet combines mul-
structural information among the arguments. By tiple types of information into a single role as
distinguishing such roles, VerbNet roles may take mentioned above. Also, the meaning of some
691
VerbNet role (# of uses) Representation in LCS
Actor (9), Actor1 (9), Actor2 (9) Actor or Effector in symmetric formulas in the structure
Agent (212) (Actor Effector) Protagonist
Asset (6) Theme Source of the change is (locate(in()) Protagonist)
selPref(valuable entity)
Beneficiary (9) (peripheral role (Goal locate(in()))) selPref(animate organization)
(Actor Effector) a transferred entity is something beneficial
Cause (21) ((Effector selPref(animate organization)) Stimulus peripheral role)
Destination (32) Goal
Experiencer (24) Actor of react()
Instrument (25) ((Effector selPref(animate organization)) peripheral role)
Location (45) (Theme PATH roles peripheral role) selPref(location)
Material (6) Theme Source of a change The Goal of the change is locate(fit())
the Goal fullfills selPref(physical object)
Patient (59), Patient 1(11) Patient Theme
Patient2 (11) (Source Goal) connect()
Predicate (23) Theme (Goal locate(fit())) peripheral role
Product (7) Theme (Goal locate(fit()) selPref(physical object))
Proposition (11) Theme
Recipient (33) Goal locate(in()) selPref(animate organization)
Source (34) Source
Theme (162) Theme
Theme1 (13), Theme2 (13) Both of the two is Theme Theme1 is Theme and Theme2 is State
Topic (18) Theme selPref(knowledge infromation)
Table 7: Relationship of roles between VerbNet and our LCS framework. VerbNet roles that appears more than
five times in frame definition are analyzed. Each relationship shown here is only a partial and consistent part of
the complete correspondence table. Note that complete table of mapping highly depends on each lexical entry
(or verb class). Here, locate(in()) generally means possession or recognizing.
roles depends more on selectional preference or Example: John bought a book from Mary for $10.
the structure of the arguments than a primitive VerbNet: Agent V Theme {from} Source {for} Asset.
function in the action-change-state chain. Such has possession(start(E), Source, Theme),
VerbNet roles are used for several different func- has possession(end(E), Agent, Theme),
tions depending on verbs and their alternations, transfer(during(E), Theme), cost(E, Asset)
LCS:
and it is therefore difficult to capture decomposed 2 h i 3
properties from the role label without having spe- 6 cause(aff(i:John, j:a book), go(j, to(loc(in(i))) )) 7
6 2 " # 37
cific lexical knowledge. Moreover, some seman- 6 7
6 from(loc(in(i))) 57
tic functions, such as Mary is a Goal of the money 6comb4cause(aff(i,l:$10), go(l, )) 7
6 to(loc(at(k:Mary))) 7
6 7
in Figure 3, are completely discarded from the 6 2 " # 3 7
6 7
representation at the level of role labels. 6 from(loc(in(k))) 7
6comb4cause(aff(k,j), go(j, )) 5 7
6 to(loc(at(i))) 7
There is another representation related to the 6 7
6 7
6 h i 7
argument meanings in VerbNet. This representa- 4 5
comb cause(aff(k,l), go(l, to(loc(in(k))) ))
tion is a type of predicate decomposition using its
original set of predicates, which are referred to as
Figure 3: Comparison between the semantic predicate
semantic predicates. For example, the verb buy representation and the LCS structure of the verb buy.
in Figure 3 has the predicates has possession,
transfer and cost for composing the meaning of
its event structure. The thematic roles are fillers publicly available. A requirement for obtaining
of the predicates arguments, thus the semantic implicit semantic functions from these semantic
predicates may implicitly provide additional func- predicates is clearly defining how the roles (or
tions to the roles and possibly represent multiple functions) are calculated from these complex re-
roles. Unfortunately, we cannot discover what lations of semantic predicates.
each argument of the semantic predicates exactly FrameNet does not use semantic roles general-
means since the definition of each predicate is not ized among all verbs or does not represent seman-
692
i: selPref(animate organization), j: selPref(any), k: selPref(animate organization), l:
selPref(valuable entity)
Figure 4: LCS of the verbs get, buy, sell, pay, and collect and their relationships calculated from the structures.
tic properties of roles using a predicate decom-

position approach, but defines specific roles for
each conceptual event/state to represent a specific
background of the roles in the event/state. How-
ever, at the same time, FrameNet defines several
types of parent-child relations between most of
the frames and between their roles; therefore, we
may say FrameNet implicitly describes a sort of Figure 5: The frame relations among the verbs get,
decomposed property using roles in highly gen- buy, sell, pay, and collect in FrameNet.
eral or abstract frames and represents the inher-
itance of these semantic properties. One advan-
tage of this approach is that the inheritance of a ments in a lexical structure. The primitive proper-
meaning between roles is controlled through the ties can be clearly defined, even though the repre-
relations, which are carefully maintained by hu- sentation ability is restricted under the generality
man efforts, and is not restricted by the represen- of roles.
tation ability of the decomposition system. On the
In addition, the frame-to-frame relations in
other hand, the only way to represent generalized
FrameNet may be a useful resource for some ap-
properties of a certain semantic role is enumerat-
plication tasks such as paraphrasing and entail-
ing all inherited roles by tracing ancestors. Also,
ment. We argue that some types of relationships
a semantic relation between arguments in a cer-
between frames are automatically calculated us-
tain frame, which is given by LCS structure and
ing the LCS approach. For example, one of the
semantic predicates of VerbNet, is only defined
relations is based on an inclusion relation of two
by a natural language description for each frame
LCS structures. Figure 4 shows automatically
in FrameNet. From a CL point of view, we con-
calculated relations surrounding the verb buy.
sider that, at least, a certain level of formalization
Note that we chose a sense related to a com-
of semantic relation of arguments is important for
mercial transaction, which means a exchange of
utilize this information for application. LCS ap-
a goods and money, for each word in order to
proach, or an approach using a well-defined pred-
compare the resulted relation graph with that of
icate decomposition, can explicitly describe se-
FrameNet. We call relations among buy, sell,
mantic properties and relationships between argu-
pay and collect as different viewpoints since
693
they contain exactly the same formulae, and the the problems that are directly related to a seman-
only difference is the main formula. The rela- tic role annotation on that we focus in this paper,
tion between buy and get is defined as in- but we plan to solve these problems with further
heritance; a part of the child structure exactly extensions.
equals the parent structure. Interestingly, the re-
lations surrounding the buy are similar to those 5 Conclusion
in FrameNet (see Figure 5). We cannot describe
all types of the relations we considered due to We discussed the two problems in current labeling
space limitations. However, the point is that these approaches for argument-structure analysis: the
relationships are represented as rewriting rules problems in clarity of role meanings and multiple-
between the two LCS representations and thus role assignment. By focusing on the fact that an
they are automatically calculated. Moreover, the approach of predicate decomposition is suitable
grounds for relations maintain clarity based on for solving these problems, we proposed a new
concrete structural relations. A semantic relation framework for semantic role assignment by ex-
construction of frames based on structural rela- tending Jackendoffs LCS framework. The statis-
tionships is another possible application of LCS tics of our LCS dictionary for 60 Japanese verbs
approaches that connects traditional LCS theo- showed that 37.6% of the created frames included
ries with resources representing a lexical network multiple events and the number of assigned roles
such as FrameNet. for one syntactic argument increased 77% from
that in single-role assignment.
4.3 Consistency on semantic structures Compared to the other resources such as Verb-
Constructing a LCS dictionary is generally a dif- Net and FrameNet, the role definitions in our ex-
ficult work since LCS has a high flexibility for tended LCS framework are clearer since the prim-
describing structures and different people tend to itive predicates limit the meaning of each role to
write different structures for a single verb. We a function in the action-change-state chain. We
maintained consistency of the dictionary by tak- also showed that LCS can separate three types of
ing into account a similarity of the structures be- information, the functions represented by primi-
tween the verbs that are in paraphrasing or entail- tives, the selectional preference and structural re-
ment relations. This idea was inspired by auto- lation of arguments, which are conflated in role la-
matic calculation of semantic relations of lexicon bels in existing resources. As a potential of LCS,
as we mentioned above. We created a LCS struc- we demonstrated that several types of frame re-
ture for each lexical entry as we can calculate se- lations, which are similar to those in FrameNet,
mantic relations between related verbs and main- are automatically calculated using the structural
tained high-level consistency among the verbs. relations between LCSs. We still must perform a
Using our extended LCS theory, we success- thorough investigation for enumerating relations
fully created 97 frames for 60 predicates without which can be represented in terms of rewriting
any extra modification. From this result, we be- rules for LCS structures. However, automatic
lieve that our extended theory is stable to some construction of a consistent relation graph of se-
extent. On the other hand, we found that an extra mantic frames may be possible based on lexical
extension of the LCS theory is needed for some structures.
verbs to explain the different syntactic behaviors We believe that this kind of decomposed analy-
of one verb. For example, a condition for a cer- sis will accelerate both fundamental and applica-
tain syntactic behavior of a verb related to re- tion research on argument-structure analysis. As a
ciprocal alteration (see class 2.5 of Levin (Levin, future work, we plan to expand the dictionary and
1993)) such as (connect) and (in- construct a corpus based on our LCS dictionary.
tegrate) cannot be explained without considering
the number of entities in some arguments. Also, Acknowledgment
some verbs need to define an order of the internal
events. For example, the Japanese verb This work was partially supported by JSPS Grant-
(shuttle) means that going is a first action and in-Aid for Scientific Research #22800078.
coming back is a second action. These are not
694
References J. Ruppenhofer, M. Ellsworth, M.R.L. Petruck, C.R.
Johnson, and J. Scheffczyk. 2006. FrameNet II:
P.W. Culicover and W.K. Wilkins. 1984. Locality in Extended Theory and Practice. Berkeley FrameNet
linguistic theory. Academic Press. Release, 1.
Bonnie J. Dorr. 1997. Large-scale dictionary con- Szu-ting Yi, Edward Loper, and Martha Palmer. 2007.
struction for foreign language tutoring and inter- Can semantic roles generalize across genres? In
lingual machine translation. Machine Translation, Proceedings of HLT-NAACL 2007, pages 548555.
12(4):271322.
Bonnie J. Dorr. 2001. Lcs database. http://www.
umiacs.umd.edu/bonnie/LCS Database Document
ation.html.
Jeffrey S Gruber. 1965. Studies in lexical relations.
Ph.D. thesis, MIT.
N. Habash and B. Dorr. 2001. Large scale language
independent generation using thematic hierarchies.
In Proceedings of MT summit VIII.
N. Habash, B. Dorr, and D. Traum. 2003. Hybrid
natural language generation from lexical conceptual
structures. Machine Translation, 18(2):81128.
Eva Hajicova and Ivona Kucerova. 2002. Argu-
ment/valency structure in propbank, lcs database
and prague dependency treebank: A comparative
pilot study. In Proceedings of the Third Inter-
Evaluation (LREC 2002), pages 846851.
Ray Jackendoff. 1990. Semantic Structures. The MIT
Press.
D. Kawahara and S. Kurohashi. 2006. Case frame
compilation from the web using high-performance
computing. In Proceedings of LREC-2006, pages
13441347.
Paul Kingsbury and Martha Palmer. 2002. From Tree-
bank to PropBank. In Proceedings of LREC-2002,
pages 19891993.
Karin Kipper, Hoa Trang Dang, and Martha Palmer.
2000. Class-based construction of a verb lexicon.
In Proceedings of the National Conference on Arti-
ficial Intelligence, pages 691696. Menlo Park, CA;
Cambridge, MA; London; AAAI Press; MIT Press;
1999.
Sadao Kurohashi and Makoto Nagao. 1997. Kyoto
university text corpus project. Proceedings of the
Annual Conference of JSAI, 11:5861.
Beth Levin and Malka Rappaport Hovav. 2005. Argu-
ment realization. Cambridge University Press.
Beth Levin. 1993. English verb classes and alter-
nations: A preliminary investigation. University of
Chicago Press.
Llus Marquez, Xavier Carreras, Kenneth C.
Litkowski, and Suzanne Stevenson. 2008. Se-
mantic role labeling: an introduction to the special
issue. Computational linguistics, 34(2):145159.
B. Rozwadowska. 1988. Thematic restrictions on de-
rived nominals. In W Wlikins, editor, Syntax and
Semantics, volume 21, pages 147165. Academic
Press.
695
Unsupervised Detection of Downward-Entailing Operators By
Maximizing Classification Certainty
Jackie CK Cheung and Gerald Penn

University of Toronto
Toronto, ON, M5S 3G4, Canada
{jcheung,gpenn}@cs.toronto.edu
Abstract a downward-entailing operator (DEO), however,

this entailment relation is reversed, such as in
We propose an unsupervised, iterative the scope of the classical DEO not (2). There
method for detecting downward-entailing
are also operators which are neither upward- nor
operators (DEOs), which are important for
deducing entailment relations between sen- downward entailing, such as the expression ex-
tences. Like the distillation algorithm of actly three (3).
Danescu-Niculescu-Mizil et al. (2009), the
initialization of our method depends on the (1) She sang in French. She sang.
correlation between DEOs and negative po- (upward-entailing)
larity items (NPIs). However, our method
trusts the initialization more and aggres- (2) She did not sing in French. She did not
sively separates likely DEOs from spuri- sing. (downward-entailing)
ous distractors and other words, unlike dis-
tillation, which we show to be equivalent (3) Exactly three students sang. 6 Exactly
to one iteration of EM prior re-estimation. three students sang in French. (neither
Our method is also amenable to a bootstrap- upward- nor downward-entailing)
ping method that co-learns DEOs and NPIs,
and achieves the best results in identifying Danescu-Niculescu-Mizil et al. (2009) (hence-
DEOs in two corpora. forth DLD09) proposed the first computational
methods for detecting DEOs from a corpus. They
proposed two unsupervised algorithms which rely
1 Introduction
on the correlation between DEOs and negative
Reasoning about text has been a long-standing polarity items (NPIs), which by the definition of
challenge in NLP, and there has been consider- Ladusaw (1980) must appear in the context of
able debate both on what constitutes inference and DEOs. An example of an NPI is yet, as in the
what techniques should be used to support infer- sentence This project is not complete yet. The
ence. One task involving inference that has re- first baseline method proposed by DLD09 sim-
cently received much attention is that of recog- ply calculates a ratio of the relative frequencies
nizing textual entailment (RTE), in which the goal of a word in NPI contexts versus in a general
is to determine whether a hypothesis sentence can corpus, and the second is a distillation method
be entailed from a piece of source text (Bentivogli which appears to refine the baseline ratios using a
et al., 2010, for example). task-specific heuristic. Danescu-Niculescu-Mizil
An important consideration in RTE is whether and Lee (2010) (henceforth DL10) extend this ap-
a sentence or context produces an entailment re- proach to Romanian, where a comprehensive list
lation for events that are a superset or subset of of NPIs is not available, by proposing a bootstrap-
the original sentence (MacCartney and Manning, ping approach to co-learn DEOs and NPIs.
2008). By default, contexts are upward-entailing, DLD09 are to be commended for having iden-
allowing reasoning from a set of events to a su- tified a crucial component of inference that nev-
perset of events as seen in (1). In the scope of ertheless lends itself to a classification-based ap-
696
proach, as we will show. However, as noted sitional operator for proposition p, then an oper-
by DL10, the performance of the distillation ator is non-veridical if F p 6 p. Positive opera-
method is mixed across languages and in the tors such as past tense adverbials are veridical (4),
semi-supervised bootstrapping setting, and there whereas questions, negation and other DEOs are
is no mathematical grounding of the heuristic to non-veridical (5, 6).
explain why it works and whether the approach
can be refined or extended. This paper supplies (4) She sang yesterday. She sang.
the missing mathematical basis for distillation and (5) She denied singing. 6 She sang.
shows that, while its intentions are fundamentally
sound, the formulation of distillation neglects an (6) Did she sing? 6 She sang.
important requirement that the method not be
easily distracted by other word co-occurrences While Ladusaws hypothesis is thus accepted
in NPI contexts. We call our alternative cer- to be insufficient from a linguistic perspective, it
tainty, which uses an unusual posterior classifica- is nevertheless a useful starting point for compu-
tion confidence score (based on the max function) tational methods for detecting NPIs and DEOs,
to favour single, definite assignments of DEO- and has inspired successful techniques to detect
hood within every NPI context. DLD09 actually DEOs, like the work by DLD09, DL10, and also
speculated on the use of max as an alternative, this work. In addition to this hypothesis, we fur-
but within the context of an EM-like optimization ther assume that there should only be one plausi-
procedure that throws away its initial parameter ble DEO candidate per NPI context. While there
settings too willingly. Certainty iteratively and are counterexamples, this assumption is in prac-
directly boosts the scores of the currently best- tice very robust, and is a useful constraint for our
ranked DEO candidates relative to the alternatives learning algorithm. An analogy can be drawn to
in a Nave Bayes model, which thus pays more re- the one sense per discourse assumption in word
spect to the initial weights, constructively build- sense disambiguation (Gale et al., 1992).
ing on top of what the model already knows. This The relatedand as we will argue, more
method proves to perform better on two corpora difficultproblem of detecting NPIs has also
than distillation, and is more amenable to the co- been studied, and in fact predates the work on
learning of NPIs and DEOs. In fact, the best DEO detection. Hoeksema (1997) performed the
results are obtained by co-learning the NPIs and first corpus-based study of NPIs, predominantly
DEOs in conjunction with our method. for Dutch, and there has also been work on de-
tecting NPIs in German which assumes linguistic
2 Related work knowledge of licensing contexts for NPIs (Lichte
and Soehn, 2007). Richter et al. (2010) make
There is a large body of literature in linguis-
this assumption as well as use syntactic structure
tic theory on downward entailment and polar-
to extract NPIs that are multi-word expressions.
ity items1 , of which we will only mention the
Parse information is an especially important con-
most relevant work here. The connection between
sideration in freer-word-order languages like Ger-
downward-entailing contexts and negative polar-
man where a MWE may not appear as a contigu-
ity items was noticed by Ladusaw (1980), who
ous string. In this paper, we explicitly do not as-
stated the hypothesis that NPIs must be gram-
sume detailed linguistic knowledge about licens-
matically licensed by a DEO. However, DEOs
ing contexts for NPIs and do not assume that a
are not the sole licensors of NPIs, as NPIs can
parser is available, since neither of these are guar-
also be found in the scope of questions, certain
anteed when extending this technique to resource-
numeric expressions (i.e., non-monotone quanti-
poor languages.
fiers), comparatives, and conditionals, among oth-
ers. Giannakidou (2002) proposes that the prop- 3 Distillation as EM Prior Re-estimation
erty shared by these constructions and downward
entailment is non-veridicality. If F is a propo- Let us first review the baseline and distillation
1
methods proposed by DLD09, then show that dis-
See van der Wouden (1997) for a comprehensive refer-
tillation is equivalent to one iteration of EM prior
ence.
697
re-estimation in a Nave Bayes generative proba-
bilistic model up to constant rescaling. The base- Y DEO
line method assigns a score to each word-type
based on the ratio of its relative frequency within
NPI contexts to its relative frequency within a
general corpus. Suppose we are given a corpus C X Context words
with extracted NPI contexts N and they contain
tokens(C) and tokens(N ) tokens respectively. L
Let y be a candidate DEO, countC (y) be the uni-
gram frequency of y in a corpus, and countN (y) Figure 1: Nave Bayes formulation of DEO detection.
be the unigram frequency of y in N . Then, we
define S(y) to be the ratio between the relative
frequencies of y within NPI contexts and in the
entire corpus2 :
NPI contexts which contain y.
countN (y)/tokens(N ) DLD09 find that distillation seems to improve
S(y) = . (7)
countC (y)/tokens(C) the performance of DEO detection in BLLIP.
Later work by DL10, however, shows that distil-
The scores are then used as a ranking to de-
lation does not seem to improve performance over
termine word-types that are likely to be DEOs.
the baseline method in Romanian, and the authors
This method approximately captures Ladusaws
also note that distillation does not improve perfor-
hypothesis by highly ranking words that appear
mance in their experiments on co-learning NPIs
in NPI contexts more often than would be ex-
and DEOs via bootstrapping.
pected by chance. However, the problem with
A better mathematical grounding of the distilla-
this approach is that DEOs are not the only words
tion methods apparent heuristic in terms of exist-
that co-occur with NPIs. In particular, there exist
ing probabilistic models sheds light on the mixed
many piggybackers, which, as defined by DLD09,
performance of distillation across languages and
collocate with DEOs due to semantic relatedness
experimental settings. In particular, it turns out
or chance, and would thus incorrectly receive a
that the distillation method of DLD09 is equiva-
high S(y) score.
lent to one iteration of EM prior re-estimation in
Examples of piggybackers found by DLD09 in-
a Nave Bayes model. Given a lexicon L of L
clude the proper noun Milken, and the adverb vig-
words, let each NPI context be one sample gen-
orously, which collocate with DEOs like deny in
erated by the model. One sample consists of a
the corpus they used. DLD09s solution to the
latent categorical (i.e., a multinomial with one
piggybacker problem is a method that they term
trial) variable Y whose values range over L, cor-
distillation. Let Ny be the NPI contexts that con-
responding to the DEO that licenses the context,
tain word y; i.e., Ny = {c N |c y}. In dis- ~ = Xi=1...L
and observed Bernoulli variables X
tillation, each word-type is given a distilled score
which indicate whether a word appears in the NPI
according to the following equation:
context (Figure 1). This method does not attempt
to model the order of the observed words, nor the
1 X S(y) number of times each word appears. Formally, a
Sd (y) = P . (8)
|Ny | y p S(y ) Nave Bayes model is given by the following ex-
pNy
pression:
where p indexes the set of NPI contexts which
L
contain y 3 , and the denominator is the number of ~ Y)=
Y
P (X, P (Xi |Y )P (Y ). (9)
2
DLD09 actually use the number of NPI contexts con- i=1
taining y rather than countN (y), but we find that using the
raw count works better in our experiments. The probability of a DEO given a particular
3
In DLD09, the corresponding equation does not indicate NPI context is
that p should be the contexts that include y, but it is clear
L
from the surrounding text that our version is the intended
~
Y
meaning. If all the NPI contexts were included in the sum- P (Y |X) P (Xi |Y )P (Y ). (10)
mation, Sd (y) would reduce to inverse relative frequency. i=1
698
The probability of a set of observed NPI con- P (Y ) gives a prior probability that a certain
texts N is the product of the probabilities for each word-type y is a DEO in an NPI context, without
sample: normalizing for the frequency of y in NPI con-
texts. Since we are interested in estimating the
~
Y
P (N ) = P (X) (11) context-independent probability that y is a DEO,
~
XN we must calculate the probability that a word is
~ = ~ y).
X
P (X) P (X, (12) a DEO given that it appears in an NPI context.
yL Let Xy be the observed variable corresponding to
y. Then, the expression we are interested in is
We first instantiate the baseline method of P (y|Xy = 1). We now show that P (y|Xy =
DLD09 by initializing the parameters to the 1) = P (y)/P (Xy = 1), and that this expression
model, P (Xi = 1|y) and P (Y = y), such that is equivalent to (8).
P (Y = y) is proportional to S(y). Recall that this
initialization utilizes domain knowledge about the P (y, Xy = 1)
P (y|Xy = 1) = (17)
correlation between NPIs and DEOs, inspired by P (Xy = 1)
Ladusaws hypothesis:
X Recall that P (y, Xy = 0) = 0 because of the
P (Y = y) = S(y)/ S(y ) (13) assumption that a DEO appears in the NPI context
y that it generates. Thus,

1 if Xi corresponds to y
P (Xi = 1|y) = P (y, Xy = 1) = P (y, Xy = 1) + P (y, Xy = 0)
0.5 otherwise.
(14) = P (y) (18)
This initialization of P (Xi = 1|y) ensures that One iteration of EM to calculate this proba-
the the value of y corresponds to one of the words bility is equivalent to the distillation method of
in the NPI context, and the initialization of P (Y ) DLD09. In particular, the numerator of (17),
is simply a normalization of S(y). which we just showed to be equal to the estimate
Since we are working in an unsupervised setof P (Y ) given by (16), is exactly the sum of the
ting, there are no labels for Y available. A com- responsibilities for a particular y, and is propor-
mon and reasonable assumption about learning tional to the summation in (8) modulo normaliza-
~
tion, because P (X|y) is constant for all y in the
the parameter settings in this case is to find the pa-
rameters that maximize the likelihood of the ob- context. The denominator P (Xy = 1) is simply
served training data; i.e., the NPI contexts: the proportion of contexts containing y, which is
proportional to |Ny |. Since both the numerator
= argmax P (N ; ). (15) and denominator are equivalent up to a constant
factor, an identical ranking is produced by distil-
The EM algorithm is a well-known iterative al- lation and EM prior re-estimation.
gorithm for performing this optimization. Assum- Unfortunately, the EM algorithm does not pro-
ing that the prior P (Y = y) is a categorical distri- vide good results on this task. In fact, as more
bution, the M-step estimate of these parameters iterations of EM are run, the performance drops
after one iteration through the corpus is as fol- drastically, even though the corpus likelihood
lows: is increasing. The reason is that unsupervised
EM learning is not constrained or biased towards
~
P t (y|X) learning a good set of DEOs. Rather, a higher data
X
P t+1 (Y = y) = (16)
P ~
P t (y |X) likelihood can be achieved simply by assigning
~
XN y
high prior probabilities to frequent word-types.
We do not re-estimate P (Xi = 1|y) because This can be seen qualitatively by consider-
their role is simply to ensure that the DEO re- ing the top-ranking DEOs after several itera-
sponsible for an NPI context exists in the context. tions of EM/distillation (Figure 2). The top-
Estimating these parameters would exacerbate the ranking words are simply function words or other
problems with EM for this task which we will dis- words common in the corpus, which have noth-
cuss shortly. ing to do with downward entailment. In effect,
699
1 iteration 2 iterations 3 iterations fication problem, and then maximizing an objec-
denies the the tive ratio that favours one DEO per context. Our
denied to to method is not guaranteed to increase classification
unaware denied that certainty between iterations, but we will show that
longest than than it does increase certainty very quickly in practice.
hardly that and The key observation that allows us to resolve
lacking if has the tension between trusting the initialization and
deny has if enforcing one DEO per NPI context is that the
nobody denies of distributions of words that co-occur with DEOs
opposes and denied and piggybackers are different, and that this dif-
highest but denies ference follows from Ladusaws hypothesis. In
particular, while DEOs may appear with or with-
Figure 2: Top 10 DEOs after iterations of EM on
out piggybackers in NPI contexts, piggybackers
BLLIP.
do not appear without DEOs in NPI contexts, be-
cause Ladusaws hypothesis stipulates that a DEO
is required to license the NPI in the first place.
Thus, the presence of a high-scoring DEO candi-
EM/distillation overrides the initialization based date among otherwise low-scoring words is strong
on Ladusaws hypothesis and finds another solu- evidence that the high-scoring word is not a pig-
tion with a higher data likelihood. We will also gybacker and its high score from the initialization
provide a quantitative analysis of the effects of is deserved. Conversely, a DEO candidate which
EM/distillation in Section 5. always appears in the presence of other strong
DEO candidates is likely a piggybacker whose
4 Alternative to EM: Maximizing the initial high score should be discounted.
Posterior Classification Certainty We now describe our heuristic method that is
We have seen that in trying to solve the piggy- based on this intuition. For clarity, we use scores
backer problem, EM/distillation too readily aban- rather than probabilities in the following explana-
dons the initialization based on Ladusaws hy- tion, though it is equally applicable to either. As
pothesis, leading to an incorrect solution. Instead in EM/distillation, the method is initialized with
of optimizing the data likelihood, what we need is the baseline S(y) scores. One iteration of the
a measure of the number of plausible DEO candi- method proceeds as follows. Let the score of the
dates there are in an NPI context, and a method strongest DEO candidate in an NPI context p be:
that refines the scores towards having only one M (p) = max Sht (y), (20)
such plausible candidate per context. To this end, yp
we define the classification certainty to be the where Sht (y) is the score of candidate y at the tth
product of the maximum posterior classification iteration according to this heuristic method.
probabilities over the DEO candidates. For a set Then, for each word-type y in each context p,
of hidden variables y N for NPI contexts N , this we compare the current score of y to the scores of
is the expression: the other words in p. If y is currently the strongest
DEO candidate in p, then we give y credit equal
Certainty(y N |N ) = ~
Y
max P (y|X). (19) to the proportional change to M (p) if y were re-
y
~
XN moved (Context p without y is denoted p \ y). A
large change means that y is the only plausible
To increase this certainty score, we propose
DEO candidate in p, while a small change means
a novel iterative heuristic method for refining
that there are other plausible DEO candidates. If
the baseline initializations of P (Y ). Unlike
y is not currently the strongest DEO candidate, it
EM/distillation, our method biases learning to-
receives no credit:
wards trusting the initialization, but refines the (
M (p)M (p\y)
scores towards having only one plausible DEO M (p) if Sht (y) = M (p)
cred(p, y) =
per context in the training corpus. This is accom- 0 otherwise.
plished by treating the problem as a DEO classi- (21)
700
NPI contexts unlikely to be a DEO according to the initializa-
A B C, B C, B C, D C tion.
Original scores 5 Experiments
S(A) = 5, S(B) = 4, S(C) = 1, S(D) = 2
We evaluate the performance of these methods on
Updated scores the BLLIP corpus (30M words) and the AFP
Sh (A) = 5 (5 4)/5 =1 portion of the Gigaword corpus (338M words).
Sh (B) = 4 (0 + 2 (4 1)/4)/3 =2 Following DLD09, we define an NPI context to
be all the words to the left of an NPI, up to the
Sh (C) = 1 (0 + 0 + 0) =0
closest comma or semi-colon, and removed NPI
Sh (D) = 2 (2 1)/2 =1 contexts which contain the most common DEOs
Figure 3: Example of one iteration of the certainty- like not. We further removed all empty NPI con-
based heuristic on four NPI contexts with four words texts or those which only contain other punctua-
in the lexicon. tion. After this filtering, there were 26696 NPI
contexts in BLLIP and 211041 NPI contexts in
AFP, using the same list of 26 NPIs defined by
DLD09.
We first define an automatic measure of per-
Then, the average credit received by each y is formance that is common in information retrieval.
a measure of how much we should trust the cur- We use average precision to quantify how well a
rent score for y. The updated score for each DEO system separates DEOs from non-DEOs. Given a
candidate is the original score multiplied by this list of known DEOs, G, and non-DEOs, the aver-
average: age precision of a ranked list of items, X, is de-
Sht (y) X fined by the following equation:
Sht+1 (y) = cred(p, y). (22)
|Ny | Pn
P (X1...k ) 1(xk G)
pNy
AP (X) = k=1 ,
|G|
The probability P t+1 (Y = y) is then simply (24)
Sht+1 (y)
normalized:
where P (X1...k ) is the precision of the first k
S t+1 (y)
P t+1
(Y = y) = X h . (23) items and 1(xk G) is an indicator function
Sht+1 (y ) which is 1 if x is in the gold standard list of DEOs
y L and 0 otherwise.
We iteratively reduce the scores in this fashion DLD09 simply evaluated the top 150 output
to get better estimates of the relative suitability of DEO candidates by their systems, and qualita-
word-types as DEOs. tively judged the precision of the top-k candidates
An example of this method and how it solves at various values of k up to 150. Average preci-
the piggybacker problem is given in Figure 3. In sion can be seen as a generalization of this evalu-
this example, we would like to learn that B and ation procedure that is sensitive to the ranking of
D are DEOs, A is a piggybacker, and C is a fre- DEOs and non-DEOs. For development purposes,
quent word-type, such as a stop word. Using the we use the list of 150 annotations by DLD09. Of
original scores, piggybacker A would appear to these, 90 were DEOs, 30 were not, and 30 were
be the most likely word to be a DEO. However, classified as other (they were either difficult to
by noticing that it never occurs on its own with classify, or were other types of non-veridical oper-
words that are unlikely to be DEOs (in the exam- ators like comparatives or conditionals). We dis-
ple, word C), our heuristic penalizes A more than carded the 30 other items and ignored all items
B, and ranks B higher after one iteration. EM not in the remaining 120 items when evaluating a
prior re-estimation would not correctly solve this ranked list of DEO candidates. We call this mea-
example, as it would converge on a solution where sure AP120 .
C receives all of the probability mass because it In addition, we annotated DEO candidates from
appears in all of the contexts, even though it is the top-150 rankings produced by our certainty-
701
absolve, abstain, banish, bereft, boycott, cau- Method BLLIP AP120 AFP AP246
tion, clear, coy, delay, denial, desist, devoid, Baseline .879 .734
disavow, discount, dispel, disqualify, down- Distillation .946 .785
play, exempt, exonerate, foil, forbid, forego, This work .955 .809
impossible, inconceivable, irrespective, limit,
Table 1: Average precision results on the BLLIP and
mitigate, nip, noone, omit, outweigh, pre-
AFP corpora.
condition, pre-empt, prerequisite, refute, re-
move5 , repel, repulse, scarcely, scotch, scuttle,
seldom, sensitive, shy, sidestep, snuff, thwart,
waive, zero-tolerance
be obtained by examining the data likelihood and
Figure 4: Lemmata of DEOs identified in this work not
the classification certainty at each iteration of the
found by DLD09.
algorithms (Figure 5). Whereas EM/distillation
maximizes the former expression, the certainty-
based heuristic method actually decreases data
likelihood for the first couple of iterations before
based heuristic on BLLIP and also by the dis- increasing it again. In terms of classification cer-
tillation and heuristic methods on AFP, in order tainty, EM/distillation converges to a lower classi-
to better evaluate the final output of the meth- fication certainty score compared to our heuristic
ods. This produced an additional 68 DEOs (nar- method. Thus, our method better captures the as-
rowly defined) (Figure 4), 58 non-DEOs, and 31 sumption of one DEO per NPI context.
other items4 . Adding the DEOs and non-DEOs
we found to the 120 items from above, we have 6 Bootstrapping to Co-Learn NPIs and
an expanded list of 246 items to rank, and a corre- DEOs
sponding average precision which we call AP246 . The above experiments show that the heuristic
We employ the frequency cut-offs used by method outperforms the EM/distillation method
DLD09 for sparsity reasons. A word-type must given a list of NPIs. We would like to extend
appear at least 10 times in an NPI context and this result to novel domains, corpora, and lan-
150 times in the corpus overall to be considered. guages. DLD09 and DL10 proposed the follow-
We treat BLLIP as a development corpus and use ing bootstrapping algorithm for co-learning NPIs
AP120 on AFP to determine the number of itera- and DEOs given a much smaller list of NPIs as a
tions to run our heuristic (5 iterations for BLLIP seed set.
and 13 iterations for AFP). We run EM/distillation
for one iteration in development and testing, be- 1. Begin with a small set of seed NPIs
cause more iterations hurt performance, as ex-
2. Iterate:
plained in Section 3.
We first report the AP120 results of our ex- (a) Use the current list of NPIs to learn a
periments on the BLLIP corpus (Table 1 sec- list of DEOs
ond column). Our method outperforms both (b) Use the current list of DEOs to learn a
EM/distillation and the baseline method. These list of NPIs
results are replicated on the final test set from
AFP using the full set of annotations AP246 (Ta- Interestingly, DL10 report that while this
ble 1 third column). Note that the scores are lower method works in Romanian data, it does not work
when using all the annotations because there are in the English BLLIP corpus. They speculate that
more non-DEOs relative to DEOs in this list, mak- the reason might be due to the nature of the En-
ing the ranking task more challenging. glish DEO any, which can occur in all classes of
A better understanding of the algorithms can DE contexts according to an analysis by Haspel-
4 math (1997). Further, they find that in Romanian,
The complete list will be made publicly available.
5
We disagree with DLD09 that remove is not downward- distillation does not perform better than the base-
entailing; e.g., The detergent removed stains from his cloth- line method during Step (2a). While this linguis-
ing. The detergent removed stains from his shirts. tic explanation may certainly be a factor, we raise
702
6 5
x 10 x 10
0 0
-0.5 -0.5
Log probability
Log probability
-1 -1
-1.5 -1.5
-2 -2
-2.5 -2.5
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Iterations Iterations
(a) Data log likelihood. (b) Log classification certainty probabilities.
Figure 5: Log likelihood and classification certainty probabilities of NPI contexts in two corpora. Thinner lines
near the top are for BLLIP; thicker lines for AFP. Blue dotted: baseline; red dashed: distillation; green solid:
~
our certainty-based heuristic method. P (X|y) probabilities are not included since they would only result in a
constant offset in the log domain.
a second possibility that the distillation algorithm other spurious correlations such as piggybackers
itself may be responsible for these results. As ev- as discussed earlier. In the other direction, it is
idence, we show that the heuristic algorithm is not the case that DEOs always or nearly always
able to work in English with just the single seed appear in the context of an NPI. Rather, the most
NPI any, and in fact the bootstrapping approach in common collocations of DEOs are the selectional
conjunction with our heuristic even outperforms preferences of the DEO, such as common argu-
the above approaches when using a static list of ments to verbal DEOs, prepositions that are part
NPIs. of the subcategorization of the DEO, and words
In particular, we use the methods described in that together with the surface form of the DEO
the previous sections for Step (2a), and the follow- comprise an idiomatic expression or multi-word
ing ratio to rank NPI candidates in Step (2b), cor- expression. Further, NPIs are more likely to be
responding to the baseline method to detect DEOs composed of multiple words, while many DEOs
in reverse: are single words, possibly with PP subcategoriza-
tion requirements which can be filled in post hoc.
countD (x)/tokens(D)
T (x) = . (25) Because of these issues, we cannot trust the ini-
countC (x)/tokens(C) tialization to learn NPIs nearly as much as with
DEOs, and cannot use the distillation or certainty
Here, countD (x) refers to the number of oc-
methods for this step. Rather, the hope is that
currences of NPI candidate x in DEO contexts
learning a noisy list of pseudo-NPIs, which of-
D, defined to be the words to the right of a DEO
ten occur in negative contexts but may not actu-
operator up to a comma or semi-colon. We do
ally be NPIs, can still improve the performance of
not use the EM/distillation or heuristic methods in
DEO detection.
Step (2b). Learning NPIs from DEOs is a much
There are a number of parameters to the method
harder problem than learning DEOs from NPIs.
which we tuned to the BLLIP corpus using
Because DEOs (and other non-veridical opera-
AP120 . At the end of Step (2a), we use the cur-
tors) license NPIs, the majority of occurrences of
rent top 25 DEOs plus 5 per iteration as the DEO
NPIs will be in the context of a DEO, modulo am-
list for the next step. To the initial seed NPI of
biguity of DEOs such as the free-choice any and
703
Method BLLIP AP120 AFP AP246 be an instance of EM prior re-estimation, our
Baseline .889 (+.010) .739 (.005) method directly addresses the issue of piggyback-
Distillation .930 (.016) .804 (+.019) ers which spuriously correlate with NPIs but are
This work .962 (+.007) .821 (+.012) not downward-entailing. This is achieved by
maximizing the posterior classification certainty
Table 2: Average precision results with bootstrapping
of the corpus in a way that respects the initializa-
on the BLLIP and AFP corpora. Absolute gain in av-
erage precision compared to using a fixed list of NPIs tion, rather than maximizing the data likelihood
given in brackets. as in EM/distillation. Our method outperforms
distillation and a baseline method on two corpora
anymore, anything, anytime, avail, bother, as well as in a bootstrapping setting where NPIs
bothered, budge, budged, countenance, faze, and DEOs are jointly learned. It achieves the best
fazed, inkling, iota, jibe, mince, nor, whatso- performance in the bootstrapping setting, rather
ever, whit than when using a fixed list of NPIs. The perfor-
mance of our algorithm suggests that it is suitable
Figure 6: Probable NPIs found by bootstrapping using for other corpora and languages.
the certainty-based heuristic method.
Interesting future research directions include
detecting DEOs of more than one word as well as
distinguishing the particular word sense and sub-
categorization that is downward-entailing. An-
any, we add the top 5 ranking NPI candidates at other problem that should be addressed is the
the end of Step (2b) in each subsequent iteration. scope of the downward entailment, generalizing
We ran the bootstrapping algorithm for 11 itera- work being done in detecting the scope of nega-
tions for all three algorithms. The final evaluation tion (Councill et al., 2010, for example).
was done on AFP using AP246 .
Acknowledgments
The results show that bootstrapping can indeed
improve performance, even in English (Table 2). We would like to thank Cristian Danescu-
Using bootstrapping to co-learn NPIs and DEOs Niculescu-Mizil for his help with replicating his
actually results in better performance than spec- results on the BLLIP corpus. This project was
ifying a static list of NPIs. The certainty-based supported by the Natural Sciences and Engineer-
heuristic in particular achieves gains with booting Research Council of Canada.
strapping in both corpora, in contrast to the base-
line and distillation methods. Another factor that
we found to be important is to add a sufficient References
number of NPIs to the NPI list each iteration, as Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa T.
adding too few NPIs results in only a small change Dang, and Danilo Giampiccolo. 2010. The sixth
in the NPI contexts available for DEO detection. pascal recognizing textual entailment challenge. In
The Text Analysis Conference (TAC 2010).
DL10 only added one NPI per iteration, which
Isaac G. Councill, Ryan McDonald, and Leonid Ve-
may explain why they did not find any improve-
likovich. 2010. Whats great and whats not:
ment with bootstrapping in English. It also ap- Learning to classify the scope of negation for im-
pears that learning the pseudo-NPIs does not hurt proved sentiment analysis. In Proceedings of the
performance in detecting DEO, and further, that Workshop on Negation and Speculation in Natural
a number of true NPIs are learned by our method Language Processing, pages 5159. Association for
(Figure 6). Computational Linguistics.
Cristian Danescu-Niculescu-Mizil and Lillian Lee.
7 Conclusion 2010. Dont have a clue?: Unsupervised co-
learning of downward-entailing operators. In Pro-
We have proposed a novel unsupervised method ceedings of the ACL 2010 Conference Short Papers,
for discovering downward-entailing operators pages 247252. Association for Computational Lin-
from raw text based on their co-occurrence with guistics.
negative polarity items. Unlike the distilla- Cristian Danescu-Niculescu-Mizil, Lillian Lee, and
tion method of DLD09, which we show to Richard Ducott. 2009. Without a doubt?: Un-
supervised discovery of downward-entailing oper-
704
ators. In Proceedings of Human Language Tech-
nologies: The 2009 Annual Conference of the North
tional Linguistics.
William A. Gale, Kenneth W. Church, and David
Yarowsky. 1992. One sense per discourse. In Pro-
ceedings of the Workshop on Speech and Natural
Language, pages 233237. Association for Compu-
tational Linguistics.
Anastasia Giannakidou. 2002. Licensing and sensitiv-
ity in polarity items: from downward entailment to
nonveridicality. CLS, 38:2953.
Martin Haspelmath. 1997. Indefinite pronouns. Ox-
ford University Press.
Jack Hoeksema. 1997. Corpus study of negative po-
larity items. IV-V Jornades de corpus linguistics
19961997.
William A. Ladusaw. 1980. On the notion affective
in the analysis of negative-polarity items. Journal
of Linguistic Research, 1(2):116.
Timm Lichte and Jan-Philipp Soehn. 2007. The re-
trieval and classification of negative polarity items
using statistical profiles. Roots: Linguistics in
Search of Its Evidential Base, pages 249266.
Bill MacCartney and Christopher D. Manning. 2008.
Modeling semantic containment and exclusion in
natural language inference. In Proceedings of the
22nd International Conference on Computational
Linguistics.
Frank Richter, Fabienne Fritzinger, and Marion Weller.
2010. Who can see the forest for the trees? ex-
tracting multiword negative polarity items from
dependency-parsed text. Journal for Language
Technology and Computational Linguistics, 25:83
110.
Ton van der Wouden. 1997. Negative Contexts: Col-
location, Polarity and Multiple Negation. Rout-
ledge.
705
Elliphant: Improved Automatic Detection of
Zero Subjects and Impersonal Constructions in Spanish
Luz Rello Ricardo Baeza-Yates Ruslan Mitkov

NLP and Web Research Groups Yahoo! Research Research Group in
Univ. Pompeu Fabra Barcelona, Spain Computational Linguistics
Barcelona, Spain Univ. of Wolverhampton, UK
Abstract the computational treatment of anaphora (Hobbs,

1977; Hirst, 1981). However, this task is of cru-
In pro-drop languages, the detection of cial importance when processing pro-drop lan-
explicit subjects, zero subjects and non- guages since subject ellipsis is a pervasive phe-
referential impersonal constructions is cru-
nomenon in these languages (Chomsky, 1981).
cial for anaphora and co-reference resolu-
tion. While the identification of explicit For instance, in our Spanish corpus, 29% of the
and zero subjects has attracted the atten- subjects are elided.
tion of researchers in the past, the auto- Our method is based on classification of all ex-
matic identification of impersonal construc- pressions in subject position, including the recog-
tions in Spanish has not been addressed yet nition of Spanish non-referential impersonal con-
and this work is the first such study. In structions which, to the best of our knowledge,
this paper we present a corpus to under-
has not yet been addressed. The necessity of iden-
pin research on the automatic detection of
these linguistic phenomena in Spanish and
tifying such kind of elliptical constructions has
a novel machine learning-based methodol- been specifically highlighted in work about Span-
ogy for their computational treatment. This ish zero pronouns (Ferrandez and Peral, 2000)
study also provides an analysis of the fea- and co-reference resolution (Recasens and Hovy,
tures, discusses performance across two 2009).
different genres and offers error analysis. The main contributions of this study are:
The evaluation results show that our system
performs better in detecting explicit sub- A public annotated corpus in Spanish to
jects than alternative systems. compare different strategies for detecting ex-
plicit subjects, zero subjects and impersonal
1 Introduction constructions.
Subject ellipsis is the omission of the subject in The first ML based approach to this problem
a sentence. We consider not only missing refer- in Spanish and a thorough analysis regarding
ential subject (zero subject) as manifestation of features, learnability, genre and errors.
ellipsis, but also non-referential impersonal con-
The best performing algorithms to automati-
structions.
cally detect explicit subjects and impersonal
Various natural language processing (NLP)
constructions in Spanish.
tasks benefit from the identification of ellip-
tical subjects, primarily anaphora resolution The remainder of the paper is organized as fol-
(Mitkov, 2002) and co-reference resolution (Ng lows. Section 2 describes the classes of Spanish
and Cardie, 2002). The difficulty in detect- subjects, while Section 3 provides a literature re-
ing missing subjects and non-referential pronouns view. Section 4 describes the creation and the an-
has been acknowledged since the first studies on notation of the corpus and in Section 5 the ma-

This work was partially funded by a La Caixa grant chine learning (ML) method is presented. The
for master students. analysis of the features, the learning curves, the
706
genre impact and the error analysis are all detailed 3 Related Work
in Section 6. Finally, in Section 7, conclusions
Identification of non-referential pronouns, al-
are drawn and plans for future work are discussed.
though a crucial step in co-reference and anaphora
This work is an extension of the first author mas-
resolution systems (Mitkov, 2010),2 has been ap-
ters thesis (Rello, 2010) and a preliminary ver-
plied only to the pleonastic it in English (Evans,
sion of the algorithm was presented in Rello et al.
2001; Boyd et al., 2005; Bergsma et al., 2008)
(2010).
and expletive pronouns in French (Danlos, 2005).
2 Classes of Spanish Subjects Machine learning methods are known to perform
better than rule-based techniques for identifying
Literature related to ellipsis in NLP (Ferrandez non-referential expressions (Boyd et al., 2005).
and Peral, 2000; Rello and Illisei, 2009a; Mitkov, However, there is some debate as to which ap-
2010) and linguistic theory (Bosque, 1989; Bru- proach may be optimal in anaphora resolution
cart, 1999; Real Academia Espanola, 2009) has systems (Mitkov and Hallett, 2007).
served as a basis for establishing the classes of Both English and French texts use an ex-
this work. plicit word, with some grammatical information
Explicit subjects are phonetically realized and (a third person pronoun), which is non-referential
their syntactic position can be pre-verbal or post- (Mitkov, 2010). By contrast, in Spanish, non-
verbal. In the case of post-verbal subjects (a), the referential expressions are not realized by exple-
syntactic position is restricted by some conditions tive or pleonastic pronouns but rather by a certain
(Real Academia Espanola, 2009). kind of ellipsis. For this reason, it is easy to mis-
(a) Careceran de validez las disposiciones que con- take them for zero pronouns, which are, in fact,
tradigan otra de rango superior.1 referential.
The dispositions which contradict higher range Previous work on detecting Spanish subject el-
ones will not be valid. lipsis focused on distinguishing verbs with ex-
plicit subjects and verbs with zero subjects (zero
Zero subjects (b) appear as the result of a nomi- pronouns), using rule-based methods (Ferrandez
nal ellipsis. That is, a lexical element the elliptic and Peral, 2000; Rello and Illisei, 2009b). The
subject, which is needed for the interpretation of Ferrandez and Peral algorithm (2000) outper-
the meaning and the structure of the sentence, is forms the (Rello and Illisei, 2009b) approach
elided; therefore, it can be retrieved from its con- with 57% accuracy in identifying zero subjects.
text. The elision of the subject can affect the en- In (Ferrandez and Peral, 2000), the implementa-
tire noun phrase and not just the noun head when tion of a zero subject identification and resolution
a definite article occurs (Brucart, 1999). module forms part of an anaphora resolution sys-
(b) Fue refrendada por el pueblo espanol. tem.
(It) was countersigned by the people of Spain.
ML based studies on the identification of
explicit non-referential constructions in English
The class of impersonal constructions is present accuracies of 71% (Evans, 2001), 87.5%
formed by impersonal clauses (c) and reflex- (Bergsma et al., 2008) and 88% (Boyd et al.,
ive impersonal clauses with particle se (d) (Real 2005), while 97.5% is achieved for French (Dan-
Academia Espanola, 2009). los, 2005). However, in these languages, non-
referential constructions are explicit and not omit-
(c) No hay matrimonio sin consentimiento.
ted which makes this task more challenging for
(There is) no marriage without consent.
Spanish.
(d) Se estara a lo que establece el apartado siguiente.
(It) will be what is established in the next section. 4 Corpus
1
All the examples provided are taken from our corpus. We created and annotated a corpus composed
In the examples, explicit subjects are presented in italics. of legal texts (law) and health texts (psychiatric
Zero subjects are presented by the symbol and in the En-
2
glish translations the subjects which are elided in Spanish are In zero anaphora resolution, the identification of zero
marked with parentheses. Impersonal constructions are not anaphors first requires that they be distinguished from non-
explicitly indicated. referential impersonal constructions (Mitkov, 2010).
707
papers) originally written in peninsular Spanish. for each of the three categories is shown against
The corpus is named after its annotated content the thirteen annotation tags to which they belong
Explicit Subjects, Zero Subjects and Impersonal (Table 1).
Constructions (ESZIC es Corpus). Afterwards, each of the tags are grouped in one
To the best of our knowledge, the existing cor- of the three main classes.
pora annotated with elliptical subjects belong to
other genres. The Blue Book (handbook) and Explicit subjects: [- elliptic, + referential].
Lexesp (journalistic texts) used in (Ferrandez and
Peral, 2000) contain zero subjects but not imper- Zero subjects: [+ elliptic, + referential].
sonal constructions. On the other hand, the Span-
Impersonal constructions: [+ elliptic, - refer-
ish AnCora corpus based on journalistic texts in-
ential].
cludes zero pronouns and impersonal construc-
tions (Recasens and Mart, 2010) while the Z- Of these annotated verbs, 71% have an explicit
corpus (Rello and Illisei, 2009b) comprises legal, subject, 26% have a zero subject and 3% belong
instructional and encyclopedic texts but has no an- to an impersonal construction (see Table 2).
notated impersonal constructions.
The ESZIC corpus contains a total of 6,827 Number of instances Legal Health All
verbs including 1,793 zero subjects. Except for Explicit subjects 2,739 2,116 4,855
AnCora-ES, with 10,791 elliptic pronouns, our Zero subjects 619 1,174 1,793
corpus is larger than the ones used in previous ap- Impersonals 71 108 179
proaches: about 1,830 verbs including zero and Total 3,429 3,398 6,827
explicit subjects in (Ferrandez and Peral, 2000) Table 2: Instances per class in ESZIC Corpus.
(the exact number is not mentioned in the pa-
per) and 1,202 zero subjects in (Rello and Illisei, To measure inter-annotator reliability we use
2009b). Fleiss Kappa statistical measure (Fleiss, 1971).
The corpus was parsed by Connexors Ma- We extracted 10% of the instances of each of the
chinese Syntax (Connexor Oy, 2006), which re- texts of the corpus covering the two genres.
turns lexical and morphological information as
well as the dependency relations between words Fleiss Kappa Legal Health All
by employing a functional dependency grammar Two Annotators 0.934 0.870 0.902
(Tapanainen and Jarvinen, 1997). Three Annotators 0.925 0.857 0.891
To annotate our corpus we created an annota-
Table 3: Inter-annotator Agreement.
tion tool that extracts the finite clauses and the
annotators assign to each example one of the de-
In Table 3 we present the Fleiss kappa inter-
fined annotation tags. Two volunteer graduate stu-
annotator agreement for two and three annota-
dents of linguistics annotated the verbs after one
tors. These results suggest that the annotation
training session. The annotations of a third volun-
is reliable since it is common practice among re-
teer with the same profile were used to compute
searchers in computational linguistics to consider
the inter-annotator agreement. During the anno-
0.8 as a minimum value of acceptance (Artstein
tation phase, we evaluated the adequacy and clar-
and Poesio, 2008).
ity of the annotation guidelines and established a
typology of the rising borderline cases, which is 5 Machine Learning Approach
included in the annotation guidelines.
Table 1 shows the linguistic and formal criteria We opted for an ML approach given that our
used to identify the chosen categories that served previous rule-based methodology improved only
as the basis for the corpus annotation. For each 0.02 over the 0.55 F-measure of a simple base-
tag, in addition to the two criteria that are crucial line (Rello and Illisei, 2009b). Besides, ML based
for identifying subject ellipsis ([ elliptic] and methods for the identification of explicit non-
[ referential]) a combination of syntactic, se- referential constructions in English appear to per-
mantic and discourse knowledge is also encoded form better than than rule-based ones (Boyd et al.,
during the annotation. The linguistic motivation 2005).
708
L INGUISTIC INFORMATION P HONETIC S YNTACTIC V ERBAL S EMANTIC D ISCOURSE
R EALIZATION CATEGORY D IATHESIS I NTERPR .
Annotation Annotation Elliptic Ell. noun Nominal Active Active Referential
Categories Tags noun phrase subject participant subject
phrase head
Explicit subject + + + +
Explicit Reflex passive + + +
subject subject
Passive subject + +
Omitted subject + + + + +
Omitted subject + + + + +
head
Non-nominal + + +
subject
Zero Reflex passive + + + +
subject omitted subject
Reflex pass. omit- + + + +
ted subject head
Reflex pass. non- + +
nominal subject
Passive omitted + + +
subject
Pass. non-nominal +
subject
Impersonal Reflex imp. clause n/a n/a
construction (with se)
Imp. construction n/a + n/a
(without se)
Table 1: ESZIC Corpus Annotation Tags.
5.1 Features complex conjunction, clauses starting with a

We built the training data from the annotated cor- simple conjunction, and clauses introduced
pus and defined fourteen features. The linguisti- using punctuation marks (commas, semi-
cally motivated features are inspired by previous colons, etc). We implemented a method
ML approaches in Chinese (Zhao and Ng, 2007) to identify these different types of clauses,
and English (Evans, 2001). The values for the fea- as the parser does not explicitly mark the
tures (see Table 4) were derived from information boundaries of clauses within sentences. The
provided both by Connexors Machinese Syntax method took into account the existence of a
parser and a set of lists. finite verb, its dependencies, the existence of
We can describe each of the features as broadly conjunctions and punctuation marks.
belonging to one of ten classes, as follows:
3 LEMMA: lexical information extracted from
1 PARSER: the presence or absence of a sub- the parser, the lemma of the finite verb.
ject in the clause, as identified by the parser.
We are not aware of a formal evaluation of 4-5 NUMBER, PERSON: morphological infor-
Connexors accuracy. It presents an accu- mation of the verb, its grammatical number
racy of 74.9% evaluated against our corpus and its person.
and we used it as a simple baseline.
6 AGREE: feature which encodes the tense,
2 CLAUSE: the clause types considered are: mood, person, and number of the verb in the
main clauses, relative clauses starting with a clause, and its agreement in person, number,
709
Feature Definition Value
1 PARSER Parsed subject True, False
2 CLAUSE Clause type Main, Rel, Imp, Prop, Punct
3 LEMMA Verb lemma Parsers lemma tag
4 NUMBER Verb morphological number SG, PL
5 PERSON Verb morphological person P1, P2, P3
6 AGREE Agreement in person, number, tense FTFF, TTTT, FFFF, TFTF, TTFF, FTFT, FTTF, TFTT,
and mood FFFT, TTTF, FFTF, TFFT, FFTT, FTTT, TFFF, TTFT
7 NHPREV Previous noun phrases Number of noun phrases previous to the verb
8 NHTOT Total noun phrases Number of noun phrases in the clause
9 INF Infinitive Number of infinitives in the clause
10 SE Spanish particle se True, False
11 A Spanish preposition a True, False
12 POSpre Four parts of the speech previous to 292 different values combining the parsers
the verb POS tags
14 POSpos Four parts of the speech following 280 different values combining the parsers
the verb POS tags
14 VERBtype Type of verb: copulative, impersonal CIPX, XIXX, XXXT, XXPX, XXXI, CIXX, XXPT, XIPX,
pronominal, transitive and intransitive XIPT, XXXX, XIXI, CXPI, XXPI, XIPI, CXPX
Table 4: Features, definitions and values.
tense, and mood with the preceding verb in (e) Se admiten los alumnos que reunan los req-
the sentence and also with the main verb of uisitos.
the sentence.3 (They) accept the students who fulfill the
requirements.
7-9 NHPREV, NHTOT, INF: the candidates for (f) Se admite a los alumnos que reunan los req-
the subject of the clause are represented by uisitos.
the number of noun phrases in the clause that (It) is accepted for the students who fulfill
precede the verb, the total number of noun the requirements.
phrases in the clause, and the number of in-
finitive verbs in the clause. 12-3 POSpre , POSpos : the part of the speech
(POS) of eight tokens, that is, the 4-grams
10 SE: a binary feature encoding the presence preceding and the 4-grams following the in-
or absence of the Spanish particle se when it stance.
occurs immediately before or after the verb
or with a maximum of one token lying be- 14 VERBtype : the verb is classified as copula-
tween the verb and itself. Particle se occurs tive, pronominal, transitive, or with an im-
in passive reflex clauses with zero subjects personal use.4 Verbs belonging to more than
and in some impersonal constructions. one class are also accommodated with dif-
ferent feature values for each of the possible
11 A: a binary feature encoding the presence or combinations of verb type.
absence of the Spanish preposition a in the
5.2 Evaluation
clause. Since the distinction between passive
reflex clauses with zero subjects and imper- To determine the most accurate algorithm for our
sonal constructions sometimes relies on the classification task, two comparisons of learning
appearance of preposition a (to, for, etc.). algorithms implemented in W EKA (Witten and
For instance, example (e) is a passive reflex Frank, 2005) were carried out. Firstly, the classi-
clause containing a zero subject while exam- fication was performed using 20% of the training
ple (s) is an impersonal construction. instances. Secondly, the seven highest perform-
3
ing classifiers were compared using 100% of the
In Spanish, when a finite verb appears in a subordinate
4
clause, its tense and mood can assist in recognition of these We used four lists provided by Molino de Ideas s.a. con-
features in the verb of the main clause and help to enforce taining 11,060 different verb lemmas belonging to the Royal
some restrictions required by this verb, especially when both Spanish Academy Dictionary (Real Academia Espanola,
verbs share the same referent as subject. 2001).
710
Class P R F Acc. Algorithm Explicit Zero Impersonals
Explicit subj. 90.1% 92.3% 91.2% 87.3% subjects subjects
Zero subj. 77.2% 74.0% 75.5% 87.4% RAE 70.4%
Impersonals 85.6% 63.1% 72.7% 98.8% Connexor 71.7% 83.0%
Ferr./Peral 79.7% 98.4%
Table 5: K* performance (87.6% accuracy for ten-fold Elliphant 87.3% 87.4% 98.8%
cross validation).
Table 6: Summary of accuracy comparison with previ-
ous work.
training data and ten-fold cross-validation. The
corpus was partitioned into training and tested
it without impersonal constructions. We achieve
using ten-fold cross-validation for randomly or-
a precision of 87% for explicit subjects compared
dered instances in both cases. The lazy learn-
to 80%, and a precision of 87% for zero subjects
ing classifier K* (Cleary and Trigg, 1995), us-
compared to their 98%. The overall accuracy
ing a blending parameter of 40%, was the best
is the same for both techniques, 87.5%, but our
performing one, with an accuracy of 87.6% for
results are more balanced. Nevertheless, the
ten-fold cross-validation. K* differs from other
approaches and corpora used in both studies are
instance-based learners in that it computes the dis-
different, and hence it is not possible to do a fair
tance between two instances using a method mo-
comparison. For example, their corpus has 46%
tivated by information theory, where a maximum
of zero subjects while ours has only 26%.
entropy-based distance function is used (Cleary
and Trigg, 1995). Table 5 shows the results For impersonal constructions our method out-
for each class using ten-fold cross-validation. performs the RAE baseline (precision 6.5%,
In contrast to previous work, the K* algorithm recall 77.7%, F-measure 12.0% and accuracy
(Cleary and Trigg, 1995) was found to provide the 70.4%). Table 6 summarizes the comparison. The
most accurate classification in the current study. low performance of the RAE baseline is due to the
Other approaches have employed various clas- fact that verbs with impersonal use are often am-
sification algorithms, including JRip in WEKA biguous. For these cases, we first tagged them as
(Muller, 2006), with precision of 74% and recall ambiguous and then, we defined additional crite-
of 60%, and K-nearest neighbors in TiMBL: both ria after analyzing then manually. The resulting
in (Evans, 2001) with precision of 73% and recall annotated criteria are stated in Table 1.
of 69%, and in (Boyd et al., 2005) with precision
6 Analysis
of 82% and recall of 71%.
Since there is no previous ML approach for this Through these analyses we aim to extract the most
task in Spanish, our baselines for the explicit sub- effective features and the information that would
jects and the zero subjects are the parser output complement the output of an standard parser to
and the previous rule-based work with the high- achieve this task. We also examine the learning
est performance (Ferrandez and Peral, 2000). For process of the algorithm to find out how many in-
the impersonal constructions the baseline is a sim- stances are needed to train it efficiently and de-
ple greedy algorithm that classifies as an imper- termine how much Elliphant is genre dependent.
sonal construction every verb whose lemma is cat- The analyses indicate that our approach is robust:
egorized as a verb with impersonal use according it performs nearly as well with just six features,
to the RAE dictionary (Real Academia Espanola, has a steep learning curve, and seems to general-
2001). ize well to other text collections.
Our method outperforms the Connexor parser
which identifies the explicit subjects but makes no 6.1 Best Features
distinction between zero subjects and impersonal We carried out three different experiments to eval-
constructions. Connexor yields 74.9% overall ac- uate the most effective group of features, and
curacy and 80.2% and 65.6% F-measure for ex- the features themselves considering the individ-
plicit and elliptic subjects, respectively. ual predictive ability of each one along with their
To compare with Ferrandez and Peral degree of redundancy.
(Ferrandez and Peral, 2000) we do consider Based on the following three feature selection
711
methods we can state that there is a complex and Omission of all but one of the simple features
balanced interaction between the features. led to a reduction in accuracy, justifying their in-
clusion in the training instances. Nevertheless, the
6.1.1 Grouping Features
majority of features present low informativeness
In the first experiment we considered the 11 except for feature A which does not make any
groups of relevant ordered features from the train- meaningful contribution to the classification. The
ing data, which were selected using each W EKA feature PARSER presents the greatest difference
attribute selection algorithm and performed the in performance (86.3% total accuracy); however,
classifications over the complete training data, us- this is no big loss, considering it is the main fea-
ing only the different groups features selected. ture. Hence, as most features do not bring a sig-
The most effective group of six features (NH- nificant loss in accuracy, the features need to be
PREV, PARSER, NHTOT, POSpos , PERSON, combined to improve the performance.
LEMMA) was the one selected by W EKAs Sym-
metricalUncertAttribute technique, which gives 6.2 Learning Analysis
an accuracy of 83.5%. The most frequently The learning curve of Figure 1 (left) presents the
selected features by all methods are PARSER, increase of the performance obtained by Elliphant
POSpos , and NHTOT, and they alone get an accu- using the training data randomly ordered. The
racy of 83.6% together. As expected, the two pairs performance reaches its plateau using 90% of the
of features that perform best (both 74.8% accu- training instances. Using different ordering of the
racy) are PARSER with either POSpos or NHTOT. training set we obtain the same result.
Based on how frequent each feature is selected Figure 1 (right) presents the precision for each
by W EKAs attribute selection algorithms, we can class and overall in relation to the number of train-
rank the features as following: (1) PARSER, ing instances for each one of them. Recall grows
(2) NHTOT, (3) POSpos , (4) NHPREV and (5) similarly to precision. Under all conditions, sub-
LEMMA. jects are classified with a high precision since the
6.1.2 Complex vs. Simple Features information given by the parser (collected in the
Second, a set of experiments was conducted features) achieves an accuracy of 74.9% for the
in which features were selected on the basis identification of explicit subjects.
of the degree of computational effort needed to The impersonal construction class has the
generate them. We propose two sets of fea- fastest learning curve. When utilizing a training
tures. One group corresponds to simple fea- set of only 163 instances (90% of the training
tures, whose values can be obtained by trivial data), it reaches a precision of 63.2%. The un-
exploitation of the tags produced in the parsers stable behaviour for impersonal constructions can
output (PARSER, LEMMA, PERSON, POSpos , be attributed to not having enough training data
POSpre ). The second group of features, com- for that class, since impersonals are not frequent
plex features (CLAUSE, AGREE, NHPREV, in Spanish. On the other hand, the zero subject
NHTOT, VERBtype ) have values that required the class is learned more gradually.
implementation of more sophisticated modules to The learning curve for the explicit subject class
identify the boundaries of syntactic constituents is almost flat due to the great variety of subjects
such as clauses and noun phrases. The accuracy occurring in the training data. In addition, reach-
obtained when the classifier exclusively exploits ing a precision of 92.0% for explicit subjects us-
complex features is 82.6% while for simple ing just 20% of the training data is far more ex-
features is 79.9%. No impersonal constructions pensive in terms of the number of training in-
are identified when only complex features are stances (978) as seen in Figure 1 (right). Actually,
used. with just 20% of the training data we can already
achieve a precision of 85.9%.
6.1.3 One-left-out Feature This demonstrates that Elliphant does not need
In the third experiment, to estimate the weight very large sets of expensive training data and
of each feature, classifications were made in is able to reach adequate levels of performance
which each feature was omitted from the train- when exploiting far fewer training instances. In
ing instances that were presented to the classifier. fact, we see that we only need a modest set of
712
% 86.5% 86.6% 498 978 1461 1929 2433 2898 3400 3899 4386 4854
86.60 85.9% 93.00
86.0% Explicit subjects
85.8% 86.4% 86.71
86.00 86.3%
85.5% Overall
80.43
85.40 85.8% 1593 1793
Precision (%)
85.6% 85.7% 74.14 354 537 735 898 1094 1249 1416
85.3% Zero subjects
84.80
85.2% 67.86 167
163 179
84.20 82
61.57 103 146
66
129 Impersonal
83.60 55.29 17 49
32 constructions
83.00 49.00
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Precision Recall F-measure
Figure 1: Learning curve for precision, recall and F-measure (left) and with respect to the number of instances
of each class (right) for a given percentage of training data.
annotated instances (fewer than 1,500) to achieve are more homogeneous, as the classifier obtains
good results. higher accuracy when testing and training only on
legal instances (90.0%). In addition, legal texts
6.3 Impact of Genre are also more informative, because when both le-
To examine the influence of the different text gen- gal and health genres are combined as training
res on this method, we divided our training data data, only instances from the health genre show
into two subgroups belonging to different genres a significant increased accuracy (93.7%). These
(legal and health) and analyze the differences. results reveal that the health texts are the most het-
A comparative evaluation using ten-fold cross- erogeneous ones. In fact, we also found subsets of
validation over the two subgroups shows that El- the legal documents where our method achieves
liphant is more successful when classifying in- an accuracy of 94.6%, implying more homoge-
stances of explicit subjects in legal texts (89.8% neous texts.
accuracy) than health texts (85.4% accuracy).
This may be explained by the greater uniformity 6.4 Error Analysis
of the sentences in the legal genre compared to
Since the features of the system are linguisti-
ones from the health genre, as well as the fact that
cally motivated, we performed a linguistic anal-
there are a larger number of explicit subjects in the
ysis of the erroneously classified instances to find
legal training data (2,739 compared with 2,116 in
out which patterns are more difficult to classify
the health texts). Further, texts from the health
and which type of information would improve the
genre present the additional complication of spe-
method (Rello et al., 2011).
cialized named entities and acronyms, which are
used quite frequently. Similarly, better perfor- We extract the erroneously classified instances
mance in the detection of zero subjects and imper- of our training data and classify the errors. Ac-
sonal sentences in the health texts may be due to cording to the distribution of the errors per class
their more frequent occurrence and hence greater (Table 8) we take into account the following four
learnability. classes of errors for the analysis: (a) impersonal
constructions classified as zero subjects, (b) im-
Training/Testing Legal Health All personal constructions classified as explicit sub-
Legal 90.0% 86.8% 89.3% jects, (c) zero subjects classified as explicit sub-
Health 86.8% 85.9% 88.7% jects, and (d) explicit subjects classified as zero
All 92.5% 93.7% 87.6% subjects. The diagonal numbers are the true pre-
Table 7: Accuracy of cross-genre training and testing dicted cases. The classification of impersonal
evaluation (ten-fold evaluation). constructions is less balanced than the ones for
explicit subjects and zero subjects. Most of the
We have also studied the effect of training the wrongly identified instances are classified as ex-
classifier on data derived from one genre and test- plicit subject, given that this class is the largest
ing on instances derived from a different genre. one. On the other hand, 25% of the zero subjects
Table 7 shows that instances from legal texts are classified as explicit subject, while only 8% of
713
the explicit subjects are identified as zero subjects. class.
A possible future avenue to explore could be
Class Zero Explicit Impers. to combine our approach with Ferrandez and
subjects subjects Peral (Ferrandez and Peral, 2000) by employing
Zero subj. 1327 453 (c) 13 both algorithms in sequence: first Ferrandez and
Explicit subj. 368 (d) 4481 6
Perals algorithm to detect all zero subjects and
Impersonals 25 (a) 41 (b) 113
then ours to identify explicit subjects and imper-
sonals. Assuming that the same accuracy could be
Table 8: Confusion Matrix (ten-fold validation).
maintained, on our data set the combined perfor-
For the analysis we first performed an explo- mance could potentially be in the range of 95%.
ration of the feature values which allows us to Future research goals are the extrinsic evalua-
generate smaller samples of the groups of errors tion of our system by integrating our system in
for the further linguistic analyses. Then, we ex- NLP tasks and its adaptation to other Romance
plore the linguistic characteristics of the instances pro-drop languages. Finally, we believe that our
by examining the clause in which the instance ap- ML approach could be improved as it is the first
pears in our corpus. A great variety of different attempt of this kind.
patterns are found. We mention only the linguistic Acknowledgements
characteristics in the errors which at least double
We thank Richard Evans, Julio Gonzalo and the
the corpus general trends.
anonymous reviewers for their wise comments.
In all groups (a-d) there is a tendency of using
the following elements: post-verbal prepositions,
auxiliary verbs, future verbal tenses, subjunctive References
verbal mode, negation, punctuation marks ap- R. Artstein and M. Poesio. 2008. Inter-coder agree-
pearing before the verb and the preceding noun ment for computational linguistics. Computational
phrases, concessive and adverbial subordinate Linguistics, 34(4):555596.
clauses. In groups (a) and (b) the lemma of the S. Bergsma, D. Lin, and R. Goebel. 2008. Distri-
verb may play a relevant role, for instance verb butional identification of non-referential pronouns.
haber (there is/are) appears in the errors seven In Proceedings of the 46th Annual Meeting of the
times more than in the training while verb tratar Association for Computational Linguistics: Human
Language Technologies (ACL/HLT-08), pages 10
(to be about, to deal with) appears 12 times
18.
more. Finally, in groups (c) and (d) we notice I. Bosque. 1989. Clases de sujetos tacitos. In Julio
the frequent occurrence of idioms which include Borrego Nieto, editor, Philologica: homenaje a An-
verbs with impersonal uses, such as es decir (that tonio Llorente, volume 2, pages 91112. Servicio
is to say) and words which can be subject on their de Publicaciones, Universidad Pontificia de Sala-
own i.e. ambos (both) or todo (all). manca, Salamanca.
A. Boyd, W. Gegg-Harrison, and D. Byron. 2005.
7 Conclusions and Future Work Identifying non-referential it: a machine learning
approach incorporating linguistically motivated pat-
In this study we learn which is the most accurate terns. In Proceedings of the ACL Workshop on Fea-
approach for identifying explicit subjects and im- ture Engineering for Machine Learning in Natural
personal constructions in Spanish and which are Language Processing. 43rd Annual Meeting of the
Association for Computational Linguistics (ACL-
the linguistic characteristics and features that help
05), pages 4047.
to perform this task. The corpus created is freely J. M. Brucart. 1999. La elipsis. In I. Bosque
available online.5 Our method complements pre- and V. Demonte, editors, Gramatica descriptiva de
vious work on Spanish anaphora resolution by ad- la lengua espanola, volume 2, pages 27872863.
dressing the identification of non-referential con- Espasa-Calpe, Madrid.
structions. It outperforms current approaches in N. Chomsky. 1981. Lectures on Government and
explicit subject detection and impersonal con- Binding. Mouton de Gruyter, Berlin, New York.
structions, doing better than the parser for every J.G. Cleary and L.E. Trigg. 1995. K*: an instance-
based learner using an entropic distance measure.
5 In Proceedings of the 12th International Conference
ESZIC es Corpus is available at: http:
//luzrello.com/Projects.html. on Machine Learning (ICML-95), pages 108114.
714
Connexor Oy, 2006. Machinese language model. M. Recasens and M.A. Mart. 2010. Ancora-
L. Danlos. 2005. Automatic recognition of French co: Coreferentially annotated corpora for Spanish
expletive pronoun occurrences. In Robert Dale, and Catalan. Language resources and evaluation,
Kam-Fai Wong, Jiang Su, and Oi Yee Kwong, ed- 44(4):315345.
itors, Natural language processing. Proceedings of L. Rello and I. Illisei. 2009a. A comparative study
the 2nd International Joint Conference on Natural of Spanish zero pronoun distribution. In Proceed-
Language Processing (IJCNLP-05), pages 7378, ings of the International Symposium on Data and
Berlin, Heidelberg, New York. Springer. Lecture Sense Mining, Machine Translation and Controlled
Notes in Computer Science, Vol. 3651. Languages, and their application to emergencies
R. Evans. 2001. Applying machine learning: toward and safety critical domains (ISMTCL-09), pages
an automatic classification of it. Literary and Lin- 209214. Presses Universitaires de Franche-Comte,
guistic Computing, 16(1):4557. Besancon.
A. Ferrandez and J. Peral. 2000. A computational ap- L. Rello and I. Illisei. 2009b. A rule-based approach
proach to zero-pronouns in Spanish. In Proceedings to the identification of Spanish zero pronouns. In
of the 38th Annual Meeting of the Association for Student Research Workshop. International Confer-
Computational Linguistics (ACL-2000), pages 166 ence on Recent Advances in Natural Language Pro-
172. cessing (RANLP-09), pages 209214.
J. L. Fleiss. 1971. Measuring nominal scale agree- L. Rello, P. Suarez, and R. Mitkov. 2010. A machine
ment among many raters. Psychological Bulletin, learning method for identifying non-referential im-
76(5):378382. personal sentences and zero pronouns in Spanish.
G. Hirst. 1981. Anaphora in natural language under- Procesamiento del Lenguaje Natural, 45:281287.
standing: a survey. Springer-Verlag. L. Rello, G. Ferraro, and A. Burga. 2011. Error analy-
J. Hobbs. 1977. Resolving pronoun references. Lin- sis for the improvement of subject ellipsis detection.
gua, 44:311338. Procesamiento de Lenguaje Natural, 47:223230.
R. Mitkov and C. Hallett. 2007. Comparing pronoun L. Rello. 2010. Elliphant: A machine learning method
resolution algorithms. Computational Intelligence, for identifying subject ellipsis and impersonal con-
23(2):262297. structions in Spanish. Masters thesis, Erasmus
R. Mitkov. 2002. Anaphora resolution. Longman, Mundus, University of Wolverhampton & Univer-
London. sitat Autonoma de Barcelona.
R. Mitkov. 2010. Discourse processing. In Alexander
P. Tapanainen and T. Jarvinen. 1997. A non-projective
Clark, Chris Fox, and Shalom Lappin, editors, The
dependency parser. In Proceedings of the 5th Con-
handbook of computational linguistics and natural
ference on Applied Natural Language Processing
language processing, pages 599629. Wiley Black-
(ANLP-97), pages 6471.
well, Oxford.
I. H. Witten and E. Frank. 2005. Data mining: practi-
C. Muller. 2006. Automatic detection of nonrefer-
cal machine learning tools and techniques. Morgan
ential it in spoken multi-party dialog. In Proceed-
Kaufmann, London, 2 edition.
ings of the 11th Conference of the European Chap-
S. Zhao and H.T. Ng. 2007. Identification and resolu-
ter of the Association for Computational Linguistics
tion of Chinese zero pronouns: a machine learning
(EACL-06), pages 4956.
approach. In Proceedings of the 2007 Joint Con-
V. Ng and C. Cardie. 2002. Identifying anaphoric
and non-anaphoric noun phrases to improve coref-
Processing and Computational Natural Language
erence resolution. In Proceedings of the 19th Inter-
Learning (EMNLP/CNLL-07), pages 541550.
national Conference on Computational Linguistics
(COLING-02), pages 17.
Real Academia Espanola. 2001. Diccionario de la
lengua espanola. Espasa-Calpe, Madrid, 22 edi-
tion.
Real Academia Espanola. 2009. Nueva gramatica de
la lengua espanola. Espasa-Calpe, Madrid.
M. Recasens and E. Hovy. 2009. A deeper
look into features for coreference resolution. In
Lalitha Devi Sobha, Antonio Branco, and Ruslan
Mitkov, editors, Anaphora Processing and Applica-
tions. Proceedings of the 7th Discourse Anaphora
and Anaphor Resolution Colloquium (DAARC-09),
pages 2942. Springer, Berlin, Heidelberg, New
York. Lecture Notes in Computer Science, Vol.
5847.
715
Validation of sub-sentential paraphrases acquired
from parallel monolingual corpora
Houda Bouamor Aurelien Max Anne Vilnat
LIMSI-CNRS & Univ. Paris Sud

Orsay, France
firstname.lastname@limsi.fr
Abstract synonyms, enumerating meaning equivalences at

the level of phrases is too daunting a task for hu-
The task of paraphrase acquisition from re- mans. Because this type of knowledge can how-
lated sentences can be tackled by a variety ever greatly benefit many NLP applications, au-
of techniques making use of various types tomatic acquisition of such paraphrases has at-
of knowledge. In this work, we make the
tracted a lot of attention (Androutsopoulos and
hypothesis that their performance can be
increased if candidate paraphrases can be Malakasiotis, 2010; Madnani and Dorr, 2010),
validated using information that character- and significant research efforts have been devoted
izes paraphrases independently of the set of to this objective (Callison-Burch, 2007; Bhagat,
techniques that proposed them. We imple- 2009; Madnani, 2010).
ment this as a bi-class classification prob-
Central to acquiring paraphrases is the need of
lem (i.e. paraphrase vs. not paraphrase),
allowing any paraphrase acquisition tech- assessing the quality of the candidate paraphrases
nique to be easily integrated into the com- produced by a given technique. Most works to
bination system. We report experiments on date have resorted to human evaluation of para-
two languages, English and French, with phrases on the levels of grammaticality and mean-
5 individual techniques on parallel mono- ing equivalence. Human evaluation is however
lingual parallel corpora obtained via multi- often criticized as being both costly and non re-
ple translation, and a large set of classifi-
producible, and the situation is even more compli-
cation features including surface to contex-
tual similarity measures. Relative improve-
cated by the inherent complexity of the task that
ments in F-measure close to 18% are ob- can produce low inter-judge agreement. Task-
tained on both languages over the best per- based evaluation involving the use of paraphras-
forming techniques. ing into some application thus seem an acceptable
solution, provided the evaluation methodologies
for the given task are deemed acceptable. This,
1 Introduction in turn, puts the emphasis on observing the im-
The fact that natural language allows messages pact of paraphrasing on the targeted application
to be conveyed in a great variety of ways consti- and is rarely accompanied by a study of the intrin-
tutes an important difficulty for NLP, with appli- sic limitations of the paraphrase acquisition tech-
cations in both text analysis and generation. The nique used.
term paraphrase is now commonly used in the The present work is concerned with the task of
NLP litterature to refer to textual units of equiva- sub-sentential paraphrase acquisition from pairs
lent meaning at the phrasal level (including single of related sentences. A large variety of tech-
words). For instance, the phrases six months and niques have been proposed that can be applied
half a year form a paraphrase pair applicable in to this task. They typically make use of differ-
many different contexts, as they would appropri- ent kinds of automatically or manually acquired
ately denote the same concept. Although one can knowledge. We make the hypothesis that their
envisage to manually build high-coverage lists of performance can be increased if candidate para-
716
phrases can be validated using information that of a similar technique on a very large scale.
characterize paraphrases in complement to the set The hypothesis that two words or phrases are
of techniques that proposed them. We propose to interchangeable if they share a common trans-
implement this as a bi-class classification problem lation into one or more other languages has
(i.e. paraphrase vs. not paraphrase), allowing also been extensively studied in works on sub-
any paraphrase acquisition technique to be easily sentential paraphrase acquisition. Bannard and
integrated into the combination system. In this Callison-Burch (2005) described a pivoting ap-
article, we report experiments on two languages, proach that can exploit bilingual parallel corpora
English and French, with 5 individual techniques in several languages. The same technique has
based on a) statistical word alignment models, been applied to the acquisition of local paraphras-
b) translational equivalence, c) handcoded rules of ing patterns in Zhao et al. (2008). The work of
term variation, d) syntactic similarity, and e) edit Callison-Burch (2008) has shown how the mono-
distance on word sequences. We used parallel lingual context of a sentence to paraphrase can be
monolingual parallel corpora obtained via mul- used to improve the quality of the acquired para-
tiple translation from a single language as our phrases.
sources of related sentences, and a large set of Another approach consists in modelling local
features including surface to contextual similarity paraphrasing identification rules. The work of
measures. Relative improvements in F-measure Jacquemin (1999) on the identification of term
close to 18% are obtained on both languages over variants, which exploits rewriting morphosyntac-
the best performing techniques. tic rules and descriptions of morphological and
The remainder of this article is organized as semantic lexical families, can be extended to ex-
follows. We first briefly review previous work tract the various forms corresponding to input pat-
on sub-sentential paraphrase acquisition in sec- terns from large monolingual corpora.
tion 2. We then describe our experimental setting When parallel monolingual corpora aligned at
in section 3 and the individual techniques that we the sentence level are available (e.g. multiple
have studied in section 4. Section 5 is devoted to translations into the same language), the task of
our approach for validating paraphrases proposed sub-sentential paraphrase acquisition can be cast
by individual techniques. Finally, section 6 con- as one of word alignment between two aligned
cludes the article and presents some of our future sentences (Cohn et al., 2008). Barzilay and
work in the area of paraphrase acquisition. McKeown (2001) applied the distributionality hy-
pothesis on such parallel sentences, and Pang et
2 Related work
al. (2003) proposed an algorithm to align sen-
The hypothesis that if two words or, by exten- tences by recursive fusion of their common syn-
sion, two phrases, occur in similar contexts then tactic constituants.
they may be interchangeable has been extensively Finally, they has been a recent interest in auto-
tested. The distributional hypothesis, attributed to matic evaluation of paraphrases (Callison-Burch
Zellig Harris, was for example applied to syntac- et al., 2008; Liu et al., 2010; Chen and Dolan,
tic dependency paths in the work of Lin and Pan- 2011; Metzler et al., 2011).
tel (2001). Their results take the form of equiva-
lence patterns with two arguments such as {X asks 3 Experimental setting
for Y, X requests Y, Xs request for Y, X wants Y,
Y is requested by X, . . .}. We used the main aspects of the methodology
Using comparable corpora, where the same in- described by Cohn et al. (2008) for constructing
formation probably exists under various linguis- evaluation corpora and assessing the performance
tic forms, increases the likelihood of finding very of techniques on the task of sub-sentential para-
close contexts for sub-sentential units. Barzilay phrase acquisition. Pairs of related sentences are
and Lee (2003) proposed a multi-sequence align- hand-aligned to define a set of reference atomic
ment algorithm that takes structurally similar sen- paraphrase pairs at the level of words or phrases,
tences and builds a compact lattice representation denoted as Ratom 1 .
that encodes local variations. The work by Bhagat 1
Note that in this study we do not distinguish between
and Ravichandran (2008) describes an application Sure and Possible alignments, and when reusing anno-
717
single language multiple language video descriptions multiply-translated news headlines
translation translation subtitles
# tokens 4,476 4,630 1,452 2,721 1,908
# unique tokens 656 795 357 830 716
% aligned tokens (excluding identities) 60.58 48.80 23.82 29.76 14.46
lexical overlap (tokens) 77.21 61.03 59.50 32.51 39.63
lexical overlap (lemmas content words) 83.77 71.04 64.83 39.54 45.31
translation edit rate (TER) 0.32 0.55 0.76 0.68 0.62
penalized n-gram prec. (BLEU) 0.33 0.15 0.13 0.14 0.39
Table 1: Various indicators of sentence pair comparability for different corpus types. Statistics are reported for
French on sets of 100 sentence pairs.
We conducted a small-scale study to assess dif- presence of common token may serve as useful
ferent types of corpora of related sentences: clues to guide paraphrase extraction.
For our experiments, we chose to use parallel
1. single language translation Corpora ob- monolingual corpora obtained by single language
tained by several independent human trans- translation, the most direct resource type for ac-
lation of the same sentences (e.g. (Barzilay quiring sub-sentential paraphrase pairs. This al-
and McKeown, 2001)). lows us to define acceptable references for the
2. multiple language translation Same as task and resort to the most consensual evaluation
above, but where a sentence is translated technique for paraphrase acquisition to date. Us-
from 4 different languages into the same lan- ing such corpora, we expect to be able to extract
guage (Bouamor et al., 2010). precise paraphrases (see Table 1), which will be
natural candidates for further validation, which
3. video descriptions Descriptions of short will be addressed in section 5.3.
YouTube videos obtained via Mechanical Figure 1 illustrates a reference alignment ob-
Turk (Chen and Dolan, 2011). tained on a pair of English sentential paraphrases
and the list of atomic paraphrase pairs that can be
4. multiply-translated subtitles Aligned mul-
extracted from it, against which acquisition tech-
tiple translations of contributed movie subti-
niques will be evaluated. Note that we do not con-
tles (Tiedemann, 2007).
sider pairs of identical units during evaluation, so
5. comparable news headlines News head- we filter them out from the list of reference para-
lines collected from Google News clusters phrase pairs.
(e.g. (Dolan et al., 2004)). The example in Figure 1 shows different cases
that point to the inherent complexity of this task,
We collected 100 sentence pairs of each type even for human annotators: it could be argued,
in French, for which various comparability mea- for instance, that a correct atomic paraphrase
sures are reported on Table 1. In particular, the pair should be reached amounted to rather
% aligned tokens row indicates the propor- than reached amounted. Also, aligning in-
tion of tokens from the sentence pairs that could dependently 260 0.26 and million billion
be manually aligned by a native-speaker annota- is assuredly an error, while the pair 260 mil-
tor.2 Obviously, the more common tokens two lion 0.26 billion would have been appropriate.
sentences from a pair contain, the fewer sub- A case of alignment that seems non trivial can be
sentential paraphrases may be extracted from that observed in the provided example (during the en-
pair. However, high lexical overlap increases the tire year annual). The abovementioned rea-
probability that two sentences be indeed para- sons will explain in part the difficulties in reach-
phrases, and in turn the probability that some of ing high performance values using such gold stan-
their phrases be paraphrases. Furthermore, the dards.
tated corpora using them we considered all alignments as be- Reference composite paraphrase pairs (denoted
ing correct.
2
as R), obtained by joining adjacent atomic para-
The same annotator hand-aligned the 5*100=500 paraphrase pairs from Ratom up to 6 tokens3 , will
phrase pairs using the YAWAT (Germann, 2008) manual
3
alignment tool. We used standard biphrase extraction heuristics (Koehn
718
corpus described in (Cohn et al., 2008), consist-
ing of multiply-translated Chinese sentences into
investment
amounted
English, and used as our gold standard both the
actually
foreign
annual
billion
alignments marked as Sure and Possible. For
used
0.26
us$
the
to
French, we used the CESTA corpus of news ar-
the ticles4 obtained by translating into French from
amount English.
of
foreign We used the YAWAT (Germann, 2008) manual
capital
actually alignment tool. Inter-annotator agreement val-
utilized ues (averaging with each annotation set as the
during
the
gold standard) are 66.1 for English and 64.6 for
entire French, which we interpret as acceptable val-
year
reached
ues. Manual inspection of the two corpora reveals
260 that the French corpus tends to contain more lit-
million eral translations, possibly due to the original lan-
us
dollars guages of the sentences, which are closer to the
.
target language than Chinese is to English.
capital investment
utilized used 4 Individual techniques for paraphrase
during the entire year annual
reached amounted
acquisition
260 0.26
million billion As discussed in section 2, the acquisition of sub-
us dollars us$ sentential paraphrases is a challenging task that
has previously attracted a lot of work. In this
Figure 1: Reference alignments for a pair of English work, we consider the scenario where sentential
sentential paraphrases from the annotation corpus of paraphrases are available and words and phrases
Cohn et al. (2008) (note that possible and sure align- from one sentence can be aligned to words and
ments are not distinguished here) and the list of atomic phrases from the other sentence to form atomic
paraphrase pairs extracted from these alignments.
paraphrase pairs. We now describe several tech-
niques that perform the task of sub-sentential unit
also be considered when measuring performance. alignment. We have selected and implemented
Evaluated techniques have to output atomic can- five techniques which we believe are representa-
didate paraphrase pairs (denoted as Hatom ) from tive of the type of knowledge that these techniques
which composite paraphrase pairs (denoted as use, and have reused existing tools, initially devel-
H) are computed. The usual measures of pre- oped for other tasks, when possible.
cision (P ), recall (R) and F-measure (F1 ) can
then be defined in the following way (Cohn et al., 4.1 Statistical learning of word alignments
2008): (Giza)
The GIZA++ tool (Och and Ney, 2004) computes
|Hatom R| |H Ratom | 2pr
P = R= F1 = statistical word alignment models of increasing
|Hatom | |Ratom | p+r complexity from parallel corpora. While origi-
nally developed in the bilingual context of Statis-
We conducted experiments using two different
tical Machine Translation, nothing prevents build-
corpora in English and French. In each case,
ing such models on monolingual corpora. How-
a held-out development corpus of 150 sentential
ever, in order to build reliable models, it is nec-
paraphrase pairs was used for development and
essary to use enough training material includ-
tuning, and all techniques were evaluated on the
ing minimal redundancy of words. To this end,
same test set consisting of 375 sentential para-
we provided GIZA++ with all possible sentence
phrase pairs. For English, we used the MTC
pairs from our mutiply-translated corpus to im-
et al., 2007) : all words from a phrase must be aligned to at prove the quality of its word alignments (note that
least one word from the other and not to words outside, but
4
unaligned words at phrase boundaries are not used. http://www.elda.org/article125.html
719
we used symmetrized alignments from the align- the first sentence and search for variants in the
ments in both directions). This constitutes a sig- other sentence, then do the reverse process and
nificant advantage for this technique that tech- finally take the intersection of the two sets.
niques working on each sentence pair indepen-
dently do not have. 4.4 Syntactic similarity (Synt)
The algorithm introduced by Pang et al. (2003)
4.2 Translational equivalence (Pivot) takes two sentences as input and merges them by
Translational equivalence can be exploited to de- top-down syntactic fusion guided by compatible
termine that two phrases may be paraphrases. syntactic substructure. A lexical blocking mecha-
Bannard and Callison-Burch (2005) defined a nism prevents constituents from fusionning when
paraphrasing probability between two phrases there is evidence of the presence of a word in an-
based on their translation probability through all other constituent of one of the sentence. We use
possible pivot phrases as: the Berkeley Probabilistic parser (Klein and Man-
X ning, 2003) to obtain syntactic trees for English
Ppara (p1 , p2 ) = Pt (piv|p1 )Pt (p2 |piv) and its adapted version for French (Candito et al.,
piv
2010). Because this process is highly sensitive to
where Pt denotes translation probabilies. We used syntactic parse errors, we use in our implemen-
the Europarl corpus5 of parliamentary debates in tation k-best parses and retain the most compact
English and French, consisting of approximately fusion from any pair of candidate parses.
1.7 million parallel sentences : this allowed us
to use the same resource to build paraphrases for 4.5 Edit rate on word sequences (TERp )
English, using French as the pivot language, and TERp (Translation Edit Rate Plus) (Snover et al.,
for French, using English as the pivot language. 2010) is a score designed for the evaluation of
The GIZA++ tool was used for word alignment Machine Translation output. Its typical use takes
and the M OSES Statistical Machine Translation a system hypothesis to compute an optimal set of
toolkit (Koehn et al., 2007) was used to com- word edits that can transform it into some exist-
pute phrase translation probabilities from these ing reference translation. Edit types include ex-
word alignments. For each sentential paraphrase act word matching, word insertion and deletion,
pair, we applied the following algorithm: for each block movement of contiguous words (computed
phrase, we build the entire set of paraphrases us- as an approximation), as well as optionally vari-
ing the previous definition. We then extract its ants substitution through stemming, synonym or
best paraphrase as the one exactly appearing in the paraphrase matching.6 Each edit type is parame-
other sentence with maximum paraphrase proba- terized by at least one weight which can be opti-
bility, using a minimal threshold value of 104 . mized using e.g. hill climbing. TERp being a tun-
able metric, our experiments will include tuning
4.3 Linguistic knowledge on term variation
TERp systems towards either precision ( P ),
(Fastr)
recall ( R), or F-measure ( F1 ).7
The FASTR tool (Jacquemin, 1999) was designed
to spot term/phrase variants in large corpora. 4.6 Evaluation of individual techniques
Variants are described through metarules express- Results for the 5 individual techniques are given
ing how the morphosyntactic structure of a term on the left part of Table 2. It is first apparent
variant can be derived from a given term by means that all techniques but TERp fared better on the
of regular expressions on word morphosyntactic French corpus than on the English corpus. This
categories. Paradigmatic variation can also be ex- can certainly be explained by the fact that the for-
pressed by expressing constraints between words, mer results from more literal translations (from
imposing that they be of the same morphologi- 6
Note that for these experiments we did not use the stem-
cal or semantic family. Both constraints rely on
ming module, the interface to WordNet for synonym match-
preexisting repertoires available for English and ing and the provided paraphrase table for English, due to the
French. To compute candidate paraphrase pairs fact that these resources were available for English only.
7
using FASTR, we first consider all phrases from Hill climbing was used for all tunings as done by Snover
et al. (2010), and we used one iteration starting with uniform
5
http://statmt.org/europarl weights and 100 random restarts.
720
Individual techniques Combinations
TERp
G IZA P IVOT FASTR S YNT union validation
P R F1
English
P 31.01 31.78 37.38 52.17 50.00 29.15 33.37 21.44 50.51
R 38.30 18.50 6.71 2.53 5.83 45.19 45.37 60.87 41.19
F1 34.27 23.39 11.38 4.83 10.44 35.44 38.46 31.71 45.37
French
P 28.99 29.53 52.48 62.50 31.35 30.26 31.43 17.58 40.77
R 45.98 26.66 8.59 8.65 44.22 44.60 44.10 63.36 45.85
F1 35.56 28.02 14.77 15.20 36.69 36.05 36.70 27.53 43.16
Table 2: Results on the test set on English and French for the 5 individual paraphrase acquisition techniques (left
part) and for the 2 combination techniques (right part).
English to French, compared with from Chinese than for highly-inflected French.
to English), which should be consequently eas- P IVOT is on par with G IZA as regards preci-
ier to word-align. This is for example clearly sion, but obtains a comparatively much lower re-
shown by the results of the statistical aligner call (differences of 19.32 and 19.80 on recall on
G IZA, which obtains a 7.68 advantage on recall French and English respectively). This may first
for French over English. be due in part to the paraphrasing score threshold
The two linguistically-aware techniques, used for P IVOT, but most certainly to the use of
FASTR and S YNT, have a very strong precision a bilingual corpus from the domain of parliamen-
on the more parallel French corpus, but fail to tary debates to extract paraphrases when our test
achieve an acceptable recall on their own. This sets are from the news domain: we may be ob-
is not surprising : FASTR metarules are focussed serving differences inherent to the domain, and
on term variant extraction, and S YNT requires possibly facing the issue of numerous out-of-
two syntactic trees to be highly comparable vocabulary phrases, in particular for named en-
to extract sub-sentential paraphrases. When tities which frequently occur in the news domain.
these constrained conditions are met, these two Importantly, we can note that we obtain at best
techniques appear to perform quite well in terms a recall of 45.98 on French (G IZA) and of 45.37
of precision. on English (TERp ). This may come as a disap-
G IZA and TERp perform roughly in the same pointment but, given the broad set of techniques
range on French, with acceptable precision and evaluated, this should rather underline the inher-
recall, TERp performing overall better, with e.g. ent complexity of the task. Also, recall that the
a 1.14 advantage on F-measure on French and metrics used do not consider identity paraphrases
4.19 on English. The fact that TERp performs (e.g. at the same time at the same time), as
comparatively better on English than on French8 , well as the fact that gold standard alignment is
with a 1.76 advantage on F-measure, is not con- a very difficult process as shown by interjudge
tradictory: the implemented edit distance makes agreement values and our example from section 3.
it possible to align reasonably distant words and This, again, confirms that the task that is ad-
phrases independently from syntax, and to find dressed is indeed a difficult one, and provides fur-
alignments for close remaining words, so the dif- ther justification for initially focussing on parallel
ferences of performance between the two lan- monolingual corpora, albeit scarce, for conduct-
guages are not necessarily expected to be coming fine-grained studies on sub-sentential para-
parable with the results of a statistical alignment phrasing.
technique. English being a poorly-inflected lan- Lastly, we can also note that precision is not
guage, alignment clues between two sentential very high, with (at best, using TERpP ) average
paraphrases are expected to be more numerous values for all techniques of 40.97 and 40.46 on
8
French and English, respectively. Several facts
Recall that all specific linguistic modules for English
only from TERp had been disabled, so the better perfor-
may provide explanations for this observation.
mance on English cannot be explained by a difference in First, it should be noted that none of those tech-
terms of resources used. niques, except S YNT, was originally developed
721
for the task of sub-sentential paraphrase acqui- Results on the test set for the two languages
sition from monolingual parallel corpora. This are given in Table 3. A number of pairs of tech-
results in definitions that are at best closely re- niques have strong complementarity values, the
lated to this task.9 Designing new techniques strongest one being for G IZA and TERp for both
was not one of the objectives of our study, so we languages. According to these figures, P IVOT
have reused existing techniques, originally devel- identify paraphrases which are slightly more sim-
oped with different aims (bilingual parallel cor- ilar to those of TERp than those of G IZA. Inter-
pora word alignment (G IZA), term variant recog- estingly, FASTR and S YNT exhibit a strong com-
nition (FASTR), Machine Translation evaluation plementarity, where in French, for instance, they
(TERp )). Also, techniques such as G IZA and only have a very small proportion of paraphrases
TERp attempt to align as many words as possi- in common. Considering the set of all other tech-
ble in a sentence pair, when gold standard align- niques, G IZA provides the more new paraphrases
ments sometimes contain gaps.10 Finally, the met- on French and TERp on English.
rics used will count as false small variations of
G IZA P IVOT FASTR S YNT TERpR all others
gold standard paraphrases (e.g. missing function English
word): the acceptability or not of such candi- G IZA - 4.65 2.83 0.59 10.31 8.31
dates could be either evaluated in a scenario where P IVOT 4.65 - 2.30 1.88 3.12 3.72
FASTR 2.83 2.30 - 2.42 1.71 0.53
such acceptable variants would be taken into S YNT 0.59 1.88 2.42 - 0.59 0.00
account, and could be considered in the context TERpR 10.31 3.12 1.71 0.59 - 12.20
of some actual use of the acquired paraphrases French
G IZA - 9.79 3.64 2.20 10.73 8.91
in some application. Nonetheless, on average the
P IVOT 9.79 - 2.26 5.22 7.84 3.39
techniques in our study produce more candidates FASTR 3.64 2.26 - 7.28 3.01 0.19
that are not in the gold standard: this will be an S YNT 2.20 5.22 7.28 - 1.76 0.44
important fact to keep in mind when tackling the TERpR 10.73 7.84 3.01 1.76 - 5.65
task of combining their outputs. In particular, we

will investigate the use of features indicating the Table 3: Values of complementarity on the test set for
both languages, where the following formula was used
combination of techniques that predicted a given
for the set of technique outputs T = {t1 , t2 , ..., tn } :
paraphrase pair, aiming to capture consensus in- C(ti , tj ) = recall(ti tj )max(recall(ti ), recall(tj )).
formation. Complementarity values are computed between all
pairs of individual techniques, and each individual
5 Paraphrase validation technique and the set of all other techniques. Values in
bold indicate highest values for the technique of each
5.1 Technique complementarity row.
Before considering combining and validating the
outputs of individual techniques, it is informative 5.2 Naive combination by union
to look at some notion of complementarity be-
tween techniques, in terms of how many correct We first implemented a naive combination ob-
paraphrases a technique would add to a combined tained by taking the union of all techniques. Re-
set. The following formula was used to account sults are given in the first column of the right part
for the complementarity between the set of can- of Table 2. The first result is quite encouraging:
didates from some technique i, ti , and the set for in both languages, more than 6 paraphrases from
some technique j, tj :
the gold standard out of 10 are found by at least
C(ti , tj ) = recall(ti tj )max(recall(ti ), recall(tj )) one of the techniques, which, given our previous
discussion, constitutes a good result and provide
a clear justification for combining different tech-
9
Recall, however, that our best performing technique on niques for improving performance on this task.
F-measure, TERp , was optimized to our task using a held Precision is mechanically lowered to account for
out development set. roughly 1 correct paraphrase over 5 candidates
10
It is arguable whether such cases should happen in sen- for both languages. F-measure values are much
tence pairs obtained by translating the same original sentence
into the same language, but this clearly depends on the inter-
lower than those of TERp and G IZA, showing
pretation of the expected level of annotation by the annota- that the union of all techniques is only interest-
tors. ing for recall-oriented paraphrase acquisition. In
722
the next section, we will show how the results of of tokens for the two phrases of a candidate para-
the union can be validated using machine learning phrase pair.
to improve these figures.
Context similarity (CTXT) It can be derived
5.3 Paraphrase validation via automatic from the distributionality hypothesis that the more
classification two phrases will be seen in similar contexts, the
A natural improvement to the naive combination more they are likely to be paraphrases. We used
of paraphrase candidates from all techniques can discretized features indicating how similar the
consist in validating candidate paraphrases by us- contexts of occurrences of two paraphrases are.
ing several models that may be good indicators of For this, we used the full set of bilingual English-
their paraphrasing status. We can therefore cast French data available for the translation task of
our problem as one of biclass classification (i.e. the Workshop on Statistical Machine Transla-
paraphrase vs. not paraphrase). tion13 , totalling roughly 30 million parallel sen-
We have used a maximum entropy classifier11 tences: this again ensures that the same resources
with the following features, aiming at capturing are used for experiments in the two languages. We
information on the paraphrase status of a candi- collect all occurrences for the phrases in a pair,
date pair: and build a vector of content words cooccurring
within a distance of 10 words from each phrase.
Morphosyntactic equivalence (POS) It may We finally compute the cosine between the vec-
be the case that some sequences of part-of-speech tors of the two phrases of a candidate paraphrase
can be rewritten as different sequences, e.g. as pair.
a result of verb nominalization. We therefore
use features to indicate the sequences of part-of- Relative position in a sentence (REL) De-
speech for a pair of candidate paraphrases. We pending on the language in which parallel sen-
used the preterminal symbols of the syntactic tences are analyzed, it may be the case that sub-
trees of the parser used for S YNT. sentential paraphrases occur at close locations in
their respective sentence. We used a discretized
Character-based distance (CAR) Morpholog- feature indicating the relative position of the two
ical variants often have close word forms, and phrases in their original sentence.
more generally close word forms in sentential
paraphase pairs may indicate related words. We Identity check (COOC) We used a binary fea-
used features for discretized values of the edit ture indicating whether one of the two phrases
distance between the two phrases of a candidate from a candidate pair, or the two, occurred at
paraphrase pair as measured by the Levenshtein some other location in the other sentence.
distance.
Phrase length ratio (LEN) We used a dis-
Stem similarity (STEM) Inflectional morphol- cretized feature indicating phrase length ratio.
ogy, which is quite productive in languages such
Source techniques (SRC) Finally, as our set-
as French, can increase vocabulary size signifi-
ting validates paraphrase candidates produced by
cantly, while in sentential paraphrases common
a set of techniques, we used features indicat-
stems may indicate related words. We used a
ing which combination of techniques predicted a
binary feature indicating whether the stemmed
paraphrase candidate. This can allow learning that
phrases of a candidate paraphrase pair match.12
paraphrases in the intersection of the predicted
Token set identity (BOW) Syntactic rearrange- sets for some techniques may produce good re-
ments may involve the same sets of words in var- sults.
ious orders. We used discretized features indicat- We used a held out training set consisting of
ing the proportion of common tokens in the set 150 sentential paraphrase pairs from the same cor-
11
We used the implementation available at: pora as our previous developement and test sets
http://homepages.inf.ed.ac.uk/lzhang10/ for both languages. Positive examples were taken
maxent_toolkit.html from the candidate paraphrase pairs from any of
12
We use the implementations of the Snowball stem-
13
mer from English and French available from: http:// http://www.statmt.org/wmt11/
snowball.tartarus.org translation-task.html
723
the 5 techniques in our study which belong to 43
the gold standard, and we used a corresponding

number of negative examples (randomly selected) 41
from candidate pairs not in the gold standard. The

right part of Table 2 provides the results for our 39
F-measure
validation experiments of the union set for all pre-
vious techniques. 37
We obtain our best results for this study using All

the output of our validation classifier over the set 35 \POS
\SRC
of all candidate paraphrase pairs. On French, it \CTXT
33 \STEM
yields an improvement in F-measure (43.16) of \LEN
+6.46 over the best individual technique (TERp ) \COOC
31
and of +15.63 over the naive union from all indi- 10 20 30 40 50 60 70 80 90 100
vidual techniques. On English, the improvement % of examples from training corpus
in F-measure (45.37) is for the same conditions of

respectively +6.91 (over TERp ) and +13.66. We Figure 2: Learning curves obtained on French by re-
unfortunately observe an important decrease in removing features individually.
call over the naive union, of respectively -17.54
and -19.68 for French and English. Increasing our and F-measure in the range 36-38, indicating that
amount of training data to better represent the full the task under study is a very challenging one.
range of paraphrase types may certainly overcome Our validation strategy based on bi-class classi-
this in part. This would indeed be sensible, as bet- fication using a broad set of features applicable to
ter covering the variety of paraphrase types as a all candidate paraphrase pairs allowed us to obtain
one-time effort would help all subsequent valida- a 18% relative improvement in F-measure over
tions. Figure 2 shows how performance varies on the best individual technique for both languages.
French with number of training examples for var- Our future work include performing a deeper
ious feature configurations. However, some para- error analysis of our current results, to better com-
phrase types will require integration of more com- prehend what characteristics of paraphrase still
plex knowledge, as is the case, for instance, for defy current validation. Also, we want to inves-
paraphrase pairs involving some anaphora and its tigate adding new individual techniques to pro-
antecedent (e.g. China it). vide so far unseen candidates. Another possible
While these results, which are very comparable approach would be to submit all pairs of sub-
for the two languages studied, are already satisfy- sentential paraphrase pairs from a sentence pair
ing given the complexity of our task, further in- to our validation process, which would obviously
spection of false positives and negatives may help require some optimization and devising sensible
us to develop additional models that will help us heuristics to limit time complexity. We also in-
obtain a better classification performance. tend to collect larger corpora for all other corpus
types appearing in Table 1 and conducting anew
6 Conclusions and future work
our acquisition and validation tasks.
In this article, we have addressed the task of com-
bining the results of sub-sentential paraphrase ac- Acknowledgements
quition from parallel monolingual corpora using a The authors would like to thank the reviewers for
large variety of techniques. We have provided jus- their comments and suggestions, as well as Guil-
tifications for using highly parallel corpora con- laume Wisniewski for helpful discussions. This
sisting of multiply translated sentences from a work was partly funded by ANR project Edylex
single language. All our experiments were con- (ANR-09-CORD-008).
ducted on both English and French using com-
parable resources, so although the results cannot
be directly compared they give some acceptable References
comparison points. The best recall of any indi- Ion Androutsopoulos and Prodromos Malakasiotis.
vidual technique is around 45 for both language, 2010. A Survey of Paraphrasing and Textual En-
724
tailment Methods. Journal of Artificial Intelligence Philipp Koehn, Hieu Hoang, Alexandra Birch,
Research, 38:135187. Chris Callison-Burch, Marcello Federico, Nicola
Colin Bannard and Chris Callison-Burch. 2005. Para- Bertoldi, Brooke Cowan, Wade Shen, Christine
phrasing with Bilingual Parallel Corpora. In Pro- Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
ceedings of ACL, Ann Arbor, USA. Alexandra Constantin, and Evan Herbst. 2007.
Regina Barzilay and Lillian Lee. 2003. Learn- Moses: Open Source Toolkit for Statistical Machine
ing to paraphrase: an unsupervised approach us- Translation. In Proceedings of ACL, demo session,
ing multiple-sequence alignment. In Proceedings Prague, Czech Republic.
of NAACL-HLT, Edmonton, Canada. Dekang Lin and Patrick Pantel. 2001. Discovery of in-
ference rules for question answering. Natural Lan-
Regina Barzilay and Kathleen R. McKeown. 2001.
guage Engineering, 7(4):343360.
Extracting paraphrases from a parallel corpus. In
Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng.
Proceedings of ACL, Toulouse, France.
2010. PEM: A paraphrase evaluation metric ex-
Rahul Bhagat and Deepak Ravichandran. 2008. Large ploiting parallel texts. In Proceedings of EMNLP,
scale acquisition of paraphrases for learning surface Cambridge, USA.
patterns. In Proceedings of ACL-HLT, Columbus,
Nitin Madnani and Bonnie J. Dorr. 2010. Generat-
USA.
ing Phrasal and Sentential Paraphrases: A Survey
Rahul Bhagat. 2009. Learning Paraphrases from Text. of Data-Driven Methods . Computational Linguis-
Ph.D. thesis, University of Southern California. tics, 36(3).
Houda Bouamor, Aurelien Max, and Anne Vilnat. Nitin Madnani. 2010. The Circle of Meaning: From
2010. Comparison of Paraphrase Acquisition Tech- Translation to Paraphrasing and Back. Ph.D. the-
niques on Sentential Paraphrases. In Proceedings of sis, University of Maryland College Park.
IceTAL, Rejkavik, Iceland. Donald Metzler, Eduard Hovy, and Chunliang Zhang.
Chris Callison-Burch, Trevor Cohn, and Mirella La- 2011. An empirical evaluation of data-driven para-
pata. 2008. Parametric: An automatic evaluation phrase generation techniques. In Proceedings of
metric for paraphrasing. In Proceedings of COL- ACL-HLT, Portland, USA.
ING, Manchester, UK. Franz Josef Och and Herman Ney. 2004. The align-
Chris Callison-Burch. 2007. Paraphrasing and Trans- ment template approach to statistical machine trans-
lation. Ph.D. thesis, University of Edinburgh. lation. Computational Linguistics, 30(4).
Chris Callison-Burch. 2008. Syntactic Constraints Bo Pang, Kevin Knight, and Daniel Marcu. 2003.
on Paraphrases Extracted from Parallel Corpora. In Syntax-based alignement of multiple translations:
Proceedings of EMNLP, Hawai, USA. Extracting paraphrases and generating new sen-
Marie Candito, Benot Crabbe, and Pascal Denis. tences. In Proceedings of NAACL-HLT, Edmonton,
2010. Statistical French dependency parsing: tree- Canada.
bank conversion and first results. In Proceedings of Matthew Snover, Nitin Madnani, Bonnie J. Dorr, and
LREC, Valletta, Malta. Richard Schwartz. 2010. TER-Plus: paraphrase,
semantic, and alignment enhancements to Transla-
David Chen and William Dolan. 2011. Collecting
tion Edit Rate. Machine Translation, 23(2-3).
highly parallel data for paraphrase evaluation. In
Jorg Tiedemann. 2007. Building a Multilingual Paral-
Proceedings of ACL, Portland, USA.
lel Subtitle Corpus. In Proceedings of the Confer-
Trevor Cohn, Chris Callison-Burch, and Mirella Lap- ence on Computational Linguistics in the Nether-
ata. 2008. Constructing corpora for the develop- lands, Leuven, Belgium.
ment and evaluation of paraphrase systems. Com-
Shiqi Zhao, Haifeng Wang, Ting Liu, and Sheng Li.
putational Linguistics, 34(4).
2008. Pivot Approach for Extracting Paraphrase
Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Patterns from Bilingual Corpora. In Proceedings
Unsupervised construction of large paraphrase cor- of ACL-HLT, Columbus, USA.
pora: Exploiting massively parallel news sources.
In Proceedings of COLING, Geneva, Switzerland.
Ulrich Germann. 2008. Yawat : Yet Another Word
Alignment Tool. In Proceedings of the ACL-HLT,
demo session, Columbus, USA.
Christian Jacquemin. 1999. Syntagmatic and paradig-
matic representations of term variation. In Proceed-
ings of ACL, College Park, USA.
Dan Klein and Christopher D. Manning. 2003. Accu-
rate unlexicalized parsing. In Proceedings of ACL,
Sapporo, Japan.
725
Determining the placement of German verbs in EnglishtoGerman
SMT
Anita Gojun Alexander Fraser

University of Stuttgart, Germany
{gojunaa, fraser}@ims.uni-stuttgart.de
Abstract German language model, making the translations

difficult to understand.
When translating English to German, exist- A common approach for handling the long-
ing reordering models often cannot model range reordering problem within PSMT is per-
the long-range reorderings needed to gen- forming syntax-based or part-of-speech-based
erate German translations with verbs in the
(POS-based) reordering of the input as a prepro-
correct position. We reorder English as a
preprocessing step for English-to-German
cessing step before translation (e.g., Collins et al.
SMT. We use a sequence of hand-crafted (2005), Gupta et al. (2007), Habash (2007), Xu
reordering rules applied to English parse et al. (2009), Niehues and Kolss (2009), Katz-
trees. The reordering rules place English Brown et al. (2011), Genzel (2010)).
verbal elements in the positions within the We reorder English to improve the translation
clause they will have in the German transla- to German. The verb reordering process is im-
tion. This is a difficult problem, as German plemented using deterministic reordering rules on
verbal elements can appear in different po-
English parse trees. The sequence of reorderings
sitions within a clause (in contrast with En-
glish verbal elements, whose positions do is derived from the clause type and the composi-
not vary as much). We obtain a significant tion of a given verbal complex (a (possibly dis-
improvement in translation performance. contiguous) sequence of verbal elements in a sin-
gle clause). Only one rule can be applied in a
given context and for each word to be reordered,
1 Introduction there is a unique reordered position. We train a
standard PSMT system on the reordered English
Phrase-based SMT (PSMT) systems translate
training and tuning data and use it to translate the
word sequences (phrases) from a source language
reordered English test set into German.
into a target language, performing reordering of
This paper is structured as follows: in section
target phrases in order to generate a fluent target
2, we outline related work. In section 3, English
language output. The reordering models, such as,
and German verb positioning is described. The
for example, the models implemented in Moses
reordering rules are given in section 4. In sec-
(Koehn et al., 2007), are often limited to a cer-
tion 5, we show the relevance of the reordering,
tain reordering range since reordering beyond this
present the experiments and present an extensive
distance cannot be performed accurately. This re-
error analysis. We discuss some problems ob-
sults in problems of fluency for language pairs
served in section 7 and conclude in section 8.
with large differences in constituent order, such
as English and German. When translating from 2 Related work
English to German, verbs in the German output
are often incorrectly left near their position in En- There have been a number of attempts to handle
glish, creating problems of fluency. Verbs are also the long-range reordering problem within PSMT.
often omitted since the distortion model cannot Many of them are based on the reordering of a
move verbs to positions which are licensed by the source language sentence as a preprocessing step
726
before translation. Our approach is related to the dev set which is then translated and evaluated.
work of Collins et al. (2005). They reordered Only those rule sequences are extracted which
German sentences as a preprocessing step for maximize the translation performance of the
German-to-English SMT. Hand-crafted reorder- reordered dev set.
ing rules are applied on German parse trees in For the extraction of reordering rules, Gen-
order to move the German verbs into the posi- zel (2010) uses shallow constituent parse trees
tions corresponding to the positions of the English which are obtained from dependency parse trees.
verbs. Subsequently, the reordered German sen- The trees are annotated using both Penn Tree-
tences are translated into English leading to better bank POS tags and using Stanford dependency
translation performance when compared with the types. However, the constraints on possible re-
translation of the original German sentences. orderings are too restrictive in order to model all
We apply this method on the opposite trans- word movements required for English-to-German
lation direction, thus having English as a source translation. In particular, the reordering rules in-
language and German as a target language. How- volve only the permutation of direct child nodes
ever, we cannot simply invert the reordering rules and do not allow changing of child-parent rela-
which are applied on German as a source lan- tionships (deleting of a child or attaching a node
guage in order to reorder the English input. While to a new father node). In our implementation, a
the reordering of German implies movement of verb can be moved to any position in a parse tree
the German verbs into a single position, when re- (according to the reordering rules): the reordering
ordering English, we need to split the English ver- can be a simple permutation of child nodes, or at-
bal complexes and, where required, move their tachment of these nodes to a new father node (cf.
parts into different positions. Therefore, we need movement of bought and read in figure 11 ).
to identify exactly which parts of a verbal com- Thus, in contrast to Genzel (2010), our ap-
plex must be moved and their possible positions proach does not have any constraints with respect
in a German sentence. to the position of nodes marking a verb within the
Reordering rules can also be extracted automat- tree. Only the syntactic structure of the sentence
ically. For example, Niehues and Kolss (2009) restricts the distance of the linguistically moti-
automatically extracted discontiguous reordering vated verb movements.
rules (allowing gaps between POS tags which
can include an arbitrary number of words) from 3 Verb positions in English and German
a word-aligned parallel corpus with POS tagged 3.1 Syntax of German sentences
source side. Since many different rules can be ap-
plied on a given sentence, a number of reordered Since in this work, we concentrate on verbs, we
sentence alternatives are created which are en- use the notion verbal complex for a sequence con-
coded as a word lattice (Dyer et al., 2008). They sisting of verbs, verbal particles and negation.
dealt with the translation directions German-to- The verb positions in the German sentences de-
English and English-to-German, but translation pend on clause type and the tense as shown in ta-
improvement was obtained only for the German- ble 1. Verbs can be placed in 1st, 2nd or clause-
to-English direction. This may be due to miss- final position. Additionally, if a composed tense
ing information about clause boundaries since En- is given, the parts of a verbal complex can be
glish verbs often have to be moved to the clause interrupted by the middle field (MF) which con-
end. Our reordering has access to this kind of tains arbitrary sentence constituents, e.g., sub-
knowledge since we are working with a full syn- jects and objects (noun phrases), adjuncts (prepo-
tactic parser of English. sitional phrases), adverbs, etc. We assume that the
Genzel (2010) proposed a language- German sentences are SVO (analogously to En-
independent method for learning reordering glish); topicalization is beyond the scope of our
rules where the rules are extracted from parsed work.
source language sentences. For each node, all In this work, we consider two possible posi-
possible reorderings (permutations) of a limited tions of the negation in German: (1) directly in
number of the child nodes are considered. The 1
The verb movements shown in figure 1 will be explained
candidate reordering rules are applied on the in detail in section 4.
727
1st 2nd MF clause- position in declarative, or in the 1st position in in-
final terrogative clauses, in German, the entire verbal
subject finV any complex can additionally be placed at the clause
decl
subject finV any mainV end in subordinate or infinitival clauses (cf. row
finV subject any sub/inf in table 1).
int/perif
finV subject any mainV
Because of these differences, for nearly all
relCon subject any finV types of English clauses, reordering is needed in
sub/inf
relCon subject any VC
order to place the English verbs in the positions
Table 1: Position of the German subjects and verbs which correspond to the correct verb positions in
in declarative clauses (decl), interrogative clauses and German. Only English declarative clauses with
clauses with a peripheral clause (int/perif ), subordi- simple present and simple past tense have the
nate/infinitival (sub/inf ) clauses. mainV = main verb, same verb position as their German counterparts.
finV = finite verb, VC = verbal complex, any = arbi- We give statistics on clause types and their rele-
trary words, relCon = relative pronoun or conjunction.
vance for the verb reordering in section 5.1.
We consider extraponed consituents in perif, as well as
optional interrogatives in int to be in position 0.
4 Reordering of the English input
front of the main verb, and (2) directly after the The reordering is carried out on English parse
finite verb. The two negation positions are illus- trees. We first enrich the parse trees with clause
trated in the following examples: type labels, as described below. Then, for each
node marking a clause (S nodes), the correspond-
(1) Ich behaupte, dass ich es nicht gesagt habe.
ing sequence of reordering rules is carried out.
I claim that I it not say did.
The appropriate reordering is derived from the
(2) Ich denke nicht, dass er das gesagt hat. clause type label and the composition of the given
I think not that he that said has. verbal complex. The reordering rules are deter-
It should, however, be noted that in German, the ministic. Only one rule can be applied in a given
negative particle nicht can have several positions context and for each verb to be reordered, there is
in a sentence depending on the context (verb argu- a unique reordered position.
ments, emphasis). Thus, more analysis is ideally The reordering procedure is the same for the
needed (e.g., discourse, etc.). training and the testing data. It is carried out
on English parse trees resulting in modified parse
3.2 Comparison of verb positions trees which are read out in order to generate the
English and German verbal complexes differ both reordered English sentences. These are input for
in their construction and their position. The Ger- training a PSMT system or input to the decoder.
man verbal complex can be discontiguous, i.e., its The processing steps are shown in figure 1.
parts can be placed in different positions which For the development of the reordering rules, we
implies that a (large) number of other words can used a small sample of the training data. In par-
be placed between the verbs (situated in the MF). ticular, by observing the English parse trees ex-
In English, the verbal complex can only be inter- tracted randomly from the training data, we de-
rupted by adverbials and subjects (in interrogative veloped a set of rules which transform the origi-
clauses). Furthermore, in German, the finite verb nal trees in such a way that the English verbs are
can sometimes be the last element of the verbal moved to the positions which correspond to the
complex, while in English, the finite verb is al- placement of verbs in German.
ways the first verb in the verbal complex.
4.1 Labeling clauses with their type
In terms of positions, the verbs in English and
German can differ significantly. As previously As shown in section 3.1, the verb positions in Ger-
noted, the German verbal complex can be discon- man depend on the clause type. Since we use En-
tiguous, simultaneously occupying 1st/2nd and glish parse trees produced by the generative parser
clause-final position (cf. rows decl and int/perif in of Charniak and Johnson (2005) which do not
table 1), which is not the case in English. While in have any function labels, we implemented a sim-
English, the verbal complex is placed in the 2nd ple rule-based clause type labeling script which
728
SEXTR tence end. The starting node is the node which
ADVP , NP VP1 . marks the verbal phrase in which the verbs are
, . enclosed. When the next node marking a clause
RB PRP VBD NP
is identified, the search stops and returns the posi-
reordering
Yesterday I read NP SSUB tion in front of the identified clause marking node.
DT NN WHNP S
When, for example, searching for the clause
boundary of S-EXTR in figure 1, we search re-
a book which NP VP
cursively for the first clause marking node within
PRP VBD NP VP1 , which is S-SUB. The position in front of S-
SEXTR SUB is marked as clause-final position of S-EXTR.
I bought JJ NN
ADVP , VBD NP VP1 .
last week 4.3 Basic verb reordering rules
, .
RB read PRP NP
The reordering procedure takes into account the
Yesterday I NP SSUB following word categories: verbs, verb particles,
the infinitival particle to and the negative parti-
DT NN WHNP S
cle not, as well as its abbreviated form t. The
a book which NP VP reordering rules are based on POS labels in the
PRP NP VBD
parse tree.
read out and translate
The reordering procedure is a sequence of ap-
I JJ NN bought
plications of the reordering rules. For each el-
last week ement of an English verbal complex, its proper-
ties are derived (tense, main verb/auxiliary, finite-
Figure 1: Processing steps: Clause type labeling an- ness). The reordering is then carried out corre-
notates the given original tree with clause type labels sponding to the clause type and verbal properties
(in figure, S-EXTR and S-SUB). Subsequently, the re- of a verb to be processed.
ordering is performed (cf. movement of the verbs read
In the following, the reordering rules are pre-
and bought). The reordered sentence is finally read out
and given to the decoder. sented. Examples of reordered sentences are
given in table 2, and are discussed further here.
enriches every clause starting node with the corre- Main clause (S-MAIN)
sponding clause type label. The label depends on
(i) simple tense: no reordering required
the context (father, child nodes) of a given clause
(cf. appearsfinV in input 1);
node. If, for example, the first child node of a
given S node is WH* (wh-word) or IN (subordi- (ii) composed tense: the main verb is moved to
nating conjunction), then the clause type label is the clause end. If a negative particle exists, it
SUB (subordinate clause, cf. figure 1). is moved in front of the reordered main verb,
We defined five clause type labels which indi- while the optional verb particle is moved af-
cate main clauses (MAIN), main clauses with a ter the reordered main verb (cf. [has]finV
peripheral clause in the prefield (EXTR), subor- [been developing]mainV in input 2).
dinate (SUB), infinitival (XCOMP) and interroga-
tive clauses (INT). Main clause with peripheral clause (S-EXTR)
4.2 Clause boundary identification (i) simple tense: the finite verb is moved to-
The German verbs are often placed at the clause gether with an optional particle to the 1st po-
end (cf. rows decl, int/perif and sub/inf in ta- sition (i.e. in front of the subject);
ble 1), making it necessary to move their En- (ii) composed tense: the main verb, as well
glish counterparts into the corresponding posi- as optional negative and verb particles are
tions within an English tree. For this reason, we moved to the clause end. The finite verb is
identify the clause ends (the right boundaries). moved in the 1st position, i.e. in front of the
The search for the clause end is implemented as subject (cf. havef inV [gone up]mainV in in-
a breadth-first search for the next S node or sen- put 3).
729
Subordinate clause (S-SUB) 4.4.3 Flexible position of German verbs
We stated that the English verbs are never moved
(i) simple tense: the finite verb is moved to the
outside the subclause they were originally in. In
clause end (cf. boastsfinV in input 3);
German there are, however, some constructions
(ii) composed tense: the main verb, as well (infinitival and relative clauses), in which the
as optional negative and verb particles are main verb can be placed after a subsequent clause.
moved to the clause end, the finite verb is Consider two German translations of the English
placed after the reordered main verb (cf. sentence He has promised to come:
havefinV [been executed]mainV in input 5).
(3a) Er hat [zu kommen]S versprochen.
he has to come promised.
Infinitival clause (S-XCOMP)
The entire English verbal complex is moved from (3b) Er hat versprochen, [zu kommen]S .
the 2nd position to the clause-final position (cf. he has promised, to come.
[to discuss]VC in input 4). In (3a), the German main verb versprochen is
placed after the infinitival clause zu kommen (to
Interrogative clause (S-INT) come), while in (3b), the same verb is placed in
front of it. Both alternatives are grammatically
(i) simple tense: no reordering required;
correct.
(ii) composed tense: the main verb, as well
If a German verb should come after an em-
as optional negative and verb particles are
bedded clause as in example (3a) or precede it
moved to the clause end (cf. [did]finV
(cf. example (3b)), depends not only on syntac-
knowmainV in input 5).
tic but also on stylistic factors. Regarding the
4.4 Reordering rules for other phenomena verb reordering problem, we would therefore have
to examine the given sentence in order to derive
4.4.1 Multiple auxiliaries in English the correct (or more probable) new verb position
Some English tenses require a sequence of aux- which is beyond the scope of this work. There-
iliaries, not all of which have a German coun- fore, we allow only for reorderings which do not
terpart. In the reordering process, non-finite cross clause boundaries as shown in example (3b).
auxiliaries are considered to be a part of the
main verb complex and are moved together with 5 Experiments
the main verb (cf. movement of hasfinV [been
In order to evaluate the translation of the re-
developing]mainV in input 2).
ordered English sentences, we built two SMT sys-
4.4.2 Simple vs. composed tenses tems with Moses (Koehn et al., 2007). As train-
In English, there are some tenses composed of ing data, we used the Europarl corpus which con-
an auxiliary and a main verb which correspond sists of 1,204,062 English/German sentence pairs.
to a German tense composed of only one verb, The baseline system was trained on the original
e.g., am reading lese and does John read? English training data while the contrastive system
liest John? Splitting such English verbal com- was trained on the reordered English training data.
plexes and only moving the main verbs would In both systems, the same original German sen-
lead to constructions which do not exist in Ger- tences were used. We used WMT 2009 dev and
man. Therefore, in the reordering process, the test sets to tune and test the systems. The baseline
English verbal complex in present continuous, as system was tuned and tested on the original data
well as interrogative phrases composed of do and while for the contrastive system, we used the re-
a main verb, are not split. They are handled as ordered English side of the dev and test sets. The
one main verb complex and reordered as a sin- German 5-gram language model used in both sys-
gle unit using the rules for main verbs (e.g. [be- tems was trained on the WMT 2009 German lan-
cause I am reading a book]SUB because I a guage modeling data, a large German newspaper
corpus consisting of 10,193,376 sentences.
book am reading weil ich ein Buch lese.2
other tenses which could (or should) be treated in the same
2
We only consider present continuous and verbs in com- way (cf. has been developing on input 2, table 2). We do not
bination with do for this kind of reordering. There are also do this to keep the reordering rules simple and general.
730
Input 1 The programme appears to be successful for published data shows that MRSA is on the decline in the UK.
Reordered The programme appears successful to be for published data shows that MRSA on the decline in the UK is.
Input 2 The real estate market in Bulgaria has been developing at an unbelievable rate - all of Europe has its eyes
on this heretofore rarely heard-of Balkan nation.
Reordered The real estate market in Bulgaria has at an unbelievable rate been developing - all of Europe has its eyes
on this heretofore rarely heard-of Balkan nation.
Input 3 While Bulgaria boasts the European Unions lowest real estate prices, they have still gone up by 21 percent
in the past five years.
Reordered While Bulgaria the European Unions lowest real estate prices boasts, have they still by 21 percent in the
past five years gone up.
Input 4 Professionals and politicians from 192 countries are slated to discuss the Bali Roadmap that focuses on
efforts to cut greenhouse gas emissions after 2012, when the Kyoto Protocol expires.
Reordered Professionals and politicians from 192 countries are slated the Bali Roadmap to discuss that on efforts
focuses greenhouse gas emissions after 2012 to cut, when the Kyoto Protocol expires.
Input 5 Did you know that in that same country, since 1976, 34 mentally-retarded offenders have been executed?
Reordered Did you know that in that same country, since 1976, 34 mentally-retarded offenders been executed have?
Table 2: Examples of reordered English sentences
5.1 Applied rules tense MAIN EXTR SUB INT XCOMP

simple 675,095 170,806 449,631 8,739 -
In order to see how many English clauses are rel- composed 343,178 116,729 277,733 8,817 314,573
evant for reordering, we derived statistics about rest 98,464 5,158 90,139 306 146,746
clause types and the number of reordering rules
applied on the training data. Table 3: Counts of English clause types and used
In table 3, the number of the English clauses tenses. Bold numbers indicate clause type/tense com-
with all considered clause type/tense combination binations where reordering is required.
are shown. The bold numbers indicate combina-
Baseline Reordered
tions which are relevant to the reordering. Over- BLEU 13.02 13.63
all, 62% of all EN clauses from our training data
(2,706,117 clauses) are relevant for the verb re- Table 4: Scores of baseline and contrastive systems
ordering. Note that there is an additional category
rest which indicates incorrect clause type/tense lation in which all verbs are placed correctly. In
combinations and might thus not be correctly re- the baseline translation, only the translation of the
ordered. These are mostly due to parsing and/or finite verb was, namely war, is placed correctly,
tagging errors. while the translation of the main verb (diagnosed
The performance of the systems was measured festgestellt) should be placed at the clause end
by BLEU (Papineni et al., 2002). The evaluation as in the translation produced by our system.
results are shown in table 4. The contrastive sys-
tem outperforms the baseline. Its BLEU score is 5.2 Evaluation
13.63 which is a gain of 0.61 BLEU points over Often, the English verbal complex is translated
the baseline. This is a statistically significant im- only partially by the baseline system. For exam-
provement at p<0.05 (computed with Gimpels ple, the English verbal complexes in sentence 2 in
implementation of the pairwise bootstrap resam- table 5 will climb and will drop are only partially
pling method (Koehn, 2004)). translated (will climb wird (will), will drop
Manual examination of the translations pro- fallen (fall)). Moreover, the generated verbs are
duced by both systems confirms the result of placed incorrectly. In our translation, all verbs are
the automatic evaluation. Many translations pro- translated and placed correctly.
duced by the contrastive system now have verbs in Another problem which was often observed in
the correct positions. If we compare the generated the baseline is the omission of the verbs in the
translations for input sentence 1 in table 5, we German translations. The baseline translation of
see that the contrastive system generates a trans- the example sentence 3 in table 5 illustrates such
731
a case. There is no translation of the English in- On the other hand, in the baseline SMT system,
finitival verbal complex to have. In the transla- the subject they is likely to be a part of a trans-
tion generated by the contrastive system, the ver- lation phrase with the correct German equivalent
bal complex does get translated (zu haben) and (they have said sie haben gesagt). They is then
is also placed correctly. We think this is because used as a disambiguating context which is missing
the reordering model is not able to identify the in the reordered sentence (but the order is wrong).
position for the verb which is licensed by the lan-
guage model, causing a hypothesis with no verb 6.2.2 Verb dependency
to be scored higher than the hypotheses with in- A similar problem occurs in a verbal complex:
correctly placed verbs. (5a) They have said it to me yesterday.
(5b) They have it to me yesterday said.
6 Error analysis In sentence (5a), the English consecutive verbs
have said are a sequence consisting of a finite
6.1 Erroneous reordering in our system
auxiliary have and the past participle said. They
In some cases, the reordering of the English parse should be translated into the corresponding Ger-
trees fails. Most erroneous reorderings are due to man verbal complex haben gesagt. But, if the
a number of different parsing and tagging errors. verbs are split, we will probably get translations
Coordinated verbs are also problematic due to which are completely independent. Even if the
their complexity. Their composition can vary, and German auxiliary is correctly inflected, it is hard
thus it would require a large number of different to predict how said is going to be translated. If
reordering rules to fully capture this. In our re- the distance between the auxiliary habe and the
ordering script, the movement of complex struc- hypothesized translation of said is large, the lan-
tures such as verbal phrases consisting of a se- guage model will not be able to help select the
quence of child nodes is not implemented (only correct translation. Here, the baseline SMT sys-
nodes with one child, namely the verb, verbal par- tem again has an advantage as the verbs are con-
ticle or negative particle are moved). secutive. It is likely they will be found in the train-
ing data and extracted with the correct German
6.2 Splitting of the English verbal complex phrase (but the German order is again incorrect).
Since in many cases, the German verbal complex
6.3 Collocations
is discontiguous, we need to split the English ver-
bal complex and move its parts into different posi- Collocations (verbobject pairs) are another case
tions. This ensures the correct placement of Ger- which can lead to a problem:
man verbs. However, this does not ensure that the (6a) I think that the discussion would take place
German verb forms are correct because of highly later this evening.
ambiguous English verbs. In some cases, we can (6b) I think that the discussion place later this
lose contextual information which would be use- evening take would.
ful for disambiguating ambiguous verbs and gen- The English collocation in (6a) consisting of the
erating the appropriate German verb forms. verb take and the object place corresponds to the
German verb stattfinden. Without this specific ob-
6.2.1 Subjectverb agreement ject, the verb take is likely to be translated liter-
Let us consider the English clause in (4a) and its ally. In the reordered sentence, the verbal com-
reordered version in (4b): plex take would is indeed separated from the ob-
(4a) ... because they have said it to me yesterday. ject place which would probably lead to the literal
(4b) ... because they it to me yesterday said have. translation of both parts of the mentioned collo-
In (4b), the English verbs said have are separated cation. So, as already described in the preceding
from the subject they. The English said have can paragraphs, an important source of contextual in-
be translated in several ways into German. With- formation is lost which could ensure the correct
out any information about the subject (the dis- translation of the given phrase.
tance between the verbs and the subject can be This problem is not specific to Englishto
very large), it is relatively likely that an erroneous German. For instance, the same problem occurs
German translation is generated. when translating German into English. If, for ex-
732
Input 1 An MRSA - an antibiotic resistant staphylococcus - infection was recently diagnosed in the trauma-
tology ward of Janos hospital.
Reordered An MRSA - an antibiotic resistant staphylococcus - infection was recently in the traumatology ward
input of Janos hospital diagnosed.
Baseline Ein MRSA - ein Antibiotikum resistenter Staphylococcus - war vor kurzem in der festgestellt
translation A MRSA - an antibiotic resistant Staphylococcus - was before recent in the diagnosed
traumatology Ward von Janos Krankenhaus.
traumatology ward of Janos hospital.
Reordered Ein MRSA - ein Antibiotikum resistenter Staphylococcus - Infektion wurde vor kurzem in den
translation A MRSA - an antibiotic resistant Staphylococcus - infection was before recent in the
traumatology Station der Janos Krankenhaus diagnostiziert.
traumatology ward of Janos hospital diagnosed.
Input 2 The ECB predicts that 2008 inflation will climb to 2.5 percent from the earlier 2.1, but will drop
back to 1.9 percent in 2009.
Reordered The ECB predicts that 2008 inflation to 2.5 percent from the earlier 2.1 will climb, but back to 1.9
input percent in 2009 will drop.
Baseline Die EZB sagt, dass 2008 die Inflationsrate wird auf 2,5 Prozent aus der fruheren 2,1, sondern
translation The ECB says, that 2008 the inflation rate will to 2.5 percent from the earlier 2.1, but
fallen zuruck auf 1,9 Prozent im Jahr 2009.
fall back to 1.9 percent in the year 2009.
Reordered Die EZB prophezeit, dass 2008 die Inflation zu 2,5 Prozent aus der fruheren 2,1 ansteigen
translation The ECB predicts, that 2008 the inflation rate to 2.5 percent from the earlier 2.1 climb
wird, aber auf 1,9 Prozent in 2009 sinken wird.
will, but to 1.9 percent in 2009 fall will.
Input 3 Labour Minister Monika Lamperth appears not to have a sensitive side.
R. input Labour Minister Monika Lamperth appears a sensitive side not to have .
Baseline Arbeitsminister Monika Lamperth scheint nicht eine sensible Seite.
translation Labour Minister Monika Lamperth appears not a sensitive side.
Reordered Arbeitsminister Monika Lamperth scheint eine sensible Seite nicht zu haben.
translation Labour Minister Monika Lamperth appears a sensitive side not to have.
Table 5: Example translations, the baseline has problems with verbal elements, reordered is correct
ample, the object Kauf (buying) of the colloca- pose a problem for translation (see sections 6.2
tion nehmen + in Kauf (accept) is separated from 6.3). Although the positions of the verbs in the
the verb nehmen (take), they are very likely to be translations are now correct, the distance between
translated literally (rather than as the idiom mean- subjects and verbs, or between verbs in a single
ing to accept), thus leading to an erroneous En- VP might lead to the generation of erroneously
glish translation. inflected verbs. The separate generation of Ger-
man verbal morphology is an interesting area of
6.4 Error statistics future work, see (de Gispert and Marino, 2008).
We also found 2 problematic collocations but note
We manually checked 100 randomly chosen En-
that this only gives a rough idea of the problem,
glish sentences to see how often the problems de-
further study is needed.
scribed in the previous sections occur. From a
total of 276 clauses, 29 were not reordered cor-
6.5 POS-based disambiguation of the
rectly. 20 errors were caused by incorrect parsing
English verbs
and/or POS tags, while the remaining 9 are mostly
due to different kinds of coordination. Table 6 With respect to the problems described in 6.2.1
shows correctly reordered clauses which might and 6.2.2, we carried out an experiment in which
733
total d 5 tokens the main verb of a verbal complex can occupy
subjectverb 40 19 different positions in a clause, we had to define
verb dependency 32 14 the English counterparts of the two components
collocations 8 2
of the German verbal complex. We defined non-
Table 6: total is the number of clauses found for the finite English verbal elements as a part of the main
respective phenomenon. d 5 tokens is the number of verb complex which are then moved together with
clauses where the distance between relevant tokens is the main verb. This rigid definition could be re-
at least 5, which is problematic. laxed by considering multiple different splittings
and movements of the English verbs.
Baseline + POS Reordered + POS
BLEU 13.11 13.68 Furthermore, the reordering rules are applied
on a clause not allowing for movements across the
Table 7: BLEU scores of the baseline and the con- clause boundaries. However, we also showed that
trastive SMT system using verbal POS tags in some cases, the main verbs may be moved after
the succeeding subclause. Stochastic rules could
we used POS tags in order to disambiguate the allow for both placements or carry out the more
English verbs. For example, the English verb said probable reordering given a specific context. We
corresponds to the German participle gesagt, as will address these issues in future work.
well as to the finite verb in simple past, e.g. sagte. Unfortunately, some important contextual in-
We attached the POS tags to the English verbs in formation is lost when splitting and moving En-
order to simulate a disambiguating suffix of a verb glish verbs. When English verbs are highly am-
(e.g. said said VBN, said VBD). The idea be- biguous, erroneous German verbs can be gener-
hind this was to extract the correct verbal trans- ated. The experiment described in section 6.5
lation phrases and score them with appropriate shows that more effort should be made in order to
translation probabilities (e.g. p(said VBN, gesagt) overcome this problem. The incorporation of sep-
> p(said VBN, sagte). arate morphological generation of inflected Ger-
We built and tested two PSMT systems using man verbs would improve translation.
the data enriched with verbal POS tags. The
first system is trained and tested on the original 8 Conclusion
English sentences, while the contrastive one was
trained and tested on the reordered English sen- We presented a method for reordering English as a
tences. Evaluation results are shown in table 7. preprocessing step for EnglishtoGerman SMT.
The baseline obtains a gain of 0.09 and the con- To our knowledge, this is one of the first papers
trastive system of 0.05 BLEU points over the cor- which reports on experiments regarding the re-
responding PSMT system without POS tags. Al- ordering problem for EnglishtoGerman SMT.
though there are verbs which are now generated We showed that the reordering rules specified in
correctly, the overall translation improvement lies this work lead to improved translation quality. We
under our expectation. We will directly model the observed that verbs are placed correctly more of-
inflection of German verbs in future work. ten than in the baseline, and that verbs which were
omitted in the baseline are now often generated.
7 Discussion and future work We carried out a thorough analysis of the rules
We implemented reordering rules for English ver- applied and discussed problems which are related
bal complexes because their placement differs to highly ambiguous English verbs. Finally we
significantly from German placement. The imple- presented ideas for future work.
mentation required dealing with three important
problems: (i) definition of the clause boundaries, Acknowledgments
(ii) identification of the new verb positions and
(iii) correct splitting of the verbal complexes. This work was funded by Deutsche Forschungs-
We showed some phenomena for which a gemeinschaft grant Models of Morphosyntax for
stochastic reordering would be more appropriate. Statistical Machine Translation.
For example, since in German, the auxiliary and
734
References
Eugene Charniak and Mark Johnson. 2005. Coarse-
to-fine n-best parsing and MaxEnt discriminative
reranking. In ACL.
Michael Collins, Philipp Koehn, and Ivona Kucerova.
2005. Clause restructuring for statistical machine
translation. In ACL.
Adria de Gispert and Jose B. Marino. 2008. On the
impact of morphology in English to Spanish statis-
tical MT. Speech Communication, 50(11-12).
Chris Dyer, Smaranda Muresan, and Philip Resnik.
2008. Generalizing word lattice translation. In
ACL-HLT.
Dmitriy Genzel. 2010. Automatically learning
source-side reordering rules for large scale machine
translation. In COLING.
Deepa Gupta, Mauro Cettolo, and Marcello Federico.
2007. POS-based reordering models for statistical
machine translation. In Proceedings of the Machine
Translation Summit (MT-Summit).
Nizar Habash. 2007. Syntactic preprocessing for sta-
tistical machine translation. In Proceedings of the
Machine Translation Summit (MT-Summit).
Jason Katz-Brown, Slav Petrov, Ryan McDon-
ald, Franz Och, David Talbot, Hiroshi Ichikawa,
Masakazu Seno, and Hideto Kazawa. 2011. Train-
ing a parser for machine translation reordering. In
EMNLP.
Moses: Open source toolkit for statistical machine
translation. In ACL, Demonstration Program.
Philipp Koehn. 2004. Statistical significance tests for
machine translation evaluation. In EMNLP.
Jan Niehues and Muntsin Kolss. 2009. A POS-based
model for long-range reorderings in SMT. In EACL
Workshop on Statistical Machine Translation.
Wei-Jing Zhu. 2002. BLEU: a method for auto-
matic evaluation of machine translation. In ACL.
Peng Xu, Jaecho Kang, Michael Ringgaard, and Franz
Och. 2009. Using a dependency parser to improve
SMT for subject-object-verb languages. In NAACL.
735
Syntax-Based Word Ordering Incorporating a Large-Scale Language
Model
Yue Zhang Graeme Blackwood Stephen Clark

University of Cambridge University of Cambridge University of Cambridge
Computer Laboratory Engineering Department Computer Laboratory
yz360@cam.ac.uk gwb24@eng.cam.ac.uk sc609@cam.ac.uk
Abstract In phrase-based machine translation (Koehn et al.,

2003; Koehn et al., 2007), a distortion limit is
A fundamental problem in text generation used to constrain the position of output phrases.
is word ordering. Word ordering is a com- In syntax-based machine translation systems such
putationally difficult problem, which can as Wu (1997) and Chiang (2007), synchronous
be constrained to some extent for particu-
grammars limit the search space so that poly-
lar applications, for example by using syn-
chronous grammars for statistical machine nomial time inference is feasible. In fluency
translation. There have been some recent improvement (Blackwood et al., 2010), parts of
attempts at the unconstrained problem of translation hypotheses identified as having high
generating a sentence from a multi-set of local confidence are held fixed, so that word or-
input words (Wan et al., 2009; Zhang and dering elsewhere is strictly local.
Clark, 2011). By using CCG and learn-
ing guided search, Zhang and Clark re- Some recent work attempts to address the fun-
ported the highest scores on this task. One damental word ordering task directly, using syn-
limitation of their system is the absence tactic models and heuristic search. Wan et al.
of an N-gram language model, which has (2009) uses a dependency grammar to solve word
been used by text generation systems to ordering, and Zhang and Clark (2011) uses CCG
improve fluency. We take the Zhang and (Steedman, 2000) for word ordering and word
Clark system as the baseline, and incor-
choice. The use of syntax models makes their
porate an N-gram model by applying on-
line large-margin training. Our system sig- search problems harder than word permutation us-
nificantly improved on the baseline by 3.7 ing an N -gram language model only. Both meth-
BLEU points. ods apply heuristic search. Zhang and Clark de-
veloped a bottom-up best-first algorithm to build
output syntax trees from input words, where
1 Introduction search is guided by learning for both efficiency
One fundamental problem in text generation is and accuracy. The framework is flexible in allow-
word ordering, which can be abstractly formu- ing a large range of constraints to be added for
lated as finding a grammatical order for a multi- particular tasks.
set of words. The word ordering problem can also We extend the work of Zhang and Clark (2011)
include word choice, where only a subset of the (Z&C) in two ways. First, we apply online large-
input words are used to produce the output. margin training to guide search. Compared to the
Word ordering is a difficult problem. Finding perceptron algorithm on constituent level fea-
the best permutation for a set of words accord- tures by Z&C, our training algorithm is theo-
ing to a bigram language model, for example, is retically more elegant (see Section 3) and con-
NP-hard, which can be proved by linear reduction verges more smoothly empirically (see Section 5).
from the traveling salesman problem. In prac- Using online large-margin training not only im-
tice, exploring the whole search space of permu- proves the output quality, but also allows the in-
tations is often prevented by adding constraints. corporation of an N -gram language-model into
736
the system. N -gram models have been used as a Steedman (2002), meaning that the CCG combina-
standard component in statistical machine trans- tory rules are encoded as rule instances, together
lation, but have not been applied to the syntac- with a number of additional rules which deal with
tic model of Z&C. Intuitively, an N -gram model punctuation and type-changing. Given a sentence,
can improve local fluency when added to a syntax its CCG derivation can be produced by first assign-
model. Our experiments show that a four-gram ing a lexical category to each word, and then re-
model trained using the English GigaWord cor- cursively applying CCG rules bottom-up.
pus gave improvements when added to the syntax-
based baseline system. 2.2 The decoding algorithm
The contributions of this paper are as follows. In the decoding algorithm, a hypothesis is an
First, we improve on the performance of the Z&C edge, which corresponds to a sub-tree in a CCG
system for the challenging task of the general derivation. Edges are built bottom-up, starting
word ordering problem. Second, we develop a from leaf edges, which are generated by assigning
novel method for incorporating a large-scale lan- all possible lexical categories to each input word.
guage model into a syntax-based generation sys- Each leaf edge corresponds to an input word with
tem. Finally, we analyse large-margin training in a particular lexical category. Two existing edges
the context of learning-guided best-first search, can be combined if there exists a CCG rule which
offering a novel solution to this computationally combines their category labels, and if they do not
hard problem. contain the same input word more times than its
total count in the input. The resulting edge is as-
2 The statistical model and decoding signed a category label according to the combi-
algorithm natory rule, and covers the concatenated surface
strings of the two sub-edges in their order or com-
We take Z&C as our baseline system. Given bination. New edges can also be generated by ap-
a multi-set of input words, the baseline system plying unary rules to a single existing edge. Start-
builds a CCG derivation by choosing and ordering ing from the leaf edges, the bottom-up process is
words from the input set. The scoring model is repeated until a goal edge is found, and its surface
trained using CCGBank (Hockenmaier and Steed- string is taken as the output.
man, 2007), and best-first decoding is applied. We This derivation-building process is reminiscent
apply the same decoding framework in this paper, of a bottom-up CCG parser in the edge combina-
but apply an improved training process, and incor- tion mechanism. However, it is fundamentally
porate an N -gram language model into the syntax different from a bottom-up parser. Since, for
model. In this section, we describe and discuss the generation problem, the order of two edges
the baseline statistical model and decoding frame- in their combination is flexible, the search prob-
work, motivating our extensions. lem is much harder than that of a parser. With
no input order specified, no efficient dynamic-
2.1 Combinatory Categorial Grammar
programming algorithm is available, and less con-
CCG , and parsing with CCG , has been described textual information is available for disambigua-
elsewhere (Clark and Curran, 2007; Hockenmaier tion due to the lack of an input string.
and Steedman, 2002); here we provide only a In order to combat the large search space, best-
short description. first search is applied, where candidate hypothe-
CCG (Steedman, 2000) is a lexicalized gram- ses are ordered by their scores, and kept in an
mar formalism, which associates each word in a agenda, and a limited number of accepted hy-
sentence with a lexical category. There is a small potheses are recorded in a chart. Here the chart
number of basic lexical categories, such as noun is essentially a set of beams, each of which con-
(N), noun phrase (NP), and prepositional phrase tains the highest scored edges covering a particu-
(PP). Complex lexical categories are formed re- lar number of words. Initially, all leaf edges are
cursively from basic categories and slashes, which generated and scored, before they are put onto the
indicate the directions of arguments. The CCG agenda. During each step in the decoding process,
grammar used by our system is read off the deriva- the top edge from the agenda is expanded. If it is
tions in CCGbank, following Hockenmaier and a goal edge, it is returned as the output, and the
737
Algorithm 1 The decoding algorithm. During decoding, feature vectors are computed
a I NITAGENDA( ) incrementally. When an edge is constructed, its
c I NIT C HART( ) score is computed from the scores of its sub-edges
while not T IME O UT( ) do and the incrementally added structure:
new []
e P OP B EST(a) f (e) = (e)
X
if G OALT EST(e) then = (es ) + (e)
return e es e
end if
X
= (es ) + (e)
for e U NARY(e, grammar) do es e
A PPEND(new, e) X
= f (es ) + (e)
end for
es e
for e c do
if C AN C OMBINE(e, e) then In the equation, es e represents a sub-edge of
e B INARY(e, e, grammar) e. Leaf edges do not have any sub-edges. Unary-
A PPEND(new, e ) branching edges have one sub-edge, and binary-
end if branching edges have two sub-edges. The fea-
if C AN C OMBINE(e, e) then ture vector (e) represents the incremental struc-
e B INARY(e, e, grammar) ture when e is constructed over its sub-edges.
A PPEND(new, e ) It is called the constituent-level feature vector
end if by Z&C. For leaf edges, (e) includes informa-
end for tion about the lexical category label; for unary-
for e new do branching edges, (e) includes information from
A DD(a, e ) the unary rule; for binary-branching edges, (e)
end for includes information from the binary rule, and ad-
A DD(c, e) ditionally the token, POS and lexical category bi-
end while grams and trigrams that result from the surface
string concatenation of its sub-edges. The score
f (e) is therefore the sum of f (es ) (for all es e)
decoding finishes. Otherwise it is extended with plus (e) . The feature templates we use are the
unary rules, and combined with existing edges in same as those in the baseline system.
the chart using binary rules to produce new edges. An important aspect of the scoring model is that
The resulting edges are scored and put onto the edges with different sizes are compared with each
agenda, while the original edge is put onto the other during decoding. Edges with different sizes
chart. The process repeats until a goal edge is can have different numbers of features, which can
found, or a timeout limit is reached. In the latter make the training of a discriminative model more
case, a default output is produced using existing difficult. For example, a leaf edge with one word
edges in the chart. can be compared with an edge over the entire in-
Pseudocode for the decoder is shown as Algo- put. One way of reducing the effect of the size dif-
rithm 1. Again it is reminiscent of a best-first ference is to include the size of the edge as part of
parser (Caraballo and Charniak, 1998) in the use feature definitions, which can improve the compa-
of an agenda and a chart, but is fundamentally dif- rability of edges of different sizes by reducing the
ferent due to the fact that there is no input order. number of features they have in common. Such
2.3 Statistical model and feature templates features are applied by Z&C, and we make use of
them here. Even with such features, the question
The baseline system uses a linear model to score of whether edges with different sizes are linearly
hypotheses. For an edge e, its score is defined as: separable is an empirical one.
f (e) = (e) , 3 Training
where (e) represents the feature vector of e and The efficiency of the decoding algorithm is de-
is the parameter vector of the model. pendent on the statistical model, since the best-
738
first search is guided to a solution by the model, Algorithm 2 The training algorithm.
and a good model will lead to a solution being a I NITAGENDA( )
found more quickly. In the ideal situation for the c I NIT C HART( )
best-first decoding algorithm, the model is perfect while not T IME O UT( ) do
and the score of any gold-standard edge is higher new []
than the score of any non-gold-standard edge. As e P OP B EST(a)
a result, the top edge on the agenda is always a if G OLD S TANDARD(e) and G OALT EST(e)
gold-standard edge, and therefore all edges on the then return e
chart are gold-standard before the gold-standard end if
goal edge is found. In this oracle procedure, the if not G OLD S TANDARD(e) then
minimum number of edges is expanded, and the e e
output is correct. The best-first decoder is perfect e+ M IN G OLD(a)
in not only accuracy, but also speed. In practice U PDATE PARAMETERS(e+ , e )
this ideal situation is rarely met, but it determines R E C OMPUTE S CORES(a, c)
the goal of the training algorithm: to produce the continue
perfect model and hence decoder. end if
If we take gold-standard edges as positive ex- for e U NARY(e, grammar) do
amples, and non-gold-standard edges as negative A PPEND(new, e)
examples, the goal of the training problem can be end for
viewed as finding a large separating margin be- for e c do
tween the scores of positive and negative exam- if C AN C OMBINE(e, e) then
ples. However, it is infeasible to generate the full e B INARY(e, e, grammar)
space of negative examples, which is factorial in A PPEND(new, e )
the size of input. Like Z&C, we apply online end if
learning, and generate negative examples based if C AN C OMBINE(e, e) then
on the decoding algorithm. e B INARY(e, e, grammar)
Our training algorithm is shown as Algo- A PPEND(new, e )
rithm 2. The algorithm is based on the decoder, end if
where an agenda is used as a priority queue of end for
edges to be expanded, and a set of accepted edges for e new do
is kept in a chart. Similar to the decoding algo- A DD(a, e )
rithm, the agenda is intialized using all possible end for
leaf edges. During each step, the top of the agenda A DD(c, e)
e is popped. If it is a gold-standard edge, it is ex- end while
panded in exactly the same way as the decoder,
with the newly generated edges being put onto
the agenda, and e being inserted into the chart. for further work possible alternative methods to
If e is not a gold-standard edge, we take it as a generate more negative examples during training.
negative example e , and take the lowest scored Another way of viewing the training process is
gold-standard edge on the agenda e+ as a positive that it pushes gold-standard edges towards the top
example, in order to make an udpate to the model of the agenda, and crucially pushes them above
parameter vector . Our parameter update algo- non-gold-standard edges. This is the view de-
rithm is different from the baseline perceptron al- scribed by Z&C. Given a positive example e+ and
gorithm, as will be discussed later. After updating a negative example e , they use the perceptron
the parameters, the scores of agenda edges above algorithm to penalize the score for (e ) and re-
and including e , together with all chart edges, ward the score of (e+ ), but do not update pa-
are updated, and e is discarded before the start rameters for the sub-edges of e+ and e . An argu-
of the next processing step. By not putting any ment for not penalizing the sub-edge scores for e
non-gold-standard edges onto the chart, the train- is that the sub-edges must be gold-standard edges
ing speed is much faster; on the other hand a wide (since the training process is constructed so that
range of negative examples is pruned. We leave only gold-standard edges are expanded). From
739
the perspective of correctness, it is unnecessary have been used as a standard component in statis-
to find a margin between the sub-edges of e+ and tical machine translation systems to control out-
those of e , since both are gold-standard edges. put fluency. For the syntax-based generation sys-
However, since the score of an edge not only tem, the incorporation of an N -gram language
represents its correctness, but also affects its pri- model can potentially improve the local fluency
ority on the agenda, promoting the sub-edge of of output sequences. In addition, the N -gram
e+ can lead to easier edges being constructed language model can be trained separately using
before harder ones (i.e. those that are less a large amount of data, while the syntax-based
likely to be correct), and therefore improve the model requires manual annotation for training.
output accuracy. This perspective has been ob- The standard method for the combination of
served by other works of learning-guided-search a syntax model and an N -gram model is linear
(Shen et al., 2007; Shen and Joshi, 2008; Gold- interpolation. We incorporate fourgram, trigram
berg and Elhadad, 2010). Intuitively, the score and bigram scores into our syntax model, so that
difference between easy gold-standard and harder the score of an edge e becomes:
gold-standard edges should not be as great as the
difference between gold-standard and non-gold- F (e) = f (e) + g(e)
standard edges. The perceptron update cannot = f (e) + gfour (e) + gtri (e) + gbi (e),
provide such control of separation, because the
amount of update is fixed to 1. where f is the syntax model score, and g is the
As described earlier, we treat parameter update N -gram model score. g consists of three com-
as finding a separation between correct and incor- ponents, gfour , gtri and gbi , representing the log-
rect edges, in which the global feature vectors , probabilities of fourgrams, trigrams and bigrams
rather than , are considered. Given a positive ex- from the language model, respectively. , and
ample e+ and a negative example e , we make a are the corresponding weights.
minimum update so that the score of e+ is higher During decoding, F (e) is computed incremen-
than that of e with some margin: tally. Again, denoting the sub-edges of e as es ,
arg min k 0 k, s.t.(e+ ) (e ) 1 F (e) = f (e) + g(e)

X
= F (es ) + (e) + g (e)
where 0 and denote the parameter vectors be- es e
fore and after the udpate, respectively. The up-
date is similar to the update of online large-margin Here g (e) = gfour (e) + gtri (e) + gbi (e)
learning algorithms such as 1-best MIRA (Cram- is the sum of log-probabilities of the new N -
mer et al., 2006), and has a closed-form solution: grams resulting from the construction of e. For
f (e ) f (e+ ) + 1 leaf edges and unary-branching edges, no new N -
0 + 2
(e+ ) (e ) grams result from their construction (i.e. g = 0).
k (e+ ) (e ) k
For a binary-branching edge, new N -grams result
In this update, the global feature vectors (e+ ) from the surface-string concatenation of its sub-
and (e ) are used. Unlike Z&C, the scores edges. The sum of log-probabilities of the new
of sub-edges of e+ and e are also udpated, so fourgrams, trigrams and bigrams contribute to g
that the sub-edges of e are less prioritized than with weights , and , respectively.
those of e+ . We show empirically that this train- For training, there are at least three methods to
ing algorithm significantly outperforms the per- tune , , and . One simple method is to train
ceptron training of the baseline system in Sec- the syntax model independently, and select ,
tion 5. An advantage of our new training algo- , and empirically from a range of candidate
rithm is that it enables the accommodation of a values according to development tests. We call
separately trained N -gram model into the system. this method test-time interpolation. An alterna-
tive is to select , and first, initializing the
4 Incorporating an N-gram language
vector as all zeroes, and then run the training
model
algorithm for taking into account the N -gram
Since the seminal work of the IBM models language model. In this process, g is considered
(Brown et al., 1993), N -gram language models when finding a separation between positive and
740
negative examples; the training algorithm finds a CCGBank Sentences Tokens
value of that best suits the precomputed , training 39,604 929,552
and values, together with the N -gram language development 1,913 45,422
model. We call this method g-precomputed in- GigaWord v4 Sentences Tokens
terpolation. Yet another method is to initialize , AFP 30,363,052 684,910,697
, and as all zeroes, and run the training al- XIN 15,982,098 340,666,976
gorithm taking into account the N -gram language
model. We call this method g-free interpolation. Table 1: Number of sentences and tokens by language
The incorporation of an N -gram language model source.
model into the syntax-based generation system is
weakly analogous to N -gram model insertion for standard sequence, assuming that for some prob-
syntax-based statistical machine translation sys- lems the ambiguities can be reduced (e.g. when
tems, both of which apply a score from the N - the input is already partly correctly ordered).
gram model component in a derivation-building Z&C use different probability cutoff levels (the
process. As discussed earlier, polynomial-time parameter in the supertagger) to control the
decoding is typically feasible for syntax-based pruning. Here we focus mainly on the dictionary
machine translation systems without an N -gram method, which leaves lexical category disam-
language model, due to constraints from the biguation entirely to the generation system. For
grammar. In these cases, incorporation of N - comparison, we also perform experiments with
gram language models can significantly increase lexical category pruning. We chose = 0.0001,
the complexity of a dynamic-programming de- which leaves 5.4 leaf edges per word on average.
coder (Bar-Hillel et al., 1961). Efficient search We used the SRILM Toolkit (Stolcke, 2002)
has been achieved using chart pruning (Chiang, to build a true-case 4-gram language model es-
2007) and iterative numerical approaches to con- timated over the CCGBank training and develop-
strained optimization (Rush and Collins, 2011). ment data and a large additional collection of flu-
In contrast, the incorporation of an N -gram lan- ent sentences in the Agence France-Presse (AFP)
guage model into our decoder is more straightfor- and Xinhua News Agency (XIN) subsets of the
ward, and does not add to its asymptotic complex- English GigaWord Fourth Edition (Parker et al.,
ity, due to the heuristic nature of the decoder. 2009), a total of over 1 billion tokens. The Gi-
gaWord data was first pre-processed to replicate
5 Experiments
the CCGBank tokenization. The total number
We use sections 221 of CCGBank to train our of sentences and tokens in each LM component
syntax model, section 00 for development and is shown in Table 1. The language model vo-
section 23 for the final test. Derivations from cabulary consists of the 46,574 words that oc-
CCGBank are transformed into inputs by turn- cur in the concatenation of the CCGBank train-
ing their surface strings into multi-sets of words. ing, development, and test sets. The LM proba-
Following Z&C, we treat base noun phrases (i.e. bilities are estimated using modified Kneser-Ney
NP s that do not recursively contain other NPs) as smoothing (Kneser and Ney, 1995) with interpo-
atomic units for the input. Output sequences are lation of lower n-gram orders.
compared with the original sentences to evaluate
their quality. We follow previous work and use 5.1 Development experiments
the BLEU metric (Papineni et al., 2002) to com- A set of development test results without lexical
pare outputs with references. category pruning (i.e. using the full dictionary) is
Z&C use two methods to construct leaf edges. shown in Table 2. We train the baseline system
The first is to assign lexical categories according and our systems under various settings for 10 iter-
to a dictionary. There are 26.8 lexical categories ations, and measure the output BLEU scores after
for each word on average using this method, cor- each iteration. The timeout value for each sen-
responding to 26.8 leaf edges. The other method tence is set to 5 seconds. The highest score (max
is to use a pre-processing step a CCG supertag- BLEU) and averaged score (avg. BLEU) of each
ger (Clark and Curran, 2007) to prune can- system over the 10 training iterations are shown
didate lexical categories according to the gold- in the table.
741
Method max BLEU avg. BLEU
baseline 38.47 37.36
margin 41.20 39.70
margin +LM (g-precomputed) 41.50 40.84
margin +LM ( = 0, = 0, = 0) 40.83
margin +LM ( = 0.08, = 0.016, = 0.004) 38.99
margin +LM ( = 0.4, = 0.08, = 0.02) 36.17
margin +LM ( = 0.8, = 0.16, = 0.04) 34.74
Table 2: Development experiments without lexical category pruning.
The first three rows represent the baseline sys- 45
tem, our largin-margin training system (margin), 44
and our system with the N -gram model incorpo- 43
rated using g-precomputed interpolation. For in- 42
BLEU
terpolation we manually chose = 0.8, = 0.16 41
and = 0.04, respectively. These values could 40
be optimized by development experiments with 39

baseline
margin
alternative configurations, which may lead to fur- 38 margin +LM
ther improvements. Our system with large-margin 37

1 2 3 4 5 6 7 8 9 10
training gives higher BLEU scores than the base- training iteration
line system consistently over all iterations. The

N -gram model led to further improvements. Figure 1: Development experiments with lexical cate-
gory pruning ( = 0.0001).
The last four rows in the table show results
of our system with the N -gram model added us-
ing test-time interpolation. The syntax model is training. One question that arises is whether g-
trained with the optimal number of iterations, and free interpolation will outperform g-precomputed
different , , and values are used to integrate interpolation. g-free interpolation offers the free-
the language model. Compared with the system dom of , and during training, and can poten-
using no N -gram model (margin), test-time inter- tially reach a better combination of the parameter
polation did not improve the accuracies. values. However, the training algorithm failed to
The row with , , = 0 represents our system converge with g-free interpolation. One possible
with the N -gram model loaded, and the scores explanation is that real-valued features from the
gf our , gtri and gbi computed for each N -gram language model made our large-margin training
during decoding, but the scores of edges are com- harder. Another possible reason is that our train-
puted without using N -gram probabilities. The ing process with heavy pruning does not accom-
scoring model is the same as the syntax model modate this complex model.
(margin), but the results are lower than the row Figure 1 shows a set of development experi-
margin, because computing N -gram probabil- ments with lexical category pruning (with the su-
ities made the system slower, exploring less hy- pertagger parameter = 0.0001). The scores
potheses under the same timeout setting.1 of the three different systems are calculated by
The comparison between g-precomputed inter- varying the number of training iterations. The
polation and test-time interpolation shows that the large-margin training system (margin) gave con-
system gives better scores when the syntax model sistently better scores than the baseline system,
takes into consideration the N -gram model during and adding a language model (margin +LM) im-
proves the scores further.
1
More decoding time could be given to the slower N - Table 3 shows some manually chosen examples
gram system, but we use 5 seconds as the timeout setting
for all the experiments, giving the methods with the N -gram
for which our system gave significant improve-
language model a slight disadvantage, as shown by the two ments over the baseline. For most other sentences
rows margin and margin +LM (, , = 0). the improvements are not as obvious. For each
742
baseline margin margin +LM
as a nonexecutive director Pierre Vinken 61 years old , the board will join as a as a nonexecutive director Pierre Vinken
, 61 years old , will join the board . 29 nonexecutive director Nov. 29 , Pierre , 61 years old , will join the board Nov.
Nov. Vinken . 29 .
Lorillard nor smokers were aware of the of any research who studied Neither the Neither Lorillard nor any research on the
Kent cigarettes of any research on the workers were aware of smokers on the workers who studied the Kent cigarettes
workers who studied the researchers Kent cigarettes nor the researchers were aware of smokers of the researchers
.
you But 35 years ago have to recognize recognize But you took place that these But you have to recognize that these
that these events took place . events have to 35 years ago . events took place 35 years ago .
investors to pour cash into money funds Despite investors , yields continue to Despite investors , recent declines in
continue in Despite yields recent declines pour into money funds recent declines in yields continue to pour cash into money
cash . funds .
yielding The top money funds are cur- The top money funds currently are yield- The top money funds are yielding well
rently well over 9 % . ing well over 9 % . over 9 % currently .
where A buffet breakfast , held in the mu- everyday visitors are banned to where A buffet breakfast , everyday visitors are
seum was food and drinks to . everyday A buffet breakfast was held , food and banned to where food and drinks was
visitors banned drinks in the museum . held in the museum .
A Commonwealth Edison spokesman tracking A Commonwealth Edison an administrative nightmare whose ad-
said an administrative nightmare would spokesman said that the two million cus- dresses would be tracking down A Com-
be tracking down the past 3 12 years that tomers whose addresses have changed monwealth Edison spokesman said that
the two million customers have . whose down during the past 3 12 years would the two million customers have changed
changed be an administrative nightmare . during the past 3 12 years .
The $ 2.5 billion Byron 1 plant , Ill. , was The $ 2.5 billion Byron 1 plant was near The $ 2.5 billion Byron 1 plant near
completed . near Rockford in 1985 completed in Rockford , Ill. , 1985 . Rockford , Ill. , was completed in 1985 .
will ( During its centennial year , The as The Wall Street Journal ( During its During its centennial year events will re-
Wall Street Journal report events of the centennial year , milestones stand of port , The Wall Street Journal that stand
past century that stand as milestones of American business history that will re- as milestones of American business his-
American business history . ) port events of the past century . ) tory ( of the past century ) .
Table 3: Some chosen examples with significant improvements (supertagger parameter = 0.0001).
method, the examples are chosen from the devel- syntactically grammatical, but are semantically
opment output with lexical category pruning, af- anomalous. For example, person names are often
ter the optimal number of training iterations, with confused with company names, verbs often take
the timeout set to 5s. We also tried manually se- unrelated subjects and objects. The problem is
lecting examples without lexical category prun- much more severe for long sentences, which have
ing, but the improvements were not as obvious, more ambiguities. For specific tasks, extra infor-
partly because the overall fluency was lower for mation (such as the source text for machine trans-
all the three systems. lation) can be available to reduce ambiguities.
Table 4 shows a set of examples chosen ran-
domly from the development test outputs of our 6 Final results
system with the N -gram model. The optimal
number of training iterations is used, and a time- The final results of our system without lexical cat-
out of 1 minute is used in addition to the 5s time- egory pruning are shown in Table 5. Row W09
out for comparison. With more time to decode CLE and W09 AB show the results of the
each input, the system gave a BLEU score of maximum spanning tree and assignment-based al-
44.61, higher than 41.50 with the 5s timout. gorithms of Wan et al. (2009); rows margin
While some of the outputs we examined are and margin +LM show the results of our large-
reasonably fluent, most are to some extent frag- margin training system and our system with the
mentary.2 In general, the system outputs are N -gram model. All these results are directly com-
still far below human fluency. Some samples are parable since we do not use any lexical category
2
pruning for this set of results. For each of our
Part of the reason for some fragmentary outputs is the
default output mechanism: partial derivations from the chart
systems, we fix the number of training iterations
are greedily put together when timeout occurs before a goal according to development test scores. Consis-
hypothesis is found. tent with the development experiments, our sys-
743
timeout = 5s timeout = 1m
drooled the cars and drivers , like Fortune 500 executives . over After schoolboys drooled over the cars and drivers , the race
the race like Fortune 500 executives .
One big reason : thin margins . One big reason : thin margins .
You or accountants look around ... and at an eye blinks . pro- blinks nobody You or accountants look around ... and at an eye
fessional ballplayers . professional ballplayers
most disturbing And of it , are educators , not students , for the And blamed for the wrongdoing , educators , not students who
wrongdoing is who . are disturbing , much of it is most .
defeat coaching aids the purpose of which is , He and other gauge coaching aids learning progress can and other critics say
critics say can to . standardized tests learning progress the purpose of which is to defeat , standardized tests .
The federal government of government debt because Congress The federal government suspended sales of government debt
has lifted the ceiling on U.S. savings bonds suspended sales because Congress has nt lifted the ceiling on U.S. savings
bonds .
Table 4: Some examples chosen at random from development test outputs without lexical category pruning.
System BLEU 2011). Unlike our system, and Wan et al. (2009),
W09 CLE 26.8 input dependencies provide additional informa-
W09 AB 33.7 tion to these systems. Although the search space
Z&C11 40.1 can be constrained by the assumption of projec-
margin 42.5 tivity, permutation of modifiers of the same head
margin +LM 43.8 word makes exact inference for tree lineariza-
tion intractable. The above systems typically ap-
Table 5: Test results without lexical category pruning. ply approximate inference, such as beam-search.
While syntax-based features are commonly used
System BLEU by these systems for linearization, Filippova and
Z&C11 43.2 Strube (2009) apply a trigram model to control
local fluency within constituents. A dependency-
margin 44.7
based N-gram model has also been shown effec-
margin +LM 46.1
tive for the linearization task (Guo et al., 2011).
The best-first inference and timeout mechanism
Table 6: Test results with lexical category pruning (su-
of our system is similar to that of White (2004), a
pertagger parameter = 0.0001).
surface realizer from logical forms using CCG.
tem outperforms the baseline methods. The acu- 8 Conclusion

racies are significantly higher when the N -gram
We studied the problem of word-ordering using
model is incorporated.
a syntactic model and allowing permutation. We
Table 6 compares our system with Z&C using
took the model of Zhang and Clark (2011) as the
lexical category pruning ( = 0.0001) and a 5s
baseline, and extended it with online large-margin
timeout for fair comparison. The results are sim-
training and an N -gram language model. These
ilar to Table 5: our large-margin training systems
extentions led to improvements in the BLEU eval-
outperforms the baseline by 1.5 BLEU points, and
uation. Analyzing the generated sentences sug-
adding the N -gram model gave a further 1.4 point
gests that, while highly fluent outputs can be pro-
improvement. The scores could be significantly
duced for short sentences ( 10 words), the sys-
increased by using a larger timeout, as shown in
tem fluency in general is still way below human
our earlier development experiments.
standard. Future work remains to apply the sys-
tem as a component for specific text generation
7 Related Work
tasks, for example machine translation.
There is a recent line of research on text-to-
text generation, which studies the linearization of Acknowledgements
dependency structures (Barzilay and McKeown, Yue Zhang and Stephen Clark are supported by the Eu-
2005; Filippova and Strube, 2007; Filippova and ropean Union Seventh Framework Programme (FP7-
Strube, 2009; Bohnet et al., 2010; Guo et al., ICT-2009-4) under grant agreement no. 247762.
744
References Short Papers, pages 225228, Boulder, Colorado,
June. Association for Computational Linguistics.
Yehoshua Bar-Hillel, M. Perles, and E. Shamir. 1961.
Yoav Goldberg and Michael Elhadad. 2010. An effi-
On formal properties of simple phrase structure
cient algorithm for easy-first non-directional depen-
grammars. Zeitschrift fur Phonetik, Sprachwis-
dency parsing. In Human Language Technologies:
senschaft und Kommunikationsforschung, 14:143
The 2010 Annual Conference of the North American
172. Reprinted in Y. Bar-Hillel. (1964). Language
Chapter of the Association for Computational Lin-
and Information: Selected Essays on their Theory
guistics, pages 742750, Los Angeles, California,
and Application, Addison-Wesley 1964, 116150.
June. Association for Computational Linguistics.
Regina Barzilay and Kathleen McKeown. 2005. Sen-
Yuqing Guo, Deirdre Hogan, and Josef van Genabith.
tence fusion for multidocument news summariza-
2011. Dcu at generation challenges 2011 surface
tion. Computational Linguistics, 31(3):297328. realisation track. In Proceedings of the Generation
Graeme Blackwood, Adria de Gispert, and William Challenges Session at the 13th European Workshop
Byrne. 2010. Fluency constraints for minimum on Natural Language Generation, pages 227229,
Bayes-risk decoding of statistical machine trans- Nancy, France, September. Association for Compu-
lation lattices. In Proceedings of the 23rd Inter- tational Linguistics.
national Conference on Computational Linguistics Julia Hockenmaier and Mark Steedman. 2002. Gen-
(Coling 2010), pages 7179, Beijing, China, Au- erative models for statistical parsing with Combi-
gust. Coling 2010 Organizing Committee. natory Categorial Grammar. In Proceedings of the
Bernd Bohnet, Leo Wanner, Simon Mill, and Alicia 40th Meeting of the ACL, pages 335342, Philadel-
Burga. 2010. Broad coverage multilingual deep phia, PA.
sentence generation with a stochastic multi-level re- Julia Hockenmaier and Mark Steedman. 2007. CCG-
alizer. In Proceedings of the 23rd International bank: A corpus of CCG derivations and dependency
Conference on Computational Linguistics (Coling structures extracted from the Penn Treebank. Com-
2010), pages 98106, Beijing, China, August. Col- putational Linguistics, 33(3):355396.
ing 2010 Organizing Committee.
R. Kneser and H. Ney. 1995. Improved backing-off
Peter F. Brown, Stephen Della Pietra, Vincent J. Della for m-gram language modeling. In International
Pietra, and Robert L. Mercer. 1993. The mathe- Conference on Acoustics, Speech, and Signal Pro-
matics of statistical machine translation: Parameter cessing, 1995. ICASSP-95, volume 1, pages 181
estimation. Computational Linguistics, 19(2):263 184.
311. Philip Koehn, Franz Och, and Daniel Marcu. 2003.
Sharon A. Caraballo and Eugene Charniak. 1998. Statistical phrase-based translation. In Proceedings
New figures of merit for best-first probabilistic chart of NAACL/HLT, Edmonton, Canada, May.
parsing. Comput. Linguist., 24:275298, June. Philipp Koehn, Hieu Hoang, Alexandra Birch,
David Chiang. 2007. Hierarchical Phrase- Chris Callison-Burch, Marcello Federico, Nicola
based Translation. Computational Linguistics, Bertoldi, Brooke Cowan, Wade Shen, Christine
33(2):201228. Moran, Richard Zens, Chris Dyer, Ondrej Bojar,
Stephen Clark and James R. Curran. 2007. Wide- Alexandra Constantin, and Evan Herbst. 2007.
coverage efficient statistical parsing with CCG Moses: Open source toolkit for statistical ma-
and log-linear models. Computational Linguistics, chine translation. In Proceedings of the 45th An-
33(4):493552. nual Meeting of the Association for Computational
Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Linguistics Companion Volume Proceedings of the
Shalev-Shwartz, and Yoram Singer. 2006. Online Demo and Poster Sessions, pages 177180, Prague,
passive-aggressive algorithms. Journal of Machine Czech Republic, June. Association for Computa-
Learning Research, 7:551585. tional Linguistics.
Katja Filippova and Michael Strube. 2007. Gener- Kishore Papineni, Salim Roukos, Todd Ward, and
ating constituent order in german clauses. In Pro- Wei-Jing Zhu. 2002. Bleu: a method for auto-
ceedings of the 45th Annual Meeting of the Asso- matic evaluation of machine translation. In Pro-
ciation of Computational Linguistics, pages 320 ceedings of 40th Annual Meeting of the Associa-
327, Prague, Czech Republic, June. Association for tion for Computational Linguistics, pages 311318,
Computational Linguistics. Philadelphia, Pennsylvania, USA, July. Association
Katja Filippova and Michael Strube. 2009. Tree lin- for Computational Linguistics.
earization in english: Improving language model Robert Parker, David Graff, Junbo Kong, Ke Chen, and
based approaches. In Proceedings of Human Lan- Kazuaki Maeda. 2009. English Gigaword Fourth
guage Technologies: The 2009 Annual Conference Edition, Linguistic Data Consortium.
of the North American Chapter of the Association Alexander M. Rush and Michael Collins. 2011. Exact
for Computational Linguistics, Companion Volume: decoding of syntactic translation models through la-
745
grangian relaxation. In Proceedings of the 49th An-
Linguistics: Human Language Technologies, pages
7282, Portland, Oregon, USA, June. Association
Libin Shen and Aravind Joshi. 2008. LTAG depen-
dency parsing with bidirectional incremental con-
struction. In Proceedings of the 2008 Conference
cessing, pages 495504, Honolulu, Hawaii, Octo-
ber. Association for Computational Linguistics.
Libin Shen, Giorgio Satta, and Aravind Joshi. 2007.
Guided learning for bidirectional sequence classi-
fication. In Proceedings of ACL, pages 760767,
Prague, Czech Republic, June.
Mark Steedman. 2000. The Syntactic Process. The
MIT Press, Cambridge, Mass.
Andreas Stolcke. 2002. SRILM - an extensible lan-
guage modeling toolkit. In Proceedings of the In-
ternational Conference on Spoken Language Pro-
cessing, pages 901904.
Stephen Wan, Mark Dras, Robert Dale, and Cecile
Paris. 2009. Improving grammaticality in statisti-
cal sentence generation: Introducing a dependency
spanning tree algorithm with an argument satisfac-
tion model. In Proceedings of the 12th Conference
of the European Chapter of the ACL (EACL 2009),
pages 852860, Athens, Greece, March. Associa-
Michael White. 2004. Reining in CCG chart realiza-
tion. In Proc. INLG-04, pages 182191.
Dekai Wu. 1997. Stochastic inversion transduction
grammars and bilingual parsing of parallel corpora.
Computational Linguistics, 23(3).
Yue Zhang and Stephen Clark. 2011. Syntax-
based grammaticality improvement using CCG and
guided search. In Proceedings of the 2011 Confer-
ence on Empirical Methods in Natural Language
Processing, pages 11471157, Edinburgh, Scot-
land, UK., July. Association for Computational Lin-
guistics.
746
Midge: Generating Image Descriptions From Computer Vision
Detections
Margaret Mitchell Jesse Dodge Amit Goyal Kota Yamaguchi Karl Stratosk
Xufeng Han Alyssa Mensch Alex Berg Tamara Berg Hal Daume III

U. of Aberdeen and Oregon Health and Science University, m.mitchell@abdn.ac.uk

Stony Brook University, {aberg,tlberg,xufhan,kyamagu}@cs.stonybrook.edu

U. of Maryland, {hal,amit}@umiacs.umd.edu
k
Columbia University, stratos@cs.columbia.edu

U. of Washington, dodgejesse@gmail.com, MIT, acmensch@mit.edu
Abstract
This paper introduces a novel generation
system that composes humanlike descrip-
tions of images from computer vision de-
tections. By leveraging syntactically in-
formed word co-occurrence statistics, the
generator filters and constrains the noisy
detections output from a vision system to The bus by the road with a clear blue sky
generate syntactic trees that detail what Figure 1: Example image with generated description.
the computer vision system sees. Results
show that the generation system outper- formation from a language model, or to be short
forms state-of-the-art systems, automati- and simple, but as true to the image as possible.
cally generating some of the most natural Rather than using a fixed template capable of
image descriptions to date.
generating one kind of utterance, our approach
therefore lies in generating syntactic trees. We
1 Introduction use a tree-generating process (Section 4.3) simi-
lar to a Tree Substitution Grammar, but preserv-
It is becoming a real possibility for intelligent sys-
ing some of the idiosyncrasies of the Penn Tree-
tems to talk about the visual world. New ways of
bank syntax (Marcus et al., 1995) on which most
mapping computer vision to generated language
statistical parsers are developed. This allows us
have emerged in the past few years, with a fo-
to automatically parse and train on an unlimited
cus on pairing detections in an image to words
amount of text, creating data-driven models that
(Farhadi et al., 2010; Li et al., 2011; Kulkarni et
flesh out descriptions around detected objects in a
al., 2011; Yang et al., 2011). The goal in connect-
principled way, based on what is both likely and
ing vision to language has varied: systems have
syntactically well-formed.
started producing language that is descriptive and
poetic (Li et al., 2011), summaries that add con- An example generated description is given in
tent where the computer vision system does not Figure 1, and example vision output/natural lan-
(Yang et al., 2011), and captions copied directly guage generation (NLG) input is given in Fig-
from other images that are globally (Farhadi et al., ure 2. The system (Midge) generates descrip-
2010) and locally similar (Ordonez et al., 2011). tions in present-tense, declarative phrases, as a
nave viewer without prior knowledge of the pho-
A commonality between all of these ap-
tographs content.1
proaches is that they aim to produce natural-
sounding descriptions from computer vision de- Midge is built using the following approach:
tections. This commonality is our starting point: An image processed by computer vision algo-
We aim to design a system capable of producing rithms can be characterized as a triple <Ai , Bi ,
natural-sounding descriptions from computer vi- Ci >, where:
sion detections that are flexible enough to become 1
Midge is available to try online at:
more descriptive and poetic, or include likely in- http://recognition.cs.stonybrook.edu:8080/mitchema/midge/.
747
stuff: sky .999 a description to select for the kinds of structures it
id: 1
tends to appear in (syntactic constraints) and the
atts: clear:0.432, blue:0.945
grey:0.853, white:0.501 ... other words it tends to occur with (semantic con-
b. box: (1,1 440,141) straints). This is a data-driven way to generate
stuff: road .908 likely adjectives, prepositions, determiners, etc.,
id: 2
taking the intersection of what the vision system
atts: wooden:0.722 clear:0.020 ...
b. box: (1,236 188,94) predicts and how the object noun tends to be de-
object: bus .307 scribed.
id: 3
atts: black:0.872, red:0.244 ... 2 Background
b. box: (38,38 366,293) Our approach to describing images starts with
preps: id 1, id 2: by id 1, id 3: by id 2, id 3: below
a system from Kulkarni et al. (2011) that com-
Figure 2: Example computer vision output and natu-
ral language generation input. Values correspond to
poses novel captions for images in the PASCAL
scores from the vision detections. sentence data set,2 introduced in Rashtchian et
al. (2010). This provides multiple object detec-
tions based on Felzenszwalbs mixtures of multi-
Ai is the set of object/stuff detections with
scale deformable parts models (Felzenszwalb et
bounding boxes and associated attribute
al., 2008), and stuff detections (roughly, mass
detections within those bounding boxes.
nouns, things like sky and grass) based on linear
Bi is the set of action or pose detections as-
SVMs for low level region features.
sociated to each ai Ai .
Appearance characteristics are predicted using
Ci is the set of spatial relationships that hold
trained detectors for colors, shapes, textures, and
between the bounding boxes of each pair
materials, an idea originally introduced in Farhadi
ai , aj Ai .
et al. (2009). Local texture, Histograms of Ori-
Similarly, a description of an image can be char- ented Gradients (HOG) (Dalal and Triggs, 2005),
acterized as a triple <Ad , Bd , Cd > where: edge, and color descriptors inside the bounding
box of a recognized object are binned into his-
Ad is the set of nouns in the description with tograms for a vision system to learn to recognize
associated modifiers. when an object is rectangular, wooden, metal,
Bd is the set of verbs associated to each ad etc. Finally, simple preposition functions are used
Ad . to compute the spatial relations between objects
Cd is the set of prepositions that hold be- based on their bounding boxes.
tween each pair of ad , ae Ad . The original Kulkarni et al. (2011) system gen-
erates descriptions with a template, filling in slots
With this representation, mapping <Ai , Bi , Ci >
by combining computer vision outputs with text
to <Ad , Bd , Cd > is trivial. The problem then
based statistics in a conditional random field to
becomes: (1) How to filter out detections that
predict the most likely image labeling. Template-
are wrong; (2) how to order the objects so that
based generation is also used in the recent Yang et
they are mentioned in a natural way; (3) how to
al. (2011) system, which fills in likely verbs and
connect these ordered objects within a syntacti-
prepositions by dependency parsing the human-
cally/semantically well-formed tree; and (4) how
written UIUC Pascal-VOC dataset (Farhadi et al.,
to add further descriptive information from lan-
2010) and selecting the dependent/head relation
guage modeling alone, if required.
with the highest log likelihood ratio.
Our solution lies in using Ai and Ad as descrip-
Template-based generation is useful for auto-
tion anchors. In computer vision, object detec-
matically generating consistent sentences, how-
tions form the basis of action/pose, attribute, and
ever, if the goal is to vary or add to the text pro-
spatial relationship detections; therefore, in our
duced, it may be suboptimal (cf. Reiter and Dale
approach to language generation, nouns for the
(1997)). Work that does not use template-based
object detections are used as the basis for the de-
generation includes Yao et al. (2010), who gener-
scription. Likelihood estimates of syntactic struc-
ate syntactic trees, similar to the approach in this
ture and word co-occurrence are conditioned on
2
object nouns, and this enables each noun head in http://vision.cs.uiuc.edu/pascal-sentences/
748
black, blue, brown, colorful, golden, gray,
green, orange, pink, red, silver, white, yel-
low, bare, clear, cute, dirty, feathered, flying,
furry, pine, plastic, rectangular, rusty, shiny,
spotted, striped, wooden
Table 1: Modifiers used to extract training corpus.
Kulkarni et al.: This is a pic- Kulkarni et al.: This is
ture of three persons, one bot- a picture of two potted- for naturally varied but well-formed text, generat-
tle and one diningtable. The plants, one dog and one ing syntactic trees rather than filling in a template.
first rusty person is beside the person. The black dog is In addition to these tasks, Midge automatically
second person. The rusty bot- by the black person, and
tle is near the first rusty per- near the second feathered
decides what the subject and objects of the de-
son, and within the colorful pottedplant. scription will be, leverages the collected word co-
diningtable. The second per- occurrence statistics to filter possible incorrect de-
son is by the third rusty per- tections, and offers the flexibility to be as de-
son. The colorful diningtable
scriptive or as terse as possible, specified by the
is near the first rusty person,
and near the second person, user at run-time. The end result is a fully au-
and near the third rusty person. tomatic vision-to-language system that is begin-
Yang et al.: Three people Yang et al.: The person is ning to generate syntactically and semantically
are showing the bottle on the sitting in the chair in the well-formed descriptions with naturalistic varia-
street room
tion. Example descriptions are given in Figures 4
Midge: people with a bottle at Midge: a person in black and 5, and descriptions from other recent systems
the table with a black dog by potted
plants
are given in Figure 3.
Figure 3: Descriptions generated by Midge, Kulkarni
The results are promising, but it is important to
et al. (2011) and Yang et al. (2011) on the same images. note that Midge is a first-pass system through the
Midge uses the Kulkarni et al. (2011) front-end, and so steps necessary to connect vision to language at
outputs are directly comparable. a deep syntactic/semantic level. As such, it uses
basic solutions at each stage of the process, which
paper. However, their system is not automatic, re- may be improved: Midge serves as an illustration
quiring extensive hand-coded semantic and syn- of the types of issues that should be handled to
tactic details. Another approach is provided in automatically generate syntactic trees from vision
Li et al. (2011), who use image detections to se- detections, and offers some possible solutions. It
lect and combine web-scale n-grams (Brants and is evaluated against the Kulkarni et al. system, the
Franz, 2006). This automatically generates de- Yang et al. system, and human-written descrip-
scriptions that are either poetic or strange (e.g., tions on the same set of images in Section 5, and
tree snowing black train). is found to significantly outperform the automatic
A different line of work transfers captions of systems.
similar images directly to a query image. Farhadi
et al. (2010) use <object,action,scene> triples 3 Learning from Descriptive Text
predicted from the visual characteristics of the To train our system on how people describe im-
image to find potential captions. Ordonez et al. ages, we use 700,000 (Flickr, 2011) images with
(2011) use global image matching with local re- associated descriptions from the dataset in Or-
ordering from a much larger set of captioned pho- donez et al. (2011). This is separate from our
tographs. These transfer-based approaches result evaluation image set, consisting of 840 PASCAL
in natural captions (they are written by humans) images. The Flickr data is messier than datasets
that may not actually be true of the image. created specifically for vision training, but pro-
This work learns and builds from these ap- vides the largest corpus of natural descriptions of
proaches. Following Kulkarni et al. and Li et al., images to date.
the system uses large-scale text corpora to esti- We normalize the text by removing emoticons
mate likely words around object detections. Fol- and mark-up language, and parse each caption
lowing Yang et al., the system can hallucinate using the Berkeley parser (Petrov, 2010). Once
likely words using word co-occurrence statistics parsed, we can extract syntactic information for
alone. And following Yao et al., the system aims individual (word, tag) pairs.
749
a cow with sheep with a gray sky people with boats a brown cow people at
green grass by the road a wooden table
Figure 4: Example generated outputs.
Awkward Prepositions Incorrect Detections
a person boats under a black bicycle at the sky a yellow bus cows by black sheep
on the dog the sky a green potted plant with people by the road
Figure 5: Example generated outputs: Not quite right
We compute the probabilities for different 4 Generation

prenominal modifiers (shiny, clear, glowing, ...)
and determiners (a/an, the, None, ...) given a Following Penn Treebank parsing guidelines
head noun in a noun phrase (NP), as well as the (Marcus et al., 1995), the relationship between
probabilities for each head noun in larger con- two head nouns in a sentence can usually be char-
structions, listed in Section 4.3. Probabilities are acterized among the following:
conditioned only on open-class words, specifi- 1. prepositional (a boy on the table)
cally, nouns and verbs. This means that a closed-
2. verbal (a boy cleans the table)
class word (such as a preposition) is never used to
generate an open-class word. 3. verb with preposition (a boy sits on the table)
4. verb with particle (a boy cleans up the table)
In addition to co-occurrence statistics, the 5. verb with S or SBAR complement (a boy
parsed Flickr data adds to our understanding of sees that the table is clean)
the basic characteristics of visually descriptive
text. Using WordNet (Miller, 1995) to automati- The generation system focuses on the first three
cally determine whether a head noun is a physical kinds of relationships, which capture a wide range
object or not, we find that 92% of the sentences of utterances. The process of generation is ap-
have no more than 3 physical objects. This in- proached as a problem of generating a semanti-
forms generation by placing a cap on how many cally and syntactically well-formed tree based on
objects are mentioned in each descriptive sen- object nouns. These serve as head noun anchors
tence: When more than 3 objects are detected, in a lexicalized syntactic derivation process that
the system splits the description over several sen- we call tree growth.
tences. We also find that many of the descriptions Vision detections are associated to a {tag
are not sentences as well (tagged as S, 58% of the word} pair, and the model fleshes out the tree de-
data), but quite commonly noun phrases (tagged tails around head noun anchors by utilizing syn-
as NP, 28% of the data), and expect that the num- tactic dependencies between words learned from
ber of noun phrases that form descriptions will be the Flickr data discussed in Section 3. The anal-
much higher with domain adaptation. This also ogy of growing a tree is quite appropriate here,
informs generation, and the system is capable of where nouns are bundles of constraints akin to
generating both sentences (contains a main verb) seeds, giving rise to the rest of the tree based on
and noun phrases (no main verb) in the final im- the lexicalized subtrees in which the nouns are
age description. We use the term sentence in the likely to occur. An example generated tree struc-
rest of this paper to refer to both kinds of complex ture is shown in Figure 6, with noun anchors in
phrases. bold.
750
NP Unordered Ordered
NP PP bottle, table, person person, bottle, table
road, sky, cow cow, road, sky
NP PP IN NP
Figure 8: Example nominal orderings.
DT NN IN NP at DT NN
pipeline. The hand-built component contains plu-
- people with DT NN the table
ral forms of singular nouns, the list of possible
a bottle spatial relations shown in Table 3, and a map-
Figure 6: Tree generated from tree growth process. ping between attribute values and modifier sur-
face forms (e.g., a green detection for person is to
Midge was developed using detections run on be realized as the postnominal modifier in green).
Flickr images, incorporating action/pose detec-
tions for verbs as well as object detections for 4.2 Content Determination
nouns. In testing, we generate descriptions for 4.2.1 Step 1: Group the Nouns
the PASCAL images, which have been used in
An initial set of object detections must first be
earlier work on the vision-to-language connection
split into clusters that give rise to different sen-
(Kulkarni et al., 2011; Yang et al., 2011), and al-
tences. If more than 3 objects are detected in the
lows us to compare systems directly. Action and
image, the system begins splitting these into dif-
pose detection for this data set still does not work
ferent noun groups. In future work, we aim to
well, and so the system does not receive these de-
compare principled approaches to this task, e.g.,
tections from the vision front-end. However, the
using mutual information to cluster similar nouns
system can still generate verbs when action and
together. The current system randomizes which
pose detectors have been run, and this framework
nouns appear in the same group.
allows the system to hallucinate likely verbal
constructions between objects if specified at run- 4.2.2 Step 2: Order the Nouns
time. A similar approach was taken in Yang et al. Each group of nouns are then ordered to deter-
(2011). Some examples are given in Figure 7. mine when they are mentioned in a sentence. Be-
We follow a three-tiered generation process cause the system generates declarative sentences,
(Reiter and Dale, 2000), utilizing content determi- this automatically determines the subject and ob-
nation to first cluster and order the object nouns, jects. This is a novel contribution for a general
create their local subtrees, and filter incorrect de- problem in NLG, and initial evaluation (Section
tections; microplanning to construct full syntactic 5) suggests it works reasonably well.
trees around the noun clusters, and surface real- To build the nominal ordering model, we use
ization to order selected modifiers, realize them as WordNet to associate all head nouns in the Flickr
postnominal or prenominal, and select final out- data to all of their hypernyms. A description is
puts. The system follows an overgenerate-and- represented as an ordered set [a1 ...an ] where each
select approach (Langkilde and Knight, 1998), ap is a noun with position p in the set of head
which allows different final trees to be selected nouns in the sentence. For the position pi of each
with different settings. hypernym ha in each sentence with n head nouns,
we estimate p(pi |n, ha ).
4.1 Knowledge Base
During generation, the system greedily maxi-
Midge uses a knowledge base that stores models mizes p(pi |n, ha ) until all nouns have been or-
for different tasks during generation. These mod- dered. Example orderings are shown in Figure 8.
els are primarily data-driven, but we also include This model automatically places animate objects
a hand-built component to handle a small set of near the beginning of a sentence, which follows
rules. The data-driven component provides the psycholinguistic work in object naming (Branigan
syntactically informed word co-occurrence statis- et al., 2007).
tics learned from the Flickr data, a model for or-
dering the selected nouns in a sentence, and a 4.2.3 Step 3: Filter Incorrect Attributes
model to change computer vision attributes to at- For the system to be able to extend coverage as
tribute:value pairs. Below, we discuss the three new computer vision attribute detections become
main data-driven models within the generation available, we develop a method to automatically
751
A person sitting on a sofa Cows grazing Airplanes flying A person walking a dog
Figure 7: Hallucinating: Creating likely actions. Straightforward to do, but can often be wrong.
COLOR purple blue green red white ... member of the group.
MATERIAL plastic wooden silver ...
SURFACE furry fluffy hard soft ... 4.2.5 Step 5: Gather Local Subtrees Around
QUALITY shiny rust dirty broken ... Object Nouns
Table 2: Example attribute classes and values. 1 2
NP
group adjectives into broader attribute classes,3 DT{0,1} JJ* NN S
and the generation system uses these classes when
n NP{NN n} VP{VBZ}
deciding how to describe objects. To group adjec-
3 4
tives, we use a bootstrapping technique (Kozareva NP NP
et al., 2008) that learns which adjectives tend to
NP{NN n} VP{VB(G|N)} NP{NN n} PP{IN}
co-occur, and groups these together to form an at-
5 6
tribute class. Co-occurrence is computed using PP VP
cosine (distributional) similarity between adjec-
tives, considering adjacent nouns as context (i.e., IN NP{NN n} VB(G|N|Z) PP{IN}
JJ NN constructions). Contexts (nouns) for adjec- 7
VP
tives are weighted using Pointwise Mutual Infor-
mation and only the top 1000 nouns are selected VB(G|N|Z) NP{NN n}
for every adjective. Some of the learned attribute Figure 9: Initial subtree frames for generation, present-
classes are given in Table 2. tense declarative phrases. marks a substitution site,
In the Flickr corpus, we find that each attribute * marks 0 sister nodes of this type permitted, {0,1}
(COLOR, SIZE, etc.), rarely has more than a single marks that this node can be included of excluded.
Input: set of ordered nouns, Output: trees preserving
value in the final description, with the most com- nominal ordering.
mon (COLOR) co-occurring less than 2% of the
time. Midge enforces this idea to select the most Possible actions/poses and spatial relationships
likely word v for each attribute from the detec- between objects nouns, represented by verbs and
tions. In a noun phrase headed by an object noun, prepositions, are selected using the subtree frames
NP{NN noun}, the prenominal adjective (JJ v) for listed in Figure 9. Each head noun selects for its
each attribute is selected using maximum likeli- likely local subtrees, some of which are not fully
hood. formed until the Microplanning stage. As an ex-
ample of how this process works, see Figure 10,
4.2.4 Step 4: Group Plurals which illustrates the combination of Trees 4 and
How to generate natural-sounding spatial rela- 5. For simplicity, we do not include the selection
tions and modifiers for a set of objects, as opposed of further subtrees. The subject noun duck se-
to a single object, is still an open problem (Fu- lects for prepositional phrases headed by different
nakoshi et al., 2004; Gatt, 2006). In this work, we prepositions, and the object noun grass selects
use a simple method to group all same-type ob- for prepositions that head the prepositional phrase
jects together, associate them to the plural form in which it is embedded. Full PP subtrees are cre-
listed in the KB, discard the modifiers, and re- ated during Microplanning by taking the intersec-
turn spatial relations based on the first recognized tion of both.
3
What in computer vision are called attributes are called
The leftmost noun in the sequence is given a
values in NLG. A value like red belongs to a COLOR at- rightward directionality constraint, placing it as
tribute, and we use this distinction in the system. the subject of the sentence, and so it will only se-
752
a over b a above b b below a b beneath a a by b b by a a on b b under a
b underneath a a upon b a over b
a by b a against b b against a b around a a around b a at b b at a a beside b
b beside a a by b b by a a near b b near a b with a a with b
a in b a in b b outside a a within b a by b b by a
Table 3: Possible prepositions from bounding boxes.
Subtree frames: a given noun as a mass or count noun (not taking a

NP PP
determiner or taking a determiner, respectively) or
NP{NN n1 } PP{IN} IN NP{NN n2 } as a given or new noun (phrases like a sky sound
unnatural because sky is given knowledge, requir-
Generated subtrees:
NP PP ing the definite article the). The selection of de-
terminer is not independent of the selection of ad-
NP PP IN NP
jective; a sky may sound unnatural, but a blue sky
NN IN on, by, over NN is fine. These trees take the dependency between
duck above, on, by grass determiner and adjective into account.
Trees 2 and 3:
Combined trees:
NP NP Collect beginnings of VP subtrees headed by
(VBZ verb), (VBG verb), and (VBN verb), no-
NP PP NP PP
tated here as VP{VBX verb}, where:
NN IN NP NN IN NP
p(VP{VBX verb}|NP{NN noun}=SUBJ) >
duck on NN duck by NN
Tree 4:
grass grass
Collect beginnings of PP subtrees headed by (IN
Figure 10: Example derivation. prep), where:
lect for trees that expand to the right. The right- p(PP{IN prep}|NP{NN noun}=SUBJ) >
most noun is given a leftward directionality con-
Tree 5:
straint, placing it as an object, and so it will only
Collect PP subtrees headed by (IN prep) with
select for trees that expand to its left. The noun in
NP complements (OBJ) headed by (NN noun),
the middle, if there is one, selects for all its local
where:
subtrees, combining first with a noun to its right
or to its left. We now walk through the deriva- p(PP{IN prep}|NP{NN noun}=OBJ) >
tion process for each of the listed subtree frames. Tree 6:
Because we are following an overgenerate-and- Collect VP subtrees headed by (VBX verb) with
select approach, all combinations above a proba- embedded PP complements, where:
bility threshold and an observation cutoff are
created. p(PP{IN prep}|VP{VBX verb}=SUBJ) >
Tree 1: Tree 7:
Collect all NP (DT det) (JJ adj)* (NN noun) Collect VP subtrees headed by (VBX verb) with
and NP (JJ adj)* (NN noun) subtrees, where: embedded NP objects, where:
p((JJ adj)|(NN noun)) > for each adj p(VP{VBX verb}|NP{NN noun}=OBJ) >
p((DT det)|JJ, (NN noun)) > , and the proba- 4.3 Microplanning
bility of a determiner for the head noun is higher
than the probability of no determiner. 4.3.1 Step 6: Create Full Trees
Any number of adjectives (including none) may In Microplanning, full trees are created by tak-
be generated, and we include the presence or ab- ing the intersection of the subtrees created in Con-
sence of an adjective when calculating which de- tent Determination. Because the nouns are or-
terminer to include. dered, it is straightforward to combine the sub-
The reasoning behind the generation of these trees surrounding a noun in position 1 with sub-
subtrees is to automatically learn whether to treat trees surrounding a noun in position 2. Two
753
NP words. We find that the second method produces
VP NP CC NP descriptions that seem more natural and varied
VP* than the n-gram ranking method for our develop-
and
ment set, and so use the longest string method in
Figure 11: Auxiliary trees for generation.
evaluation.
further trees are necessary to allow the subtrees
4.4.2 Step 8: Prenominal Modifier Ordering
gathered to combine within the Penn Treebank
syntax. These are given in Figure 11. If two To order sets of selected adjectives, we use the
nouns in a proposed sentence cannot be combined top-scoring prenominal modifier ordering model
with prepositions or verbs, we backoff to combine discussed in Mitchell et al. (2011). This is an n-
them using (CC and). gram model constructed over noun phrases that
Stepping through this process, all nouns will were extracted from an automatically parsed ver-
have a set of subtrees selected by Tree 1. Prepo- sion of the New York Times portion of the Giga-
sitional relationships between nouns are created word corpus (Graff and Cieri, 2003). With this
by substituting Tree 1 subtrees into the NP nodes in place, blue clear sky becomes clear blue sky,
of Trees 4 and 5, as shown in Figure 10. Verbal wooden brown table becomes brown wooden ta-
relationships between nouns are created by substi- ble, etc.
tuting Tree 1 subtrees into Trees 2, 3, and 7. Verb
5 Evaluation
with preposition relationships are created between
nouns by substituting the VBX node in Tree 6 Each set of sentences is generated with (likeli-
with the corresponding node in Trees 2 and 3 to hood cutoff) set to .01 and (observation count
grow the tree to the right, and the PP node in Tree cutoff) set to 3. We compare the system against
6 with the corresponding node in Tree 5 to grow human-written descriptions and two state-of-the-
the tree to the left. Generation of a full tree stops art vision-to-language systems, the Kulkarni et al.
when all nouns in a group are dominated by the (2011) and Yang et al. (2011) systems.
same node, either an S or NP. Human judgments were collected using Ama-
zons Mechanical Turk (Amazon, 2011). We
4.4 Surface Realization follow recommended practices for evaluating an
In the surface realization stage, the system se- NLG system (Reiter and Belz, 2009) and for run-
lects a single tree from the generated set of pos- ning a study on Mechanical Turk (Callison-Burch
sible trees and removes mark-up to produce a fi- and Dredze, 2010), using a balanced design with
nal string. This is also the stage where punctua- each subject rating 3 descriptions from each sys-
tion may be added. Different strings may be gen- tem. Subjects rated their level of agreement on
erated depending on different specifications from a 5-point Likert scale including a neutral mid-
the user, as discussed at the beginning of Section dle position, and since quality ratings are ordinal
4 and shown in the online demo. To evaluate the (points are not necessarily equidistant), we evalu-
system against other systems, we specify that the ate responses using a non-parametric test. Partici-
system should (1) not hallucinate likely verbs; and pants that took less than 3 minutes to answer all 60
(2) return the longest string possible. questions and did not include a humanlike rating
for at least 1 of the 3 human-written descriptions
4.4.1 Step 7: Get Final Tree, Clear Mark-Up were removed and replaced. It is important to note
We explored two methods for selecting a final that this evaluation compares full generation sys-
string. In one method, a trigram language model tems; many factors are at play in each system that
built using the Europarl (Koehn, 2005) data with may also influence participants perception, e.g.,
start/end symbols returns the highest-scoring de- sentence length (Napoles et al., 2011) and punc-
scription (normalizing for length). In the second tuation decisions.
method, we limit the generation system to select The systems are evaluated on a set of 840
the most likely closed-class words (determiners, images evaluated in the original Kulkarni et al.
prepositions) while building the subtrees, over- (2011) system. Participants were asked to judge
generating all possible adjective combinations. the statements given in Figure 12, from Strongly
The final string is then the one with the most Disagree to Strongly Agree.
754
Grammaticality Main Aspects Correctness Order Humanlikeness
Human 4 (3.77, 1.19) 4 (4.09, 0.97) 4 (3.81, 1.11) 4 (3.88, 1.05) 4 (3.88, 0.96)
Midge 3 (2.95, 1.42) 3 (2.86, 1.35) 3 (2.95, 1.34) 3 (2.92, 1.25) 3 (3.16, 1.17)
Kulkarni et al. 2011 3 (2.83, 1.37) 3 (2.84, 1.33) 3 (2.76, 1.34) 3 (2.78, 1.23) 3 (3.13, 1.23)
Yang et al. 2011 3 (2.95, 1.49) 2 (2.31, 1.30) 2 (2.46, 1.36) 2 (2.53, 1.26) 3 (2.97, 1.23)
Table 4: Median scores for systems, mean and standard deviation in parentheses. Distance between points on the
rating scale cannot be assumed to be equidistant, and so we analyze results using a non-parametric test.
G RAMMATICALITY:
This description is grammatically correct. side. On the computer vision side, incorrect ob-
M AIN A SPECTS : jects are often detected and salient objects are of-
This description describes the main aspects of this ten missed. Midge does not yet screen out un-
image. likely objects or add likely objects, and so pro-
C ORRECTNESS : vides no filter for this. On the language side, like-
This description does not include extraneous or in-
lihood is estimated directly, and the system pri-
correct information.
O RDER : marily uses simple maximum likelihood estima-
The objects described are mentioned in a reasonable tions to combine subtrees. The descriptive cor-
order. pus that informs the system is not parsed with
H UMANLIKENESS : a domain-adapted parser; with this in place, the
It sounds like a person wrote this description. syntactic constructions that Midge learns will bet-
Figure 12: Mechanical Turk prompts. ter reflect the constructions that people use.
In future work, we hope to address these issues
We report the scores for the systems in Table
as well as advance the syntactic derivation pro-
4. Results are analyzed using the non-parametric
cess, providing an adjunction operation (for ex-
Wilcoxon Signed-Rank test, which uses median
ample, to add likely adjectives or adverbs based
values to compare the different systems. Midge
on language alone). We would also like to incor-
outperforms all recent automatic approaches on
porate meta-data even when no vision detection
C ORRECTNESS and O RDER, and Yang et al. ad-
fires for an image, the system may be able to gen-
ditionally on H UMANLIKENESS and M AIN A S -
erate descriptions of the time and place where an
PECTS . Differences between Midge and Kulkarni
image was taken based on the image file alone.
et al. are significant at p < .01; Midge and Yang et
al. at p < .001. For all metrics, human-written de- 7 Conclusion
scriptions still outperform automatic approaches We have introduced a generation system that uses
(p < .001). a new approach to generating language, tying a
These findings are striking, particularly be- syntactic model to computer vision detections.
cause Midge uses the same input as the Kulka- Midge generates a well-formed description of an
rni et al. system. Using syntactically informed image by filtering attribute detections that are un-
word co-occurrence statistics from a large corpus likely and placing objects into an ordered syntac-
of descriptive text improves over state-of-the-art, tic structure. Humans judge Midges output to be
allowing syntactic trees to be generated that cap- the most natural descriptions of images generated
ture the variation of natural language. thus far. The methods described here are promis-
6 Discussion ing for generating natural language descriptions
of the visual world, and we hope to expand and
Midge automatically generates language that is as refine the system to capture further linguistic phe-
good as or better than template-based systems, nomena.
tying vision to language at a syntactic/semantic
level to produce natural language descriptions. 8 Acknowledgements
Results are promising, but, there is more work to Thanks to the Johns Hopkins CLSP summer
be done: Evaluators can still tell a difference be- workshop 2011 for making this system possible,
tween human-written descriptions and automati- and to reviewers for helpful comments. This
cally generated descriptions. work is supported in part by Michael Collins and
Improvements to the generated language are by NSF Faculty Early Career Development (CA-
possible at both the vision side and the language REER) Award #1054133.
755
References Siming Li, Girish Kulkarni, Tamara L. Berg, Alexan-
der C. Berg, and Yejin Choi. 2011. Composing
Amazon. 2011. Amazon mechanical turk: Artificial simple image descriptions using web-scale n-grams.
artificial intelligence. Proceedings of CoNLL 2011.
Holly P. Branigan, Martin J. Pickering, and Mikihiro Mitchell Marcus, Ann Bies, Constance Cooper, Mark
Tanaka. 2007. Contributions of animacy to gram- Ferguson, and Alyson Littman. 1995. Treebank II
matical function assignment and word order during bracketing guide.
production. Lingua, 118(2):172189. George A. Miller. 1995. WordNet: A lexical
Thorsten Brants and Alex Franz. 2006. Web 1T 5- database for english. Communications of the ACM,
gram version 1. 38(11):3941.
Chris Callison-Burch and Mark Dredze. 2010. Creat- Margaret Mitchell, Aaron Dunlop, and Brian Roark.
ing speech and language data with Amazons Me- 2011. Semi-supervised modeling for prenomi-
chanical Turk. NAACL 2010 Workshop on Creat- nal modifier ordering. Proceedings of the 49th
ing Speech and Language Data with Amazons Me- ACL:HLT.
chanical Turk. Courtney Napoles, Benjamin Van Durme, and Chris
Navneet Dalal and Bill Triggs. 2005. Histograms of Callison-Burch. 2011. Evaluating sentence com-
oriented gradients for human detections. Proceed- pression: Pitfalls and suggested remedies. ACL-
ings of CVPR 2005. HLT Workshop on Monolingual Text-To-Text Gen-
Ali Farhadi, Ian Endres, Derek Hoiem, and David eration.
Forsyth. 2009. Describing objects by their at- Vicente Ordonez, Girish Kulkarni, and Tamara L Berg.
tributes. Proceedings of CVPR 2009. 2011. Im2text: Describing images using 1 million
Ali Farhadi, Mohsen Hejrati, Mohammad Amin captioned photographs. Proceedings of NIPS 2011.
Sadeghi, Peter Young, Cyrus Rashtchian, Julia Slav Petrov. 2010. Berkeley parser. GNU General
Hockenmaier, and David Forsyth. 2010. Every pic- Public License v.2.
ture tells a story: generating sentences for images. Cyrus Rashtchian, Peter Young, Micah Hodosh, and
Proceedings of ECCV 2010. Julia Hockenmaier. 2010. Collecting image anno-
Pedro Felzenszwalb, David McAllester, and Deva Ra- tations using amazons mechanical turk. Proceed-
maman. 2008. A discriminatively trained, mul- ings of the NAACL HLT 2010 Workshop on Creat-
tiscale, deformable part model. Proceedings of ing Speech and Language Data with Amazons Me-
CVPR 2008. chanical Turk.
Flickr. 2011. http://www.flickr.com. Accessed Ehud Reiter and Anja Belz. 2009. An investiga-
1.Sep.11. tion into the validity of some metrics for automat-
ically evaluating natural language generation sys-
Kotaro Funakoshi, Satoru Watanabe, Naoko
tems. Computational Linguistics, 35(4):529558.
Kuriyama, and Takenobu Tokunaga. 2004.
Ehud Reiter and Robert Dale. 1997. Building ap-
Generating referring expressions using perceptual
plied natural language generation systems. Journal
groups. Proceedings of the 3rd INLG.
of Natural Language Engineering, pages 5787.
Albert Gatt. 2006. Generating collective spatial refer-
Ehud Reiter and Robert Dale. 2000. Building Natural
ences. Proceedings of the 28th CogSci.
Language Generation Systems. Cambridge Univer-
David Graff and Christopher Cieri. 2003. English Gi- sity Press.
gaword. Linguistic Data Consortium, Philadelphia, Yezhou Yang, Ching Lik Teo, Hal Daume III, and
PA. LDC Catalog No. LDC2003T05. Yiannis Aloimonos. 2011. Corpus-guided sen-
Philipp Koehn. 2005. Europarl: A parallel cor- tence generation of natural images. Proceedings of
pus for statistical machine translation. MT Summit. EMNLP 2011.
http://www.statmt.org/europarl/. Benjamin Z. Yao, Xiong Yang, Liang Lin, Mun Wai
Zornitsa Kozareva, Ellen Riloff, and Eduard Hovy. Lee, and Song-Chun Zhu. 2010. I2T: Image pars-
2008. Semantic class learning from the web with ing to text description. Proceedings of IEEE 2010,
hyponym pattern linkage graphs. Proceedings of 98(8):14851508.
ACL-08: HLT.
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Sim-
ing Li, Yejin Choi, Alexander C. Berg, and Tamara
Berg. 2011. Baby talk: Understanding and gener-
ating image descriptions. Proceedings of the 24th
CVPR.
Irene Langkilde and Kevin Knight. 1998. Gener-
ation that exploits corpus-based statistical knowl-
edge. Proceedings of the 36th ACL.
756
Generation of landmark-based navigation instructions
from open-source data
Markus Drager Alexander Koller

Dept. of Computational Linguistics Dept. of Linguistics
Saarland University University of Potsdam
mdraeger@coli.uni-saarland.de koller@ling.uni-potsdam.de
Abstract limited to distance-based route instructions. Even

in academic research, there has been remarkably
We present a system for the real-time gen-
little work on NLG for landmark-based naviga-
eration of car navigation instructions with
landmarks. Our system relies exclusively
tion systems. Some of these systems rely on map
on freely available map data from Open- resources that have been hand-crafted for a par-
StreetMap, organizes its output to fit into ticular city (Malaka et al., 2004), or on a com-
the available time until the next driving ma- bination of multiple complex resources (Raubal
neuver, and reacts in real time to driving er- and Winter, 2002), which effectively limits their
rors. We show that female users spend sig- coverage. Others, such as Dale et al. (2003), fo-
nificantly less time looking away from the cus on non-interactive one-shot instruction dis-
road when using our system compared to a
courses. However, commercially successful car
baseline system.
navigation systems continuously monitor whether
the driver is following the instructions and pro-
1 Introduction vide modified instructions in real time when nec-
Systems that generate route instructions are be- essary. That is, two key problems in designing
coming an increasingly interesting application NLG systems for car navigation instructions are
area for natural language generation (NLG) sys- the availability of suitable map resources and the
tems. Car navigation systems are ubiquitous ability of the NLG system to generate instructions
already, and with the increased availability of and react to driving errors in real time.
powerful mobile devices, the wide-spread use of In this paper, we explore solutions to both of
pedestrian navigation systems is on the horizon. these points. We present the Virtual Co-Pilot,
One area in which NLG systems could improve a system which generates route instructions for
existing navigation systems is in the use of land- car navigation using landmarks that are extracted
marks, which would enable them to generate in- from the open-source OpenStreetMap resource.1
structions such as turn right after the church in- The system computes a route plan and splits it
stead of after 300 meters. It has been shown in into episodes that end in driving maneuvers. It
human-human studies that landmark-based route then selects landmarks that describe the locations
instructions are easier to understand (Lovelace of these driving maneuvers, and aggregates in-
et al., 1999) than distance-based ones and re- structions such that they can be presented (via
duce driver distraction in in-car settings (Bur- a TTS system) in the time available within the
nett, 2000), which is crucial for improved traffic episode. The system monitors the users position
safety (Stutts et al., 2001). From an NLG per- and computes new, corrective instructions when
spective, navigation systems are an obvious ap- the user leaves the intended path. We evaluate
plication area for situated generation, for which our system using a driving simulator, and com-
there has recently been increasing interest (see pare it to a baseline that is designed to replicate
e.g. (Lessmann et al., 2006; Koller et al., 2010; a typical commercial navigation system. The Vir-
Striegnitz and Majda, 2009)). tual Co-Pilot performs comparably to the baseline
Current commercial navigation systems use
1
only trivial NLG technology, and in particular are http://www.openstreetmap.org/
757
on the number of driving errors and on user satis a crucial goal of improved navigation systems,
isfaction, and outperforms it significantly on the as driver inattention of various kinds is a lead-
time female users spend looking away from the ing cause of traffic accidents (25% of all police-
road. To our knowledge, this is the first time that reported car crashes in the US in 2000, according
the generation of landmarks has been shown to to Stutts et al. (2001)). Another road-based study
significantly improve the instructions of a wide- conducted by May and Ross (2006) yielded simi-
coverage navigation system. lar results.
Plan of the paper. We start by reviewing ear- One recurring finding in studies on landmarks
lier literature on landmarks, route instructions, in navigation is that some user groups are able
and the use of NLG for route instructions in Sec- to benefit more from their inclusion than oth-
tion 2. We then present the way in which we ers. This is particularly the case for female users.
extract information on potential landmarks from While men tend to outperform women in wayfind-
OpenStreetMap in Section 3. Section 4 shows ing tasks, completing them faster and with fewer
how we generate route instructions, and Section 5 navigation errors (c.f. Allen (2000)), women are
presents the evaluation. Section 6 concludes. likely to show improved wayfinding performance
when landmark information is given (e.g. Saucier
2 Related Work et al. (2002)).
What makes an object in the environment a good Despite all of this evidence from human-human
landmark has been the topic of research in vari- studies, there has been remarkably little research
ous disciplines, including cognitive science, com- on implemented navigation systems that use land-
puter science, and urban planning. Lynch (1960) marks. Commercial systems make virtually no
defines landmarks as physical entities that serve use of landmark information when giving direc-
as external points of reference that stand out from tions, relying on metric representations instead
their surroundings. Kaplan (1976) specified a (e.g. Turn right in one hundred meters). In aca-
landmark as a known place for which the in- demic research, there have only been a handful of
dividual has a well-formed representation. Al- relevant systems. A notable example is the DEEP
though there are different definitions of land- MAP system, which was created in the SmartKom
marks, a common theme is that objects are con- project as a mobile tourist information system for
sidered landmarks if they have some kind of cog- the city of Heidelberg (Malaka and Zipf, 2000;
nitive salience (both in terms of visual distinctive- Malaka et al., 2004). DEEP MAP uses landmarks
ness and frequeny of interaction). as waypoints for the planning of touristic routes
The usefulness of landmarks in route instruc- for car drivers and pedestrians, while also making
tions has been shown in a number of different use of landmark information in the generation of
human-human studies. Experimental results from route directions. Raubal and Winter (2002) com-
Lovelace et al. (1999) show that people not only bine data from digital city maps, facade images,
use landmarks intuitively when giving directions, cultural heritage information, and other sources
but they also perceive instructions that are given to to compute landmark descriptions that could be
them to be of higher quality when those instruc- used in a pedestrian navigation system for the city
tions contain landmark information. Similar find- of Vienna.
ings have also been reported by Michon and Denis The key to the richness of these systems is a
(2001) and Tom and Denis (2003). set of extensive, manually curated geographic and
Regarding car navigation systems specifically, landmark databases. However, creation and main-
Burnett (2000) reports on a road-based user study tenance of such databases is expensive, which
which compared a landmark-based navigation makes it impractical to use these systems outside
system to a conventional car navigation system. of the limited environments for which they were
Here the provision of landmark information in created. There have been a number of suggestions
route directions led to a decrease of navigational for automatically acquiring landmark data from
errors. Furthermore, glances at the navigation existing electronic databases, for instance cadas-
display were shorter and fewer, which indicates tral data (Elias, 2003) and airborne laser scans
less driver distraction in this particular experi- (Brenner and Elias, 2003). But the raw data for
mental condition. Minimizing driver distraction these approaches is still hard to obtain; informa-
758
tion about landmarks is mostly limited to geomet-
ric data and does not specify the semantic type
of a landmark (such as church); and updating
the landmark database frequently when the real
world changes (e.g., a shop closes down) remains
an open issue.
The closest system in the literature to the re-
search we present here is the CORAL system
(Dale et al., 2003). CORAL generates a text of
driving instructions with landmarks out of the out- Figure 1: A graphical representation of some nodes
put of a commercial web-based route planner. Un- and ways in OpenStreetMap.
like CORAL, our system relies purely on open-
Landmark Type
source map data. Also, our system generates driv-
Street Furniture stop sign
ing instructions in real time (as opposed to a sin- traffic lights
gle discourse before the user starts driving) and pedestrian crossing
reacts in real time to driving errors. Finally, we Visual Landmarks church
evaluate our system thoroughly for driving errors, certain video stores
user satisfaction, and driver distraction on an ac- certain supermarkets
tual driving task, and find a significant improve- gas station
ment over the baseline. pubs and bars
Figure 2: Landmarks used by the Virtual Co-Pilot.

3 OpenStreetMap
A system that generates landmark-based route di-
rections requires two kinds of data. First, it must XML format for offline use.
plan routes between points in space, and therefore Geographical data in OpenStreetMap is repre-
needs data on the road network, i.e. the road seg- sented in terms of nodes and ways. Nodes rep-
ments that make up streets along with their con- resent points in space, defined by their latitude
nections. Second, the system needs information and longitude. Ways consist of sequences of
about the landmarks that are present in the envi- edges between adjacent nodes; we call the in-
ronment. This includes geographic information dividual edges segments below. They are used
such as position, but also semantic information to represent streets (with curved streets consist-
such as the landmark type. ing of multiple straight segments approximating
We have argued above that the availability of their shape), but also a variety of other real-world
such data has been a major bottleneck in the entities: buildings, rivers, trees, etc. Nodes and
development of landmark-based navigation sys- ways can both be enriched with further infor-
tems. In the Virtual Co-Pilot system, which mation by attaching tags. Tags encode a wide
we present below, we solve this problem by us- range of additional information using a predefined
ing data from OpenStreetMap, an on-line map type ontology. Among other things, they specify
resource that provides both types of informa- the types of buildings (church, cafe, supermarket,
tion mentioned above, in a unified data struc- etc.); where a shop or restaurant has a name, it too
ture. The OpenStreetMap project is to maps what is specified in a tag. Fig. 1 is a graphical represen-
Wikipedia is to encyclopedias: It is a map of tation of some OpenStreetMap data, consisting of
the entire world which can be edited by anyone nodes and ways for two streets (with two and five
wishing to participate. New map data is usually segments) and a building which has been tagged
added by volunteers who measure streets using as a gas station.
GPS devices and annotate them via a Web inter- For the Virtual Co-Pilot system, we have cho-
face. The decentralized nature of the data entry sen a set of concrete landmark types that we con-
process means that when the world changes, the sider useful (Fig. 2). We operationalize the crite-
map will be updated quickly. Existing map data ria for good landmarks sketched in Section 2 by
can be viewed as a zoomable map on the Open- requiring that a landmark should be easily visible,
StreetMap website, or it can be downloaded in an and that it should be generic in that it is appli-
759
cable not just for one particular city, but for any
place for which OpenStreetMap data is available.
We end up with two classes of landmark types:
street furniture and visual landmarks. Street fur-
niture is a generic term for objects that are in-
stalled on streets. In this subset, we include stop
signs, traffic lights, and pedestrian crossings. Our
assumption is that these objects inherently pos-
sess a high salience, since they already require
particular attention from the driver. Visual land-
marks encompass roadside buildings that are not
directly connected to the road infrastructure, but
draw the drivers attention due to visual salience.
Churches are an obvious member of this group; in Figure 3: Schematic representation of an episode
addition, we include gas stations, pubs, and bars, (dashed red line), with sample trigger positions of pre-
as well as certain supermarket and video store view, turn instruction, and confirmation messages.
chains (selected for wide distribution over differ-
ent cities and recognizable, colorful signs).
Given a certain location at which the Virtual
Co-Pilot is to be used, we automatically extract most interesting. Our system avoids the genera-
suitable landmarks along with their types and lo- tion of metric distance indicators, as in turn left
cations from OpenStreetMap. We also gather in 100 meters. Instead, it tries to find landmarks
the road network information that is required that describe the position of the decision point:
for route planning, and collect informations on Prepare to turn left after the church. When no
streets, such as their names, from the tags. We landmark is available, the system tries to use street
then transform this information into a directed intersections as secondary landmarks, as in Turn
street graph. The nodes of this graph are the right at the next/second/third intersection. Metric
OpenStreetMap nodes that are part of streets; two distances are only used when both of these strate-
adjacent nodes are connected by a single directed gies fail.
edge for segments of one-way streets and a di-
In-car NLG takes place in a heavily real-time
rected edge in each direction for ordinary street
setting, in which an utterance becomes uninter-
segments. Each edge is weighted with the Eu-
pretable or even misleading if it is given too late.
clidean distance between the two nodes.
This problem is exacerbated for NLG of speech
4 Generation of route directions because simply speaking the utterance takes time
as well. One consequence that our system ad-
We will now describe how the Virtual Co-Pilot dresses is the problem of planning preview mes-
generates route directions from OpenStreetMap sages in such a way that they can be spoken be-
data. The system generates three types of mes- fore the decision point without overlapping each
sages (see Fig. 3). First, at every decision point, other. We handle this problem in the sentence
i.e. at the intersection where a driving maneu- planner, which may aggregate utterances to fit
ver such as turning left or right is required, the into the available time. A second problem is that
user is told to turn immediately in the given di- the users reactions to the generated utterances are
rection (now turn right). Second, if the driver unpredictable; if the driver takes a wrong turn, the
has followed an instruction correctly, we gener- system must generate updated instructions in real
ate a confirmation message after the driver has time.
made the turn, letting them know they are still
on the right track. Finally, we generate preview Below, we describe the individual components
messages on the street leading up to the decision of the system. We mostly follow a standard NLG
point. These preview messages describe the loca- pipeline (Reiter and Dale, 2000), with a focus on
tion of the next driving maneuver. the sentence planner and an extension to interac-
Of the three types, preview messages are the tive real-time NLG.
760
Segment123 when the road makes a sharp turn where a minor
From: Node1
street forks off. To handle this case, we introduce
To: Node2
On: Main Street decision points at nodes with multiple adjacent
segments if the angle between the incoming and
Segment124 outgoing segment of the street exceeds a certain
From: Node2
threshold. Conversely, our heuristic will some-
To: Node3
On: Main Street
times end an episode where no driving maneuver
is necessary, e.g. when an ongoing street changes
Segment125 its name. This is unproblematic in practice; the
From: Node3 system will simply generate an instruction to keep
To: Node4
driving straight ahead. Fig. 3 shows a graphical
On: Park Street
representation of an episode, with the street seg-
Segment126 ments belonging to it drawn as red dashed lines.
From: Node4
To: Node5 4.2 Aggregation
On: Park Street
Because we generate spoken instructions that are
Figure 4: A simple example of a route plan consisting given to the user while they are driving, the timing
of four street segments. of the instructions becomes a crucial issue, espe-
cially because a driver moves faster than the user
of a pedestrian navigation system. It is undesir-
4.1 Content determination and text planning
able for a second instruction to interrupt an ear-
The first step in our system is to obtain a plan for lier one. On the other hand, the second instruc-
reaching the destination. To this end, we com- tion cannot be delayed because this might make
pute a shortest path on the directed street graph the user miss a turn or interpret the instruction in-
described in Section 3. The result is an ordered correctly.
list of street segments that need to be traversed in We must therefore control at which points in-
the given order to successfully reach the destina- structions are given and make sure that they do
tion; see Fig. 4 for an example. not overlap. We do this by always presenting pre-
To be suitable as the input for an NLG system, view messages at trigger positions at certain fixed
this flat list of OpenStreetMap nodes needs to be distances from the decision point. The sentence
subdivided into smaller message chunks. In turn- planner calculates where these trigger positions
by-turn navigation, the general delimiter between are located for each episode. In this way, we cre-
such chunks are the driving maneuvers that the ate time frames during which there is enough time
driver must execute at each decision point. We for instructions to be presented.
call each span between two decision points an However, some episodes are too short to ac-
episode. Episodes are not explicitly represented commodate the three trigger positions for the con-
in the original route plan: although every segment firmation message and the two preview messages.
has a street name associated with it, the name of In such episodes, we aggregate different mes-
a street sometimes changes as we go along, and sages. We remove the trigger positions for the two
because chains of segments are used to model preview messages from the episode, and instead
curved streets in OpenStreetMap, even segments add the first preview message to the turn instruc-
that are joined at an angle may be parts of the tion message of the previous episode. This allows
same street. Thus, in Fig. 4 it is not apparent our system to generate instructions like Now turn
which segment traversals require any navigational right, and then turn left after the church.
maneuvers.
We identify episode boundaries with the fol- 4.3 Generation of landmark descriptions
lowing heuristic. We first assume that episode The Virtual Co-Pilot computes referring expres-
boundaries occur when the street name changes sions to decision points by selecting appropriate
from one segment to the next. However, stay- landmarks. To this end, it first looks up landmark
ing on the road may involve a driving maneu- candidates within a given range of the decision
ver (and therefore a decision point) as well, e.g. point from the database created in Section 3. This
761
yields an initial list of landmark candidates. Preview message p1 :
Trigger position: Node3 50m
Some of these landmark candidates may be un- Turn direction: right
suitable for the given situation because of lack of Landmark: church
uniqueness. If there are several visual landmarks Preposition: after
of the same type along the course of an episode,
all of these landmark candidates are removed. For Preview message p2 = p1 , except:
Trigger position: Node3 100m
episodes which contain multiple street furniture
landmarks of the same type, the first three in each Turn instruction t1 :
episode are retained; a referring expression for the Trigger position: Node3
decision point might then be at the second traf- Turn direction: right
fic light. If the decision point is no more than Confirmation message c1 :
three intersections away, we also add a landmark Trigger position: Node3 + 50m
description of the form at the third intersection.
Furthermore, a landmark must be visible from the Figure 5: Semantic representations of the different
last segment of the current episode; we only retain types of instructions in one episode.
a candidate if it is either adjacent to a segment of
the current episode or if it is close to the end point Turn direction preposition landmark).
of the very last segment of the episode. Among
the landmarks that are left over, the system prefers 4.4 Interactive generation
visual landmarks over street furniture, and street As a final point, the NLG process of a car naviga-
furniture over intersections. If no landmark candi- tion system takes place in an interactive setting:
dates are left over, the system falls back to metric as the system generates and utters instructions, the
distances. user may either follow them correctly, or they may
Second, the Virtual Co-Pilot determines the miss a turn or turn incorrectly because they mis-
spatial relationship between the landmark and the understood the instruction or were forced to disre-
decision point so that an appropriate preposition gard it by the traffic situation. The system must be
can be used in the referring expression. If the de- able to detect such problems, recover from them,
cision point occurs before the landmark along the and generate new instructions in real time.
course of the episode, we use the preposition in Our system receives a continuous stream of in-
front of, otherwise, we use after. Intersections formation about the position and direction of the
are always used with at and metric distances user. It performs execution monitoring to check
with in. whether the user is still following the intended
Finally, the system decides how to refer to the route. If a trigger position is reached, we present
landmark objects themselves. Although it has ac- the instruction that we have generated for this po-
cess to the names of all objects from the Open- sition. If the user has left the route, the system
StreetMap data, the user may not know these reacts by planning a new route starting from the
names. We therefore refer to churches, gas sta- users current position and generating a new set of
tions, and any street furniture simply as the instructions. We check whether the user is follow-
church, the gas station, etc. For supermar- ing the intended route in the following way. The
kets and bars, we assume that these buildings are system keeps track of the current episode of the
more saliently referred to by their names, which route plan, and monitors the distance of the car
are used in everyday language, and therefore use to the final node of the episode. While the user
the names to refer to them. is following the route correctly, the distance be-
The result of the sentence planning stage is tween the car and the final node should decrease
a list of semantic representations, specifying the or at least stay the same between two measure-
individual instructions that are to be uttered in ments. To accommodate for occasional deviations
each episode; an example is shown in Fig. 5. from the middle of the road, we allow five subse-
For each type of instruction, we then use a sen- quent measurements to increase the distance; the
tence template to generate linguistic surface forms sixth increase of the distance triggers a recompu-
by inserting the information contained in those tation of the route plan and a freshly generated
plans into the slots provided by the templates (e.g. instruction. On the other hand, when the distance
762
of the car to the final node falls below a certain
threshold, we assume that the end of the episode
has been reached, and activate the next episode.
By monitoring whether the user is now approach-
ing the final node of this new episode, we can in
particular detect wrong turns at intersections.
Because each instruction carries the risk that it
may not be followed correctly, there is a question
as to whether it is worth planning out all remain-
ing instructions for the complete route plan. After
all, if the user does not follow the first instruc-
tion, the computation of all remaining instructions
was a waste of time. We decided to compute all
future instructions anyway because the aggrega-
Figure 6: Experiment setup. A) Main screen B) Navi-
tion procedure described above requires them. In gation screen C) steering wheel D) eye tracker
practice, the NLG process is so efficient that all
instructions can be done in real time, but this de-
cision would have to be revisited for a slower sys- a separate 7 monitor (B). The driving simula-
tem. tor was controlled by means of a steering wheel
(C), along with a pair of brake and acceleration
5 Evaluation pedals. We recorded user eye movements using
We will now report on an experiment in which we a Tobii IS-Z1 table-mounted eye tracker (D). The
evaluated the performance of the Virtual Co-Pilot. generated instructions were converted to speech
using MARY, an open-source text-to-speech sys-
5.1 Experimental Method tem (Schroder and Trouvain, 2003), and played
5.1.1 Subjects back on loudspeakers.
The task of the user was to drive the car in
In total, 12 participants were recruited through
the virtual environment towards a given destina-
printed ads and mailing lists. All of them were
tion; spoken instructions were presented to them
university students aged between 21 and 27 years.
as they were driving, in real time. Using the
Our experiment was balanced for gender, hence
steering wheel and the pedals, users had full con-
we recruited 6 male and 6 female participants. All
trol over steering angles, acceleration and brak-
participants were compensated for their effort.
ing. The driving speed was limited to 30 km/h, but
5.1.2 Design there were no restrictions otherwise. The driving
The driving simulator used in the experiment simulator sent the NLG system a message with the
replicates a real-world city center using a 3D current position of the car (as GPS coordinates)
model that contains buildings and streets as they once per second.
can be perceived in reality. The street layout 3D Each user was asked to drive three short routes
model used by the driving simulator is based on in the driving simulator. Each route took about
OpenStreetMap data, and buildings were added to four minutes to complete, and the travelled dis-
the virtual environment based on cadastral data. tance was about 1 km. The number of episodes
To increase the perceived realism of the model, per route ranged from three to five. Landmark
some buildings were manually enhanced with candidates were sufficiently dense that the Virtual
photographic images of their real-world counter- Co-Pilot used landmarks to refer to all decision
parts (see Fig. 7). points and never had to fall back to the metric dis-
Figure 6 shows the set-up of the evaluation ex- tance strategy.
periment. The virtual driving simulator environ- There were three experimental conditions,
ment (main picture in Fig. 7) was presented to the which differed with respect to the spoken route
participants on a 20 computer screen (A). In ad- instructions and the use of the navigation screen.
dition, graphical navigation instructions (shown In the baseline condition, designed to replicate the
in the lower right of Fig. 7) were displayed on behavior of an off-the-shelf commercial car nav-
763
All Users Males Females
B VCP B VCP B VCP
Total Fixation Duration (seconds) 4.9 3.5 2.7 4.1 7.0 2.9*
Total Fixation Count (N) 21.8 15.4 13.5 16.5 30.0 14.3*
The system provided the right amount 3.9 2.9 4.2* 3.3 3.5 2.5
of information at any time
I was insecure at times about still be- 2.3 3.2 1.9* 2.8 2.6 3.5
ing on the right track.
It was important to have a visual rep- 4.3 4.0 4.2 4.2 4.3 3.7
resentation of route directions
I could trust the navigation system 3.6 3.7 4.1 3.7 3.0 3.7
Figure 8: Mean values for gaze behavior and subjective evaluation, separated by user group and condition (B =
baseline, VCP = our system). Significant differences are indicated by *; better values are printed in boldface.
aspects of their cognitive workload (general, vi-

sual, auditive and temporal workload, as well as
perceived stress level). In the second question-
naire, participants were state to rate their agree-
ment with a number of statements about their sub-
jective impression of the system on a 5-point un-
labelled Likert scale, e.g. whether they had re-
ceived instructions at the right time or whether
they trusted the navigation system to give them
the right instructions during trials.
5.2 Results
Figure 7: Screenshot of a scene in the driving simula- There were no significant differences between the
tor. Lower right corner: matching screenshot of navi- Virtual Co-Pilot and the baseline system on task
gation display.
completion time, rate of driving errors, or any of
the questions of the DALI questionnaire. Driv-
ing errors in particular were very rare: there were
igation system, participants were provided with only four driving errors in total, two of which
spoken metric distance-to-turn navigation instruc- were due to problems with left/right coordination.
tions. The navigation screen showed arrows de- We then analyzed the gaze data collected by the
picting the direction of the next turn, along with table-mounted eye tracker, which we set up such
the distance to the decision point (cf. Fig. 7). The that it recognized glances at the navigation screen.
second condition replaced the spoken route in- In particular, we looked at the total fixation dura-
structions by those generated by the Virtual Co- tion (TFD), i.e. the total amount of time that a user
Pilot. In a third condition, the output of the nav- spent looking at the navigation screen during a
igation screen was further changed to display an given trial run. We also looked at the total fixation
icon for the next landmark along with the arrow count (TFC), i.e. the total number of times that a
and distance indicator. The three routes were pre- user looked at the navigation screen in each run.
sented to the users in different orders, and com- Mean values for both metrics are given in Fig. 8,
bined with the conditions in a Latin Squares de- averaged over all subjects and only male and fe-
sign. In this paper, we focus on the first and sec- male subjects, respectively; the VCP column is
ond condition, in order to contrast the two styles for the Virtual Co-Pilot, whereas B stands for
of spoken instruction. the baseline. We found that male users tended
Participants were asked to answer two ques- to look more at the navigation screen in the VCP
tionnaires after each trial run. The first was the condition than in B, although the difference is not
DALI questionnaire (Pauzie, 2008), which asks statistically significant. However, female users
subjects to report how they perceived different looked at the navigation screen significantly fewer
764
times (t(5) = 3.2, p < 0.05, t-test for dependent other subjective questions. This may partly be due
samples) and for significantly shorter amounts of to the fact that the subjects were familiar with ex-
time (t(5) = 3.2, p < 0.05) in the VCP condition isting commercial car navigation systems and not
than in B. used to landmark-based instructions. On the other
On the subjective questionnaire, most questions hand, this finding is also consistent with results
yielded no significant differences (and are not re- of other evaluations of NLG systems, in which
ported here). However, we found that female an improvement in the objective task usefulness
users tended to rate the Virtual Co-Pilot more pos- of the system does not necessarily correlate with
itively than the baseline on questions concerning improved scores from subjective questionnaires
trust in the system and the need for the navigation (Gatt et al., 2009).
screen (but not significantly). Male users found
that the baseline significantly outperformed the 6 Conclusion
Virtual Co-Pilot on presenting instructions at the
In this paper, we have described a system for gen-
right time (t(5) = 2.7, p < 0.05) and on giving
erating real-time car navigation instructions with
them a sense of security in still being on the right
landmarks. Our system is distinguished from ear-
track (t(5) = 2.7, p < 0.05).
lier work in its reliance on open-source map data
5.3 Discussion from OpenStreetMap, from which we extract both
the street graph and the potential landmarks. This
The most striking result of the evaluation is that demonstrates that open resources are now infor-
there was a significant reduction of looks to the mative enough for use in wide-coverage naviga-
navigation display, even if only for one group tion NLG systems. The system then chooses ap-
of users. Female users looked at the navigation propriate landmarks at decision points, and con-
screen less and more rarely with the Virtual Co- tinuously monitors the drivers behavior to pro-
Pilot compared to the baseline system. In a real vide modified instructions in real time when driv-
car navigation system, this translates into a driver ing errors occur.
who spends less time looking away from the road,
We evaluated our system using a driving simu-
i.e. a reduction in driver distraction and an in-
lator with respect to driving errors, user satisfac-
crease in traffic safety. This suggests that female
tion, and driver distraction. To our knowledge,
users learned to trust the landmark-based instruc-
we have shown for the first time that a landmark-
tions, an interpretation that is further supported
based car navigation system outperforms a base-
by the trends we found in the subjective question-
line significantly; namely, in the amount of time
naire.
female users spend looking away from the road.
We did not find these differences in the male
In many ways, the Virtual Co-Pilot is a very
user group. Part of the reason may be the known
simple system, which we see primarily as a start-
gender differences in landmark use we mentioned
ing point for future research. The evaluation
in Section 2. But interestingly, the two signifi-
confirmed the importance of interactive real-time
cantly worse ratings by male users concerned the
NLG for navigation, and we therefore see this as
correct timing of instructions and the feedback for
a key direction of future work. On the other hand,
driving errors, i.e. issues regarding the systems
it would be desirable to generate more complex
real-time capabilities. Although our system does
referring expressions (the tall church). This
not yet perform ideally on these measures, this
would require more informative map data, as well
confirms our initial hypothesis that the NLG sys-
as a formal model of visual salience (Kelleher and
tem must track the users behavior and schedule
van Genabith, 2004; Raubal and Winter, 2002).
its utterances appropriately. This means that ear-
lier systems such as CORAL, which only com- Acknowledgments. We would like to thank the
pute a one-shot discourse of route instructions DFKI CARMINA group for providing the driv-
without regard to the timing of the presentation, ing simulator, as well as their support. We would
miss a crucial part of the problem. furthermore like to thank the DFKI Agents and
Apart from the exceptions we just discussed, Simulated Reality group for providing the 3D city
the landmark-based system tended to score com- model.
parably or a bit worse than the baseline on the
765
References A. J. May and T. Ross. 2006. Presence and quality
of navigational landmarks: effect on driver perfor-
G. L. Allen. 2000. Principles and practices for com- mance and implications for design. Human Fac-
municating route knowledge. Applied Cognitive tors: The Journal of the Human Factors and Er-
Psychology, 14(4):333359. gonomics Society, 48(2):346.
C. Brenner and B. Elias. 2003. Extracting land- P. E. Michon and M. Denis. 2001. When and why are
marks for car navigation systems using existing visual landmarks used in giving directions? Spatial
gis databases and laser scanning. International information theory, pages 292305.
archives of photogrammetry remote sensing and
A. Pauzie. 2008. Evaluating driver mental workload
spatial information sciences, 34(3/W8):131138.
using the driving activity load index (DALI). In
G. Burnett. 2000. Turn right at the Traffic Lights: Proc. of European Conference on Human Interface
The Requirement for Landmarks in Vehicle Nav- Design for Intelligent Transport Systems, pages 67
igation Systems. The Journal of Navigation, 77.
53(03):499510. M. Raubal and S. Winter. 2002. Enriching wayfind-
R. Dale, S. Geldof, and J. P. Prost. 2003. Using natural ing instructions with local landmarks. Geographic
language generation for navigational assistance. In information science, pages 243259.
ACSC, pages 3544. E. Reiter and R. Dale. 2000. Building natural lan-
B. Elias. 2003. Extracting landmarks with data min- guage generation systems. Studies in natural lan-
ing methods. Spatial information theory, pages guage processing. Cambridge University Press.
375389. D. M. Saucier, S. M. Green, J. Leason, A. MacFadden,
A. Gatt, F. Portet, E. Reiter, J. Hunter, S. Mahamood, S. Bell, and L. J. Elias. 2002. Are sex differences in
W. Moncur, and S. Sripada. 2009. From data to text navigation caused by sexually dimorphic strategies
in the neonatal intensive care unit: Using NLG tech- or by differences in the ability to use the strategies?.
nology for decision support and information man- Behavioral Neuroscience, 116(3):403.
agement. AI Communications, 22:153186. M. Schroder and J. Trouvain. 2003. The German
S. Kaplan. 1976. Adaption, structure and knowledge. text-to-speech synthesis system MARY: A tool for
In G. Moore and R. Golledge, editors, Environmen- research, development and teaching. International
tal knowing: Theories, research and methods, pages Journal of Speech Technology, 6(4):365377.
3245. Dowden, Hutchinson and Ross. K. Striegnitz and F. Majda. 2009. Landmarks in
J. D. Kelleher and J. van Genabith. 2004. Visual navigation instructions for a virtual environment.
salience and reference resolution in simulated 3-D Online Proceedings of the First NLG Challenge
environments. Artificial Intelligence Review, 21(3). on Generating Instructions in Virtual Environments
A. Koller, K. Striegnitz, D. Byron, J. Cassell, R. Dale, (GIVE-1).
J. Moore, and J. Oberlander. 2010. The First Chal- J. C. Stutts, D. W. Reinfurt, L. Staplin, and E. A. Rodg-
lenge on Generating Instructions in Virtual Environ- man. 2001. The role of driver distraction in traf-
ments. In E. Krahmer and M. Theune, editors, Em- fic crashes. Washington, DC: AAA Foundation for
pirical Methods in Natural Language Generation. Traffic Safety.
Springer. A. Tom and M. Denis. 2003. Referring to landmark
N. Lessmann, S. Kopp, and I. Wachsmuth. 2006. Sit- or street information in route directions: What dif-
uated interaction with a virtual human percep- ference does it make? Spatial information theory,
tion, action, and cognition. In G. Rickheit and pages 362374.
I. Wachsmuth, editors, Situated Communication,
pages 287323. Mouton de Gruyter.
K. Lovelace, M. Hegarty, and D. Montello. 1999. El-
ements of good route directions in familiar and un-
familiar environments. Spatial information theory.
Cognitive and computational foundations of geo-
graphic information science, pages 751751.
K. Lynch. 1960. The image of the city. MIT Press.
R. Malaka and A. Zipf. 2000. DEEP MAP Chal-
lenging IT research in the framework of a tourist in-
formation system. Information and communication
technologies in tourism, 7:1527.
R. Malaka, J. Haeussler, and H. Aras. 2004.
SmartKom mobile: intelligent ubiquitous user in-
teraction. In Proceedings of the 9th International
Conference on Intelligent User Interfaces.
766
To what extent does sentence-internal realisation reflect discourse
context? A study on word order
Sina Zarrie Jonas Kuhn Aoife Cahill

Institut fur maschinelle Sprachverarbeitung Educational Testing Service
University of Stuttgart, Germany Princeton, NJ 08541, USA
zarriesa,jonas@ims.uni-stuttgart.de acahill@ets.org
Abstract context (givenness or salience of particular refer-

ents, prior mentioning of particular concepts).
We compare the impact of sentence-
Since so many factors are involved and there is
internal vs. sentence-external features on
word order prediction in two generation further interaction with subtle semantic and prag-
settings: starting out from a discrimina- matic differentiations, lexical choice, stylistics
tive surface realisation ranking model for and presumably processing factors, theoretical ac-
an LFG grammar of German, we enrich counts making reliable predictions for real cor-
the feature set with lexical chain features pus examples have for a long time proven elusive.
from the discourse context which can be As for German, only quite recently, a number of
robustly detected and reflect rough gram- corpus-based studies (Filippova and Strube, 2007;
matical correlates of notions from theoreti-
Speyer, 2005; Dipper and Zinsmeister, 2009) have
cal approaches to discourse coherence. In a
more controlled setting, we develop a con- made some good progress towards a coherence-
stituent ordering classifier that is trained oriented account of at least the left edge of the
on a German treebank with gold corefer- German clause structure, the Vorfeld constituent.
ence annotation. Surprisingly, in both set- What makes the technological application of
tings, the sentence-external features per- theoretical insights even harder is that for most
form poorly compared to the sentence-
internal ones, and do not improve over
relevant factors, automatic recognition cannot be
a baseline model capturing the syntactic performed with high accuracy (e.g., a coreference
functions of the constituents. accuracy in the 70s means there is a good deal
of noise) and for the higher-level notions such
as the information-structural focus, interannotator
1 Introduction agreement on real corpus data tends to be much
The task of surface realization, especially in a rel- lower than for core-grammatical notions (Poesio
atively free word order language like German, is and Artstein, 2005; Ritz et al., 2008).
only partially determined by hard syntactic con- On the other hand, many of the relevant dis-
straints. The space of alternative realizations that course factors are reflected indirectly in proper-
are strictly speaking grammatical is typically con- ties of the sentence-internal material. Most no-
siderable. Nevertheless, for any given choice of tably, knowing the shape of referring expressions
lexical items and prior discourse context, only a narrows down many aspects of givenness and
few realizations will come across as natural and salience of its referent; pronominal realizations
will contribute to a coherent text. Hence, any NLP indicate givenness, and in German there are even
application involving a non-trivial generation step two variants of the personal pronoun (er and der)
is confronted with the issue of soft constraints on for distinguishing salience. So, if the genera-
grammatical alternatives in one way or another. tion task is set in such a way that the actual lex-
There are countless approaches to modelling ical choice, including functional categories such
these soft constraints, taking into account their as determiners, is fully fixed (which is of course
interaction with various aspects of the discourse not always the case), one can take advantage of
767
these reflexes. This explains in part the fairly high pata (2010) have improved a sentence compres-
baseline performance of n-gram language mod- sion system by capturing prominence of phrases
els in the surface realization task. And the effect or referents in terms of lexical chain information
can indeed be taken much further: the discrimi- inspired by Morris and Hirst (1991) and Center-
native training experiments of Cahill and Riester ing (Grosz et al., 1995). In their system, discourse
(2009) show how effective it is to systematically context is represented in terms of hard constraints
take advantage of asymmetry patterns in the mor- modelling whether a certain constituent can be
phosyntactic reflexes of the discourse notion of deleted or not.
information status (i.e., using a feature set with In the linearisation or surface realisation do-
well-chosen purely sentence-bound features). main, there is a considerable body of work ap-
These observations give rise to the question: in proximating information structure in terms of
the light of the difficulty in obtaining reliable dis- sentence-internal realisation (Ringger et al., 2004;
course information on the one hand and the effec- Filippova and Strube, 2009; Velldal and Oepen,
tiveness of exploiting the reflexes of discourse in 2005; Cahill et al., 2007). Cahill and Riester
the sentence-internal material on the other can (2009) improve realisation ranking for German
we nevertheless expect to gain something from which mainly deals with word order variation by
adding sentence-external feature information? representing precedence patterns of constituents
We propose two scenarios for adressing this in terms of asymmetries in their morphosyntac-
question: first, we choose an approximative ac- tic properties. As a simple example, a pattern ex-
cess to context information and relations between ploited by Cahill and Riester (2009) is the ten-
discourse referents lexical reiteration of head dency of definite elements tend to precede indef-
words, combined with information about their inites, which, on a discourse level, reflects that
grammatical relation and topological positioning given entities in a sentence tend to precede new
in prior sentences. We apply these features in a entities.
rich sentence-internal surface realisation ranking Other work on German surface realisation has
model for German. Secondly, we choose a more highlighted the role of the initial position in the
controlled scenario: we train a constituent order- German sentence, the so-called Vorfeld (or pre-
ing classifier based on a feature model that cap- field). Filippova and Strube (2007) show that
tures properties of discourse referents in terms of once the Vorfeld (i.e. the constituent that precedes
manually annotated coreference relations. As we the finite verb) is correctly determined, the pre-
get the same effect in both setups the sentence- diction of the order in the Mittelfeld (i.e. the con-
external features do not improve over a baseline stituents that follow the finite verb) is very easy.
that captures basic morphosyntactic properties of Cheung and Penn (2010) extend the approach
the constituents we conclude that sentence- of Filippova and Strube (2007) and augment a
internal realisation is actually a relatively accurate sentence-internal constituent ordering model with
predictor of discourse context, even more accurate sentence-external features inspired from the en-
than information that can be obtained from coref- tity grid model proposed by Barzilay and Lapata
erence and lexical chain relations. (2008).
2 Related Work 3 Motivation

In the generation literature, most works on ex- While there would be many ways to construe
ploiting sentence-external discourse information or represent discourse context (e.g. in terms of
are set in a summarisation or content ordering the global discourse or information structure), we
framework. Barzilay and Lee (2004) propose an concentrate on capturing local coherence through
account for constraints on topic selection based on the distribution of discourse referents in a text.
probabilistic content models. Barzilay and Lapata These discourse referents basically correspond to
(2008) propose an entity grid model which repre- the constituents that our surface realisation model
sents the distribution of referents in a discourse has to put in the right order. As the order of refer-
for sentence ordering. Karamanis et al. (2009) ents or constituents is arguably influenced by the
use Centering-based metrics to assess coherence information structure of a sentence given the pre-
in an information ordering system. Clarke and La- vious text, our main assumption was that infor-
768
(1) a. Kurze Zeit spater erklarte ein Anrufer bei Nachrichtenagenturen in Pakistan , die Gruppe Gamaa bekenne sich.
Shortly after, a caller declared at the news agencies in Pakistan, that the group Gamaa avowes itself.
b. Diese Gruppe wird fur einen Groteil der Gewalttaten verantwortlich gemacht , die seit dreieinhalb Jahren in
Agypten verubt worden sind .
This group is made responsible for most of the violent acts that have been committed in Egypt in the last three and
a half years.
(2) a. Belgien wunscht, dass sich WEU und NATO daruber einigen.
Belgium wants that WEU and NATO agree on that.
b. Belgien sieht in der NATO die beste militarische Struktur in Europa .
Belgium sees the best military structure of Europe in the NATO.
(3) a. Frauen vom Land kampften aktiv darum , ein Staudammprojekt zu verhindern.
Women from the countryside fighted actively to block the dam project.
b. Auch in den Stadten fanden sich immer mehr Frauen in Selbsthilfeorganisationen zusammen.
Also in the cities, more and more women team up in self-help organisations.
mation about the prior mentioning of a referent of the noun group is modified by a demonstra-
would be helpful for predicting the position of this tive pronoun such that its known and prominent
referent in a sentence. discourse status is overt in the morpho-syntactic
The idea that the occurence of discourse refer- realisation. In Example (2), both instances of
ents in a text is a central aspect of discourse struc- Belgium are realised as bare proper nouns with-
ture has been systematically pursued by Centering out an overt morphosyntactic clue indicating their
Theory (Grosz et al., 1995). Its most important discourse status.
notions are related to the realisation of discourse Beyond the simple presence of reitered items in
referents (i.e. described as centers) and the way sequences of sentences, we expected that it would
the centers are arranged in a sequence of utter- be useful to look at the position and syntactic
ances to make this sequence a coherent discourse. function of the previous mentions of a discourse
Another important concept is the ranking of dis- referent. In Example (1), the reiterated item is first
course referents which basically determines the introduced in an embedded sentence and realised
prominence of a referent in a certain sentence and in the Vorfeld in the second utterance. In terms
is driven by several factors (e.g. their grammati- of centering, this transition would correspond to
cal function). For free word order languages like a topic shift. In Example (2), both instances are
German, word order has been proposed as one of realised in the Vorfeld, such that the topic of the
the factors that account for the ranking (Poesio et first sentence is carried over to the next.
al., 2004). In a similar spirit, Morris and Hirst In Example (3), we illustrate a further type of
(1991) have proposed that chains of (related) lex- lexical reiteration. In this case, two identical head
ical items in a text are an important indicator of nouns are realised in subsequent sentences, even
text structure. though they refer to two different discourse refer-
Our main hypothesis was that it is possible to ents. While this type of lexical chain is described
exploit these intuitions from Centering Theory as reiteration without identity of referents by
and the idea of lexical chains for word order pre- Morris and Hirst (1991), it would not be captured
diction. Thus, we expected that it would be easier in Centering since this is not a case of strict coref-
to predict the position of a referent in a sentence erence. On the other hand, lexical chains do not
if we have not only given its realisation in the cur- capture types of reiterated discourse referents that
rent utterance but also its prominence in the previ- have distinct morpho-syntactic realisations, e.g.
ous discourse. Especially, we expected this intu- nouns and pronouns.
ition to hold for cases where the morpho-syntactic Originally, we had the hypothesis that strict
realisation of a constituent does not provide many corefence information is more useful and accurate
clues. This is illustrated in Examples (1) and (2) for word order prediction than rather loose lexi-
which both exemplify the reiteration of a lexical cal chains which conflate several types of referen-
item in two subsequent sentences, (reiteration is tial and lexical relations. However, the advantage
one type of lexical chain discussed in Morris and of chains, especially chains of reiteration, is that
Hirst (1991)). In Example (1), the second instance they can be easily detected in any corpus text and
769
that they might capture topics of sentences be- The realisation ranking component is an SVM
yond the identity of referents. Thus, we started ranking model implemented with SVMrank,
out from the idea of lexical chains and added cor- a Support Vector Machine-based learning tool
responding features in a statistical ranking model (Joachims, 2006). During training, each sentence
for surface realisation of German (Section 4). As is annotated with a rank and a set of features ex-
this strategy did not work out, we wanted to assess tracted from the F-structure, its surface string and
whether an ideal coreference annotation would be external resources (e.g. a language model). If
helpful at all for predicting word order. In a sec- the sentence matches the original corpus string,
ond experiment, we use a corpus which is manu- its rank will be highest, the assumption being that
ally annotated for coreference (Section 5). the original sentence corresponds to the optimal
realisation in context. The output of generation,
4 Experiment 1: Realisation Ranking the top-ranked sentence, is evaluated against the
with Lexical Chains original corpus sentence.
In this Section, we present an experiment that in- 4.2 The Feature Models
vestigates sentence-external context in a surface As the aim of this experiment is to better un-
realisation task. The sentence-external context is derstand the nature of sentence-internal features
represented in terms of lexical chain features and reflecting discourse context and compare them
compared to sentence-internal models which are to sentence-external ones, we build several fea-
based on morphosyntactic features. The experi- ture models which capture different aspects of the
ment thus targets a generation scenario where no constituents in a given sentence. The sentence-
coreference information is available and aims at internal features describe the morphosyntacic re-
assessing whether relatively naive context infor- alisation of constituents, for instance their func-
mation is also useful. tion (subject, object), and can be straightfor-
wardly extracted from the f-structure. These fea-
4.1 System Description tures are then combined into discriminative prece-
We carry out our first experiment in a regener- dence features, for instance subject-precedes-
ation set-up with two components: a) a large- object. We implement the following types of
scale hand-crafted Lexical Functional Grammar morphosyntactic features:
(LFG) for German (Rohrer and Forst, 2006), used
syntactic function (arguments and adjuncts)
to parse and regenerate a corpus sentence, b)
a stochastic ranker that selects the most appro- modification (e.g. nouns modified by relative
priate regenerated sentence in context according clauses, genitive etc.)
to an underlying, linguistically motivated feature syntactic category (e.g. adverbs, proper
model. In contrast to fully statistical linearisation nouns, phrasal arguments)
methods, our system first generates the full set definiteness for nouns
of sentences that correspond to the grammatically number and person for nominal elements
well-formed realisations of the intermediate syn- types of pronouns (e.g. demonstrative, re-
tactic representation.1 This representation is an flexive)
f-structure, which underspecifies the order of con- constituent span and number of embedded
stituents and, to some extent, their morphological nodes in the tree
realisation, such that the output sentences contain In addition, we also include language model
all possible combinations of word order permu- scores in our ranking model. In Section 4.4,
tations and morphological variants. Depending we report on results for several subsets of these
on the length and structure of the original corpus features where BaseSyn refers to a model that
sentence, the set of regenerated sentences can be only includes the syntactic function features and
huge (see Cahill et al. (2007) for details on regen- FullMorphSyn includes all features mentioned
erating the German treebank TIGER). above.
1
There are occasional mistakes in the grammar which
For extracting the lexical chains, we check for
sometimes lead to ungrammatical strings being generated, any overlapping nouns in the n sentences previ-
but this is rare. ous to the current one being generated. We check
770
Rank Sentence and Features
% Diese Gruppe wird fur einen Groteil der Gewalttaten verantwortlich gemacht.
% This group is for a major part of the violent acts responsible made.
1 subject-<-pp-object, demonstrative-<-indefinite, overlap-<-no-overlap, overlap-in-vorfeld, lm:-7.89
% Fur einen Groteil der Gewalttaten wird diese Gruppe verantwortlich gemacht.
% For a major part of the violent acts is this group responsible made.
3 pp-object-<-subject, indefinite-<-demonstrative, no-overlap-<-overlap, no-overlap-in-vorfeld, lm:-10.33
% Verantwortlich gemacht wird diese Gruppe fur einen Groteil der Gewalttaten.
% Responsible made is this group for a major part of the violent acts.
3 subject-<-pp-object, demonstrative-<-indefinite, overlap-<-no-overlap, lm:-9.41
Figure 1: Made-up training example for realisation ranking with precedence features
proper and common nouns, considering full and # Sentences % Sentences with overlap
in context Training Dev Test
partial overlaps as shown in Examples (1) and
1 20.96 23.64 20.42
(2), where the (a) example is the previous sen- 2 35.42 40.74 35.00
tence in the corpus. For each overlap, we record 3 45.58 50.00 53.33
the following properties: (i) function in the previ- 4 52.66 53.70 58.75
5 57.45 58.18 64.58
ous sentence, (ii) position in the previous sentence
6 61.42 57.41 68.75
(e.g. Vorfeld), (iii) distance between sentences, 7 64.58 61.11 70.83
(iv) total number of overlaps. 8 67.05 62.96 72.08
These overlap features are then also 9 69.20 64.81 74.17
combined in terms of precedence, e.g. 10 71.16 70.37 75.83
has subject overlap:3-precedes-no overlap, Table 1: The percentage of sentences that have at least
meaning that in the current sentence a noun one overlapping entity in the previous n sentences
that was previously mentioned in a subject 3
sentences ago precedes a noun that was not
mentioned before.
In Figure 1, we give an example of a set of gen- coreference annotation, since we already have a
eration alternatives and their (partial) feature rep- number of resources available to match the syn-
resentation for the sentence (1-b). Precedence is tactic analyses produced by our grammar against
indicated by <. the analyses in the treebank. Thus, in our regen-
Basically, our sentence-external feature model eration system, we parse the sentences with the
is built on the intuition that lexical chains or over- grammar, and choose the parsed f-structures that
laps approximate discourse status in a way which are compatible with the manual annotation in the
is similar to sentence-internal morphosyntactic TIGER treebank as is done in Cahill et al. (2007).
properties. Thus, we would expect that overlaps This compatibility check eliminates noise which
indicate givenness, salience or prominence and would be introduced by generating from incorrect
that asymmetries between overlapping and non- parses (e.g. incorrect PP-attachments typically re-
overlapping entities are helpful in the ranking. sult in unnatural and non-equivalent surface reali-
sations).
4.3 Data
All our models are trained on 7,039 sentences For comparing the string chosen by the mod-
(subdivided into 1259 texts) from the TIGER els against the original corpus sentence, we use
Treebank of German newspaper text (Brants et al., BLEU, NIST and exact match. Exact match is
2002). We tune the parameters of our SVM model a strict measure that only credits the system if it
on a development set of 55 sentences and report chooses the exact same string as the original cor-
the final results for our unseen test set of 240 sen- pus string. BLEU and NIST are more relaxed
tences. Table 1 shows how many sentences in our measures that compare the strings on the n-gram
training, development and test sets have at least level. Finally, we report accuracy scores for the
one textually overlapping phrase in the previous Vorfeld position (VF) corresponding to the per-
110 sentences. centage of sentences generated with a correct Vor-
We choose the TIGER treebank, which has no feld.
771
Sc BLEU NIST Exact VF by morphosyntactic features. However, we cannot
0 0.766 11.885 50.19 64.0
exclude the possibility that the chain features are
1 0.765 11.756 49.78 64.0
2 0.765 11.886 50.01 64.1 too noisy as they conflate several types of lexical
3 0.765 11.885 50.08 63.8 and coreferential relations. This will be adressed
4 0.761 11.723 49.43 63.2 in the following experiment.
5 0.765 11.884 49.71 64.2
6 0.768 11.892 50.42 64.6
5 Experiment 2: Constituent Ordering
7 0.765 11.885 50.01 64.5
8 0.764 11.884 49.78 64.3 with Centering-inspired Features
9 0.765 11.888 49.82 63.6
10 0.764 11.889 49.7 63.5 We now look at a simpler generation setup where
we concentrate on the ordering of constituents in
Table 2: Tenfold-crossvalidation for feature model the German Vorfeld and Mittelfeld. This strat-
FullMorphSyn and different context windows (Sc ) egy has also been adopted in previous investiga-
Model BLEU VF tions of German word order: Filippova and Strube
Language Model 0.702 51.2 (2007) show that once the German Vorfeld is cor-
Language Model + Context Sc = 5 0.715 54.3 rectly chosen, the prediction accuracy for the Mit-
BaseSyn 0.757 62.0
telfeld (the constituents following the finite verb)
BaseSyn + Context Sc = 5 0.760 63.0
FullMorphSyn 0.766 64.0 is in the 90s.
FullMorphSyn + Context Sc = 5 0.763 64.2 In order to eliminate noise introduced from po-
tentially heterogeneous chain features, we look at
Table 3: Evaluation for different feature models; Lan-
coreference features and, again, compare them to
guage Model: ranking based on language model
scores, BaseSyn: precedence between constituent
sentence-internal morphosyntactic features. We
functions, FullMorphSyn: entire set of sentence- target a generation scenario where coreference in-
internal features. formation is available. The aim is to establish an
upper bound concerning the quality improvement
4.4 Results for word order prediction by recurring to manual
In Table 2, we report the performance of the full corefence annotation.
sentence-internal feature model combined with
5.1 Data and Setup
context windows from zero to ten. The scores
have been obtained from tenfold-crossvalidation. We carry out the constituent ordering experiment
For none of the context windows, the model out- on the Tuba-D/Z treebank (v5) of German news-
performs the baseline with a zero context which paper articles (Telljohann et al., 2006). It com-
has no sentence-external features. In Table 3, prises about 800k tokens in 45k sentences. We
we compare the performance of several feature choose this corpus because it is not only annotated
models corresponding to subsets of the features with syntactic analyses but also with coreference
used so far which are combined with sentence- relations (Naumann, 2006). The syntactic annota-
external features respectively. We note that the tion format differs from the TIGER treebank used
function precedence features (i.e. the BaseSyn in the previous experiment, for instance, it ex-
model) are very powerful, leading to a major im- plicitely represents the Vorfeld and Mittelfeld as
provement compared to a language model. The phrasal nodes in the tree. This format is very con-
sentence-external features lead to an improvement venient for the extraction of constituents in the re-
when combined with the language-model based spective positions.
ranking. However, this improvement is leveled The Tuba-D/Z coreference annotation distin-
out in the BaseSyn model. guishes several relations between discourse ref-
On the one hand, the fact that the lexical chain erents, most importantly coreferential relation
features improve a language-model based ranking and anaphoric relation where the first denotes
suggests these features are, to some extent, pre- a relation between noun phrases that refer to the
dictive for certain patterns of German word order. same entity, and the latter refers to a link between
On the other hand, the fact that they dont improve a pronoun and a contextual antecedent, see Nau-
over an informed sentence-internal baseline sug- mann (2006) for further detail. We expected the
gests that these patterns are equally well captured coreferential relation to be particularly useful, as
772
it cannot always be read off the morphosyntac- # VF # MF
Backward Center 3.5% 5.1%
tic realisation of a noun phrase, whereas pronouns
Forward Center 6.8% 6.8%
are almost always used in an anaphoric relation. Coref Link 30.5% 23.4%
The constituent ordering model is implemented
as a classifier that is given a set of constituents Table 4: Backward and forward centers and their posi-
and predicts the constituent that is most likely to tions
be realised in the Vorfeld.
The set of candidate constituents is determined chain model since there is no lexical overlap be-
from the tree of the original corpus sentence. We tween the realisations of the discourse referents.
will assume that all constituents under a Vorfeld These types of coreference features implicitly
and Mittelfeld node can be freely reordered. Thus, carry the information that would also be consid-
we do not check whether the word order variants ered in a Centering formalisation of discourse
we look at are actually grammatical assuming that context. In addition to these, we designed features
most of them are. In this sense, this experiment that explicitly describe centers as these might
is close to fully statistical generation approaches. have a higher weight. In line with Clarke and
As a further simplification, we do not look at mor- Lapata (2010), we compute backward (CB) and
phological generation variants of the constituents forward centers (CF ) in the following way:
or their head verb.
The classifier is implemented with SVMrank 1. Extract all entities from the current sentence
again. In contrast to the previous experiment and the previous sentence.
where we learned to rank sentences, the classi- 2. Rank the entities of the previous sentence ac-
fier now learns to rank constituents. The con- cording to their function (subject < direct
stituents have been extracted using the tool de- object < indirect object ...).
scribed in Bouma (2010). The final data set com- 3. Find the highest ranked entity in the previous
prises 48.513 candidate sets of freely orderable sentence that has a link to an entity in the
constituents. current sentence, this entity is the CB of the
sentence.
5.2 Centering-inspired Feature Model
To compare the discourse context model against a In the same way, we mark entities as forward
sentence-based model, we implemented a number centers that are ranked highest in the current sen-
of sentence-internal features that are very similar tence and have a link to an entity in the following
to the features used in the previous experiment. sentence.2 In Table 4, we report the percentage of
Since we extract them from the syntactic annota- sentences that have backward and forward centers
tion instead of f-structures, some labels and fea- in the Vorfeld or Mittelfeld. While the percentage
ture names will be different, however, the design of sentences that realise a backward center is quite
of the sentence-internal model is identical to the low, the overall proportion of sentences contain-
previous one in Section 4. ing some type of coreference link is in a dimen-
The sentence-external features differ in some sion such that the learner could definitely pick up
aspects from Section 4, since we extract coref- some predictive patterns. Going by the relative
erence relations of several types (see (Naumann, frequencies, coreferential constituents have a bias
2006) for the anaphoric relations annotated in the towards appearing in the Vorfeld rather than in the
Tueba-D/Z). For each type of coreference link, Mittelfeld.
we extract the following properties: (i) function
5.3 Results
of the antecedent, (ii) position of the antecedent,
(iii) distance between sentences, (iv) type of rela- First, we build three coreference-based con-
tion. We also distinguish coreference links anno- stituent classifiers on their entire training set and
tated for the whole phrase (head link) and links compare them to their sentence-internal baseline.
that are annotated for an element embedded by the The most simple baseline records the category of
constituent (contained link). The two types are 2
In Centering, all entities in a given utterance can be seen
illustrated in Examples (4) and (5). Note that both as forward centers, however we thought that this implemen-
cases would not have been captured in the lexical tation would be more useful.
773
(4) a. Die Rechnung geht an die AWO.
The bill goes to the AWO.
b. [Hintergrund der gegenseitigen Vorwurfe in der Arbeiterwohlfahrt] sind offenbar scharfe Konkurrenzen zwischen
Bremern und Bremerhavenern.
Apparently, [the background of the mutual accusations at the labour welfare] are rivalries between people from
Bremen and Bremerhaven.
(5) a. Dies ist die Behauptung, mit der Bremens Hafensenator die Skeptiker davon uberzeugt hat, [...].
This is the claim, which Bremens harbour senator used to convince doubters, [...].
b. Fur diese Behauptung hat Beckmeyer bisher keinen Nachweis geliefert. So far, Beckmeyer has not given a prove of
this claim.
Model VF Model VF
ConstituentLength + HeadPos 47.48% ConstituentLength + HeadPos 46.61%
ConstituentLength + HeadPos + Coref 51.30% ConstituentLength + HeadPos + Coref 52.23%
BaseSyn 54.82% BaseSyn 54.63%
BaseSyn + Coref 56.21% BaseSyn + Coref 56.67%
FullMorphSyn 57.24% FullMorphSyn 55.36%
FullMorphSyn + Coref 57.40% FullMorphSyn + Coref 57.93%
Table 5: Results from Vorfeld classification, training Table 6: Results from Vorfeld classification, training
and evaluation on entire treebank and evaluation on sentences that contain a coreference
link
the constituent head and the number of words that

the constituent spans. Additionally, in parallel to 5.4 Discussion
the experiment in Section 4, we build a BaseSyn The results presented in this Section consis-
model which has the syntactic function features, tently complete the picture that emerged from
and a FullMorphSyn model which comprises the experiments in Section 4. Even if we have
the entire set of sentence-internal features. To high quality information about discourse con-
each of these baseline, we add the coreference text in terms of relations between referents, a
features. The results are reported in Table 5. non-trivial sentence-internal model for word or-
In this experiment, we find an effect of der prediction can be hardly improved. This
the sentence-external features over the simple suggests that sentence-internal approximations of
sentence-internal baselines. However, in the fully discourse context provide a fairly good way of
spelled-out, sentence-internal model, the effect dealing with local coherence in a linearisation
is, again, minimal. Moreover, for each base- task. It is also interesting that the sentence-
line, we obtain higher improvements by adding external features improve over simple baselines,
further sentence-internal features than by adding but get leveled out in rich sentence-internal fea-
sentence-external ones the accuracy of the sim- ture models. From this, we conclude that the
ple baseline (47.48%) improves by 7.34 points sentence-external features we implemented are to
through adding function features (the accuracy some extent predictive for word order, but that
of BaseSyn is 54.82%) and by only 3.48 points they can be covered by sentence-internal features
through adding coreference features. as well.
We run a second experiment in order to so see Our second evaluation concentrating on the
whether the better performance of the sentence- sentences that have coreference information
internal features is related to their coverage. We shows that the better performance of the sentence-
build and evaluate the same set of classifiers on internal features is also related to their cover-
the subset of sentences that contain at least one age. These results confirm our initial intuition
coreference link for one of its constituents (see that coreference information can add to the pre-
Table 4 for the distribution of coreference links dictive power of the morpho-syntactic features in
in our data). The results are given in Table 6. In certain contexts. This positive effect disappears
this experiment, the coreference features improve when sentences with and without coreferential
over all sentence-internal baselines including the constituents are taken together. For future work,
FullMorphSyn model. it would be promising to investigate whether the
774
positive impact of coreference features can be to generation and summarization. In Proceedings of
strengthened if the coreference annotation scheme HLT-NAACL 2004, Boston,MA.
is more exhaustive, including, e.g., bridging and Anja Belz and Ehud Reiter. 2006. Comparing auto-
event anaphora. matic and human evaluation of NLG systems. In
Proceedings of EACL 2006, pages 313320, Trento,
6 Conclusion Italy.
Gerlof Bouma. 2010. Syntactic tree queries in prolog.
We have carried out a number of experiments that In Proceedings of the Fourth Linguistic Annotation
show that sentence-internal models for word order Workshop, ACL 2010.
Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolf-
are hardly improved by features which explicitely
gang Lezius, and George Smith. 2002. The TIGER
represent the preceding context of a sentence in Treebank. In Proceedings of the Workshop on Tree-
terms of lexical and referential relations between banks and Linguistic Theories.
discourse entities. This suggests that sentence- Aoife Cahill and Arndt Riester. 2009. Incorporat-
internal realisation implicitly carries a lot of im- ing information status into generation ranking. In
formation about discourse context. On average, Proceedings of the Joint Conference of the 47th An-
the morphosyntactic properties of constituents in nual Meeting of the ACL and the 4th International
a text are better approximates of their discourse Joint Conference on Natural Language Processing
of the AFNLP, pages 817825, Suntec, Singapore,
status than actual coreference relations.
August. Association for Computational Linguistics.
This result feeds into a number of research Aoife Cahill, Martin Forst, and Christian Rohrer.
questions concerning the representation of dis- 2007. Stochastic Realisation Ranking for a Free
course and its application in generation systems. Word Order Language. In Proceedings of the
Although we should certainly not expect a com- Eleventh European Workshop on Natural Language
putational model to achieve a perfect accuracy in Generation, pages 1724, Saarbrucken, Germany.
the constituent ordering task even humans only DFKI GmbH.
Aoife Cahill. 2009. Correlating human and automatic
agree to a certain extent in rating word order vari-
evaluation of a german surface realiser. In Proceed-
ants (Belz and Reiter, 2006; Cahill, 2009) the ings of the ACL-IJCNLP 2009 Conference Short Pa-
average accuracy in the 60s for prediction of Vor- pers, pages 97100, Suntec, Singapore, August. As-
feld occupance is still moderate. An obvious di- sociation for Computational Linguistics.
rection would be to further investigate more com- Jackie C.K. Cheung and Gerald Penn. 2010. Entity-
plex representations of discourse that take into ac- based local coherence modelling using topological
count the relations between utterances, such as fields. In Proceedings of the 48th Annual Meeting
of the Association for Computational Linguistics
topic shifts. Moreover, it is not clear whether the
(ACL 2010). Association for Computational Lin-
effects we find for linearisation in this paper carry guistics.
over to other levels of generation such as tacti- James Clarke and Mirella Lapata. 2010. Discourse
cal generation where syntactic functions are not constraints for document compression. Computa-
fully specified. In a broader perspective, our re- tional Linguistics, 36(3):411441.
sults underline the need for better formalisations Stefanie Dipper and Heike Zinsmeister. 2009. The
of discourse that can be translated into features for role of the German Vorfeld for local coherence. In
large-scale applications such as generation. Christian Chiarcos, Richard Eckart de Castilho, and
Manfred Stede, editors, Von der Form zur Bedeu-
Acknowledgments tung: Texte automatisch verarbeiten/From Form to
Meaning: Processing Texts Automatically, pages
This work was funded by the Collaborative Re- 6979. Narr, Tubingen.
search Centre (SFB 732) at the University of Katja Filippova and Michael Strube. 2007. The ger-
man vorfeld and local coherence. Journal of Logic,
Stuttgart.
Language and Information, 16:465485.
Katja Filippova and Michael Strube. 2009. Tree Lin-
References earization in English: Improving Language Model
Based Approaches. In Proceedings of Human Lan-
Regina Barzilay and Mirella Lapata. 2008. Modeling guage Technologies: The 2009 Annual Conference
local coherence: An entity-based approach. Com- of the North American Chapter of the Association
putational Linguistics, 34:134. for Computational Linguistics, Companion Volume:
Regina Barzilay and Lillian Lee. 2004. Catching the Short Papers, pages 225228, Boulder, Colorado,
drift: Probabilistic content models with applications June. Association for Computational Linguistics.
775
Barbara J. Grosz, Aravind Joshi, and Scott Weinstein.
1995. Centering: A framework for modeling the
local coherence of discourse. Computational Lin-
guistics, 21(2):203225.
Thorsten Joachims. 2006. Training linear SVMs in
linear time. In Proceedings of the ACM Conference
on Knowledge Discovery and Data Mining (KDD),
pages 217226.
Nikiforos Karamanis, Massimo Poesioand Chris Mel-
lish, and Jon Oberlander. 2009. Evaluating center-
ing for information ordering using corpora. Com-
putational Linguistics, 35(1).
Jane Morris and Graeme Hirst. 1991. Lexical cohe-
sion, the thesaurus, and the structure of text. Com-
putational Linguistics, 17(1):21225.
Karin Naumann. 2006. Manual for the annotation of
in-document referential relations. Technical report,
Seminar fur Sprachwissenschaft, Abt. Computerlin-
guistik, Universitat Tubingen.
Massimo Poesio and Ron Artstein. 2005. The relia-
bility of anaphoric annotation, reconsidered: Taking
ambiguity into account. In Proc. of ACL Workshop
on Frontiers in Corpus Annotation.
Massimo Poesio, Rosemary Stevenson, Barbara di Eu-
genio, and Janet Hitzeman. 2004. Centering: A
parametric theory and its instantiations. Computa-
tional Linguistics, 30(3):309363.
Eric K. Ringger, Michael Gamon, Robert C. Moore,
David Rojas, Martine Smets, and Simon Corston-
Oliver. 2004. Linguistically Informed Statisti-
cal Models of Constituent Structure for Ordering
in Sentence Realization. In Proceedings of the
2004 International Conference on Computational
Linguistics, Geneva, Switzerland.
Julia Ritz, Stefanie Dipper, and Michael Gotze. 2008.
Annotation of information structure: An evaluation
across different types of texts. In Proceedings of the
the 6th LREC conference.
Christian Rohrer and Martin Forst. 2006. Improv-
ing Coverage and Parsing Quality of a Large-Scale
LFG for German. In Proceedings of the Fifth In-
Evaluation (LREC), Genoa, Italy.
Augustin Speyer. 2005. Competing constraints on
vorfeldbesetzung in german. In Proceedings of the
Constraints in Discourse Workshop, pages 7987.
Heike Telljohann, Erhard Hinrichs, Sandra Kubler,
and Heike Zinsmeister. 2006. Stylebook for the
tubingen treebank of written german (tuba-d/z).
revised version. Technical report, Seminar fur
Sprachwissenschaft, Universitat Tubingen.
Erik Velldal and Stephan Oepen. 2005. Maximum
entropy models for realization ranking. In Proceed-
ings of the 10th Machine Translation Summit, pages
109116, Thailand.
776
Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages
Oliver Ferschke , Iryna Gurevych and Yevgen Chebotar

Abstract chronous co-authoring tool. A unique character-

istic of Wikis is the documentation of the edit
In this paper, we propose an annota- history which keeps track of every change that
tion schema for the discourse analysis of
is made to a Wiki page. With this information,
Wikipedia Talk pages aimed at the coor-
dination efforts for article improvement.
it is possible to reconstruct the writing process
We apply the annotation schema to a cor- from the beginning to the end. Additionally, many
pus of 100 Talk pages from the Simple Wikis offer their users a communication platform,
English Wikipedia and make the resulting the Talk pages, where they can discuss the ongo-
dataset freely available for download1 . Fur- ing writing process with other users.
thermore, we perform automatic dialog act
classification on Wikipedia discussions and
The most prominent example for a successful,
achieve an average F1 -score of 0.82 with large-scale Wiki is Wikipedia, a collaboratively
our classification pipeline. created online encyclopedia, which has grown
considerably since its launch in 2001, and con-
tains a total of almost 20 million articles in 282
1 Introduction languages and dialects, as of Sept. 2011. As there
Over the past decade, the paradigm of information is no editorial body that manages Wikipedia top-
sharing in the web has shifted towards participa- down, it is an open question how the huge on-
tory and collaborative content production. Texts line community around Wikipedia regulates and
are no longer exclusively prepared by individuals enforces standards of behavior and article qual-
and then shared with the community. They are in- ity. The user discussions on the article Talk pages
creasingly created collaboratively by multiple au- might shed light on this issue and give an insight
thors and iteratively revised by the community. into the otherwise hidden processes of collabora-
When researchers first conducted surveys on tion that, until now, could only be analyzed via
professional writers in the 1980s, they found that interviews or group observations in experimental
the collaborative writing process differs consider- settings.
ably from the way individual writing is done (Pos- The main goal of the present paper is to analyze
ner and Baecker, 1992). In joint writing, the writ- the content of the discussion pages of the Simple
ers have to externalize processes that are other- English Wikipedia with respect to the dialog acts
wise not made explicit, like the planning and the aimed at the coordination efforts for article im-
organization of the text. The authors have to com- provement. Dialog acts, according to the classic
municate how the text should be written and what speech act theory (Austin, 1962; Searle, 1969),
exactly it should contain. represent the meaning of an utterance at the level
Today, many tools are available that support of illocutionary force, i.e. a dialog act label con-
collaborative writing. A tool that has particu- cisely characterizes the intention and the role of a
larly taken hold is the Wiki, a web-based, asyn- contribution in a dialog. We chose the Simple En-
1
http://www.ukp.tu-darmstadt.de/data/ glish Wikipedia for our initial analysis, because
wikidiscourse we are able to obtain more representative results
777
by covering almost 15% of all relevant Talk pages, peculiarities of the Switchboard corpus. The re-
as opposed to the much smaller fraction we could sulting SWDB-DAMSL schema contained more
achieve for the English Wikipedia. The long-term than 220 distinct labels which have been clustered
goal of this work is to identify relations between to 42 coarse grained labels. Both schemata have
contributions on the Talk pages and particular arti- often been adapted for special purpose annotation
cle edits. We plan to analyze the relation between tasks.
article discussions and article content and identify With the rise of the social web, the amount of
the edits in the article revision history that react to research analyzing user generated discourse sub-
the problems discussed on the Talk page. In com- stantially increased. In addition to analyzing web
bination with article quality assessment (Yaari et forums (Kim et al., 2010a), chats (Carpenter and
al., 2011), this opens up the possibility to iden- Fujioka, 2011) and emails (Cohen et al., 2004),
tify successful patterns of collaboration which in- Wikipedia Talk pages have recently moved into
crease the article quality. Furthermore, our work the center of attention of the research community.
will enable practical applications. By augment- Viegas et al. (2007) manually annotate 25
ing Wikipedia articles with the information de- Wikipedia article discussion pages with a set of
rived from automatically labeled discussions, arti- 11 labels in order to analyze how Talk pages are
cle readers can be made aware of particular prob- used for planning the work on articles and resolv-
lems that are being discussed on the Talk page ing disputes among the editors. Schneider et al.
behind the article. (2011) extend this schema and manually annotate
Our primary contributions in this paper are: (1) 100 Talk pages with 15 labels. They confirm the
an annotation schema for dialog acts reflecting findings of Viegas et al. that coordination requests
the efforts for coordinating the article improve- occur most frequently in the discussions.
ment; (2) the Simple English Wikipedia Dis- Bender et al. (2011) describe a corpus of 47
cussion (SEWD) corpus, consisting of 100 seg- Talk pages which have been annotated for author-
mented and annotated Talk pages which we make ity claims and alignment moves. With this cor-
freely available for download; and (3) a dialog pus, the authors analyze how the participants in
act classification pipeline that incorporates sev- Wikipedia discussions establish their credibility
eral state of the art machine learning algorithms and how they express agreement and disagree-
and feature selection techniques and achieves an ment towards other participants or topics.
average F1 -score of .82 on our corpus. From a different perspective, Stvilia et al.
(2008) analyze 60 discussion pages in regard to
2 Related Work how information quality (IQ) in Wikipedia arti-
The analysis of speech and dialog acts has its cles is assessed on the Talk pages and which types
roots in the linguistic field of pragmatics. In of IQ problems are identified by the community.
1962, John Austin shifted the focus from the mere They describe a Wikipedia IQ assessment model
declarative use of language as a means for making and map it to established frameworks. Further-
factual statements towards its non-declarative use more, they provide a list of IQ problems along
as a tool for performing actions. The speech act with related causal factors and necessary actions
theory was further systematized by Searle (1969), which has also inspired the design of our annota-
whose classification of illocutionary acts (Searle, tion schema.
1976) is still used as a starting point for creating Finally, Laniado et al. (2011) examine
dialog act classification schemata for natural lan- Wikipedia discussion networks in order to
guage processing. capture structural patterns of interaction. They
A well known, domain- and task-independent extract the thread structure from all Talk pages in
annotation schema is DAMSL (Core and Allen, the English Wikipedia and create tree structures
1997). It was created as the standard annotation of the discussion. The analysis of the graphs
schema for dialog tagging on the utterance level reveals patterns that are unique to Wikipedia
by the Discourse Resource Initiative. It uses a discussions and might be used as a means to
four-dimensional tagset that allows arbitrary label characterize different types of Talk pages.
combinations for each utterance. Jurafsky et al. To the best of our knowledge, there is no
(1997) augmented the DAMSL schema to fit the work yet that uses machine learning to automati-
778
are usually headed by a topic title. Finally, the
thread structure designates the sequence of turns
and their indentation levels on the Talk page. A
structural overview of a Talk page and its con-
stituents can be seen in Figure 1.
We composed an annotation schema that re-
flects the coordination efforts for article improve-
ment. Therefore, we manually analyzed a set
of thirty Talk pages from the Simple English
Wikipedia to identify the types of article defi-
ciencies that are discussed and the way article
Figure 1: Structure of a Talk page: a) Talk page title, improvement is coordinated. We furthermore
b) untitled discussion topic, c) titled discussion topic, incorporated the findings from an information-
d) unsigned turns, e) signed turns, f) topic title scientific analysis of information quality in
Wikipedia (Stvilia et al., 2008), which identifies
cally classify user contributions in Wikipedia Talk twelve types of quality problems, like e.g. Accu-
pages. Furthermore, there is no corpus available racy, Completeness or Relevance. Our resulting
that reflects the efforts of article improvement in tagset consists of 17 labels (cf. Table 1) which can
Wikipedia discussions. This is the subject of our be subdivided into four higher level categories:
work.
Article Criticism Denote comments that iden-
tify deficiencies in the article. The criticism
3 Annotation Schema
can refer to the article as a whole or to indi-
The main purpose of Wikipedia Talk pages is the vidual parts of the article.
coordination of the editing process with the goal Explicit Performative Announce, report or sug-
of improving and sustaining the quality of the re- gest editing activities.
spective article. The criteria for article quality in
Wikipedia are loosely defined in the guidelines for Information Content Describe the direction of
good articles2 and very good articles3 . Ac- the communication. A contribution can be
cording to these guidelines, distinguished articles used to communicate new information to
must be well-written in simple English, compre- others (IP), to request information (IS), or
hensive, neutral, stable, accurate, verifiable and to suggest changes to established facts (IC).
follow the Wikipedia style guidelines4 . These cri- The IP label applies to most of the contri-
teria are the main points of reference in the dis- butions as most comments provide a certain
cussions on the Talk pages. amount of new information.
Discourse analysis, as it is performed in this pa- Interpersonal Describe the attitude that is ex-
per, can be carried out on various levels, depend- pressed towards other participants in the dis-
ing on what is regarded as the smallest unit of the cussion and/or their comments.
discourse. In this work, we focus on turns, not
on individual utterances, as we are interested in a Since a single turn may consist of several utter-
coarse-grained analysis of the discourse-structure ances, it can consequently comprise multiple di-
as a first step towards a finer-grained discourse alog acts. Therefore, we designed the annotation
analysis. We define a turn (or contribution) as the study as a multi-label classification task, i.e. the
body of text that is added by an individual contrib- annotators can assign one or more labels to each
utor in one or more revisions to a single discus- annotation unit. Each label is chosen indepen-
sion topic until another contributor edits the page. dently. Table 1 shows the labels, their respective
Furthermore, a topic (or discussion) is the body definitions and an example from our corpus.
of turns that revolve around a single matter. They
4 Corpus Creation and Analysis
2
http://simple.wikipedia.org/wiki/WP:RGA
3
http://simple.wikipedia.org/wiki/WP:RVGA The SEWD corpus consists of 100 annotated Talk
4
http://simple.wikipedia.org/wiki/WP:STYLE pages extracted from a snapshot of the Simple En-
779
Label Description Example
Article Criticism
It should be added (1) that voters may skip prefer-
CM Content incomplete or lacking detail ences, but (2) that skipping preferences has no impact
on the result of the elections.
Kris Kringle is NOT a Germanic god, but an English
CW Lack of accuracy or correctness mispronunciation of Christkind, a German word that
means the baby Jesus.
The references should be removed. The reason: The
CU Unsuitable or unnecessary content references are too complicated for the typical reader
of simple Wikipedia.
CS Structural problems Also use sectioning, and interlinking
This section needs to be simplified further; there are a
CL Deficiencies in language or style
lot of words that are too complex for this wiki.
This article seems to take a clear pro-Christian, anti-
COBJ Objectivity issues
commercial view.
I have started an article on Google. It needs improve-
CO Other kind of criticism
ment though.
Explicit Performative
PSR Explicit suggestion, recommendation or request This section needs to be simplified further
Got it. The URL is http://www.dmbeatles.com/
PREF Explicit reference or pointer
history.php?year=1968
PFC Commitment to an action in the future Okay, I forgot to add that, Ill do so later tonight.
I took and hopefully simplified the [[en:Prehistoric
PPC Report of a performed action
musicPrehistoric music]] article from EnWP
Information Content
IP Information providing Depression is the most basic term there is.
So what kind of theory would you use for your music
IS Information seeking
composing?
In linguistics and generally speaking, when Talking
about the lexicon in a language, words are usually cat-
IC Information correcting
egorized as nouns, verbs, adjectives and so on.
The term doing word does not exist.
Interpersonal
Positive attitude towards other contributor or
ATT+ Thank you.
acceptance
Okay, I can understand that, but some citations are
ATTP Partial acceptance or partial rejection
going to have to be included for [[WP:V]].
Negative attitude towards other contributor or Now what? You think you know so much about every-
ATT-
rejection thing, and you are not even helping?!
Table 1: Annotation schema for the dialog act classification in Wikipedia discussion pages with examples from
the SEWD Corpus. Some examples have been shortened to fit the table.
glish Wikipedia from Apr 4th 2011.5 Technically pages with 11-20 turns, and (iii) pages with more
speaking, a Talk page is a normal Wiki page lo- than 20 turns. We then randomly extracted 50 dis-
cated in one of the Talk namespaces. In this work, cussion pages from class (i), 40 pages from class
we focus on article Talk pages and do not re- (ii) and 10 pages from class (iii). This decision is
gard User Talk pages. We selected the discussion grounded in the restricted resources for the human
pages according to the number of turns they con- annotation task.
tain. First, we discarded all discussion pages with
less than four contributions. We then analyzed Data Preprocessing Due to a lack of discussion
the distribution of turn counts per discussion page structure, extracting the discussion threads from
in the remaining set of pages and defined three the Talk pages requires a substantial amount of
classes: (i) discussion pages with 4-10 turns, (ii) preprocessing. Laniado et al. (2011) tackle the
5
The snapshot contains 69900 articles and 5783 Talk thread extraction by using text indentation and in-
pages of which 683 contained more than 3 contributions. serted user signatures as clues. We found these
780
attributes to be insufficient for a reliable recon- Annotation Process For our annotation study,
struction of the thread structure.6 we used the freely available MMAX2 annotation
Our preprocessing approach consists of three tool8 . Two annotators were introduced to the an-
steps: data retrieval, topic segmentation and turn notation schema by an instructor and trained on
segmentation. For retrieving the discussion pages, an extra set of ten discussion pages. During the
we use the Java Wikipedia Library (JWPL) (Zesch annotation of the corpus, the annotators were al-
et al., 2008), which offers efficient, database- lowed to discuss difficult cases and could consult
driven access to the contents of Wikipedia. We the instructor if in doubt. They had access to the
segment the individual Talk pages into discus- segmented discussion pages within the MMAX2
sions topics using the MediaWiki parser that tool as well as to the original Wikipedia articles
comes with JWPL. In our corpus, the parser man- and discussion pages on the web.
aged to identify all topic boundaries without any The reconciliation of the annotations was car-
errors. The most complex preprocessing step is ried out by an expert annotator. In order to obtain
the turn segmentation. a consolidated gold standard, the expert decided
First, we use the revision history of the Talk all cases in which the annotations of the two an-
page to identify the author and the creation time notators did not match. Descriptive statistics for
of each paragraph. We use the Wikipedia Revi- the label assignments of each annotator and for
sion Toolkit (Ferschke et al., 2011) to examine the the gold standard can be seen in Table 2 and will
changes between adjacent revisions of the Talk be further discussed in Section 4.2.
page in order to identify the exact time a piece of
text was added as well as the author of the con- Corpus Format We publish our SEWD cor-
tribution. We have to filter out malicious edits pus in two formats9 , the original MMAX format,
from the history, as they would negatively affect and as XMI files for further processing with the
the segmentation process. We therefore disregard Apache Unstructured Information Management
all edits that are reverted in later later revisions. Architecture10 . For the latter format, we also pro-
In contrast to vandalism on article pages, this ap- vide the type system which defines all necessary
proach has proven to be sufficient to detect van- corpus specific types needed for using the data in
dalism in the Talk page history. an NLP pipeline.
Within each discussion topic, we aggregate all 4.1 Inter-Annotator Agreement
adjacent paragraphs with the same author and the
same time stamp to one turn. In order to account To evaluate the reliability of our dataset, we per-
for turns that were written in multiple revisions, form a detailed inter-rater agreement study. For
we regard all time stamps within a window of 10 measuring the agreement of the individual labels,
minutes7 as belonging to the same turn, unless the we report the observed agreement, Kappa statis-
page was edited by another user in the meantime. tics (Carletta, 1996), and F1 -scores. The latter are
Finally, the turn is marked with the indentation computed by treating one annotator as the gold
level of its least indented paragraph. This infor- standard and the other one as predictions (Hripc-
mation is used to identify the relationship between sak and Rothschild, 2005). The scores can be seen
the turns, since indentation is used to indicate a in Table 2.
reply to an existing comment in the discussion. The average observed agreement across all la-
A co-author of this paper evaluated the ac- bels is PO = .94. The individual Kappa scores
ceptability of the boundaries of each turn in the largely fall into the range that Landis and Koch
SEWD corpus and found that 94% of the 1450 (1977) regard as substantial agreement, while
turns were correctly segmented. Turns with seg- three labels are above the more strict .8 thresh-
mentation errors were not included in the gold old for reliable annotations (Artstein and Poesio,
standard. 2008). Furthermore, we obtain an overall pooled
Kappa (De Vries et al., 2008) of pool = .67,
6
Viegas et al. (2007) reported that only 67% of the con-
8
tributions on Wikipedia Talk pages are signed, which makes http://www.mmax2.net
9
signatures an unreliable predictor for turn boundaries. http://www.ukp.tu-darmstadt.de/data/
7
We experimentally tested values between 1 and 60 min- wikidiscourse
10
utes. http://uima.apache.org
781
Annotator 1 Annotator 2 Inter-Annotator Agreement Gold Standard
Label N Percent N Percent NA1 A2 PO F1 N Percent
Article Criticism
CM 183 13.4% 105 7.7% 193 .93 .63 .66 116 8.5%
CW 106 7.8% 57 4.2% 120 .95 .52 .55 70 5.1%
CU 69 5.0% 35 2.6% 83 .95 .38 .40 42 3.1%
CS 164 12.0% 101 7.4% 174 .94 .66 .69 136 9.9%
CL 195 14.3% 199 14.6% 244 .93 .73 .77 219 16.0%
COBJ 27 2.0% 23 1.7% 29 .99 .84 .84 27 2.0%
CO 20 1.5% 59 4.3% 71 .95 .18 .20 48 3.5%
Explicit Performative
PSR 458 33.5% 351 25.7% 503 .86 .66 .76 406 29.7%
PREF 43 3.1% 31 2.3% 51 .98 .61 .62 45 3.3%
PFC 73 5.3% 65 4.8% 86 .98 .76 .77 77 5.6%
PPC 357 26.1% 340 24.9% 371 .97 .92 .94 358 26.2%
Information Content
IP 1084 79.3% 1027 75.1% 1135 .89 .69 .93 1070 78.3%
IS 228 16.7% 208 15.2% 256 .95 .80 .83 220 16.1%
IC 187 13.7% 109 8.0% 221 .89 .46 .51 130 9.5%
Interpersonal
ATT+ 71 5.2% 140 10.2% 151 .94 .55 .58 144 10.5%
ATTP 71 5.2% 30 2.2% 79 .96 .42 .44 33 2.4%
ATT- 67 4.9% 74 5.4% 100 .96 .56 .58 87 6.4%
Table 2: Label frequencies and inter-annotator agreement. NA1 A2 denotes the number of turns that have been
labeled with the given label by at least one annotator. PO denotes the observed agreement.
which is defined as chose this label when they were unsure whether a
particular criticism label would fit a certain turn
PO PE
pool = (1) or not.
1 PE
Labels in the interpersonal category all show
with agreement scores below 0.6. It turned out that the
L L annotators had a different understanding of these
1X 1X
PO = POl , PE = PEl (2) labels. While one annotator assigned the labels
L L for any kind of positive or negative sentiment, the
l=1 l=1
other used the labels to express agreement and
where L denotes the number of labels, PEl the
disagreement between the participants of a dis-
expected agreement and POl the observed agree-
cussion.
ment of the lth label. pool is regarded to be more
A common problem for all labels were contri-
accurate than an averaged Kappa.
butions with a high degree of indirectness and im-
For assessing the overall inter-rater reliabil-
plicitness. Indirect contributions have to be in-
ity of the label set assignments per turn, we
terpreted in the light of conversational implica-
chose Krippendorffs Alpha (Krippendorff, 1980)
ture theory (Grice, 1975), which requires contex-
using MASI, a measure of agreement on set-
tual knowledge for decoding the intentions of a
valued items, as the distance function (Passon-
speaker. For example, the message
neau, 2006). MASI accounts for partial agree-
ment if the label sets of both annotators overlap Is population density allowed to be n/a?
in at least one label. We achieved an Alpha score
of = .75. According to Krippendorff, datasets has the surface form of a question. However, the
with this score are considered reliable and allow context of the discussion revealed that the author
tentative conclusions to be drawn. tried to draw attention to the missing figure in the
The CO label showed the lowest agreement of article and requested it to be filled or removed.
only = .18. The label was supposed to cover The annotators rarely made use of the context,
any criticism that is not covered by a dedicated which was a major source for disagreement in the
label. However, the annotators reported that they study.
782
Another difficulty for the annotators were long that edit requests and reports of performed edits
discussion turns. While the average turn consists are the main subject of discussion. Generally, it is
of 42 tokens, the largest contribution in the cor- more common that edits are reported after they
pus is 658 tokens long. Turns of this size can have been made than to announce them before
cover multiple aspects and potentially comprise they are carried out, as can be seen in the ratio
many different dialog acts, which increases the of PPC to PFC labels. The number of turns la-
probability of disagreement. This issue can be ad- beled with PSR is almost the same as the number
dressed by going from the turn level to the utter- of contributions labeled with either PPC or PFC.
ance level in future work. This allows the tentative conclusion that nearly all
A comparison of our results with the agreement requests potentially lead to an edit action. As a
reported for other datasets shows that the reliabil- matter of fact, the most common label adjacency
ity of our annotations lies well within the field of pair11 in the corpus is PSRPPC, which substan-
the related work. Bender et al. (2011) carried out tiates this assumption.
an annotation study of social acts in 365 discus- Article criticism labels have been assigned to
sions from 47 Wikipedia Talk pages. They report 39.4% of all turns. Almost half (241) of the labels
Kappa scores for thirteen labels in two categories from this class are assigned to the first turn of a
ranging from .13 to .66 per label. The overall discussion. This shows that it is common to open
agreement for each category was .50 and .59, re- a discussion in reference to a particular deficiency
spectively, which is considerably lower than our of the article. The large number of CL labels com-
pool = .67. Kim et al. (2010b) annotate pairs of pared to other labels from the same category is
posts taken from an online forum. They use a di- due to the fact that the Simple English Wikipedia
alog act tagset with twelve labels customized for requires authors to write articles in a way that they
modeling troubleshooting-oriented forum discus- are understandable for non-native speakers of En-
sions. For their corpus of 1334 posts, they report glish. Therefore, the use of adequate language is
an overall Kappa of .59. Kim et al. (2010a) iden- one of the major concerns of the Simple English
tify unresolved discussions in student online fo- Wikipedia community.
rums by annotating 1135 posts with five different
speech acts. They report Kappa scores per speech 5 Automatic Dialog Act Classification
act between .72 and .94. Their better results might For the automatic classification of dialog acts in
be due to a more coarse grained label set. Wikipedia Talk pages, we transform the multi-
label classification problem into a binary classi-
4.2 Corpus Analysis
fication task (Tsoumakas et al., 2010). We train a
The SEWD corpus contains 313 discussions con- binary classifier for each label using the WEKA
sisting of 1367 turns by 337 users. The average data-mining software (Hall et al., 2009). We use
length of a turn is 42 words. 208 of the 337 three learners for the classification task, a Naive
contributors are registered Wikipedia users, 129 Bayes classifier, J48, an implementation of the
wrote anonymously. On average, each contributor C4.5 decision tree algorithm (Quinlan, 1992) and
wrote 168 words in 4 turns. However, there was a SMO, an optimization algorithm for training sup-
cluster of 16 people with 20 contributions. port vector machines (Platt, 1998). Finally, we
Table 2 shows the frequencies of all labels in combine the best performing learners for each la-
the SEWD corpus. The most frequent labels are bel in a UIMA-based classification pipeline (Fer-
information providing (IP), requests (PSR) and rucci and Lally, 2004).
reports of performed edits (PPC). The IP-label
was assigned to more than 78% of all 1367 turns, Features for Dialog Act Classification As fea-
because almost every contribution provides a cer- tures, we use all uni-, bi- and trigrams that oc-
tain amount of information. The label was only curred in at least three different turns. Further-
omitted if a turn merely consisted of a discussion more, we include the time distance to the previ-
template but did not contain any text or if it exclu- ous and the next turn (in seconds), the length of
sively contained questions. the current, previous and next turn (in tokens), the
More than a quarter of the turns are labeled 11
A label transition A B is recorded if two adjacent
with PSR and PPC, respectively. This indicates turns are labeled with A and B, respectively.
783
position of the turn within the discussion, the in- Naive
Label Human Base J48 SMO Best
Bayes
dentation level of the turn and two binary features
CM .66 .07 .68 .48 .66 .68
indicating whether a turn references or is refer-
CW .55 .01 .70 .20 .56 .70
enced by another turn.12 In order to capture the CU .40 .07 .66 .35 .59 .66
sequential nature of the discussions, we use the CS .69 .09 .67 .67 .75 .75
n-grams of the previous and the next turn as addi- CL .77 .11 .70 .66 .73 .73
tional features. COBJ .84 .04 .78 .51 .63 .78
CO .20 .02 .61 .06 .39 .61
Balancing Positive and Negative Instances PSR .76 .30 .72 .70 .76 .76
Since the number of positive instances for each PREF .62 .00 .76 .41 .64 .76
PFC .77 .04 .70 .62 .73 .73
label is small compared to the number of nega- PPC .94 .25 .74 .82 .85 .85
tive instances, we create a balanced dataset which IP .93 .74 .83 .93 .93 .93
contains an equal amount of positive and nega- IS .83 .16 .79 .86 .85 .86
tive instances. Therefore, we randomly select the IC .51 .06 .67 .32 .59 .67
appropriate number of negative instances and dis- ATT+ .58 .10 .61 .65 .72 .72
card the rest. This improves the classification per- ATTP .44 .03 .72 .25 .62 .72
formance on every label for all three learners. ATT- .58 .07 .52 .30 .52 .52
Macro .65 .13 .70 .52 .68 .73
Feature Selection Using the full set of features, Micro .79 .35 .74 .75 .80 .82
we achieve the following macro/micro averaged Table 3: F1 -Scores for the balanced set with feature
F1 -scores: 0.29 / 0.57 for Naive Bayes, 0.42 / selection on 10-fold cross-validation. Base refers to
0.66 for J48 and 0.43 / 0.72 for SMO. To fur- the baseline performance, Best to our classification
ther improve the classification performance, we pipeline.
reduce the feature space using two feature selec-
tion techniques, the 2 metric (Yang and Ped-
Classification Results Table 3 shows the per-
ersen, 1997) and the Information Gain approach
formance of all classifiers and our final classi-
(Mitchell, 1997). For each label, we train separate
fication pipeline evaluated on 10-fold cross val-
classifiers using the top 100, 200 and 300 features
idation. Naive Bayes performed surprisingly
obtained by each feature selection technique and
well and showed the best macro averaged scores
choose the best performing set for our final clas-
among the three learners while SMO showed the
sification pipeline.
best micro averaged performance. We compare
Indentation and temporal distance to the pre- our results to a random baseline and to the per-
ceding turn proved to be the best ranked non- formance of the human annotators (cf. Table 3
lexical features overall. Additionally, the turn po- and Figure 2). The baseline assigns the dialog act
sition within the topic was a crucial feature for labels at random according to their frequency dis-
most labels in the criticism class and for PSR and tribution in the gold standard. Our classifier out-
IS labels. This is not surprising, because article performed the baseline significantly on all labels.
criticism, suggestions and questions tend to oc-
The comparison with the human performance
cur in the beginning of a discussion. The two
shows that our system is able to reach the human
reference features have not proven to be useful.
performance. In most cases, the annotation agree-
The relational information was better covered by
ment is reliable, and so are the results of the auto-
the indentation feature. The subjective quality of
matic classification. For the labels CU and CO,
the lexical features seems to be correlated with
the inter-annotator agreement is not high. The
the inter-annotator agreement of the respective la-
comparably good performance of the classifiers
bels. Features for labels with low agreement con-
on these labels shows that the instances do have
tain many n-grams without any recognizable se-
shared characteristics. Human raters, however,
mantic connection to the label. For labels with
have difficulties recognizing these labels consis-
good agreement, the feature lists almost exclu-
tently. Thus, their definitions need to be refined in
sively contain meaningful lexical cues.
future work.
12
A turn Y references a preceding turn X if the indenta- To our knowledge, none of the related work on
tion level of Y is one level deeper than of X. discourse analysis of Wikipedia Talk pages per-
784
1
Best Human Baseline
0.8
F1 -score
0.6
0.4
0.2
0
CM
CW
CU
CS
CL
BJ
CO
R
EF
IP
IS
IC
T+
TP
T-
PS
PF
PP
AT
CO
PR
AT
AT
Figure 2: F1 -Scores for our classification pipeline (Best), the human performance and baseline performance.
formed automatic dialog act classification. How- more, it will be the basis for practical applications
ever, there has been previous work on classify- that bring the hidden content of Talk pages to the
ing speech acts in other discourse types. Kim et attention of article readers.
al. (2010a) use Support Vector Machines (SVM)
and Transformation Based Learning (TBL) for Acknowledgments
the automatic assignment of five speech acts to This work has been supported by the Volkswagen
posts taken from student online forums. They re- Foundation as part of the Lichtenberg-
port individual F1 -scores per label which result Professorship Program under grant No. I/82806,
in a macro average of 0.59 for SVM and 0.66 and by the Hessian research excellence program
for TBL. Cohen et al. (2004) classify speech acts Landes-Offensive zur Entwicklung Wissen-
in emails. They train five binary classifiers us- schaftlich-okonomischer Exzellenz (LOEWE)
ing several learners on 1375 emails and report F1 as part of the research center Digital Humani-
scores per speech act between .44 and .85. De- ties.
spite the larger tagset, our classification approach
achieves an average F1 -score of .82 and therefore
lies in the top ranks of the related work. References
Ron Artstein and Massimo Poesio. 2008. Inter-Coder
6 Conclusions Agreement for Computational Linguistics. Compu-
tational Linguistics, 34(4):555596, December.
In this paper, we proposed an annotation schema
John L. Austin. 1962. How to Do Things with Words.
for the discourse analysis of Wikipedia discus- Clarendon Press, Cambridge, UK.
sions aimed at the coordination efforts for article Emily M. Bender, Jonathan T. Morgan, Meghan Ox-
improvement. We applied the annotation schema ley, Mark Zachry, Brian Hutchinson, Alex Marin,
to a corpus of 100 Wikipedia Talk pages, which Bin Zhang, and Mari Ostendorf. 2011. Annotat-
we make freely available for download. A thor- ing Social Acts: Authority Claims and Alignment
ough analysis of the inter-annotator agreement Moves in Wikipedia Talk Pages. In Proceedings of
the Workshop on Language in Social Media, pages
showed that the dataset is reliable. Finally, we
4857, Portland, Oregon, USA.
performed automatic dialog act classification on Jean Carletta. 1996. Assessing Agreement on Classi-
Wikipedia Talk pages. Therefore, we combined fication Tasks: The Kappa Statistic. Computational
three machine learning algorithms and two feature Linguistics, 22(2):249254.
selection techniques to a classification pipeline, Tamitha Carpenter and Emi Fujioka. 2011. The Role
which we trained on our SEWD corpus. We and Identification of Dialog Acts in Online Chat. In
achieve an average F1 -score of .82, which is com- Proceesings of the Workshop on Analyzing Micro-
parable to the human performance of .79. The text at the 25th AAAI Conference on Artificial Intel-
ligence, San Francisco, CA, USA.
ability to automatically classify discussion pages
William W. Cohen, Vitor R. Carvalho, and Tom M.
will help to investigate the relations between arti- Mitchell. 2004. Learning to Classify Email into
cle discussions and article edits, which is an im- Speech Acts. In Proceedings of the 2004 Con-
portant step towards understanding the processes ference on Empirical Methods in Natural Language
of collaboration in large-scale Wikis. Further- Processing, pages 309316, Barcelona, ES.
785
Mark G. Core and James F. Allen. 1997. Cod- Discussion Pages. In Proceedings of the 5th Inter-
ing dialogs with the DAMSL annotation scheme. national AAAI Conference on Weblogs and Social
In Proceedings of the Working Notes of the AAAI Media, Dublin, IE.
Fall Symposium on Communicative Action in Hu- Tom Mitchell. 1997. Machine Learning. McGraw-
mans and Machines, pages 2835, Cambridge, MA, Hill Education (ISE Editions), 1st edition.
USA. Rebecca Passonneau. 2006. Measuring Agreement on
Han De Vries, Marc N. Elliott, David E. Kanouse, and Set-valued Items (MASI) for Semantic and Prag-
Stephanie S. Teleki. 2008. Using Pooled Kappa matic Annotation. In Proceedings of the Fifth In-
to Summarize Interrater Agreement across Many ternational Conference on Language Resources and
Items. Field Methods, 20(3):272282. Evaluation, Genoa, IT.
David Ferrucci and Adam Lally. 2004. UIMA: An Ar- John C. Platt. 1998. Fast training of support vector
chitectural Approach to Unstructured Information machines using sequential minimal optimization.
Processing in the Corporate Research Environment. In Advances in Kernel Methods: Support Vector
Natural Language Engineering, 10:327348. Learning, pages 185208, Cambridge, MA, USA.
Oliver Ferschke, Torsten Zesch, and Iryna Gurevych. Ilona R. Posner and Ronald M. Baecker. 1992. How
2011. Wikipedia Revision Toolkit: Efficiently People Write Together. In Proceedings of the 25th
Accessing Wikipedias Edit History. In Proceed- Hawaii International Conference on System Sci-
ings of the 49th Annual Meeting of the Associa- ences, pages 127138, Wailea, Maui, HI, USA.
tion for Computational Linguistics: Human Lan- Ross Quinlan. 1992. C4.5: Programs for Machine
guage Technologies. System Demonstrations, pages Learning. Morgan Kaufmann, 1st edition.
97102, Portland, OR, USA. Jodi Schneider, Alexandre Passant, and John G. Bres-
Paul Grice. 1975. Logic and Conversation. In Pe- lin. 2011. Understanding and Improving Wikipedia
ter Cole and Jerry L. Morgan, editors, Syntax and Article Discussion Spaces. In Proceedings of the
Semantics, volume 3. New York: Academic Press. 26th Symposium on Applied Computing, Taichung,
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
TW.
Pfahringer, Peter Reutemann, and Ian H. Witten.
John R. Searle. 1969. Speech Acts. Cambridge Uni-
2009. The WEKA Data Mining Software: An Up-
versity Press, Cambridge, UK.
date. SIGKDD Explorations, 11:1018.
John R. Searle. 1976. A classification of illocutionary
George Hripcsak and Adam S. Rothschild. 2005.
acts. Language in Society, 5:123.
Agreement, the f-measure, and reliability in infor-
mation retrieval. Journal of the American Medical Besiki Stvilia, Michael B. Twidale, Linda C. Smith,
Informatics Association, 12(3):296298. and Les Gasser. 2008. Information Quality Work
Dan Jurafsky, Liz Shriberg, and Debbra Biasca. 1997. Organization in Wikipedia. Journal of the Ameri-
Switchboard SWBD-DAMSL Shallow-Discourse- can Society for Information Science, 59:9831001.
Function Annotation Coders Manual. Technical Grigorios Tsoumakas, Ioannis Katakis, and Ioannis P.
Report Draft 13, University of Colorado, Institute Vlahavas. 2010. Mining multi-label data. In Data
of Cognitive Science. Mining and Knowledge Discovery Handbook, pages
Jihie Kim, Jia Li, and Taehwan Kim. 2010a. To- 667685. Springer.
wards Identifying Unresolved Discussions in Stu- Fernanda Viegas, Martin Wattenberg, Jesse Kriss, and
dent Online Forums. In Proceedings of the NAACL Frank Ham. 2007. Talk Before You Type: Coor-
HLT 2010 Fifth Workshop on Innovative Use of NLP dination in Wikipedia. In Proceedings of the 40th
for Building Educational Applications, pages 84 Annual Hawaii International Conference on System
91, Los Angeles, CA, USA. Sciences, Waikoloa, Big Island, HI, USA.
Su Nam Kim, Li Wang, and Timothy Baldwin. 2010b. Eti Yaari, Shifra Baruchson-Arbib, and Judit Bar-Ilan.
Tagging and linking web forum posts. In Pro- 2011. Information quality assessment of commu-
ceedings of the Fourteenth Conference on Compu- nity generated content: A user study of Wikipedia.
tational Natural Language Learning, CoNLL 10, Journal of Information Science, 37:487498.
pages 192202, Stroudsburg, PA, USA. Yiming Yang and Jan O. Pedersen. 1997. A Compara-
Klaus Krippendorff. 1980. Content Analysis: An tive Study on Feature Selection in Text Categoriza-
Introduction to Its Methodology. Thousand Oaks, tion. In Proceedings of the Fourteenth International
CA: Sage Publications. Conference on Machine Learning, pages 412420,
J. Richard Landis and Gary G. Koch. 1977. An Appli- San Francisco, CA, USA.
cation of Hierarchical Kappa-type Statistics in the Torsten Zesch, Christof Muller, and Iryna Gurevych.
Assessment of Majority Agreement among Multi- 2008. Extracting Lexical Semantic Knowledge
ple Observers. Biometrics, 33(2):363374, June. from Wikipedia and Wiktionary. In Proceedings of
David Laniado, Riccardo Tasso, Yana Volkovich, and the 6th International Conference on Language Re-
Andreas Kaltenbrunner. 2011. When the Wikipedi- sources and Evaluation, Marrakech, MA.
ans Talk: Network and Tree Structure of Wikipedia
786
An Unsupervised Dynamic Bayesian Network Approach to Measuring
Speech Style Accommodation
Mahaveer Jain1 , John McDonough1 , Gahgene Gweon2 , Bhiksha Raj1 , Carolyn Penstein Rose1,2
1. Language Technologies Institute; 2. Human Computer Interaction Institute
Carnegie Mellon University
Pittsburgh, PA 15213
{mmahavee,johnmcd,ggweon,bhiksha,cprose}@cs.cmu.edu
Abstract the achievement of more natural interactions with

speech dialogue systems (Levitan et al., 2011).
Speech style accommodation refers to
Monitoring social processes from speech or
shifts in style that are used to achieve strate-
gic goals within interactions. Models of language data has other practical benefits as well,
stylistic shift that focus on specific fea- such as enabling monitoring how beneficial an in-
tures are limited in terms of the contexts teraction is for group learning (Ward & Litman,
to which they can be applied if the goal of 2007; Gweon, 2011), how equal participation is
the analysis is to model socially motivated within a group (DiMicco et al., 2004), or how
speech style accommodation. In this pa- conducive an environment is for fostering a sense
per, we present an unsupervised Dynamic of belonging and identification with a community
Bayesian Model that allows us to model
stylistic style accommodation in a way that
(Wang et al., 2011).
is agnostic to which specific speech style Typical work on computational models of
features will shift in a way that resem- speech style accommodation have focused on spe-
bles socially motivated stylistic variation. cific aspects of style that may be accommodated,
This greatly expands the applicability of the
such as the frequency or timing of pauses or
model across contexts. Our hypothesis is
that stylistic shifts that occur as a result of
backchannels (i.e., words that show attention like
social processes are likely to display some Un huh or ok), pitch, or speaking rate (Ed-
consistency over time, and if we leverage lund et al., 2009; Levitan & Hirschberg, 2011). In
this insight in our model,we will achieve this paper, we present an unsupervised Dynamic
a model that better captures inherent struc- Bayesian Model that allows us to model speech
ture within speech. style accommodation in a way that does not re-
quire us to specify which linguistic features we
are targeting. We explore a space of models de-
1 Introduction
fined by two independent factors, namely the di-
Sociolinguistic research on speech style and its rect influence of one speakers style on another
resulting social interpretation has frequently fo- speakers style and the influence of the relational
cused on the ways in which shifts in style are gestalt between the two speakers that motivates
used to achieve strategic goals within interac- the stylistic accommodation, and thus may keep
tions, for example the ways in which speakers the accommodation moving consistently, with the
may adapt their speaking style to suppress differ- same momentum. Prior work has explored the in-
ences and accentuate similarities between them- fluence of the first factor. However, because ac-
selves and their interlocutors in order to build commodation reflects social processes that extend
solidarity (Coupland, 2007; Eckert & Rickford, over time within an interaction, one may expect a
2001; Sanders, 1987). We refer to this stylis- certain consistency of motion within the stylistic
tic convergence as speech style accommodation. shift. Furthermore, we can leverage this consis-
In the language technologies community, one tar- tency of style shift to identify socially meaningful
geted practical benefit of such modeling has been variation without specifying ahead of time which
787
particular stylistic elements we are focusing on. prior work on modeling emotional speech has
Our evaluation provides support for this hypothe- sought to identify features that themselves have
sis. a social interpretation, such as features that pre-
When stylistic shifts are focused on specific dict emotional states like uncertainty (Liscombe
linguistic features, then measuring the extent of et al., 2005), or surprise (Ang et al., 2002), or
the stylistic accommodation is simple since a social strategies like flirting (Ranganath et al.,
speakers style may be represented on a one or two 2009). However, our goal is to monitor social pro-
dimensional space, and movement can then be cesses that evolve over time and are reflected in
measured precisely within this space using sim- the change in speech dynamics. Examples include
ple linear functions. However, the rich sociolin- fostering trust, forming attachments, or building
guistic literature on speech style accommodation solidarity.
highlights a much greater variety of speech style
characteristics that may be associated with social 2.1 Defining Speech Style Accommmodation
status within an interaction and may thus be bene- The concept of what we refer to as Speech
ficial to monitor for stylistic shifts. Unfortunately, Style Accommodation has its roots in the field
within any given context, the linguistic features of the Social Psychology of Language, where
that have these status associations, which we re- the many ways in which social processes are re-
fer to as indexicality, are only a small subset of flected through language, and conversely, how
the linguistic features that are being used in some language influences social processes, are the ob-
way. Furthermore, which features carry this in- jects of investigation (Giles & Coupland, 1991).
dexicality are specific to a context. Thus, separat- As a first step towards leveraging this broad range
ing the socially meaningful variation from varia- of language processes, we refer to one very spe-
tion in linguistic features occurring for other rea- cific topic, which has been referred to as entrain-
sons is akin to searching for the proverbial needle ment, priming, accommodation, or adaptation in
in a haystack. It is this technical challenge that we other computational work (Levitan & Hirschberg,
address in this paper. 2011). Specifically we refer to the finding that
In the remainder of the paper we review the lit- conversational partners may shift their speaking
erature on speech style accommodation both from style within the interaction, either becoming more
a sociolinguistic perspective and from a techno- similar or less similar to one another.
logical perspective in order to motivate our hy- Our usage of the term accommodation specifi-
pothesis and proposed model. We then describe cally refers to the process of speech style conver-
the technical details of our model. Next, we gence within an interaction. Stylistic shifts may
present an experiment in which we test our hy- occur at a variety of levels of speech or language
pothesis about the nature of speech style accom- representation. For example, much of the early
modation and find statistically significant con- work on speech style accommodation focused on
firming evidence. We conclude with a discussion regional dialect variation, and specifically on as-
of the limitations of our model and directions for pects of pronunciation, such as the occurrence of
ongoing research. post-vocalic r in New York City, that reflected
differences in age, regional identification, and so-
2 Theoretical Framework
cioeconomic status (Labov, 2010a,b). Distribu-
Our research goal is to model the structure of tion of backchannels and pauses have also been
speech in a way that allows us to monitor so- the target of prior work on accommodation (Lev-
cial processes through speech. One common goal itan & Hirschberg, 2011). These effects may be
of prior work on modeling speech dynamics has moderated by other social factors. For example,
been for the purpose of informing the design of Bilous & Krauss (1988) found that females ac-
more natural spoken dialogue systems (Levitan et commodated to their male partners in conversa-
al., 2011). The practical goal of our work is to tion in terms of average number of words uttered
measure the social processes themselves, for exper turn. For example, Hecht et al. (1989) re-
ample in order to estimate the extent to which ported that extroverts are more listener adaptive
group discussions show signs of productive con- than introverts and hence extroverts converged
sensus building processes (Gweon, 2011). Much more in their data.
788
Accommodation could be measured either ity of speech and lexical features either over full
from textual or speech content of a conversation. conversations or by comparing the similarity in
The former relates to what people say whereas the first half and the second half of the conver-
the latter to how they say it. We are only inter- sation. For example, Edlund et al. (2009) mea-
ested in measuring accommodation from speech sure accommodation in pause and gap length us-
in this work. There has been work on convergence ing measures such as synchrony and convergence.
in text such as syntactic adaptation (Reitter et al., Levitan & Hirschberg (2011) found that accom-
2006) and language similarity in online commu- modation is also found in special social behaviors
nities (Huffaker et al., 2006). within conversation such as backchannels. They
show that speakers in conversation tend to use
2.2 Social Interpretation of Speech Style similar kinds of speech cues such as high pitch at
Accommodation the end of utterance to invite a backchannel from
It has long been established that while some their partner. In order to measure accommodation
speech style shifts are subconscious, speakers on these cues, they compute the correlation be-
may also choose to adapt their way of speaking tween the numerical values of these cues used by
in order to achieve social effects within an in- partners.
teraction (Sanders, 1987). One of the main mo- In our work we measure accommodation using
tives for accommodation is to decrease social dis- Dynamic Bayesian Networks (DBNs). Our mod-
tance. On a variety of levels, speech style accom- els are learnt in an unsupervised fashion. What
modation has been found to affect the impression we are specifically interested in is the manner in
that speakers give within an interaction. For ex- which the influence of one partner on the other is
ample, Welkowitz & Feldstein (1970) found that modeled. What is novel in our approach is the
when speakers become more similar to their part- introduction of the concept of an accommodation
ners, they are liked more by partners. Another state, or relational gestalt variable, which essen-
study by Putman & Street Jr (1984) demonstrated tially models the momentum of the influence that
that interviewees who converge to the speaking one partner is having on the other partners speak-
rate and response latency of their interviewers are ing style. It allows us to represent structurally the
rated more favorably by the interviewers. Giles et insight that accommodation occurs over time as a
al. (1987) found that more accommodating speak- reflection of a social process, and thus has some
ers were rated as more intelligent and supportive consistency in the nature of the accommodation
by their partners. Conversely, social factors in within some span of time. The prior work de-
an interaction affect the extent to which speak- scribed in this section can be thought of as tak-
ers engage in, and some times chose not to en- ing the influence of the partners style directly on
gage in, accommodation. For example, Purcell the speakers style within an instant as the floor
(1984) found that Hawaiian children exhibit more shifts from one speaker to the next. Thus, no con-
convergence in interactions with peer groups that sistency in the manner in which the accommoda-
they like more. Bourhis & Giles (1977) found that tion is occurring is explicitly encouraged by the
Welsh speakers while answering to an English model. The major advantage of consistency of
surveyor broadened their Welsh accent when their motion within the style shift over time is that it
ethnic identity was challenged. Scotton (1985) provides a sign post for identifying which style
found that few people hesitated to repeat lexi- variation within the speech is salient with respect
cal patterns of their partners to maintain integrity. to social interpretation within a specific interac-
Nenkova et al. (2008) found that accommodation tion so that the model may remain agnostic and
on high frequency words correlates with natural- may thus be applied to a variety of interactions
ness, task success, and coordinated turn-taking that differ with respect to which stylistic features
behavior. are salient in this respect.
2.3 Computational models of speech style 3 A Dynamic Bayesian Network Model

accommodation for Conversation
Prior research has attempted to quantify accom- Speech stylistic information is reflected in
modation computationally by measuring similar- prosodic features such as pitch, energy, speak-
789
ing rate etc. In this work, we leverage on sev-
eral of these speech features to quantify accom-
modation. We propose a series of models that
can be trained unsupervised from speech features
and can be used for predicting accommodation.
The models attempt to capture the dependence of
speech features on speaking style, as well as the
Figure 1: An example Dynamic Bayesian Network
effect of persistence and accommodation on style.
(DBN) showing the temporal relationship between
We use a dynamic Bayesian network (DBN) for- three random variables (A,B and C). A is observered
malism to capture these relationships. Below we and dependent on two hidden variables B and C. Di-
briefly review DBNs, and subsequently describe rected edges across time (t 1 t) indicate temporal
the speech features used, and the proposed mod- relationships between variables. In this example, the
els. variables At and Bt are both dependent on Bt1 with
the relationship defined through conditional distribu-
3.1 Dynamic Bayesian Networks tions P (At |Bt1 ) and P (Bt |Bt1 ).
The theory of Bayesian networks is well doc-
umented and understood (Jensen, 1996; Pearl, parents of xi in the network. We note that not
1988). A Bayesian network is a probabilistic all of these variables need to be observable; of-
model that represents statistical relationships be- ten in such models several of the variables are
tween random variables via a directed acyclic unobservable, i.e. they are latent. In order
graph (DAG). Formally, it is a directed acyclic to obtain the joint distribution of the observable
graph whose nodes represent random variables variables the latent variables must be marginal-
(which may be observable quantities, latent unob- ized out. I.e. if x1 , , xm are observable
servable variables, or hypotheses to be estimated).
Edges represent conditional dependencies; nodes xm+1 , , xn are latent, P (x1 , , xm ) =
and
xm+1, ,xn P (x1 , x2 , , xn ).
which are connected by an edge represent ran- Dynamic Bayesian networks (DBNs) further
dom variables that have a direct influence on one represent time-series data through a recurrent for-
another. The entire network represents the joint mulation of a basic Bayesian network that repre-
probability of all the variables represented by the sents the relationship between variables. Within
nodes, with appropriate factoring of the condi- a DBN a set of random variables at each time in-
tional dependencies between variables. stance t is represented as a static Bayesian Net-
Consider, for instance, a joint distribution work with temporal dependencies to variables at
over a set of random variables x1 , x2 , , xn , other instants. Namely, the distribution of a vari-
modeled by a Bayesian network. Let V = able xi,t at time t is dependent on other variables
v1 , v2 , , vn represent the set of n nodes in at times t , xj,t through conditional prob-
the network, representing the random variables abilities of the form P r(xi,t |xj,t ). An exam-
x1 , x2 , , xn respectively. Let (vi ) represent ple DBN, consisting of three variables (A, B and
the set of parent nodes of vi , i.e. nodes in V C), two of which have temporal dependencies is
that have a directed edge into a node vi . Then, shown in Figure 1.
by the dependencies specified by the network, One benefit of the DBN formalism is that in
P (xi |x1 , x2 , , xn ) = P (xi |xj : vj (vi )). addition to providing a compact graphical way
In other words, any variable xi is directly depen- of representing statistical relationships between
dent only on its parent variables, i.e. the random variables in a process, the constrained, directed
variables represented by the nodes in (vi ), and network structure also allows for simplified in-
is independent of all other variables given these ference. Moreover, the conditional distributions
variables. The joint probability of x1 , x2 , , xn associated with the network are often assumed
is hence given by not to vary over time, i.e. P r(xi,t |xj,t ) =
P r(xi,t |xj,t ). This allows for a very com-
p(x1 , x2 , ..., xn ) = p(xi |xi ) (1)
pact representation of DBNs and allows for ef-
i
ficient Expectation-Maximization (EM) learning
Where xi represents {xj : vj (vi ), i.e. the algorithms to be applied.
790
In the discussion that follows we do not explic- O1t-1 O1t O1t+1
itly specify the random variables and the form of

SY1t-1t-1 S1t S1t+1
Yt+1
the associated probability distributions, but only
present them graphically. The joint distribution of
the variables should nevertheless be obvious from Figure 2: The basic generative model.
the figures. We employ EM to learn the param-
O1t-1 O1t O1t+1
eters of the models from training data, and the
junction tree algorithm (Lauritzen & Spiegelhal- SY1t-1t-1 S1t S1t+1
Yt+1
ter, 1988) to perform inference.
3.2 Speech Features S2t-1 S2t S2t+1
We characterize conversations as a series of spo-

ken turns by the partners. We characterize the O2t-1 O2t O2t+1
speech in each turn through a vector that cap-

tures several aspects of the signal that are salient Figure 3: ISM: The dynamics of each speaker are in-
to style. We used the OPENSmile toolkit (opens- dependent of the other speaker.
mile, 2011) to compute the features. Specifi-
cally, within each turn the speech was segmented states are represented as At , where t is turn index.
into analysis windows of 50ms, where adjacent Observation Vector: The observation vectors are
windows overlapped by 40ms. From each anal- the feature vectors oit computed for each turn.
ysis window a total of 7 features were com-
puted: voice probability, harmonic to noise ratio, 3.4 Models for Accommodation
voice quality , three measures of pitch (F0 , F0raw , Our models embody two premises. First, a per-
F0env ), and loudness. A 10-bin histogram of fea- sons speech in any turn is a function of his/her
ture values was computed for each of these fea- speaking style in that turn. Second, a persons
tures, which was then normalized to sum to 1.0. speaking style at any turn depends not only by
The normalized histogram effectively represents their own personal biases, but also by their ac-
both the values and the fluctuation in the features. commodation to their partner. We represent these
For instance, a histogram of loudness values cap- dependencies as a DBN.
tures the variation in the loudness of the speaker Our basic model to represent the generation of
within a turn. The logarithms of the normalized speech (i.e. speech features) by a speaker in the
10-bin histograms for the 7 features were concate- absence of other influences is shown in Figure 2.
nated to result in a single 70-dimensional obser- The speech features oit in any turn depend only on
vation vector for the turn. These 70 dimensional the speaking style sit in that turn. The style sit in
observation vectors for each turn of any speaker any turn depends on the style sit1 in the previ-
are represented in our model as oit where t is turn ous turn, to capture the speaker-specific patterns
index and i is speaker index. of variation in speaking style. We note that this
is a rather simple model and patterns of variation
3.3 Elements of the Models
in style are captured only through the statistical
In this section we formally describe the elements dependence between styles in consequent turns.
of our model. We now build our models for accommodation
Speaking Style State: These states represent the on this basic model.
speaking styles of the partners in a conversation.
We represent these states as sit , where t represent 3.4.1 Style-based models
turn index and i represents speaker index. These Our two first models assume that accommo-
states are assumed to belong to a finite, discrete dation is demonstrated as a direct dependence
set S = {s1 , s2 , , sk }, i.e. sit S (i, t). of a persons speaking sytle on their partners
Accommodation State: An accommodation state style. Therefore the models only consider speak-
represents the indirect influence of partners on ing styles.
each other in a conversation. In our present de- The Independent Speaker Model
sign, it can take a value of either 1 or 0. These Our simplest model for a conversation assumes
791
O1t-1 O1t O1t+1 O1t O1t+1
SY1t-1t-1 S1t S1t+1

Yt+1 SY1t-1t S1t+1
Yt+1
AY 2t AYt-11t AY2t-1t+1
S2t-1 S2t S2t+1 S2t S2t+1
O2t-1 O2t O2t+1 O2t O2t+1
Figure 4: CSDM: A speakers style depends on their Figure 6: AASM: Accommodation state associated
partners style at the previous turn. with every speaker turn
O1t-1 O1t O1t+1

(SASM) we assume that accommodation is a
SY1t-1t-1 S1 t S1
Yt+1
t+1
jointly experienced characteristic of the conversa-
AYt+1
tion at any time, which enjoys some persistence,
AYt t-1
but is also affected by the speaking styles exhib-

S2t
S2t-1 S2t+1 ited by the speakers at each turn. The accom-
modation at any time in turn affects the speaking
O2t-1 O2t O2t+1
styles of both speakers in the next turn. The DBN
for this model is shown in Figure 5.
Figure 5: SASM: Both partners styles depend on mu- The Asymmetric Accommodation State Model
tual accommodation to one another.
The asymmetric accommodation state model
(AASM) represents accommodation as a speaker-
that each persons speaking style evolves indepen- turn-specific characteristic. In any turn, the ac-
dently, uninfluenced by their partner. The DBN commodation for a speaker depends chiefly on
for this is shown in Figure 3. We refer to this their partners most recent speaking style. The ac-
model as the Independent Speaker Model (ISM). commodation state can change after each speaker
Note that the set of values that the style states can turn. Figure 6 shows the DBN for this model.
take is common for both speakers. The speaking Note that this model captures the asymmetric na-
styles for the two speakers may be said to be con- ture of accommodation, e.g. it may be the case
fluent in any turn if both of them are in the same that only one of the speakers is accommodating.
style state at that turn. For instance, if if a1t = 0 and a2t = 1, only
The Cross-speaker Dependence Model speaker2 is accommodating but not speaker1.
Intuitively, in a conversation speakers are influ-
enced by their partners speaking style in previ- 3.4.3 Accommodated style dependence
ous turns. The Cross-Speaker Dependence Model models
(CSDM) represents this dependence as shown in While accommodation state models explicitly
the DBN in Figure 4. In this model a persons models accommodation, they do not explicitly
speaking style depends on both their own and represent how it is expressed. In reality, accom-
their partners speaking styles in the previous turn. modation is a process of convergence an ac-
commodating speakers speaking style may be ex-
3.4.2 Accommodation state models pected to converge toward that of their partner. In
Accommodation state models assume that con- other words, the persons speaking style depends
versations actually have an underlying state of ac- not only on whether they are accommodating or
commodation, and that speakers in fact vary their not, but also on their partners style at the previ-
speaking styles in response to it. We models this ous turn. Accommodated style dependence mod-
through a binary-valued accommodation state that els explicitly represent this dependence.
is embedded into the DBN. We posit two types of The Symmetric Accommodated Style Depen-
accommodation state models. dence Model
The Symmetric Accommodation State Model The Symmetric Accommodated Style Depen-
In the symmetric accommodation state model dence Model (SASDM) extends the SASM, to in-
792
O1t-1 O1t O1t+1 tures, as represented in the observation vectors.
It is hence reasonable to assume that they are both
SY1t-1t-1 S1t S1t+1
Yt+1
speaking in similar style. Similarly, the accom-
AYt AYt+1
t-1 modation state cannot be expected to actually de-
S2t
pict accommodation; nevertheless, it can capture
S2t-1 S2t+1
the dependencies that govern when the two speak-
O2t-1 O2t O2t+1
ers are likely to be in the same state.
4 Evaluation
Figure 7: SASDM: A speakers style depends both on
mutual accommodation and the partners style in the The model we have just described allows us to in-
previous turn. vestigate two separate aspects of our concept of
speech style accommodation. The first aspect is
O1t O1t+1 that style accommodation occurs as a local influ-
ence of one speakers style on the other speakers
SY1t-1t S1t+1
Yt+1
style, as depicted by direct links between style
AY 2t AYt-11t AY2t-1t+1 states. The second aspect is that although this is a
S2t S2t+1
local phenomenon, because it is a reflection of a
social process that extends over a period of time,
O2t O2t+1 there will be some persistence of accommodation
over longer periods of time, as characterized by
Figure 8: AASDM: The accommodation state associ- the accommodation state. We presented two dif-
ated with every speaker and a speakers style depends ferent operationalizations of the accommodation
on the partners style. state above, namely Asymmetric and Symmetric.
Accommodation is a phenomenon that occurs
within interactions between speakers; we can ex-
dicate that a speakers style in any turn depends
pect not to observe accommodation occurring be-
both on accommodation and on their partners
tween individuals that have never met and are not
style in the previous turn. Figure 7 shows the
interacting. On average, then, we expect to see
DBN for this model.
more evidence of speech style accommodation in
Asymmetric Accommodated Style Dependence pairs of individuals who are interacting (i.e., Real
Model Pairs) than in pairs of individuals who are not in-
The Asymmetric Accommodated Style Depen- teracting and have never met (i.e., Constructed
dence Model (AASDM) extends the AASM by Pairs). Thus, we may evaluate the extent to which
adding a direct dependence between a speakers our model is sensitive to social dynamics within
style and their partners style in their most recent pairs by the extent to which it is able to distinguish
turn. The DBN for this is shown in Figure 8. between true conversation between Real Pairs of
speaker and synthetic conversation between Con-
3.5 Interpreting the states
structed Pairs. A similar experimental paradigm
We note that we have referred to the states in the has been adopted in prior work on speech style
models above as style states. In reality, in all accommodation (Levitan et al., 2011).
cases, we learn the parameters of the model in Hypothesis: Our hypothesis is that models that
an unsupervised manner, since the data we use to explicitly represent the notion that accommoda-
train it do not have either speaking style or action occurs over a span of time with consistency
commodation indicated (although, if they were la- of momentum will achieve better success at dis-
beled, the labels could be employed within our tinguishing between Real Pairs and Constructed
models). Consequently, we have no assurance Pairs than models that do not.
that the states learned will actually correspond to Experimental Manipulation: Thus, using the
speaking styles. They can only be considered a model we have just described, we are able to
proxy for speaking style. Nevertheless, if both test our hypothesis using a 2 3 factorial design
speakers are in the same state, they can both be in which one factor is the inclusion of direct
expected to be producing similar prosodic fea- links from the style of one speaker to the style
793
of the other speaker, which we refer to as the factors. Furthermore, because the participants did
DirectInfluence (DI) factor, with values True not know each other before the debate, we can
(T) and False (F), and the second factor is the assume that if accommodation happened, it was
inclusion of links from style states to and from only during the conversation.
Accommodation states, which we refer to as the Real versus Constructed Pairs: In our analy-
IndirectInfluence (II) factor, with values False sis below, we compare measured accommodation
(F), Asymmetric (A), and Symmetric (S). The between pairs of humans who had a real conver-
result of this 2 3 factorial design are the 6 sation and a constructed pair in which one per-
different models described in Section 3, namely son from that conversation is paired with a con-
ISM (DI=False, II=False), CSDM (DI=True, structed partner, where the partners side of the
II=False), SASM (DI=False, II=Symmetric), conversation was constructed from turns that oc-
AASM (DI=False, II=Asymmetric), SASDM curred in other conversations. We set up this com-
(DI=True, II=Symmetric), and AASDM parison in order to isolate speech style conver-
(DI=True, II= Asymmetric). gence from lexical convergence when we evalu-
Corpus: The success criterion in our experiment ate the performance of our model. The difference
is the extent to which models of speech style between the measured accommodation between
accommodation are able to distinguish between real and constructed pairs is treated as a weak op-
Real Pairs and Constructed pairs. In order to set erationalization of model accuracy at measuring
up this comparison, we began with a corpus of de- speech style accommodation.
bates between students about the reasons for the For each of the 20 Real pairs in the test corpus
fall of the Ottoman Empire. We obtained this cor- we composed one Constructed Pair. Each Con-
pus from researchers who originally collected it structed Pair comprised one student from the cor-
to investigate issues related to learning from con- responding Real Pair (i.e., the Real Student) and a
versational interactions (Nokes et al., 2010). The Constructed Partner that resembled the real part-
full corpus contains interactions between 76 pairs ner in content but not necessarily style. We did
of students who interacted for 8 minutes. Within this by iterating through the real partners turns,
each pair, one student was assigned the role of ar- replacing each with a turn that matched as well as
guing that the fall of the Ottoman empire was due possible in terms of lexical content but came from
to internal causes, whereas the other student was a different conversation. Lexical content match
assigned the role of arguing that the fall of the Ot- was measured in terms of cosine similarity. Turns
toman empire was due to external causes. Each were selected from the other Real pairs. Thus, the
student was given a 4 page packet of supporting Constructed Partner had similar content to the cor-
information for their side of the debate to draw responding real partner on a turn by turn basis, but
from in the interaction. the style of expression could not be influenced by
The speech from each participant was recorded the Real Student. Thus, ideally we should not see
on a separate channel. As a first step, we aligned evidence of speech style accommodation within
the speech recordings automatically to their tran- the Constructed Pairs.
scriptions at the word and turn level. After align- Experimental Procedure: For each of the four
ing the corpus at the word level, we identify the models we computed an Accommodation Score
turn interval of each partner in the conversation. for each of the Real Pairs and Constructed Pairs.
We use 66 of the debates out of the complete set In order to obtain a measure that can be used to
of 76 for the experiments discussed in this paper. compute accommodation for all the models con-
We had to eliminate 10 dialogues where the seg- sidered, we compute the accommodation value as
mentation and alignment failed. For each of our the fraction of turns in a session where partners
models, we used the same 3 fold cross-validation. exhibited the same speaking style.
Participants: Participants were all male under- Results: In order to test our hypothesis we con-
graduate students between the ages of 18 and 25. structed an ANOVA model with Accommodation
In prior studies, it has been shown that accommo- Score as the dependent variable and DirectInflu-
dation varies based on gender, age and familiar- ence, IndirectInfluence, RealVsConstructed as in-
ity between partners. This corpus is particularly dependent variables. Additionally we included
appropriate because it controls for most of these the interaction terms between all pairs of inde-
794
DI II Real Constructed Based on this analysis, we find support for our
() () hypothesis. We find that the model that includes
SASDM T S .54 (.23) .44 (.29) Symmetric IndirectInfluence links and DirectIn-
SASM F S .54 (.23) .44 (.29) fluence links is the best balance between represen-
CSDM T F .6 (.26) .52 (.3) tational power and simplicity. The support for the
ISM F F .56 (.25) .51 (.32) inclusion of DirectInfluence links in the model is
AASM F A .6 (.24) .51 (.3) weaker than that of IndirectInfluence links, how-
AASDM T A .61 (.24) .48 (.3) ever. On a larger dataset, we may have observed
stronger effects of both factors. Even on this small
Table 1: Accommodation measured using different dataset, we find evidence that adding that struc-
models. Legend: =mean, = standard deviation, DI ture improves the performance of the model with-
= Direct Influence, II = Indirect Influence. out leading to overfitting.
pendent variables. Using this ANOVA model, we 5 Conclusions and Current Directions
find a highly significant main effect of the Re- In this paper we presented an unsupervised dy-
alVsConstructed factor that demonstrates the gen- namic Bayesian modeling approach to modeling
eral ability of the models to achieve separation be- speech style accommodation in face-to-face inter-
tween Real Pairs and Constructed Pairs; on aver- actions. Our model was motivated by the idea that
age F(1,780) = 18.22, p < .0001. because accommodation reflects social processes
However, when we look more closely, we find that extend over time within an interaction, one
that although the trend is consistently to find more may expect a certain consistency of motion within
evidence of speech style accommodation in Real the stylistic shift. Our evaluation demonstrated a
Pairs than in Constructed Pairs, we see differen- statistically significant advantage for the models
tiation among the models in terms of their abil- that embodied this idea.
ity to achieve this separation. When we exam- An important motivation for our modeling ap-
ine the two way interactions between DirectIn- proach was that it allows us to avoid targeting
fluence and RealVsConstructed as well as be- specific linguistic style features in our measure
tween IndirectInfluence and RealVsConstructed, of accommodation. However, in our evaluation,
although we do not find significant interactions, we only tested our approach on conversations be-
we do find some suggestive patterns when we tween male undergraduate students discussing the
do the student T posthoc analysis. In particular, fall of the Ottoman Empire. Thus, while our eval-
when we explore just the interaction between In- uation provides evidence that we have taken a first
directInfluence links, we find a significant separa- important step towards our ultimate goal, we can-
tion between Real vs Constructed pairs for models not yet claim that we have a model that performs
with Accommodation states, but not for the cases equally effectively across contexts. In our future
where no Accommodation states are included. work, we plan to formally test the extent to which
However, when we do the same for the interaction this allows us to accurately measure accommoda-
between DirectInfluence links and RealVsCon- tion within contexts in which very different stylis-
structed, we find significant separation with or tic elements carry strategic social value.
without those links. This suggests that IndirectIn- Another important direction of our current re-
fluence links are more important than DirectInflu- search is to explore how measures of speech style
ence links. At a finer-grained level, when we ex- accommodation may predict other important mea-
amine the models individually, we only find a sig- sures such as how positively partners view one an-
nificant separation between Real and Constructed other, how successful partners perform tasks to-
pairs with the model that includes both Direct- gether, or how well students learn together.
Influence and Symmetric IndirectInfluence links.
These results suggest that Symmetric IndirectIn- 6 Acknowledgments
fluence links may be slightly better than Asym- We gratefully acknowledge John Levine and Tim-
metric ones, and that combining DirectInfluence othy Nokes for sharing their data with us. This
links and Symmetric IndirectInfluence links may work was funded by NSF SBE 0836012.
be the best combination.
795
References Lauritzen, S. L. & Spiegelhalter, D. J. (1988). Local
computations with probabilities on graphical struc-
Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stol- tures and their application to expert systems. Jour-
cke, A. (2002). Prosody-based automatic detection nal of the Royal Statistical Society, 50, 157224.
of annoyance and frustration in human-computer di-
alog. In Proc. ICSLP, volume 3, pages 20372040. Levitan, R. & Hirschberg, J. (2011). Measuring
Citeseer. acoustic-prosodic entrainment with respect to mul-
tiple levels and dimensions. In Proceedings of In-
Bilous, F. & Krauss, R. (1988). Dominance and terspeech.
accommodation in the conversational behaviours
of same-and mixed-gender dyads. Language and Levitan, R., Gravano, A., & Hirschberg, J. (2011).
Communication, 8(3), 4. Entrainment in speech preceding backchannels. In
Bourhis, R. & Giles, H. (1977). The language of in- sociation for Computational Linguistics: Human
tergroup distinctiveness. Language, ethnicity and Language Technologies: short papers-Volume 2,
intergroup relations, 13, 119. pages 113117. Association for Computational Lin-
Coupland, N. (2007). Style: Language variation and guistics.
identity. Cambridge Univ Pr. Liscombe, J., Hirschberg, J., & Venditti, J. (2005). De-
DiMicco, J., Pandolfo, A., & Bender, W. (2004). Influ- tecting certainness in spoken tutorial dialogues. In
encing group participation with a shared display. In Proceedings of INTERSPEECH, pages 18371840.
Proceedings of the 2004 ACM conference on Com- Citeseer.
puter supported cooperative work, pages 614623. Nenkova, A., Gravano, A., & Hirschberg, J. (2008).
ACM. High frequency word entrainment in spoken dia-
Eckert, P. & Rickford, J. (2001). Style and sociolin- logue. In In Proceedings of ACL-08: HLT. Asso-
guistic variation. Cambridge Univ Pr. ciation for Computational Linguistics.
Edlund, J., Heldner, M., & Hirschberg, J. (2009). opensmile (2011). http://opensmile.sourceforge.net/.
Pause and gap length in face-to-face interaction. In Pearl, J. (1988). Probabilistic Reasoning in Intelligent
Proc. Interspeech. Systems: Networks of Plausible Inference. Morgan
Giles, H. & Coupland, N. (1991). Language: Contexts Kaufmann.
and consequences. Thomson Brooks/Cole Publish- Purcell, A. (1984). Code shifting hawaiian style: chil-
ing Co. drens accommodation along a decreolizing contin-
Giles, H., Mulac, A., Bradac, J., & Johnson, P. (1987). uum. International Journal of the Sociology of Lan-
Speech accommodation theory: The next decade guage, 1984(46), 7186.
and beyond. Communication yearbook, 10, 1348. Putman, W. & Street Jr, R. (1984). The conception
and perception of noncontent speech performance:
Gweon, G. A. P. U. M. R. B. R. C. P. (2011). The
Implications for speech-accommodation theory. In-
automatic assessment of knowledge integration pro-
ternational Journal of the Sociology of Language,
cesses in project teams. In Proceedings of Computer
1984(46), 97114.
Supported Collaborative Learning.
Ranganath, R., Jurafsky, D., & McFarland, D. (2009).
Hecht, M., Boster, F., & LaMer, S. (1989). The ef-
Its not you, its me: detecting flirting and its mis-
fect of extroversion and differentiation on listener-
perception in speed-dates. In Proceedings of the
adapted communication. Communication Reports,
2009 Conference on Empirical Methods in Natural
2(1), 18.
Language Processing: Volume 1-Volume 1, pages
Huffaker, D., Jorgensen, J., Iacobelli, F., Tepper, P., & 334342. Association for Computational Linguis-
Cassell, J. (2006). Computational measures for lan- tics.
guage similarity across time in online communities.
Reitter, D., Keller, F., & Moore, J. D. (2006). Com-
In In ACTS: Proceedings of the HLT-NAACL 2006
putational modelling of structural priming in dia-
Workshop on Analyzing Conversations in Text and
logue. In In Proc. Human Language Technology
Speech, pages 1522.
conference - North American chapter of the Asso-
Jensen, F. V. (1996). An introduction to Bayesian net- ciation for Computational Linguistics annual mtg,
works. UCL Press. pages 121124.
Labov, W. (2010a). Principles of linguistic change: Sanders, R. (1987). Cognitive foundations of calcu-
Internal factors, volume 1. Wiley-Blackwell. lated speech. State University of New York Press.
Labov, W. (2010b). Principles of linguistic change: Scotton, C. (1985). What the heck, sir: Style shifting
Social factors, volume 2. Wiley-Blackwell. and lexical colouring as features of powerful lan-
796
guage. Sequence and pattern in communicative be-
haviour, pages 103119.
Wang, Y., Kraut, R., & Levine, J. (2011). To stay or
leave? the relationship of emotional and informa-
tional support to commitment in online health sup-
port groups. In Proceedings of the ACM conference
on computer-supported cooperative work. ACM.
Ward, A. & Litman, D. (2007). Automatically measur-
ing lexical and acoustic/prosodic convergence in tu-
torial dialog corpora. In Proceedings of the SLaTE
Workshop on Speech and Language Technology in
Education. Citeseer.
Welkowitz, J. & Feldstein, S. (1970). Relation of ex-
perimentally manipulated interpersonal perception
and psychological differentiation to the temporal
patterning of conversation. In Proceedings of the
78th Annual Convention of the American Psycho-
logical Association, volume 5, pages 387388.
797
Learning the Fine-Grained Information Status of Discourse Entities
Altaf Rahman and Vincent Ng

Human Language Technology Research Institute
University of Texas at Dallas
Richardson, TX 75083-0688
{altaf,vince}@hlt.utdallas.edu
Abstract has not been previously referred to; and (3) me-
diated (henceforth med) if it is newly mentioned
While information status (IS) plays a cru- in the dialogue but she can infer its identity from
cial role in discourse processing, there have a previously-mentioned entity. To capture finer-
only been a handful of attempts to automat- grained distinctions for IS, Nissim et al. allow an
ically determine the IS of discourse entities.
old or med entity to have a subtype, which subcat-
We examine a related but more challenging
task, fine-grained IS determination, which egorizes an old or med entity. For instance, a med
involves classifying a discourse entity as entity has the subtype set if the NP that refers to
one of 16 IS subtypes. We investigate the it is in a set-subset relation with its antecedent.
use of rich knowledge sources for this task IS plays a crucial role in discourse processing:
in combination with a rule-based approach it provides an indication of how a discourse model
and a learning-based approach. In experi-
should be updated as a dialogue is processed in-
ments with a set of Switchboard dialogues,
the learning-based approach achieves an ac-
crementally. Its importance can be reflected in
curacy of 78.7%, outperforming the rule- part in the amount of attention it has received in
based approach by 21.3%. theoretical linguistics over the years (e.g., Halli-
day (1976), Prince (1981), Hajicova (1984), Vall-
duv (1992), Steedman (2000)), and in part in the
1 Introduction benefits it can potentially bring to NLP applica-
A linguistic notion central to discourse processing tions. One task that could benefit from knowledge
is information status (IS). It describes the extent of IS is identity coreference: since new entities by
definition have not been previously referred to, an
to which a discourse entity, which is typically re-
NP marked as new does not need to be resolved,
ferred to by noun phrases (NPs) in a dialogue, is
available to the hearer. Different definitions of IS thereby improving the precision of a coreference
have been proposed over the years. In this paper, resolver. Knowledge of fine-grained or subcat-
we adopt Nissim et al.s (2004) proposal, since it egorized IS is valuable for other NLP tasks. For
is primarily built upon Princes (1992) and Eck- instance, an NP marked as set signifies that it is in
ert and Strubes (2001) well-known definitions, a set-subset relation with its antecedent, thereby
and is empirically shown by Nissim et al. to yield providing important clues for bridging anaphora
an annotation scheme for IS in dialogue that has resolution (e.g., Gasperin and Briscoe (2008)).
good reproducibility.1 Despite the potential usefulness of IS in NLP
Specifically, Nissim et al. (2004) adopt a three- tasks, there has been little work on learning
way classification scheme for IS, defining a dis- the IS of discourse entities. To investigate the
course entity as (1) old to the hearer if it is known plausibility of learning IS, Nissim et al. (2004)
to the hearer and has previously been referred to in annotate a set of Switchboard dialogues with
the dialogue; (2) new if it is unknown to her and such information2 , and subsequently present a
1 2
It is worth noting that several IS annotation schemes These and other linguistic annotations on the Switch-
have been proposed more recently. See Gotze et al. (2007) board dialogues were later released by the LDC as part of the
and Riester et al. (2010) for details. NXT corpus, which is described in Calhoun et al. (2010).
798
rule-based approach and a learning-based ap- the hand-written rules and their predictions di-
proach to acquiring such knowledge (Nissim, rectly as features for the learner. In an evalua-
2006). More recently, we have improved Nissims tion on 147 Switchboard dialogues, our learning-
learning-based approach by augmenting her fea- based approach to fine-grained IS determina-
ture set, which comprises seven string-matching tion achieves an accuracy of 78.7%, substan-
and grammatical features, with lexical and syn- tially outperforming the rule-based approach by
tactic features (Rahman and Ng, 2011; hence- 21.3%. Equally importantly, when employing
forth R&N). Despite the improvements, the per- these linguistically rich features to learn Nissims
formance on new entities remains poor: an F- 3-class IS determination task, the resulting classi-
score of 46.5% was achieved. fier achieves an accuracy of 91.7%, surpassing the
Our goal in this paper is to investigate fine- classifier trained on R&Ns state-of-the-art fea-
grained IS determination, the task of classifying ture set by 8.8% in absolute accuracy. Improve-
a discourse entity as one of the 16 IS subtypes ments on the new class are particularly substan-
defined by Nissim et al. (2004).3 Owing in part tial: its F-score rises from 46.7% to 87.2%.
to the increase in the number of categories, fine-
grained IS determination is arguably a more chal- 2 IS Types and Subtypes: An Overview
lenging task than the 3-class IS determination task
In Nissim et al.s (2004) IS classification scheme,
that Nissim and R&N investigated. To our knowl-
an NP can be assigned one of three main types
edge, this is the first empirical investigation of au-
(old, med, new) and one of 16 subtypes. Below
tomated fine-grained IS determination.
we will illustrate their definitions with examples,
We propose a knowledge-rich approach to fine-
most of which are taken from Nissim (2003) or
grained IS determination. Our proposal is moti-
Nissim et al.s (2004) dataset (see Section 3).
vated in part by Nissims and R&Ns poor per-
formance on new entities, which we hypothesize Old. An NP is marked is old if (i) it is corefer-
can be attributed to their sole reliance on shallow ential with an entity introduced earlier, (ii) it is a
knowledge sources. In light of this hypothesis, generic pronoun, or (iii) it is a personal pronoun
our approach employs semantic and world knowl- referring to the dialogue participants. Six sub-
edge extracted from manually and automatically types are defined for old entities: identity, event,
constructed knowledge bases, as well as corefer- general, generic, ident generic, and relative. In
ence information. The relevance of coreference to Example 1, my is marked as old with subtype
IS determination can be seen from the definition identity, since it is coreferent with I.
of IS: a new entity is not coreferential with any (1) I was angry that he destroyed my tent.
previously-mentioned entity, whereas an old en-
However, if the markable has a verb phrase (VP)
tity may. While our use of coreference informa-
rather than an NP as its antecedent, it will be
tion for IS determination and our earlier claim that
marked as old/event, as can be seen in Example
IS annotation would be useful for coreference res-
2, where the antecedent of That is the VP put my
olution may seem to have created a chicken-and-
phone number on the form.
egg problem, they do not: since coreference reso-
lution and IS determination can benefit from each (2) They ask me to put my phone number
other, it may be possible to formulate an approach on the form. That I think is not needed.
where the two tasks can mutually bootstrap. Other NPs marked as old include (i) relative
We investigate rule-based and learning-based pronouns, which have the subtype relative; (ii)
approaches to fine-grained IS determination. In personal pronouns referring to the dialogue par-
the rule-based approach, we manually compose ticipants, which have the subtype general, and
rules to combine the aforementioned knowledge (iii) generic pronouns, which have the subtype
sources. While we could employ the same knowl- generic. The pronoun you in Example 3 is an in-
edge sources in the learning-based approach, we stance of a generic pronoun.
chose to encode, among other knowledge sources, (3) I think to correct the judicial system,
3
you have to get the lawyer out of it.
One of these 16 classes is the new type, for which no
subtype is defined. For ease of exposition, we will refer to Note, however, that in a coreference chain of
the new type as one of the 16 subtypes to be predicted. generic pronouns, every element of the chain is
799
assigned the subtype ident generic instead. If an NP is part of a situation set up by a
Mediated. An NP is marked as med if the en- previously-mentioned entity, it is assigned the
tity it refers to has not been previously introduced subtype situation, as exemplified by the NP a few
in the dialogue, but can be inferred from already- horses in the sentence below, which is involved in
mentioned entities or is generally known to the the situation set up by Johns ranch.
hearer. Nine subtypes are available for med en- (7) Mary went to Johns ranch and saw that
tities: general, bound, part, situation, event, set, there were only a few horses.
poss, func value, and aggregation. Similar to old entities, an NP marked as med may
General is assigned to med entities that are be related to a previously mentioned VP. In this
generally known, such as the Earth, China, and case, the NP will receive the subtype event, as ex-
most proper names. Bound is reserved for bound emplified by the NP the bus in the sentence below,
pronouns, an instance of which is shown in Ex- which is triggered by the VP traveling in Miami.
ample 4, where its is bound to the variable of the (8) We were traveling in Miami, and the
universally quantified NP, Every cat. bus was very full.
(4) Every cat ate its dinner. If an NP refers to a value of a previously men-
Poss is assigned to NPs involved in intra-phrasal tioned function, such as the NP 30 degrees in Ex-
possessive relations, including prenominal geni- ample 9, which is related to the temperature, then
tives (i.e., Xs Y) and postnominal genitives (i.e., it is assigned the subtype func value.
Y of X). Specifically, Y will be marked as poss if (9) The temperature rose to 30 degrees.
X is old or med; otherwise, Y will be new. For ex- Finally, the subtype aggregation is assigned to co-
ample, in cases like a friends boat where a friend ordinated NPs if at least one of the NPs involved
is new, boat is marked as new. is not new. However, if all NPs in the coordinated
Four subtypes, namely part, situation, event, phrase are new, the phrase should be marked as
and set, are used to identify instances of bridg- new. For instance, the NP My son and I in Exam-
ing (i.e., entities that are inferrable from a related ple 10 should be marked as med/aggregation.
entity mentioned earlier in the dialogue). As an
(10) I have a son ... My son and I like to
example, consider the following sentences:
play chess after dinner.
(5a) He passed by the door of Jans house
New. An entity is new if it has not been intro-
and saw that the door was painted red.
duced in the dialogue and the hearer cannot infer
(5b) He passed by Jans house and saw that
it from previously mentioned entities. No subtype
the door was painted red.
is defined for new entities.
In Example 5a, by the time the hearer processes
the second occurrence of the door, she has already There are cases where more than one IS value
had a mental entity corresponding to the door (af- is appropriate for a given NP. For instance, given
ter processing the first occurrence). As a result, two occurrences of China in a dialogue, the sec-
the second occurrence of the door refers to an ond occurrence can be labeled as old/identity (be-
old entity. In Example 5b, on the other hand, the cause it is coreferential with an earlier NP) or
hearer is not assumed to have any mental repre- med/general (because it is a generally known
sentation of the door in question, but she can in- entity). To break ties, Nissim (2003) define a
fer that the door she saw was part of Jans house. precedence relation on the IS subtypes, which
Hence, this occurrence of the door should be yields a total ordering on the subtypes. Since
marked as med with subtype part, as it is involved all the old subtypes are ordered before their med
in a part-whole relation with its antecedent. counterparts in this relation, the second occur-
If an NP is involved in a set-subset relation with rence of China in our example will be labeled as
its antecedent, it inherits the med subtype set. old/identity. Owing to space limitations, we refer
This applies to the NP the house payment in Ex- the reader to Nissim (2003) for details.
ample 6, whose antecedent is our monthly budget.
3 Dataset
(6) What we try to do to stick to our
monthly budget is we pretty much have We employ Nissim et al.s (2004) dataset, which
the house payment. comprises 147 Switchboard dialogues. We parti-
800
tion them into a training set (117 dialogues) and a to the dialogue participants. Note that this and
test set (30 dialogues). A total of 58,835 NPs are several other rules rely on coreference informa-
annotated with IS types and subtypes.4 The distri- tion, which we obtain from two sources: (1)
butions of NPs over the IS subtypes in the training chains generated automatically using the Stan-
set and the test set are shown in Table 1. ford Deterministic Coreference Resolution Sys-
tem (Lee et al., 2011)5 , and (2) manually iden-
Train (%) Test (%) tified coreference chains taken directly from the
old/identity 10236 (20.1) 1258 (15.8)
annotated Switchboard dialogues. Reporting re-
old/event 1943 (3.8) 290 (3.6)
old/general 8216 (16.2) 1129 (14.2) sults using these two ways of obtaining chains fa-
old/generic 2432 (4.8) 427 (5.4) cilitates the comparison of the IS determination
old/ident generic 1730 (3.4) 404 (5.1) results that we can realistically obtain using ex-
old/relative 1241 (2.4) 193 (2.4) isting coreference technologies against those that
med/general 2640 (5.2) 325 (4.1) we could obtain if we further improved exist-
med/bound 529 (1.0) 74 (0.9) ing coreference resolvers. Note that both sources
med/part 885 (1.7) 120 (1.5) provide identity coreference chains. Specifically,
med/situation 1109 (2.2) 244 (3.1)
the gold chains were annotated for NPs belong-
med/event 351 (0.7) 67 (0.8)
med/set 10282 (20.2) 1771 (22.3) ing to old/identity and old/ident generic. Hence,
med/poss 1318 (2.6) 220 (2.8) these chains can be used to distinguish between
med/func value 224 (0.4) 31 (0.4) old/general NPs and old/ident generic NPs, be-
med/aggregation 580 (1.1) 117 (1.5) cause the former are not part of a chain whereas
new 7158 (14.1) 1293 (16.2) the latter are. However, they cannot be used
total 50874 (100) 7961 (100) to distinguish between old/general entities and
old/generic entities, since neither of them belongs
Table 1: Distributions of NPs over IS subtypes. The to any chains. As a result, when gold chains are
corresponding percentages are parenthesized.
used, Rule 1 will classify all occurrences of you
that are not part of a chain as old/general, regard-
less of whether the pronoun is generic. While the
4 Rule-Based Approach
gold chains alone can distinguish old/general and
In this section, we describe our rule-based ap- old/ident generic NPs, the Stanford chains can-
proach to fine-grained IS determination, where we not distinguish any of the old subtypes in the ab-
manually design rules for assigning IS subtypes to sence of other knowledge sources, since it gener-
NPs based on the subtype definitions in Section 2, ates chains for all old NPs regardless of their sub-
Nissims (2003) IS annotation guidelines, and our types. This implies that Rule 1 and several other
inspection of the IS annotations in the training rules are only a very crude approximation of the
set. The motivations behind having a rule-based definition of the corresponding IS subtypes.
approach are two-fold. First, it can serve as a The rules for the remaining old subtypes can be
baseline for fine-grained IS determination. Sec- interpreted similarly. A few points deserve men-
ond, it can provide insight into how the available tion. First, many rules depend on the string of
knowledge sources can be combined into predic- the NP under consideration (e.g., they in Rule 2
tion rules, which can potentially serve as sophis- and whatever in Rule 4). The decision of which
ticated features for a learning-based approach. strings are chosen is based primarily on our in-
As shown in Table 2, our ruleset is composed of spection of the training data. Hence, these rules
18 rules, which should be applied to an NP in the are partly data-driven. Second, these rules should
order in which they are listed. Rules 17 handle be applied in the order in which they are shown.
the assignment of old subtypes to NPs. For in- For instance, though not explicitly stated, Rule 3
stance, Rule 1 identifies instances of old/general, is only applicable to the non-anaphoric you and
which comprises the personal pronouns referring they pronouns, since Rule 2 has already covered
their anaphoric counterparts. Finally, Rule 7 uses
4
Not all NPs have an IS type/subtype. For instance, a non-anaphoricity as a test of old/event NPs. The
pleonastic it does not refer to any real-world entity and
5
therefore does not have any IS, and so are nouns such as The Stanford resolver is available from http://nlp.
course in of course, accident in by accident, etc. stanford.edu/software/corenlp.shtml.
801
1. if the NP is I or you and it is not part of a coreference chain, then
subtype := old/general
2. if the NP is you or they and it is anaphoric, then
subtype := old/ident generic
3. if the NP is you or they, then
subtype := old/generic
4. if the NP is whatever or an indefinite pronoun prefixed by some or any (e.g., somebody), then
subtype := old/generic
5. if the NP is an anaphoric pronoun other than that, or its string is identical to that of a preceding NP, then
subtype := old/ident
6. if the NP is that and it is coreferential with the immediately preceding word, then
subtype := old/relative
7. if the NP is it, this or that, and it is not anaphoric, then
subtype := old/event
8. if the NP is pronominal and is not anaphoric, then
subtype := med/bound
9. if the NP contains and or or, then
subtype := med/aggregation
10. if the NP is a multi-word phrase that (1) begins with so much, something, somebody, someone,
anything, one, or different, or (2) has another, anyone, other, such, that, of or type
as neither its first nor last word, or (3) its head noun is also the head noun of a preceding NP, then
subtype := med/set
11. if the NP contains a word that is a hyponym of the word value in WordNet, then
subtype := med/func value
12. if the NP is involved in a part-whole relation with a preceding NP based on information extracted from
ReVerbs output, then
subtype := med/part
13. if the NP is of the form Xs Y or poss-pro Y, where X and Y are NPs and poss-pro is a possessive
pronoun, then
subtype := med/poss
14. if the NP fills an argument of a FrameNet frame set up by a preceding NP or verb, then
subtype := med/situation
15. if the head of the NP and one of the preceding verbs in the same sentence share the same WordNet
hypernym which is not in synsets that appear one of the top five levels of the noun/verb hierarchy, then
subtype := med/event
16. if the NP is a named entity (NE) or starts with the, then
subtype := med/general
17. if the NP appears in the training set, then
subtype := its most frequent IS subtype in the training set
18. subtype := new
Table 2: Hand-crafted rules for assigning IS subtypes to NPs.
reason is that these NPs have VP antecedents, but Rule 10 concerns med/set. The words and
both the gold chains and the Stanford chains are phrases listed in the rule, which are derived manu-
computed over NPs only. ally from the training data, provide suggestive ev-
idence that the NP under consideration is a subset
Rules 816 concern med subtypes. Apart from
or a specific portion of an entity or concept men-
Rule 8 (med/bound), Rule 9 (med/aggregation),
tioned earlier in the dialogue. Examples include
and Rule 11 (med/func value), which are arguably
another bedroom, different color, somebody
crude approximations of the definitions of the
else, any place, one of them, and most other
corresponding subtypes, the med rules are more
cities. Condition 3 of the rule, which checks
complicated than their old counterparts, in part
whether the head noun of the NP has been men-
because of their reliance on the extraction of so-
tioned previously, is a good test for identity coref-
phisticated knowledge. Below we describe the ex-
erence, but since all the old entities have suppos-
traction process and the motivation behind them.
802
edly been identified by the preceding rules, it be- entities, whose identification is difficult as it re-
comes a reasonable test for set-subset relations. quires world knowledge. Consequently, we apply
For convenience, we identify part-whole rela- this rule only after all other med rules are applied.
tions in Rule 12 based on the output produced by As we can see, the rule assigns med/general to
ReVerb (Fader et al., 2011), an open information NPs that are named entities (NEs) and definite de-
extraction system.6 The output contains, among scriptions (specifically those NPs that start with
other things, relation instances, each of which is the). The reason is simple. Most NEs are gener-
represented as a triple, <A,rel,B>, where rel is ally known. Definite descriptions are typically not
a relation, and A and B are its arguments. To pre- new, so it seems reasonable to assign med/general
process the output, we first identify all the triples to them given that the remaining (i.e., unlabeled)
that are instances of the part-whole relation us- NPs are presumably either new and med/general.
ing regular expressions. Next, we create clusters Before Rule 18, which assigns an NP to the new
of relation arguments, such that each pair of ar- class by default, we have a memorization rule
guments in a cluster has a part-whole relation. that checks whether the NP under consideration
This is easy: since part-whole is a transitive rela- appears in the training set (Rule 17). If so, we
tion (i.e., <A,part,B> and <B,part,C> implies assign to it its most frequent subtype based on its
<A,part,C>), we cluster the arguments by taking occurrences in the training set. In essence, this
the transitive closure of these relation instances. heuristic rule can help classify some of the NPs
Then, given an NP NPi in the test set, we assign that are somehow missed by the first 16 rules.
med/part to it if there is a preceding NP NPj such The ordering of these rules has a direct impact
that the two NPs are in the same argument cluster. on performance of the ruleset, so a natural ques-
In Rule 14, we use FrameNet (Baker et al., tion is: what criteria did we use to order the rules?
1998) to determine whether med/situation should We order them in such a way that they respect the
be assigned to an NP, NPi . Specifically, we check total ordering on the subtypes imposed by Nis-
whether it fills an argument of a frame set up by sims (2003) preference relation (see Section 3),
a preceding NP, NPj , or verb. To exemplify, let except that we give med/general a lower priority
us assume that NPj is capital punishment. We than Nissim due to the difficulty involved in iden-
search for punishment in FrameNet to access tifying generally known entities, as noted above.
the appropriate frame, which in this case is re-
wards and punishments. This frame contains a 5 Learning-Based Approach
list of arguments together with examples. If NPi is
In this section, we describe our learning-based ap-
one of these arguments, we assign med/situation
proach to fine-grained IS determination. Since
to NPi , since it is involved in a situation (described
we aim to automatically label an NP with its IS
by a frame) that is set up by a preceding NP/verb.
subtype, we create one training/test instance from
In Rule 15, we use WordNet (Fellbaum, 1998)
each hand-annotated NP in the training/test set.
to determine whether med/event should be as-
Each instance is represented using five types of
signed to an NP, NPi , by checking whether NPi is
features, as described below.
related to an event, which is typically described
by a verb. Specifically, we use WordNet to check Unigrams (119704). We create one binary fea-
whether there exists a verb, v, preceding NPi such ture for each unigram appearing in the training
that v and NPi have the same hypernym. If so, we set. Its value indicates the presence or absence
assign NPi the subtype med/event. Note that we of the unigram in the NP under consideration.
ensure that the hypernym they share does not ap- Markables (209751). We create one binary fea-
pear in the top five levels of the WordNet noun ture for each markable (i.e., an NP having an IS
and verb hierarchies, since we want them to be subtype) appearing in the training set. Its value is
related via a concept that is not overly general. 1 if and only if the markable has the same string
Rule 16 identifies instances of med/general. as the NP under consideration.
The majority of its members are generally-known Markable predictions (17). We create 17 bi-
6
We use ReVerb ClueWeb09 Extractions 1.1, which
nary features, 16 of which correspond to the 16
is available from http://reverb.cs.washington. IS subtypes and the remaining one corresponds to
edu/reverb_clueweb_tuples-1.1.txt.gz. a dummy subtype. Specifically, if the NP un-
803
der consideration appears in the training set, we 6 Evaluation
use Rule 17 in our hand-crafted ruleset to deter-
mine the IS subtype it is most frequently associ- Next, we evaluate the rule-based approach and
ated with in the training set, and then set the value the learning-based approach to determining the IS
of the feature corresponding to this IS subtype to subtype of each hand-annotated NP in the test set.
1. If the NP does not appear in the training set, we Classification results. Table 3 shows the results
set the value of the dummy subtype feature to 1. of the two approaches. Specifically, row 1 shows
Rule conditions (17). As mentioned before, we their accuracy, which is defined as the percent-
can create features based on the hand-crafted rules age of correctly classified instances. For each
in Section 4. To describe these features, let us in- approach, we present results that are generated
troduce some notation. Let Rule i be denoted by based on gold coreference chains as well as auto-
Ai Bi , where Ai is the condition that must matic chains computed by the Stanford resolver.
be satisfied before the rule can be applied and Bi As we can see, the rule-based approach
is the IS subtype predicted by the rule. We could achieves accuracies of 66.0% (gold coreference)
create one binary feature from each Ai , and set its and 57.4% (Stanford coreference), whereas the
value to 1 if Ai is satisfied by the NP under con- learning-based approach achieves accuracies of
sideration. These features, however, fail to cap- 86.4% (gold) and 78.7% (Stanford). In other
ture a crucial aspect of the ruleset: the ordering of words, the gold coreference results are better than
the rules. For instance, Rule i should be applied the Stanford coreference results, and the learning-
only if the conditions of the first i 1 rules are not based results are better than the rule-based results.
satisfied by the NP, but such ordering is not en- While perhaps neither of these results are surpris-
coded in these features. To address this problem, ing, we are pleasantly surprised by the extent to
we capture rule ordering information by defining which the learned classifier outperforms the hand-
binary feature fi as A1 A2 . . . Ai1 Ai , crafted rules: accuracies increase by 20.4% and
where 1 i 16. In addition, we define a fea- 21.3% when gold coreference and Stanford coref-
ture, f18 , for the default rule (Rule 18) in a simi- erence are used, respectively. In other words, ma-
lar fashion, but since it does not have any condi- chine learning has transformed a ruleset that
tion, we simply define f18 as A1 . . . A16 . achieves mediocre performance into a system that
The value of a feature in this feature group is 1 achieves relatively high performance.
if and only if the NP under consideration satis- These results also suggest that coreference
fies the condition defined by the feature. Note that plays a crucial role in IS subtype determination:
we did not create any features from Rule 17 here, accuracies could increase by up to 7.78.6% if
since we have already generated markables and we solely improved coreference resolution perfor-
markable prediction features for it. mance. This is perhaps not surprising: IS and
coreference can mutually benefit from each other.
Rule predictions (17). None of the features fi s
To gain additional insight into the task, we also
defined above makes use of the predictions of our
show in rows 217 of Table 3 the performance
hand-crafted rules (i.e., the Bi s). To make use
on each of the 16 subtypes, expressed in terms of
of these predictions, we define 17 binary features,
recall (R), precision (P), and F-score (F). A few
one for each Bi , where i = 1, . . . , 16, 18. Specif-
points deserve mention. First, in comparison to
ically, the value of the feature corresponding to
the rule-based approach, the learning-based ap-
Bi is 1 if and only if fi is 1, where fi is a rule
proach achieves considerably better performance
condition feature as defined above.
on almost all classes. One that is of particular in-
Since IS subtype determination is a 16-class terest is the new class. As we can see in row 17,
classification problem, we train a multi-class its F-score rises by about 30 points. These gains
SVM classifier on the training instances using are accompanied by a simultaneous rise in recall
SVMmulticlass (Tsochantaridis et al., 2004), and and precision. In particular, recall increases by
use it to make predictions on the test instances.7 about 40 points. Now, recall from the introduc-
7
For all the experiments involving SVMmulticlass , we to overfitting (by setting C to a small value) tends to yield
set C, the regularization parameter, to 500,000, since pre- poorer classification performance. The remaining learning
liminary experiments indicate that preferring generalization parameters are set to their default values.
804
Rule-Based Approach Learning-Based Approach
Gold Coreference Stanford Coreference Gold Coreference Stanford Coreference
1 Accuracy 66.0 57.4 86.4 78.7
IS Subtype R P F R P F R P F R P F
2 old/ident 77.5 78.2 77.8 66.1 52.7 58.7 82.8 85.2 84.0 75.8 64.2 69.5
3 old/event 98.6 50.4 66.7 71.3 43.2 53.8 98.3 87.9 92.8 2.4 31.8 4.5
4 old/general 81.9 82.7 82.3 72.3 83.6 77.6 97.7 93.7 95.6 87.8 92.7 90.2
5 old/generic 55.9 55.2 55.5 39.2 39.8 39.5 76.1 87.3 81.3 39.9 85.9 54.5
6 old/ident generic 48.7 77.7 59.9 27.2 51.8 35.7 57.1 87.5 69.1 47.2 44.8 46.0
7 old/relative 55.0 69.2 61.3 55.1 63.4 59.0 98.0 63.0 76.7 99.0 37.5 54.4
8 med/general 29.9 19.8 23.8 29.5 19.6 23.6 91.2 87.7 89.4 84.0 72.2 77.7
9 med/bound 56.4 20.5 30.1 56.4 20.5 30.1 25.7 65.5 36.9 2.7 40.0 5.1
10 med/part 19.5 100.0 32.7 19.5 100.0 32.7 73.2 96.8 83.3 73.2 96.8 83.3
11 med/situation 28.7 100.0 44.6 28.7 100.0 44.6 68.4 95.4 79.7 68.0 97.7 80.2
12 med/event 10.5 100.0 18.9 10.5 100.0 18.9 46.3 100.0 63.3 46.3 100.0 63.3
13 med/set 82.9 61.8 70.8 78.0 59.4 67.4 90.4 87.8 89.1 88.4 86.0 87.2
14 med/poss 52.9 86.0 65.6 52.9 86.0 65.6 93.2 92.4 92.8 90.5 97.6 93.9
15 med/func value 81.3 74.3 77.6 81.3 74.3 77.6 88.1 85.9 87.0 88.1 85.9 87.0
16 med/aggregation 57.4 44.0 49.9 57.4 43.6 49.6 85.2 72.9 78.6 83.8 93.9 88.6
17 new 50.4 65.7 57.0 50.3 65.1 56.7 90.3 84.6 87.4 90.4 83.6 86.9
Table 3: IS subtype accuracies and F-scores. In each row, the strongest result, as well as those that are statistically
indistinguishable from it according to the paired t-test (p < 0.05), are boldfaced.
tion that previous attempts on 3-class IS determi- and 10.5 for event. Nevertheless, the learning
nation by Nissim and R&N have achieved poor algorithm has again discovered a profitable way
performance on the new class. We hypothesize to combine the available features, enabling the F-
that the use of shallow features in their approaches scores of these classes to increase by 35.150.6%.
were responsible for the poor performance they While most classes are improved by machine
observed, and that using our knowledge-rich fea- learning, the same is not true for old/event and
ture set could improve its performance. We will med/bound, whose F-scores are 4.5% (row 3) and
test this hypothesis at the end of this section. 5.1% (row 9), respectively, when Stanford coref-
Other subtypes that are worth discussing erence is employed. This is perhaps not surpris-
are med/aggregation, med/func value, and ing. Recall that the multi-class SVM classifier
med/poss. Recall that the rules we designed for was trained to maximize classification accuracy.
these classes were only crude approximations, or, Hence, if it encounters a class that is both difficult
perhaps more precisely, simplified versions of the to learn and is under-represented, it may as well
definitions of the corresponding subtypes. For aim to achieve good performance on the easier-
instance, to determine whether an NP belongs to to-learn, well-represented classes at the expense
med/aggregation, we simply look for occurrences of these hard-to-learn, under-represented classes.
of and and or (Rule 9), whereas its definition Feature analysis. In an attempt to gain addi-
requires that not all of the NPs in the coordinated tional insight into the performance contribution
phrase are new. Despite the over-simplicity of each of the five types of features used in the
of these rules, machine learning has enabled learning-based approach, we conduct feature ab-
the available features to be combined in such a lation experiments. Results are shown in Table 4,
way that high performance is achieved for these where each row shows the accuracy of the classi-
classes (see rows 1416). fier trained on all types of features except for the
Also worth examining are those classes for one shown in that row. For easy reference, the
which the hand-crafted rules rely on sophisti- accuracy of the classifier trained on all types of
cated knowledge sources. They include med/part, features is shown in row 1 of the table. According
which relies on ReVerb; med/situation, which re- to the paired t-test (p < 0.05), performance drops
lies on FrameNet; and med/event, which relies on significantly whichever feature type is removed.
WordNet. As we can see from the rule-based re- This suggests that all five feature types are con-
sults (rows 1012), these knowledge sources have tributing positively to overall accuracy. Also, the
yielded rules that achieved perfect precision but markables features are the least important in the
low recall: 19.5% for part, 28.7% for situation, presence of other feature groups, whereas mark-
805
Feature Type Gold Coref Stanford Coref Feature Type Gold Coref Stanford Coref
All features 86.4 78.7 All rules 66.0 57.4
rule predictions 77.5 70.0 memorization 62.6 52.0
markable predictions 72.4 64.7 ReVerb 64.2 56.6
rule conditions 81.1 71.0 cue words 63.8 54.0
unigrams 74.4 58.6
markables 83.2 75.5
Table 6: Accuracies of the simplified ruleset.
Table 4: Accuracies of feature ablation experiments. R&Ns Features Our Features

IS Type R P F R P F
Feature Type Gold Coref Stanford Coref old 93.5 95.8 94.6 93.8 96.4 95.1
rule predictions 49.1 45.2 med 89.3 71.2 79.2 93.3 86.0 89.5
markable predictions 39.7 39.7 new 34.6 71.7 46.7 82.4 72.7 87.2
rule conditions 58.1 28.9 Accuracy 82.9 91.7
unigrams 56.8 56.8
markables 10.4 10.4
Table 7: Accuracies on IS types.
Table 5: Accuracies of classifiers for each feature type.

IS type results. We hypothesized earlier that
the poor performance reported by Nissim and
able predictions and unigrams are the two most R&N on identifying new entities in their 3-class
important feature groups. IS classification experiments (i.e., classifying an
To get a better idea of the utility of each feature NP as old, med, or new) could be attributed to
type, we conduct another experiment in which we their sole reliance on lexico-syntactic features. To
train five classifiers, each of which employs ex- test this hypothesis, we (1) train a 3-class classi-
actly one type of features. The accuracies of these fier using the five types of features we employed
classifiers are shown in Table 5. As we can see, in our learning-based approach, computing the
the markables features have the smallest contribu- features based on the Stanford coreference chains;
tion, whereas unigrams have the largest contribu- and (2) compare its results against those obtained
tion. Somewhat interesting are the results of the via the lexico-syntactic approach in R&N on our
classifiers trained on the rule conditions: the rules test set. Results of these experiments, which are
are far more effective when gold coreference is shown in Table 7, substantiate our hypothesis:
used. This can be attributed to the fact that the when we replace R&Ns features with ours, accu-
design of the rules was based in part on the defini- racy rises from 82.9% to 91.7%. These gains can
tions of the subtypes, which assume the availabil- be attributed to large improvements in identifying
ity of perfect coreference information. new and med entities, for which F-scores increase
by about 40 points and 10 points, respectively.
Knowledge source analysis. To gain some in-
sight into the extent to which a knowledge source 7 Conclusions
or a rule contributes to the overall performance of
the rule-based approach, we conduct ablation ex- We have examined the fine-grained IS determi-
periments: in each experiment, we measure the nation task. Experiments on a set of Switch-
performance of the ruleset after removing a par- board dialogues show that our learning-based ap-
ticular rule or knowledge source from it. Specifi- proach, which uses features that include hand-
cally, rows 24 of Table 6 show the accuracies of crafted rules and their predictions, outperforms its
the ruleset after removing the memorization rule rule-based counterpart by more than 20%, achiev-
(Rule 17), the rule that uses ReVerbs output (Rule ing an overall accuracy of 78.7% when relying on
12), and the cue words used in Rules 4 and 10, automatically computed coreference information.
respectively. For easy reference, the accuracy of In addition, we have achieved state-of-the-art re-
the original ruleset is shown in row 1 of the ta- sults on the 3-class IS determination task, in part
ble. According to the paired t-test (p < 0.05), due to our reliance on richer knowledge sources
performance drops significantly in all three abla- in comparison to prior work. To our knowledge,
tion experiments. This suggests that the memo- there has been little work on automatic IS subtype
rization rule, ReVerb, and the cue words all con- determination. We hope that our work can stimu-
tribute positively to the accuracy of the ruleset. late further research on this task.
806
Acknowledgments Malvina Nissim, Shipra Dingare, Jean Carletta, and
Mark Steedman. 2004. An annotation scheme for
We thank the three anonymous reviewers for their information status in dialogue. In Proceedings of
detailed and insightful comments on an earlier the 4th International Conference on Language Re-
draft of the paper. This work was supported sources and Evaluation, pages 10231026.
in part by NSF Grants IIS-0812261 and IIS- Malvina Nissim. 2003. Annotation scheme
1147644. for information status in dialogue. Available
from http://www.stanford.edu/class/
cs224u/guidelines-infostatus.pdf.
References Malvina Nissim. 2006. Learning information status of
discourse entities. In Proceedings of the 2006 Con-
Collin F. Baker, Charles J. Fillmore, and John B.
Lowe. 1998. The Berkeley FrameNet project.
Processing, pages 94102.
Ellen F. Prince. 1981. Toward a taxonomy of given-
Association for Computational Linguistics and the
new information. In P. Cole, editor, Radical Prag-
17th International Conference on Computational
matics, pages 223255. New York, N.Y.: Academic
Linguistics, Volume 1, pages 8690.
Press.
Sasha Calhoun, Jean Carletta, Jason Brenier, Neil
Ellen F. Prince. 1992. The ZPG letter: Subjects,
Mayo, Dan Jurafsky, Mark Steedman, and David
definiteness, and information-status. In Discourse
Beaver. 2010. The NXT-format Switchboard cor-
Description: Diverse Analysis of a Fund Raising
pus: A rich resource for investigating the syntax, se-
Text, pages 295325. John Benjamins, Philadel-
mantics, pragmatics and prosody of dialogue. Lan-
phia/Amsterdam.
guage Resources and Evaluation, 44(4):387419.
Miriam Eckert and Michael Strube. 2001. Dialogue Altaf Rahman and Vincent Ng. 2011. Learning the
acts, synchronising units and anaphora resolution. information status of noun phrases in spoken dia-
Journal of Semantics, 17(1):5189. logues. In Proceedings of the 2011 Conference on
Anthony Fader, Stephen Soderland, and Oren Etzioni.
ing, pages 10691080.
2011. Identifying relations for open information ex-
traction. In Proceedings of the 2011 Conference on Arndt Riester, David Lorenz, and Nina Seemann.
Empirical Methods in Natural Language Process- 2010. A recursive annotation scheme for referential
ing, pages 15351545. information status. In Proceedings of the Seventh
Christiane Fellbaum. 1998. WordNet: An Electronic International Conference on Language Resources
Lexical Database. MIT Press, Cambridge, MA. and Evaluation, pages 717722.
Caroline Gasperin and Ted Briscoe. 2008. Statisti- Mark Steedman. 2000. The Syntactic Process. The
cal anaphora resolution in biomedical texts. In Pro- MIT Press, Cambridge, MA.
ceedings of the 22nd International Conference on Ioannis Tsochantaridis, Thomas Hofmann, Thorsten
Computational Linguistics, pages 257264. Joachims, and Yasemin Altun. 2004. Support vec-
Michael Gotze, Thomas Weskott, Cornelia En- tor machine learning for interdependent and struc-
driss, Ines Fiedler, Stefan Hinterwimmer, Svetlana tured output spaces. In Proceedings of the 21st
Petrova, Anne Schwarz, Stavros Skopeteas, and International Conference on Machine Learning,
Ruben Stoel. 2007. Information structure. In pages 104112.
Working Papers of the SFB632, Interdisciplinary Enric Vallduv. 1992. The Informational Component.
Studies on Information Structure (ISIS). Potsdam: Garland, New York.
Universitatsverlag Potsdam.
Eva Hajicova. 1984. Topic and focus. In Contri-
butions to Functional Syntax, Semantics, and Lan-
guage Comprehension (LLSEE 16), pages 189202.
John Benjamins, Amsterdam.
Michael A. K. Halliday. 1976. Notes on transitiv-
ity and theme in English. Journal of Linguistics,
3(2):199244.
Heeyoung Lee, Yves Peirsman, Angel Chang,
Nathanael Chambers, Mihai Surdeanu, and Dan Ju-
rafsky. 2011. Stanfords multi-pass sieve corefer-
ence resolution system at the CoNLL-2011 shared
task. In Proceedings of the Fifteenth Confer-
ence on Computational Natural Language Learn-
ing: Shared Task, pages 2834.
807
Composing extended top-down tree transducers
Aurelie Lagoutte
Ecole normale superieure de Cachan, Departement Informatique
alagoutt@dptinfo.ens-cachan.fr
Fabienne Braune and Daniel Quernheim and Andreas Maletti

University of Stuttgart, Institute for Natural Language Processing
{braunefe,daniel,maletti}@ims.uni-stuttgart.de
Abstract RC
C
PREL C 7
A composition procedure for linear and NP VP
nondeleting extended top-down tree trans- that NP VP
ducers is presented. It is demonstrated that C C
the new procedure is more widely applica-
NP VP 7 NP VP
ble than the existing methods. In general,
the result of the composition is an extended VAUX VPART NP VAUX NP VPART
top-down tree transducer that is no longer
linear or nondeleting, but in a number of Figure 1: Word drop [top] and reordering [bottom].
cases these properties can easily be recov-
ered by a post-processing step.
The newswire reported yesterday that the Serbs have
completed the negotiations.
1 Introduction Gestern [Yesterday] berichtete [reported] die [the]
Nachrichtenagentur [newswire] die [the] Serben
Tree-based translation models such as syn- [Serbs] hatten [would have] die [the] Verhandlungen
chronous tree substitution grammars (Eisner, [negotiations] beendet [completed].
2003; Shieber, 2004) or multi bottom-up tree
transducers (Lilin, 1978; Engelfriet et al., 2009; The relation between them can be described
Maletti, 2010; Maletti, 2011) are used for sev- (Yamada and Knight, 2001) by three operations:
eral aspects of syntax-based machine transla- drop of the relative pronoun, movement of the
tion (Knight and Graehl, 2005). Here we consider participle to end of the clause, and word-to-word
the extended top-down tree transducer (XTOP), translation. Figure 1 shows the first two oper-
which was studied in (Arnold and Dauchet, ations, and Figure 2 shows ln-XTOP rules per-
1982; Knight, 2007; Graehl et al., 2008; Graehl forming them. Let us now informally describe
et al., 2009) and implemented in the toolkit the execution of an ln-XTOP on the top rule
T IBURON (May and Knight, 2006; May, 2010). of Figure 2. In general, ln-XTOPs process an in-
Specifically, we investigate compositions of linear put tree from the root towards the leaves using
and nondeleting XTOPs (ln-XTOP). Arnold and a set of rules and states. The state p in the left-
Dauchet (1982) showed that ln-XTOPs compute hand side of controls the particular operation of
a class of transformations that is not closed under Figure 1 [top]. Once the operation has been per-
composition, so we cannot compose two arbitrary formed, control is passed to states pNP and pVP ,
ln-XTOPs into a single ln-XTOP. However, we which use their own rules to process the remain-
will show that ln-XTOPs can be composed into a ing input subtree governed by the variable below
(not necessarily linear or nondeleting) XTOP. To them (see Figure 2). In the same fashion, an ln-
illustrate the use of ln-XTOPs in machine transla- XTOP containing the bottom rule of Figure 2 re-
tion, we consider the following English sentence orders the English verbal complex.
together with a German reference translation: In this way we model the word drop by an ln-

All authors were financially supported by the E MMY
XTOP M and reordering by an ln-XTOP N . The
N OETHER project MA / 4959 / 1-1 of the German Research syntactic properties of linearity and nondeletion
Foundation (DFG). yield nice algorithmic properties, and the mod-
808
p ()
C
RC q (1) (3)
pNP pVP (2)
q
PREL C (11)
y1 y2 x1 (21) q (22) (31) x1
that y1 y2
(221) p
q C x2 p(311)
x3
C qNP VP (3111)
x3
z1 VP z1 qVA qVP qNP
z2 z3 z4 z2 z4 z3 Figure 3: Linear normalized tree t T (Q(X)) [left]
and t[]2 [right] with var(t) = {x1 , x2 , x3 }. The posi-
Figure 2: XTOP rules for the operations of Figure 1. tions are indicated in t as superscripts. The subtree t|2
is (, q(x2 )).
ular approach is desirable for better design and

the composed XTOP has only bounded overlap-
parametrization of the translation model (May et
ping cuts, post-processing will get rid of them
al., 2010). Composition allows us to recombine
and restore an ln-XTOP. In the remaining cases,
those parts into one device modeling the whole
in which unbounded overlapping is necessary or
translation. In particular, it gives all parts the
occurs in the syntactic form but would not be nec-
chance to vote at the same time. This is especially
essary, we will compute an XTOP. This is still
important if pruning is used because it might oth-
an improvement on the existing methods that just
erwise exclude candidates that score low in one
fail. Since general XTOPs are implemented in
part but well in others (May et al., 2010).
T IBURON and the new composition covers (essen-
Because ln-XTOP is not closed under compo- tially) all cases currently possible, our new com-
sition, the composition of M and N might be out- position procedure could replace the existing one
side ln-XTOP. These cases have been identified in T IBURON. Our approach to composition is the
by Arnold and Dauchet (1982) as infinitely over- same as in (Engelfriet, 1975; Baker, 1979; Maletti
lapping cuts, which occur when the right-hand and Vogler, 2010): We simply parse the right-
sides of M and the left-hand sides of N are un- hand sides of the XTOP M with the left-hand
boundedly overlapping. This can be purely syn- sides of the XTOP N . However, to facilitate this
tactic (for a given ln-XTOP) or semantic (inher- approach we have to adjust the XTOPs M and N
ent in all ln-XTOPs for a given transformation). in two pre-processing steps. In a first step we cut
Despite the general impossibility, several strate- left-hand sides of rules of N into smaller pieces,
gies have been developed: (i) Extension of the which might introduce non-linearity and deletion
model (Maletti, 2010; Maletti, 2011), (ii) online into N . In certain cases, this can also intro-
composition (May et al., 2010), and (iii) restric- duce finite look-ahead (Engelfriet, 1977; Graehl
tion of the model, which we follow. Composi- et al., 2009). To compensate, we expand the rules
tions of subclasses in which the XTOP N has at of M slightly. Section 4 explains those prepa-
most one input symbol in its left-hand sides have rations. Next, we compose the prepared XTOPs
already been studied in (Engelfriet, 1975; Baker, as usual and obtain a single XTOP computing the
1979; Maletti and Vogler, 2010). Such compo- composition of the transformations computed by
sitions are implemented in the toolkit T IBURON. M and N (see Section 5). Finally, we apply a
However, there are translation tasks in which the post-processing step to expand rules to reobtain
used XTOPs do not fulfill this requirement. Sup- linearity and nondeletion. Clearly, this cannot be
pose that we simply want to compose the rules of successful in all cases, but often removes the non-
Figure 2, The bottom rule does not satisfy the re- linearity introduced in the pre-processing step.
quirement that there is at most one input symbol
in the left-hand side. 2 Preliminaries
We will demonstrate how to compose two lin-
ear and nondeleting XTOPs into a single XTOP, Our trees have labels taken from an alphabet
which might however no longer be linear or non- of symbols, and in addition, leaves might be
deleting. However, when the syntactic form of labeled by elements of the countably infinite
809
qS
S
x1 S

7

[ qV qNP qNP
x3 x1 VP
x2 x1 x1
x2 x2 x2 x3
Figure 4: Substitution where (x1 ) = , (x2 ) = x2 , t

and (x3 ) = ((, , x2 )).
qS t
set X = {x1 , x2 , . . . } of formal variables. For- S S

mally, for every V X the set T (V ) of qV qNP qNP
-trees with V -leaves is the smallest set such that VP
t1
V T (V ) and (t1 , . . . , tk ) T (V ) for all
t2 t1 t1
k N, , and t1 , . . . , tk T (V ). To avoid
excessive universal quantifications, we drop them t2 t3
if they are obvious from the context.
For each tree t T (X) we identify nodes by Figure 5: Rule and its use in a derivation step.
positions. The root of t has position and the po-
sition iw with i N and w N addresses the
for all and t1 , . . . , tk T (X). The effect
position w in the i-th direct subtree at the root.
of a substitution is displayed in Figure 4. Two
The set of all positions in t is pos(t). We write
substitutions , 0 : X T (X) can be com-
t(w) for the label (taken from X) of t at po-
posed to form a substitution 0 : X T (X)
sition w pos(t). Similarly, we use
such that 0 (x) = (x)0 for every x X.
t|w to address the subtree of t that is rooted
Next, we define two notions of compatibility
in position w, and
for trees. Let t, t0 T (X) be two trees. If there
t[u]w to represent the tree that is ob-
exists a substitution such that t0 = t, then t0 is
tained from replacing the subtree t|w at w
an instance of t. Note that this relation is not sym-
by u T (X).
metric. A unifier for t and t0 is a substitution
For a given set L X of labels, we let
such that t = t0 . The unifier is a most gen-
posL (t) = {w pos(t) | t(w) L} eral unifier (short: mgu) for t and t0 if for every
unifier 00 for t and t0 there exists a substitution 0
be the set of all positions whose label belongs such that 0 = 00 . The set mgu(t, t0 ) is the set of
to L. We also write posl (t) instead of pos{l} (t). all mgus for t and t0 . Most general unifiers can be
The tree t T (V ) is linear if |posx (t)| 1 for computed efficiently (Robinson, 1965; Martelli
every x X. Moreover, and Montanari, 1982) and all mgus for t and t0
are equal up to a variable renaming.
var(t) = {x X | posx (t) 6= }
Example 1. Let t = (x1 , ((, , x2 ))) and
collects all variables that occur in t. If the vari- t0 = (, x3 ). Then mgu(t, t0 ) contains such
ables occur in the order x1 , x2 , . . . in a pre-order that (x1 ) = and (x3 ) = ((, , x2 )). Fig-
traversal of the tree t, then t is normalized. Given ure 4 illustrates the unification.
a finite set Q, we write Q(T ) with T T (X)
for the set {q(t) | q Q, t T }. We will treat 3 The model
elements of Q(T ) as special trees of TQ (X).
The previous notions are illustrated in Figure 3. The discussed model in this contribution is an
A substitution is a mapping : X T (X). extension of the classical top-down tree trans-
When applied to a tree t T (X), it will return ducer, which was introduced by Rounds (1970)
the tree t, which is obtained from t by replacing and Thatcher (1970). The extended top-down
all occurrences of x X (in parallel) by (x). tree transducer with finite look-ahead or just
This can be defined recursively by x = (x) for XTOPF and its variations were studied in (Arnold
all x X and (t1 , . . . , tk ) = (t1 , . . . , tk ) and Dauchet, 1982; Knight and Graehl, 2005;
810
qS S RC
S qS p
S qNP VP
qV qNP qNP S PREL C
x1 VP x1 qV qNP C
x2 x1 x3 x2 x1 x3 that pNP pVP
x2 x3 x2 x3 y1 y2
y1 y2
Figure 6: Rule [left] and reversed rule [right].
Figure 7: Top rule of Figure 2 reversed.
Knight, 2007; Graehl et al., 2008; Graehl et
al., 2009). Formally, an extended top-down tree of M is independent of the choice of 0 and 0 .
transducer with finite look-ahead (XTOPF ) is a Moreover, it is known (Graehl et al., 2009) that
system M = (Q, , , I, R, c) where each XTOPF can be transformed into an equiva-
Q is a finite set of states, lent XTOP preserving both linearity and nondele-
and are alphabets of input and output tion. However, the notion of XTOPF will be con-
symbols, respectively, venient in our composition construction. A de-
I Q is a set of initial states, tailed exposition to XTOPs is presented by Arnold
R is a finite set of (rewrite) rules of the form and Dauchet (1982) and Graehl et al. (2009).
` r where ` Q(T (X)) is linear and A linear and nondeleting XTOP M with
r T (Q(var(`))), and rules R can easily be reversed to obtain
c : R X T (X) assigns a look-ahead a linear and nondeleting XTOP M 1 with
restriction to each rule and variable such that rules R1 , which computes the inverse transfor-
c(, x) is linear for each R and x X. mation M 1 = M 1
, by reversing all its rules.
The XTOPF M is linear (respectively, nondelet- A (suitable) rule is reversed by exchanging the
ing) if r is linear (respectively, var(r) = var(`)) locations of the states. More precisely, given
for every rule ` r R. It has no look-ahead a rule q(l) r R, we obtain the rule
(or it is an XTOP) if c(, x) X for all rules q(r0 ) l0 of R1 , where l0 = l and r0 is the
R and x X. In this case, we drop the look- unique tree such that there exists a substitution
ahead component c from the description. A rule : X Q(X) with (x) Q({x}) for every
` r R is consuming (respectively, produc- x X and r = r0 . Figure 6 displays a rule
ing) if pos (`) 6= (respectively, pos (r) 6= ). and its corresponding reversed rule. The reversed
We let Lhs(M ) = {l | q, r : q(l) r R}. form of the XTOP rule modeling the insertion op-
Let M = (Q, , , I, R, c) be an XTOPF . In eration in Figure 2 is displayed in Figure 7.
order to facilitate composition, we define senten- Finally, let us formally define composition.
tial forms more generally than immediately nec- The XTOP M computes the tree transformation
essary. Let 0 and 0 be such that 0 M T T . Given another XTOP N that
and 0 . To keep the presentation sim- computes a tree transformation N T T ,
ple, we assume that Q (0 0 ) = . A we might be interested in the tree transforma-
sentential form of M (using 0 and 0 ) is a tion computed by the composition of M and N
tree of SF(M ) = T0 (Q(T0 )). For every (i.e., running M first and then N ). Formally, the
, SF(M ), we write M if there exist a composition M ; N of the tree transformations
position w posQ (), a rule = ` r R, and M and N is defined by
a substitution : X T0 such that (x) is an in-
stance of c(, x) for every x X and = [`]w M ; N = {(s, u) | t : (s, t) M , (t, u) N }
and = [r]w . If the applicable rules are re-
stricted to a certain subset R0 R, then we also and we often also use the notion composition for
write R0 . Figure 5 illustrates a derivation XTOP with the expectation that the composition
step. The tree transformation computed by M is of M and N computes exactly M ; N .
M = {(t, u) T T | q I : q(t) M u} 4 Pre-processing

where M is the reflexive, transitive closure We want to compose two linear and nondelet-
of M . It can easily be verified that the definition ing XTOPs M = (P, , , IM , RM ) and
811
LHS(M 1 ) LHS(N ) Rule of M 1 Rule of N
q
C p
C
z1 VP p1 p2 q1 q2
y1 y2
z2 z3 z4 y1 y2 y1 y2 z1 z2
z1 z2
Figure 8: Incompatible left-hand sides of Example 3.
Figure 9: Rules used in Example 5.
N = (Q, , , IN , RN ). Before we actually per-

form the composition, we will prepare M and N Intuitively, for every -labeled position w in a
in two pre-processing steps. After these two steps, right-hand side r1 of M and any left-hand side l2
the composition is very simple. To avoid com- of N , we require (ignoring the states) that either
plications, we assume that (i) all rules of M are (i) r1 |w and l2 are not unifiable or (ii) r1 |w is an
producing and (ii) all rules of N are consuming. instance of l2 .
For convenience, we also assume that the XTOPs Example 3. The XTOPs for the English-to-
M and N only use variables of the disjoint sets German translation task in the Introduction are
Y X and Z X, respectively. not compatible. This can be observed on the
left-hand side l1 Lhs(M 1 ) of Figure 7
4.1 Compatibility and the left-hand side l2 Lhs(N ) of Fig-
ure 2[bottom]. These two left-hand sides are il-
In the existing composition results for subclasses
lustrated in Figure 8. Between them there is an
of XTOPs (Engelfriet, 1975; Baker, 1979; Maletti
mgu such that (Y ) 6 X (e.g., (y1 ) = z1 and
and Vogler, 2010) the XTOP N has at most one
(y2 ) = VP(z2 , z3 , z4 ) is such an mgu).
input symbol in its left-hand sides. This restric-
tion allows us to match rule applications of N to Theorem 4. There exists an XTOPF N 0 that is
positions in the right-hand sides of M . Namely, equivalent to N and compatible with M .
for each output symbol in a right-hand side of M ,
Proof. We achieve compatibility by cutting of-
we can select a rule of N that can consume that
fending rules of the XTOP N into smaller pieces.
output symbol. To achieve a similar decompo-
Unfortunately, both linearity and nondeletion
sition strategy in our more general setup, we in-
of N might be lost in the process. We first let
troduce a compatibility requirement on right-hand
N 0 = (Q, , , IN , RN , cN ) be the XTOPF such
sides of M and left-hand sides of N . Roughly
that cN (, x) = x for every RN and x X.
speaking, we require that the left-hand sides of N
If N 0 is compatible with M , then we are done.
are small enough to completely process right-
Otherwise, let l1 Lhs(M 1 ) be a left-hand side,
hand sides of M . However, a comparison of
q(l2 ) r2 RN be a rule, and w pos (l1 )
left- and right-hand sides is complicated by the
be a position such that (y) / X for some
fact that their shape is different (left-hand sides
mgu(l1 |w , l2 ) and y Y . Let v posy (l1 |w )
have a state at the root, whereas right-hand sides
be the unique position of y in l1 |w .
have states in front of the variables). We avoid
Now we have to distinguish two cases: (i) Ei-
these complications by considering reversed rules
ther var(l2 |v ) = and there is no leaf in r2 la-
of M . Thus, an original right-hand side of M is
beled by a symbol from . In this case, we have
now a left-hand side in the reversed rules and thus
to introduce deletion and look-ahead into N 0 . We
has the right format for a comparison. Recall that
replace the old rule = q(l2 ) r2 by the new
Lhs(N ) contains all left-hand sides of the rules
rule 0 = q(l2 [z]v ) r2 , where z X \ var(l2 )
of N , in which the state at the root was removed.
is a variable that does not appear in l2 . In addition,
Definition 2. The XTOP N is compatible to M we let cN (0 , z) = l2 |v and cN (0 , x) = cN (, x)
if (Y ) X for all unifiers mgu(l1 |w , l2 ) for all x X \ {z}.
between a subtree at a -labeled position (ii) Otherwise, let V var(l2 |v ) be a maximal
w pos (l1 ) in a left-hand side l1 Lhs(M 1 ) set such that there exists a minimal (with respect
and a left-hand side l2 Lhs(N ). to the prefix order) position w0 pos(r2 ) with
812
Another rule of N q C
q 1 :
C qNP q0

z1 z
q1 q2 q3 z1 z
z1
z1 z2 z3 q0 VP
z2 z3
2 : VP qVA qVP qNP
Figure 10: Additional rule used in Example 5.
z2 z3 z4 z2 z4 z3
var(r2 |w0 ) var(l2 |v ) and var(r2 []w0 )V = , Figure 11: Rules replacing the rule in Figure 7.
where is arbitrary. Let z X \ var(l2 ) be
a fresh variable, q 0 be a new state of N , and
Example 5. Let us consider the rules illustrated
V 0 = var(l2 |v ) \ V . We replace the rule
in Figure 9. We might first note that y1 has to
= q(l2 ) r2 of RN by
be unified with . Since does not contain any
1 = q(l2 [z]v ) trans(r2 )[q 0 (z)]w0 variables and the right-hand side of the rule of N
does not contain any non-variable leaves, we are
2 = q 0 (l2 |v ) r2 |w0 .
in case (i) in the proof of Theorem 4. Conse-
The look-ahead for z is trivial and other- quently, the displayed rule of N is replaced by a
wise we simply copy the old look-ahead, so variant, in which is replaced by a new variable z
cN (1 , z) = z and cN (1 , x) = cN (, x) for all with look-ahead .
x X \ {z}. Moreover, cN (2 , x) = cN (, x) Secondly, with this new rule there is an mgu,
for all x X. The mapping trans is given for in which y2 is mapped to (z1 , z2 ). Clearly, we
t = (t1 , . . . , tk ) and q 00 (z 00 ) Q(Z) by are now in case (ii). Furthermore, we can select
the set V = {z1 , z2 } and position w0 = . Cor-
trans(t) = (trans(t1 ), . . . , trans(tk )) respondingly, the following two new rules for N
( replace the old rule:
hl2 |v , q 00 , v 0 i(z) if z 00 V 0
trans(q 00 (z 00 )) =
q 00 (z 00 ) otherwise, q((z, z 0 )) q 0 (z 0 )
where v 0 = posz 00 (l2 |v ). q 0 ((z1 , z2 )) (q1 (z1 ), q2 (z2 )) ,
Finally, we collect all newly generated states
of the form hl, q, vi in Ql and for every such where the look-ahead for z remains .
state with l = (l1 , . . . , lk ) and v = iw, let Figure 10 displays another rule of N . There is
l0 = (z1 , . . . , zk ) and an mgu, in which y2 is mapped to (z2 , z3 ). Thus,
we end up in case (ii) again and we can select the
set V = {z2 } and position w0 = 2. Thus, we
(
q(zi ) if w =
hl, q, vi(l0 ) replace the rule of Figure 10 by the new rules
hli , q, wi(zi ) otherwise
be a new rule of N without look-ahead. q((z1 , z)) (q1 (z1 ), q 0 (z), q3 (z)) (?)
0
Overall, we run the procedure until N 0 is com- q ((z2 , z3 )) q2 (z2 )
patible with M . The procedure eventually ter- q3 ((z1 , z2 )) q3 (z2 ) ,
minates since the left-hand sides of the newly
added rules are always smaller than the replaced where q3 = h(z2 , z3 ), q3 , 2i.
rules. Moreover, each step preserves the seman-
Let us use the construction in the proof of The-
tics of N 0 , which completes the proof.
orem 4 to resolve the incompatibility (see Exam-
We note that the look-ahead of N 0 after the con- ple 3) between the XTOPs presented in the Intro-
struction used in the proof of Theorem 4 is either duction. Fortunately, the incompatibility can be
trivial (i.e., a variable) or a ground tree (i.e., a tree resolved easily by cutting the rule of N (see Fig-
without variables). Let us illustrate the construc- ure 7) into the rules of Figure 11. In this example,
tion used in the proof of Theorem 4. linearity and nondeletion are preserved.
813
4.2 Local determinism q i s
i
p i q q0 ps i
After the first pre-processing step, we have the ps
ps s0 ps
original linear and nondeleting XTOP M and
an XTOPF N 0 = (Q0 , , , IN , RN 0 , c ) that is
N y1 y2 y1 y2 y2 y1 y1
equivalent to N and compatible with M . How-
q q0 q
ever, in the first pre-processing step we might i q i
have introduced some non-linear (copying) rules s s s,s0 /0s,s0
ps p ps
in N 0 (see rule (?) in Example 5), and it is known
y1 y2 y1
that nondeterminism [in M ] followed by copy-
y1 y2 y1 y2
ing [in N 0 ] is a feature that prevents composition y1 y2 y3
to work (Engelfriet, 1975; Baker, 1979). How- q0 q0
ever, our copying is very local and the copies
are only used to project to different subtrees. 0s,s0 i i s,s0 i q q0

Nevertheless, during those projection steps, we ps0 p p s0

need to make sure that the processing in M pro- y2 y3 y1 y2 y3 y2 y3 y3
y1 y2 y3
ceeds deterministically. We immediately note that
all but one copy are processed by states of the Figure 12: Useful rules for the composition M 0 ; N 0 of
form hl, q, vi Ql . These states basically pro- Example 8, where s, s0 {, } and P(z2 ,z3 ) .
cess (part of) the tree l and project (with state q)
to the subtree at position v. It is guaranteed that
p(l) M 0 M 0 r for some that is not an
each such subtree (indicated by v) is reached only
instance of t. In other words, we construct each
once. Thus, the copying is resolved once the
rule of Rt by applying existing rules of RM in
states of the form hl, q, vi are left. To keep the
sequence to generate a (minimal) right-hand side
presentation simple, we just add expanded rules
that is an instance of t. We thus potentially make
to M such that any rule that can produce a part of
the right-hand sides of M bigger by joining sev-
a tree l immediately produces the whole tree. A
eral existing rules into a single rule. Note that
similar strategy is used to handle the look-ahead
this affects neither compatibility nor the seman-
of N 0 . Any right-hand side of a rule of M that
tics. In the second step, we add pure -rules
produces part of a left-hand side of a rule of N 0
that allow us to change the state to one that we
with look-ahead is expanded to produce the re-
constructed in the previous step. For every new
quired look-ahead immediately.
state p = p(l) r, let base(p) = p.S Then
Let L T (Z) be the set of trees l such that 0 = R 0
RM M RL RE and P = P tL Pt
hl, q, vi appears as a state of Ql , or where
l = l2 for some 2 = q(l2 ) r2 RN 0
[
0
of N with non-trivial look-ahead (i.e., RL = Rt and Pt = {`() | ` r Rt }
cN (2 , z) / X for some z X), where tL
[
(x) = cN (2 , x) for every x X. RE = {base(p)(x1 ) p(x1 ) | p Pt } .
To keep the presentation uniform, we assume tL
that for every l L, there exists a state of the Clearly, this does not change the semantics be-
form hl, q, vi Q0 . If this is not already the cause each rule of RM 0 can be simulated by a
case, then we can simply add useless states with- chain of rules of RM . Let us now do a full ex-
out rules for them. In other words, we assume that ample for the pre-processing step. We consider a
the first case applies to each l L. nondeterministic variant of the classical example
Next, we add two sets of rules to RM , which by Arnold and Dauchet (1982).
will not change the semantics but prove to be use- Example 6. Let M = (P, , , {p}, RM )
ful in the composition construction. First, for be the linear and nondeleting XTOP such that
every tree t L, let Rt contain all the rules P = {p, p , p }, = {, , , , }, and
p(l) r, where p = p(l) r is a new state RM contains the following rules
with p P , minimal normalized tree l T (X),
and an instance r T (P (X)) of t such that p((y1 , y2 )) (ps (y1 ), p(y2 )) ()
814
p((y1 , y2 , y3 )) (ps (y1 ), (ps0 (y2 ), p(y3 ))) hq, pi
C
p((y1 , y2 , y3 )) (ps (y1 ), (ps0 (y2 ), p (y3 )))
RC
ps (s0 (y1 )) s(ps (y1 )) hqNP , pNP i hq 0 , pVP i
ps () PREL C
x1 x2
that x1 x2
for every s, s0 {, }. Similarly, we let
N = (Q, , , {q}, RN ) be the linear and non- Figure 13: Composed rule created from the rule of Fig-
deleting XTOP such that Q = {q, i} and RN con- ure 7 and the rules of N 0 displayed in Figure 11.
tains the following rules
q((z1 , z2 )) (i(z1 ), i(z2 )) 5 Composition
q((z1 , (z2 , z3 ))) (i(z1 ), i(z2 ), q(z3 )) ()
Now we are ready for the actual composition. For
i(s(z1 )) s(i(z1 )) space efficiency reasons we reuse the notations
i() used in Section 4. Moreover, we identify trees of
T (Q0 (P 0 (X))) with trees of T ((Q0 P 0 )(X)).
for all s {, }. It can easily be verified that
In other words, when meeting a subtree q(p(x))
M and N meet our requirements. However, N is
with q Q0 , p P 0 , and x X, then we also
not yet compatible with M because an mgu be-
view this equivalently as the tree hq, pi(x), which
tween rules () of M and () of N might map y2
could be part of a rule of our composed XTOP.
to (z2 , z3 ). Thus, we decompose () into
However, not all combinations of states will be
q((z1 , z)) (i(z1 ), q(z), q 0 (z)) allowed in our composed XTOP, so some combi-
q 0 ((z2 , z3 )) q(z3 ) nations will never yield valid rules.
Generally, we construct a rule of M 0 ; N 0 by ap-
q((z1 , z2 )) i(z1 )
plying a single rule of M 0 followed by any num-
where q = h(z2 , z3 ), i, 1i. This newly obtained ber of pure -rules of RE , which can turn states
XTOP N 0 is compatible with M . In addition, we base(p) into p. Then we apply any number of
only have one special tree (z2 , z3 ) that occurs in rules of N 0 and try to obtain a sentential form that
states of the form hl, q, vi. Thus, we need to com- has the required shape of a rule of M 0 ; N 0 .
pute all minimal derivations whose output trees
are instances of (z2 , z3 ). This is again simple Definition 7. Let M 0 = (P 0 , , , IM , RM0 ) and
since the first three rule schemes s , s,s0 , and 0 0 0

N = (Q , , , IN , RN ) beS the XTOPs con-
0s,s0 of M create such instances, so we simply structed in Section 4, where S 0
lL Pl P and
create copies of them: 0 00 0
S
lL Ql Q . Let Q = Q \ lL Ql . We con-
0 0
struct the XTOP M ; N = (S, , , IN IM , R)
s ((y1 , y2 )) (ps (y1 ), p(y2 ))
s,s0 ((y1 , y2 , y3 )) (ps (y1 ), (ps0 (y2 ), p(y3 ))) where
0s,s0 ((y1 , y2 , y3 )) (ps (y1 ), (ps0 (y2 ), p (y3 ))) [
S= (Ql Pl ) (Q00 P 0 )
for all s, s0 {, }. These are all the rules lL
of R(z2 ,z3 ) . In addition, we create the following
and R contains all normalized rules ` r (of the
rules of RE :
required shape) such that
p(x1 ) s (x1 ) p(x1 ) s,s0 (x1 )
` M 0 RE N 0 r
p(x1 ) 0s,s0 (x1 )
for all s, s0 {, }. for some , T (Q0 (T (P 0 (X)))).
Especially after reading the example it might The required rule shape is given by the defi-
seem useless to create the rule copies in Rl [in Ex- nition of an XTOP. Most importantly, we must
ample 6 for l = (z2 , z3 )]. However, each such have that ` S(T (X)), which we identify
rule has a distinct state at the root of the left-hand with a certain subset of Q0 (P 0 (T (X))), and
side, which can be used to trigger only this rule. r T (S(X)), which similarly corresponds to
In this way, the state selects the next rule to apply, a subset of T (Q0 (P 0 (X))). The states are sim-
which yields the desired local determinism. ply combinations of the states of M 0 and N 0 , of
815
q q

p p i i
i i q
ps ps0 i q q0
ps ps p
y1 y1
y1 y2 y3 y1 y2 ps00 0 0
y2 y3 y2 y3 y4 y3 y4 y4
Figure 14: Successfully expanded rule from Exam-
ple 9. Figure 15: Expanded rule that remains copying (see
Example 9).
which however the combinations of a state q Ql

with a state p
/ Pl are forbidden. This reflects the 6 Post-processing
intuition of the previous section. If we entered a Finally, we will compose rules again in an ef-
special state of the form hl, q, vi, then we should fort to restore linearity (and nondeletion). Since
use a corresponding state p Pl of M , which the composition of two linear and nondeleting
only has rules producing instances of l. We note XTOPs cannot always be computed by a single
that look-ahead of N 0 is checked normally in the XTOP (Arnold and Dauchet, 1982), this method
derivation process. can fail to return such an XTOP. The presented
Example 8. Now let us illustrate the composition method is not a characterization, which means it
on Example 6. Let us start with rule () of M . might even fail to return a linear and nondelet-
q(p((x1 , x2 ))) ing XTOP although an equivalent linear and non-
M 0 q((ps (x1 ), p(x2 ))) deleting XTOP exists. However, in a significant
number of examples, the recombination succeeds
RE q((ps (x1 ), s0 ,s00 (x2 ))) to rebuild a linear (and nondeleting) XTOP.
N 0 (i(ps (x1 )), q(s0 ,s00 (x2 )), q 0 (s0 ,s00 (x2 ))) Let M 0 ; N 0 = (S, , , I, R) be the composed
is a rule of M 0 ; N 0 for every s, s0 , s00 {, }. XTOP constructed in Section 5. We simply in-
Note if we had not applied the RE -step, then we spect each non-linear rule (i.e., each rule with a
would not have obtained a rule of M ; N (be- non-linear right-hand side) and expand it by all
cause we would have obtained the state combina- rule options at the copied variables. Since the
tion hq, pi instead of hq, s0 ,s00 i, and hq, pi is not a method is pretty standard and variants have al-
state of M 0 ; N 0 ). Let us also construct a rule for ready been used in the pre-processing steps, we
the state combination hq, s0 ,s00 i. only illustrate it on the rules of Figure 12.
q(s0 ,s00 ((x1 , x2 , x3 ))) Example 9. The first (top row, left-most) rule of
Figure 12 is non-linear in the variable y2 . Thus,
M 0 q((ps0 (x1 ), (ps00 (x2 ), p(x3 ))))
we expand the calls hq, i(y2 ) and hq 0 , i(y2 ). If
N 0 q 0 (ps0 (x1 )) = s for some s {, }, then the next rules
Finally, let us construct a rule for the state combi- are uniquely determined and we obtain the rule
nation hq 00 , s0 ,s00 i. displayed in Figure 14. Here the expansion was
successful and we could delete the original rule
q 00 (s0 ,s00 ((x1 , x2 , x3 ))) for = s and replace it by the displayed ex-
M 0 q((ps0 (x1 ), (ps00 (x2 ), p(x3 )))) panded rule. However, if = 0s0 ,s00 , then we can
RE q((ps0 (x1 ), (ps00 (x2 ), s (x3 )))) also expand the rule to obtain the rule displayed in
N 0 q((ps00 (x2 ), s (x3 ))) Figure 15. It is still copying and we could repeat
the process of expansion here, but we cannot get
N 0 (q 0 (ps00 (x1 )), q(s (x2 )), q 00 (s (x2 )))
rid of all copying rules using this approach (as ex-
for every s {, }. pected since there is no linear XTOP computing
After having pre-processed the XTOPs in our the same tree transformation).
introductory example, the devices M and N 0 can
be composed into M ; N 0 . One rule of the com-
posed XTOP is illustrated in Figure 13.
816
References Jonathan May, Kevin Knight, and Heiko Vogler. 2010.
Efficient inference through cascades of weighted
Andre Arnold and Max Dauchet. 1982. Morphismes tree transducers. In Proc. ACL, pages 10581066.
et bimorphismes darbres. Theoretical Computer Association for Computational Linguistics.
Science, 20(1):3393.
Jonathan May. 2010. Weighted Tree Automata and
Brenda S. Baker. 1979. Composition of top-down Transducers for Syntactic Natural Language Pro-
and bottom-up tree transductions. Information and cessing. Ph.D. thesis, University of Southern Cali-
Control, 41(2):186213. fornia, Los Angeles.
Jason Eisner. 2003. Learning non-isomorphic tree John Alan Robinson. 1965. A machine-oriented logic
mappings for machine translation. In Proc. ACL, based on the resolution principle. Journal of the
pages 205208. Association for Computational Lin- ACM, 12(1):2341.
guistics.
William C. Rounds. 1970. Mappings and grammars
Joost Engelfriet, Eric Lilin, and Andreas Maletti. on trees. Mathematical Systems Theory, 4(3):257
2009. Composition and decomposition of extended 287.
multi bottom-up tree transducers. Acta Informatica, Stuart M. Shieber. 2004. Synchronous grammars as
46(8):561590. tree transducers. In Proc. TAG+7, pages 8895.
Joost Engelfriet. 1975. Bottom-up and top-down James W. Thatcher. 1970. Generalized2 sequential
tree transformationsA comparison. Mathemati- machine maps. Journal of Computer and System
cal Systems Theory, 9(3):198231. Sciences, 4(4):339367.
Joost Engelfriet. 1977. Top-down tree transducers Kenji Yamada and Kevin Knight. 2001. A syntax-
with regular look-ahead. Mathematical Systems based statistical translation model. In Proc. ACL,
Theory, 10(1):289303. pages 523530. Association for Computational Lin-
Jonathan Graehl, Kevin Knight, and Jonathan May. guistics.
2008. Training tree transducers. Computational
Linguistics, 34(3):391427.
Jonathan Graehl, Mark Hopkins, Kevin Knight, and
Andreas Maletti. 2009. The power of extended top-
down tree transducers. SIAM Journal on Comput-
ing, 39(2):410430.
Kevin Knight and Jonathan Graehl. 2005. An over-
view of probabilistic tree transducers for natural
language processing. In Proc. CICLing, volume
3406 of LNCS, pages 124. Springer.
Kevin Knight. 2007. Capturing practical natural
language transformations. Machine Translation,
21(2):121133.
Eric Lilin. 1978. Une generalisation des transduc-
teurs detats finis darbres: les S-transducteurs.
These 3eme cycle, Universite de Lille.
Andreas Maletti and Heiko Vogler. 2010. Composi-
tions of top-down tree transducers with -rules. In
Proc. FSMNLP, volume 6062 of LNAI, pages 69
80. Springer.
Andreas Maletti. 2010. Why synchronous tree sub-
stitution grammars? In Proc. HLT-NAACL, pages
876884. Association for Computational Linguis-
tics.
Andreas Maletti. 2011. An alternative to synchronous
tree substitution grammars. Natural Language En-
gineering, 17(2):221242.
Alberto Martelli and Ugo Montanari. 1982. An effi-
cient unification algorithm. ACM Transactions on
Programming Languages and Systems, 4(2):258
282.
Jonathan May and Kevin Knight. 2006. Tiburon: A
weighted tree automata toolkit. In Proc. CIAA, vol-
ume 4094 of LNCS, pages 102113. Springer.
817
Structural and Topical Dimensions in Multi-Task Patent Translation
Katharina Waschle and Stefan Riezler

Department of Computational Linguistics
Heidelberg University, Germany
{waeschle,riezler}@cl.uni-heidelberg.de
Abstract In this paper, we analyze patents with respect

to the orthogonal dimensions of topic the tech-
Patent translation is a complex problem due nical field covered by the patent and structure
to the highly specialized technical vocab- a patents text sections , with respect to their
ulary and the peculiar textual structure of
influence on machine translation performance.
patent documents. In this paper we analyze
patents along the orthogonal dimensions of The topical dimension of patents is charac-
topic and textual structure. We view differ- terized by the International Patent Classification
ent patent classes and different patent text (IPC)1 which categorizes patents hierarchically
sections such as title, abstract, and claims, into 8 sections, 120 classes, 600 subclasses, down
as separate translation tasks, and investi- to 70,000 subgroups at the leaf level. Table 1
gate the influence of such tasks on machine shows the 8 top level sections.
translation performance. We study multi-
task learning techniques that exploit com-
monalities between tasks by mixtures of A Human Necessities
translation models or by multi-task meta- B Performing Operations, Transporting
parameter tuning. We find small but sig- C Chemistry, Metallurgy
nificant gains over task-specific training D Textiles, Paper
by techniques that model commonalities E Fixed Constructions
through shared parameters. A by-product
F Mechanical Engineering, Lighting,
of our work is a parallel patent corpus of 23
million German-English sentence pairs.
Heating, Weapons
G Physics
H Electricity
1 Introduction
Table 1: IPC top level sections.
Patents are an important tool for the protection
of intellectual property and also play a significant Orthogonal to the patent classification, patent
role in business strategies in modern economies. documents can be sub-categorized along the di-
Patent translation is an enabling technique for mension of textual structure. Article 78.1 of the
patent prior art search which aims to detect a European Patent Convention (EPC) lists all sec-
patents novelty and thus needs to be cross-lingual tions required in a patent document2 :
for a multitude of languages. Patent translation is
complicated by a highly specialized vocabulary, A European patent application shall
consisting of technical terms specific to the field contain:
of invention the patent relates to. Patents are writ-
ten in a sophisticated legal jargon (patentese) (a) a request for the grant of a Euro-
that is not found in everyday language and ex- pean patent;
hibits a complex textual structure. Also, patents 1
http://www.wipo.int/classifications/
are often intentionally ambiguous or vague in or- ipc/en/
2
der to maximize the coverage of the claims. Highlights by the authors.
818
(b) a description of the invention; adapting unsupervised generative modules such
(c) one or more claims; as translation models or language models to new
(d) any drawings referred to in the de- tasks. For example, transductive approaches have
scription or the claims; used automatic translations of monolingual cor-
(e) an abstract, pora for self-training modules of the generative
SMT pipeline (Ueffing et al., 2007; Schwenk,
and satisfy the requirements laid down 2008; Bertoldi and Federico, 2009). Other ap-
in the Implementing Regulations. proaches have extracted parallel data from similar
The request for grant contains the patent title; thus or comparable corpora (Zhao et al., 2004; Snover
a patent document comprises the textual elements et al., 2008). Several approaches have been pre-
of title, description, claim, and abstract. sented that train separate translation and language
We investigate whether it is worthwhile to treat models on task-specific subsets of the data and
different values along the structural and topical combine them in different mixture models (Fos-
dimensions as different tasks that are not com- ter and Kuhn, 2007; Koehn and Schroeder, 2007;
pletely independent of each other but share some Foster et al., 2010). The latter kind of approach is
commonalities, yet differ enough to counter a applied in our work to multiple patent tasks.
simple pooling of data. For example, we con- Multi-task learning efforts in patent transla-
sider different tasks such as patents from different tion have so far been restricted to experimental
IPC classes, or along an orthogonal dimension, combinations of translation and language mod-
patent documents of all IPC classes but consisting els from different sets of IPC sections. For ex-
only of titles or only of claims. We ask whether ample, Utiyama and Isahara (2007) and Tinsley
such tasks should be addressed as separate trans- et al. (2010) investigate translation and language
lation tasks, or whether translation performance models trained on different sets of patent sections,
can be improved by learning several tasks simul- with larger pools of parallel data improving re-
taneously through shared models that are more so- sults. Ceausu et al. (2011) find that language mod-
phisticated than simple data pooling. Our goal is els always and translation model mostly benefit
to learn a patent translation system that performs from larger pools of data from different sections.
well across several different tasks, thus benefits Models trained on pooled patent data are used as
from shared information, but is yet able to address baselines in our approach.
the specifics of each task. The machine learning community has devel-
One contribution of this paper is a thorough oped several different formalizations of the cen-
analysis of the differences and similarities of mul- tral idea of trading off optimality of parameter
tilingual patent data along the dimensions of tex- vectors for each task-specific model and close-
tual structure and topic. The second contribution ness of these model parameters to the average pa-
is the experimental investigation of the influence rameter vector across models. For example, start-
of various such tasks on patent translation perfor- ing from a separate SVM for each task, Evgeniou
mance. Starting from baseline models that are and Pontil (2004) present a regularization method
trained on individual tasks or on data pooled from that trades off optimization of the task-specific pa-
all tasks, we apply mixtures of translation mod- rameter vectors and the distance of each SVM to
els and multi-task minimum error rate training to the average SVM. Equivalent formalizations re-
multiple patent translation tasks. A by-product of place parameter regularization by Bayesian prior
our research is a parallel patent corpus of over 23 distributions on the parameters (Finkel and Man-
million sentence pairs. ning, 2009) or by augmentation of the feature
space with domain independent features (Daume,
2 Related work
2007). Besides SVMs, several learning algo-
Multi-task learning has mostly been discussed un- rithms have been extended to the multi-task sce-
der the name of multi-domain adaptation in the nario in a parameter regularization setting, e.g.,
area of statistical machine translation (SMT). If perceptron-type algorithms (Dredze et al., 2010)
we consider domains as tasks, domain adapta- or boosting (Chapelle et al., 2011). Further vari-
tion is a special two-task case of multi-task learn- ants include different formalizations of norms for
ing. Most previous work has concentrated on parameter regularization, e.g., `1,2 regularization
819
(Obozinski et al., 2010) or `1, regularization pass alignment. This yields the parallel corpus
(Quattoni et al., 2009), where only the features listed in table 2 with high input-output ratios for
that are most important across all tasks are kept in claims, and much lower ratios for abstracts and
the model. In our experiments, we apply parame- descriptions, showing that claims exhibit a nat-
ter regularization for multi-task learning to mini- ural parallelism due to their structure, while ab-
mum error rate training for patent translation. stracts and descriptions are considerably less par-
allel. Removing duplicates and adding parallel ti-
3 Extraction of a parallel patent corpus tles results in a corpus of over 23 million parallel
from comparable data sentence pairs.
Our work on patent translation is based on the output de ratio en ratio

MAREC3 patent data corpus. MAREC con-
tains over 19 million patent applications and abstract 720,571 92.36% 76.81%
granted patents in a standardized format from claims 8,346,863 97.82% 96.17%
four patent organizations (European Patent Of- descr. 14,082,381 86.23% 82.67%
fice (EP), World Intellectual Property Organiza-
Table 2: Number of parallel sentences in output with
tion (WO), United States Patent and Trademark
input/output ratio of sentence aligner.
Office (US), Japan Patent Office (JP)), from 1976
to 2008. The data for our experiments are ex- Differences between the text sections become
tracted from the EP and WO collections which visible in an analysis of token to type ratios. Ta-
contain patent documents that include translations ble 3 gives the average number of tokens com-
of some of the patent text. To extract such parallel pared to the average type frequencies for a win-
patent sections, we first determine the longest in- dow of 100,000 tokens from every subsection. It
stance, if different kinds4 exist for a patent. We shows that titles contain considerably fewer to-
assume titles to be sentence-aligned by default, kens than other sections, however, the disadvan-
and define sections with a token ratio larger than tage is partially made up by a relatively large
0.7 as parallel. For the language pair German- amount of types, indicated by a lower average
English we extracted a total of 2,101,107 parallel type frequency.
titles, 291,716 parallel abstracts, and 735,667 par-
allel claims sections. tokens types
The lack of directly translated descriptions
poses a serious limitation for patent translation, de en de en
since this section constitutes the largest part of the title 6.5 8.0 2.9 4.8
document. It is possible to obtain comparable de- abstract 37.4 43.2 4.3 9.0
scriptions from related patents that have been filed claims 53.2 61.3 5.5 9.5
in different countries and are connected through description 27.5 35.5 4.0 7.0
the patent family id. We extracted 172,472 patents
that were both filed with the USPTO and the EPO Table 3: Average number of tokens and average type
and contain an English and a German description, frequencies in text sections.
respectively.
We reserved patent data published between
For sentence alignment, we used the Gargan-
1979 and 2007 for training and documents pub-
tua5 tool (Braune and Fraser, 2010) that fil-
lished in 2008 for tuning and testing in SMT.
ters a sentence-length based alignment with IBM
For the dimension of text sections, we sampled
Model-1 lexical word translation probabilities, es-
500,000 sentences distributed across all IPC
timated on parallel data obtained from the first-
sections for training and 2,000 sentences for
3
http://www.ir-facility.org/ each text section for development and testing. Be-
prototypes/marec cause of a relatively high number of identical sen-
4
A patent kind code indicates the document stage in the tences in test and training set for titles, we re-
filing process, e.g., A for applications and B for granted
patents, with publication levels from 1-9. See http://
moved the overlap for this section.
www.wipo.int/standards/en/part\_03.html. Table 4 shows the distribution of IPC sections
5
http://gargantua.sourceforge.net on claims, with the smallest class accounting for
820
around 300,000 parallel sentences. In order to ob- ison to the task-specific MAREC model, although
tain similar amounts of training data for each task the former has been learned on more than three
along the topical dimension, we sampled 300,000 times the amount of data. An analysis of the out-
sentences from each IPC class for training, and put of both system shows that the Europarl model
2,000 sentences for each IPC class for develop- suffers from two problems: Firstly, there is an ob-
ment and testing. vious out of vocabulary (OOV) problem of the
Europarl model compared to the MAREC model.
A 1,947,542 Secondly, the Europarl model suffers from incor-
B 2,522,995 rect word sense disambiguation, as illustrated by
C 2,263,375 the samples in table 6.
D 299,742
E 353,910 source steuerbar leitet
F 1,012,808 Europarl taxable is in charge of
G 2,066,132 MAREC controllable guiding
H 1,754,573 reference controllable guides
Table 4: Distribution of IPC sections on claims. Table 6: Output of Europarl model on MAREC data.
4 Machine translation experiments Table 7 shows the results of the evaluation

across text sections; we measured the perfor-
4.1 Individual task baselines mance of separately trained and tuned individual
For our experiments we used the phrase-based, models on every section. The results allow some
open-source SMT toolkit Moses6 (Koehn et al., conclusions about the textual characteristics of the
2007). For language modeling, we computed sections and indicate similarities. Naturally, ev-
5-gram models using IRSTLM7 (Federico et ery task is best translated with a model trained
al., 2008) and queried the model with KenLM on the respective section, as the B LEU scores
(Heafield, 2011). B LEU (Papineni et al., 2001) on the diagonal are the highest in every column.
scores were computed up to 4-grams on lower- Accordingly, we are interested in the runner-up
cased data. on each section, which is indicated in bold font.
The results on abstracts suggest that this section
Europarl-v6 MAREC bears the strongest resemblance to claims, since
B LEU OOV B LEU OOV the model trained on claims achieves a respectable
score. The abstract model seems to be the most
abstract 0.1726 14.40% 0.3721 3.00% robust and varied model, yielding the runner-up
claim 0.2301 15.80% 0.4711 4.20% score on all other sections. Claims are easiest to
title 0.0964 26.00% 0.3228 9.20% translate, yielding the highest overall B LEU score
of 0.4879. In contrast to that, all models score
Table 5: B LEU scores and OOV rate for Europarl base-
considerably lower on titles.
line and MAREC model.
Table 5 shows a first comparison of results of test

Moses models trained on 500,000 parallel sen- train abstract claim title desc.
tences from patent text sections balanced over IPC
abstract 0.3737 0.4076 0.2681 0.2812
classes, against Moses trained on 1.7 Million sen-
claim 0.3416 0.4879 0.2420 0.2623
tences of parliament proceedings from Europarl8
title 0.2839 0.3512 0.3196 0.1743
(Koehn, 2005). The best result on each section is
desc. 0.32189 0.403 0.2342 0.3347
indicated in bold face. The Europarl model per-
forms very poorly on all three sections in compar- Table 7: B LEU scores for 500k individual text section
6
http://statmt.org/moses/ models.
7
http://sourceforge.net/projects/
irstlm/ The cross-section evaluation on the IPC classes
8
http://www.statmt.org/europarl/ (table 8) shows similar patterns. Each section
821
is best translated with a model trained on data section B and C is trained on a data set composed
from the same section. Note that best section of 150,000 sentences from each IPC section. The
scores vary considerably, ranging from 0.5719 on pooled model for pairing data from abstracts and
C to 0.4714 on H, indicating that higher-scoring claims is trained on data composed of 250,000
classes, such as C and A, are more homogeneous sentences from each text section.
and therefore easier to translate. C, the Chem- Another approach to exploit commonalities be-
istry section, presumably benefits from the fact tween tasks is to train separate language and trans-
that the data contain chemical formulae, which lation models9 on the sentences from each task
are language-independent and do not have to be and combine the models in the global log-linear
translated. Again, for determining the relation- model of the SMT framework, following Fos-
ship between the classes, we examine the best ter and Kuhn (2007) and Koehn and Schroeder
runner-up on each section, considering the B LEU (2007). Model combination is accomplished by
score, although asymmetrical, as a kind of mea- adding additional language model and translation
sure of similarity between classes. We can es- model features to the log-linear model and tuning
tablish symmetric relationships between sections the additional meta-parameters by standard mini-
A and C, B and F as well as G and H, which mum error rate training (Bertoldi et al., 2009).
means that the models are mutual runner-up on We try out mixture and pooling for all pairwise
the others test section. combinations of the three structural sections, for
The similarities of translation tasks estab- which we have high-quality data, i.e. abstract,
lished in the previous section can be confirmed claims and title. Due to the large number of pos-
by information-theoretic similarity measures that sible combinations of IPC sections, we limit the
perform a pairwise comparison of the vocabulary experiments to pairs of similar sections, based on
probability distribution of each task-specific cor- the A-distance measure.
pus. This distribution is calculated on the basis of Table 10 lists the results for two combinations
the 500 most frequent words in the union of two of data from different sections: a log-linear mix-
corpora, normalized by vocabulary size. As met- ture of separately trained models and simple pool-
ric we use the A-distance measure of Kifer et al. ing, i.e. concatenation, of the training data. Over-
(2004). If A is the set of events on which the word all, the mixture models perform slightly better
distributions of two corpora are defined, then the than the pooled models on the text sections, al-
A-distance is the supremum of the difference of though the difference is significant only in two
probabilities assigned to the same event. Low dis- cases. This is indicated by highlighting best re-
tance means higher similarity. sults in bold face (with more than one result high-
Table 9 shows the A-distance of corpora spe- lighted if the difference is not significant).10
cific to IPC classes. The most similar section or We investigate the same mixture and pooling
sections apart from the section itself on the di- techniques on the IPC sections we considered
agonal is indicated in bold face. The pairwise pairwise similar (see table 11). Somehow contra-
similarity of A and C, B and F, G and H obtained dicting the former results, the mixture models per-
by B LEU score is confirmed. Furthermore, a close form significantly worse than the pooled model on
similarity between E and F is indicated. G and three sections. This might be the result of inade-
H (electricity and physics, respectively) are very quate tuning, since most of the time the MERT
similar to each other but not close to any other algorithm did not converge after the maximum
section apart from B. number of iterations, due to the larger number of
features when using several models.
4.2 Task pooling and mixture
9
Following Duh et al. (2010), we use the alignment
One straightforward technique to exploit com- model trained on the pooled data set in the phrase extraction
monalities between tasks is pooling data from phase of the separate models. Similarly, we use a globally
separate tasks into a single training set. Instead of trained lexical reordering model.
10
a trivial enlargement of training data by pooling, For assessing significance, we apply the approximate
randomization method described in Riezler and Maxwell
we train the pooled models on the same amount (2005). We consider pairwise differing results scoring a p-
of sentences as the individual models. For in- value smaller than 0.05 as significant; the assessment is re-
stance, the pooled model for the pairing of IPC peated three times and the average value is taken.
822
test
train A B C D E F G H
A 0.5349 0.4475 0.5472 0.4746 0.4438 0.4523 0.4318 0.4109
B 0.4846 0.4736 0.5161 0.4847 0.4578 0.4734 0.4396 0.4248
C 0.5047 0.4257 0.5719 0.462 0.4134 0.4249 0.409 0.3845
D 0.47 0.4387 0.5106 0.5167 0.4344 0.4435 0.407 0.3917
E 0.4486 0.4458 0.4681 0.4531 0.4771 0.4591 0.4073 0.4028
F 0.4595 0.4588 0.4761 0.4655 0.4517 0.4909 0.422 0.4188
G 0.4935 0.4489 0.5239 0.4629 0.4414 0.4565 0.4748 0.4532
H 0.4628 0.4484 0.4914 0.4621 0.4421 0.4616 0.4588 0.4714
Table 8: B LEU scores for 300k individual IPC section models.
A B C D E F G H
A 0 0.1303 0.1317 0.1311 0.188 0.186 0.164 0.1906
B 0.1302 0 0.2388 0.1242 0.0974 0.0875 0.1417 0.1514
C 0.1317 0.2388 0 0.1992 0.311 0.3068 0.2506 0.2825
D 0.1311 0.1242 0.1992 0 0.1811 0.1808 0.1876 0.201
E 0.188 0.0974 0.311 0.1811 0 0.0921 0.2058 0.2025
F 0.186 0.0875 0.3068 0.1808 0.0921 0 0.1824 0.1743
G 0.164 0.1417 0.2506 0.1876 0.2056 0.1824 0 0.064
H 0.1906 0.1514 0.2825 0.201 0.2025 0.1743 0.064 0
Table 9: Pairwise A-distance for 300k IPC training sets.
train test pooling mixture train test pooling mixture

abstract-claim abstract 0.3703 0.3704 A-C A 0.5271 0.5274
claim 0.4809 0.4834 C 0.5664 0.5632
claim-title claim 0.4799 0.4789 B-F B 0.4696 0.4354
title 0.3269 0.328 F 0.4859 0.4769
title-abstract title 0.3311 0.3275 G-H G 0.4735 0.4754
abstract 0.3643 0.366 H 0.4634 0.467
Table 10: Mixture and pooling on text sections. Table 11: Mixture and pooling on IPC sections.
A comparison of the results for pooling and SMT pipeline is not adaptable. Such situations
mixture with the respective results for individual arise if there are not enough data to train transla-
models (tables 7 and 8) shows that replacing data tion models or language models on the new tasks.
from the same task by data from related tasks However, we assume that there are enough paral-
decreases translation performance in almost all lel data available to perform meta-parameter tun-
cases. The exception is the title model that bene- ing by minimum error rate training (MERT) (Och,
fits from pooling and mixing with both abstracts 2003; Bertoldi et al., 2009) for each task.
and claims due to their richer data structure. A generic algorithm for multi-task learning
can be motivated as follows: Multi-task learning
4.3 Multi-task minimum error rate training aims to take advantage of commonalities shared
In contrast to task pooling and task mixtures, the among tasks by learning several independent but
specific setting addressed by multi-task minimum related tasks together. Information is shared be-
error rate training is one in which the generative tween tasks through a joint representation and in-
823
tuning
test individual pooled average MMERT MMERT-average
abstract 0.3721 0.362 0.3657+ 0.3719 +
0.3685+
claim 0.4711 0.4681 0.4749+ 0.475+ 0.4734+
title 0.3228 0.3152 0.3326+ 0.3268+ 0.3325+
Table 12: Multi-task tuning on text sections.
tuning
test individual pooled average MMERT MMERT-average
A 0.5187 0.5199 0.5213+ 0.5195 0.5196
B 0.4877 0.4885 0.4908+ 0.4911+ 0.4921+
C 0.5214 0.5175 0.5199+ 0.5218+ 0.5162+
D 0.4724 0.4730 0.4733 0.4736 0.4734
E 0.4666 0.4661 0.4679+ 0.4669+ 0.4685+
F 0.4794 0.4801 0.4811 0.4821+ 0.4830+
G 0.4596 0.4576 0.4607+ 0.4606+ 0.4610+
H 0.4573 0.4560 0.4578 0.4581+ 0.4581+
Table 13: Multi-task tuning on IPC sections.
troduces an inductive bias. Evgeniou and Pon- moves beyond the average, it is clipped to the av-
til (2004) propose a regularization method that erage value. The process is iterated until a stop-
balances task-specific parameter vectors and their ping criterion is met, e.g. a threshold on the max-
distance to the average. The learning objective is imum change in the average weight vector. The
to minimize task-specific loss functions ld across parameter controls the influence of the regular-
all tasks d with weight vectors wd , while keep- ization. A larger pulls the weights closer to the
ingPeach parameter vector close to the average average, a smaller leaves more freedom to the
1 D
D d=1 wd = wavg . This is enforced by min- individual tasks.
imizing the norm (here the `1 -norm) of the dif-
ference of each task-specific weight vector to the
avarage weight vector. MMERT(w(0) , D, {ld }D d=1 ):
for t = 1, . . . , T do
(t) 1 PD (t1)
D D wavg = D d=1 wd
for d = 1, . . . , D parallel do
X X
min ld (wd ) + ||wd wavg ||1 (1) (t) (t1)
w1 ,...,wD
d=1 d=1 wd = MERT(wd , ld )
for k = 1, . . . , K do
The MMERT algorithm is given in figure 1. (t) (t)
The algorithm starts with initial weights w(0) . At if w[k]d wavg [k] > 0 then
(t) (t) (t)
each iteration step, the average of the parame- wd [k] = max(wavg [k], wd [k] )
(t) (t)
ter vectors from the previous iteration is com- else if wd [k] wavg [k] < 0 then
puted. For each task d D, one iteration of stan- (t) (t) (t)
wd [k] = min(wavg [k], wd [k] + )
dard MERT is called, continuing from weight vec- end if
(t1)
tor wd and minimizing translation loss func- end for
tion ld on the data from task d. The individu- end for
ally tuned weight vectors returned by MERT are end for
then moved towards the previously calculated av- (T ) (T ) (T )
return w1 , . . . , wD , wavg
erage by adding or subtracting a penalty term
(t) Figure 1: Multi-task MERT.
for each weight component wd [k]. If a weight
824
The weight updates and the clipping strategy for each task, where no information has been
can be motivated in a framework of gradient de- shared between the tasks. The second baseline
scent optimization under `1 -regularization (Tsu- simulates the setting where the sections are not
ruoka et al., 2009). Assuming MERT as algorith- differentiated at all. We tune the model on a
mic minimizer11 of the loss function ld in equa- pooled development set of 2,000 sentences that
tion 1, the weight update towards the average combines the same amount of data from all sec-
follows from the subgradient of the `1 regular- tions (pooled). This yields a single joint weight
(t)
izer. Since wavg is taken as average over weights vector for all tasks optimized to perform well
wd
(t1) (t)
from the step before, the term wavg is con- across all sections. Furthermore, we compare
(t) multi-task MERT tuning with two parameter av-
stant with respect to wd , leading to the follow-
eraging methods. The first method computes the
ing subgradient (where sgn(x) = 1 if x > 0,
arithmetic mean of the weight vectors returned by
sgn(x) = 1 if x < 0, and sgn(x) = 0 if x = 0):
the individual baseline for each weight compo-
nent, yielding a joint average vector for all tasks

D D

X (t) 1 X (t1)
wd ws (average). The second method takes the last av-

(t) D

wr [k] d=1
s=1

1 erage vector computed during multi-task MERT
D tuning (MMERT-average).12
!
1 X (t1)
= sgn wr(t) [k] ws [k] . Tables 12 and 13 give the results for multi-task
D
s=1 learning on text and IPC sections. The latter re-
Gradient descent minimization tells us to move in sults have been presented earlier in Simianer et al.
the opposite direction of the subgradient, thus mo- (2011). The former table extends the technique
tivating the addition or subtraction of the regular- of multi-task MERT to the structural dimension
ization penalty. Clipping is motivated by the de- of patent SMT tasks. In all experiments, the pa-
sire to avoid oscillating parameter weights and in rameter was adjusted to 0.001 after evaluating
order to to enforce parameter sharing. different settings on a development set. The best
Experimental results for multi-task MERT result on each section is indicated in bold face; *
(MMERT) are reported for both dimensions of indicates significance with respect to the individ-
patent tasks. For the IPC sections we trained ual baseline, + the same for the pooled baseline.
a pooled model on 1,000,000 sentences sampled We observe statistically significant improvements
from abstracts and claims from all sections. We of 0.5 to 1% B LEU over the individual baseline for
did not balance the sections but kept their orig- claims and titles; for abstracts, the multi-task vari-
inal distribution, reflecting a real-life task where ant yields the same result as the baseline, while
the distribution of sections is unknown. We then the averaging methods perform worse. Multi-task
extend this experiment to the structural dimen- MERT yields the best result for claims; on titles,
sion. Since we do not have an intuitive notion of a the simple average and the last MMERT average
natural distribution for the text sections, we train dominate. Pooled tuning always performs signifi-
a balanced pooled model on a corpus composed cantly worse than any other method, confirming
of 170,000 sentences each from abstracts, claims that it is beneficial to differentiate between the
and titles, i.e. 510,000 sentences in total. For text section sections.
both dimensions, for each task, we sampled 2,000 Similarly for IPC sections, small but statisti-
parallel sentences for development, development- cally significant improvements over the individual
testing, and testing from patents that were pub- and pooled baselines are achieved by multi-task
lished in different years than the training data. tuning and averaging over IPC sections, except-
We compare the multi-task experiments with ing C and D. However, an advantage of multi-task
two baselines. The first baseline is individual tuning over averaging is hard to establish.
task learning, corresponding to standard separate Note that the averaging techniques implicitly
MERT tuning on each section (individual). This benefit from a larger tuning set. In order to ascer-
results in three separately learned weight vectors tain that the improvements by averaging are not
11 12
MERT as presented in Och (2003) is not a gradient- The aspect of averaging found in all of our multi-task
based optimization techniquem, thus MMERT is strictly learning techniques effectively controls for optimizer insta-
speaking only inspired by gradient descent optimization. bility as mentioned in Clark et al. (2011).
825
test pooled-6k significance ley et al. (2010) and Utiyama and Isahara (2007).
A caveat in this situation is that data need to be
abstract 0.3628 <
from the general patent domain, as shown by the
claim 0.4696 <
inferior performance of a large Europarl-trained
title 0.3174 <
model compared to a small patent-trained model.
Table 14: Multi-task tuning on 6,000 sentences pooled The goal of this paper is to analyze patent data
from text sections. < denotes a statistically signifi- along the topical dimension of IPC classes and
cant difference to the best result. along the structural dimension of textual sections.
Instead of trying to beat a pooling baseline that
simply increases the data size, our research goal
simply due to increasing the size of the tuning set,
is to investigate whether different subtasks along
we ran a control experiment where we tuned the
these dimensions share commonalities that can
model on a pooled development set of 3 2, 000
fruitfully be exploited by multi-task learning in
sentences for text sections and on a development
machine translation. We thus aim to investigate
set of 8 2, 000 sentences for IPC sections. The
the benefits of multi-task learning in realistic sit-
results given in table 14 show that tuning on a
uations where a simple enlargement of training
pooled set of 6,000 text sections yields only min-
data is not possible.
imal differences to tuning on 2,000 sentence pairs
such that the B LEU scores for the new pooled Starting from baseline models that are trained
models are still significantly lower than the best on individual tasks or on data pooled from all
results in table 12 (indicated by <). However, tasks, we apply mixtures of translation models
increasing the tuning set to 16,000 sentence pairs and multi-task MERT tuning to multiple patent
for IPC sections makes the pooled baseline per- translation tasks. We find small, but statistically
form as well as the best results in table 13, except significant improvements for multi-task MERT
for two cases (indicated by <) (see table 15). tuning and parameter averaging techniques. Im-
This is due to the smaller differences between best provements are more pronounced for multi-task
and worst results for tuning on IPC sections com- learning on textual domains than on IPC domains.
pared to tuning on text sections, indicating that This might indicate that the IPC sections are less
IPC sections are less well suited for multi-task well delimitated than the structural domains. Fur-
tuning than the textual domains. thermore, this is owing to the limited expressive-
ness of a standard linear model including 14-20
test pooled-16k significance features in tuning. The available features are very
coarse and more likely to capture structural dif-
A 0.5177 < ferences, such as sentence length, than the lexi-
B 0.4920 cal differences that differentiate the semantic do-
C 0.5133 < mains. We expect to see larger gains due to multi-
D 0.4737 task learning for discriminatively trained SMT
E 0.4685 models that involve very large numbers of fea-
F 0.4832 tures, especially when multi-task learning is done
G 0.4608 in a framework that combines parameter regular-
H 0.4579 ization with feature selection (Obozinski et al.,
2010). In future work, we will explore a combina-
Table 15: Multi-task tuning on 16,000 sentences
pooled from IPC sections. < denotes a statistically tion of large-scale discriminative training (Liang
significant difference to the best result. et al., 2006) with multi-task learning for SMT.
Acknowledgments
5 Conclusion
The most straightforward approach to improve This work was supported in part by DFG grant
machine translation performance on patents is to Cross-language Learning-to-Rank for Patent Re-
enlarge the training set to include all available trieval.
data. This question has been investigated by Tins-
826
References Chapter of the Association for Computational Lin-
guistics - Human Language Technologies (NAACL-
Nicola Bertoldi and Marcello Federico. 2009. Do- HLT09), Boulder, CO.
main adaptation for statistical machine translation George Foster and Roland Kuhn. 2007. Mixture-
with monolingual resources. In Proceedings of the model adaptation for SMT. In Proceedings of the
4th EACL Workshop on Statistical Machine Trans- Second Workshop on Statistical Machine Transla-
lation, Athens, Greece. tion, Prague, Czech Republic.
Nicola Bertoldi, Barry Haddow, and Jean-Baptiste George Foster, Pierre Isabelle, and Roland Kuhn.
Fouet. 2009. Improved minimum error rate train- 2010. Translating structured documents. In Pro-
ing in Moses. The Prague Bulletin of Mathematical ceedings of the 9th Conference of the Association
Linguistics, 91:716. for Machine Translation in the Americas (AMTA
Fabienne Braune and Alexander Fraser. 2010. Im- 2010), Denver, CO.
proved unsupervised sentence alignment for sym- Kenneth Heafield. 2011. KenLM: faster and smaller
metrical and asymmetrical parallel corpora. In Pro- language model queries. In Proceedings of the
ceedings of the 23rd International Conference on EMNLP 2011 Sixth Workshop on Statistical Ma-
Computational Linguistics (COLING10), Beijing, chine Translation (WMT11), Edinburgh, UK.
China. Daniel Kifer, Shain Ben-David, and Johannes Gehrke.
Alexandru Ceausu, John Tinsley, Jian Zhang, and 2004. Detecting change in data streams. In Pro-
Andy Way. 2011. Experiments on domain adap- ceedings of the 30th international conference on
tation for patent machine translation in the PLuTO Very large data bases, Toronta, Ontario, Canada.
project. In Proceedings of the 15th Conference of Philipp Koehn and Josh Schroeder. 2007. Experi-
the European Assocation for Machine Translation ments in domain adaptation for statistical machine
(EAMT 2011), Leuven, Belgium. translation. In Proceedings of the Second Workshop
Olivier Chapelle, Pannagadatta Shivaswamy, Srinivas on Statistical Machine Translation, Prague, Czech
Vadrevu, Kilian Weinberger, Ya Zhang, and Belle Republic.
Tseng. 2011. Boosted multi-task learning. Ma- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
chine Learning. Callison-Birch, Marcello Federico, Nicola Bertoldi,
Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Brooke Cowan, Wade Shen, Christine Moran,
Smith. 2011. Better hypothesis testing for statis- Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
tical machine translation: Controlling for optimizer Constantin, and Evan Herbst. 2007. Moses: Open
instability. In Proceedings of the 49th Annual Meet- source toolkit for statistical machine translation. In
ing of the Association for Computational Linguis- Proceedings of the ACL 2007 Demo and Poster Ses-
tics (ACL11), Portland, OR. sions, Prague, Czech Republic.
Hal Daume. 2007. Frustratingly easy domain adap- Philipp Koehn. 2005. Europarl: A parallel corpus for
tation. In Proceedings of the 45th Annual Meet- statistical machine translation. In Proceedings of
ing of the Association for Computational Linguis- Machine Translation Summit X, Phuket, Thailand.
tics (ACL07), Prague, Czech Republic. Percy Liang, Alexandre Bouchard-Cote, Dan Klein,
and Ben Taskar. 2006. An end-to-end dis-
Mark Dredze, Alex Kulesza, and Koby Crammer.
criminative approach to machine translation. In
2010. Multi-domain learning by confidence-
Proceedings of the joint conference of the Inter-
weighted parameter combination. Machine Learn-
national Committee on Computational Linguistics
ing, 79:123149.
and the Association for Computational Linguistics
Kevin Duh, Katsuhito Sudoh, and Hajime Tsukada. (COLING-ACL06), Sydney, Australia.
2010. Analysis of translation model adaptation in Guillaume Obozinski, Ben Taskar, and Michael I. Jor-
statistical machine translation. In Proceedings of dan. 2010. Joint covariate selection and joint sub-
the International Workshop on Spoken Language space selection for multiple classification problems.
Translation (IWSLT10), Paris, France. Statistics and Computing, 20:231252.
Theodoros Evgeniou and Massimiliano Pontil. 2004. Franz Josef Och. 2003. Minimum error rate train-
Regularized multi-task learning. In Proceedings of ing in statistical machine translation. In Proceed-
the 10th ACM SIGKDD conference on knowledge ings of the Human Language Technology Confer-
discovery and data mining (KDD04), Seattle, WA. ence and the 3rd Meeting of the North American
Marcello Federico, Nicola Bertoldi, and Mauro Cet- Chapter of the Association for Computational Lin-
tolo. 2008. IRSTLM: an open source toolkit for guistics (HLT-NAACL03), Edmonton, Cananda.
handling large scale language models. In Proceed- Kishore Papineni, Salim Roukos, Todd Ward, and
ings of Interspeech, Brisbane, Australia. Wei-Jing Zhu. 2001. Bleu: a method for auto-
Jenny Rose Finkel and Christopher D. Manning. 2009. matic evaluation of machine translation. Technical
Hierarchical bayesian domain adaptation. In Pro- Report IBM Research Division Technical Report,
ceedings of the Conference of the North American RC22176 (W0190-022), Yorktown Heights, N.Y.
827
Ariadna Quattoni, Xavier Carreras, Michael Collins,
and Trevor Darrell. 2009. An efficient projec-
tion for `1, regularization. In Proceedings of the
26th International Conference on Machine Learn-
ing (ICML09), Montreal, Canada.
Stefan Riezler and John Maxwell. 2005. On some pit-
falls in automatic evaluation and significance testing
for MT. In Proceedings of the ACL-05 Workshop on
Intrinsic and Extrinsic Evaluation Measures for MT
and/or Summarization, Ann Arbor, MI.
Holger Schwenk. 2008. Investigations on large-
scale lightly-supervised training for statistical ma-
chine translation. In Proceedings of the Interna-
tional Workshop on Spoken Language Translation
(IWSLT08), Hawaii.
Patrick Simianer, Katharina Waschle, and Stefan Rie-
zler. 2011. Multi-task minimum error rate train-
ing for SMT. The Prague Bulletin of Mathematical
Linguistics, 96:99108.
Matthew Snover, Bonnie Dorr, and Richard Schwartz.
2008. Language and translation model adaptation
using comparable corpora. In Proceedings of the
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP08), Honolulu, Hawaii.
John Tinsley, Andy Way, and Paraic Sheridan. 2010.
PLuTO: MT for online patent translation. In Pro-
ceedings of the 9th Conference of the Association
for Machine Translation in the Americas (AMTA
2010), Denver, CO.
Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia Ana-
niadou. 2009. Stochastic gradient descent train-
ing for `1 -regularized log-linear models with cumu-
lative penalty. In Proceedings of the 47th Annual
guistics (ACL-IJCNLP09), Singapore.
Nicola Ueffing, Gholamreza Haffari, and Anoop
Sarkar. 2007. Transductive learning for statistical
machine translation. In Proceedings of the 45th An-
nual Meeting of the Association of Computational
Linguistics (ACL07), Prague, Czech Republic.
Masao Utiyama and Hitoshi Isahara. 2007. A
Japanese-English patent parallel corpus. In Pro-
ceedings of MT Summit XI, Copenhagen, Denmark.
Bing Zhao, Matthias Eck, and Stephan Vogel. 2004.
Language model adaptation for statistical machine
translation with structured query models. In Pro-
ceedings of the 20th International Conference on
Computational Linguistics (COLING04), Geneva,
Switzerland.
828
Not as Awful as it Seems: Explaining German Case through
Computational Experiments in Fluid Construction Grammar
Remi van Trijp

Sony Computer Science Laboratory Paris
6 Rue Amyot
75005 Paris (France)
remi@csl.sony.fr
Abstract 2 The Problem of German Case

German case syncretism is often assumed German articles, adjectives and nouns are marked
to be the accidental by-product of historical for gender, number and case through morpholog-
development. This paper contradicts this ical inflection, as illustrated for definite articles in
claim and argues that the evolution of Ger- Table 1.
man case is driven by the need to optimize
the cognitive effort and memory required Case SG-M SG-F SG-N PL
for processing and interpretation. This hy- NOM der die das die
pothesis is supported by a novel kind of ACC den die das die
computational experiments that reconstruct
DAT dem der dem den
and compare attested variations of the Ger-
man definite article paradigm. The exper- GEN des der des der
iments show how the intricate interaction
Table 1: German definite articles.
between those variations and the rest of the
German linguistic landscape may direct
The system is notorious for its syncretism (i.e.
language change.
the same form can be mapped onto different func-
tions), a riddle that has fascinated many formal
1 Introduction and historical linguists looking for explanations.
In his 1880 essay, Mark Twain famously com-
2.1 Historical Linguistics
plained that The awful German Language is the
most slipshod and systemless, and so slippery Studies in historical linguistics and grammatical-
and elusive to grasp language of all. A brief ization often propose the following three forces to
look at the literature on the German case system explain syncretism (Heine and Kuteva, 2005, p.
seems to provide sufficient evidence for instantly 148):
agreeing with the American author. But what if 1. The formal distinction between case markers
the German case system were not the accidental is lost through phonological changes.
by-product of diachronic changes as is often as-
sumed? Are there linguistic forces that are not yet 2. One case takes over the functional domain of
fully appreciated in the field, but which may ex- another case and replaces it.
plain the German case paradigm?
3. A case marker disappears and its functions
This paper demonstrates that there indeed are
are usurped by another marker.
such forces through a case study on German def-
inite articles. The experiments reconstruct deep Syncretism is thus considered as the accidental
language processing models for different variants by-product of such forces, and German case syn-
of this paradigm, and show how the linguistic cretism is typically analyzed according to these
landscape of German has allowed its speakers to lines (Bardal, 2009; Baerman, 2009, p. 229).
reduce their definite article system without loss in However, these forces are not explanatory: they
efficiency for processing and interpretation. only describe what has happened, but not why.
829
Another problem for the syncretism by acci- However, it is a well-established fact that dis-
dent hypothesis is the fact that the collapsing of junctions are computationally expensive, which
case forms is not randomly distributed over the is illustrated in the top of Figure 1. This Fig-
whole paradigm as would be expected. Hawkins ure shows the search tree of a small grammar
(2004, p. 78) observes that instead there is a sys- when parsing the utterance Die Kinder gaben der
tematic tendency for lower cells in the paradigm Lehrerin die Zeichnung (the children gave the
(e.g. genitive; Table 1) to collapse before cells in drawing to the (female) teacher), which is un-
higher positions (e.g. nominative) do so. ambiguous to German speakers. As can be seen
in the Figure, the search tree has to explore sev-
2.2 Formal Linguistics eral branches before arriving at a valid solution.
Many hidden effects of verbal linguistic theo- Most of the splits are caused by disjunctions. For
ries can be uncovered through explicit formaliza- example, when a determiner-noun construction
tions. Unfortunately, formal linguists also typi- specifies that the case features of the definite ar-
cally distinguish between systematic and non- ticle die (nominative or accusative) and the noun
systematic syncretism when analyzing German Kinder (children; nominative, accusative or gen-
case. For instance, in his review of a number of itive) have to unify, the search tree splits into two
studies on German (a.o. Bierwisch, 1967; Blevins, hypotheses (a nominative and an accusative read-
1995; Wiese, 1996; Wunderlich, 1997), Mller ing) even though for native speakers of German,
(2002) concludes that none of these approaches the syntactic context unambiguously points to a
is able to rule out accidental syncretism. nominative reading (because it is the only noun
There is however one major stone that has been phrase that agrees with the main verb).
left unturned by formal linguists: processing. It should be no surprise, then, that a lot of work
Most formal theories, such as HPSG (Ginzburg has focused on processing disjunctions more ef-
and Sag, 2000), assume a strict division between ficiently (e.g. Carter, 1990; Ramsay, 1990). As
competence and performance and therefore observed by Flickinger (2000), however, most of
represent linguistic knowledge in a purely declar- these studies implicitly assume that the grammar
ative, process-independent way (Sag and Wasow, representation has to remain unchanged. He then
2011). While such an approach may be desirable demonstrates through computational experiments
from a mathematical point of view, it puts the how a different representation can directly impact
burden of efficient processing on the shoulders efficiency, and argues that revisions of the gram-
of computational linguists, who have to develop mar for efficiency should be discussed more thor-
more intelligent interpreters. oughly in the literature.
One example of the gap between description The impact of representation on processing is
and computational implementation is disjunctive illustrated at the bottom of Figure 1, which shows
feature representation, which became popular in the performance of a grammar that uses the same
feature-based grammar formalisms in the 1980s processing technique for handling the same utter-
(Karttunen, 1984). Disjunctions allow an elegant ance, but a different representation than the dis-
notation for multiple feature values, as illustrated junctive grammar. As can be seen, the alternative
in example 1 for the German definite article die, grammar (whose technical details are disclosed
which is either assigned nominative or accusative further below) is able to parse the German defi-
case, and which is either feminine-singular or plu- nite articles without tears, and the resulting search
ral. The feature structure (adopted from Kart- tree arguably better reflects the actual processing
tunen, 1984, p. 30) represents disjunctions by en- performed by native speakers of German.
closing the alternatives in curly brackets ({ }).
2.3 Alternative Hypothesis
" #
(1)
GENDER f The effect of processing-friendly representations

on search suggests that answers for the unsolved

AGREEMENT NUM sg
problems concerning case syncretism have to

h i

NUM pl

be sought in performance. This paper there-

n o fore rejects the processing-independent approach
CASE nom acc and explores the alternative hypothesis, following
830
initial sem syn
structure top top
(a) Search with disjunctive feature representation:
application
process determiner-nominal-
phrase-cxn
determiner-nominal- kinder- (marked-phrasal)
phrase-cxn lex
(marked-phrasal) (lex) determiner-nominal-
phrase-cxn
determiner- (marked-phrasal)
nominal-phrase- lehrerin-
cxn lex (lex) determiner-
(marked-phrasal) nominal-phrase- ditransitive-
determiner- cxn cxn (arg)
kinder-
nominal-phrase- (marked-phrasal)
lex
* der-lex cxn
(lex)
(lex), die- (marked-phrasal)
determiner-nominal-phrase-cxn
lex (lex), (marked-phrasal)
die-lex
initial (lex), +
gaben-lex
(lex), determiner-nominal-
zeichnung- phrase-cxn
(marked-phrasal)
determiner-nominal-
Parsing "die Kinder gaben der Lehrerin die
lex (lex)
Zeichnung ."
phrase-cxn
kinder-
lex
(marked-phrasal) (lex) determiner-nominal-
phrase-cxn
determiner-
(marked-phrasal)
nominal-phrase- lehrerin-
Applying construction set (8) in
cxndirection lex (lex)
(marked-phrasal) determiner-nominal-phrase-cxn
(marked-phrasal)
determiner-
kinder-
nominal-phrase-
lex determiner-
Found a solution cxn
(lex) nominal-phrase- ditransitive-
(marked-phrasal)
cxn cxn (arg)
initial sem syn (marked-phrasal)
structure top top
queue determiner-nominal-phrase-cxn (marked-phrasal) kinder-lex (lex) lehrerin-lex (lex) zeichnung-lex (lex)
(b) Search with feature matrices:
application
reset * zeichnung-lex, kinder-lex, lehrerin-lex, gaben-lex, die-lex, detnp-cxn, ditransitive-
process initial
die-lex , detnp-cxn, der-lex, detnp-cxn cxn
queue detnp-cxn der-lex (t) die-lex (t) die-lex (t)

Figure 1: The representation of linguistic information has a direct impact on processing efficiency. The top
applied figure shows a search
ditransitive-cxn tree when
detnp-cxn parsing
der-lex the unambiguous
(t) detnp-cxn die-lex (t) utterance
detnp-cxnDie Kinder
die-lex (t) gaben der(t)Lehrerin die Zeich-
gaben-lex
constructions
nung (The children gave the drawing to the (female) teacher) using disjunctive feature representation. The
lehrerin-lex (t) kinder-lex (t) ... and 1 more
bottom figure shows the search tree using distinctive feature matrices. Labels in the boxes show the names
resulting of the applied constructions; boxes with a bold border are successful end nodes. Both grammars
structure kinder- der-1 have been
1
implemented indetnp-
Fluid Construction Grammar (FCG; Steels, 2011, 2012a) and aredetnp- processed using a standard
depth-first search unit-1
algorithm (Bleys et al., 2011) and general unification (withoutunit-3 optimization lehrerin-
for particular
die-1 1
types or data structures; Steels and De Beule, 2006; De Beule, 2012). The utterance is assumed to be seg-
mented into
zeichnung- words. Interested readers can explore the Figure through an interactive web demonstration
die-2 at
1 detnp-
http://www.fcg-net.org/demos/design-patterns/07-feature-matrices/. detnp-
unit-2 ditransitive- sem syn ditransitive- unit-2 zeichnung-
top top
die-2 unit-1 unit-1 1
Steels (2004, 2012b), that grammar evolves in or- 3. The decrease of cue-reliability of case for
gaben-1 die-1
der to optimize communicative success by damp- disambiguation encourages
detnp- the emergence of
eninglehrerin-
the search space in linguistic processing and competing systemsunit-1 kinder-
(such as word order).
1 detnp- 1
reducing the cognitive
unit-3 effort needed for interpre-
der-1
tation, while at the same time minimizing the re- The hypothesis is substantiated
gaben-1 through com-
sources required for doing so. More specifically, putational experiments that reconstruct three dif-
ferent variants of the German definite article sys-
Meaning: this paper explores the following claims:
tem (the current
((teacher.f ?recipient-1) (unique-referent ?recipient-1) (drawing system, its Old High German pre-
?sem-role-3)
1. The ?sem-role-3)
(unique-referent German definite article ?ref-2)
(children system can be decessor,?ref-2)
(unique-referent Wright, 1906; and the Texas German
(gave ?ev-1 ?ref-2 ?sem-role-3 ?recipient-1))
processed as efficiently as its Old High Ger- dialect system, Boas, 2009a,b) and compare their
reset
man predecessor, which had less syncretism. performance in terms of processing efficiency and
cognitive effort in interpretation.
2. The presence of other grammatical structures
have made it possible to reduce the definite 3 Operationalizing German Case
article paradigm without increasing the cog-
nitive effort needed for disambiguating the An adequate operationalization of German case
argument structures that underly German ut- requires a bidirectional grammar (for parsing and
terances. production) and easy access to linguistic process-
831
ing data. All experiments reported in this paper case, the value for dative means that die can-
have therefore been implemented in Fluid Con- not be assigned dative case. We can do the same
struction Grammar (FCG; Steels, 2011, 2012a), a for Kinder (children), which can be nominative
unification-based grammar formalism that comes or accusative, but not dative:
equipped with an interactive web interface and
monitoring tools (Loetzsch, 2012). A second ad- (3) Kinder: nom ?nom

vantage of FCG is that it features strong bidirec- CASE
acc ?acc
tionality: the FCG-interpreter can achieve both dat
parsing and production using the same linguistic
inventory. Other feature structure platforms, such As demonstrated in Figure 1, disjunctive fea-
as the lkb-system (Copestake, 2002), require a ture representation would cause a split in the
separate parser and generator for formalizing bidi- search tree when unifying die and Kinder. Us-
rectional grammars, which make them less suited ing a feature matrix, however, the choice between
for substantiating the claims of this paper. a nominative and accusative reading can simply
be postponed until enough information from the
3.1 Distinctive Feature Matrix rest of the utterance is available. Unifying die and
German case has become the litmus test for Kinder yields the following feature structure:
demonstrating how well a feature-based grammar
formalism copes with multifunctionality, espe- (4) die Kinder: nom ?nom

cially since Ingria (1990) provocatively stated that CASE
acc ?acc
unification is not the best technique for handling dat
it. People have gone to great lengths to counter
Ingrias claim, especially within the HPSG frame- 3.2 A Three-Dimensional Matrix
work (e.g. Mller, 1999; Daniels, 2001; Sag, The German case paradigm is obviously more
2003), and various formalizations have been of- complex than the examples shown so far. Lets
fered for German case (Heinz and Matiasek, consider Table 1 again, but this time we replace
1994; Mller, 2001; Crysmann, 2005). However, every cell in the table by a variable. This leads to
these proposals either do not succeed in avoiding the following feature matrix for the German defi-
inefficient disjunctions or they require a complex nite articles:
double type hierarchy (Crysmann, 2005).
The experiments in this paper use a more Case SG-M SG-F SG-N PL
straightforward solution, called a distinctive fea- ?NOM ?n-s-m ?n-s-f ?n-s-n ?n-pl
ture matrix, which is based on an idea that was ?ACC ?a-s-m ?a-s-f ?a-s-n ?a-pl
first explored by Ingria (1990) and of which a ?DAT ?d-s-m ?d-s-f ?d-s-n ?d-pl
variation has recently also been proposed for ?GEN ?g-s-m ?g-s-f ?g-s-n ?g-pl
Lexical Functional Grammar (Dalrymple et al., Table 2: A distinctive feature matrix for German case.
2009). Instead of treating case as a single-valued
feature, it can be represented as an array of fea- Each cell in this matrix represents a specific
tures, as shown for the definite article die (ignor- feature bundle that collects the features case,
ing the genitive case for the time being): number, and person. For example, the variable
?n-s-m stands for nominative singular mascu-
(2) die: nom ?nom
line. Note that also the cases themselves have

CASE
acc ?acc their own variable (?nom, ?acc, ?dat and
dat ?gen). This allows us to single out a specific di-
mension of the matrix for constructions that only
The case feature includes a paradigm of three care about case distinctions, but abstract away
cases (nom, acc and dat), whose values can ei- from gender or number. Each linguistic item fills
ther be + or , or left unspecified through a in as much information as possible in this case
variable (indicated by a question mark). The two matrix. For example, Table 3 shows how the def-
variables ?nom and ?acc indicate that die can inite article die underspecifies its potential values
potentially be assigned nominative or accusative and rules out all other options through .
832
Case SG-M SG-F SG-N PL 4 Experiments
?NOM ?n-s-f ?n-pl
This section describes the experimental set-up and
?ACC ?a-s-f ?a-pl
discusses the experimental results.

4.1 Three Paradigms
Table 3: The feature matrix of die. The experiments compare three different variants
of the German definite article paradigm.
Standard German. The Standard German
paradigm has been illustrated in Table 1 and its
The feature matrix of Kinder (children), operationalization has been shown in section 3.2.
which underspecifies for nominative, accusative The paradigm has been inherited without signifi-
and genitive, is shown in Table 4. Notice, how- cant changes from Middle High German (1050-
ever, that the same variable names are used for 1350; Walshe, 1974) and features six different
both the column that singles out the case dimen- forms.
sion as for the column of the plural feature bun- Old High German. The Old High German
dles. paradigm is the direct predecessor of the current
Case SG-M SG-F SG-N PL paradigm of definite articles. It contained at least
?n-pl ?n-pl twelve distinct forms (depending on which varia-
?a-pl ?a-pl tion is taken) that included gender distinctions in
plural (Wright, 1906, p. 67). It also included one
?g-pl ?g-pl definite article that marked the now extinct instru-
mental case, which is ignored in this paper. The
Table 4: The feature matrix of Kinder (children). variant of the Old High German paradigm that has
been implemented in the experiments is summa-
Unification of die and Kinder can exploit these rized in Table 6.
variable equalities for ruling out a singular value
of the definite article. Likewise, the matrix of die Case Singular
rules out the genitive reading of Kinder, as illus- M F N
trated in Table 5. NOM dr diu daz
ACC dn die daz
Case SG-M SG-F SG-N PL DAT dmu dru dmu
?n-pl ?n-pl GEN ds dra ds
?a-pl ?a-pl Plural
M F N
NOM die deo diu
Table 5: The feature matrix of die Kinder. ACC die deo diu
DAT dem dem dem
Argument structure constructions (Goldberg, GEN dro dro dro
2006), such as the ditransitive, can then later as-
Table 6: The Old High German definite article system.
sign either nominative or accusative case. The
main advantage of feature matrices is that linguis-
tic search only has to commit to specific feature- Texas German. The third variant is an
values once sufficient information is available, so American-German dialect called Texas German
the search tree only splits when there is an actual (Boas, 2009a,b), which evolved a two-way case
ambiguity. Moreover, they can be handled using distinction between nominative and oblique. This
standard unification. Interested readers can con- type of case system, in which the accusative and
sult van Trijp (2011) for a thorough description of dative case have collapsed, is also a common
the approach, as well as a discussion on how the evolution in the Low German dialects (Shrier,
FCG implementation differs from Ingria (1990) 1965). The implemented paradigm of Texas
and Dalrymple et al. (2009). German is shown in Table 7.
833
Case SG-M SG-F SG-N PL The experiments exploit types because there
NOM der die das die are three different language systems, hence it is
ACC/DAT den die den die impossible to use a single, real corpus and its to-
ken frequencies. It would also be unwarranted to
Table 7: The Texas German definite article system.
use different corpora because corpus-specific bi-
ases would distort the comparative results. Sec-
ondly, as the experiments involve models of deep
language processing (as opposed to stochastic
4.2 Production and Parsing Tasks models), the use of types instead of tokens is
justified in this phase of the research: the first
Each grammar is tested as to how efficiently it can
concern of precision-grammars is descriptive ade-
produce and parse utterances in terms of cognitive
quacy, for which types are a more reliable source.
effort and search (see section 4.3). There are three
Obviously, the effect of token frequency needs to
basic types of utterances:
be examined in future research.
1. Ditransitive: NOM Verb DAT ACC 4.3 Measuring Cognitive Effort
2. Transitive (a): NOM Verb ACC The experiments measure two kinds of cognitive
effort: syntactic search and semantic ambiguity.
3. Transitive (b): NOM Verb DAT
Search. The search measure counts the number
The argument roles are filled by noun phrases of branches in the search process that reach an end
whose head nouns always have a distinct form node, which can either be a possible solution or
for singular and plural (e.g. Mann vs. Mn- a dead end (i.e. no constructions can be applied
ner; man vs. men), but that are unmarked for anymore). Duplicate nodes (for instance, nodes
case. The combinations of arguments is always that use the same rules but in a different order)
unique along the dimensions of number and gen- are not counted. The search measure is then used
der, which yields 216 unique utterance types for as a sanity check to verify whether the three dif-
the ditransitive as follows: ferent paradigms can be processed with the same
efficiency in terms of search tree length, as hy-
NOM.S.M V DAT.S.M ACC.S.M pothesized by this paper. More specifically, the
NOM.S.M V DAT.S.F ACC.S.M following conditions have to be met:
(5) NOM.S.M V DAT.S.N ACC.S.M
NOM.S.M V DAT.PL.M ACC.S.M 1. In production, there should only be one
etc. branch.
In transitive utterances, there is an additional 2. In parsing, search has to be equal to the se-
distinction based on animacy for noun phrases in mantic effort.
the Object position of the utterance, which yields
The single branch constraint in production
72 types in the NOM-ACC configuration and 72
checks whether the definite articles are suffi-
in the NOM-DAT configuration. Together, there
ciently distinct from one another. Since there is no
are 360 unique utterance types. As can be gleaned
ambiguity about which argument plays which role
from the utterance types, the genitive case is not
in the utterance, the grammar should only come
considered by the experiments, as the genitive is
up with one solution. In parsing, the number of
not part of basic German argument structures and
branches has to correspond to real semantic am-
it has almost disappeared in most dialects of Ger-
biguities and not create additional search, as ar-
man (Shrier, 1965).
gued in section 2.2.
In production, the grammar is presented with a
meaning that needs to be verbalized into an utter- Semantic Ambiguity. Semantic ambiguity
ance. In parsing, the produced utterance has to be equals the number of possible interpretations
analyzed back into a meaning. Every utterance is of an utterance. For instance, the utterance
processed using a full search, that is, all branches Der Hund beit den Mann the dog bites the
and solutions are calculated. man is unambiguous in Modern High German,
834
since der Hund can only be nominative singular- Cue E1 E2 E3 E4
masculine, and den Mann can only be accusative SV-agreement + +
masculine-singular. There is thus only one pos- Selection restrictions + +
sible interpretation in which the dog is the biter
and the man is being bitten, illustrated as follows
using a logic-based meaning representation (also
see Steels, 2004, for this operationalization of
cognitive effort): SV-agreement restricts the subject to singular
or plural nouns, and semantic selection restric-
(6) Interpretation 1: tions can disambiguate utterances in which for ex-
Der Hund beit den Mann. ample the Agent-role has to be animate (e.g. in
perception verbs such as sehen to see). All other
dog(?a) bite(?ev) man(?b) possible cues, such as word order, are ignored.
biter(?ev, ?x)
?a=?x bitten(?ev, ?y)
?b=?y 5 Results
5.1 Search
However, an utterance such as die Katze beit
die Frau the cat bites the woman is ambiguous In all experiments, the constraints of the search
because die has both a nominative and accusative measure were satisfied: every grammar only re-
singular-feminine reading: quired one branch per utterance in production,
and the number of branches in parsing never ex-
(7) a. Interpretation 1: ceeded the number of possible interpretations. In
Die Katze beit die Frau. terms of search length, more syncretism therefore
cat(?a) bite(?ev) woman(?b) does not automatically harm efficiency, provided
biter(?ev, ?x) that the grammar uses an adequate representation.
?a=?x bitten(?ev, ?y)
?b=?y Arguably, the smaller paradigms are even more
efficient because they require less unifications to
b. Interpretation 2:
Die Katze beit die Frau. be performed.
cat(?a) bite(?ev) woman(?b) 5.2 Semantic Ambiguity

biter(?ev, ?x)
bitten(?ev, ?y) ?b=?x Now that it has been ascertained that more
?a=?y
syncretism does not harm processing efficiency,
we can compare cue-reliability of the different
Here, German speakers are likely to use word
paradigms for semantic interpretation.
order, intonation and world knowledge (i.e. cats
are more likely to bite a person than the other way Ambiguous Utterances. Figure 2 shows the
round) for disambiguating the utterance. number of ambiguous utterances in parsing (in %)
per paradigm and per set-up. As can be seen,
4.4 Experimental Parameters the Old High German paradigm (black) is the
The experiments (E1-E4) concern the cue- most reliable cue in Experiment 1 (E1; when SV-
reliability of the definite articles for disambiguat- agreement and selection restrictions are ignored)
ing event structure. In all experiments, the differ- with 35.56% of ambiguous utterances, as opposed
ent grammars can exploit the case-number-gender to 55.56% for Modern High German (grey) and
information of definite articles, and also the gen- 77.78% for Texas German (white).
der and number specifications of nouns, and the When SV-agreement is taken into account (E2),
syntactic valence of verbs. For instance, the the difference between Old and Modern High
noun form Frauen women is specified as plural- German becomes smaller, with both paradigms
feminine, and verbs like helfen to help are spec- offering a reliability of more than 70%, while
ified to take a dative object, whereas verbs like Texas German still faces more than 70% of am-
finden to find take an accusative object. In other biguous utterances.
experiments, different combinations of grammat- Ambiguity is even more reduced when using
ical cues become available or not: semantic selection restrictions of the verb (set-up
835
E3). Here, the difference between Old and Mod- amount of ambiguity remains more than 20% us-
ern High German becomes trivial with 4.44% and ing all available cues. One verifiable predic-
6.94% of ambiguous utterances respectively. The tion of the experiments is therefore that this di-
difference with Texas German remains apparent, alect should show an increase in alternative syn-
even though its ambiguity is cut by half. tactic restrictions (such as word order) in order
In set-up E4 (case, SV-agreement and selection to make up for the lost case distinctions. Inter-
restrictions), the Old and Modern High German estingly, such alternatives have been attested in
paradigms resolve almost all ambiguities, leaving Low German dialects that have evolved a simi-
little difference between them. Using the Texas lar two-way case system (Shrier, 1965). Modern
German dialect, one utterance out of five remains High German, on the other hand, has already re-
ambiguous and requires additional grammatical cruited word order for other purposes (such as in-
cues or inferencing for semantic interpretation. formation structure; Lenerz, 1977; Micelli, 2012),
which may explain why the current paradigm has
Number of possible interpretations. Semantic
been able to survive since the Middle Ages.
ambiguity can also be measured by counting the
Instead of an accidental by-product of phono-
number of possible interpretations per utterance.
logical and morphological changes, then, a new
A non-ambiguous language would thus have 1
picture emerges for explaining syncretism in
possible interpretation per utterance. The aver-
Modern High German definite articles: German
age number of interpretations per utterance (per
speakers have been able to reduce their case
paradigm and per set-up) is shown in Table 8.
paradigm without loss in processing and interpre-
Paradigm E1 E2 E3 E4 tation efficiency. With cognitive effort as a selec-
Old High German 1.56 1.22 1.04 1.03 tion criterion, subsequent generations of speakers
Modern High German 1.56 1.28 1.07 1.04 found no linguistic pressures for maintaining par-
Texas German 2.84 2.39 1.36 1.22 ticular distinctions such as gender in plural arti-
cles. Especially forms whose acoustic distinctions
Table 8: Average number of interpretations per utter- are harder to perceive are candidates for collapse
ance type. if they are no longer functional for processing or
interpretation. Other factors, such as frequency,
The Old High German paradigm has the least may accelerate this evolution, as also argued by
semantic ambiguity throughout, except in Exper- Bardal (2009). For instance, there may be less
iment 1 (E1). Here, Modern High German has benefits for upholding a case distinction for infre-
the same average effort despite having more am- quent than for frequent forms.
biguous utterances. This means that the Old High If case syncretism is not randomly distributed
German paradigm provides a better coverage in over a grammatical paradigm, but rather func-
terms of construction types, but when ambiguity tionally motivated, a new explanatory model is
occurs, more possible interpretations exist. needed. One candidate is evolutionary linguistics
(Steels, 2012b), a framework of cultural evolu-
6 Discussion
tion in which populations of language users con-
The experiments compare how well three differ- stantly shape and reshape their language in re-
ent paradigms of definite articles perform if they sponse to their communicative needs. The ex-
are inserted in the grammar of Modern High Ger- periments reported here suggest that this dynamic
man. The results show that, in isolation, Old High shaping process is guided by the linguistic land-
German offers the best cue-reliability for retriev- scape of a language. For instance, the pres-
ing whos doing what to whom in events. How- ence of grammatical cues such as gender, num-
ever, when other grammatical cues are taken into ber and SV-agreement may encourage paradigm
account, it turns out that Modern High German reduction. However, reduction may be the start
achieves similar results with respect to syntactic of a self-enforcing loop in which the decreasing
search and semantic ambiguity, with a reduced cue-reliability of a paradigm may pressure lan-
paradigm (using only six instead of twelve forms). guage users into enforcing the alternatives to take
As for the Texas German dialect, which has on even more of the cognitive load of processing.
collapsed the accusative-dative distinction, the The intricate interactions between grammati-
836
% of ambiguous u,erances
100
90
80 77.78
71.11
70
60 55.56
50
40 35.56 35.56
28.89
30
22.22 22.22
20
10 6.94
4.44 2.78 3.61
0
E1 E2 E3 E4
Old High German Modern High German Texas German
Figure 2: This chart shows the number of ambiguous utterances per paradigm per E(xperimental set-up) in %.
cal systems also requires more sophisticated mea- experiments have demonstrated that Modern High
sures. A promising extension of this paper could German achieves a similar performance as its Old
lie in an information-theoretic approach to lan- High German predecessor using only half of the
guage (Hale, 2003; Jaeger and Tily, 2011), which forms in its definite article paradigm.
has recently explored a set of tools for assessing Instead of a series of historical accidents, the
linguistic complexity, processing effort and un- German case system thus underwent a systematic
certainty. Unfortunately, only little work has been and performance-driven [...] morphological re-
done on morphological paradigms so far (see e.g. structuring (Hawkins, 2004, p. 79), in which lin-
Ackerman et al., 2011), and the approach is typi- guistic pressures such as cognitive effort decided
cally applied in stochastic or Probabilistic Context on the maintenance or loss of certain distinctions.
Free Grammars, hence it remains unclear how the The case study makes clear that formal and com-
assumptions of this field fit into models of deep putational models of deep language understand-
language processing. ing have to reconsider their strict division between
competence and performance if the goal is to ex-
7 Conclusions plain individual language development. This pa-
per proposed that new tools and methodologies
More than 130 years after Mark Twains com- should be sought in evolutionary linguistics.
plaints, it seems that the German language is not
that awful after all. Through a series of compu- Acknowledgements
tational experiments, this paper has proposed a
different explanation for German case syncretism This research has been conducted at the Sony
that answers some of the unsolved riddles of pre- Computer Science Laboratory Paris. I would like
vious studies. First, the experiments have shown to thank Luc Steels, director of Sony CSL Paris
that an increase in syncretism does not necessar- and the VUB AI-Lab of the University of Brus-
ily lead to an increase in the cognitive effort re- sels, for his support and feedback. I also thank
quired for syntactic search, provided that the rep- Hans Boas, Jhanna Bardal, Peter Hanappe,
resentation of the grammar is processing-friendly. Manfred Hild and the anonymous reviewers for
Secondly, by comparing cue-reliability of differ- helping to improve this article. All errors remain
ent paradigms for semantic disambiguation, the of course my own.
837
References minacy, and likeness of case. In Stefan Mller,
editor, Proceedings of the 12th International
Farrell Ackerman, James P. Blevins, and Robert
Conference on Head-Driven Phrase Structure
Malouf. Parts and wholes: Implicative patterns
Grammar, pages 91107, Stanford, 2005. CSLI
in inflectional paradigms. In J.P. Blevins and
Publications.
J. Blevins, editors, Analogy in Grammar: Form
and Acquisition, pages 5481. Oxford Univer- Mary Dalrymple, Tracy Holloway King, and
sity Press, Oxford, 2011. Louisa Sadler. Indeterminacy by underspecifi-
cation. Journal of Linguistics, 45:3168, 2009.
Matthew Baerman. Case syncretism. In An-
drej Malchukov and Andrew Spencer, editors, Michael Daniels. On a type-based analysis of fea-
The Oxford Handbook of Case, chapter 14, ture neutrality and the coordination of unlikes.
pages 219230. Oxford University Press, Ox- In Proceedings of the 8th International Confer-
ford, 2009. ence on HPSG, pages 137147, Stanford, 2001.
CSLI.
J. Bardal. The development of case in germanic.
In J. Bardal and S. Chelliah, editors, The Role Joachim De Beule. A formal deconstruction of
of Semantics and Pragmatics in the Develop- Fluid Construction Grammar. In Luc Steels, ed-
ment of Case, pages 123159. John Benjamins, itor, Computational Issues in Fluid Construc-
Amsterdam, 2009. tion Grammar. Springer Verlag, Berlin, 2012.
Manfred Bierwisch. Syntactic features in Daniel P. Flickinger. On building a more efficient
morphology: General problems of so-called grammar by exploiting types. Natural Lan-
pronominal inflection in German. In To Hon- guage Engineering, 6(1):1528, 2000.
our Roman Jakobson, pages 239270. Mouton Jonathan Ginzburg and Ivan A. Sag. Interroga-
De Gruyter, Berlin, 1967. tive Investigations: the Form, the Meaning, and
James Blevins. Syncretism and paradigmatic op- Use of English Interrogatives. CSLI Publica-
position. Linguistics and Philosophy, 18:113 tions, Stanford, 2000.
152, 1995. Adele E. Goldberg. Constructions At Work: The
Joris Bleys, Kevin Stadler, and Joachim De Beule. Nature of Generalization in Language. Oxford
Search in linguistic processing. In Luc Steels, University Press, Oxford, 2006.
editor, Design Patterns in Fluid Construction John T. Hale. The information conveyed by words
Grammar. John Benjamins, Amsterdam, 2011. in sentences. Journal of Psycholinguistic Re-
Hans C. Boas. Case loss in Texas German: The search, 32(2):101123, 2003.
influence of semantic and pragmatic factors. In John A. Hawkins. Efficiency and Complexity in
J. Bardal and S. Chelliah, editors, The Role of Grammars. Oxford University Press, Oxford,
Semantics and Pragmatics in the Development 2004.
of Case, pages 347373. John Benjamins, Am- Bernd Heine and Tania Kuteva. Language Con-
sterdam, 2009a. tact and Grammatical Change. Cambridge
Hans C. Boas. The Life and Death of Texas University Press, Cambridge, 2005.
German, volume 93 of Publication of the The Wolfgang Heinz and Johannes Matiasek. Argu-
American Dialect Society. Duke University ment structure and case assignment in german.
Press, Durham, 2009b. In John Nerbonne, Klaus Netter, and Carl Pol-
David Carter. Efficient disjunctive unification lard, editors, German in Head-Driven Phrase
for bottom-up parsing. In Proceedings of the Structure Grammar, volume 46 of CSLI Lec-
13th Conference on Computational Linguistics, ture Notes, pages 199236. CSLI Publications,
pages 7075. ACL, 1990. Stanford, 1994.
Ann Copestake. Implementing Typed Feature R.J.P. Ingria. The limits of unification. In Pro-
Structure Grammars. CSLI Publications, Stan- ceedings of the 28th Annual Meeting of the
ford, 2002. ACL, pages 194204, 1990.
Berthold Crysmann. Syncretism in german: A T. Florian Jaeger and Harry Tily. On language
unified approach to underspecification, indeter- utility: Processing complexity and commu-
838
nicative efficiency. WIREs: Cognitive Science, Meeting of the Association for Computational
2(3):323335, 2011. Linguistics, pages 919, Barcelona, 2004.
L. Karttunen. Features and values. In Proceedings Luc Steels, editor. Design Patterns in Fluid Con-
of the 10th International Conference on Com- struction Grammar. John Benjamins, Amster-
putational Linguistics, Stanford, 1984. dam, 2011.
Jrgen Lenerz. Zur Abfolge nominaler Luc Steels, editor. Computational Issues in
Satzglieder im Deutschen. Narr, Tbin- Fluid Construction Grammar. Springer, Berlin,
gen, 1977. 2012a.
Martin Loetzsch. Tools for grammar engineering. Luc Steels. Self-organization and selection in cul-
In Luc Steels, editor, Computational Issues in tural language evolution. In Luc Steels, editor,
Fluid Construction Grammar. Springer Verlag, Experiments in Cultural Language Evolution.
Berlin, 2012. John Benjamins, Amsterdam, 2012b.
Vanessa Micelli. Field topology and information Luc Steels and Joachim De Beule. Unify and
structure: A case study for German constituent merge in Fluid Construction Grammar. In
order. In Luc Steels, editor, Computational Is- P. Vogt, Y. Sugita, E. Tuci, and C. Nehaniv,
sues in Fluid Construction Grammar. Springer editors, Symbol Grounding and Beyond., LNAI
Verlag, Berlin, 2012. 4211, pages 197223, Berlin, 2006. Springer.
Gereon Mller. Remarks on nominal inflection Remi van Trijp. Feature matrices and agreement:
in German. In Ingrid Kaufmann and Bar- A case study for German case. In Luc Steels,
bara Stiebels, editors, More than Words: A editor, Design Patterns in Fluid Construction
Festschrift for Dieter Wunderlich, pages 113 Grammar. John Benjamins, Amsterdam, 2011.
145. Akademie Verlag, Berlin, 2002. M. Walshe. A Middle High German Reader: With
Stefan Mller. An HPSG-analysis for free rela- Grammar, Notes and Glossary. Oxford Univer-
tive clauses in german. Grammars, 2(1):53 sity Press, Oxford, 1974.
105, 1999. Bernd Wiese. Iconicity and syncretism. on
Stefan Mller. Case in German towards and pronominal inflection in Modern German. In
HPSG analysis. In Tibor Kiss and Det- Robin Sckmann, editor, Theoretical Linguistics
mar Meurers, editors, Constraint-Based Ap- and Grammatical Description, pages 323344.
proaches to Germanic Syntax. CSLI, Stanford, John Benjamins, Amsterdam, 1996.
2001. Joseph Wright. An Old High German Primer.
Clarendon Press, Oxford, 2nd edition, 1906.
Allan Ramsay. Disjunction without tears. Com-
putational Linguistics, 16(3):171174, 1990. Dieter Wunderlich. Der unterspezifizierte Artikel.
In Karl Heinz Ramers Drscheid and Monika
Ivan A. Sag. Coordination and underspecifica-
Schwarz, editors, Sprache im Fokus, pages 47
tion. In Jongbok Kom and Stephen Wechsler,
55. Niemeyer, Tbingen, 1997.
editors, Proceedings of the Ninth International
Conference on HPSG, Stanford, 2003. CSLI.
Ivan A. Sag and Thomas Wasow. Performance-
compatible competence grammar. In Robert D.
Borsley and Kersti Brjars, editors, Non-
Transformational Syntax: Formal and Explicit
Models of Grammar. Wiley-Blackwell, Ox-
ford, 2011.
Martha Shrier. Case systems in German dialects.
Language, 41(3):420438, 1965.
Luc Steels. Constructivist development of
grounded construction grammars. In Walter
Daelemans, editor, Proceedings 42nd Annual
839
Managing Uncertainty in Semantic Tagging
Silvie Cinkova and Martin Holub and Vincent Krz

Charles University in Prague, Faculty of Mathematics and Physics
Institute of Formal and Applied Linguistics
{cinkova|holub}@ufal.mff.cuni.cz
vincent.kriz@gmail.com
Abstract logic. By semantic tagging we mean a process of

assigning semantic categories to target words in
Low interannotator agreement (IAA) is a given contexts. This process can be either manual
well-known issue in manual semantic tag-
or automatic.
ging (sense tagging). IAA correlates with
the granularity of word senses and they Traditionally, semantic tagging relies on the
both correlate with the amount of informa- tacit assumption that various uses of polysemous
tion they give as well as with its reliability. words can be sorted into discrete senses; under-
We compare different approaches to seman- standing or using an unfamiliar word be then like
tic tagging in WordNet, FrameNet, Prop-
looking it up in a dictionary. When building a dic-
Bank and OntoNotes with a small tagged
data sample based on the Corpus Pattern
tionary entry for a given word, the lexicographer
Analysis to present the reliable information sorts a number of its occurrences into discrete
gain (RG), a measure used to optimize the senses present (or emerging) in his/her mental lex-
semantic granularity of a sense inventory icon, which is supposed to be shared by all speak-
with respect to its reliability indicated by ers of the same language. The assumed common
the IAA in the given data set. RG can also mental representation of a words meaning should
be used as feedback for lexicographers, and make it easy for other humans to assign random
as a supporting component of automatic se-
occurrences of the word to one of the pre-defined
mantic classifiers, especially when dealing
with a very fine-grained set of semantic cat- senses (Fellbaum et al., 1997).
egories. This assumption seems to be falsified by the
interannotator agreement (IAA, sometimes ITA)
1 Introduction constantly reported much lower in semantic than
in morphological or syntactic annotation, as well
The term semantic tagging is used in two diver- as by the general divergence of opinion on which
gent areas: value of which IAA measure indicates a reliable
1) recognizing objects of semantic importance, annotation. In some projects (e.g. OntoNotes
such as entities, events and polarity, often tailored (Hovy et al., 2006)), the percentage of agreements
to a restricted domain, or between two annotators is used, but a number
2) relating occurrences of words in a corpus to a of more complex measures are available (for a
lexicon and selecting the most appropriate seman- comprehensive survey see (Artstein and Poesio,
tic categories (such as synsets, semantic frames, 2008)). Consequently, using different measures
wordsenses, semantic patterns or framesets). for IAA makes the reported IAA values incompa-
We are concerned with the second case, which rable across different projects.
seeks to make lexical semantics tractable for com- Even skilled lexicographers have trouble se-
puters. Lexical semantics, as opposed to proposi- lecting one discrete sense for a concordance (Kr-
tional semantics, focuses the meaning of lexical ishnamurthy and Nicholls, 2000), and, more to
items. The disciplines that focus lexical seman- say, when the tagging performance of lexicog-
tics are lexicology and lexicography rather than raphers and ordinary annotators (students) was
840
compared, the experiment showed that the men- cording to a reference source. There is, never-
tal representations of a words semantics differ for theless, a substantial difference: whereas mor-
each group (Fellbaum et al., 1997), and cf. (Jor- phologically or syntactically annotated data ex-
gensen, 1990). Lexicographers are trained in con- ist separately from the reference (tagset, anno-
sidering subtle differences among various uses of tation guide, annotation scheme), a semantically
a word, which ordinary language users do not re- tagged resource can be regarded both as a cor-
flect. Identifying a semantic difference between pus of texts disambiguated according to an at-
uses of a word and deciding whether a difference tached inventory of semantic categories and as
is important enough to constitute a separate sense a lexicon with links to example concordances
means presenting a word with a certain degree for each semantic category. So, in semanti-
of semantic granularity. Intuitively, the finer the cally tagged resources, the data and the reference
granularity of a word entry is, the more oppor- are intertwined. Such double-faced semantic re-
tunities for interannotator disagreement there are sources have also been called semantic concor-
and the lower IAA can be expected. Brown et al. dances (Miller et al., 1993a). For instance, one of
proved this hypothesis experimentally (Brown et the earlier versions of WordNet, the largest lexi-
al., 2010). Also, the annotators are less confident cal resource for English, was used in the seman-
in their decisions, when they have many options tic concordance SemCor (Miller et al., 1993b).
to choose from (Fellbaum et al. (1998) reported a More recent lexical resources have been built as
drop in subjective annotators confidence in words semantic concordances from the very beginning
with 8+ senses). (PropBank (Palmer et al., 2005), OntoNotes word
Despite all the known issues in semantic tag- senses (Weischedel et al., 2011)).
ging, the major lexical resources (WordNet (Fell- In morphological or syntactic annotation, the
baum, 1998), FrameNet (Ruppenhofer et al., tagset or inventory of constituents are given be-
2010), PropBank (Palmer et al., 2005) and the forehand and are supposed to hold for all to-
word-sense part of OntoNotes (Weischedel et al., kens/sentences contained in the corpus. Prob-
2011)) are still maintained and their annotation lematic and theory-dependent issues are few and
schemes are adopted for creating new manually mostly well-known in advance. Therefore they
annotated data (e.g. MASC, the Manually An- can be reflected by a few additional conventions in
notated Subcorpus (Ide et al., 2008)). More to the annotation manual (e.g. where to draw the line
say, these resources are not only used in WSD and between particles and prepositions or between ad-
semantic labeling, but also in research directions jectives and verbs in past participles (Santorini,
that in their turn do not rely on the idea of an in- 1990) or where to attach a prepositional phrase
ventory of discrete senses any more, e.g. in dis- following a noun phrase and how to treat specific
tributional semantics (Erk, 2010) and recognizing financialspeak structures (Bies et al., 1995)).
textual entailment (e.g. (Zanzotto et al., 2009) and Even in difficult cases, there are hardly more than
(Aharon et al., 2010)). two options of interpretation. Data manually an-
It is a remarkable fact that, to the best of our notated for morphology or surface syntax are reli-
knowledge, there is no measure that would relate able enough to train syntactic parsers with an ac-
granularity, reliability of the annotation (derived curacy above 80 % (e.g. (Zhang and Clark, 2011;
from IAA) and the resulting information gain. McDonald et al., 2006)).
Therefore it is impossible to say where the opti- On the other hand, semantic tagging actually
mum for granularity and IAA lies. employs a different tagset for each word lemma.
Even within the same part of speech, individual
2 Approaches to semantic tagging words require individual descriptions. Possible
similarities among them come into relief ex post
2.1 Semantic tagging vs. morphological or
rather than that they could be imposed on the lex-
syntactic analysis
icographers from the beginning. When assign-
Manual semantic tagging is in many respects siming senses to concordances, the annotator often
ilar to morphological tagging and syntactic anal- has to select among more than two relevant op-
ysis: human annotators are trained to sort cer- tions. These two aspects make achieving good
tain elements occurring in a running text ac- IAA much harder than in morphology and syn-
841
tax tasks. In addition, while a linguistically edu- In FrameNet corpora, content words are associ-
cated annotator can have roughly the same idea of ated to particular semantic frames that they evoke
parts of speech as the author of the tagset, there (e.g. charm would relate to the Aesthetics frame)
is no chance that two humans (not even two pro- and their collocates in relevant syntactic positions
fessional lexicographers) would create identical (arguments of verbs, head nouns of adjectives,
entries for e.g. a polysemous verb. Any human etc.) would be assigned the corresponding frame-
evaluation of complete entries would be subjec- element labels (e.g. in their dazzling charm, their
tive. The maximum to be achieved is that the en- would be The Entity for which a particular grad-
try reflects the corpus data in a reasonable gran- able Attribute is appropriate and under considera-
ular way on which annotators still can reach rea- tion and dazzling would be Degree). Neither IAA
sonable IAA. nor granularity seem to be an issue in FrameNet.
We have not succeeded in finding a report on IAA
2.2 Major existing semantic resources in the original FrameNet annotation, except one
The granularity vs. IAA equilibrium is of great measurement in progress in the annotation of the
concern in creating lexical resources as well as in Manually Annotated Subcorpus of English (Ide et
applications dealing with semantic tasks. When al., 2008).1
WordNet (Fellbaum, 1998) was created, both IAA PropBank is a valency (argument structure) lex-
and subjective confidence measurements served icon. The current resource lists and labels ar-
as an informal feedback to lexicographers (Fell- guments and obligatory modifiers typical of each
baum et al., (1998), p. 200). In general, WordNet (very coarse) word sense (called frameset). Two
has been considered a resource too fine-grained core criteria for distinguishing among framesets
for most annotations (and applications). Nav- are the semantic roles of the arguments along
igli (2006) developed a method of reducing the with the syntactic alternations that the verb can
granularity of WordNet by mapping the synsets undergo with that particular argument set. To
to senses in a more coarse-grained dictionary. A keep low granularity, this lexiconamong other
manual, more coarse-grained grouping of Word- thingsdoes usually not make special framesets
Net senses has been performed in OntoNotes for metaphoric uses. The overall IAA measured
(Weischedel et al., 2011). The OntoNotes 90 % on verbs was 94 % (Palmer et al., 2005).
solution (Hovy et al., 2006) actually means such
a degree of granularity that enables a 90-%-IAA. 2.3 Semantic Pattern Recognition
OntoNotes is a reaction to the traditionally poor From corpus-based lexicography to semantic
IAA in WordNet annotated corpora, caused by the patterns
high granularity of senses. The quality of seman-
The modern, corpus-based lexicology of 1990s
tic concordances is maintained by numerous itera-
(Sinclair, 1991; Fillmore and Atkins, 1994) has
tions between lexicographers and annotators. The
had a great impact on lexicography. There is a
categories rightwrong have been, for the pur-
general consensus that dictionary definitions need
pose of the annotated linguistic resource, defined
to be supported by corpus examples. Cf. Fell-
by the IAA score, which isin OntoNotes
baum (2001):
calculated as the percentage of agreements be-
For polysemous words, dictionaries [. . . ] do
tween two annotators.
not say enough about the range of possible con-
Two other, somewhat different, lexical re-
texts that differentiate the senses. [. . . ] On the
sources have to be mentioned to complete the pic-
other hand, texts or corpora [. . . ] are not ex-
ture: FrameNet (Ruppenhofer et al., 2010) and
plicit about the words meaning. When we first
PropBank (Palmer et al., 2005). While Word-
encounter a new word in a text, we can usually
Net and OntoNotes pair words and word senses in
form only a vague idea of its meaning; checking a
a way comparable to printed lexicons, FrameNet
dictionary will clarify the meaning. But the more
is primarily an inventory of semantic frames and
contexts we encounter for a word, the harder it is
PropBank focuses the argument structure of verbs
to match them against only one dictionary sense.
and nouns (NomBank (Meyers et al., 2008), a re-
lated project capturing the argument structure of 1
Checked on the project web www.anc.org/MASC/Home
nouns, was later integrated in OntoNotes). 2011-10-29.
842
The lexical description in modern English can be semantically so tightly related that they
monolingual dictionaries (Sinclair et al., 1987; could appear together under one sense in a tradi-
Rundell, 2002) explicitly emphasizes contextual tional dictionary. The patterns are not senses but
clues, such as typical collocates and the syntac- syntactico-semantically characterized prototypes
tic surroundings of the given lexical item, rather (see the example verb submit in Table 1). Con-
than relying on very detailed definitions. In cordances that match these prototypes well are
other words, the sense definitions are obtained called norms in Hanks (forthcoming). Concor-
as syntactico-semantic abstractions of manually dances that match with a reservation (metaphor-
clustered corpus concordances in the modern ical uses, argument mismatch, etc.) are called ex-
corpus-based lexicography: in classical dictionar- ploitations. The PDEV corpus annotation indi-
ies as well as in semantic concordances. cates the norm-exploitation status for each con-
Nevertheless, the word senses, even when ob- cordance.
tained by a collective mind of lexicographers and Compared to other semantic concordances, the
annotators, are naturally hard-wired and tailored granularity of PDEV is high and thus discourag-
to the annotated corpus. They may be too fine- ing in terms of expected IAA. However, select-
grained or too coarse-grained for automatic pro- ing among patterns does not really mean disam-
cessing of different corpora (e.g. a restricted- biguating a concordance but rather determining to
domain corpus). Kilgarriff (1997, p. 115) shows which pattern it is most similara task easier for
(the handbag example) that there is no reason to humans than WSD is. This principle seems par-
expect the same set of word senses to be relevant ticularly promising for verbs as words expressing
for different tasks and that the corpus dictates the events, which resist the traditional word sense dis-
word senses and therefore word sense was not ambiguation the most.
found to be sufficiently well-defined to be a work-
able basic unit of meaning (p. 116). On the other A novel approach to semantic tagging
hand, even non-experts seem to agree reasonably We present the semantic pattern recognition as
well when judging the similarity of use of a word a novel approach to semantic tagging, which is
in different contexts (Rumshisky et al., 2009). Erk different from the traditional word-sense assign-
et al. (2009) showed promising annotation results ment tasks. We adopt the central idea of CPA that
with a scheme that allowed the annotators graded words do not have fixed senses but that regular
judgments of similarity between two words or be- patterns can be identified in the corpus that ac-
tween a word and its definition. tivate different conversational implicatures from
Verbs are the most challenging part of speech. the meaning potential of the given verb. Our
We see two major causes: vagueness and coer- method draws on a hard-wired, fine-grained in-
cion. We neglect ambiguity, since it has proved to ventory of semantic categories manually extracted
be rare in our experience. from corpus data. This inventory represents the
maximum semantic granularity that humans are
CPA and PDEV able to recognize in normal and frequent uses of a
Our current work focuses on English verbs. verb in a balanced corpus. We thoroughly analyze
It has been inspired by the manual Corpus Pat- the interannotator agreement to find out which of
tern Analysis method (CPA) (Hanks, forthcom- the highly semantic categories are useful in the
ing) and its implementation, the Pattern Dictio- sense of information gain. Our goal is a dynamic
nary of English Verbs (PDEV) (Hanks and Puste- optimization of semantic granularity with respect
jovsky, 2005). PDEV is a semantic concordance to given data and target application.
built on yet a different principle than FrameNet, Like Passonneau et al. (2010), we are con-
WordNet, PropBank or OntoNotes. The man- vinced that IAA is specific to each respective
ually extracted patterns of frequent and normal word and reflects its inherent semantic properties
verb uses are, roughly speaking, intuitively sim- as well as the specificity of contexts the given
ilar uses of a verb that expressin a syntacti- word occurs in, even within the same balanced
cally similar forma similar event in which sim- corpus. We accept as a matter of fact that inter-
ilar participants (e.g. humans, artifacts, institu- annotator confusion is inevitable in semantic tag-
tions, other events) are involved. Two patterns ging. However, the amount of uncertainty of the
843
No. Pattern / Implicature
[[Human 1 | Institution 1] [Human 1 | Institution 1 = Competitor]] submit [[Plan | Document
| Speech Act | Proposition | {complaint | demand | request | claim | application | proposal
| report | resignation | information | plea | petition | memorandum | budget | amendment |
programme | . . . }] [Artifact | Artwork | Service | Activity | {design | tender | bid | entry
1
| dance | . . . }]] (({to} Human 2 | Institution 2 = authority)({to} Human 2 | Institution 2 =
referee)) ({for} {approval | discussion | arbitration | inspection | designation | assessment |
funding | taxation | . . . })
[[Human 1 | Institution 1]] presents [[Plan | Document]] to [[Human 2 | Institution 2]] for {approval
| discussion | arbitration | inspection | designation | assessment | taxation | . . . }
[Human | Institution] submit [THAT-CL|QUOTE]
2
[[Human | Institution]] respectfully expresses {that [CLAUSE]} and invites listeners or readers to
accept that {that [CLAUSE]} is true}
[Human 1 | Institution 1] submit (Self) ({to} Human 2 | Institution 2)
4
[[Human 1 | Institution 1]] acknowledges the superior force of [[Human 2 | Institution 2]] and puts
[[Self]] in the power of [[Human 2 | Institution 2]]
[Human 1] submit (Self) [[{to} Eventuality = Unpleasant] [{to} Rule]]
5
[[Human 1]] accepts [[Rule |Eventuality = Unpleasant]] without complaining
[passive]
6 [Human| Institution] submit [Anything] [{to} Eventuality]
[[Human 1|Institution 1]] exposes [[Anything]] to [[Eventuality]]
Table 1: Example of patterns defined for the verb submit.
right tag differs a lot, and should be quantified. per verb). The annotators were given the en-
For that purpose we developed the reliable infor- tries as well as the reference sample annotated
mation gain measure presented in Section 3.2. by the lexicographer and a test sample of 50 con-
cordances for annotation. We measured IAA, us-
CPA Verb Validation Sample ing Fleisss kappa,3 and analyzed the interannota-
The original PDEV had never been tested with tor confusion manually. IAA varied from verb to
respect to IAA. Each entry had been based on verb, mostly reaching safely above 0.6. When the
concordances annotated solely by the author of IAA was low and the type of confusion indicated a
that particular entry. The annotation instructions problem in the entry, the entry was revised. Then
had been transmitted only orally. The data had the lexicographer revised the original reference
been evolving along with the method, which im- sample along with the first 50-concordance sam-
plied inconsistencies. We put down an annotation ple. The annotators got back the revised entry, the
manual (a momentary snapshot of the theory) and newly revised reference sample and an entirely
trained three annotators accordingly. For practical new 50-concordance annotation batch. The fi-
annotation we use the infrastructure developed at nal multiple 50-concordance sample went through
Masaryk University in Brno (Horak et al., 2008), one more additional procedure, the adjudication:
which was also used for the original PDEV de- first, the lexicographer compared the three anno-
velopment. After initial IAA experiments with tations and eliminated evident errors. Then the
the original PDEV, we decided to select 30 verb lexicographer selected one value for each concor-
entries from PDEV along with the annotated con- dance to remain in the resulting one-value-per-
cordances. We made a new semantic concordance concordance gold standard data and recorded it
sample (Cinkova et al., 2012) for the validation of into the gold standard set. The adjudication pro-
the annotation scheme. We refer to this new col-
lection2 as VPS-30-En (Verb Pattern Sample, 30 3
Fleisss kappa (Fleiss, 1971) is a generalization of
English verbs). Scotts statistic (Scott, 1955). In contrast to Cohens kappa
(Cohen, 1960), Fleisss kappa evaluates agreement between
We slightly revised some entries and updated multiple raters. However, Fleisss kappa is not a generaliza-
the reference samples (usually 250 concordances tion of Cohens kappa, which is a different, yet related, sta-
tistical measure. Sometimes, the terminology about kappas
2
This new lexical resource, including the complete docu- is confusing in the literature. For a detailed explanation refer
mentation, is publicly available at http://ufal.mff.cuni.cz/spr. e.g. to (Artstein and Poesio, 2008).
844
tocol has been kept for further experiments. All Properties: ACM is symmetric and for any i 6= j
values except the marked errors are regarded as the number Cij ? says how many times a pair of
equally acceptable for this type of experiments. annotators disagreed on two tags ti and tj , while
In the end, we get for each verb: Cii? is the frequency
P of? agreements on ti ; the sum
in the i-th row j Cij is the total frequency of
an entry, which is an inventory of semantic assigned sets {t, t0 } that contain ti .
categories (patterns) An example of ACM is given in Table 2. The
corresponding confusion matrices are shown in
300+ manually annotated concordances (sin-
Table 3.
gle values)
out of which 50 are manually annotated and 1 1.a 2 4 5

adjudicated concordances (multiple values 1 85 8 2 0 0
without evident errors). 1.a 8 1 2 0 0
2 2 2 34 0 0
3 Tagging confusion analysis 4 0 0 0 4 8
5 0 0 0 8 6
3.1 Formal model of tagging confusion
To formally describe the semantic tagging task, Table 2: Aggregated Confusion Matrix.
we assume a target word and a (randomly se-
lected) corpus sample of its occurrences. The Our approach to exact tagging confusion analy-
tagged sample is S = {s1 , . . . , sr }, where each sis is based on probability and information theory.
instance si is an occurrence of the target word Assigning semantic tags by annotators is viewed
with its context, and r is the sample size. as a random process. We define (categorical) ran-
For multiple annotation we need a set of m an- dom variable T1 as the outcome of one annota-
notators A = {A1 , . . . , Am } who choose from tor; its values are single member sets {t}, and we
a given set of semantic categories represented have mr observations to compute their probabil-
by a set of n semantic tags T = {t1 , . . . , tn }. ities. The probability that an annotator will use
Generally, if we admitted assigning more tags to ti is denoted by p1 (ti ) = Pr(T1 = {ti }) and is
one word occurrence, annotators could assign any practically computed as the relative frequency of
subset of T to an instance. In our experiments, ti among all mr assigned tags. Formally,
however, annotators were allowed to assign just
one tag to each tagged instance. Therefore each m r
1 XX
annotator is described as a function that assigns a p1 (ti ) = |Ak (sj ) {ti }|.
mr
single member set to each instance Ai (s) = {t}, k=1 j=1
where s S, t T . When a pair of annotators

The outcome of two annotators (they both tag
tag an instance s, they produce a set of one or two
the same instance) is described by random vari-
different tags {t, t0 } = Ai (s) Aj (s).
able T2 ; its values are single or double member
Detailed information about interannotator
sets {t, t0 }, and we have m 2 r observations to
(dis)agreement on a given sample S is rep- compute their probabilities. In contrast to p1 , the
resented by a set of m 2 symmetric matrices
Ak Al probability that ti will be used by a pair of anno-
Cij = |{s S | Ak (s) Al (s) = {ti , tj }}|,
tators is denoted by p2 (ti ) = Pr(T2 {ti }), and
for 1 k < l m, and i, j {1, . . . , n}.
is computed as the relative frequency of assigned
Note that each of those matrices can be easily
sets {t, t0 } containing ti among all m 2 r observa-
computed as C Ak Al = C + C T In C, where
tions:
C is a conventional confusion matrix representing 1 X ?
the agreement between annotators Ak and Al , p2 (ti ) = m Cik .
2 r k
and In is a unit matrix.
Definition: Aggregated Confusion Matrix (ACM) We also need the conditional probability that an
annotator will use ti given that another annotator
has used tj . For convenience, we use the nota-
X
C? = C Ak Al .
1k<lm tion p2 (ti | tj ) = Pr(T2 {ti } | T2 {tj }).
845
A1 vs. A2 A1 vs. A3 A2 vs. A3
1 1.a 2 4 5 1 1.a 2 4 5 1 1.a 2 4 5
1 29 1 1 0 0 1 29 2 0 0 0 1 27 2 0 0 0
1.a 0 1 0 0 0 1.a 1 0 0 0 0 1.a 2 0 1 0 0
2 0 1 11 0 0 2 0 0 12 0 0 2 1 0 11 0 0
4 0 0 0 2 0 4 0 0 0 1 1 4 0 0 0 1 4
5 0 0 0 3 1 5 0 0 0 0 4 5 0 0 0 0 1
Table 3: Example of all confusion matrices for the target word submit and three annotators.
Obviously, it can be computed as only on the probability of that tag, and would be
defined as I(tj ) = log p1 (tj ). However, intu-
Pr(T2 = {ti , tj })
p2 (ti | tj ) = itively one can say that a good measure of use-
Pr(T2 {tj }) fulness of a particular tag should also take into
Cij? ?
Cij consideration the expected tagging confusion re-
= m
= P ? .
2 r p2 (tj ) k Cjk lated to the tag. Therefore, to exactly measure
usefulness of the tag tj we propose to compare
and measure similarity of the distribution p1 (ti )
Definition: Confusion Probability Matrix (CPM) and the distribution p2 (ti | tj ), i = 1, . . . , n.
?
Cij How much information do we gain when an an-
p
Cji = p2 (ti | tj ) = P ? . notator assigns the tag tj to an instance? When
k Cjk the tag tj has once been assigned to an instance
Properties: The sum in any row is 1. The j-th by an annotator, one would naturally expect that
row of CPM contains probabilities of assigning ti another annotator will probably tend to assign the
given that another annotator has chosen tj for the same tag tj to the same instance. Formally, things
same instance. Thus, the j-th row of CPM de- make good sense if p2 (tj | tj ) > p1 (tj ) and if
scribes expected tagging confusion related to the p2 (ti | tj ) < p1 (ti ) for any i different from j.
tag tj . If p2 (tj | tj ) = 100 %, then there is full con-
An example is given in Table 3 (all confusion sensus about assigning tj among annotators; then
matrices for three annotators), in Table 2 (the and only then the measure of usefulness of the tag
corresponding ACM), and in Table 4 (the corre- tj should be maximal and should have the value
sponding CPM). of log p1 (tj ). Otherwise, the value of useful-
ness should be smaller. This is our motivation to
1 1.a 2 4 5 define a quantity of reliable information gain ob-
1 0.895 0.084 0.021 0.000 0.000 tained from semantic tags as follows:
1.a 0.727 0.091 0.182 0.000 0.000 Definition: Reliable Gain (RG) from the tag tj is
2 0.053 0.053 0.895 0.000 0.000
4 0.000 0.000 0.000 0.333 0.667 X p2 (tk |tj )
RG(tj ) = (1)kj p2 (tk |tj ) log .
5 0.000 0.000 0.000 0.571 0.429 p1 (tk )
k
Table 4: Example of Confusion Probability Matrix. Properties: RG is similar to the well known
Kullback-Leibler divergence (or information
gain). If p2 (ti | tj ) = p1 (ti ) for all i = 1, . . . , n,
3.2 Semantic granularity optimization then RG(tj ) = 0. If p2 (tj | tj ) = 100 %, then
Now, having a detailed analysis of expected tag- and only then RG(tj ) = log p1 (tj ), which
ging confusion described in CPM, we are able to is the maximum. If p2 (ti | tj ) < p1 (ti ) for
compare usefulness of different semantic tags us- all i different from j, the greater difference in
ing a measure of the information content associ- probabilities, the bigger (and positive) RG(tj ).
ated with them (in the information theory sense). And vice versa, the inequality p2 (ti | tj ) > p1 (ti )
Traditionally, the amount of self-information con- for all i different from j implies a negative value
tained in a tag (as a probabilistic event) depends of RG(tj ).
846
Definition: Average Reliable Gain (ARG) from 3.3 Classifier evaluation with respect to
the tagset {t1 , . . . , tn } is computed as an expected expected tagging confusion
value of RG(tj ): An automatic classifier is considered to be a func-
X tion c thatthe same way as annotators assigns
ARG = p1 (tj )RG(tj ) tags to instances s S, so that c(s) = {t},
j t T . The traditional way to evaluate the ac-
curacy of an automatic classifier means to com-
Properties: ARG has its maximum value if the
pare its output with the correct semantic tags on
CPM is a unit matrix, which is the case of the
a Gold Standard (GS) dataset. Within our formal
absolute agreement among all annotators. Then
framework, we can imagine that we have a gold
ARG has the value of the entropy of the p1 distri-
annotator Ag , so that the GS dataset is represented
bution: ARGmax = H(p1 (t1 ), . . . , p1 (tn )).
by Ag (s1 ), . . . , Ag (sr ). Then the classic accuracy
1 Pr
Merging tags with poor RG score can be computed as r i=1 |Ag (si )c(si )|.
The main motivation for developing the ARG However, that approach does not take into con-
value was the optimization of the tagset granular- sideration the fact that some semantic tags are
ity. We use a semi-greedy algorithm that searches quite confusing even for human annotators. In our
for an optimal tagset. The optimization process opinion, automatic classifier should not be penal-
starts with the fine-grained list of CPA semantic ized for mistakes that would be made even by hu-
categories and then the algorithm merges some mans. So we propose a more complex evaluation
tags in order to maximize the ARG value. An ex- score using the knowledge of the expected tagging
ample is given in Table 5. Tables 6 and 7 show confusion stored in CPM.
the ACM and the CPM after merging. The ex- Definition: Classifier evaluation Score with re-
amples relate to the verb submit already shown in spect to tagging confusion is defined as the pro-
Tables 1, 2, 3 and 4. portion Score(c) = S(c)/Smax , where
Original tagset Optimal merge r

Tag f RG Tag f RG X
S(c) = |Ag (si ) c(si )| +
r
1 90 +0.300 i=1
1 + 1.a 96 +0.425 r
1.a 6 0.001 1X
2 36 +0.447 2 36 +0.473 + p2 (c(si ) | Ag (si ))
r
i=1
4 8 0.071 r
4+5 18 +0.367 1
5 10 0.054
X
Smax = + p2 (Ag (si ) | Ag (si )).
r
i=1
Table 5: Frequency and Reliable Gain of tags.
=1 = 0.5 =0
1 2 4 Verb Score Score Score
1 94 4 0 halt 1 0.84 2 0.90 4 0.81
2 4 34 0 submit 2 0.83 1 0.90 1 0.84
4 0 0 18 ally 3 0.82 3 0.89 5 0.76
cry 4 0.79 4 0.88 2 0.82
Table 6: Aggregated Confusion Matrix after merging. arrive 5 0.74 5 0.85 3 0.81
plough 6 0.70 6 0.81 6 0.72
deny 7 0.62 7 0.74 7 0.66
1 2 4 cool 8 0.58 8 0.69 8 0.53
1 0.959 0.041 0.000 yield 9 0.55 9 0.67 9 0.52
2 0.105 0.895 0.000
4 0.000 0.000 1.000 Table 8: Evaluation with different values.
Table 7: Confusion Probability Matrix after merging. Table 8 gives an illustration of the fact that us-
ing different values one can get different re-
847
sults when comparing tagging accuracy for dif- 4 Conclusion
ferent words (a classifier based on bag-of-words
The usefulness of a semantic resource depends on
approach was used). The same holds true for com-
two aspects:
parison of different classifiers.
reliability of the annotation
3.4 Related work
In their extensive survey article Artstein and Poe- information gain from the annotation.
sio (2008) state that word sense tagging is one
In practice, each semantic resource emphasizes
of the hardest annotation tasks. They assume
one aspect: OntoNotes, e.g., guarantees reliabil-
that making distinctions between semantic cate-
ity, whereas the WordNet-annotated corpora seek
gories must rely on a dictionary. The problem
to convey as much semantic nuance as possible.
is that annotators often cannot consistently make
To the best of our knowledge, there has been no
the fine-grained distinctions proposed by trained
exact measure for the optimization, and the use-
lexicographers, which is particularly serious for
fulness of a given resource can only be assessed
verbs, because verbs generally tend to be polyse-
when it is finished and used in applications. We
mous rather than homonymous.
propose the reliable information gain, a measure
A few approaches have been suggested in
based on information theory and on the analysis of
the literature that address the problem of the
interannotator confusion matrices for each word
fine-grained semantic distinctions by (automatic)
entry, that can be continually applied during the
measuring sense distinguishability. Diab (2004)
creation of a semantic resource, and that provides
computes sense perplexity using the entropy func-
automatic feedback about the granularity of the
tion as a characteristic of training data. She also
used tagset. Moreover, the computed information
compares the sense distributions to obtain sense
about the amount of expected tagging confusion
distributional correlation, which can serve as a
is also used in evaluation of automatic classifiers.
very good direct indicator of performance ra-
tio, especially together with sense context con- Acknowledgments
fusability (another indicator observed in the train-
ing data). Resnik and Yarowsky (1999) intro- This work has been supported by the Czech Sci-
duced the communicative/semantic distance be- ence Foundation projects GK103/12/G084 and
tween the predicted sense and the correct sense. P406/2010/0875 and partly by the project Euro-
Then they use it for evaluation metric that pro- MatrixPlus (FP7-ICT-2007-3-231720 of the EU
vides partial credit for incorrectly classified in- and 7E09003+7E11051 of the Ministry of Edu-
stances. Cohn (2003) introduces the concept of cation, Youth and Sports of the Czech Republic).
(non-uniform) misclassification costs. He makes We thank our friends from Masaryk University
use of the communicative/semantic distance and in Brno for providing the annotation infrastruc-
proposes a metric for evaluating word sense dis- ture and for their permanent technical support.
ambiguation performance using the Receiver Op- We thank Patrick Hanks for his CPA method, for
erating Characteristics curve that takes the mis- the original PDEV development, and for numer-
classification costs into account. Bruce and ous discussions about the semantics of English
Wiebe (1998) analyze the agreement among hu- verbs. We also thank three anonymous reviewers
man judges for the purpose of formulating a re- for their valuable comments.
fined and more reliable set of sense tags. Their
method is based on statistical analysis of inter-
annotator confusion matrices. An extended study
is given in (Bruce and Wiebe, 1999).
848
References Christiane Fellbaum, Joachim Grabowski, and Shari
Landes. 1997. Analysis of a hand-tagging task. In
Roni Ben Aharon, Idan Szpektor, and Ido Dagan.
Proceedings of the ACL/Siglex Workshop, Somer-
2010. Generating entailment rules from FrameNet.
set, NJ.
In Proceedings of the ACL 2010 Conference Short
Christiane Fellbaum, J. Grabowski, and S. Landes.
Papers., pages 241246, Uppsala, Sweden.
1998. Performance and confidence in a semantic
Ron Artstein and Massimo Poesio. 2008. Inter-coder
annotation task. In WordNet: An Electronic Lexical
agreement for computational linguistics. Computa-
Database, pages 217238. Cambridge (Mass.): The
tional Linguistics, 34(4):555596, December.
MIT Press., Cambridge (Mass.).
Ann Bies, Mark Ferguson, Karen Katz, Robert Mac-
Christiane Fellbaum, Martha Palmer, Hoa Trang Dang,
Intyre, Victoria Tredinnick, Grace Kim, Mary Ann
Lauren Delfs, and Susanne Wolf. 2001. Manual
Marcinkiewicz, and Britta Schasberger. 1995.
and automatic semantic annotation with WordNet.
Bracketing guidelines for treebank II style. Tech-
nical report, University of Pennsylvania. Christiane Fellbaum. 1998. WordNet. An Electronic
Lexical Database. MIT Press, Cambridge, MA.
Susan Windisch Brown, Travis Rood, and Martha
Palmer. 2010. Number or nuance: Which factors Charles J. Fillmore and B. T. S. Atkins. 1994. Start-
restrict reliable word sense annotation? In LREC, ing where the dictionaries stop: The challenge for
pages 32373243. European Language Resources computational lexicography. In Computational Ap-
Association (ELRA). proaches to the Lexicon, pages 349393. Oxford
Rebecca F. Bruce and Janyce M. Wiebe. 1998. Word- University Press.
sense distinguishability and inter-coder agreement. Joseph L. Fleiss. 1971. Measuring nominal scale
In Proceedings of the Third Conference on Em- agreement among many raters. Psychological Bul-
pirical Methods in Natural Language Processing letin, 76:378382.
(EMNLP 98), pages 5360. Granada, Spain, June. Patrick Hanks and James Pustejovsky. 2005. A pat-
Rebecca F. Bruce and Janyce M. Wiebe. 1999. Recog- tern dictionary for natural language processing. Re-
nizing subjectivity: A case study of manual tagging. vue Francaise de linguistique applique, 10(2).
Natural Language Engineering, 5(2):187205. Patrick Hanks. forthcoming. Lexical Analysis: Norms
Silvie Cinkova, Martin Holub, Adam Rambousek, and and Exploitations. MIT Press.
Lenka Smejkalova. 2012. A database of seman- Ales Horak, Adam Rambousek, and Piek Vossen.
tic clusters of verb usages. In Proceedings of the 2008. A distributed database system for develop-
LREC 2012 International Conference on Language ing ontological and lexical resources in harmony.
Resources and Evaluation. To appear. In 9th International Conference on Intelligent Text
Jacob Cohen. 1960. A coefficient of agreement for Processing and Computational Linguistics, pages
nominal scales. Educational and Psychological 115. Berlin: Springer.
Measurement, 20(1):3746. Eduard Hovy, Mitchell Marcus, Martha Palmer,
Trevor Cohn. 2003. Performance metrics for word Lance Ramshaw, and Ralph Weischedel. 2006.
sense disambiguation. In Proceedings of the Aus- OntoNotes: the 90% solution. In Proceedings
tralasian Language Technology Workshop 2003, of the Human Language Technology Conference
pages 8693, Melbourne, Australia, December. of the NAACL, Companion Volume: Short Papers,
Mona T. Diab. 2004. Relieving the data acquisition NAACL-Short 06, pages 5760, Stroudsburg, PA,
bottleneck in word sense disambiguation. In Pro- USA. Association for Computational Linguistics.
ceedings of the 42nd Annual Meeting of the ACL, Nancy Ide, Collin Baker, Christiane Fellbaum, Charles
pages 303310. Barcelona, Spain. Association for Fillmore, and Rebecca Passoneau. 2008. MASC:
Computational Linguistics. The Manually Annotated Sub-Corpus of American
Katrin Erk, Diana McCarthy, and Nicholas Gaylord. English. In Proceedings of the Sixth International
2009. Investigations on word senses and word us- Conference on Language Resources and Evaluation
ages. In Proceedings of the Joint Conference of the (LREC08), pages 2830. European Language Re-
47th Annual Meeting of the ACL and the 4th In- sources Association (ELRA).
ternational Joint Conference on Natural Language Julia Jorgensen. 1990. The psycholinguistic reality of
Processing of the AFNLP, pages 1018, Suntec, word senses. Journal of Psycholinguistic Research,
Singapore, August. Association for Computational (19):167190.
Linguistics. Adam Kilgarriff. 1997. I dont believe in word
Katrin Erk. 2010. What is word meaning, really? senses. Computers and the Humanities, 31(2):91
(And how can distributional models help us de- 113.
scribe it?). In Proceedings of the 2010 Workshop Ramesh Krishnamurthy and Diane Nicholls. 2000.
on GEometrical Models of Natural Language Se- Peeling an onion: The lexicographers experience
mantics, pages 1726, Uppsala, Sweden, July. As- of manual sense tagging. Computers and the Hu-
sociation for Computational Linguistics. manities, 34:8597.
849
Ryan McDonald, Kevin Lerman, and Fernando John Sinclair. 1991. Corpus, Concordance, Colloca-
Pereira. 2006. Multilingual dependency analysis tion. Describing English Language. Oxford Univer-
with a two-stage discriminative parser. In Proceed- sity Press.
ings of the Tenth Conference on Computational Nat- Ralph Weischedel, Martha Palmer, Mitchell Marcus,
ural Language Learning CoNLLX 06, pages 216 Eduard Hovy, Sameer Pradhan, Lance Ramshaw,
220. Association for Computational Linguistics. Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle
Adam Meyers, Ruth Reeves, and Catherine Macleod. Franchini, Mohammed El-Bachouti, Robert Belvin,
2008. NomBank v 1.0. and Ann Houston. 2011. OntoNotes release 4.0.
G. A. Miller, C. Leacock, R. Tengi, and R. T. Bunker. Fabio Massimo Zanzotto, Marco Pennacchiotti, and
1993a. A semantic concordance. In Proceedings of Alessandro Moschitti. 2009. A machine learning
ARPA Workshop on Human Language Technology. approach to textual entailment recognition. Natural
G. A. Miller, C. Leacock, R. Tengi, and R. T. Bunker. Language Engineering, 15(4):551582.
1993b. A semantic concordance. In Proceedings of Yue Zhang and Stephen Clark. 2011. Syntactic pro-
ARPA Workshop on Human Language Technology. cessing using the generalized perceptron and beam
Roberto Navigli. 2006. Meaningful clustering of search. Computational Linguistics, 37(November
senses helps boost word sense disambiguation per- 2009):105151.
formance. In Proceedings of the 21st International
Conference on Computational Linguistics and 44th
Annual Meeting of the ACL, pages 105112, Syd-
ney, Australia.
Martha Palmer, Dan Gildea, and Paul Kingsbury.
2005. The proposition bank: A corpus annotated
with semantic roles. Computational Linguistics
Journal, 31(1).
Rebecca J. Passonneau, Ansaf Salleb-Aoussi, Vikas
Bhardwaj, and Nancy Ide. 2010. Word sense anno-
tation of PolysemousWords by multiple annotators.
In LREC Proceedings, pages 32443249, Valetta,
Malta.
Philip Resnik and David Yarowsky. 1999. Distin-
guishing systems and distinguishing senses: New
evaluation methods for word sense disambiguation.
Natural Language Engineering, 5(2):113133.
Anna Rumshisky, M. Verhagen, and J. Moszkowicz.
2009. The holy grail of sense definition: Creating
a Sense-Disambiguated corpus from scratch. Pisa,
Italy.
Michael Rundell. 2002. Macmillan English Dictio-
nary for advanced learners. Macmillan Education.
Josef Ruppenhofer, Michael Ellsworth, Miriam R. L.
Petruck, Christopher R. Johnson, and Jan Schef-
fczyk. 2010. FrameNet II: Extended Theory and
Practice. ICSI, University of Berkeley, September.
Beatrice Santorini. 1990. Part-of-Speech tagging
guidelines for the penn treebank project. University
of Pennsylvania 3rd Revision 2nd Printing, (MS-
CIS-90-47):33.
William A. Scott. 1955. Reliability of content analy-
sis: The case of nominal scale coding. Public Opin-
ion Quarterly, 19(3):321325.
John Sinclair, Patrick Hanks, and et al. 1987. Collins
Cobuild English Dictionary for Advanced Learn-
ers 4th edition published in 2003. HarperCollins
Publishers 1987, 1995, 2001, 2003 and Collins
AZ Thesaurus 1st edition first published in 1995.
HarperCollins Publishers 1995.
850
Aditya Kalyanpur, Siddharth Patwardhan, Branimir Boguraev,

Jennifer Chu-Carroll and Adam Lally
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598, USA
{adityakal,siddharth,bran,jencc,alally}@us.ibm.com
Abstract Largely a legacy of the nature of TREC questions

(Voorhees, 2002), this tactic works in most cases
Typically, automatic Question Answering
where the assumption holds that a question is fo-
(QA) approaches use the question in its en-
tirety in the search for potential answers. cused upon a single fact, and support for it may
We argue that decomposing complex fac- be found in a single resource.
toid questions into separate facts about their Our work deals with more complex factoid
answers is beneficial to QA, since an an- questions, specifically ones containing multiple
swer candidate with support coming from
multiple independent facts is more likely facts related to the correct answer. Because such
to be the correct one. We broadly cate- facts may be independent of each other, they may
gorize decomposable questions as parallel well reside in different resourcesand thus out-
or nested, and we present a novel ques- side of the scope of a single-shot search query.
tion decomposition framework for enhanc-
(2) Which company has origins dating back to the
ing the ability of single-shot QA systems
1870s and became the first U.S. company to
to answer complex factoid questions. Es-
have 1 million stockholders?
sential to the framework are components
for decomposition recognition, question re- Example (2) shows a question with two facts
writing, and candidate answer synthesis about its answer (a company): its origins date
and re-ranking. We discuss the inter- back to the 1870s, and it became the first in U.S.
play among these, with particular empha-
sis on decomposition recognition, a pro-
to have 1 million stockholders. We turn to ques-
cess which, we argue, can be sufficiently in- tion decomposition to leverage the separate facts
formed by lexico-syntactic features alone. within the question, using them to garner support
We validate our decomposition approach by for the correct answer from independent sources
implementing the framework on top of a of evidence. Our hypothesis is that the more in-
state-of-the-art QA system, showing a sta- dependent facts support an answer candidate, the
tistically significant improvement over its
accuracy. more likely it is to be the correct answer.
We focus here on decomposition applied to im-
proving the quality of QA over a broad set of
1 Introduction
factoid questions. In contrast to most work on
Question Answering (QA) systems for factoid decomposition to date, which tends to appeal to
questions typically adopt a single-shot ap- discourse and/or semantic properties of the ques-
proach for the task. Single-shot QA implicitly as- tion (Section 2), we exploit the notion of a fact to
sumes that the question contains a single nugget view decomposition as circumscribed largely by
of information (as in Example (1)). the syntactic shape of questions. Facts are entity-
(1) In which city are the headquarters of GE relationship expressions, where the relation may
located? be an N-ary predicate. Most informative, and thus
To answer the question, these approaches attempt useful, facts are those that contain at least one
to locate the factual information (the location of named entity (including temporal or locative ex-
GEs headquarters) in their underlying resources. pressions).
851
The particular relationship between indepen- 2005), lists (Hartrumpf, 2008) or lists of sets (Lin
dent facts in any given question leads us to catego- and Liu, 2008), and so forth.
rize decomposable questions broadly into two In the literature, we find descriptions of pro-
types: parallel and nested. Examples (2) above cesses like local decomposition and meronymy
and (3) below are parallel decomposable: sub- decomposition (Hartrumpf, 2008), semantic de-
questions can be evaluated independently of one composition using knowledge templates (Katz et
another. In contrast, nested questions require their al., 2005), question refocusing (Hartrumpf, 2008;
decompositions to be processed in sequence, with Katz et al., 2005), and textual entailment (Laca-
the answer to an inner sub-question plugged tusu et al., 2006) to connect, through semantics
into the outer. In Example (4), the inner sub- and discourse, the original question with its nu-
question is marked in brackets; its answer, cir- merous decompositions. In general, such pro-
rhosis, then leads to an outer question In the cesses are not limited to using only lexical mate-
treatment of cirrhosis, which drug reduces portal rial explicitly present in the question: a constraint
venous blood inflow, the answer to which is also we place upon our decomposition algorithms in
the answer to the original question. order to retain the ability to do open-domain QA.
(3) Which 2011 tax form do I fill if I need to do Closer to our strategy are notions like the syn-
itemized deductions and I have an IRA rollover tactic decomposition of Katz et al. (2005), and the
from 2010? temporal/spatial analysis of Saquete et al. (2004)
(4) In the treatment of [a condition that causes and Hartrumpf (2008). Still, our approach differs
bleeding esophageal varices], which drug
in at least two significant ways. We offer a prin-
reduces portal venous blood inflow?
cipled solution to the problem of the final combi-
Questions like these are found in domains such as
nation and ranking of candidate answers returned
medical, legal, etc., as they tend to arise in more
from multiple decompositions, by means of train-
dynamic QA system setting. Independently of do-
ing a model to weigh the effects of decomposition
main and type, however, they share a common
recognition rules. We also note that spatial and
characteristic: if a search query is constructed
temporal decomposition are just special cases of
from all the facts collectively describing the an-
solving nested decomposable questions.
swer, it is likely to flood the system with noise,
The closest similarity our fact-based decompo-
and confuse the identification of potential answer-
sition has with an established approach is with the
bearing passages. The notion of decomposition
notion of asking additional questions in order to
thus goes hand in hand with that of recursively
derive constraints on candidate answers (Prager
applying a QA system to the individual facts (sub-
et al., 2004). However, the additional questions
questions), followed by suitable re-composition
there are generated through knowledge of the do-
of the candidate answer lists for the sub-questions.
main, making that technique hard to apply in an
This paper presents a novel decomposition ap- open domain setting. In contrast, we developed
proach for such questions. We discuss the partic- a domain-independent approach to question de-
ular strategies for recognizing and typing decom- composition, in which we use the question con-
posable questions, and the subsequent processing text alone in generating queriable constraints.
of sub-questions, and their candidate answer lists,
in ways which can improve the performance of an 3 Fact-based Decomposition
existing state-of-the-art QA system.
Enhancing a single-shot QA system with a ca-
2 Related work pability for incremental solving of decomposable
questions requires recognizing that a question is
A variety of approaches to QA cite decomposi- decomposable, and engaging in a staged process-
tion, in the context of addressing question coming of its sub-question parts. Whether parallel or
plexity. In most work to date, however, com- nested, the system needs to identify the multiple
plex refers to questions requiring non-factoid facts, and configure itself as appropriate. Figure
answers: e.g. multiple sentences or summaries 1 shows our fact-based decomposition meta-
of answers (Lacatusu et al., 2006), connected framework (meta, as it builds on top of an ex-
paragraphs (Soricut and Brill, 2004), explana- isting QA system). It comprises four main com-
tions and/or justification of an answer (Katz et al., ponents as illustrated in the figure.
852
Question
the two different pathways in the figure, multi-
ple parallel facts submitted to the base QA system
vs. inner-outer sub-question pairs, processed via a
feedback loop. The base system is invoked on the
full question, and on its decompositions.
4 Decomposition Recognizers
The primary goal in decomposing questions is to
identify facts involving the entity being asked for
Ranked Ranked Ranked
Candidates Candidates Candidates (henceforth the focus), simpler than the full ques-
tion and solvable independently (Section 1). Most
question decomposition work (Section 2) tends to
defer to semantic, discourse, and other domain-
Final
Answer List specific information; in contrast, we recognize de-
composable questions primarily on the basis of
Figure 1: Fact-based decomposition framework their syntactic shape. This is important for our
claim that the decomposition framework outlined
in Section 3 is generally applicable to multiple
Decomposition Recognizers analyze the input
QA tasks and system configurations.
question and identify decomposable parts using a
In our work, we use a dataset of factoid ques-
set of predominantly lexico-syntactic cues (Sec-
tion/answer pairs from Jeopardy!,1 a popular TV
tion 4). Question Rewriters re-write the sub-
quiz show in the US. The data is particularly chal-
questions found by the recognizer, retaining key
lenging, not least for the broad domain it covers
contextual information (Section 5.1). Underly-
and the complex language used. In addition to
ing QA System generates, for any factoid ques-
making for an excellent test-bed for open-domain
tion, a ranked list of answer candidates, each with
QA, the data offers a wide choice of questions
a confidence corresponding to the probability of
which require decomposing.
the answer being correct. Answer Synthesis and
Re-ranking is a placeholder for the particular pro- 4.1 Decomposition Patterns
cess which tries to combine ranked candidate an-
Our analysis of complex decomposable questions
swers obtained to the original question with so-
highlights numerous syntactic cues that are reli-
lutions for the decomposed facts into a uniform
able indicators for decomposition, and it is pre-
ranked answer list. In general, different combi-
dominantly such cues we exploit for driving the
nation functions may be appropriate for different
recognition and typing of decomposable ques-
types of decomposable questions. Thus, for the
tions. A set of recognition patterns can be formu-
classes of parallel and nested questions, our de-
lated in terms of fine-grained lexico-syntactic in-
composition strategies (described in Sections 5.2
formation, expressed over the predicate-argument
and 5.3) defer to an Answer Merger. Other combi-
structure (PAS) for the syntactic parse of the
nation functions may be required for e.g. selecting
question. We identify three major categories
from or aggregating over lists; cf. Hartrumpfs op-
of configurationally-based patterns: independent
erational decomposition (2008), or Lin and Lius
subtrees, composable units and segments with
multi-focus questions (2008); see also the special
qualifiers. These are general, in the sense
questions solving techniques of (Prager et al., ).
that they capture relationships between configura-
We use a particular QA system (Ferrucci et
tional properties of a question and its status with
al., 2010) as base. However, any system can be
respect to decomposability. The specific rules im-
plugged into our meta-framework, as long as: it
plementing the patterns may, or may not, have to
can solve factoid questions by providing answers
be modified as, for instance, there may be a style
with confidences reflecting correctness probabil-
change, or a shift in the syntactic analysis frame-
ity; and it maintains context/topic information for
work of the base QA system, to a different parser;
the question separately from its main content.
1
Parallel and nested processing are distinct: note http://www.jeopardy.com.
853
Independent Subtrees
(1.P) Parallel
clause Its original name meant bitter water and it was Fact #1: Its original name meant bitter water
made palatable to Europeans after the Spaniards Fact #2: It was made palatable to Europeans after the
added sugar Spaniards added Sugar
complementary American Prometheus is a biography of this physi- Fact #1: this physicist who died in 1967
cist who died in 1967 Fact #2: American Prometheus is a biography of
this physicist
(1.N) Nested
coincidental When 60 Minutes premiered, this man was U.S. Inner Fact: When 60 Minutes premiered
President Outer Fact: When this man was president
based-on A controversial 1979 war film was based on a 1902 Inner Fact: A controversial 1979 war film
work by this author Outer Fact: film was based on a work by this author
named-for Article of clothing named for an old character who Inner Fact: an old character who dressed in loose
dressed in loose trousers in commedia dellarte trousers in commedia dellarte
Outer Fact: Article of clothing named for character
Composable Units
(2.P) Parallel
verb-args He launched his lecturing career in 1866 with a talk Fact #1: He launched his lecturing career in 1866
later titled Our fellow savages of the Sandwich Is-
lands
focus-mod The Mute was the working title of this 1940 novel Fact #1: this 1940 novel by a female author
by a female author
triple His rise began when he upset Robert M. La Follette, Fact #1: he upset Robert M. La Follette, Jr.
Jr. in a 1946 Senate primary
(2.N) Nested
explicit-link To honor his work, this mans daughter took the name Inner Fact: To honor his work, [this] daughter took
Maria Celeste when she became a nun in 1616 the name Maria Celeste, when . . .
Outer Fact: this mans daughter
descriptive-np The word for this congressional job comes from a fox- Inner Fact: a fox-hunting term for someone who
hunting term for someone who keeps the hunting dogs keeps the hunting dogs from straying
from straying Outer Fact: The word for this congressional job
comes from term
Segments with Qualifiers
(3.P) Parallel
qualifier Winning in 1965 and 1966, he was the first man to win Fact #1: he was the first man to win the Masters golf
the Masters golf tournament in 2 consecutive years tournament in 2 consecutive years
Table 1: Decomposition Rule Sets
such implementations do not affect our analysis subtree from the question as a decomposable fact.
of syntactically-cued decomposition recognition.
Table 1 shows example decompositions within
pattern categories; note that within a category,
typically there are rule sets for parallel and nested
decomposition types.2
Independent Subtrees A good source of in-

dependent sub-questions within a question is in For example, the subtree fragment circled is an in-
clauses likely to capture a unique piece of infor- dependent fact (in brackets) identified within the
mation about the answer, distinct from the rest larger question The name of [this character, first
of the question. Relative or subordinate clauses introduced in 1894], comes from the Hindi for
(not in a superlative or ordinal context; see Seg- bear. This category also includes rules using
ments with Qualifiers below) are examples of in- conjunctions as decomposition points (at various
dependent subtrees and are indicative of parallel levels of the syntactic parse), as in Example (3)
decomposition. PAS configurations that connect earlier (Section 1).
such subtrees to the focus are generally good in- Parallel decomposition of this type is captured
dicators of a sub-question: cues to break off a in two rule sets, clause and complementary, which
2
In the data we use, questions are posed in a declarative
differ primarily in that complementary rules at-
format, with stylized marking of question focus. This should tempt to derive two separate sub-questions, while
not detract from referring to them as questions. the clause rules attempt to locate independent
854
sub-questions in the original question. Examples Units rules combine separate parts of the PAS
in Table 1/Row (1.P) illustrate this distinction. into a fact. For instance, a sub-question can be
For nested decomposition, we have three rule created by associating the focus head with its pre-
sets: coincidental, based-on and named-for. modifiers and postmodifiers. If the premodifiers
These use lexical cues to detect specific seman- and postmodifiers are sufficiently specific, we ob-
tic relations within the question that could indi- tain reasonably independent sub-questions, with
cate nestedness. For instance, the coincidental parallel-decomposable behavior.
rules identify sub-questions resolving a tempo- Three parallel decomposition rule sets are de-
ral link with the focus of the original question. fined in this category: verb-args, focus-mod and
The based-on and named-for rules detect sub- triple (see Table 1/row (2.P)). The rules in verb-
questions where the answer to the original ques- args compose a fact from the verb and its ar-
tion is based on or named for the answer to the in- guments (subject, object, PP complements). The
ner sub-question (Table 1/row (1.N)). Note that in focus-mod rules combine the head of the focus
different domains, different relations may corre- NP with its modifiers to generate a sub-question.
late with nestedness, for instance, disease-causes- Similar to verb-args are triple rules, which
symptom in a medical setting; cf. Example (4) in create less constrained sub-questions (in that the
Section 1. The general pattern would still apply, composition always links only two of the argu-
even if we need different rule(s) to implement it. ments to the underlying predicate, e.g. subject-
Configurational information is used to deter- verb-object or subject-verb-complement).
mine whether the question exhibits parallel or Here also, a particular configuration around the
nested decomposition profile. Thus the syntac- focus may indicate a question requiring nested
tic contour of Example (3) shows that two clauses processing. For nested, the Composable Units
characterize the same entity (the focus): a clear category has two rule sets: explicit-link and
indicator that the sub-questions are parallel. Con- descriptive-np (Table 1/row (2.N)).
versely, A controversial 1979 war film was based In contrast to questions where modifiers of the
on a 1902 work by this author exhibits a very focus can be cues for parallel decomposition (i.e.
different set of configurational properties. There the focus-mod rules above), the explicit-link
are two underspecified entities (including the fo- rules detect nested decomposition, signaled by the
cus), both characterized as head-plus-modifiers focus itself being a modifier. For example, in
syntactic units; however, there is no sharing of To honor his work, this mans daughter took the
the separate characterizations (facts) via a com- name Maria Celeste when she became a nun in
mon head. This indicates nestedness: the inner 1616, the focus (this man) is a determiner to
sub-question is the one around the underspecified, an underspecified node (daughter). Traversing
but non-focus, element (a controversial 1979 the tree without descending to the level of the fo-
war film); the outer is [film] was based on a cus would carve out an inner sub-question itself
1902 work by this author. focused on that underspecified node (daughter):
Another cue for nested questions is a sub-tree see Table 1/row (2.N).
labeled by a temporal subordinate conjunction,
or a subordinate clause, away from the focus-
enclosing top level of the question and itself un-
derspecified. Such analysis will motivate the
question When 60 Minutes premiered, this
man was U.S. president to be solved first for the
temporal expression, When did 60 Minutes The descriptive-np rule set finds parenthetical
premiere?, followed by Who was U.S. Presi- descriptions of underspecified nouns in the pri-
dent in 1968?. mary question, as in e.g. This arboreally named
area was made famous by [a prince in the re-
Composable Units An alternate strategy for gion noted for impaling enemies on stakes]:
identifying sub-questions is to compose a fact the nested-decomposable nature of this question
by combining elements from the question. In con- is captured in the descriptive phrase (in square
trast to the previous category, the Composable brackets) functioning as an inner sub-question.
855
Segments with Qualifiers This category of parallel or nested, the appropriate pathway in the
rules covers cases where the modifier of the fo- framework (Figure 1) needs to get instantiated;
cus is a relative qualifier, such as the first, before sub-questions are submitted to the base QA
only, the westernmost. In such cases, in- system, they may need augmentation to facilitate
formation from another clause is usually re- the recursive system invocation. The answer sets
quired to complete the relative qualifier: con- obtained from sub-questions processing need then
sider e.g. the incomplete the third man vs. to be analyzed and rationalized, to determine the
the fact the third man . . . to climb Mt. Ever- final answer to the original question.
est) To deal with these cases, rules in this cat-
egory combine the characteristics of Composable 5.1 Question Re-Writing
Units with those of Independent Subtrees rules. For parallel decomposition, the goal is to solve the
We compose the relative qualifier, the focus original question Q by solving sub-questions in-
(along with its modifiers) and the attached sup- dependently and combining results appropriately.
porting clause subtree to generate this type of For example, consider the Jeopardy! question
rules. As illustrated in row (3.P) of Table 1, for (5) H ISTORIC P EOPLE: The life story of this man
parallel decomposition our rule set covers sub- who died in 1801 was chronicled in an A&E
questions expressed as superlatives. We do not Biography DVD titled Triumph and treason
have any rules of this type for the nested case. We get two decompositions:3
Q1 : This man who died in 1801
Q2 : The life story of this man was chronicled in an
A&E Biography DVD titled Triumph and
treason
Submitting sub-questionsunmodifiedto the
base QA system raises at least two problems.
Sub-questions are often much shorter than the
original question, and in many cases no longer
4.2 Decomposition Filters have a unique answer. Moreover, some of the
All three pattern categories above rely only on a information from the original question that was
syntactic analysis of the question; this is delivered dropped in a sub-question may be relevant con-
by the English Slot Grammar (ESG) parser (Mc- textual cues that the QA system needs to come up
Cord, 1989). When rules fire, they also identify with the correct answer. Q1 above illustrates these
question segments proposed as sub-questions. problems: it does not have a unique answer, and
Not surprisingly, the rules over-generate; to suffers from a recall problem (the correct answer
mitigate against that, we apply several heuristic is not in the candidate answer list of the base sys-
filters to the proposed sub-questions. The filters tem when it considers this sub-question alone).
discard sub-questions that do not contain either a Our solution is to insert contextual informa-
named entity, a quoted string, or a time or date ex- tion into the sub-questions. In a two-step pro-
pression (these are detected by the ESG parser). cess for a sub-question Qi , we obtain the set of
Additionally, we discard sub-questions that al- all named entities and nouns (ignoring stopwords)
most completely overlap the entire question or a in the original question text outside of Qi , and
sub-question from a prior rule. A partial prior- we insert these keywords into the original ques-
ity order is imposed on rule application, based on tion category. In Jeopardy! questions, the cate-
intuitions of how informative the facts generated gory field is the context/topic information which
by a rule are; this order is reflected on a per-type the underlying QA system needs in order to use
basis in Table 1: e.g. within type (2.P) we prefer the decomposition framework, as stated in Sec-
verb-args to triple since the latter tends to produce tion 3. In general, a QA system may derive such
less constrained facts than the former. information in a variety of ways, e.g. by exploit-
ing the problem description in a technical assis-
5 Using Decomposition tance QA setting, or a patients medical history,
In essence, decomposition recognition informs 3
Jeopardy! questions also contain category information,
two processes. According to the question type, which further contextualizes the search for the answer.
856
in a medical QA setting. What is important here Feature Name Description
Binary feature signaling whether
is that the base system treat such information dif- candidate was top answer to non-
Orig. Top Answer
ferently from the question itself. Rewriting takes decomposed question
advantage of this differential weighting to ensure Confidence for candidate answer to
Orig. Confidence non-decomposed question
that the larger context of the original question is
still taken into account when evaluating a sub- Number of sub-questions which
# Facts Matched have candidate answer in top 10
question, albeit with less weight.
Rule-verb-args Features corresponding to the rules
The re-written Q1 /Q2 for Example (5) are: Rule-clause sets used in parallel decomposition
(5-1) H ISTORIC P EOPLE (A&E B IOGRAPHY DVD Rule-qualifier each feature takes a numeric value,
T RIUMPH AND T REASON ): This man who Rule-focus-mod which is the confidence of the QA
Rule-complementary system on a fact identified by the cor-
died in 1801
Rule-triple responding rule set
(5-2) H ISTORIC P EOPLE (1801): The life story of
this man was chronicled in an A&E Biography Table 2: Features in Parallel Re-ranking Model
DVD titled Triumph and treason
The keywords are inserted in parentheses, to en-
Finally, if the sub-questions are not of a good
sure a clear separation between the original cat-
quality (e.g. due to a bad parse), we need a fall-
egory terms and the context terms added. Other
back to the original question, which implies that
systems may need a different re-writing tactic.
the confidence for the candidate answer for the
The above re-writing technique is used for both
entire question should also be considered when
parallel and nested decomposable questions. For
making a final decision. Consequently, we use a
the nested case, there is an additional re-writing
machine-learning model to combine information
step that needs to be done after solving the in-
across sub-question answer confidences, with fea-
ner question, we need to substitute its answer into
tures capturing the above information (Table 2).
the outer when solving for it. Thus the first ex-
In case a candidate answer is not in the answer
ample in Table 1/row(1.N) would have its inner
list of the full question or any of the decomposed
focus When 60 Minutes premiered replaced
sub-questions, the corresponding feature value is
with In 1968 creating the outer question In
set to missing. If a rule generates multiple sub-
1968, this man was U.S. President whose solu-
questions, its corresponding feature value for the
tion is the answer to the original question.
candidate answer is set to the sum of the confi-
dences obtained for that answer across all sub-
5.2 Answer Re-Ranking: Parallel
questions. The model is trained using Wekas
The base QA system will process the re-written (Witten and Frank, 2000) logistic regression al-
category/sub-question pairs, and will produce a gorithm with instance weighting.
set of ranked candidate lists with confidences.
These need to be combined into a final answer list 5.3 Answer Re-Ranking: Nested
for the original question, accounting for informa- Nested questions decompose into inner/outer
tion across all sub-question candidate lists. question pairs. The task is to solve the inner ques-
One way to produce a final score for each can- tion first, substitute the answer obtained, based on
didate answer is simply to take the product of its confidence, into the outer, and solve that for the
the scores returned by the QA system for each final answer. This is contingent upon selecting an-
of the sub-questions. This assumes that the sub- swers to the inner question which might profitably
questions are typically independent and that the be plugged into the outer; substituting incorrect
QA system produces a confidence which corre- answers will only lead to noisy final answers, with
sponds to the probability of the answer being cor- negative impact to overall accuracy.
rect. However, even if the sub-questions are inde- We rely on the ability of the underlying QA sys-
pendent, question re-writing breaks this assump- tem to produce meaningful confidences for its an-
tion as it brings information from the remain- swers, and only consider the top answer to the in-
der of the question into the sub-question context. ner question for substitution into the outerif its
Also, the sub-questions are generated by decom- confidence exceeds some threshold.
position rules that have varying precision and re- Finally, the answers to the outer question need
call, and thus should not be weighted equally. to be related to the full question answer list, to
857
produce the final ranked answers. For answer QA End-to-End Decomposable Q
System Accuracy Accuracy
re-ranking, we use the following heuristic selec-
PB 635/1269 (50.05%) 339/598 (56.68%)
tion strategy: we compute the aggregate confi-
PDQR 634/1269 (49.96%) 338/598 (56.52%)
dence of the answer obtained through decompo- PD+QR 643/1269 (50.66%) 347/598 (58.02%)
sition as the product of the inner-question answer NB 635/1269 (50.05%) 129/255 (50.58%)
confidence and the outer-question answer confi- ND+QR 640/1269 (50.43%) 134/255 (52.54%)
dence, and compare this value with that of the top
Table 3: Evaluating Decomposition
answer confidence to the entire question select-
ing the higher confidence one as our final answer.
Note that this re-ranking is different from the one NB to Nested Baseline; both are results from run-
used in parallel decomposition where we combine ning the underlying QA system without any de-
results from multiple sub-questions into a single composition capabilities. PD and ND refer to Par-
confidence. allel and Nested Decomposition systems respec-
tively and QR refers to question re-writing. Sepa-
6 Evaluation rate experiments determined end-to-end accuracy
for the different system configurations, with re-
6.1 Evaluation Data
spect to the entire test set, and accuracy over the
As we discuss question decomposition in the con- decomposable questions subsets of the test set.
text of Jeopardy! data (Section 4), our test set con- We do not offer separate analysis of decompo-
tains only Final Jeopardy! (FJ) questions. They sition recognition. Manual creation of decompo-
are often long and complex, with multiple facts sition standard is highly non-trivial, largely due
or constraints that need to be satisfied. Also, they to the numerous alternative ways to decompose
are typically much harder to answer than regular a question, and synthesize unique facts from the
Jeopardy! questions both for humans and for our segments. Indeed, this is precisely the motivation
base QA system. The test set comprises close to for weighting the decomposition rules in a trained
3000 FJ questions, broken into 1138 for training, re-ranking model (Section 5.2). Given this, we
517 for development and 1269 questions for test- are interested only in measuring the impact of de-
ing (as blind data). composition on end-to-end QA performance.
6.2 Experiments 6.3 Discussion of Results

The decomposition rules (Section 4) and re- The results in Table 3 show that the parallel de-
ranking parameters (Sections 5.2, 5.3) were decomposition rules were able to decompose a large
fined and tuned on the development set. The final fraction of the test set (598 out of 1269 ques-
re-ranking model was trained over the training set, tions: 47%). Interestingly, the performance of
using the features described in Section 5.2, and lo- the baseline QA system on the decomposable set
gistic regression with instance re-weighting. The was 56.6%, which is 6% higher than the overall
results of applying the decomposition rules fol- performance over the entire test set. One reason
lowed by the re-ranking model to the 1269 test for this result would be that parallel decompos-
questions are shown in Table 3. The baseline able questions typically contain a lot more infor-
is the performance of the underlying QA system mation (more than one fact or constraint that the
used in our meta-framework without any decom- answer must satisfy) about the same answer, and
position components, applied to the same test set. the system in some cases is able to exploit this
We evaluated separately the impact of our ques- redundancysuch as when one fact is strongly
tion re-writing strategy (Section 5.1) that main- associated with the correct answer and there is ev-
tains contextual information from the original idence supporting this in the sources.
question into the sub-questions. For this purpose, A different, important result is that using the
we altered our algorithm to issue the sub-question decomposition algorithm without question re-
text as-is, using the original category, and rewriting did not show impact over the baseline.
trained and evaluated the resulting model again on This highlights the importance of contextual in-
the test set. The results are also shown in Table 3. formation for QA. On the other hand, when us-
In the table, PB refers to Parallel Baseline, and ing re-writing to maintain context (Section 5.1),
858
our parallel decomposition algorithm was able to 7 Conclusion
achieve a gain of 1.4% on the parallel decompos-
In this paper, we presented a general-purpose de-
able question set, which translated to an end-to-
composition framework for answering complex
end gain of 0.6%.
factoid questions, which consists of three compo-
Separately, the table shows that roughly a
nents: 1) a decomposition recognizer, which iden-
fifth (255 out of 1269 questions) of the entire
tifies the subparts of a decomposable question, 2)
test set were recognized as nested decomposable.
a question re-writer, which composes new sub-
Again, interestingly, the performance of the base-
questions from the identified subparts, taking into
line QA system on the nested decomposable set
account context from the original question, and
was roughly the same as the overall performance
3) an answer synthesis and re-ranking component,
(and much lower than the parallel decomposable
which synthesizes and ranks final answers based
cases). The likely explanation here is that nested
on candidate answers to the sub-questions. Addi-
questions require solving for an inner fact first,
tionally, this framework leverages an underlying
and it is the answer to this, which often provides
factoid QA system for producing answers to the
the necessary missing information required to find
sub-questions. Any QA system that can associate
the correct answer: this makes nested questions
confidence scores with its answers and can make
much harder to solve than parallel decomposable
distinctions between the question and the context
ones with their multiple independent facts. Our
in which the question should be interpreted can be
nested decomposition algorithm using the heuris-
adopted in this decomposition framework.
tic re-ranking approach (Section 5.3) was able to
We applied our decomposition framework to
achieve a gain of 2% on the nested decompos-
address two broad classes of complex factoid
able question set, which translated to an end-to-
questions, parallel and nested decomposition
end gain of 0.4%.
questions. These are distinguished by how the
The aggregate impact of parallel and nested de-
identified sub-questions related to each other,
composition was a 1.5% gain in accuracy on the
which in turn affects how the candidate answers
decomposable set, and a 1% gain on end-to-end
to the sub-questions are combined to form the fi-
system accuracy (in our case the questions that are
nal answers. In order to maintain generality and
classified as parallel or nested form disjoint sets).
facilitate domain adaptation, the rule-based pat-
To put these results in perspective, we empha-
terns for decomposition recognition leverage syn-
size that the baseline QA system represents state-
tactic characteristics of the question that are in-
of-the-art in solving Jeopardy! questions. The
dicative of sub-question boundaries. To optimally
FJ questions, which exclusively comprise our test
leverage these patterns, a machine learning model
data, are known to be harder than regular Jeop-
was trained to properly weigh the possibly over-
ardy!: qualified Jeopardy! players accuracy on
lapping, and occasionally conflicting, patterns.
this kind of questions is 48%,4 and the underlying
We demonstrated the impact of our question
QA system has an accuracy close to 51% on pre-
decomposition approach on a state-of-the-art fac-
viously unseen FJ questions. A gain of 1% end-
toid QA system. On a test set of 1269 Final Jeop-
to-end on such questions, therefore, represents a
ardy! questions, 47% of the question were found
strong improvement. Also, using the statistical
to be parallel decomposable and 20% were nested
McNemars test (McNemar, 1947), we found the
decomposable. Overall, the system achieved a
net end-to-end impact to be statistically signifi-
statistically significant gain of 1.5% in accuracy
cant at a 99% confidence interval.
on these questions, further increasing the systems
Finally, we note that our error analysis of the
lead over human Jeopardy! players performance
test questions shows a wide variety of reasons for
on these questions.
their failures beyond question decomposition. To
Given that factoid (and, often, complex) ques-
further improve the system on this test set would
tions are typically found in several real world do-
require advances beyond deciding whether to take
mains (e.g. medical, legal, technical support),
a single-shot or decomposable approach to ques-
we expect our decomposition framework to have
tions, which is beyond the scope of this paper.
broad impact, both in open- and specialized-
4
Calculated over historical games data, from J-archive domain QA.
(http://www.j-archive.com).
859
References E. Voorhees. 2002. Overview of the TREC 2002
Question Answering Track. In NIST Special Pub-
D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, lication 500-251: The Eleventh Text REtrieval Con-
D. Gondek, A. Kalyanpur, A. Lally, J. W. Murdock, ference (TREC 2002), Gaithersburg, MD, Novem-
E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. ber.
2010. Building Watson: An Overview of the I. Witten and E. Frank. 2000. Data Mining - Practical
DeepQA Project. AI Magazine, 31(3):5979, Fall. Machine Learning Tools and Techniques with Java
S. Hartrumpf. 2008. Semantic Decomposition for Implementations. MorganKaufmann, San Fran-
Question Answering. In Proceedings of the 18th cisco, CA.
European Conference on Artificial Intelligence,
pages 313317, Patras, Greece, July.
B. Katz, G. Borchardt, and S. Felshin. 2005. Syntactic
and Semantic Decomposition Strategies for Ques-
tion Answering from Multiple Sources. In Proceed-
ings of the AAAI Workshop on Inference for Textual
Question Answering, pages 3541, Pittsburgh, PA,
July.
F. Lacatusu, A. Hickl, and S. Harabagiu. 2006. The
Impact of Question Decomposition on the Quality
of Answer Summaries. In Proceedings of the Fifth
Language Resources and Evaluation Conference,
pages 11471152, Genoa, Italy, May.
C.J. Lin and R.R. Liu. 2008. An Analysis of Multi-
Focus Questions. In Proceedings of the SIGIR 2008
Workshop on Focused Retrieval, pages 3036, Sin-
gapore, July.
M. McCord. 1989. Slot Grammar: A System for
Simpler Construction of Practical Natural Language
Grammars. In Proceedings of the International
Symposium on Natural Language and Logic, pages
118145, Hamburg, Germany, May.
Q. McNemar. 1947. Note on the Sampling Error of
the Difference Between Correlated Proportions or
Percentages. Psychometrika, 12(2):153157.
J. Prager, E. Brown, and J. Chu-Carroll. Special Ques-
tions and Techniques. Submitted to IBM Jour-
nal of Research and Development, Special Issue on
DeepQA.
J. Prager, J. Chu-Carroll, and K. Czuba. 2004. Ques-
tion Answering by Constraint Satisfaction: QA-by-
Dossier with Constraints. In Proceedings of the
42nd Annual Meeting of the Association for Com-
putational Linguistics, pages 574581, Barcelona,
Spain, July.
E. Saquete, P. Martnez-Barco, R. Munoz, and
J. Vicedo. 2004. Splitting Complex Temporal
Questions for Question Answering Systems. In
Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics, pages
566573, Barcelona, Spain, July.
R. Soricut and E. Brill. 2004. Automatic Question
Answering: Beyond the Factoid. In Proceedings
of the Human Language Technology Conference of
the North American Chapter of the Association for
Computational Linguistics, pages 5764, Boston,
MA, May.
860
Author Index
Aizawa, Akiko, 686 Clark, Stephen, 736

Alfonseca, Enrique, 214 Cook, Paul, 591
Alkuhlani, Sarah, 675 Cooke, Martin, 1
Andersson, Evelina, 44 Costa, Francisco, 266
Andres-Ferrer, Jesus, 152
Daume III, Hal, 204, 747
Baeza-Yates, Ricardo, 706 Delort, Jean-Yves, 214
Baldwin, Timothy, 591 Detrez, Gregoire, 645
Balle, Borja, 409 Dinarelli, Marco, 174
Baroni, Marco, 23 Dinu, Liviu P., 524
Barzilay, Regina, 397 Do, Ngoc-Quynh, 23
Battersby, Stuart, 482 Dodge, Jesse, 747
Baumann, Timo, 514 Dolan, William B., 306
Bell, Peter, 471 Drager, Markus, 757
Berg, Alex, 747 Dzikovska, Myroslava O., 471
Berg, Tamara, 747
Bernardi, Raffaella, 23 Eckle-Kohler, Judith, 550, 580
Bethard, Steven, 336 Elsner, Micha, 634
Bhowmick, Rishav, 162
Fan, James, 185
Bisazza, Arianna, 439
Farkas, Richard, 55
Blackwood, Graeme, 736
Faruqui, Manaal, 623
Boguraev, Branimir, 851
Federico, Marcello, 439
Bohnet, Bernd, 77
Feng, Vanessa Wei, 315
Bouamor, Houda, 716
Fernandez Monsalve, Irene, 398
Boves, Lou, 561
Fernandez-Gonzalez, Daniel, 66
Branco, Antonio, 266
Ferschke, Oliver, 777
Braune, Fabienne, 808
Figueroa, Alejandro, 99
Bronner, Amit, 356
Frank, Stefan L., 398
Bu, Okko, 514
Fraser, Alexander, 664, 726
Cahill, Aoife, 664, 767
Gasco, Guillem, 152
Callison-Burch, Chris, 130
Georgiev, Georgi, 492
Can, Burcu, 654
Glaser, Andrea, 276
Cap, Fabienne, 664
Gliozzo, Alfio, 185
Carreras, Xavier, 409
Gojun, Anita, 726
Casacuberta, Francisco, 152, 245
Goldwater, Sharon, 234
Chambers, Nathanael, 603
Gollub, Tim, 570
Chebotar, Yevgen, 777
Gomez-Rodrguez, Carlos, 66
Cheung, Jackie Chi Kit, 33, 696
Gonzalez-Rubio, Jesus, 245
Chowdhury, Md. Faisal Mahbub, 420
Goyal, Amit, 747
Christensen, Janara, 503
Grishman, Ralph, 194
Chrupala, Grzegorz, 613
Gurevych, Iryna, 550, 580, 777
Chu-Carroll, Jennifer, 851
Gweon, Gahgene, 787
Cinkova, Silvie, 840
861
Habash, Nizar, 675 Min, Bonan, 194
Han, Xufeng, 747 Mitchell, Margaret, 747
Hanamoto, Atsushi, 430 Mitkov, Ruslan, 706
Hartmann, Silvana, 580 Miyao, Yusuke, 686
Henrich, Verena, 387 Moens, Marie-Francine, 336, 449
Hinrichs, Erhard, 387 Mohit, Behrang, 162
Hirst, Graeme, 315 Monz, Christof, 2, 109, 356
Holub, Martin, 840 Mooney, Raymond, 602
Hoppe, Dennis, 570 Moore, Johanna D., 471
Hovy, Dirk, 185 Mostow, Jack, 377
Huang, Ruihong, 286
Nakov, Preslav, 492
Irvine, Ann, 130 Newman, David, 591
Isard, Amy, 471 Ng, Vincent, 798
Niculae, Vlad, 524
Jagarlamudi, Jagadeesh, 204 Nikoulina, Vassilina, 109
Jain, Mahaveer, 787 Nivre, Joakim, 44
Jang, Hyeju, 377
Jans, Bram, 336 Oflazer, Kemal, 162
Joachims, Thorsten, 224 Ordan, Noam, 255
Ortiz-Martnez, Daniel, 245
Kaisser, Michael, 88 Osenova, Petya, 492
Kalyanpur, Aditya, 851
Klakow, Dietrich, 325 Pado, Sebastian, 623
Klementiev, Alexandre, 12, 130 Pasca, Marius, 503
Koller, Alexander, 757 Patwardhan, Siddharth, 185, 851
Kovachev, Bogomil, 109 Peldszus, Andreas, 514
Krz, Vincent, 840 Penn, Gerald, 33, 696
Kuhn, Jonas, 77, 767 Penstein Rose, Carolyn, 787
Kwiatkowski, Tom, 234 Powers, David Martin Ward, 345
Purver, Matthew, 482
Lagos, Nikolaos, 109
Lagoutte, Aurelie, 808 Qu, Zhonghua, 367
Lally, Adam, 851 Quattoni, Ariadna, 409
Lau, Jey Han, 591 Quernheim, Daniel, 808
Lavelli, Alberto, 420
Lembersky, Gennadi, 255 Rahman, Altaf, 798
Liu, Ting, 296 Raj, Bhiksha, 787
Liu, Yang, 367 Ranta, Aarne, 645
Luque, Franco M., 409 Rello, Luz, 706
Riezler, Stefan, 818
Maletti, Andreas, 808 Riloff, Ellen, 286
Manandhar, Suresh, 654 Rocha, Martha-Alicia, 152
Marchetti-Bowick, Micol, 603 Rosset, Sophie, 174
Martzoukos, Spyros, 2
Matsubayashi, Yuichiroh, 686 Sanchis-Trilles, German, 152
Matsuzaki, Takuya, 430 Schlangen, David, 514
Matuschek, Michael, 580 Schmid, Helmut, 55
Max, Aurelien, 716 Schneider, Nathan, 162
McCarthy, Diana, 591 Schutze, Hinrich, 276
McDonough, John, 787 Sennrich, Rico, 539
Mensch, Alyssa, 747 Shan, Chung-chieh, 23
Meyer, Christian M., 580 Shivaswamy, Pannaga, 224
Simov, Kiril, 492
Sipos, Ruben, 224
Smith, Noah A., 162
Sokolov, Artem, 120
Steedman, Mark, 234
Stein, Benno, 570
Stratos, Karl, 747
Strik, Helmer, 561
Strzalkowski, Tomek, 296
Sulea, Octavia-Maria, 524
Tiedemann, Jorg, 141

Titov, Ivan, 12
Tsarfaty, Reut, 44
Tsujii, Junichi, 430
Udupa, Raghavendra, 204
van Cranenburgh, Andreas, 460

van den Bosch, Antal, 561
van Trijp, Remi, 829
Verberne, Suzan, 561
Vigliocco, Gabriella, 398
Vilnat, Anne, 716
Vincze, Veronika, 55
Vodolazova, Tatiana, 387
Volkova, Svitlana, 306
Vulic, Ivan, 336, 449
Waeschle, Katharina, 818

Weller, Marion, 664
Welty, Christopher, 185
Wiegand, Michael, 325
Wilson, Theresa, 306
Wintner, Shuly, 255
Wirth, Christian, 580
Wisniewski, Guillaume, 120
Yamaguchi, Kota, 747

Yarowsky, David, 130
Yvon, Francois, 120
Zarrie, Sina, 767

Zesch, Torsten, 529
Zettlemoyer, Luke, 234
Zhang, Yue, 736
Zhikov, Valentin, 492

E12 1 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

E12 1 PDF

Загружено:

Авторское право:

Доступные форматы

EACL 2012

13th Conference of the European Chapter of the

Proceedings of the Conference

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)

We hope you enjoy the conference!

Mirella Lapata and Llus Marquez

EACL 2012 Program Chairs

Programme Committee Chairs:

Workshop Committee chairs:

Tutorials Committee chairs:

Student research workshop faculty advisor:

System Demonstrations Committee:

Local Organising Committee:

Speech Communication in the Wild

Power-Law Distributions for Paraphrases Extracted from Bilingual Corpora

A Bayesian Approach to Unsupervised Semantic Role Induction

Entailment above the word level in distributional semantics

Evaluating Distributional Models of Semantics for Syntactically Invariant Inference

Cross-Framework Evaluation for Statistical Parsing

Dependency Parsing of Hungarian: Baseline Results and Challenges

Dependency Parsing with Undirected Graphs

The Best of BothWorlds A Graph-based Completion Model for Transition-based Parsers

Computing Lattice BLEU Oracle Scores for Machine Translation

Toward Statistical Machine Translation without Parallel Corpora

Character-Based Pivot Translation for Under-Resourced Languages and Domains

Does more data always yield better translations?

Recall-Oriented Learning of Named Entities in Arabic Wikipedia

When Did that Happen? Linking Events and Relations to Timestamps

Compensating for Annotation Errors in Training a Relation Extractor

Incorporating Lexical Priors into Topic Models

DualSum: a Topic-Model based approach for update summarization

Large-Margin Learning of Submodular Summarization Models

Active learning for interactive machine translation

Adapting Translation Models to Translationese Improves SMT

Aspectual Type and Temporal Relation Classification

Automatic generation of short informative sentiment summaries

Bootstrapped Training of Event Extraction Classifiers

Bootstrapping Events and Relations from Text

Extending the Entity-based Coherence Model with Multiple Ranks

Generalization Methods for In-Domain and Cross-Domain Opinion Holder Extraction

Skip N-grams and Ranking Functions for Predicting Script Events

The Problem with Kappa

User Participation Prediction in Online Forums

Inferring Selectional Preferences from Part-Of-Speech N-grams

WebCAGe A Web-Harvested Corpus Annotated with GermaNet Senses

Learning to Behave by Reading

Lexical surprisal as a general predictor of reading time

Spectral Learning for Non-Deterministic Dependency Parsing

Coordination Structure Analysis using Dual Decomposition

Efficient parsing with Linear Context-Free Rewriting Systems

Experimenting with Distant Supervision for Emotion Classification

Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgar-

Instance-Driven Attachment of Semantic Annotations over Conceptual Hierarchies

Subcat-LMF: Fleshing out a standardized format for subcategorization frame interoperability

The effect of domain and text type on text prediction quality

The Impact of Spelling Errors on Patent Search

UBY - A Large-Scale Unified Lexical-Semantic Resource Based on LMF

Word Sense Induction for Novel Sense Detection

Learning Language from Perceptual Context

Learning from evolving data streams: online triage of bug reports

Towards a model of formal and informal address in English

Character-based kernels for novelistic plot structure