Вы находитесь на странице: 1из 11

Solving Geometry Problems:

Combining Text and Diagram Interpretation

Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, Clint Malcolm
University of Washington, Allen Institute for Artificial Intelligence
{minjoon,hannaneh,clintm}@washington.edu,{alif,orene}@allenai.org

Abstract Ques*ons  
(a)   In  the  diagram  at   Equals(RadiusOf(O), 5)
Interpreta*ons  

the  le.,  circle  O   IsCircle(O)


This paper introduces G EO S, the first au- has  a  radius  of  5,   Equals(LengthOf(CE), 2)
IsDiameter(AC)
tomated system to solve unaltered SAT ge- and  CE  =  2.   IsChord(BD)
Diameter  AC  is   Perpendicular(AC), BD)
ometry questions by combining text un- perpendicular  to   Equals(what, Length(BD))
chord  BD.  What  is  
derstanding and diagram interpretation. correct  
the  length  of  BD?   a)  12          b)  10            c)  8            d)  6            e)  4  
We model the problem of understanding (b)   In  isosceles   IsIsoscelesTriangle(ABC)
geometry questions as submodular opti- triangle  ABC  at   BisectsAngle(AM, BAC)
IsLine(AM)
the  le.,  lines  AM  
mization, and identify a formal problem and  CM  are  the  
CC(AM, CM)
CC(BAC, BCA)
description likely to be compatible with angle  bisectors  of   IsAngle(BAC)
angles  BAC  and   IsAngle(AMC)
both the question text and diagram. G EO S BCA.  What  is  the   Equals(what, MeasureOf(AMC))
measure  of  angle  
then feeds the description to a geometric AMC?  
correct  
a)  110        b)  115      c)  120      d)  125      e)  130  
solver that attempts to determine the cor-
In  the  figure  at  le.,  
rect answer. In our experiments, G EO S (c)   The  bisector  of  
IsAngle(BAC)
BisectsAngle(line, BAC)
achieves a 49% score on official SAT ques- angle  BAC  is  
Perpendicular (line, BC)
Equals(LengthOf(AB), 6)
perpendicular  to  BC  
tions, and a score of 61% on practice ques- at  point  D.  If  AB  =  6  
Equals(LengthOf(BD), 3)
IsAngle(BAC)
tions.1 Finally, we show that by integrat- and  BD  =  3,  what  is   Equals(what, MeasureOf(BAC))
correct  
the  measure  of  
ing textual and visual information, G EO S angle  BAC?   a)  15        b)  30      c)  45      d)  60        e)  75  

boosts the accuracy of dependency and se- Figure 1: Questions (left column) and interpretations (right
mantic parsing of the question text. column) derived by G EO S.

1 Introduction
The geometry genre has several distinctive char-
This paper introduces the first fully-automated acteristics. First, diagrams provide essential in-
system for solving unaletered SAT-level geomet- formation absent from question text. In Figure 1
ric word problems, each of which consists of text problem (a), for example, the unstated fact that
and the corresponding diagram (Figure 1). The ge- lines BD and AC intersect at E is necessary to
ometry domain has a long history in AI, but previ- solve the problem. Second, the text often includes
ous work has focused on geometric theorem prov- difficult references to diagram elements. For ex-
ing (Feigenbaum and Feldman, 1963) or geomet- ample, in the sentence “In the diagram, the longer
ric analogies (Evans, 1964). Arithmetic and alge- line is tangent to the circle”, resolving the ref-
braic word problems have attracted several NLP erent of the phrase “longer line” is challenging.
researchers (Kushman et al., 2014; Hosseini et al., Third, the text often contains implicit relations.
2014; Roy et al., 2015), but geometric word prob- For example, in the sentence “AB is 5”, the rela-
lems were first explored only last year by Seo et al. tions IsLine(AB) and length(AB)=5 are implicit.
(2014). Still, this system merely aligned diagram Fourth, geometric terms can be ambiguous as well.
elements with their textual mentions (e.g., “Circle For instance, radius can be a type identifier in “the
O”)—it did not attempt to fully represent geome- length of radius AO is 5”, or a predicate in “AO
try problems or solve them. Answering geometry is the radius of circle O”. Fifth, identifying the
questions requires a method that interpert question correct arguments for each relation is challeng-
text and diagrams in concert. ing. For example, in sentence “Lines AB and CD
1
The source code, the dataset and the annotations are pub- are perpendicular to EF”, the parser has to deter-
licly available at geometry.allenai.org. mine what is perpendicular to EF—line AB? line

1466
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1466–1476,
c
Lisbon, Portugal, 17-21 September 2015. 2015 Association for Computational Linguistics.
CD? Or both AB and CD? Finally, it is hard to challenging to learn semantic parsers directly from
obtain large number of SAT-level geometry ques- geometry questions. Relation extraction is another
tions; Learning from a few examples makes this a area of NLP that is related to our task (Cowie
particularly challenging NLP problem. and Lehnert, 1996; Culotta and Sorensen, 2004).
This paper introduces G EO S, a system that Again, both diagrams and small corpora are prob-
maps geometry word problems into a logical rep- lematic for this body of work.
resentation that is compatible with both the prob- Our work is part of grounded language acqui-
lem text and the accompanying diagram (Fig- sition research (Branavan et al., 2012; Vogel and
ure 1). We cast the mapping problem as the prob- Jurafsky, 2010; Chen et al., 2010; Hajishirzi et
lem of selecting the subset of relations that is most al., 2011; Liang et al., 2009; Koncel-Kedziorski et
likely to correspond to each question. al., 2014; Bordes et al., 2010; Kim and Mooney,
We compute the mapping in three main steps 2013; Angeli and Manning, 2014; Hixon et al.,
(Figure 2). First, G EO S uses text- and diagram- 2015; Koncel-Kedziorski et al., 2014; Artzi and
parsing to overgenerate a set of relations that po- Zettlemoyer, 2013) that involves mapping text
tentially correspond to the question text, and asso- to a restricted formalism (instead of a full, do-
ciates a score with each. Second, G EO S generates main independent representation). In the geom-
a set of relations (with scores) that corresponds to etry domain, we recover the entities (e.g., circles)
the diagram. Third, G EO S selects a subset of the from diagrams, derive relations compatible with
relations that maximizes the joint text and diagram both text and diagram, and re-score relations de-
scores. We cast this maximization as a submodu- rived from text parsing using diagram information.
lar optimization problem, which enables G EO S to Casting the interpretation problem as selecting the
use a close-to-optimal greedy algorithm. Finally, most likely subset of literals can be generalized to
we feed the derived formal model of the problem grounded semantic parsing domains such as navi-
to a geometric solver, which computes the answer gational instructions.
to the question. Coupling images and the corresponding text has
G EO S is able to solve unseen and unaltered attracted attention in both vision and NLP (Farhadi
multiple-choice geometry questions. We report on et al., 2010; Kulkarni et al., 2011; Gupta and
experiments where G EO S achieves a 49% score Mooney, 2010; Gong et al., 2014; Fang et al.,
on official SAT questions, and a score of 61% on 2014). We build on this powerful paradigm, but
practice questions, providing the first results of instead of generating captions we show how pro-
this kind. Our contributions include: (1) designing cessing multimodal information help improve tex-
and implementing the first end-to-end system that tual or visual interpretations for solving geometry
solves SAT plane geometry problems; (2) formal- questions.
izing the problem of interpreting geometry ques- Diagram understanding has been explored since
tions as a submodular optimization problem; and early days in AI (Lin et al., 1985; Hegarty and Just,
(3) providing the first empirical results on the ge- 1989; Novak, 1995; O’Gorman and Kasturi, 1995;
ometry genre, making the data and software avail- Bulko, 1988; Srihari, 1994; Lovett and Forbus,
able for future work. 2012). Most previous approaches differ from our
method because they address the twin problems of
2 Related Work diagram understanding and text understanding in
isolation. Often, previous work relies on manual
Semantic parsing is an important area of NLP re- identification of visual primitives, or on rule-based
search (Zettlemoyer and Collins, 2005; Ge and system for text analysis. The closest work to ours
Mooney, 2006; Flanigan et al., 2014; Eisenstein is the recent work of Seo et al. (2014) that aligns
et al., 2009; Kate and Mooney, 2007; Goldwasser geometric shapes with their textual mentions, but
and Roth, 2011; Poon and Domingos, 2009; Be- does not identify geometric relations or solve ge-
rant and Liang, 2014; Kwiatkowski et al., 2013; ometry problems.
Reddy et al., 2014). However, semantic parsers do
not tackle diagrams—a critical element of the ge- 3 Problem Formulation
ometry genre. In addition, the overall number of
available geometry questions is quite small com- A geometry question is a tuple (t, d, c) consist-
pared to the size of typical NLP corpora, making it ing of a text t in natural language, a diagram d

1467
B in the language Ω as concepts.
GeoS input In  triangle  ABC,  line  DE  is  parallel  
We use the term literal to refer to the application
with  line  A C,  DB  equals  4,  A D  is  8,   D E
and  DE  is  5.  Find  A C. of a predicate to a sequence of arguments (e.g.,
(a)  9      (b)  10    (c)  12.5      (d)  15      (e)  17 A C
IsTriangle(ABC)). Literals are possibly negated
Sec.  4 Text  Parsing
Diagram  Parsing Sec.  5 atomic formulas in the language Ω. Logical for-
mulas contain constants, variables, functions, ex-
L, A L , Adiagram
text
IsTriangle(ABC) 0.96 Colinear(A,D,B)
Δ
1.0 istential quantifiers and conjunctions over literals
Interpretation

Parallel(AC,  DE)
Parallel(AC,  DB)
0.91
0.74
Colinear(B,E,C)
Parallel(AC,  DE)
1.0
0.99
(e.g., ∃x, IsTriangle(x)∧IsIsosceles(x)).
Equals(LengthOf(DB),  4 ) 0.97 Parallel(AC,  DB) 0.02
Equals(LengthOf(AD),  8 )
Equals(LengthOf(DE),  5 )
0.94
0.94
… Interpretation is the task of mapping a new ge-
Equals(4,  L engthOf(AD)) 0.31 ometry question with each choice, (t, d, cm ), into

*
L ⊂L Sec.  6 a logical formula γ in Ω. More formally, the
IsTriangle(ABC) Parallel(AC,  DE) goal is to find γ ∗ = arg maxγ∈Γ score(γ; t, d, cm )
Equals(LengthOf(DB),  4 ) Equals(LengthOf(AD),  8)
Equals(LengthOf(DE),  5 ) Find(LengthOf(AC)) where Γ is the set of all logical formulas in Ω and
...
Sec.  7 score measures the interpretation score of the for-
Solver

Answer:  (d)
mula according to both text and diagram. The
problem of deriving the best formula γ ∗ can be
Figure 2: Overview of our method for solving geometry modeled as a combinatorial search in the space of
questions. literals L (note that each logical formula γ is rep-
resented as a conjunction over literals li ).
G EO S efficiently searches this combinatorial
in raster graphics, and multiple choice answers
space taking advantage of a submodular set func-
c = {c1 , . . . , cM } (M = 5 in SAT). Answering
tion that scores a subset of literals using both text
a geometry question is to find a correct choice ci .
and diagram. The best subset of literals is the one
Our method, G EO S, consists of two steps (Fig- that has a high affinity with both text and diagram
ure 2): (1) interpreting a geometry question by and is coherent i.e., does not suffer from redun-
deriving a logical expression that represents the dancy (see Section 6). More formally,2
meaning of the text and the diagram, and (2) solv-
ing the geometry question by checking the satis- L∗ = arg max λ A(L0 , t, d) + H(L0 , t, d), (1)
L0 ⊂L | {z } | {z }
fiablity of the derived logical expression. In this Affinity Coherence
paper we mainly focus on interpreting geometry where A(L0 , t, d) measures the affinity of the lit-
questions and use a standard algebraic solver (see erals in L0 with both the text and the diagram,
section 7 for a brief description of the solver). H(L0 , t, d) measures the coverage of the literals
Definitions: We formally represent logical ex- in L0 compared to the text and discourages redun-
pressions in the geometry domain with the lan- dancies, and λ is a trade-off parameter between A
guage Ω, a subset of typed first-order logic that and H.
includes: The affinity A is decomposed into text-
• constants, corresponding to known numbers based affinity, Atext , and diagram-based affinity,
(e.g., 5 and 2 in Figure 1) or entities with known Adiagram . The text-based affinity closely mirrors
geometric coordinates. the linguistic structure of the sentences as well as
• variables, corresponding to unknown numbers type matches in the geometry language Ω. For
or geometrical entities in the question (e.g., O modeling the text score for each literal, we learn
and CE in Figure 1). a log-linear model. The diagram-based affinity
• predicates, corresponding to geometric or arith- Adiagram grounds literals into the diagram, and
metic relations (e.g., Equals, IsDiameter, scores literals according to the diagram parse. We
IsTangent).
describe the details on how to compute Atext in
• functions, corresponding to properties of geo- section 4 and Adiagram in section 5.
metrical entities (e.g., LengthOf, AreaOf) or 4 Text Parser
arithmetic operations (e.g., SumOf, RatioOf).
Each element in the geometry language has either The text-based scoring function Atext (L, t) com-
boolean (e.g., true), numeric (e.g., 4), or entity putes the affinity score between the set of liter-
(e.g., line, circle) type. We refer to all symbols 2
We omit the argument cm for the ease of notation.

1468
1. {lj }, Atext ← T EXT PARSING(language Ω, text-choice pair (t, ci )) (Section 4)
(i) concept identification: initialize a hypergraph G with concept nodes.
(ii) relation identification: add a hyperedge (relation) rj between two or three related concept nodes and assign a
weight Atext (rj , t; θ) based on the learned classifier.
(iii) literals parsing: obtain all subtrees of G, which are equivalent to all possible literals, {lj0 }. Let Atext (lj , t) =
A (rj , t; θ) for all rj in the literal li .
P
j text
(iv) relation completion: obtain a complete literal lj for each (under-specified) lj0 , dealing with implication and
coordinating conjunctions.
2. L∆ , Adiagram ← D IAGRAM PARSING(diagram image d, literals {lj }) (Section 5)
3. L∗ ← G REEDY M AXIMIZATION(literals L = {lj }, score functions Atext and Adiagram ) (Section 6)
(i) initialization: L0 ← {}
(ii) greedy addition: add(L0 , lj ) s.t. lj = argmaxlj ∈L\L0 F(L0 ∪ {lj }) − F (L0 ), where F = λA + H
(iii) iteration: repeat step (ii) while the gain is positive.
4. Answer c∗ ← one of choices s.t. L∗ ∪ L∆ are simultaneously satisfiable according to S OLVER (Section 7)

Figure 3: Solving geometry questions with G EO S.

in the geometry language. Note that a phrase can


Predicates  IsTangentTo IsCircle Equals
be mapped to several concepts. For instance, in
Func.ons  
the sentence “ABCD is a square with an area of
RadiusOf 1”, the word “square” is a noun referring to some
Constants,   object, so it maps to a variable square. In a similar
Variables   line O 5
sentence “square ABCD has an area 1”, the word
“A  tangent  line  is  drawn  to  circle  O  with  radius  of  5”   “square” describes the variable ABCD, so it maps to
a predicate IsSquare.
Figure 4: Hypergraph representation of the sentence “A
tangent line is drawn to circle O with radius of 5”. G EO S builds a lexicon from training data that
maps stemmed words and phrases to the con-
als L and the question text t. This score is the cepts in the geometry language Ω. The lexicon
sum of the affinity scores P of individual literals is derived from all correspondences between ge-
lj ∈ L i.e., Atext (L, t) = j Atext (lj , t) where ometry keywords and concepts in the geometry
Atext (lj , t) 7→ [−∞, 0].3 G EO S learns a discrim- language as well as phrases and concepts from
inative model Atext (lj , t; θ) that scores the affin- manual annotations in the training data. For in-
ity of every literal lj ∈ L and the question text t stance, the lexicon contains (“square”, {square,
through supervised learning from training data. IsSquare}) including all possible concepts for the
We represent literals using a hypergraph (Fig- phrase “square”. Note that G EO S does not make
ure 4) (Klein and Manning, 2005; Flanigan et al., any hard decision on which identification is cor-
2014). Each node in the graph corresponds to a rect in this stage, and defers it to the relation iden-
concept in the geometry language (i.e. constants, tification stage (Section 4.2). To identify num-
variables, functions, or predicates). The edges bers and explicit variables (e.g. “5”, “AB”, “O”),
capture the relations between concepts; concept G EO S uses regular expressions. For an input text
nodes are connected if one concept is the argument t, G EO S assigns one node in the graph (Figure 4)
of the other in the geometry language. In order to for each concept identified by the lexicon.
interpret the question text (Figure 3 step 1), G EO S
first identifies concepts evoked by the words or 4.2 Relation Identification
phrases in the input text. Then, it learns the affin-
A relation is a directed hyperedge between
ity scores which are the weights of edges in the
concept nodes. A hyperedge connects two
hypergraph. It finally completes relations so that
nodes (for unary relations such as the edge be-
type matches are satistfied in the formal language.
tween RadiusOf and O in Figure 4) or three nodes
4.1 Concept Identification (for binary relations such as the hyperedge be-
tween Equals and its two arguments RadiusOf
Concepts are defined as symbols in the geometry and 5 in Figure 4).
language Ω. The concept identification stage maps We use a discriminative model (logistic re-
words or phrases to their corresponding concepts gression) to predict the probability of a rela-
3
For the ease of notation, we use Atext as a function tak- tion ri being correct in text t: Pθ (yi |ri , t) =
1
ing sets of literals or a literal. 1+exp (ftext (ri ,t)·θ) , where yi ∈ {0, 1} is the label

1469
Dependency tree distance Shortest distance between the words of the concept nodes in the dependency tree. We use
indicator features for distances of -3 to 3. Positive distance shows if the child word is at the
right of the parent’s in the sentence, and negative otherwise.
Word distance Distance between the words of the concept nodes in the sentence.
Dependency tree edge label Indicator functions for the outgoing edges of the parent and child for the shortest path
between them.
Part of speech tag Indicator functions for the POS tags of the parent and the child.
Relation type Indicator functions for unary / binary parent and child nodes.
Return type Indicator functions for the return types of the parent and the child nodes. For example,
return type of Equals is boolean, and that of LengthOf is numeric.
Table 1: The features of the unary relations. The features of the binary relations is computed in a similar way.
(a) sentence: “What is the perimeter of ABCE?”
intermediate: ∃ what, ABCE: Bridged(what, PerimeterOf(ABCE))
final: ∃ what, ABCE: Equals(what, PerimeterOf(ABCE))
(b) sentence: “AM and CM bisect BAC and BCA.”
intermediate: ∃ AM, CM, BAC, BCA: BisectsAngle(AM, BAC) ∧ CC(AM, CM) ∧ CC(BAC, BCA)
final: ∃ AM, CM, BAC, BCA: BisectsAngle(AM, BAC) ∧ BisectsAngle(CM, BCA)

Figure 5: Showing the two-stage learning with the intermediate representation that demonstrates implication.

for ri being correct in t, ftext (ri , t) is a feature 4.3 Relation Completion


vector of t and ri , and θ is a vector of parameters So far, we have explained how to score the affini-
to be learned. We define the affinity score of ri ties between explicit relations and the question
by Atext (ri , t; θ) = log Pθ (yi |ri , t). The weight text. Geometry questions usually include implicit
of the corresponding hyperedge is the relation’s concepts. For instance, “Circle O has a radius of
affinity score. We learn θ using the maximum like- 5” implies the Equals relationship between “Ra-
lihood estimation of the training data (details in dius of circle O” and “5”. In addition, geometry
Section 8), with L2 regularization. questions include coordinating conjunctions be-
We train two separate models for learning unary tween entities. In “AM and CM bisect BAC and
and binary relations. The training data consists BCA”, “bisect” is shared by two lines and two an-
of sentence-relation-label tuples (t, r, y); for in- gles (Figure 5 (b)). Also, consider two sentences:
stance, (“A tangent line is drawn to circle O”, “AB and CD are perpendicular” and “AB is per-
IsTangent(line, O), 1) is a positive training pendicular to CD”. Both have the same semantic
example. All incorrect relations in the sen- annotation but very different syntactic structures.
tences of the training data are negative exam- It is difficult to directly fit the syntactic struc-
ples (e.g. (“A tangent line is drawn to circle O”, ture of question sentences into the formal language
IsCircle(line), 0)). Ω for implications and coordinating conjunctions,
The features for the unary and binary models especially due to small training data. We, instead,
are shown in Table 1 for the text t and the relation adopt a two-stage learning inspired by recent work
ri . We use two main feature categories. Structural in semantic parsing (Kwiatkowski et al., 2013).
features: these features capture the syntactic cues Our solution assumes an intermediate representa-
of the text in the form of text distance, dependency tion that is syntactically sound but possibly under-
tree labels, and part of speech tags for the words specified. The intermediate representation closely
associated with the concepts in the relation. Ge- mirrors the linguistic structure of the sentences. In
ometry language features: these features capture addition, it can easily be transferred to the formal
the cues available in the geometry language Ω in representation in the geometry language Ω.
the form of the types and the truth values of the Figure 5 shows how implications and coordinat-
corresponding concepts in the relation. ing conjunctions are modeled in the intermediate
At inference, G EO S uses the learned models representation. Bridged in Figure 5 (a) indicates
to calculate the affinity scores of all the literals that there is a special relation (edge) between the
derived from the text t. The affinity score of two concepts (e.g., what and PerimeterOf), but
each literal lj is calculated from the edge (rela- the alignment to the geometry language L is not
tion) weights inPthe corresponding subgraph, i.e. clear. CC in Figure 5 (b) indicates that there is a
Atext (lj , t) = i Atext (ri , t; θ) for all ri in the special relation between two concepts that are con-
literal lj . nected by “and” in the sentence. G EO S completes

1470
the under-specified relations by mapping them to sure of the affinity of each literal with the diagram;
the corresponding well-defined relations in the for- (b) obtaining high-confidence visual literals which
mal language. cannot be obtained from the text.
Implication: We train a log-linear classifier to Diagram score: For each literal lj from
identify if a Bridged relation (implied concept) the text parsing, we obtain its diagram score
exists between two concepts. Intuitively, the clas- Adiagram (lj , d) 7→ [−∞, 0]. G EO S grounds each
sification score indicates the likelihood that certain literal derived from the text by replacing every
two concepts (e.g., What and PerimeterOf) are variable (entity or numerical variable) in the re-
bridged. For training, positive examples are pairs lation to the corresponding variable from the dia-
of concepts whose underlying relation is under- gram parse. The score function is the relaxed in-
specified, and negative examples are all other pairs dicator function of whether a literal is true accord-
of concepts that are not bridged. For instance, ing to the diagram. For instance, in Figure 1 (a),
(what, PerimeterOf) is a positive training exam- consider the literal l = Perpendicular(AC, BD).
ple for the bridged relation. We use the same fea- In order to obtain its diagram score, we compute
tures in Table 1 for the classifier. the angle between the lines AC and BD in the di-
We then use a deterministic rule to map bridged agram and compare it with π/2. The closer the
relations in the intermediate representation to the two values, the higher the score (closer to 0), and
correct completed relations in the final represen- the farther they are, the lower the score. Note that
tation. In particular, we map bridged to Equals the variables AC and BD are grounded into the dia-
if the two children concepts are of type number, gram before we obtain the score; that is, they are
and to IsA if the concepts are of type entity (e.g. matched with the actual corresponding lines AC
point, line, circle). and BD in the diagram.
Coordinating Conjunctions: CC relations model The diagram parser is not able to evaluate
coordinating conjunctions in the intermediate rep- the correctness of some literals, in which case
resentation. For example, Figure 5 (b) shows the their diagram scores are undefined. For instance,
conjunction between the two angles BAC and BCA. Equals(LengthOf(AB), 5) cannot be evaluated
We train a log-linear classifier for the CC relations, in the diagram because the scales in the diagram
where the setup of the model is identical to that of (pixel) and the text are different. For another ex-
the binary relation model in Section 4.2. ample, Equals(what, RadiusOf(circle)) can-
After we obtain a list of CC(x,y) in the interme- not be evaluated because it contains an un-
diate representation, we use deterministic rules to grounded (query) variable, what. When the dia-
coordinate the entities x and y in each CC relation gram score of a literal lj is undefined, G EO S lets
(Figure 5 (b)). First, G EO S forms a set {x, y} for Adiagram (lj ) = Atext (lj ).
every two concepts x and y that appear in CC(x,y) If the diagram score of a literal is very low,
and transforms every x and y in other literals to then it is highly likely that the literal is false. For
{x, y}. Second, G EO S transforms the relations example, in Figure 2, Parallel(AC, DB) has a
with expansion and distribution rules (Figure 3 very low diagram score, 0.02, and is apparently
Step 1 (iv)). For instance, Perpendicular({x,y}) false in the diagram. Concretely, if for some lit-
will be transferred to Perpendicular(x, y) (ex- eral lj , Adiagram (li ) < , then G EO S disregards
pansion rule), and LengthOf{x,y}) will be trans- the text score of li by replacing Atext (lj ) with
ferred to LengthOf(x) ∧ LengthOf(y) (distribu- Adiagram (lj ). On the other hand, even if the dia-
tion rule). gram score of a literal is very high, it is still possi-
ble that the literal is false, because many diagrams
5 Diagram Parser are not drawn to scale. Hence, G EO S adds both
We use the publicly available diagram parser (Seo text and diagram scores in order to score literals
et al., 2014) to obtain the set of all visual elements (Section 6).
(points, lines, circles, etc.), their coordinates, their High-confidence visual literals: Diagrams often
relationships in the diagram, and their alignment contain critical information that is not present in
with entity references in the text (e.g. “line AB”, the text. For instance, to solve the question in Fig-
“circle O”). The diagram parser serves two pur- ure 1, one has to know that the points A, E, and C
poses: (a) computing the diagram score as a mea- are colinear. In addition, diagrams include numer-

1471
ical labels (e.g. one of the labels in Figure 1(b) in- of automated geometry theorem proving in com-
dicates the measure of the angle ABC = 40 degrees). putational geometry (Alvin et al., 2014).
This kind of information is confidently parsed with We use a numerical method to check the satis-
the diagram parser by Seo et al. (2014). We denote fiablity of literals. For each literal lj in L∗ ∪ L∆ ,
the set of the high-confidence literals by L∆ that we define a relaxed indicator function gj : S 7→
are passed to the solver (Section 7). zj ∈ [−∞, 0]. The function zj = gj (S) indi-
cates the relaxed satisfiability of lj given an as-
6 Optimization signment S to the variables X. The literal lj
is completely satisfied if gj (S) = 0. We for-
Here, we describe the details of the objective func-
mulate the problem of satisfiability of literals as
tion (Equation 1) and how to efficiently maximize
the task of finding the assignment S ∗ to X such
it. The integrated affinity score of a set of literals
that sum of all indicator functions
P gj (S ∗ ) is maxi-
L0 (the first term in Equation 1) is defined as:
mized, i.e. S ∗ = arg maxS j gj (S). We use the
X  basing-hopping algorithm (Wales and Doye, 1997)
A(L0 , t, d) = Atext (lj0 , t) + Adiagram (lj0 , d)
with sequential least squares programming (Kraft,
lj0 ∈L0
1988) to globally maximize the sum of the indica-
tor functions.
P If there exists an assignment such
where Atext and Adiagram are the text and dia-
that j gj (S) = 0, then G EO S finds an assign-
gram affinities of lj0 , respectively.
ment to X that satisfies all literals. If such assign-
To encourage G EO S to pick a subset of literals
ment does not exist, then G EO S concludes that the
that cover the concepts in the question text and, at
literals are not satisfiable simultaneously. G EO S
the same time, avoid redundancies, we define the
chooses to answer a geometry question if the lit-
coherence function as:
erals of exactly one answer choice are simultane-
H(L0 , t, d) = Ncovered (L0 ) − Rredundant (L0 ) ously satisfiable.

8 Experimental Setup
where Ncovered is the number of the concept nodes
used by the literals in L0 , and Nredundant is the num- Logical Language Ω: Ω consists of 13 types of
ber of redundancies among the concept nodes of entities and 94 function and predicates observed
the literals. To account for the different scales be- in our development set of geometry questions.
tween A and H, we use the trade-off parameter λ Implementation details: Sentences in geometry
in Equation 1 learned on the validation dataset. questions often contain in-line mathematical ex-
Maximizing the objective function in Equation pressions, such as “If AB=x+5, what is x?”. These
1 is an NP-hard combinatorial optimization prob- mathematical expressions cause general purpose
lem. However, we show that our objective func- parsers to fail. G EO S uses an equation analyzer
tion is submodular (see Appendix (Section 11) for and pre-processes question text by replacing “=”
the proof of submodularity). This means that there with “equals”, and replacing mathematical terms
exists a greedy method that can provide a reliable (e.g., “x+5”) with a dummy noun so that the de-
approximation. G EO S greedily maximizes Equa- pendency parser does not fail.
tion 1 by starting from an empty set of literals and G EO S uses Stanford dependency parser (Chen
adding the next literal lj that maximizes the gain of and Manning, 2014) to obtain syntactic informa-
the objective function until the gain becomes nega- tion, which is used to compute features for rela-
tive (details of the algorithm and the gain function tion identification (Table 1). For diagram parsing,
are explained in Figure 3 step 3). similar to Seo et al. (2014), we assume that G EO S
has access to ground truth optical character recog-
7 Solver nition for labels in the diagrams. For optimization,
We now have the best set of literals L∗ from the we tune the parameters λ to 0.5, based on the train-
optimization, and the high-confidence visual lit- ing examples.4
erals L∆ from the diagram parser. In this step, Dataset: We built a dataset of SAT plane ge-
G EO S determines if an assignment exists to the ometry questions where every question has a tex-
variables X in L∗ ∪ L∆ that simultaneously satis- 4
In our dataset, the number of all possible literals for each
fies all of the literals. This is known as the problem sentence is at most 1000.

1472
Total Training Practice Official ometry questions when text parsing does not use
Questions 186 67 64 55 the intermediate representation and does not in-
Sentences 336 121 110 105
Words 4343 1435 1310 1598
clude the relation completion step.
Literals 577 176 189 212
Binary relations 337 110 108 119 9 Experiments
Unary relations 437 141 150 146
We evaluate our method on three tasks: solving
Table 2: Data and annotation statistics geometry question, interpreting geometry ques-
tions, and dependency parsing.
tual description in English accompanied by a dia-
gram and multiple choices. Questions and answers Solving Geometry Questions: Table 3 compares
are compiled from previous official SAT exams the score of G EO S in solving geometry questions
and practice exams offered by the College Board in practice and official SAT questions with that
(Board, 2014). In addition, we use a portion of of baselines. SAT’s grading scheme penalizes a
the publicly available high-school plane geometry wrong answer with a negative score of 0.25. We
questions (Seo et al., 2014) as our training set. report the SAT score as the percentage of correctly
answered questions penalized by the wrong an-
We annotate ground-truth logical forms for all
swers. For official questions, G EO S answers 27
questions in the dataset. Table 2 shows details
questions correctly, 1 questions incorrectly, and
of the data and annotation statistics. For evaluat-
leaves 27 un-answered, which gives it a score of
ing dependency parsing, we annotate 50 questions
26.75 out of 55, or 49%. Thus, G EO S’s preci-
with the ground truth dependency tree structures
sion exceeds 96% on the 51% of questions that
of all sentences in the questions. 5
it chooses to answer. For practice SAT questions,
Baselines: Rule-based text parsing + G EO S dia- G EO S scores 61%.6
gram solves geometry questions using literals ex- In order to understand the effect of individ-
tracted from a manually defined set of rules over ual components of G EO S, we compare the full
the textual dependency parser, and scored by dia- method with a few ablations. G EO S signifi-
gram. For this baseline, we manually designed 12 cantly outperforms the two baselines G EO S with-
high-precision rules based on the development set. out text parsing and G EO S without diagram pars-
Each rule compares the dependency tree of each ing, demonstrating that G EO S benefits from both
sentence to pre-defined templates, and if a tem- text and diagram parsing. In order to understand
plate pattern is matched, the rule outputs the re- the text parsing component, we compare G EO S
lation or function structure corresponding to that with Rule-based text parsing + G EO S Diagram
template. For example, a rule assigns a relation and G EO S without relation completion. The re-
parent(child-1, child-2) for a triplet of (parent, sults show that our method of learning to interpret
child-1, child-2) where child-1 is the subject of literals from the text is substantially better than the
parent and child-2 is the object of the parent. rule-based baseline. In addition, the relation com-
G EO S without text parsing solves geometry pletion step, which relies on the intermediate rep-
questions using a simple heuristic. With simple resentation, helps to improve text interpretation.
textual processing, this baseline extracts numeri-
Error Analysis: In order to understand the errors
cal relations from the question text and then com-
made by G EO S, we use oracle text parsing and or-
putes the scale between the units in the question
acle diagram parsing (Table 3). Roughly 38% of
and the pixels in the diagram. This baseline rounds
the errors are due to failures in text parsing, and
the number to the closest choice available in the
about 46% of errors are due to failures in diagram
multiple choices.
parsing. Among them, about 15% of errors were
G EO S without diagram parsing solves geom- due to failures in both diagram and text parsing.
etry questions only relying on the literals inter- For an example of text parsing failure, the liter-
preted from the text. It outputs all literals whose als in Figure 6 (a) are not scored accurately due
text scores are higher than a tuned threshold, 0.6 to missing coreference relations (Hajishirzi et al.,
on the training set. 2013). The rest of errors are due to problems that
G EO S without relation completion solves ge- require more complex reasoning (Figure 6 (b)).
5 6
The source code, the dataset and the annotations are pub- Typically, 50th percentile (penalized) score in SAT math
licly available at geometry.allenai.org. section is 27 out of 54 (50%).

1473
SAT score (%) Accuracy
Method Practice Official Stanford dep parse 0.05
G EO S w/o diagram parsing 7 5 Stanford dep parse + eq. analyzer 0.64
G EO S w/o text parsing 10 10 G EO S 0.78
Rule-based text parsing + G EO S diagram 31 24
G EO S w/o relation completion 42 33
Table 5: Accuracy of dependency parsing.
G EO S 61 49
Oracle text parsing + G EO S diagram parsing 78 75 (a)   In  the  figure  at  the  le-,  the  smaller  circles  
each  have  radius  3.  They  are  tangent  to  the  
G EO S text parsing + oracle diagram parsing 81 79
larger  circle  at  points  A  and  C,  and  are  tangent  
Oracle text parsing + oracle diagram parsing 88 84 to  each  other  at  point  B,  which  is  the  center  of  
the  larger  circle.  What  is  the  perimeter  of  the  
Table 3: SAT scores of solving geometry questions. shaded  region?  
Fails  to  resolve  “they”  to  “each  other”  
P R F1
(a)  6*pi      (b)  8*pi      (c)  9*pi      (d)  8*pi      (e)  15*pi  
Rule-based text parsing 0.99 0.23 0.37
G EO S w/o diagram 0.57 0.82 0.67 (b)   In  the  figure  at  the  le-,  a  shaded  
G EO S 0.92 0.76 0.83 polygon  which  has  equal  angles  is  
parCally  covered  with  a  sheet  of  
Table 4: Precision and recall of text interpretation. blank  paper.  If  x+y=80,  how  many  
Requires  complex  reasoning:     sides  does  the  polygon  have?    
Cannot  understand  that  the  polygon  
(a)  10      (b)  9        (c)  8        (d)  7      (e)  6  
is  “hidden”  
Interpreting Question Texts: Table 4 details
the precision and recall of G EO S in deriving lit- Figure 6: Examples of Failure: reasons are in red.
erals for geometry question texts for official SAT
questions. The rule-based text parsing baseline ford dependency parser predicts that “E” depends
achieves a high precision, but at the cost of lower on “CD”, while G EO S predicts the correct parse
recall. On the other hand, the baseline G EO S with- in which “E” depends on “perpendicular”.
out diagram achieves a high recall, but at the cost
of lower precision. Nevertheless, G EO S attains 10 Conclusion
substantially higher F1 score compared to both This paper introduced G EO S, an automated sys-
baselines, which is the key factor in solving the tem that combines diagram and text interpretation
questions. Direct application of a generic seman- to solve geometry problems. Solving geometry
tic parser (Berant et al., 2013) with full supervi- questions was inspired by two important trends in
sion does not perform well in the geometry do- the current NLP literature. The first is in designing
main, mainly due to lack of enough training data. methods for grounded language acquisition to map
Our initial investigations show the performance of text to a restricted formalism (instead of a full,
33% F1 in the official set. domain independent representation). We demon-
Improving Dependency Parsing: Table 5 shows strate a new algorithm for learning to map text to
the results of different methods in dependency a geometry language with a small amount of train-
parsing. G EO S returns a dependency parse tree by ing data. The second is designing methods in cou-
selecting the dependency tree that maximizes the pling language and vision and show how process-
text score in the objective function from the top ing multimodal information help improve textual
50 trees produced by a generic dependency parser, or visual interpretations.
Stanford parser (Chen and Manning, 2014). Note Our experiments on unseen SAT geometry
that Stanford parser cannot handle mathematical problems achieve a score of 49% of official ques-
symbols and equations. We report the results of tions and a score of 61% on practice questions,
a baseline that extends the Stanford dependency providing a baseline for future work. Future work
parser by adding a pre-processing step to separate includes expanding the geometry language and
the mathematical expressions from the plain sen- the reasoning to address a broader set of geom-
tences (Section 8). etry questions, reducing the amount of supervi-
We evaluate the performance of G EO S against sion, learning the relevant geometry knowledge,
the best tree returned by Stanford parser by re- and scaling up the dataset.
porting the fraction of the questions whose depen- Acknowledgements. The research was sup-
dency parse structures match the ground truth an- ported by the Allen Institute for AI, Allen Dis-
notations. Our results show an improvement of tinguished Investigator Award, and NSF (IIS-
16% over the Stanford dependency parser when 1352249). We thank Dan Weld, Luke Zettlemoyer,
equipped with the equation analyzer. For exam- Aria Haghighi, Mark Hopkins, Eunsol Choi, and
ple, in “AB is perpendicular to CD at E”, the Stan- the anonymous reviewers for helpful comments.

1474
References E.A. Feigenbaum and J. Feldman, editors. 1963. Com-
puters and Thought. McGraw Hill, New York.
Chris Alvin, Sumit Gulwani, Rupak Majumdar, and
Supratik Mukhopadhyay. 2014. Synthesis of ge- Jeffrey Flanigan, Sam Thomson, Jaime Carbonell,
ometry proof problems. In AAAI. Chris Dyer, and Noah A. Smith. 2014. A discrim-
Gabor Angeli and Christopher D. Manning. 2014. inative graph-based parser for the abstract meaning
Naturalli: Natural logic inference for common sense representation. In ACL.
reasoning. In EMNLP.
Ruifang Ge and Raymond J. Mooney. 2006. Discrimi-
Yoav Artzi and Luke Zettlemoyer. 2013. Weakly su- native reranking for semantic parsing. In ACL.
pervised learning of semantic parsers for mapping
instructions to actions. TACL, 1. Dan Goldwasser and Dan Roth. 2011. Learning from
natural instructions. In IJCAI.
J. Berant and P. Liang. 2014. Semantic parsing via
paraphrasing. In ACL. Yunchao Gong, Liwei Wang, Micah Hodosh, Ju-
lia Hockenmaier, and Svetlana Lazebnik. 2014.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Improving image-sentence embeddings using large
Liang. 2013. Semantic parsing on freebase from weakly annotated photo collections. In ECCV.
question-answer pairs. In EMNLP.
Sonal Gupta and Raymond J. Mooney. 2010. Us-
College Board. 2014. The college board.
ing closed captions as supervision for video activity
Antoine Bordes, Nicolas Usunier, and Jason Weston. recognition. In AAAI.
2010. Label ranking under ambiguous supervision
for learning semantic correspondences. In ICML. Hannaneh Hajishirzi, Julia Hockenmaier, Erik T.
Mueller, and Eyal Amir. 2011. Reasoning about
SRK Branavan, Nate Kushman, Tao Lei, and Regina robocup soccer narratives. In UAI.
Barzilay. 2012. Learning high-level planning from
text. In ACL. Hannaneh Hajishirzi, Leila Zilles, Daniel S Weld, and
Luke S Zettlemoyer. 2013. Joint coreference res-
William C. Bulko. 1988. Understanding text with an olution and named-entity linking with multi-pass
accompanying diagram. In IEA/AIE. sieves. In EMNLP.
Danqi Chen and Christopher D Manning. 2014. A fast Mary Hegarty and Marcel Adam Just. 1989. 10 under-
and accurate dependency parser using neural net- standing machines from text and diagrams. Knowl-
works. In EMNLP. edge acquisition from text and pictures.
David Chen, Joohyun Kim, and Raymond Mooney.
2010. Training a multilingual sportscaster: Using Ben Hixon, Peter Clark, and Hannaneh Hajishirzi.
perceptual context to learn language. JAIR, 37. 2015. Learning knowledge graphs for question an-
swering through conversational dialog. In NAACL.
Jim Cowie and Wendy Lehnert. 1996. Information
extraction. Communications of the ACM, 39(1). Mohammad Javad Hosseini, Hannaneh Hajishirzi,
Oren Etzioni, and Nate Kushman. 2014. Learning
Aron Culotta and Jeffrey Sorensen. 2004. Dependency to solve arithmetic word problems with verb catego-
tree kernels for relation extraction. In ACL. rization. In EMNLP.
Jacob Eisenstein, James Clarke, Dan Goldwasser, and Rohit J. Kate and Raymond J. Mooney. 2007. Learn-
Dan Roth. 2009. Reading to learn: Constructing ing language semantics from ambiguous supervi-
features from semantic abstracts. In EMNLP. sion. In AAAI.
Thomas G Evans. 1964. A heuristic program to solve
geometric-analogy problems. In Proceedings of the Joohyun Kim and Raymond J. Mooney. 2013. Adapt-
April 21-23, 1964, spring joint computer confer- ing discriminative reranking to grounded language
ence. learning. In ACL.

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Sri- Dan Klein and Christopher D Manning. 2005. Parsing
vastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xi- and hypergraphs. In New developments in parsing
aodong He, Margaret Mitchell, John Platt, et al. technology. Springer.
2014. From captions to visual concepts and back.
In CVPR. R Koncel-Kedziorski, Hannaneh Hajishirzi, and Ali
Farhadi. 2014. Multi-resolution language ground-
Ali Farhadi, Mohsen Hejrati, Mohammad Amin ing with weak supervision. In EMNLP.
Sadeghi, Peter Young, Cyrus Rashtchian, Julia
Hockenmaier, and David Forsyth. 2010. Every pic- Dieter et. al. Kraft. 1988. A software package for se-
ture tells a story: Generating sentences from images. quential quadratic programming. DFVLR Obers-
In ECCV. faffeuhofen, Germany.

1475
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Sim- 11 Appendix: Proof of Submodularity of
ing Li, Yejin Choi, Alexander C Berg, and Tamara L Equation 1
Berg. 2011. Baby talk: Understanding and generat-
ing image descriptions. In CVPR. We prove that the objective function in equation
(1), λA(L0 ) + H(L0 ) is submodular by showing
Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and
Regina Barzilay. 2014. Learning to automatically that A(L0 ) and H(L0 ) are submodular functions.
solve algebra word problems. In ACL. Submodularity of A. Consider L0 ⊂ L, and
a new literal to be added, li ∈ L \ L0 . By the
T Kwiatkowski, E Choi, Y Artzi, and L Zettlemoyer.
2013. Scaling semantic parsers with on-the-fly on- definition of A, it is clear that A(L0 ∪ {lj }) =
tology matching. In EMNLP. A(L0 ) + A({lj }). Hence, for all L00 ⊂ L0 ⊂ L,

Percy Liang, Michael I. Jordan, and Dan Klein. 2009. A(L00 ∪ {lj }) − A(L00 ) = A(L0 ∪ {lj }) − A(L0 )
Learning semantic correspondences with less super-
vision. In ACLAFNLP. . Thus A is submodular.
Xinggang Lin, Shigeyoshi Shimotsuji, Michihiko Mi- Submodularity of H. We prove that the cover-
noh, and Toshiyuki Sakai. 1985. Efficient diagram age function, Hcov , and the negation of the redun-
understanding with characteristic pattern detection. dancy function, −Hred are submodular indepen-
CVGIP, 30(1).
dently, and thus derive that their sum is submodu-
A. Lovett and K. Forbus. 2012. Modeling multiple lar. For both, consider we are given L00 ⊂ L0 ⊂ L,
strategies for solving geometric analogy problems. and a new literal lj ∈ L \ L0 . Also, let K 00 and
In CCS. K 0 denote the the sets of concepts covered by L00
Gordon Novak. 1995. Diagrams for solving physical and L0 , respectively, and let Kj denote the set of
problems. Diagrammatic reasoning: Cognitive and concepts covered by lj .
computational perspectives. Coverage: Since K 00 ⊂ K 0 , |K 00 ∪ Kj | − |K 00 | ≥
|K 0 ∪ Kj | − |K 0 |, which is equivalent to
Lawrence O’Gorman and Rangachar Kasturi. 1995.
Document image analysis, volume 39. Citeseer.
Hcov (L00 ∪ {lj }) − Hcov (L00 )
Hoifung Poon and Pedro Domingos. 2009. Unsuper- ≥ Hcov (L0 ∪ {lj }) − Hcov (L0 )
vised semantic parsing. In EMNLP.
Redundancy: Note that Hred (L00 ∪ {lj }) −
Siva Reddy, Mirella Lapata, and Mark Steedman.
2014. Large-scale semantic parsing without
Hred (L00 ) = |K 00 ∩ Kj |, and similarly, Hred (L0 ∪
question-answer pairs. TACL, 2(Oct). {lj }) − Hred (L0 ) = |K 0 ∩ Kj |. Since K 00 ⊂ K 0 ,
thus |K 00 ∩ Kj | ≤ |K 0 ∩ Kj |. Hence,
S. Roy, T. Vieira, and D. Roth. 2015. Reasoning about
quantities in natural language. Hred (L00 ∪ {lj }) − Hred (L00 )
Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, and ≤ Hred (L0 ∪ {lj }) − Hred (L0 ),
Oren Etzioni. 2014. Diagram understanding in ge-
ometry questions. In AAAI. By negating both sides, we derive that the negation
of the redundancy function is submodular.
Rohini K Srihari. 1994. Computational models for
integrating linguistic and visual information: A sur-
vey. Artificial Intelligence Review, 8(5-6).

Adam Vogel and Daniel Jurafsky. 2010. Learning to


follow navigational directions. In ACL.

David J Wales and Jonathan PK Doye. 1997. Global


optimization by basin-hopping and the lowest en-
ergy structures of lennard-jones clusters containing
up to 110 atoms. The Journal of Physical Chemistry
A, 101(28).

Luke S. Zettlemoyer and Michael Collins. 2005.


Learning to map sentences to logical form: Struc-
tured classification with probabilistic categorial
grammars. In UAI.

1476

Вам также может понравиться