Multiple Choice Question Answering by Meshael Sultan

Multiple Choice Question Answering
MSc Advanced Computer Science

Meshael Sultan
Supervised by
Dr. Louise Guthrie
Department of Computer Science

University of Sheffield
August 2006
This report is submitted in partial fulfilment of the requirement for the degree of
MSc in Advanced Computer Science
To
The loving memory of my parents
Aisha and Mohammed
May God mercy their souls
ii
Signed Declaration
All sentences or passages quoted in this dissertation from other people's work have
been specifically acknowledged by clear cross-referencing to author, work and
page(s). Any illustrations which are not the work of the author of this dissertation have
been used with the explicit permission of the originator and are specifically
acknowledged. I understand that failure to do this amounts to plagiarism and will be
considered grounds for failure in this dissertation and the degree examination as a
whole.
Name: Meshael Sultan

Signature:
Date: 30th August 2006
iii
Abstract
People rely on information for performing the activities of their daily lives. Much of
this information can be found quickly and accurately online through the World Wide
Web. As the amount of online information grows, better techniques to access the
information are necessary. Techniques for answering specific questions are in high
demand and large research programmes investigating methods for question answering
have been developed to respond to this need.
Question Answering (QA) technology; however, faces some problems, thus inhibiting
its advancement. Typical approaches will first generate many candidate answers for
each question, then attempt to select the correct answer from the set of potential
answers. The techniques for selection of the correct answer are in their infancy, and
further techniques are needed to decide and to select the correct answer from candidate
answers.
This project focuses on multiple choice questions and the development of techniques
for automatically finding the correct answer. In addition to being a novel and
interesting problem on its own, the investigation has identified methods for web based
Question Answering (QA) technology in selecting the correct answers from potential
candidate answers. The project has investigated techniques performed manually and
automatically. The data consists of 600 questions, which were collected from an
online web resource. They are classified into 6 categories, depending on the
questions’ domain, and divided equally between the investigation and the evaluation
stages. The manual experiments were promising, as 45 percent of the answers are
correct, which increased to 95 percent after the form of the queries was restructured.
Automatic techniques, such as using quotation marks, and replacing the question
words according to the question type it was found that the accuracy ranged between
48.5 and 49 percent. The accuracy had also increased to 63 percent and 74 percent in
some categories, such as geography and literature.
iv
Acknowledgements
This project would not have existed without the will of Allah, the giver of all good and
perfect gifts. His mercy and blessing have empowered me throughout my life. All
praise is due to Allah for his guidance and grace. Peace and blessings be upon our
prophet Mohammed.
Then, I would like to thank Dr. Louise Guthrie for her supervision, help, guidance and
encouragement throughout this project.
I would also like to thank Dr. Mark Greenwood for providing many good ideas and
resources throughout the project and Mr. David Guthrie for assisting me with the
RASP program.
Many thanks to my dear friend, Mrs. Basma Albuhairan for her assistance in
proofreading.
My deepest thanks to my lovely friends Ebtehal, Arwa and Maha for their friendship,
substantive support and encouragement.
Finally, I would like to thank my compassionate brother Fahad, my sweet sister Ru’a
and my loving aunts Halimah, Hana, Tarfa, and Zahra who through their love
supported me throughout my Masters degree.
v
Table of Contents
Table of Contents
Signed Declaration .................................................................................................... iii

Abstract ......................................................................................................................iv
Acknowledgements .....................................................................................................v
Table of Contents .......................................................................................................vi
List of Tables........................................................................................................... viii
List of Figures ............................................................................................................ix
List of Equations .........................................................................................................x
List of Abbreviations ..................................................................................................xi
Chapter 1: Introduction..............................................................................................12
1.1 Background................................................................................................12
1.2 Project Aim................................................................................................13
1.3 Dissertation Structure.................................................................................13
Chapter 2: Literature Review .....................................................................................14
2.1 Question Answering System ............................................................................14
2.1.1 History of QA Systems..............................................................................15
2.1.2 QA System Types .....................................................................................16
2.1.2.1 Answers Drawn From Database..........................................................16
2.1.2.2 Answers Drawn From Free Text .........................................................18
2.2 The TREC QA Track .......................................................................................21
2.3 Evaluation Measures ........................................................................................24
2.3.1 Question Answering Measures ..................................................................24
Chapter 3: Data and Resources Used in Experiments .................................................26
3.1 Data Set ...........................................................................................................26
3.1.1 Question Set..............................................................................................26
3.1.2 Knowledge Source ....................................................................................27
3.2 Evaluation........................................................................................................27
3.3 Tools................................................................................................................28
Chapter 4: Design and Implementation ......................................................................29
4.1 Manual experiments.........................................................................................29
4.2 Automated Experiments...................................................................................33
Chapter 5: Results and Discussion .............................................................................36
5.1 Experiments Results.........................................................................................36
5.1.1 Manual Experiments Results .....................................................................36
5.1.2 Automated Experiments Results................................................................37
5.2 Future Work.....................................................................................................47
Chapter 6: Conclusions..............................................................................................48
Bibliography..............................................................................................................50
Appendix A ...............................................................................................................54
Appendix B ...............................................................................................................55
Appendix C ...............................................................................................................57
Appendix D ...............................................................................................................62
Appendix E ...............................................................................................................67
Appendix F................................................................................................................69
Appendix G ...............................................................................................................75
Appendix H ...............................................................................................................79
Appendix I.................................................................................................................85
vi
Table of Contents
Appendix J ................................................................................................................87
Appendix K ...............................................................................................................88
vii
List of Tables
Table 2. 1 Trigger words and their corresponding thematic roles and attributes..........18
Table 2. 2 Question categories and their possible answer types ..................................19
Table 2. 3 Sample question series from the test set. Series 8 has an organization as a
target, series 10 has a thing as atarget, series 27 has a person as a target.............23
Table 4. 1 Number of Google hits for first manual experiment when a given question
and given answer is used as a query in Google ...................................................30
Table 4. 2 Number of Google hits for second manual experiment when a given
question and given answer is used as a query in Google .....................................32
Table 5. 1 Number of correct answers for the First and the Second Manual
Experiments, and the overall accuracy for both of them .....................................36
Table 5. 2 Number of correct answers for every question category and its accuracy, in
addition to the overall accuracy in First Automated Experiment.........................38
Table 5. 3 Number of correct answers for every question category and its accuracy, in
addition to the overall accuracy in Second Automated Experiment.....................39
Table 5. 4 Number of correct answers for every question category and its accuracy
using quotation, besides to the overall accuracy in Third Automated Experiment
..........................................................................................................................40
Table 5. 5 Number of correct answers for every question type, besides its accuracy
with and without using quotation and the overall accuracy in Forth
Automated Experiment ......................................................................................43
Table 5. 6 Number of correct answers for every question type in each category, besides
their accuracy with and without using quotation and the overall accuracy ..........45
Table 6. 1 Evaluation for the participant systems in TREC 2005 QA track.................49
viii
List of Figures
Figure 2. 1 MULDER mock up..................................................................................20

Figure 5. 1 The improvement of the accuracy for the First and the Second Manual
Experiments.......................................................................................................37
Figure 5. 2 The accuracy for each category in First Automated Experiment ...............38
Figure 5. 3 The accuracy for each category in Second Automated Experiment...........39
Figure 5. 4 The accuracy for each category using quotation in Third Automated
Experiment ........................................................................................................41
Figure 5. 5 The accuracy for each category with and without quotation .....................42
Figure 5. 6 The accuracy for each question type with and without quotation ..............43
Figure 5. 7 The accuracy for every question type in each category with and without
using quotation ..................................................................................................46
ix
List of Equations
Equation 2. 1 Mean Reciprocal Rank .........................................................................24

Equation 2. 2 Answer Recall Lenient .........................................................................25
Equation 2. 3 Answer Recall Strict ............................................................................25
Equation 2. 4 Accuracy..............................................................................................25
x
List of Abbreviations
Abbreviation Word
AI Artificial Intelligence
ARA Answer Recall Average
ARL Answer Recall Lenient
ARS Answer Recall Strict
IR Information Retrieval
MRR Mean Reciprocal Rank
NASA National Aeronautics and Space Administration
NIST National Institute of Standards and Technology
PoS Part-of-Speech
QA Question Answering
RASP Robust Accurate Statistical Parsing
START SynTactic Analysis using Reversible Transformations
TREC Text REtrieval Conference
WWW World Wide Web
xi
Chapter 1: Introduction
1.1 Background
Searching the Web is part of our daily lives, and one of the uses of the Web is to
determine the answer to a specific questions, such as “When was Shakespeare born?”
Most people, however, would not choose to spend a lot of time finding the answer to
such a simple question. The tremendous amount of online information is increasing
daily, therefore making the search process for such a specific question even more
difficult. Despite the drastic advances in search engines, they still fall short of
understanding the user’s specific question. As a result, more sophisticated search tools
are needed to reliably provide the correct answers for these questions. The desirability
of such technology has led to a, relatively recent, increase in funding for the
development of Question Answering (QA) systems. (Korfhage, 1997; Levene, 2006).
In this dissertation a QA system is defined as an artificial intelligence system that uses

a question as its input and attempts to provide the correct answer for that specific
question. Most QA systems utilize sophisticated techniques from the natural language
processing community. Since 1999, there has been a huge advancement in the
performance of QA systems, but they are still in their early stages. All QA systems
have faced some difficulties in selecting the actual right answer from a list of the
possible answers due to the challenges and limitations of the natural language that is
understood by the machine. According to the Text REtrieval Conference (TREC)
(which, to encourage research and improvement of these systems arranges the annual
question answering system evaluations) the best QA system was only able to answer
70 percent of the factoid questions (answers representing simple facts) correctly.
This project is mainly inspired by the difficulties still present in the QA systems. The
project investigates techniques for selecting accurate answers for multiple choice
questions. This procedure is similar to the answer extraction stage in the traditional
QA systems, since the typical systems first will generate a small set of answers prior to
selecting a candidate answer. This work investigates techniques for choosing the
correct candidate answer from among a small set of possible answers. The results may
therefore be useful for the improvement of the current QA systems.
12
1.2 Project Aim
The aim of this project is to investigate automated techniques for answering multiple
choice questions. Neither commercial QA nor research QA systems that attempt to
answer different types of questions, and retrieve a precise answers are perfect.
Typically, the system analyses the question, finds a set of candidate answers by
consulting a knowledge resource and selects among the candidate answers before
presenting the correct answer to the user. One area where future research is needed is
with respect to the most appropriate method for selecting the correct answer from a set
of candidate answers. This project therefore focuses on techniques for improving this
stage of the QA systems. In addition, the idea of automatically finding the correct
answers for multiple choice questions is both novel and interesting, and is yet to be
fully explored.
1.3 Dissertation Structure
This dissertation is organised into 6 main chapters. Chapter 1 provides some

information about the project and states the objectives and aims of the project.
Chapter 2 contains the literature review that explains why QA systems are important.
It also provides a brief history of the QA systems, their types, and the relationship
between QA and the TREC QA track. Details about the evaluation measure for the
QA systems and their metrics are also provided.
Chapter 3 contains the data and resources for this project. It describes in detail the
data set for the investigation experiments and the tools used within the project.
Chapter 4 describes the design and implementation phases, including the experimental
design and the implementation.
Chapter 5 provides the results, analysis and evaluation of these experiments. As a

result it states the project’s findings, the achieved objectives and further work needed
for improvement.
Chapter 6 draws the conclusions from the results given.
13
Chapter 2: Literature Review
The world today is starving for more information. Organizations and humans are
striving for new or stored information every second in order to improve the
performance of their work and daily activities. People seek information in order to
solve problems, whether it is booking a seat at the cinema to watch a movie or to find a
solution to a financial problem. They rely on information resources such as their
background, experts, libraries, private or public information, and the global networks,
which have rapidly become the most important resource, and are considered the
cheapest method, for easily accessing information from nearly everywhere.
Chowdhury (1999) noticed that electronic information has become readily accessible;
the rate of generating information has also increased. This information explosion has
caused a major problem, which has raised the need for effective and efficient
information retrieval.
Even though Information Retrieval (IR) systems have emerged and improved over the
last decade, users often find the search process difficult and boring, as a simple
question may require an extensive search that is time consuming and exhausting. In
order to meet the need for an efficient search tool that retrieves the correct answer for a
question, the development of accurate QA systems has evolved into an active area of
research. Such systems aim to retrieve answers in the shortest possible time with
minimum expense and maximum efficiency.
This chapter defines what a QA system is, provides a brief history of QA systems and
their main types, illustrated with some important examples. It also explains how the
TREC QA track is involved in promoting QA research, and the metrics need to
evaluate QA systems.
2.1 Question Answering System
The recent expansion in the World Wide Web has demanded an increase in high
performance Information Retrieval (IR) systems. These IR systems help the users to
seek whatever they want on the Web and to retrieve the required information. The QA
system is a type of the IR system. The IR system provides the user with a collection of
documents containing the search result and the user extracts the correct answer.
Unlike an IR system, the QA system provides the users with the right answer for their
question using underlying data sources, such as the Web or a local collection of
14
documents. Therefore, QA systems require complex natural language processing

techniques to enable the system to understand the user’s questions. The long term goal
is that both the explicit and implicit meaning can be extracted by such a system
(Korfhage, 1997; Levene, 2006; Liddy, 2001).
Salton (1968) implies that, a QA system ideally provides a direct answer in response to
a search query by using stored data covering a restricted subject area (a database).
Thus, if the user asks an efficient QA system “What is the boiling point of water?” it
would reply “one hundred degree centigrade” or “100 °C”.
Moreover Salton (1968) stated that these systems become important for different types
of applications, varying from standard business-management systems to sophisticated
military or library systems. These applications can help the personnel managers, bank
accountant managers and military officers to easily access the various data about
employees, customer accounts and tactical situations.
“The task of QA system consists in analysing the user query, comparing the analyzed
query with the stored knowledge and assembling a suitable response from the
apparently relevant facts” (Salton and McGill, 1983:9).
2.1.1 History of QA Systems
The actual beginning for the QA system was in the 1960s within the Artificial
Intelligence (AI) field, developed by researchers who employed knowledge based
techniques such as encyclopaedias, edited lists of frequently asked questions, and sets
of newspaper articles. These researchers attempted to develop rules that enabled a
logical expression to be derived automatically from natural sentences, and vice versa,
and this therefore directed them to the QA system design (Belkin and Vickery, 1985;
Etzioni et al., 2001; Joho, 1999).
The QA systems were intended to stimulate human language behaviour, therefore

being able to answer natural language questions. As a result, the machine attempts to
understand the question and thus responds by answering. In other words,
understanding natural language by the machine is an essential process in the
development of QA systems (Belkin and Vickery, 1985; Etzioni et al., 2001; Joho,
1999).
The processing of understanding natural language, however, has many complications

because of its varied and complex nature. Accordingly, these complications led to the
development of QA within a restricted domain of knowledge, such as databases.
15
Nevertheless, the advance in natural language processing techniques was an important

reason for the recent evolvement of QA systems to be used in the information retrieval
domain. (Belkin and Vickery, 1985; Etzioni et al., 2001; Joho, 1999)
2.1.2 QA System Types
QA systems are classified into two types according to the source of the question’s
answer, that is, whether it is from a structured source or free text. The former systems
return answers drawn from a database, and there are considered to be the earliest QA
systems. As Belkin and Vickery (1985) declared it does, however, have some
disadvantages, as it is often restricted to a specific domain in simple English and the
input questions are often limited to simple forms. The latter type generates answers
drawn from free text instead of from a well-structured database. According to Belkin
and Vickery (1985), these systems are not linked to a specific domain and the text can
cover any subject.
The following two sections describe these two types of systems in more detail.
2.1.2.1 Answers Drawn From Database
The earliest QA systems are often described as the Natural Language Interface of a
database. They allow the user to access the stored information in a traditional database
by using natural language requests or questions. The request is then translated into a
database language query e.g. SQL. For instance, if such an interface is connected to
the personal data of a company and the user asks the system the following (Monz,
2003) (please note that the user’s query is in italics and the system’s response is in
bold):
> Who is the youngest female in the marketing department?

Mari Clayton.
> How much is her salary?

£ 2500.00
> Does any employee in the marketing department earn less than Mari Clayton?
No.
Two early examples of this type are ‘Baseball’ (Chomsky et al., 1961) and ‘Lunar’
(Woods, 1973), which were developed in the 1960s. The Baseball system answers
16
English questions about baseball (a popular sport in the United States of America).
The database contains information about the teams, scores, dates and locations of these
games, taken from lists summarizing a Major League season’s experience. This
system allows the users, using their natural language, to communicate with an interface
that understands the question and the database structure. The user could only insert a
simple question without any connectives (such as ‘and’, ‘or’, etc.), without
superlatives such as ‘smallest’, ‘youngest’, etc. and without a sequence of events
(Chomsky et al., 1961).
This system answered questions such as “How many games did the Red Sox play in
June?”, “Who did the Yankees lose to on August 10?” or “On how many days in June
did eight teams play?” As Sargaison (2003) explained, this is done by analysing the
question and matching normalised forms of the extracted important key-terms such as
“How many”, “Red Sox” and “June” against a pre-structured database that contained
the baseball data, then returning an answer to the question.
The Lunar system (Woods, 1973) is more sophisticated, providing information about
chemical-analysis data on lunar rock and soil material that were collected during the
Apollo moon missions, enabling lunar geologists to process this information by
comparing or evaluating them. This system answered 78 percent of the geologist
questions correctly.
Gaizauskas and Hirschman (2001) believed that Lunar was capable of answering
questions such as “What is the average concentration of aluminium in high alkali
rocks?” or “How many Brescias contain Olivine?” by translating the question into a
database-query language. An example of this (Monz, 2003) is as follows:
The question “Does sample S10046 contains Olivine?”

Will be translated to the following query “(TEST (CONTAIN S10046 OLIV))”.
Additionally Monz (2003) stated that Lunar can manage a sequence of questions such
as “Do any sample have aluminium grater than 12 percent?”
“What are these samples?”
Despite the significant success of the previous systems, they are restricted to a specific
domain and are associated with a database storing the domain knowledge. The results
of the database based QA systems are not directly comparable to the open domain
question-answering from an unstructured text.
17
2.1.2.2 Answers Drawn From Free Text
The next generation of QA systems evolved to extract answers from the plain
machine-readable text that is found in a collection of documents or on the Web. Monz
(2003) believed that the need for these systems is a result of the increasing growth in
the Web, and the user’s demand to access the information in a simple and fast fashion.
The feasibility of these systems is due to the advancement in the natural language
processing techniques. As the QA systems are capable of answering questions about
different domains and topics, these systems are more complex than the database based
QA systems, as they need to analyze the data in addition to the question, in order to
find the correct answer.
QUALM system was one of the early movements toward extracting answers from a
text. As Pazzani (1983) revealed, this QA system was able to answer a question about
stories by consulting scripted knowledge that is constructed during the story
understanding. Ferret et al. (2001) added that it uses complex semantic information,
due to its dependency on the question taxonomy that is provided by the question type
classes. This approach was the inspiration for many QA systems (refer to Appendix A
for the 13 conceptual question categories used in Wendy Lehnert’s QUALM which
was taken from Burger et al. (2001)).
The Wendlandt and Driscoll system (Driscoll and Wendlandt, 1991) is considered to
be one of the early text based QA system. This system uses National Aeronautics and
Space Administration (NASA) plain text documentation to answer questions about the
NASA Space Shuttle. Monz (2003) cited that it uses thematic roles and attributes
occurring in the question to identify the paragraph that contains the answer. Table 2.1
was taken form Monz (2003) to illustrate examples of the trigger words and the
corresponding thematic roles and attributes.
Trigger word The corresponding thematic roles and attributes

Area Location
Carry Location
Dimensions Size
In destination, instrument, location, manner, purpose
Into location, destination
On location, time
Of Amount
To location, destination, purpose
Table 2. 1 Trigger words and their corresponding thematic roles and attributes
18
Murax is another example for this type of QA system. It extracts answers for general
fact questions by using an online version of Grolier’s Academic American
Encyclopaedia as a knowledge base. According to Greenwood (2005), it accesses the
documents that contain the answer via information retrieval and then analyses them to
extract the exact answer. As Monz (2003) explained, this is done by using question
categories that focus on types of questions likely to have short answers. Table 2.2 was
taken from Monz (2003) and illustrates these question categories and their possible
answer type:
Question type Answer type

Who / Whose Person
What / Which Thing, Person, Location
Where Location
When Time
How many Number
Table 2. 2 Question categories and their possible answer types
More recently, systems have been developed that attempt to find the answer to a
question from very large document collections. These systems attempt to simulate
finding answers on the Web. Katz (2001) states that START (SynTactic Analysis
using Reversible Transformations) is one the first systems that tried to answer a
question using a web interface. It has been available to users since 1993 on the World
Wide Web (WWW). According to Katz (2001), this system, however, focuses on
answering questions about many subjects, such as geography, movies, corporations
and weather. This requires time and effort in expanding its knowledge base.
On the other hand, MULDAR was considered the first QA system that utilizes the full
Web. It uses multiple search engine queries, and natural language processing, to
extract the answer. Figure 2.1 is a mock up for MULDAR based on an example have
seen in Etzioni et al. (2001) due to the original system is no longer being available
online:
19
MULDER
“The Truth is Out There” Your question Who was the first American in space? Ask
Mulder is 70% confident the answer is Alan Shepard.

The following are possible answers, list in order of confidence:
1. Alan Shepard (70%)
History is Bunk
… may have been the “first man on the moon,” |, Alan Shepard, was
the first American in space, and therefore it is much more exciting that |,
Alan Shepard, have now walked…
Man in Space
… the February 20th 1962 but he was the first American in space
because the United State needed a lot of competition to equal the Soviet.
The first American in space was Alan B. Shepard May 5th 1961…
Shepard
On May 5, 1961, Shepard became the first American in space. After
several tours of duty in the Mediterranean, Shepard became one of the
Navy’s top test pilots and took part in high-
2. John Glenn (15%)

Views on the Extraterrestrial hypothesis, UFOs, flying disks
John Glenn was the first American in Space, flying on the Mercury 6
mission. A Bachelor of Science in engineering from Muskingum College,
Glenn has also recently.
Figure 2. 1 MULDER mock up
According to (WWW2), the general text based QA system architecture contains three
modules: question classification, document retrieval and answer extraction. The
question classifier model is responsible for analysing and processing a question to
determine its type and consequently the type of answer would also be determined. The
document retrieval model is used to search through a set of documents in order to
return the documents that might have the potential answer. Finally, the answer
extraction model identifies the candidate answers in order to extract the accurate
answer. Thus, they may apply the statistical approach which could use the frequency
of the candidate answers in the documents collection. An external database could also
be used to verify that the answer and its category are appropriate to the question
classification. For instance, if the question starts with the word ‘where’, its answer
should be a location and this information used in selecting the correct answer.
20
This type of QA system is the focus of this project, as techniques for selecting the
correct answers for multiple choice questions from free text over the web are
investigated.
2.2 The TREC QA Track
TREC is co-sponsored by the National Institute of Standards and Technology (NIST)

and the United States Department of Defence (WWW4, 2004). It was started as a part
of the TIPSTER Text program in 1992 and it attempts to support research in the
community of information retrieval, thereby trying to transfer the technology from the
research laboratories to the commercial products as quickly as possible.
In the beginning, the QA systems were isolated from the outside world in the research
laboratories. Thereafter, they started to appear during the TREC competition.
(WWW3, 2005) states that when the QA track at TREC was started in 1999, “in each
track the task was defined such that the systems were to retrieve small snippets of text
that contained an answer for open-domain, closed-class questions (i.e., fact-based,
short-answer questions that can be drawn from any domain)”. For every track,
therefore, the participants would have a question set to answer and a corpus of
documents, which is a collection of text documents drawn from newswires and
articles, in which the answers to the questions are included in at least one of these
documents. These answers are evaluated according to the answer patterns, and the
answer will be considered correct if it matches this answer pattern. TREC assesses the
answers accuracy for the competitive QA systems to judge their performance.
The first track TREC-8 (1999), as Voorhees (2004) clarified, was focused on
answering factoid questions, which are fact-based questions, such as, taken from
(WWW5, 2005), “Who is the author of the book, "The Iron Lady: A Biography of
Margaret Thatcher?” The main task is to return a 5-ranked list of strings containing
the answer.
Sanka (2005) revealed that in TREC-9 (2000) the task was similar to TREC-8 but the
question set was increased from 200 to 693 questions. These questions were
developed by extracting them from a log of questions submitted to the Encarta system
(Microsoft) and questions derived from an Excite query log; whereas the questions in
TREC-8, were taken from a log of questions submitted to the FAQ Finder system.
Moreover, both of TREC-8 and TREC-9 used the Mean Reciprocal Rank (MRR) score
to evaluate the answers.
As Voorhees (2004) mentioned, in the TREC 2001 QA track, however, the question
set included 500 questions obtained from the MSN Search logs and AskJeeves logs. It
21
also contained different tasks, the main task being similar to the previous tracks.
However, the other tasks were to return an unordered list for short answers and to test
the context task through a series of questions. As a result, these different tasks have
different evaluation metrics and reports.
Furthermore Voorhees (2004) stated that the TREC 2002 QA track had two tasks: the
main task and the list task. In contrast to the previous years, the participants should
return the exact answer for each question as it did not allow the text snippet that
contained the answers. Accordingly, they used a new scoring metric called the
confidence-weighted score to evaluate the main task. and another evaluation metric
was used for the other task.
Voorhees (2004) also reported that the TREC 2003 QA track test set contained 413
questions taken from MSN and AOL Search logs. It included two tasks, that is, the
main task and the passage tasks. These tasks have different types of test question. The
main task has a set of list questions, definition questions and factoid questions. The
passage task, however, has a set of factoid questions in which the systems should, to
answer them, return a text snippet. This track has different evaluation metrics because
it has different tasks and different types of question.
In regards to the TREC 2004 QA track Sanka (2005) and Voorhees (2004) mentioned
that the question set was similar to TREC 2003, i.e. drawn from MSN and AOL Search
logs. It contained a series of questions (of several question types, including factoid
questions, list questions and other questions similar to the definition questions in
TREC 2003) asking for a specific target. This target might be a person, an
organization, an event or a thing. Table 2.3 illustrates a sample of a question set
(WWW6, 2005).
22
Series Target Question Question Question

No. No. type
8 Black 8.1 Factoid Who founded the Black
Panthers Panthers organization?
8.2 Factoid When was it founded?
8.3 Factoid Where was it founded?
8.4 List Who have been members
of the organization?
8.5 Other Other
10 Prions 10.1 Factoid What are prions made of?
10.2 Factoid Who discovered prions?
10.3 List What diseases are prions
associated with?
10.4 List What researchers have
worked with prions?
10.5 Other Other
27 Jennifer 27.1 Factoid What sport does Jennifer
Capriati Capriati play?
27.2 Factoid Who is her coach?
27.3 Factoid Where does she live?
27.4 Factoid When was she born?
27.5 Other Other
Table 2. 3 Sample question series from the test set. Series 8 has an organization as a target, series
10 has a thing as atarget, series 27 has a person as a target.
Also, TREC 2004 has a single task that is evaluated by the weighted average of the
three components (factoid, list and other).
Moreover, and according to Voorhees (2005), TREC 2005 QA track uses the same
type of question set used in the TREC 2004 (with the 75 series questions). In addition
to the main task, it has other tasks similar to TREC 2004, such as returning a ranked
list for the documents that have the answers, and a relationship task to return evidence
for the relationship hypothesized in the question or the lake of this relationship. The
main task is evaluated by the weighted average of the three components (factoid, list
and other). The other tasks, however, have different evaluation measures.
Thus, the TREC QA track is an effective factor in the development of QA systems. It

has also helped in evolving the evaluation metrics in order to bring about a more
accurate judgement of the QA systems.
23
2.3 Evaluation Measures
This task is a vital stage, as it will assess the performance of the systems, as well as the
accuracy of the techniques. Subsequently, it will be used to make a confidant
conclusion about efficiency of performance.
Joho (1999) stated that TREC started evaluating the QA system in 1999, as a result of
the QA system developer’s need for an automated evaluation method that measures the
system’s improvement. Since then, it has added a track for the QA system to evaluate
the open-domain QA system (that provides an exact answer for a question by using a
special corpus). This corpus includes a large collection of newspaper articles where
for each generated question there is at least one document containing the answer
(WWW3, 2005).
Hence, the QA system consists of document-retrieval and answer-extraction modules.

The evaluation measures are divided into two types (WWW2): information retrieval
measures that are used in evaluating the document retrieval module and the QA
measures that evaluate the complete QA system. This dissertation, however, only
discusses the evaluation measures for QA. The other measures are not applicable to
this project, since a QA system will not be created, merely a simple program that
represents the answer extraction module in the standard QA system.
2.3.1 Question Answering Measures
These measures will be used to evaluate the whole QA system. Some of them are also
used by the TREC, such as the Mean Reciprocal Rank (MRR).
The MRR measures the QA system that could retrieve a list of 5 answers for each
question, which will be scored by inversing the rank of the first correct answer.
According to Greenwood (2005), MRR is given by Equation 2.1 (Where |Q| is the
number of questions and ri is the answer rank for question i).
|Q|
Σ 1/r i
i=1
MRR =
|Q|
Equation 2. 1 Mean Reciprocal Rank
24
This means that if the second answer in the list is the correct answer for a question
then the reciprocal rank will be ½. This rank is applied for every question, afterwards
the average score is calculated for all of the questions. Consequently, this means that
the higher the value of the metrics, the better the performance of the system.
There are also other measures which were specified by Joho (1999). For example,
Mani’s method applies three measures Answer Recall Lenient (ARL), Answer Recall
Strict (ARS) and Answer Recall Average (ARA). These are used to evaluate the
summaries found by a question responding to a task requiring a more detailed
judgment. ARL and ARS are defined in Equation 2.2 and Equation 2.3 (Where n1
represents the number of the correct answers, n2 is the number of partially correct
answers and n3 is the number of the answered questions).
n1+ (0.5*n2)
ARL =
n3
Equation 2. 2 Answer Recall Lenient
n1
ARS = n3
Equation 2. 3 Answer Recall Strict
ARA is the average of the ARL and ARS. This measure, however, is not used during
the project, as it needs a summary for the answer and this is not available in this case.
More so, it needs to determine if the answer is correct, partially correct or missing.
In addition, Greenwood (2005) uses simple accuracy metrics as another evaluation

measure, defined as the ratio of the correct answers to the number of questions. This is
represented by Equation 2.4 (Where |Q| is the number of questions and |C| is the
number of correct answers).
|C|
accuracy = * 100
|Q|
Equation 2. 4 Accuracy
These metrics are used throughout the project, thus allowing the accuracy of the
techniques to be determined.
25
Chapter 3: Data and Resources Used in Experiments
This section discusses in detail the requirements of the investigation, such as the data
set (question set and the knowledge source) and the tools and evaluation methods used
in this project.
3.1 Data Set
The data set includes the question set and the corpus of knowledge or the knowledge
source. The question set is a set of multiple choice questions with four choices. The
knowledge source, however, is the source which the techniques refer to for finding the
correct answer amongst the four choices.
3.1.1 Question Set
Throughout the investigative stage the questions were collected from an online quiz
resource, which is Fun Trivia.com (WWW1). This could be relied on, as it has more
than 1000 multiple choice questions and their correct solutions. Also, these questions
cover different domains such as entertainment, history, geography, sport, science/
nature and literature (refer to Appendix B for question examples for each category). A
set of 600 randomly chosen questions that were divided equally between the 6 domains
and their answers were collected for this project. The answers were withheld in the
evaluation stages.
These questions are found in html format and their correct answers are found in
separate html files, thus making the comparison process between the technique
answers and the correct answers more complicated. Hence, we reformatted the 600
multiple choice questions manually into text files, therefore each question is followed
by its correct answer. Below is an example:
26
qID= 1
Television was first introduced to the general public at the 1939 Worlds Fair in
a. New York
b. Seattle
c. St. Louis
d. Montreal
answer: a
</q>
A tag was also added defining the end of a question. This involves automating the
techniques so that the program will know where each question begins and ends.
3.1.2 Knowledge Source
The knowledge source was derived from the World Wide Web, as these investigated
techniques involved the use of the Google Search Engine in deciding which choice is
the correct answer for multiple choice questions.
Google is currently one of the most popular and one of the largest search engines on
the Internet. It was chosen for two main reasons (mentioned in Etzioni et al. (2001)):
first it has the widest coverage, since it has indexed more than one billion web pages
and secondly it ranks the pages according to which the page has the highest
information value.
3.2 Evaluation
The evaluation scheme is very important because it evaluates the experiments

according to their results. In this stage, each manually investigated technique is
evaluated by using the accuracy metric previously mentioned in section 2.3.1. This
helps in determining its performance in finding the correct answers amongst the other
choices. Also, the answers at this stage are compared with the results in order to
determine which technique is more appropriate. The results of this evaluation are
discussed in detail in section 5.1.1.
Moreover, the techniques will be applied automatically over a large set of questions
without prior knowledge of the answers. After comparing the answers with the results,
which is described in section 5.1.2, the precision and accuracy of the technique are
determined through the aforementioned accuracy metric.
27
3.3 Tools
An exploration of the requirements from the data set to this evaluation has been
undertaken. Therefore, the resources required to complete the project can be specified.
The appropriate programming language to be used should be decided upon. Java is
selected for its flexibility and familiarity.
Additionally, another program is used for further experiments. This program is called
Robust Accurate Statistical Parsing (RASP). RASP is a parsing toolkit that is a result
of the cooperation between the Department of Informatics at Sussex University and the
Computer Laboratory at Cambridge University. It was published in January 2002 and
is used to run the input text through many processes for tokenisation, tagging,
lemmatization and parsing (Briscoe and Carroll, 2002; Briscoe and Carroll, 2003;
WWW7).
28
Chapter 4: Design and Implementation
The aim of this project is to investigate techniques to find the correct answer for
multiple choice questions. This investigation involves the use of the Google Search
Engine. This chapter explains the design of the manual experiments and the
implementation of the automated techniques.
4.1 Manual experiments
The initial experiments were performed manually and then automated for precision
and accuracy of results. In the initial investigations, the question set sample is small
(20 questions) and fact-related questions are pertaining to famous individuals and
historical subjects. In the first experiment, the question is combined with each choice
and passed to Google manually. The choices are ranked depending on the search
result number. For instance, for the question:
The Roanoke Colony of 1587 became known as the 'Lost Colony'. Who was the
governor of this colony?
a. John Smith
b. Sir Walter Raleigh
c. John White
d. John White
The question is combined first with choice (a) then sent to Google as follows:
The Roanoke Colony of 1587 became known as the 'Lost Colony'. Who was the
governor of this colony John Smith
This process is repeated with each choice. Subsequently, the choice with the highest
search result hits is then selected and if the numbers of search result hits are the same,
then the result is chosen in the order which appeared in the question. Table 4.1
illustrates the results of the first experiment. For question 7, the number of search
result hits was equal for the four choices; therefore the first choice was selected as the
possible answer. Hence, there were 8 correct answers for this experiment, so an
accuracy of 40 percent.
29
Question Choice a Choice b Choice c Choice d Answer Accuracy

No. of the
answer
1 65 138 168 254 d Correct
2 567 524 735 517 c Correct
3 9 38 51 39 c Correct
4 13500 12600 11100 9830 a Wrong
5 648 548 426 440 a Wrong
6 64 75 78 74 c Wrong
7 5 5 5 5 a Wrong
8 31 33 34 27 c Correct
9 221 217 203 271 d Wrong
10 27200 10400 510 12400 a Correct
11 63 57 46 39 a Wrong
12 4 4 5 5 c Correct
13 12 92 98 47 c Correct
14 244 342 841 11400 d Wrong
15 16 7 20 40 d Wrong
16 609000 367000 301000 101000 a Wrong
17 43 70 111 37 c Wrong
18 39 43 42 42 b Correct
19 24 11 70 69 c Correct
20 133 40300 218 337 b Wrong
Table 4. 1 Number of Google hits for first manual experiment when a given question and given
answer is used as a query in Google
After this experiment, this technique was implemented and a larger number of
questions were posed to Google for better accuracy and precision of results. The
implementation used Java as a programming language and this is described in detail in
section 4.2.
In the second experiment, the performance was increased by restructuring the form of
the query. This was done by taking the same sample used in the first experiment and
then choosing the questions that had failed to answer the question correctly (these were
11 questions). The questions were modified by using important keywords in the
question and combining them with each choice as follows:
30
Question 4: In 1594, Shakespeare became an actor and playwright in what company?

a. The King's Men
b. Globe Theatre
c. The Stratford Company
d. Lord Chamberlain's Men
The previous question had a wrong answer in the first experiment; therefore the query
form was restructured and combined with each choice as follows:
“In 1594” Shakespeare became an actor and playwright in “(one of the choices)”
This query was combined with choice (a) and sent to Google as follows:
“In 1594” Shakespeare became an actor and playwright in “The King's Men”
This was then repeated with each choice. In the previous example, keywords in the
question were chosen to improve the accuracy of the answer. The year was put in
quotation marks as follows: “In 1594” It was noticed that the quotation marks
enhanced the search results. The question word and the noun following the question
word were replaced with a quoted choice as the possible answer.
Question 5: What astronomer published his theory of a sun- centered universe in

1543?
a. Johannes Kepler
b. Galileo Galilei
c. Isaac Newton
d. Nicholas Copernicus
This question was restructured by replacing “What astronomer” with a quoted choice
and the year was also put in quotation marks, “in 1543”, as follows:
“(one of the choices)” published his theory of a sun- centered universe “in 1543”
Question 6: This Russian ruler was the first to be crowned Czar (Tsar) when he
defeated the boyars (influential families) in 1547. Who was he?
a. Nicholas II
b. Ivan IV (the Terrible)
c. Peter the Great
d. Alexander I
31
In question 6 “This Russian ruler” was replaced with a quoted choice and some
important words such as “first”, “Czar” and “defeated the boyars” were used.
Moreover, the year was put in quotation marks; hence the query is as follows:
“(one of the choices)” was first Czar defeated the boyars “in 1547”
For the remaining questions, please refer to Appendix C. As a result of the previous
changes, there were 19 correct questions from 20. This is highlighted in Table 4.2.
Therefore, the changes have improved the accuracy of the results to 95 percent.
Question Choice a Choice b Choice c Choice d Answer Accuracy

No. of the
answer
1 65 138 168 254 d Correct
2 567 524 735 517 c Correct
3 9 38 51 39 c Correct
4 348 301 6 461 d Correct
5 230 309 201 465 d Correct
6 65 171 109 70 b Correct
7 10400 731 25500 603 c Correct
8 31 33 34 27 c Correct
9 225 49 216 137 a Correct
10 27200 10400 510 12400 a Correct
11 48400 72200 60000 37800 b Correct
13 12 92 98 47 c Correct
15 9 55 5 10 b Correct
16 609000 367000 301000 101000 a Wrong
17 0 12 0 0 b Correct
18 39 43 42 42 b Correct
19 24 11 70 69 c Correct
20 2 99 4 307 d Correct
Table 4. 2 Number of Google hits for second manual experiment when a given question and given
answer is used as a query in Google
During these manual experiments, some special and interesting cases about the
structuring of the questions were found. For example, when the choice is the complete
name (first name and surname) for a famous person it was found that it worked better
to combine the query with the surname, as the first name is rarely used (as can be seen
32
in question 15 in Appendix C). Moreover, it was noticed that the answer’s accuracy
could be enhanced in several ways. This could be done by restructuring the form of
the query by using quotation marks, as can be seen in questions 4, 5, 6, 7, 9, 14, 17 and
20 in Appendix C. It could also be done by selecting the main keywords in the query
as can be seen in questions 6, 7, 11 and 15 in Appendix C. Additionally, it could also
be achieved by replacing the question words (What, Who, Which and Where) and the
noun following the question word with one of the choices, as can be seen in questions
4, 5, 7, 9, 11, 15 and 20 in Appendix C. These interesting changes were done
manually and some of these were applied by automated means as explained in section
4.2.
4.2 Automated Experiments
In the first automated experiments, the data set contained 180 questions encompassing
different domains that were divided into 30 questions for each category. In this stage,
the multiple choice questions were collected in a text file and reformatted, as explained
previously in section 3.1.1. A program named QueryPost was written using Java.
This uses Google class which was written by Dr. Mark Greenwood. This program
combines the question with each choice. Thereafter, this was passed to Google to
retrieve the search results. The program code is shown in Appendix D. Thus, the
choice with the highest search result hits is selected as the answer to the question.
Two text files are generated. The first file is the main file, containing the number of
hits for the query and for the query with each choice. This also represents the correct
choice and the technique choice, as can be seen in Appendix E, for a sample of the
output. The second file records each action the program undertakes and this can be
used as the reference source when needed.
Moreover, in the second experiment, the data set was larger than the previous
experiment, as it contained 600 questions that were divided into 100 questions for each
category. These questions were then sent to Google, also using the QueryPost
program. After this experiment, another manual experiment was undertaken, as
previously mentioned in section 4.1, to investigate the different methods for finding
the correct answer for multiple choice questions.
According to the observations in questions 4, 5, 6, 7, 9, 14, 17 and 20 from the second

manual experiment, it was noticed that using quotation marks improves the accuracy
of the answer. Therefore, a third experiment was undertaken that involved
modifications to the code of the QueryPost program according to the observations
noted. As a result, the new version of the QueryPost program was named QueryPost2,
which is capable of combining the question with each choice by using two techniques.
33
The answer with quotation marks and the answer without quotation marks can be seen
in the program code in Appendix F. Moreover, 3 text files are generated. The two
main files are for each technique that contains the number of search result hits for the
query and for the query with each choice. These files also represent the correct choice
and the technique’s choice. The third file records each action the program undertakes
and this can be used as the reference source when needed. This program was also run
using the same question set used in the second experiment.
Additionally, the forth experiment relied on the perceptions from questions 4, 5, 7, 9,

11, 15 and 20 as can be seen in Appendix C. In the second manual experiment, it was
noticed that replacing the question words (What, Who, Which and Where) and the
nouns following the question words, by the answer, enhances the accuracy of the
approach. Thus, to apply this method when using an automated technique a program
was written categorising the questions according to the question words. The program
is called QueryFilter as documented in the program code in Appendix G. This
program reads the query from the Query text file that contains 600 multiple choice
questions, and then extracts and filters the question and the sentence that begins with
‘What’, ‘Who’, ‘Where’, ‘Which’ and ‘This’. Thereafter, the program removes the
words ‘Who’, ‘Where’, ‘Which’, ‘What’ and ‘This’ from the questions and writes
them in 5 separate text files. The first file is for the ‘Who’ questions, the second file is
for the ‘Where’ questions, the third file is for the ‘Which’ questions and the forth file
is for the ‘What’/‘This’ questions (the sentences in the forth file are treated in the same
manner and will require additional processing steps to remove the noun words
following them). The fifth file is for other questions that are started with other
question words. Then the first three text files and the fifth file are sent to another
program called QueryPost3. This program is the same as the QueryPost2 program.
However, the order of the question and the choices are modified. Firstly, each choice
is combined to replace the question words with the new question by using two
techniques, that is, the answer in quotation marks and the answer without quotation
marks. This is then passed to Google to retrieve the search results as can be seen in
the program code in Appendix H. Thus, the choice with the highest result is selected
as the answer to the question.
Nevertheless, in the ‘What’/’This’ scenarios, the text file contains ‘What’/’This’

queries in the RASP program, for tagging the words in the file. According to Briscoe
and Carroll (2002), this program runs the input text file through certain processes. It
first tokenizes the words in the input file to generate a sequence of tokens, which are
separated from the punctuations by spaces, and marks the boundaries of the system.
These tokens are then tagged with one of the 155 CLAWS-2 part-of-speech (PoS) and
punctuation labels. Therefore, after this process the output is a text file which contains
words tagged with PoS. This results in tagging all the noun words as can be seen in
Appendix I. Some modifications, however, are made to the output file, as there are
34
different types of nouns having different tags, as can be seen in Appendix J. These
tags are modified to only one tag (_NN1) manually, thus making it easier to be
recognized by the NounRemove program. This program was written, and a text file
made, to delete the noun words present in the queries as can be seen in Appendix K.
This text file is sent to another program called QueryPost3 to combine each choice
with the question by using two techniques, that is, the answer in quotation marks and
the answer without quotation marks. This was finally passed to Google to retrieve the
search results.
35
Chapter 5: Results and Discussion
In this chapter, the results of the experiments are presented. The results are divided
into two sections. One section is for the manual results and the second section is for
the automated results. An analysis and discussion for clarification of the findings is
then given. This in turn determines the success and failure of the goals of the project.
Moreover, suggestions are also given for further work that can be done in order to
improve the accomplishments of the project.
5.1 Experiments Results
We have investigated a number of experiments which were done manually then we

have applied them automatically. Thus, we divided the results of these experiments
according to their type as the following:
5.1.1 Manual Experiments Results
The manual experiments were two experiments using a small set of multiple choice
questions (20 questions) to perform the investigations. In the first experiment, the
question was combined with its choices and sent to Google as a query. As a result,
there were 9 correct questions out of 20, given an accuracy of 45 percent as can be
seen in Table 5.1 and Figure 5.1. Since this is a promising result, it was decided to
automate this technique thereby being able to apply it on a larger set of questions.
Moreover, to better improve the technique, another manual experiment was
undertaken. In this experiment the query forms were restructured by using important
key words, using quotation marks and replacing the question words as previously
explained in detail in section 4.1. According to Table 5.1, there are 19 correct
questions out of 20, which increased the accuracy to 95 percent as shown in Figure 5.1.
Exp No of correct answers No of questions Accuracy

First Manual Exp. 9 20 45%
Second Manual Exp. 19 20 95%
Table 5. 1 Number of correct answers for the First and the Second Manual Experiments, and the
overall accuracy for both of them
36
100%
90%
80%
70%
Accuracy
60%
50% accuracy
40%
30%
20%
10%
0%
First Manual Second Manual
Exp. Exp.
Experiment
Figure 5. 1 The improvement of the accuracy for the First and the Second Manual Experiments
5.1.2 Automated Experiments Results
In the automated experiments, the manual experiments were applied using some of the
programs written for this purpose. In the first automated experiment, a small data set
containing 180 questions from different categories was used, as previously mentioned
in section 4.2. These questions had been read and sent to Google by the QueryPost
program to retrieve the search result. This program chooses the answer amongst the
four choices of the question. The results for this experiment are illustrated in Table
5.2. This shows the number of correct answers for every category and its accuracy.
The overall accuracy for the whole question set is also shown.
37
Question category No of correct No of questions Accuracy

answers for each in each category for each
category category
Entertainment 7 30 23.3%
Geography 17 30 56.7%
History 8 30 26.7%
Literature 18 30 60%
Science/Nature 11 30 36.7%
Sport 20 30 66.7%
Total 81 180 45%
Table 5. 2 Number of correct answers for every question category and its accuracy, in addition to
the overall accuracy in First Automated Experiment
80.00%
70.00%
60.00%
Accuracy
50.00%
Accuracy for each
40.00%
category
30.00%
20.00%
10.00%
0.00%
re
hy
t
re
t
y
en
or
or
tu
tu
ap
Sp
m
st
ra
Na
gr
in
Hi
te
rta
eo
e/
Li
nc
G
te
ie
En
Sc
Question Category
Figure 5. 2 The accuracy for each category in First Automated Experiment
According to Figure 5.2, the highest accuracy percentage was for the sports category,
as it achieved 66.7 percent. The entertainment category, however, was lower, as it
only achieved 23.3 percent. In addition, the accuracy was higher than 50 percent in
both the literature and the geography categories. It decreased, however, to 36.7
percent in the science/nature category. Furthermore, a decrease in accuracy was also
evident in the history category.
In the same manner and as previously detailed in section 4.2, the second automated
experiments were undertaken by running the QueryPost program on a large set of
38
questions containing 600 multiple choice questions (for more accurate results). These
results are shown in Table 5.3. This shows the number of correct answers for every
category and its accuracy, in addition to the overall accuracy for the whole question
set.
Question category No of correct No of questions in Accuracy

answers for each each category for each
category category
Entertainment 26 100 26%
Geography 62 100 62%
History 35 100 35%
Science/Nature 40 100 40%
Sport 45 100 45%
Total 275 600 45.8%
Table 5. 3 Number of correct answers for every question category and its accuracy, in addition to
the overall accuracy in Second Automated Experiment
The overall accuracy result is shown in Table 5.3. This shows that the accuracy does
not change much compared to the overall accuracy in the first automated experiment;
hence the size of the question set does not affect their accuracy.
80%
70%
60%
Accuracy
50%
Accuracy for each
40%
category
30%
20%
10%
0%
re
hy
t
re
t
y
en
or
or
tu
tu
ap
Sp
m
st
ra
Na
gr
in
Hi
te
rta
eo
e/
Li
nc
G
te
ie
En
Sc
Question Category
Figure 5. 3 The accuracy for each category in Second Automated Experiment
39
From Figure 5.3 it can be said that the results in the second experiment are similar to
the first experiment, as the three categories literature, geography and sport still have
the highest accuracy percentage. The highest accuracy this time however, was in the
literature category, as it had 67 percent accuracy, while the sport category dropped to
45 percent accuracy. At the same time, the geography’s accuracy increased to 62
percent, but the lowest accuracy was still for the entertainment category, as it obtained
26 percent, which was less than history’s accuracy, which was 35 percent.
The third experiment was done in light of the findings of the second manual
experiment. It was found that using quotation marks may have improved the accuracy
of the answers. Therefore, this technique was applied on the same question set that
was used in the second automated experiment by using the QueryPost2 program. The
results of this experiment are summarised in Table 5.4, which represents the number of
correct answers for every category, and its accuracy using the quotation marks, as well
as the overall accuracy for this experiment.
Question No of correct No of questions in Accuracy

category answers for each each category for each
category using category using
quotation quotation
Entertainment 30 100 30%
Geography 62 100 62%
History 40 100 40%
Science/Nature 39 100 39%
Sport 48 100 48%
Total 294 600 49%
Table 5. 4 Number of correct answers for every question category and its accuracy using
quotation, besides to the overall accuracy in Third Automated Experiment
40
80%
70%
60%
50%
Accuracy
Accuracy for each

40%
category using quotation
30%
20%
10%
0%
re
hy
t
t
re
y
en
or
or
tu
tu
ap
Sp
m
st
ra
Na
gr
in
Hi
te
rta
eo
e/
Li
nc
G
te
ie
En
Sc
Question Category
Figure 5. 4 The accuracy for each category using quotation in Third Automated Experiment
According to Figure 5.4, it can be concluded that literature still comes out on top with
the highest accuracy percentage, as it attained 75 percent. This was followed by the
geography category, as its accuracy was 62 percent. Moreover, the accuracy dropped
to 48 percent in sport, but this percent is high in comparison to history, which has 40
percent. At the same time, Sport still remains in the lowest level of accuracy with
30% accuracy. However, the accuracy increased to 39 percent in the science/nature
category.
41
80%
70%
60%
50%
Accuracy for each category
Accuracu
without quotation
40%
using quotation
30%
20%
10%
0%
re
hy
l
t
t
e
ry
ta
en
or
ur
tu
to
To
ap
Sp
m
at
ra
s
gr
in
Hi
N
te
rta
eo
or
Li
te
e
En
nc
ie
Sc
Question Category
Figure 5. 5 The accuracy for each category with and without quotation
Figure 5.5 summarises the results for the second and the third experiments as shown in
the variation of accuracy in all of the categories between the two techniques, that is,
with quotation marks and without quotation marks. Thus, it can be concluded that
finding the correct answer is easier for the literature and geography questions as
compared to the entertainment question, which was more difficult. It was also found
that using the quotation marks enhances the selection of the correct answer in all
categories with at least 1 percent, with the exception of the geography category. The
percentage in this category did not change, even with the quotation marks.
Additionally, the overall accuracy for using quotation marks was 49 percent, which is
higher than the accuracy without using quotation marks, which was 45.8 percent, that
is, the performance of using quotation marks is better.
The forth experiment depended on the finding of the second manual experiments, as it
was found that replacing the question words (Who, Where, Which, What/This) in the
queries with their choices could improve the correct answer selection, in addition to
using the quotation marks. Accordingly, different programs were used. The final
results from QueryPost3 program are illustrated in Table 5.5.
42
Question No of No of No of Accuracy for Accuracy

type correct correct questions each for each
answers answers in each category category
without using category without using
using quotation using quotation
quotation quotation
Which 80 85 175 45.7% 48.6%
Who 69 77 142 48.6% 54.2%
This & 82 87 198 41.4% 43.9%
What
Where 8 9 21 38.1% 42.9%
Other 28 33 64 43.8% 51.6%
Total 267 291 600 44.5% 48.5%
Table 5. 5 Number of correct answers for every question type, besides its accuracy with
and without using quotation and the overall accuracy in Forth Automated Experiment
60.00%
50.00%
40.00%

Accuracy
without quotation
30.00%
using quotation
20.00%
10.00%
0.00%
Which Who This & Where Other Total
What
Question Type
Figure 5. 6 The accuracy for each question type with and without quotation
As shown in Figure 5.6, the highest accuracy occurs with the ‘Who’ questions either
with or without using quotation marks, by 54.2 percent and 48.6 percent, respectively.
This is because the answer for this question is a name of a person (which can be found
43
more easily by quoting the whole name). The accuracy decreased slightly in the
‘Which’ question: to 48.6 percent with using quotation marks and to 45.7 percent
without quotation marks. Moreover, the accuracy for using quotation marks declined
to its lowest rate in the ‘Where’ questions with 42.9 percent and 38.1 percent,
respectively, without the quotation marks. This could be a result of the data size in this
question type (21 questions), but it increased in the ‘What’/’This’ questions to 41.4
percent without using quotation marks, and it also increased to 43.9 percent with
quotation marks. Thus, the best performance was in the ‘Who’ and ‘Which’ questions
and the worst was in the ‘Where’ questions.
Furthermore, the results were analysed according to the question categories and the
question types as can be seen in Table 5.6:
44
Question Question No of No of No of Accuracy Accuracy

category type correct correct Question without with
answers answers quotation quotation
without with
quotation quotation
Entertainment Which 6 8 27 22% 30%
Who 4 6 18 22% 33%
This / What 14 11 51 27% 22%
Where 0 0 0 0% 0%
Other 2 3 4 50% 75%
Total 26 28 100 26% 28%
Geography Which 24 25 40 60% 63%
Who 0 0 0 0% 0%
This/ What 32 34 49 65% 69%
Where 2 2 4 50% 50%
Other 2 2 7 29% 29%
Total 60 63 100 60% 63%
History Which 5 7 18 28% 39%
Who 20 17 42 48% 40%
This/ What 7 9 30 23% 30%
Where 0 0 0 0% 0%
Other 2 6 10 20% 60%
Total 34 39 100 34% 39%
Literature Which 7 9 13 54% 69%
Who 37 44 59 63% 75%
This/ What 7 9 13 54% 69%
Where 1 1 1 100% 100%
Other 11 11 14 79% 79%
Total 63 74 100 63% 74%
Science /Nature Which 10 8 24 42% 33%
Who 2 4 7 29% 57%
This/ What 18 18 43 42% 42%
Where 1 1 3 33% 33%
Other 8 9 23 35% 39%
Total 39 40 100 39% 40%
Sport Which 28 28 53 53% 53%
Who 6 6 16 38% 38%
This/ What 4 6 12 33% 50%
Where 4 5 13 31% 38%
Other 3 2 6 50% 33%
Total 45 47 100 45% 47%
Table 5. 6 Number of correct answers for every question type in each category, besides their
accuracy with and without using quotation and the overall accuracy
45
Total
Other
Where
Sport This & What

Who
Which
Total
Other
Science /Nature
Where
This & What
Who
Which
Total
Other
Literature
Where
Quetion Category & Question Type
This & What
Who
Which Accuracy with quotation
Total Accuracy without quotatin
Other
Where
History
This & What
Who
Which
Total
Other
Geography
Where
This & What

Who
Which
Total
Other
Entertainment
Where
This & What

Who
Which
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Accuracy
Figure 5. 7 The accuracy for every question type in each category with and without using
quotation
In Figure 5.7 all the techniques applied are summarized, such as using quotation marks
and replacing the question words with the choices in every question category. As a
result, the highest performance was in the literature and geography categories, as they
achieved 74 percent and 63 percent of accuracy by using both the aforementioned
techniques. In contrast, the lowest performance was for the entertainment questions,
46
with 28 percent accuracy. This variation of accuracy between the question categories
is a result of the type of information that is available on the WWW. Thus, according
to these experiments, it can be concluded that the WWW contains more information
about literature and geography and less information about entertainment, which makes
these techniques unable to answer correctly a question such as:
Who won the Oscar for 2000 for Best Actor

a. Tom Hanks
b. Ed Harris
c. Geoffrey Rush
d. Russell Crowe
answer: d
5.2 Future Work
During the manual experiments, it was found that choosing query keywords such as
the date or the year, and using the surname only instead of the whole name for people,
may enhance the performance of the QA system in finding the correct answer.
However, to apply these ideas automatically, we need more time and knowledge in the
natural language processing.
47
Chapter 6: Conclusions
Recently, there are expansion and dissemination of knowledge and information and
their utilisation. Yet information has becomes more accessible through the World
Wide Web. The increased access of the web and the growth of users and their diverse
background and objectives required more accurate and effective search over the web.
Consequently, many techniques have been developed responding to the need to
provide a reliable answer of their search.
Question Answering (QA) technology is a very important field in Information

Retrieval, since answering specific questions correctly is becoming a vital way to deal
with the ever increasing volumes of information available as free text. This
technology, however, faces some obstacles in its development. This project
investigated different techniques to automatically find the correct answer for multiple
choice questions. These techniques prove our hypothesis that the correct answer for a
multiple choice question would have the higher hit rate in Google than the wrong
answers when submitted along with the question.
Manual experiments were initially done on a small set of questions (20 questions).
These questions were then sent to Google by combining them with their choices in
finding the hit rate for each choice, and selecting the choice with the highest hit rate as
the answer. As a result, 45 percent accuracy was achieved. In spite of the automated
implementation of the previous method on a large set of data (600 questions), which is
categorised according to the question domain into 6 categories (entertainment,
geography, history, literature, science/nature and sport), the same accuracy rate, 45.8
percent, with a slight increase was obtained.
Accordingly, the manual experiments were revisited to investigate other techniques

that could enhance the rate of accuracy. It was found that in the experiments in which
the query was modified by placing quotation marks around each choice and replacing
the question words (Who, Which, Where, What/This) with the choices, achieved an
improvement in the accuracy rate to 95 percent. Therefore, these techniques were
applied automatically on a reasonable scale to see the improvements that could be
made by identifying the question types in each category and quoting the choices.
Hence, it was found that the accuracy varied from 48.5 percent to 49 percent. This
accuracy increased to 63 percent and 74 percent in some question categories, such as
in the geography and literature questions.
48
Although the overall accuracy for these techniques did not achieve the same results as
some other QA systems that achieved more than 70 percent (Voorhees, 2004). This
was, however, better than most of the systems that participated in the TREC 2005 QA
track. The investigated techniques received from 48.5 percent to 49 percent accuracy,
which is higher than the median in that track, which was 15.2 percent according to
Table 6.1. This table shows the accuracy for the participant systems in TREC 2005
QA track, was taken from Greenwood (2006).
System categories Accuracy

Best systems 71.3%
Median system 15.2%
Worst systems 1.4%
Table 6. 1 Evaluation for the participant systems in TREC 2005 QA track
Moreover, the comparison between our techniques and these systems is not fair
because the questions in this set of questions were much longer than the questions in
the TREC 2005 QA track data set. These systems are a result of research projects that
have more human resources, larger funding, and they spend a lot of time creating
them. In contrast, the performance of our techniques are not time consuming, require
less programming and are better than most of the other systems.
Overall, this project proved that rather simple techniques can be applied to obtain
respectable results. Actually, the project does not solve the problem, yet the
remarkable results illustrated an improvement in the baseline in comparison to other
systems.
49
Bibliography
Bibliography
Belkin, N. J. and Vickery, A. (1985) Interaction in Information Systems: a Review of

Research from Document Retrieval to Knowledge-Based Systems. London: The British
Library.
Breck, E. J., Burger, J. D., Ferro, L., Hirschman, L., House, D., Light, M. and Mani, I.
(2000) How to Evaluate Your Question Answering System Every Day and Still Get
Real Work Done. Proceedings of LREC-2000, Second International Conference on
Language Resources and Evaluation. Greece: Athens. [online].Available: http://www.
cs.cornell.edu/~ebreck/publications/docs/breck2000.pdf [1/5/2006].
Briscoe, E. and Carroll, J. (2002) Robust Accurate Statistical Annotation of General

Text. Proceedings of the Third International Conference on Language Resources and
Evaluation (LREC 2002). Canary Islands: Las Palmas. pp. 1499-1504.
Briscoe, E. and Carroll, J. (2003) RASP System Demonstration. [online]. Available:

http://www.informatics.sussex.ac.uk/research/nlp/rasp/offline-demo.html [20/7/2006]
Burger, J., Cardie, C., Chaudhri, V., Gaizauskas, R., Israel, D., Jacquemin, C., Lin, C.-
Y., Maiorano, S., Miller, G., Moldovan, D., Ogden, B., Prager, J., Riloff, E., Singhal,
A., Shrihari, R., Strzalkowski, T., Voorhees, E. and Weischedel, R. (2001) Issues,
Tasks and Program Structures to Roadmap Research in Question & Answering
(Q&A). [online]. Available: http://www-nlpir.nist.gov/projects/duc/papers
/qa.Roadmap-paper_v2.doc[8/5/2006]
Chomsky, C., Green, B. F., Laughery, K. and Wolf A. K. (1961) BASEBALL: An

Automatic Question Answerer. In Proceedings of the Western Joint Computer
Conference, volume 19, pages 219–224.
Chowdhury, G. G. (1999) Introduction to Modern Information Retrieval. London:

Library Association.
Driscoll, J. and Wendlandt, E. (1991) Incorporating a Semantic Analysis into a

Document Retrieval Strategy. In proceedings of the 14th Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval, pages 270-
279
50
Bibliography
Etzioni, O., Kwok, C. and Weld, D.S. (2001) Scaling Question Answering to the Web.
Proc WWW10. Hong Kong. [online]. Available: http://www.iccs.inf.ed.ac.uk/~s02395
48/qa-group/papers/kwok.2001.www.pdf [1/5/2006].
Ferret, O., Grau, B., Hurault-Plantet, M., Illouz, G. and Jacquemin, C. (2001)
Document Selection Refinement Based on Linguistic Features for QALC, a Question
Answering System. In Proceedings of RANLP–2001. Bulgaria: Tzigov Chark. [online].
Available:http://www.limsi.fr/Individu/mhp/QA/ranlp2001.pdf#search=
%22DOCUMENT%20SELECTION%20REFINEMENT%20BASED%20ON%22
[2/8/2006]
Gabbay, I. (2004) Retrieving Definitions from Scientific Text in the Salmon Fish
Domain by Lexical Pattern Matching. MA, University of Limerick. [online].
Available:http://etdindividuals.dlib.vt.edu:9090/archive/00000114/01/chapter2.pdf
[2/8/2006].
Gaizauskas, R. and Hirschman, L. (2001) Natural Language Question Answering: The

View from Here. Natural Language Engineering 7 (4).pp.275-300.
Greenwood, M.A. (2005) Open–Domain Question Answering. PhD, University of

Sheffield. [online].Available: http://www.dcs.shef.ac.uk/intranet/ research/phdtheses/
Greenwood2006.pdf [1/3/2006].
Greenwood, M.A. (2006) Question Answering at TREC. Natural Language Processing

Group, Department of Computer Science, University of Sheffield. [online]. Available:
http://nlp.shef.ac.uk/talks/greenwood_20060117.ppt#38 [15/8/2006].
Joho, H. (1999) Automatic Detection of Descriptive Phrases for Question Answering

System: A Simple Pattern Matching Approach. MSc, University of Sheffield. [online].
Available: http://dis.shef.ac.uk/mark/cv/publications/dissertations/Joho1999.pdf
[1/5/2006].
Katz, B., Lin, J. and Felshin, S. (2001) Gathering Knowledge for a Question
Answering System from Heterogeneous Information Sources. In Proceedings of the
ACL 2001 Workshop on Human Language Technology and Knowledge Management.
France: Toulouse. [online]. Available: http://groups.csail.mit.edu/infolab/publications/
Katz-etal-ACL01.pdf [1/5/2006].
Korfhage, R. R. (1997) Information Storage and Retrieval. Canada: John Wiley &
Sons, Inc.
51
Bibliography
Levene, M. (2006) An Introduction to Search Engines and Web Navigation. Harlow:

Addison-Wesley.
Liddy, E. D. (2001) When You Want an Answer, Why Settle for A List?.Center for
Natural Language Processing, School of Information Studies, Syracuse University.
[online]. Available:http://www.cnlp.org/presentations/slides/When_You_Want_
Answer.pdf#search=%22When_You_Want_Answer%22 [2/8/2006].
Monz, C. (2003) From Document Retrieval to Question Answering. PhD, University

of Amsterdam.
Pazzani, M. J. (1983) Interactive Script Instantiation. In Proceedings of the National

Conference on Artificial Intelligence. Washington DC: Morgan Kaufmann.pp.320-
326.
Salton, G. (1968) Automatic Information Organization and Retrieval. New York;

Maidenhead: McGraw-Hill.
Salton, G. and McGill, M. J. (1983) Introduction to Modern Information Retrieval.

New York: McGraw-Hill.
Sanka, A. (2005) Passage Retrieval for Question Answering. MSc, University of

Sheffield.[online].Available:http://www.dcs.shef.ac.uk/intranet/teaching/projects/archi
ve /msc2005/pdf/m4as1.pdf [31/7/2006].
Sargaison, M. (2003) Named Entity Information Extraction and Semantic Indexing -

Techniques for Improving Automated Question Answering Systems. MSc, University
of Sheffield. [online]. Available:http://www.dcs.shef.ac.uk/intranet/teaching/projects
/archive /msc2003/pdf/m2ms.pdf [31/7/2006].
Voorhees, E. M. (2004) Overview of the TREC 2004 Question Answering Track. In

Proceedings of the 13th Text REtrieval Conference (TREC 2004). Maryland:
Gaithersburg.[online].Available:http://trec.nist.gov/pubs/trec13/papers/QA.OVERVIE
W.pdf [31/7/2006]
Voorhees, E. M. (2005) Overview of TREC 2005. In Proceedings of the 14th Text

REtrieval Conference (TREC 2005). Maryland: Gaithersburg. [online]. Available:
http://trec.nist.gov/pubs/trec14/papers/OVERVIEW14.pdf [31/7/2006]
Woods, W. (1973) Progress in Natural Language Understanding - An Application to

Lunar Geology. In AFIPS Conference Proceedings, volume 42, pages 441–450.
52
Bibliography
WWW1: Fun Trivia.com (2005) [online]. Available: http://www.funtrivia.com/

[25/2/2006].
WWW2: Meng, Z. (No Date) Question Answering. [online]. Available: http://www.

csee.umbc.edu/~nicholas/691B/Sp05presentations/Question%20Answering.ppt#5
[15/2/2006].
WWW3: Text REtrieval Conference (TREC) QA Track (2005) [online]. Available:

http://trec.nist.gov/data/qa.html [16/2/2006].
WWW4: Text REtrieval Conference (TREC) Overview (2004) [online]. Available:

http://trec.nist.gov/overview.html [16/2/2006]
WWW5: TREC-8 QA Data (2005) [online]. Available: http://trec.nist.gov/data/qa/

T8_QAdata/topics.qa_questions.txt [16/2/2006]
WWW6: TREC 2004 QA Data (2005) [online]. Available: http://trec.nist.gov/data

/qa/2004_qadata/QA2004_testset.xml [16/2/2006]
WWW7: Robust Accurate Statistical Parsing (RASP) (No Date). [online]. Available:
http://www.informatics.susx.ac.uk/research/nlp/rasp/ [20/7/2006]
WWW8: UCREL CLAWS2 Tagset (No Date) [online]. Available: http://www.comp

.lancs.ac.uk/computing/research/ucrel/claws2tags.html [20/7/2006]
53
Appendix A
Appendix A
The 13 conceptual question categories used in Wendy Lehnert’s QUALM
Question Categories Examples

1. Causal Antecedent Why did John go to New York?
What resulted in John’s leaving?
How did the glass break?
2. Goal Orientation For what purposes did John take the book?
Why did Mary drop the book?
Mary left for what reason?
3. Enablement How was John able to eat?
What did John need to do in order to leave?
4. Causal Consequent What happened when John left?
What if I don’t leave?
What did John do after Mary left?
5. Verification Did John leave?
Did John anything to keep Mary from leaving?
Does John think that Mary left?
6. Disjunctive Was John or Mary here?
Is John coming or going?
7. How did John go to New York?
Instrumental/Procedural What did John use to eat?
How do I get to your house?
8. Concept Completion What did John eat?
Who gave Mary the book?
When did John leave Paris?
9. Expectational Why didn’t John go to New York?
Why isn’t John eating?
10. Judgmental What should John do to keep Mary from
leaving?
What should John do now?
11. Quantification How many people are there?
How ill was John?
How many dogs does John have?
12. Feature Specification What color are John’s eyes?
What breed of dog is Rover?
How much does that rug cost?
13. Request Would you pass the salt?
Can you get me my coat?
Will you take out the garbage?
54
Appendix B
Appendix B
Example for Entertainment Question:
qID= 2
Who was the first U.S. President to appear on television
a. Franklin Roosevelt
b. Theodore Roosevelt
c. Calvin Coolidge
d. Herbert Hoover
answer: a
</q>
Example for History Question:
qID= 5
Who was the first English explorer to land in Australia
a. James Cook
b. Matthew Flinders
c. Dirk Hartog
d. William Dampier
answer: d
</q>
Example for Geography Question:
qID= 1
What is the capital of Uruguay
a. Assuncion
b. Santiago
c. La Paz
d. Montevideo
answer: d
</q>
Example for Literature Question:
qID= 1
Who wrote 'The Canterbury Tales'
a. William Shakespeare
b. William Faulkner
55
Appendix B
c. Christopher Marlowe
d. Geoffrey Chaucer
answer: d
</q>
Example for Science or Nature Question:
qID= 1
Which of these programming languages was invented at Bell Labs
a. COBOL
b. C
c. FORTRAN
d. BASIC
answer: b
</q>
Example for Sport Question:
qID= 1
Which of the following newspapers sponsored a major sports stadium/arena
a. Chicago Tribune
b. Arizona Republic
c. Boston Globe
d. St Petersburg Times
answer: d
</q>
56
Appendix C
Appendix C
These are the questions that we failed to get the right answers from the first manual
experiment. Thus, in the second manual experiment we modified them by using
important keywords in the question and combined them with each choice as the
following:
Question 4: In 1594, Shakespeare became an actor and playwright in what company?

e. The King's Men
f. Globe Theatre
g. The Stratford Company
h. Lord Chamberlain's Men
The previous question had a wrong answer in the first experiment; therefore the query
form was restructured and combined with each choice as follows:
“In 1594” Shakespeare became an actor and playwright in “(one of the choices)”
This query was combined with choice (a) and sent to Google as follows:
“In 1594” Shakespeare became an actor and playwright in “The King's Men”
This was repeated with each choice. In the previous example, keywords in the
question were chosen as to improve the accuracy of the answer. The year was put in
quotation marks as follows: “In 1594” It was noticed that the quotation marks
enhanced the search results. The question word and the noun following the question
word were replaced with a quoted choice as the possible answer, as follows:
Question 5: What astronomer published his theory of a sun- centered universe in

1543?
e. Johannes Kepler
f. Galileo Galilei
g. Isaac Newton
h. Nicholas Copernicus
This question was restructured by replacing “What astronomer” with a quoted choice
and the year was also put in quotation marks, “in 1543”, as follows:
“(one of the choices)” published his theory of a sun- centered universe “in 1543”
57
Appendix C
Question 6: This Russian ruler was the first to be crowned Czar (Tsar) when he
defeated the boyars (influential families) in 1547. Who was he?
e. Nicholas II
f. Ivan IV (the Terrible)
g. Peter the Great
h. Alexander I
In question 6 “This Russian ruler” was replaced with a quoted choice and some
important words such as “first”, “Czar” and “defeated the boyars” were used.
Moreover, the year was put in quotation marks; hence the query is as follows:
“(one of the choices)” was first Czar defeated the boyars “in 1547”
Question 7: Sir Thomas More, Chancellor of England from 1529-1532, was beheaded
by what King for refusing to acknowledge the King as the head of the newly formed
Church of England?
a. James I
b. Henry VI
c. Henry VIII
d. Edward VI
The previous question was modified by using some keywords such as “Sir Thomas
More” and “was beheaded by”; also, “what King” was replaced with a quoted choice
as follows:
Sir Thomas More was beheaded by “(one of the choices)”
Question 9: Which dynasty was in power throughout the 1500's in China?

a. Ming
b. Manchu
c. Han
d. Yuan
In this question, “Which” was replaced with a quoted choice followed by “dynasty” as
follows:
“(one of the choices) dynasty” was in power throughout the 1500's in China
58
Appendix C
Question 11: Ponce de Leon, 'discoverer' of Florida (1513), was also Governor of
what Carribean island?
a. Cuba
b. Puerto Rico
c. Virgin Islands
d. Bahamas
In question 11, some keywords were chosen, such as “Ponce de Leon”,”was” and
“Governor of” and “what Carribean island” was replaced with a choice as follows:
Ponce de Leon was Governor of (one of the choices)
Question 14: Who did Queen Elizabeth feel threatened by and had executed in 1587?
a. Jane Seymour
b. Margaret Tudor
c. Mary, Queen of Scots
d. James I
However, in the previous question “Who did” was deleted and other words were
quoted, such as “Queen Elizabeth”. Also, “threatened by” was combined with a
choice and was quoted as follows:
“Queen Elizabeth” feel “threatened by (one of the choices)” and had executed in
1587
Question 15: What 16th century Cartographer and Mathematician developed a

projection map representing the world in two dimensions using latitude and longitude?
a. Ferdinand Magellan
b. Gerardus Mercator
c. Galileo Galilei
d. Ptolemy
The previous question was modified by replacing “What 16th century Cartographer
and Mathematician” with the choice, or the surname for each choice with the whole
name for a person , and quoting other important words such as “projection map”,” two
dimensions” and “latitude and longitude”, as follows:
(one of the choices) developed “ projection map” representing the world in “two
dimensions” using “latitude and longitude”
59
Appendix C
Question 16: Which group controlled North Africa throughout most of the 16th
century?
a. French
b. Spanish
c. Egyptians
d. Turks
This question was the only question in which the right answer could not be obtained.
The forms were modified several times by replacing “Which” and “Which group”
with one of the choices. In addition, some search operators such as OR and AND were
used, and a noun was added for each choice since they were adjectives. Also, each
form was tried with and without quotation marks as follows:
• (one of the choices) AND (the noun of the choices) group controlled
North Africa throughout most of the 16th century
• (one of the choices) OR (the noun of the choices) group controlled
• “(one of the choices) AND (the noun of the choices)” group
controlled North Africa throughout most of the 16th century
• “(one of the choices) OR (the noun of the choices)” group controlled
• (one of the choices) AND (the noun of the choices) controlled North
Africa throughout most of the 16th century
• (one of the choices) OR (the noun of the choices) controlled North
• “(one of the choices) AND (the noun of the choices)” controlled
• “(one of the choices) OR (the noun of the choices)” controlled North
Question 17: Martin Luther started the Protestant Reformation in 1517 by criticizing
the Roman Catholic Church. Where did he nail his 95 Theses?
a. Sistine Chapel
b. Wittenberg Cathedral
c. St. Paul's Cathedral
d. Louvre
In the previous question, the question was modified as follows:
60
Appendix C
Martin Luther nail his "95 Theses in the (one of the choices)"
Question 20: What association, formed in the late 16th century, comprised of five
Native American tribes?
a. Hopi League
b. Native American Council
c. Sioux Confederation
d. Iroquois League
This question was modified by replacing “What association, formed in the late 16th
century” with a quoted choice as follows:
“(one of the choices)" comprised of five Native American tribes.
61
Appendix D
Appendix D
QueryPost program code:
import java.io.*;
import java.util.*;
import javax.swing.*;
public class QueryPost{
public static void main(String rgs[]) throws Exception{
String Fname= JOptionPane.showInputDialog("Enter the file name: ");
BufferedReader inFile= new BufferedReader(new FileReader(Fname+".txt"));

BufferedWriter outFile = new BufferedWriter(new
FileWriter(Fname+"Without.txt"));
BufferedWriter outFile2 = new BufferedWriter(new
FileWriter("result"+Fname+".txt"));
String line=null,qID=null,ans=null,tans=null,qtans=null;
String[] choice= new String[4];
String[] query= new String[5];
double[] res= new double[5];
double answer=0;
Google g = new Google();
outFile.write("qID********hits********a********b********c********d********
CorrectAnswer********answer");
outFile.newLine();
line=inFile.readLine();
while(line!=null)
{
while(!line.startsWith("</q>")){
if(line.startsWith("qID=")){
StringTokenizer st=new StringTokenizer(line);

if(st.nextToken().equals("qID=")){
62
Appendix D
qID=st.nextToken();
}
query[0]=inFile.readLine();
}
if (line.startsWith("a.")){
choice[0]=line.substring(3);
}
if(line.startsWith("b.")){
}
if(line.startsWith("c.")){
}
if(line.startsWith("d.")){
if (line.startsWith("answer:")){
if(st.nextToken().equals("answer:")){
ans=st.nextToken();
}
}
}
for(int i=0;i<4;i++)
{
query[i+1]=query[0]+" "+choice[i];
}
for(int i=0;i<res.length;i++)
res[i]=0;
63
Appendix D
for(int i=0;i<query.length;i++)
{
res[i]= g.getNumberHits(query[i]);
System.out.println("Google Search Results for: "+query[i]+" "+ res[i]+
"documents");
outFile2.write("Google Search Results for: "+query[i]+" "+ res[i]+ "
documents");
outFile2.newLine();
outFile2.write("======================");
outFile2.newLine();
}
if (res[0] == -1)
{
System.out.println(" No hit for this question, so there is not an
answer");
outFile2.write(" No hit for this question, so there is not an answer");
outFile2.newLine();
}
else
{
answer=res[1];
System.out.println(" answer = first result"+res[1]);
if (res[1]<res[2])
{
answer= res[2];
System.out.println(" res[1]"+res[1]+" is less than res[2]"+res[2]);
if (res[2]<res[3])
{
answer= res[3];
if (res[3]<res[4])
{
answer= res[4];
64
Appendix D
System.out.println(" The answer for this question is choice d ");

tans="d";
else
{
System.out.println(" The answer for this question is choice c ");
tans="c";
}
}
else if (res[2]<res[4])
{
answer= res[4];

tans="d";
}
else
{
System.out.println(" The answer for this question is choice b ");
tans="b";
}
}
{
answer= res[3];
if (res[3]<res[4])
{
answer= res[4];

tans="d";
}
else
{
tans="c";
}
65
Appendix D
}
{
answer= res[4];
tans="d";
}
else
{
System.out.println(" The answer for this question is choice a ");
tans="a";
}
}
outFile.write(qID+"********"+res[0]+"********"+res[1]+"********"+res[2]+"****
****"+res[3]+"********"+res[4]+"********"+ans+"********"+tans);
outFile.newLine();
}
outFile.close();
outFile2.close();
}
}
66
Appendix E
Appendix E
An output sample of QueryPost program for sport questions:
qID********hits********a********b********c********d********CorrectAnswe
r********answer
1********282.0********23.0********15.0********44.0********64.0********d*
*******d
2********8.0********-1.0********-1.0********-1.0********-
1.0********c********a
3********54.0********41.0********35.0********52.0********28.0********c**
******c
4********33.0********30.0********32.0********31.0********22.0********b**
******b
5********16100.0********10900.0********27600.0********44100.0********34
500.0********c********c
6********3.0********-1.0********-1.0********-1.0********-
1.0********d********a
7********366.0********291.0********360.0********307.0********127.0******
**b********b
8********-1.0********-1.0********-1.0********-1.0********-
1.0********b********b
9********12200.0********886.0********410.0********845.0********10900.0**
******c********d
10********93.0********89.0********47.0********81.0********34.0********a*
*******a
11********85100.0********26800.0********23800.0********21600.0********2
4400.0********d********a
12********114000.0********96700.0********27800.0********45600.0********
50200.0********a********a
13********28700.0********862.0********628.0********529.0********-
1.0********a********a
14********15200.0********596.0********548.0********809.0********654.0***
*****b********c
15********5.0********-1.0********-1.0********-1.0********-
1.0********c********a
16********24400.0********785.0********770.0********927.0********910.0***
*****c********c
17********540000.0********70600.0********67300.0********72500.0********
62900.0********d********c
67
Appendix E
18********628000.0********11.0********110000.0********137000.0********1
83000.0********d********d
19********20000.0********943.0********15200.0********18500.0********132
00.0********c********c
20********15800.0********107.0********331.0********891.0********469.0***
*****d********c
68
Appendix F
Appendix F
QueryPost2 program code:
import java.io.*;
import java.util.*;
public class QueryPost2{

BufferedWriter outFile1 = new BufferedWriter(new FileWriter(Fname+"With.txt"));
double answer=0,qanswer=0;
outFile1.write("qID********hits********a********b********c********d*******
*CorrectAnswer********answer");
outFile.newLine();
outFile1.newLine();
while(line!=null)
{
69
Appendix F

qID=st.nextToken();
}
}
}
}
}
}
ans=st.nextToken();
}
}
}
{
query[i+1]=query[0]+" "+choice[i];
}
{
query[i+5]=query[0]+" "+"\""+choice[i]+"\"";
}
res[i]=0;
{
70
Appendix F
System.out.println("Google Search Results for: "+query[i]+" "+ res[i]+ "

documents");
outFile2.write("Google Search Results for: "+query[i]+" "+ res[i]+ " documents");
outFile2.newLine();
outFile2.write("======================");
outFile2.newLine();
}
if (res[0] == -1)
{
System.out.println(" No hit for this question, so there is not an answer");
outFile2.newLine();
}
else
{
answer=res[1];
if (res[1]<res[2])
{
answer= res[2];
if (res[2]<res[3])
{
answer= res[3];
if (res[3]<res[4])
{
answer= res[4];
tans="d";
}
else
{
tans="c";
}
}
{
answer= res[4];
71
Appendix F

tans="d";
}
else
{
tans="b";
}
}
{
answer= res[3];
if (res[3]<res[4])
{
answer= res[4];
tans="d";
}
else
{
tans="c";
}
}
{
answer= res[4];
tans="d";
}
else
{
tans="a";
}
qanswer=res[5];
qtans="a";
72
Appendix F
if (res[5]<res[6])
{
qanswer= res[6];
if (res[6]<res[7])
{
qanswer= res[7];
if (res[7]<res[8])
{
qanswer= res[8];
qtans="d";
}
else
{
qtans="c";
}
}
{
qanswer= res[8];
qtans="d";
}
else
{
qtans="b";
}
}
{
qanswer= res[7];
if (res[7]<res[8])
{
qanswer= res[8];
73
Appendix F

qtans="d";
}
else
{
qtans="c";
}
}
{
qanswer= res[8];
Sstem.out.println(" res[1]"+res[5]+" is less than res[4]"+res[8]);
qtans="d";
}
else
{
qtans="a";
}
}
****"+res[3]+"*
*******"+res[4]+"********"+ans+"********"+tans);
outFile.newLine();
outFile1.write(qID+"********"+res[0]+"********"+res[5]+"********"+res[6]+"***
*****"+res[7]+"********"+res[8]+"********"+ans+"********"+qtans);
outFile1.newLine();
}
outFile.close();
outFile1.close();
outFile2.close();
}
}
74
Appendix G
Appendix G
QueryFilter program code:
import java.io.*;
import java.util.*;
public class QueryFilter{


FileWriter(Fname+"WhoQuery.txt"));
FileWriter(Fname+"WhereQuery.txt"));
FileWriter(Fname+"WhichQuery.txt"));
FileWriter(Fname+"ThisWhatQuery.txt"));
FileWriter(Fname+"OtherQuery.txt"));
String line=null,qID=null;
System.out.println(line);
while(line!=null)
{
qID=line;
System.out.println(qID);
if(line.startsWith("Who ")){
75
Appendix G
System.out.println("Who if ");
query[0]=line.substring(4);
outFile1.write(qID);
outFile1.newLine();
outFile1.write(query[0]);
outFile1.newLine();
{
outFile1.write(inFile.readLine());
outFile1.newLine();
}
}
else if(line.startsWith("Where ")){
System.out.println("Where if ");
outFile2.newLine();
outFile2.newLine();
{
outFile2.newLine();
}
}
else if(line.startsWith("Which ")){

System.out.println("Which if ");
outFile3.newLine();
outFile3.newLine();
{
outFile3.newLine();
}
}
else if(line.startsWith("This ")){

System.out.println("This if ");
76
Appendix G
outFile4.newLine();
outFile4.newLine();
{
outFile4.newLine();
}
}
else if(line.startsWith("What ")){

System.out.println("What if ");
outFile4.newLine();
outFile4.newLine();
{
outFile4.newLine();
}
}
else
{
System.out.println("Other ");
query[0]=line;
outFile5.newLine();
outFile5.newLine();
{
outFile5.newLine();
}
}
}
77
Appendix G
}
outFile1.close();
outFile2.close();
outFile3.close();
outFile4.close();
outFile5.close();
}
}
78
Appendix H
Appendix H
QueryPost3 program code:
import java.io.*;
import java.util.*;
public class QueryPost3{

BufferedWriter outFile1 = new BufferedWriter(new FileWriter(Fname+"With.txt"));
double answer=0,qanswer=0;
outFile1.write("qID********hits********a********b********c********d*******
*CorrectAnswer********answer");
outFile.newLine();
outFile1.newLine();
while(line!=null)
{
79
Appendix H

qID=st.nextToken();
}
}
}
}
}
}
ans=st.nextToken();
}
}
}
{
query[i+1]=choice[i]+" "+query[0];
}
{
query[i+5]="\""+choice[i]+"\""+" "+query[0];
}
res[i]=0;
{
80
Appendix H
System.out.println("Google Search Results for: "+query[i]+" "+ res[i]+ "

documents");
outFile2.write("Google Search Results for: "+query[i]+" "+ res[i]+ " documents");
outFile2.newLine();
outFile2.write("======================");
outFile2.newLine();
}
if (res[0] == -1)
{
System.out.println(" No hit for this question, so there is not an answer");
outFile2.newLine();
}
else
{
answer=res[1];
if (res[1]<res[2])
{
answer= res[2];
if (res[2]<res[3])
{
answer= res[3];
if (res[3]<res[4])
{
answer= res[4];
tans="d";
}
else
{
tans="c";
}
}
{
answer= res[4];
81
Appendix H

tans="d";
}
else
{
tans="b";
}
}
{
answer= res[3];
if (res[3]<res[4])
{
answer= res[4];
tans="d";
}
else
{
tans="c";
}
}
{
answer= res[4];
tans="d";
}
else
{
tans="a";
}
qanswer=res[5];
qtans="a";
82
Appendix H
if (res[5]<res[6])
{
qanswer= res[6];
if (res[6]<res[7])
{
qanswer= res[7];
if (res[7]<res[8])
{
qanswer= res[8];
qtans="d";
}
else
{
qtans="c";
}
}
{
qanswer= res[8];
qtans="d";
}
else
{
qtans="b";
}
}
{
qanswer= res[7];
if (res[7]<res[8])
{
qanswer= res[8];
83
Appendix H

qtans="d";
}
else
{
qtans="c";
}
}
{
qanswer= res[8];
Sstem.out.println(" res[1]"+res[5]+" is less than res[4]"+res[8]);
qtans="d";
}
else
{
qtans="a";
}
}
****"+res[3]+"*
*******"+res[4]+"********"+ans+"********"+tans);
outFile.newLine();
outFile1.write(qID+"********"+res[0]+"********"+res[5]+"********"+res[6]+"***
*****"+res[7]+"********"+res[8]+"********"+ans+"********"+qtans);
outFile1.newLine();
}
outFile.close();
outFile1.close();
outFile2.close();
}
}
84
Appendix I
Appendix I
An output sample of the RASP system, which is a text file and all of its words are
tagged with special PoS:
qID=_NN1
3_MC
NBC_NP1
broadcast_VVD
the_AT
first_MD
sportscast_NN1
of_IO
this_DD1
game_NN1
in_II
1939_MC
a._NNU
Wrestling_VVG
b._&FW
Boxing_NN1
c._RR
Football_NN1
d._NNU
Baseball_NN1
answer_NN1
:_:
d_NN2
</q>_(
qID=_NN1
14_MC
What_DDQ
genre_NN1
is_VBZ
the_AT
1993_MC
movie_NN1
'Blue_NP1
'_$
a._NN1
Comedy_NP1
85
Appendix I
b._&FW
Horror_NN1
c._RR
Documentary_JJ
d._NNU
Western_JJ
answer_NN1
:_:
c_ZZ1
</q>_)
qID=_NN1
17_MC
What_DDQ
is_VBZ
the_AT
first_MD
name_NN1
of_IO
the_AT
lead_NN1
singer_NN1
in_II
the_AT
band_NN1
'Little_NP1
Crunchy_NP1
Blue_NP1
Things_NP1
'_$
a._NN1
Hunter_NP1
b._&FW
Eric_NP1
c._RR
Noah_NP1
d._NNU
Brian_NP1
answer_NN1
:_:
c_ZZ1
</q>_)
86
Appendix J
Appendix J
These are the noun tags according to the 155 CLAWS-2 part-of-speech (PoS) which
were taken from (WWW8)
Tag Tag type
ND1 singular noun of direction (north, southeast)
NN common noun, neutral for number (sheep, cod)
NN1 singular common noun (book, girl)
NN1$ genitive singular common noun (domini)
NN2 plural common noun (books, girls)
NNJ organization noun, neutral for number (department, council, committee)
NNJ1 singular organization noun (Assembly, commonwealth)
NNJ2 plural organization noun (governments, committees)
NNL locative noun, neutral for number (Is.)
NNL1 singular locative noun (street, Bay)
NNL2 plural locative noun (islands, roads)
NNO numeral noun, neutral for number (dozen, thousand)
NNO1 singular numeral noun (no known examples)
NNO2 plural numeral noun (hundreds, thousands)
NNS noun of style, neutral for number (no known examples)
NNS1 singular noun of style (president, rabbi)
NNS2 plural noun of style (presidents, viscounts)
NNSA1 following noun of style or title, abbreviatory (M.A.)
NNSA2 following plural noun of style or title, abbreviatory
NNSB preceding noun of style or title, abbr. (Rt. Hon.)
NNSB1 preceding sing. noun of style or title, abbr. (Prof.)
NNSB2 preceding plur. noun of style or title, abbr. (Messrs.)
NNT temporal noun, neutral for number (no known examples)
NNT1 singular temporal noun (day, week, year)
NNT2 plural temporal noun (days, weeks, years)
NNU unit of measurement, neutral for number (in., cc.)
NNU1 singular unit of measurement (inch, centimetre)
NNU2 plural unit of measurement (inches, centimetres)
NP proper noun, neutral for number (Indies, Andes)
NP1 singular proper noun (London, Jane, Frederick)
NP2 plural proper noun (Browns, Reagans, Koreas)
NPD1 singular weekday noun (Sunday)
NPD2 plural weekday noun (Sundays)
NPM1 singular month noun (October)
NPM2 plural month noun (Octobers)
87
Appendix K
Appendix K
NounRemove program code:
import java.io.*;
import java.util.*;
public class NounRemove{


FileWriter(Fname+"NRemoved.txt"));
String line=null,qID=null;
while(line!=null)
{
qID=inFile.readLine();
outFile.write("qID="+qID);
outFile.newLine();
System.out.println(qID);
if((line.startsWith("This"))||(line.startsWith("What"))){
while (line.endsWith("_NN1")){
}
}
88
Appendix K
outFile.write(line);
outFile.newLine();
}
outFile.write(line);
outFile.newLine();
}
}
outFile.close();
}
}
89

Multiple Choice Question Answering by Meshael Sultan

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Multiple Choice Question Answering by Meshael Sultan

Загружено:

Авторское право:

Доступные форматы

Multiple Choice Question Answering

MSc Advanced Computer Science

Department of Computer Science

The loving memory of my parents

Aisha and Mohammed

May God mercy their souls

Name: Meshael Sultan

Signed Declaration .................................................................................................... iii

Figure 2. 1 MULDER mock up..................................................................................20

Equation 2. 1 Mean Reciprocal Rank .........................................................................24

ARA Answer Recall Average

ARL Answer Recall Lenient

ARS Answer Recall Strict

MRR Mean Reciprocal Rank

NASA National Aeronautics and Space Administration

NIST National Institute of Standards and Technology

RASP Robust Accurate Statistical Parsing

START SynTactic Analysis using Reversible Transformations

TREC Text REtrieval Conference

WWW World Wide Web

In this dissertation a QA system is defined as an artificial intelligence system that uses

1.2 Project Aim

1.3 Dissertation Structure

This dissertation is organised into 6 main chapters. Chapter 1 provides some

Chapter 5 provides the results, analysis and evaluation of these experiments. As a

Chapter 6 draws the conclusions from the results given.

Chapter 2: Literature Review

2.1 Question Answering System

documents. Therefore, QA systems require complex natural language processing

2.1.1 History of QA Systems

The QA systems were intended to stimulate human language behaviour, therefore

The processing of understanding natural language, however, has many complications

Nevertheless, the advance in natural language processing techniques was an important

2.1.2 QA System Types

2.1.2.1 Answers Drawn From Database

> Who is the youngest female in the marketing department?

> How much is her salary?

The question “Does sample S10046 contains Olivine?”

2.1.2.2 Answers Drawn From Free Text

Trigger word The corresponding thematic roles and attributes

Question type Answer type

Mulder is 70% confident the answer is Alan Shepard.

2. John Glenn (15%)

Figure 2. 1 MULDER mock up

2.2 The TREC QA Track

TREC is co-sponsored by the National Institute of Standards and Technology (NIST)

Series Target Question Question Question

Thus, the TREC QA track is an effective factor in the development of QA systems. It

2.3 Evaluation Measures

Hence, the QA system consists of document-retrieval and answer-extraction modules.

2.3.1 Question Answering Measures

Equation 2. 1 Mean Reciprocal Rank

Equation 2. 2 Answer Recall Lenient

Equation 2. 3 Answer Recall Strict

In addition, Greenwood (2005) uses simple accuracy metrics as another evaluation

Chapter 3: Data and Resources Used in Experiments

3.1 Data Set

3.1.1 Question Set

3.1.2 Knowledge Source

The evaluation scheme is very important because it evaluates the experiments

Chapter 4: Design and Implementation

4.1 Manual experiments

Question Choice a Choice b Choice c Choice d Answer Accuracy

Question 4: In 1594, Shakespeare became an actor and playwright in what company?