Академический Документы
Профессиональный Документы
Культура Документы
Supervised by
Dr. Louise Guthrie
August 2006
This report is submitted in partial fulfilment of the requirement for the degree of
MSc in Advanced Computer Science
To
ii
Signed Declaration
All sentences or passages quoted in this dissertation from other people's work have
been specifically acknowledged by clear cross-referencing to author, work and
page(s). Any illustrations which are not the work of the author of this dissertation have
been used with the explicit permission of the originator and are specifically
acknowledged. I understand that failure to do this amounts to plagiarism and will be
considered grounds for failure in this dissertation and the degree examination as a
whole.
iii
Abstract
People rely on information for performing the activities of their daily lives. Much of
this information can be found quickly and accurately online through the World Wide
Web. As the amount of online information grows, better techniques to access the
information are necessary. Techniques for answering specific questions are in high
demand and large research programmes investigating methods for question answering
have been developed to respond to this need.
Question Answering (QA) technology; however, faces some problems, thus inhibiting
its advancement. Typical approaches will first generate many candidate answers for
each question, then attempt to select the correct answer from the set of potential
answers. The techniques for selection of the correct answer are in their infancy, and
further techniques are needed to decide and to select the correct answer from candidate
answers.
This project focuses on multiple choice questions and the development of techniques
for automatically finding the correct answer. In addition to being a novel and
interesting problem on its own, the investigation has identified methods for web based
Question Answering (QA) technology in selecting the correct answers from potential
candidate answers. The project has investigated techniques performed manually and
automatically. The data consists of 600 questions, which were collected from an
online web resource. They are classified into 6 categories, depending on the
questions’ domain, and divided equally between the investigation and the evaluation
stages. The manual experiments were promising, as 45 percent of the answers are
correct, which increased to 95 percent after the form of the queries was restructured.
Automatic techniques, such as using quotation marks, and replacing the question
words according to the question type it was found that the accuracy ranged between
48.5 and 49 percent. The accuracy had also increased to 63 percent and 74 percent in
some categories, such as geography and literature.
iv
Acknowledgements
This project would not have existed without the will of Allah, the giver of all good and
perfect gifts. His mercy and blessing have empowered me throughout my life. All
praise is due to Allah for his guidance and grace. Peace and blessings be upon our
prophet Mohammed.
Then, I would like to thank Dr. Louise Guthrie for her supervision, help, guidance and
encouragement throughout this project.
I would also like to thank Dr. Mark Greenwood for providing many good ideas and
resources throughout the project and Mr. David Guthrie for assisting me with the
RASP program.
Many thanks to my dear friend, Mrs. Basma Albuhairan for her assistance in
proofreading.
My deepest thanks to my lovely friends Ebtehal, Arwa and Maha for their friendship,
substantive support and encouragement.
Finally, I would like to thank my compassionate brother Fahad, my sweet sister Ru’a
and my loving aunts Halimah, Hana, Tarfa, and Zahra who through their love
supported me throughout my Masters degree.
v
Table of Contents
Table of Contents
vi
Table of Contents
Appendix J ................................................................................................................87
Appendix K ...............................................................................................................88
vii
List of Tables
Table 2. 1 Trigger words and their corresponding thematic roles and attributes..........18
Table 2. 2 Question categories and their possible answer types ..................................19
Table 2. 3 Sample question series from the test set. Series 8 has an organization as a
target, series 10 has a thing as atarget, series 27 has a person as a target.............23
Table 4. 1 Number of Google hits for first manual experiment when a given question
and given answer is used as a query in Google ...................................................30
Table 4. 2 Number of Google hits for second manual experiment when a given
question and given answer is used as a query in Google .....................................32
Table 5. 1 Number of correct answers for the First and the Second Manual
Experiments, and the overall accuracy for both of them .....................................36
Table 5. 2 Number of correct answers for every question category and its accuracy, in
addition to the overall accuracy in First Automated Experiment.........................38
Table 5. 3 Number of correct answers for every question category and its accuracy, in
addition to the overall accuracy in Second Automated Experiment.....................39
Table 5. 4 Number of correct answers for every question category and its accuracy
using quotation, besides to the overall accuracy in Third Automated Experiment
..........................................................................................................................40
Table 5. 5 Number of correct answers for every question type, besides its accuracy
with and without using quotation and the overall accuracy in Forth
Automated Experiment ......................................................................................43
Table 5. 6 Number of correct answers for every question type in each category, besides
their accuracy with and without using quotation and the overall accuracy ..........45
Table 6. 1 Evaluation for the participant systems in TREC 2005 QA track.................49
viii
List of Figures
ix
List of Equations
x
List of Abbreviations
Abbreviation Word
AI Artificial Intelligence
IR Information Retrieval
PoS Part-of-Speech
QA Question Answering
xi
Chapter 1: Introduction
Chapter 1: Introduction
1.1 Background
Searching the Web is part of our daily lives, and one of the uses of the Web is to
determine the answer to a specific questions, such as “When was Shakespeare born?”
Most people, however, would not choose to spend a lot of time finding the answer to
such a simple question. The tremendous amount of online information is increasing
daily, therefore making the search process for such a specific question even more
difficult. Despite the drastic advances in search engines, they still fall short of
understanding the user’s specific question. As a result, more sophisticated search tools
are needed to reliably provide the correct answers for these questions. The desirability
of such technology has led to a, relatively recent, increase in funding for the
development of Question Answering (QA) systems. (Korfhage, 1997; Levene, 2006).
This project is mainly inspired by the difficulties still present in the QA systems. The
project investigates techniques for selecting accurate answers for multiple choice
questions. This procedure is similar to the answer extraction stage in the traditional
QA systems, since the typical systems first will generate a small set of answers prior to
selecting a candidate answer. This work investigates techniques for choosing the
correct candidate answer from among a small set of possible answers. The results may
therefore be useful for the improvement of the current QA systems.
12
Chapter 1: Introduction
The aim of this project is to investigate automated techniques for answering multiple
choice questions. Neither commercial QA nor research QA systems that attempt to
answer different types of questions, and retrieve a precise answers are perfect.
Typically, the system analyses the question, finds a set of candidate answers by
consulting a knowledge resource and selects among the candidate answers before
presenting the correct answer to the user. One area where future research is needed is
with respect to the most appropriate method for selecting the correct answer from a set
of candidate answers. This project therefore focuses on techniques for improving this
stage of the QA systems. In addition, the idea of automatically finding the correct
answers for multiple choice questions is both novel and interesting, and is yet to be
fully explored.
Chapter 2 contains the literature review that explains why QA systems are important.
It also provides a brief history of the QA systems, their types, and the relationship
between QA and the TREC QA track. Details about the evaluation measure for the
QA systems and their metrics are also provided.
Chapter 3 contains the data and resources for this project. It describes in detail the
data set for the investigation experiments and the tools used within the project.
Chapter 4 describes the design and implementation phases, including the experimental
design and the implementation.
13
Chapter 2: Literature Review
The world today is starving for more information. Organizations and humans are
striving for new or stored information every second in order to improve the
performance of their work and daily activities. People seek information in order to
solve problems, whether it is booking a seat at the cinema to watch a movie or to find a
solution to a financial problem. They rely on information resources such as their
background, experts, libraries, private or public information, and the global networks,
which have rapidly become the most important resource, and are considered the
cheapest method, for easily accessing information from nearly everywhere.
Chowdhury (1999) noticed that electronic information has become readily accessible;
the rate of generating information has also increased. This information explosion has
caused a major problem, which has raised the need for effective and efficient
information retrieval.
Even though Information Retrieval (IR) systems have emerged and improved over the
last decade, users often find the search process difficult and boring, as a simple
question may require an extensive search that is time consuming and exhausting. In
order to meet the need for an efficient search tool that retrieves the correct answer for a
question, the development of accurate QA systems has evolved into an active area of
research. Such systems aim to retrieve answers in the shortest possible time with
minimum expense and maximum efficiency.
This chapter defines what a QA system is, provides a brief history of QA systems and
their main types, illustrated with some important examples. It also explains how the
TREC QA track is involved in promoting QA research, and the metrics need to
evaluate QA systems.
The recent expansion in the World Wide Web has demanded an increase in high
performance Information Retrieval (IR) systems. These IR systems help the users to
seek whatever they want on the Web and to retrieve the required information. The QA
system is a type of the IR system. The IR system provides the user with a collection of
documents containing the search result and the user extracts the correct answer.
Unlike an IR system, the QA system provides the users with the right answer for their
question using underlying data sources, such as the Web or a local collection of
14
Chapter 2: Literature Review
Salton (1968) implies that, a QA system ideally provides a direct answer in response to
a search query by using stored data covering a restricted subject area (a database).
Thus, if the user asks an efficient QA system “What is the boiling point of water?” it
would reply “one hundred degree centigrade” or “100 °C”.
Moreover Salton (1968) stated that these systems become important for different types
of applications, varying from standard business-management systems to sophisticated
military or library systems. These applications can help the personnel managers, bank
accountant managers and military officers to easily access the various data about
employees, customer accounts and tactical situations.
“The task of QA system consists in analysing the user query, comparing the analyzed
query with the stored knowledge and assembling a suitable response from the
apparently relevant facts” (Salton and McGill, 1983:9).
The actual beginning for the QA system was in the 1960s within the Artificial
Intelligence (AI) field, developed by researchers who employed knowledge based
techniques such as encyclopaedias, edited lists of frequently asked questions, and sets
of newspaper articles. These researchers attempted to develop rules that enabled a
logical expression to be derived automatically from natural sentences, and vice versa,
and this therefore directed them to the QA system design (Belkin and Vickery, 1985;
Etzioni et al., 2001; Joho, 1999).
15
Chapter 2: Literature Review
QA systems are classified into two types according to the source of the question’s
answer, that is, whether it is from a structured source or free text. The former systems
return answers drawn from a database, and there are considered to be the earliest QA
systems. As Belkin and Vickery (1985) declared it does, however, have some
disadvantages, as it is often restricted to a specific domain in simple English and the
input questions are often limited to simple forms. The latter type generates answers
drawn from free text instead of from a well-structured database. According to Belkin
and Vickery (1985), these systems are not linked to a specific domain and the text can
cover any subject.
The following two sections describe these two types of systems in more detail.
The earliest QA systems are often described as the Natural Language Interface of a
database. They allow the user to access the stored information in a traditional database
by using natural language requests or questions. The request is then translated into a
database language query e.g. SQL. For instance, if such an interface is connected to
the personal data of a company and the user asks the system the following (Monz,
2003) (please note that the user’s query is in italics and the system’s response is in
bold):
> Does any employee in the marketing department earn less than Mari Clayton?
No.
Two early examples of this type are ‘Baseball’ (Chomsky et al., 1961) and ‘Lunar’
(Woods, 1973), which were developed in the 1960s. The Baseball system answers
16
Chapter 2: Literature Review
English questions about baseball (a popular sport in the United States of America).
The database contains information about the teams, scores, dates and locations of these
games, taken from lists summarizing a Major League season’s experience. This
system allows the users, using their natural language, to communicate with an interface
that understands the question and the database structure. The user could only insert a
simple question without any connectives (such as ‘and’, ‘or’, etc.), without
superlatives such as ‘smallest’, ‘youngest’, etc. and without a sequence of events
(Chomsky et al., 1961).
This system answered questions such as “How many games did the Red Sox play in
June?”, “Who did the Yankees lose to on August 10?” or “On how many days in June
did eight teams play?” As Sargaison (2003) explained, this is done by analysing the
question and matching normalised forms of the extracted important key-terms such as
“How many”, “Red Sox” and “June” against a pre-structured database that contained
the baseball data, then returning an answer to the question.
The Lunar system (Woods, 1973) is more sophisticated, providing information about
chemical-analysis data on lunar rock and soil material that were collected during the
Apollo moon missions, enabling lunar geologists to process this information by
comparing or evaluating them. This system answered 78 percent of the geologist
questions correctly.
Gaizauskas and Hirschman (2001) believed that Lunar was capable of answering
questions such as “What is the average concentration of aluminium in high alkali
rocks?” or “How many Brescias contain Olivine?” by translating the question into a
database-query language. An example of this (Monz, 2003) is as follows:
Additionally Monz (2003) stated that Lunar can manage a sequence of questions such
as “Do any sample have aluminium grater than 12 percent?”
“What are these samples?”
Despite the significant success of the previous systems, they are restricted to a specific
domain and are associated with a database storing the domain knowledge. The results
of the database based QA systems are not directly comparable to the open domain
question-answering from an unstructured text.
17
Chapter 2: Literature Review
The next generation of QA systems evolved to extract answers from the plain
machine-readable text that is found in a collection of documents or on the Web. Monz
(2003) believed that the need for these systems is a result of the increasing growth in
the Web, and the user’s demand to access the information in a simple and fast fashion.
The feasibility of these systems is due to the advancement in the natural language
processing techniques. As the QA systems are capable of answering questions about
different domains and topics, these systems are more complex than the database based
QA systems, as they need to analyze the data in addition to the question, in order to
find the correct answer.
QUALM system was one of the early movements toward extracting answers from a
text. As Pazzani (1983) revealed, this QA system was able to answer a question about
stories by consulting scripted knowledge that is constructed during the story
understanding. Ferret et al. (2001) added that it uses complex semantic information,
due to its dependency on the question taxonomy that is provided by the question type
classes. This approach was the inspiration for many QA systems (refer to Appendix A
for the 13 conceptual question categories used in Wendy Lehnert’s QUALM which
was taken from Burger et al. (2001)).
The Wendlandt and Driscoll system (Driscoll and Wendlandt, 1991) is considered to
be one of the early text based QA system. This system uses National Aeronautics and
Space Administration (NASA) plain text documentation to answer questions about the
NASA Space Shuttle. Monz (2003) cited that it uses thematic roles and attributes
occurring in the question to identify the paragraph that contains the answer. Table 2.1
was taken form Monz (2003) to illustrate examples of the trigger words and the
corresponding thematic roles and attributes.
18
Chapter 2: Literature Review
Murax is another example for this type of QA system. It extracts answers for general
fact questions by using an online version of Grolier’s Academic American
Encyclopaedia as a knowledge base. According to Greenwood (2005), it accesses the
documents that contain the answer via information retrieval and then analyses them to
extract the exact answer. As Monz (2003) explained, this is done by using question
categories that focus on types of questions likely to have short answers. Table 2.2 was
taken from Monz (2003) and illustrates these question categories and their possible
answer type:
More recently, systems have been developed that attempt to find the answer to a
question from very large document collections. These systems attempt to simulate
finding answers on the Web. Katz (2001) states that START (SynTactic Analysis
using Reversible Transformations) is one the first systems that tried to answer a
question using a web interface. It has been available to users since 1993 on the World
Wide Web (WWW). According to Katz (2001), this system, however, focuses on
answering questions about many subjects, such as geography, movies, corporations
and weather. This requires time and effort in expanding its knowledge base.
On the other hand, MULDAR was considered the first QA system that utilizes the full
Web. It uses multiple search engine queries, and natural language processing, to
extract the answer. Figure 2.1 is a mock up for MULDAR based on an example have
seen in Etzioni et al. (2001) due to the original system is no longer being available
online:
19
Chapter 2: Literature Review
MULDER
“The Truth is Out There” Your question Who was the first American in space? Ask
Man in Space
… the February 20th 1962 but he was the first American in space
because the United State needed a lot of competition to equal the Soviet.
The first American in space was Alan B. Shepard May 5th 1961…
Shepard
On May 5, 1961, Shepard became the first American in space. After
several tours of duty in the Mediterranean, Shepard became one of the
Navy’s top test pilots and took part in high-
According to (WWW2), the general text based QA system architecture contains three
modules: question classification, document retrieval and answer extraction. The
question classifier model is responsible for analysing and processing a question to
determine its type and consequently the type of answer would also be determined. The
document retrieval model is used to search through a set of documents in order to
return the documents that might have the potential answer. Finally, the answer
extraction model identifies the candidate answers in order to extract the accurate
answer. Thus, they may apply the statistical approach which could use the frequency
of the candidate answers in the documents collection. An external database could also
be used to verify that the answer and its category are appropriate to the question
classification. For instance, if the question starts with the word ‘where’, its answer
should be a location and this information used in selecting the correct answer.
20
Chapter 2: Literature Review
This type of QA system is the focus of this project, as techniques for selecting the
correct answers for multiple choice questions from free text over the web are
investigated.
In the beginning, the QA systems were isolated from the outside world in the research
laboratories. Thereafter, they started to appear during the TREC competition.
(WWW3, 2005) states that when the QA track at TREC was started in 1999, “in each
track the task was defined such that the systems were to retrieve small snippets of text
that contained an answer for open-domain, closed-class questions (i.e., fact-based,
short-answer questions that can be drawn from any domain)”. For every track,
therefore, the participants would have a question set to answer and a corpus of
documents, which is a collection of text documents drawn from newswires and
articles, in which the answers to the questions are included in at least one of these
documents. These answers are evaluated according to the answer patterns, and the
answer will be considered correct if it matches this answer pattern. TREC assesses the
answers accuracy for the competitive QA systems to judge their performance.
The first track TREC-8 (1999), as Voorhees (2004) clarified, was focused on
answering factoid questions, which are fact-based questions, such as, taken from
(WWW5, 2005), “Who is the author of the book, "The Iron Lady: A Biography of
Margaret Thatcher?” The main task is to return a 5-ranked list of strings containing
the answer.
Sanka (2005) revealed that in TREC-9 (2000) the task was similar to TREC-8 but the
question set was increased from 200 to 693 questions. These questions were
developed by extracting them from a log of questions submitted to the Encarta system
(Microsoft) and questions derived from an Excite query log; whereas the questions in
TREC-8, were taken from a log of questions submitted to the FAQ Finder system.
Moreover, both of TREC-8 and TREC-9 used the Mean Reciprocal Rank (MRR) score
to evaluate the answers.
As Voorhees (2004) mentioned, in the TREC 2001 QA track, however, the question
set included 500 questions obtained from the MSN Search logs and AskJeeves logs. It
21
Chapter 2: Literature Review
also contained different tasks, the main task being similar to the previous tracks.
However, the other tasks were to return an unordered list for short answers and to test
the context task through a series of questions. As a result, these different tasks have
different evaluation metrics and reports.
Furthermore Voorhees (2004) stated that the TREC 2002 QA track had two tasks: the
main task and the list task. In contrast to the previous years, the participants should
return the exact answer for each question as it did not allow the text snippet that
contained the answers. Accordingly, they used a new scoring metric called the
confidence-weighted score to evaluate the main task. and another evaluation metric
was used for the other task.
Voorhees (2004) also reported that the TREC 2003 QA track test set contained 413
questions taken from MSN and AOL Search logs. It included two tasks, that is, the
main task and the passage tasks. These tasks have different types of test question. The
main task has a set of list questions, definition questions and factoid questions. The
passage task, however, has a set of factoid questions in which the systems should, to
answer them, return a text snippet. This track has different evaluation metrics because
it has different tasks and different types of question.
In regards to the TREC 2004 QA track Sanka (2005) and Voorhees (2004) mentioned
that the question set was similar to TREC 2003, i.e. drawn from MSN and AOL Search
logs. It contained a series of questions (of several question types, including factoid
questions, list questions and other questions similar to the definition questions in
TREC 2003) asking for a specific target. This target might be a person, an
organization, an event or a thing. Table 2.3 illustrates a sample of a question set
(WWW6, 2005).
22
Chapter 2: Literature Review
Also, TREC 2004 has a single task that is evaluated by the weighted average of the
three components (factoid, list and other).
Moreover, and according to Voorhees (2005), TREC 2005 QA track uses the same
type of question set used in the TREC 2004 (with the 75 series questions). In addition
to the main task, it has other tasks similar to TREC 2004, such as returning a ranked
list for the documents that have the answers, and a relationship task to return evidence
for the relationship hypothesized in the question or the lake of this relationship. The
main task is evaluated by the weighted average of the three components (factoid, list
and other). The other tasks, however, have different evaluation measures.
23
Chapter 2: Literature Review
This task is a vital stage, as it will assess the performance of the systems, as well as the
accuracy of the techniques. Subsequently, it will be used to make a confidant
conclusion about efficiency of performance.
Joho (1999) stated that TREC started evaluating the QA system in 1999, as a result of
the QA system developer’s need for an automated evaluation method that measures the
system’s improvement. Since then, it has added a track for the QA system to evaluate
the open-domain QA system (that provides an exact answer for a question by using a
special corpus). This corpus includes a large collection of newspaper articles where
for each generated question there is at least one document containing the answer
(WWW3, 2005).
These measures will be used to evaluate the whole QA system. Some of them are also
used by the TREC, such as the Mean Reciprocal Rank (MRR).
The MRR measures the QA system that could retrieve a list of 5 answers for each
question, which will be scored by inversing the rank of the first correct answer.
According to Greenwood (2005), MRR is given by Equation 2.1 (Where |Q| is the
number of questions and ri is the answer rank for question i).
|Q|
Σ 1/r i
i=1
MRR =
|Q|
24
Chapter 2: Literature Review
This means that if the second answer in the list is the correct answer for a question
then the reciprocal rank will be ½. This rank is applied for every question, afterwards
the average score is calculated for all of the questions. Consequently, this means that
the higher the value of the metrics, the better the performance of the system.
There are also other measures which were specified by Joho (1999). For example,
Mani’s method applies three measures Answer Recall Lenient (ARL), Answer Recall
Strict (ARS) and Answer Recall Average (ARA). These are used to evaluate the
summaries found by a question responding to a task requiring a more detailed
judgment. ARL and ARS are defined in Equation 2.2 and Equation 2.3 (Where n1
represents the number of the correct answers, n2 is the number of partially correct
answers and n3 is the number of the answered questions).
n1+ (0.5*n2)
ARL =
n3
n1
ARS = n3
ARA is the average of the ARL and ARS. This measure, however, is not used during
the project, as it needs a summary for the answer and this is not available in this case.
More so, it needs to determine if the answer is correct, partially correct or missing.
|C|
accuracy = * 100
|Q|
Equation 2. 4 Accuracy
These metrics are used throughout the project, thus allowing the accuracy of the
techniques to be determined.
25
Chapter 3: Data and Resources Used in Experiments
This section discusses in detail the requirements of the investigation, such as the data
set (question set and the knowledge source) and the tools and evaluation methods used
in this project.
The data set includes the question set and the corpus of knowledge or the knowledge
source. The question set is a set of multiple choice questions with four choices. The
knowledge source, however, is the source which the techniques refer to for finding the
correct answer amongst the four choices.
Throughout the investigative stage the questions were collected from an online quiz
resource, which is Fun Trivia.com (WWW1). This could be relied on, as it has more
than 1000 multiple choice questions and their correct solutions. Also, these questions
cover different domains such as entertainment, history, geography, sport, science/
nature and literature (refer to Appendix B for question examples for each category). A
set of 600 randomly chosen questions that were divided equally between the 6 domains
and their answers were collected for this project. The answers were withheld in the
evaluation stages.
These questions are found in html format and their correct answers are found in
separate html files, thus making the comparison process between the technique
answers and the correct answers more complicated. Hence, we reformatted the 600
multiple choice questions manually into text files, therefore each question is followed
by its correct answer. Below is an example:
26
Chapter 3: Data and Resources Used in Experiments
qID= 1
Television was first introduced to the general public at the 1939 Worlds Fair in
a. New York
b. Seattle
c. St. Louis
d. Montreal
answer: a
</q>
A tag was also added defining the end of a question. This involves automating the
techniques so that the program will know where each question begins and ends.
The knowledge source was derived from the World Wide Web, as these investigated
techniques involved the use of the Google Search Engine in deciding which choice is
the correct answer for multiple choice questions.
Google is currently one of the most popular and one of the largest search engines on
the Internet. It was chosen for two main reasons (mentioned in Etzioni et al. (2001)):
first it has the widest coverage, since it has indexed more than one billion web pages
and secondly it ranks the pages according to which the page has the highest
information value.
3.2 Evaluation
Moreover, the techniques will be applied automatically over a large set of questions
without prior knowledge of the answers. After comparing the answers with the results,
which is described in section 5.1.2, the precision and accuracy of the technique are
determined through the aforementioned accuracy metric.
27
Chapter 3: Data and Resources Used in Experiments
3.3 Tools
An exploration of the requirements from the data set to this evaluation has been
undertaken. Therefore, the resources required to complete the project can be specified.
The appropriate programming language to be used should be decided upon. Java is
selected for its flexibility and familiarity.
Additionally, another program is used for further experiments. This program is called
Robust Accurate Statistical Parsing (RASP). RASP is a parsing toolkit that is a result
of the cooperation between the Department of Informatics at Sussex University and the
Computer Laboratory at Cambridge University. It was published in January 2002 and
is used to run the input text through many processes for tokenisation, tagging,
lemmatization and parsing (Briscoe and Carroll, 2002; Briscoe and Carroll, 2003;
WWW7).
28
Chapter 4: Design and Implementation
The aim of this project is to investigate techniques to find the correct answer for
multiple choice questions. This investigation involves the use of the Google Search
Engine. This chapter explains the design of the manual experiments and the
implementation of the automated techniques.
The initial experiments were performed manually and then automated for precision
and accuracy of results. In the initial investigations, the question set sample is small
(20 questions) and fact-related questions are pertaining to famous individuals and
historical subjects. In the first experiment, the question is combined with each choice
and passed to Google manually. The choices are ranked depending on the search
result number. For instance, for the question:
The Roanoke Colony of 1587 became known as the 'Lost Colony'. Who was the
governor of this colony?
a. John Smith
b. Sir Walter Raleigh
c. John White
d. John White
The question is combined first with choice (a) then sent to Google as follows:
The Roanoke Colony of 1587 became known as the 'Lost Colony'. Who was the
governor of this colony John Smith
This process is repeated with each choice. Subsequently, the choice with the highest
search result hits is then selected and if the numbers of search result hits are the same,
then the result is chosen in the order which appeared in the question. Table 4.1
illustrates the results of the first experiment. For question 7, the number of search
result hits was equal for the four choices; therefore the first choice was selected as the
possible answer. Hence, there were 8 correct answers for this experiment, so an
accuracy of 40 percent.
29
Chapter 4: Design and Implementation
After this experiment, this technique was implemented and a larger number of
questions were posed to Google for better accuracy and precision of results. The
implementation used Java as a programming language and this is described in detail in
section 4.2.
In the second experiment, the performance was increased by restructuring the form of
the query. This was done by taking the same sample used in the first experiment and
then choosing the questions that had failed to answer the question correctly (these were
11 questions). The questions were modified by using important keywords in the
question and combining them with each choice as follows:
30
Chapter 4: Design and Implementation
The previous question had a wrong answer in the first experiment; therefore the query
form was restructured and combined with each choice as follows:
“In 1594” Shakespeare became an actor and playwright in “(one of the choices)”
This query was combined with choice (a) and sent to Google as follows:
“In 1594” Shakespeare became an actor and playwright in “The King's Men”
This was then repeated with each choice. In the previous example, keywords in the
question were chosen to improve the accuracy of the answer. The year was put in
quotation marks as follows: “In 1594” It was noticed that the quotation marks
enhanced the search results. The question word and the noun following the question
word were replaced with a quoted choice as the possible answer.
This question was restructured by replacing “What astronomer” with a quoted choice
and the year was also put in quotation marks, “in 1543”, as follows:
“(one of the choices)” published his theory of a sun- centered universe “in 1543”
Question 6: This Russian ruler was the first to be crowned Czar (Tsar) when he
defeated the boyars (influential families) in 1547. Who was he?
a. Nicholas II
b. Ivan IV (the Terrible)
c. Peter the Great
d. Alexander I
31
Chapter 4: Design and Implementation
In question 6 “This Russian ruler” was replaced with a quoted choice and some
important words such as “first”, “Czar” and “defeated the boyars” were used.
Moreover, the year was put in quotation marks; hence the query is as follows:
“(one of the choices)” was first Czar defeated the boyars “in 1547”
For the remaining questions, please refer to Appendix C. As a result of the previous
changes, there were 19 correct questions from 20. This is highlighted in Table 4.2.
Therefore, the changes have improved the accuracy of the results to 95 percent.
During these manual experiments, some special and interesting cases about the
structuring of the questions were found. For example, when the choice is the complete
name (first name and surname) for a famous person it was found that it worked better
to combine the query with the surname, as the first name is rarely used (as can be seen
32
Chapter 4: Design and Implementation
in question 15 in Appendix C). Moreover, it was noticed that the answer’s accuracy
could be enhanced in several ways. This could be done by restructuring the form of
the query by using quotation marks, as can be seen in questions 4, 5, 6, 7, 9, 14, 17 and
20 in Appendix C. It could also be done by selecting the main keywords in the query
as can be seen in questions 6, 7, 11 and 15 in Appendix C. Additionally, it could also
be achieved by replacing the question words (What, Who, Which and Where) and the
noun following the question word with one of the choices, as can be seen in questions
4, 5, 7, 9, 11, 15 and 20 in Appendix C. These interesting changes were done
manually and some of these were applied by automated means as explained in section
4.2.
In the first automated experiments, the data set contained 180 questions encompassing
different domains that were divided into 30 questions for each category. In this stage,
the multiple choice questions were collected in a text file and reformatted, as explained
previously in section 3.1.1. A program named QueryPost was written using Java.
This uses Google class which was written by Dr. Mark Greenwood. This program
combines the question with each choice. Thereafter, this was passed to Google to
retrieve the search results. The program code is shown in Appendix D. Thus, the
choice with the highest search result hits is selected as the answer to the question.
Two text files are generated. The first file is the main file, containing the number of
hits for the query and for the query with each choice. This also represents the correct
choice and the technique choice, as can be seen in Appendix E, for a sample of the
output. The second file records each action the program undertakes and this can be
used as the reference source when needed.
Moreover, in the second experiment, the data set was larger than the previous
experiment, as it contained 600 questions that were divided into 100 questions for each
category. These questions were then sent to Google, also using the QueryPost
program. After this experiment, another manual experiment was undertaken, as
previously mentioned in section 4.1, to investigate the different methods for finding
the correct answer for multiple choice questions.
33
Chapter 4: Design and Implementation
The answer with quotation marks and the answer without quotation marks can be seen
in the program code in Appendix F. Moreover, 3 text files are generated. The two
main files are for each technique that contains the number of search result hits for the
query and for the query with each choice. These files also represent the correct choice
and the technique’s choice. The third file records each action the program undertakes
and this can be used as the reference source when needed. This program was also run
using the same question set used in the second experiment.
34
Chapter 4: Design and Implementation
different types of nouns having different tags, as can be seen in Appendix J. These
tags are modified to only one tag (_NN1) manually, thus making it easier to be
recognized by the NounRemove program. This program was written, and a text file
made, to delete the noun words present in the queries as can be seen in Appendix K.
This text file is sent to another program called QueryPost3 to combine each choice
with the question by using two techniques, that is, the answer in quotation marks and
the answer without quotation marks. This was finally passed to Google to retrieve the
search results.
35
Chapter 5: Results and Discussion
In this chapter, the results of the experiments are presented. The results are divided
into two sections. One section is for the manual results and the second section is for
the automated results. An analysis and discussion for clarification of the findings is
then given. This in turn determines the success and failure of the goals of the project.
Moreover, suggestions are also given for further work that can be done in order to
improve the accomplishments of the project.
The manual experiments were two experiments using a small set of multiple choice
questions (20 questions) to perform the investigations. In the first experiment, the
question was combined with its choices and sent to Google as a query. As a result,
there were 9 correct questions out of 20, given an accuracy of 45 percent as can be
seen in Table 5.1 and Figure 5.1. Since this is a promising result, it was decided to
automate this technique thereby being able to apply it on a larger set of questions.
Moreover, to better improve the technique, another manual experiment was
undertaken. In this experiment the query forms were restructured by using important
key words, using quotation marks and replacing the question words as previously
explained in detail in section 4.1. According to Table 5.1, there are 19 correct
questions out of 20, which increased the accuracy to 95 percent as shown in Figure 5.1.
36
Chapter 5: Results and Discussion
100%
90%
80%
70%
Accuracy
60%
50% accuracy
40%
30%
20%
10%
0%
First Manual Second Manual
Exp. Exp.
Experiment
Figure 5. 1 The improvement of the accuracy for the First and the Second Manual Experiments
In the automated experiments, the manual experiments were applied using some of the
programs written for this purpose. In the first automated experiment, a small data set
containing 180 questions from different categories was used, as previously mentioned
in section 4.2. These questions had been read and sent to Google by the QueryPost
program to retrieve the search result. This program chooses the answer amongst the
four choices of the question. The results for this experiment are illustrated in Table
5.2. This shows the number of correct answers for every category and its accuracy.
The overall accuracy for the whole question set is also shown.
37
Chapter 5: Results and Discussion
80.00%
70.00%
60.00%
Accuracy
50.00%
Accuracy for each
40.00%
category
30.00%
20.00%
10.00%
0.00%
re
hy
t
re
t
y
en
or
or
tu
tu
ap
Sp
m
st
ra
Na
gr
in
Hi
te
rta
eo
e/
Li
nc
G
te
ie
En
Sc
Question Category
According to Figure 5.2, the highest accuracy percentage was for the sports category,
as it achieved 66.7 percent. The entertainment category, however, was lower, as it
only achieved 23.3 percent. In addition, the accuracy was higher than 50 percent in
both the literature and the geography categories. It decreased, however, to 36.7
percent in the science/nature category. Furthermore, a decrease in accuracy was also
evident in the history category.
In the same manner and as previously detailed in section 4.2, the second automated
experiments were undertaken by running the QueryPost program on a large set of
38
Chapter 5: Results and Discussion
questions containing 600 multiple choice questions (for more accurate results). These
results are shown in Table 5.3. This shows the number of correct answers for every
category and its accuracy, in addition to the overall accuracy for the whole question
set.
The overall accuracy result is shown in Table 5.3. This shows that the accuracy does
not change much compared to the overall accuracy in the first automated experiment;
hence the size of the question set does not affect their accuracy.
80%
70%
60%
Accuracy
50%
Accuracy for each
40%
category
30%
20%
10%
0%
re
hy
t
re
t
y
en
or
or
tu
tu
ap
Sp
m
st
ra
Na
gr
in
Hi
te
rta
eo
e/
Li
nc
G
te
ie
En
Sc
Question Category
39
Chapter 5: Results and Discussion
From Figure 5.3 it can be said that the results in the second experiment are similar to
the first experiment, as the three categories literature, geography and sport still have
the highest accuracy percentage. The highest accuracy this time however, was in the
literature category, as it had 67 percent accuracy, while the sport category dropped to
45 percent accuracy. At the same time, the geography’s accuracy increased to 62
percent, but the lowest accuracy was still for the entertainment category, as it obtained
26 percent, which was less than history’s accuracy, which was 35 percent.
The third experiment was done in light of the findings of the second manual
experiment. It was found that using quotation marks may have improved the accuracy
of the answers. Therefore, this technique was applied on the same question set that
was used in the second automated experiment by using the QueryPost2 program. The
results of this experiment are summarised in Table 5.4, which represents the number of
correct answers for every category, and its accuracy using the quotation marks, as well
as the overall accuracy for this experiment.
40
Chapter 5: Results and Discussion
80%
70%
60%
50%
Accuracy
t
t
re
y
en
or
or
tu
tu
ap
Sp
m
st
ra
Na
gr
in
Hi
te
rta
eo
e/
Li
nc
G
te
ie
En
Sc
Question Category
Figure 5. 4 The accuracy for each category using quotation in Third Automated Experiment
According to Figure 5.4, it can be concluded that literature still comes out on top with
the highest accuracy percentage, as it attained 75 percent. This was followed by the
geography category, as its accuracy was 62 percent. Moreover, the accuracy dropped
to 48 percent in sport, but this percent is high in comparison to history, which has 40
percent. At the same time, Sport still remains in the lowest level of accuracy with
30% accuracy. However, the accuracy increased to 39 percent in the science/nature
category.
41
Chapter 5: Results and Discussion
80%
70%
60%
50%
Accuracy for each category
Accuracu
without quotation
40%
Accuracy for each category
using quotation
30%
20%
10%
0%
re
hy
l
t
t
e
ry
ta
en
or
ur
tu
to
To
ap
Sp
m
at
ra
s
gr
in
Hi
N
te
rta
eo
or
Li
te
e
En
nc
ie
Sc
Question Category
Figure 5. 5 The accuracy for each category with and without quotation
Figure 5.5 summarises the results for the second and the third experiments as shown in
the variation of accuracy in all of the categories between the two techniques, that is,
with quotation marks and without quotation marks. Thus, it can be concluded that
finding the correct answer is easier for the literature and geography questions as
compared to the entertainment question, which was more difficult. It was also found
that using the quotation marks enhances the selection of the correct answer in all
categories with at least 1 percent, with the exception of the geography category. The
percentage in this category did not change, even with the quotation marks.
Additionally, the overall accuracy for using quotation marks was 49 percent, which is
higher than the accuracy without using quotation marks, which was 45.8 percent, that
is, the performance of using quotation marks is better.
The forth experiment depended on the finding of the second manual experiments, as it
was found that replacing the question words (Who, Where, Which, What/This) in the
queries with their choices could improve the correct answer selection, in addition to
using the quotation marks. Accordingly, different programs were used. The final
results from QueryPost3 program are illustrated in Table 5.5.
42
Chapter 5: Results and Discussion
60.00%
50.00%
40.00%
without quotation
30.00%
Accuracy for each category
using quotation
20.00%
10.00%
0.00%
Which Who This & Where Other Total
What
Question Type
Figure 5. 6 The accuracy for each question type with and without quotation
As shown in Figure 5.6, the highest accuracy occurs with the ‘Who’ questions either
with or without using quotation marks, by 54.2 percent and 48.6 percent, respectively.
This is because the answer for this question is a name of a person (which can be found
43
Chapter 5: Results and Discussion
more easily by quoting the whole name). The accuracy decreased slightly in the
‘Which’ question: to 48.6 percent with using quotation marks and to 45.7 percent
without quotation marks. Moreover, the accuracy for using quotation marks declined
to its lowest rate in the ‘Where’ questions with 42.9 percent and 38.1 percent,
respectively, without the quotation marks. This could be a result of the data size in this
question type (21 questions), but it increased in the ‘What’/’This’ questions to 41.4
percent without using quotation marks, and it also increased to 43.9 percent with
quotation marks. Thus, the best performance was in the ‘Who’ and ‘Which’ questions
and the worst was in the ‘Where’ questions.
Furthermore, the results were analysed according to the question categories and the
question types as can be seen in Table 5.6:
44
Chapter 5: Results and Discussion
45
Chapter 5: Results and Discussion
Total
Other
Where
Which
Total
Other
Science /Nature
Where
Who
Which
Total
Other
Literature
Where
Quetion Category & Question Type
Who
Which Accuracy with quotation
Total Accuracy without quotatin
Other
Where
History
Who
Which
Total
Other
Geography
Where
Which
Total
Other
Entertainment
Where
Which
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Accuracy
Figure 5. 7 The accuracy for every question type in each category with and without using
quotation
In Figure 5.7 all the techniques applied are summarized, such as using quotation marks
and replacing the question words with the choices in every question category. As a
result, the highest performance was in the literature and geography categories, as they
achieved 74 percent and 63 percent of accuracy by using both the aforementioned
techniques. In contrast, the lowest performance was for the entertainment questions,
46
Chapter 5: Results and Discussion
with 28 percent accuracy. This variation of accuracy between the question categories
is a result of the type of information that is available on the WWW. Thus, according
to these experiments, it can be concluded that the WWW contains more information
about literature and geography and less information about entertainment, which makes
these techniques unable to answer correctly a question such as:
During the manual experiments, it was found that choosing query keywords such as
the date or the year, and using the surname only instead of the whole name for people,
may enhance the performance of the QA system in finding the correct answer.
However, to apply these ideas automatically, we need more time and knowledge in the
natural language processing.
47
Chapter 6: Conclusions
Chapter 6: Conclusions
Recently, there are expansion and dissemination of knowledge and information and
their utilisation. Yet information has becomes more accessible through the World
Wide Web. The increased access of the web and the growth of users and their diverse
background and objectives required more accurate and effective search over the web.
Consequently, many techniques have been developed responding to the need to
provide a reliable answer of their search.
Manual experiments were initially done on a small set of questions (20 questions).
These questions were then sent to Google by combining them with their choices in
finding the hit rate for each choice, and selecting the choice with the highest hit rate as
the answer. As a result, 45 percent accuracy was achieved. In spite of the automated
implementation of the previous method on a large set of data (600 questions), which is
categorised according to the question domain into 6 categories (entertainment,
geography, history, literature, science/nature and sport), the same accuracy rate, 45.8
percent, with a slight increase was obtained.
48
Chapter 6: Conclusions
Although the overall accuracy for these techniques did not achieve the same results as
some other QA systems that achieved more than 70 percent (Voorhees, 2004). This
was, however, better than most of the systems that participated in the TREC 2005 QA
track. The investigated techniques received from 48.5 percent to 49 percent accuracy,
which is higher than the median in that track, which was 15.2 percent according to
Table 6.1. This table shows the accuracy for the participant systems in TREC 2005
QA track, was taken from Greenwood (2006).
Moreover, the comparison between our techniques and these systems is not fair
because the questions in this set of questions were much longer than the questions in
the TREC 2005 QA track data set. These systems are a result of research projects that
have more human resources, larger funding, and they spend a lot of time creating
them. In contrast, the performance of our techniques are not time consuming, require
less programming and are better than most of the other systems.
Overall, this project proved that rather simple techniques can be applied to obtain
respectable results. Actually, the project does not solve the problem, yet the
remarkable results illustrated an improvement in the baseline in comparison to other
systems.
49
Bibliography
Bibliography
Breck, E. J., Burger, J. D., Ferro, L., Hirschman, L., House, D., Light, M. and Mani, I.
(2000) How to Evaluate Your Question Answering System Every Day and Still Get
Real Work Done. Proceedings of LREC-2000, Second International Conference on
Language Resources and Evaluation. Greece: Athens. [online].Available: http://www.
cs.cornell.edu/~ebreck/publications/docs/breck2000.pdf [1/5/2006].
Burger, J., Cardie, C., Chaudhri, V., Gaizauskas, R., Israel, D., Jacquemin, C., Lin, C.-
Y., Maiorano, S., Miller, G., Moldovan, D., Ogden, B., Prager, J., Riloff, E., Singhal,
A., Shrihari, R., Strzalkowski, T., Voorhees, E. and Weischedel, R. (2001) Issues,
Tasks and Program Structures to Roadmap Research in Question & Answering
(Q&A). [online]. Available: http://www-nlpir.nist.gov/projects/duc/papers
/qa.Roadmap-paper_v2.doc[8/5/2006]
50
Bibliography
Etzioni, O., Kwok, C. and Weld, D.S. (2001) Scaling Question Answering to the Web.
Proc WWW10. Hong Kong. [online]. Available: http://www.iccs.inf.ed.ac.uk/~s02395
48/qa-group/papers/kwok.2001.www.pdf [1/5/2006].
Ferret, O., Grau, B., Hurault-Plantet, M., Illouz, G. and Jacquemin, C. (2001)
Document Selection Refinement Based on Linguistic Features for QALC, a Question
Answering System. In Proceedings of RANLP–2001. Bulgaria: Tzigov Chark. [online].
Available:http://www.limsi.fr/Individu/mhp/QA/ranlp2001.pdf#search=
%22DOCUMENT%20SELECTION%20REFINEMENT%20BASED%20ON%22
[2/8/2006]
Gabbay, I. (2004) Retrieving Definitions from Scientific Text in the Salmon Fish
Domain by Lexical Pattern Matching. MA, University of Limerick. [online].
Available:http://etdindividuals.dlib.vt.edu:9090/archive/00000114/01/chapter2.pdf
[2/8/2006].
Katz, B., Lin, J. and Felshin, S. (2001) Gathering Knowledge for a Question
Answering System from Heterogeneous Information Sources. In Proceedings of the
ACL 2001 Workshop on Human Language Technology and Knowledge Management.
France: Toulouse. [online]. Available: http://groups.csail.mit.edu/infolab/publications/
Katz-etal-ACL01.pdf [1/5/2006].
Korfhage, R. R. (1997) Information Storage and Retrieval. Canada: John Wiley &
Sons, Inc.
51
Bibliography
Liddy, E. D. (2001) When You Want an Answer, Why Settle for A List?.Center for
Natural Language Processing, School of Information Studies, Syracuse University.
[online]. Available:http://www.cnlp.org/presentations/slides/When_You_Want_
Answer.pdf#search=%22When_You_Want_Answer%22 [2/8/2006].
52
Bibliography
WWW7: Robust Accurate Statistical Parsing (RASP) (No Date). [online]. Available:
http://www.informatics.susx.ac.uk/research/nlp/rasp/ [20/7/2006]
53
Appendix A
Appendix A
The 13 conceptual question categories used in Wendy Lehnert’s QUALM
54
Appendix B
Appendix B
qID= 2
Who was the first U.S. President to appear on television
a. Franklin Roosevelt
b. Theodore Roosevelt
c. Calvin Coolidge
d. Herbert Hoover
answer: a
</q>
qID= 5
Who was the first English explorer to land in Australia
a. James Cook
b. Matthew Flinders
c. Dirk Hartog
d. William Dampier
answer: d
</q>
qID= 1
What is the capital of Uruguay
a. Assuncion
b. Santiago
c. La Paz
d. Montevideo
answer: d
</q>
qID= 1
Who wrote 'The Canterbury Tales'
a. William Shakespeare
b. William Faulkner
55
Appendix B
c. Christopher Marlowe
d. Geoffrey Chaucer
answer: d
</q>
qID= 1
Which of these programming languages was invented at Bell Labs
a. COBOL
b. C
c. FORTRAN
d. BASIC
answer: b
</q>
qID= 1
Which of the following newspapers sponsored a major sports stadium/arena
a. Chicago Tribune
b. Arizona Republic
c. Boston Globe
d. St Petersburg Times
answer: d
</q>
56
Appendix C
Appendix C
These are the questions that we failed to get the right answers from the first manual
experiment. Thus, in the second manual experiment we modified them by using
important keywords in the question and combined them with each choice as the
following:
The previous question had a wrong answer in the first experiment; therefore the query
form was restructured and combined with each choice as follows:
“In 1594” Shakespeare became an actor and playwright in “(one of the choices)”
This query was combined with choice (a) and sent to Google as follows:
“In 1594” Shakespeare became an actor and playwright in “The King's Men”
This was repeated with each choice. In the previous example, keywords in the
question were chosen as to improve the accuracy of the answer. The year was put in
quotation marks as follows: “In 1594” It was noticed that the quotation marks
enhanced the search results. The question word and the noun following the question
word were replaced with a quoted choice as the possible answer, as follows:
This question was restructured by replacing “What astronomer” with a quoted choice
and the year was also put in quotation marks, “in 1543”, as follows:
“(one of the choices)” published his theory of a sun- centered universe “in 1543”
57
Appendix C
Question 6: This Russian ruler was the first to be crowned Czar (Tsar) when he
defeated the boyars (influential families) in 1547. Who was he?
e. Nicholas II
f. Ivan IV (the Terrible)
g. Peter the Great
h. Alexander I
In question 6 “This Russian ruler” was replaced with a quoted choice and some
important words such as “first”, “Czar” and “defeated the boyars” were used.
Moreover, the year was put in quotation marks; hence the query is as follows:
“(one of the choices)” was first Czar defeated the boyars “in 1547”
Question 7: Sir Thomas More, Chancellor of England from 1529-1532, was beheaded
by what King for refusing to acknowledge the King as the head of the newly formed
Church of England?
a. James I
b. Henry VI
c. Henry VIII
d. Edward VI
The previous question was modified by using some keywords such as “Sir Thomas
More” and “was beheaded by”; also, “what King” was replaced with a quoted choice
as follows:
In this question, “Which” was replaced with a quoted choice followed by “dynasty” as
follows:
“(one of the choices) dynasty” was in power throughout the 1500's in China
58
Appendix C
Question 11: Ponce de Leon, 'discoverer' of Florida (1513), was also Governor of
what Carribean island?
a. Cuba
b. Puerto Rico
c. Virgin Islands
d. Bahamas
In question 11, some keywords were chosen, such as “Ponce de Leon”,”was” and
“Governor of” and “what Carribean island” was replaced with a choice as follows:
Question 14: Who did Queen Elizabeth feel threatened by and had executed in 1587?
a. Jane Seymour
b. Margaret Tudor
c. Mary, Queen of Scots
d. James I
However, in the previous question “Who did” was deleted and other words were
quoted, such as “Queen Elizabeth”. Also, “threatened by” was combined with a
choice and was quoted as follows:
“Queen Elizabeth” feel “threatened by (one of the choices)” and had executed in
1587
The previous question was modified by replacing “What 16th century Cartographer
and Mathematician” with the choice, or the surname for each choice with the whole
name for a person , and quoting other important words such as “projection map”,” two
dimensions” and “latitude and longitude”, as follows:
(one of the choices) developed “ projection map” representing the world in “two
dimensions” using “latitude and longitude”
59
Appendix C
Question 16: Which group controlled North Africa throughout most of the 16th
century?
a. French
b. Spanish
c. Egyptians
d. Turks
This question was the only question in which the right answer could not be obtained.
The forms were modified several times by replacing “Which” and “Which group”
with one of the choices. In addition, some search operators such as OR and AND were
used, and a noun was added for each choice since they were adjectives. Also, each
form was tried with and without quotation marks as follows:
• (one of the choices) AND (the noun of the choices) group controlled
North Africa throughout most of the 16th century
• (one of the choices) OR (the noun of the choices) group controlled
North Africa throughout most of the 16th century
• “(one of the choices) AND (the noun of the choices)” group
controlled North Africa throughout most of the 16th century
• “(one of the choices) OR (the noun of the choices)” group controlled
North Africa throughout most of the 16th century
• (one of the choices) AND (the noun of the choices) controlled North
Africa throughout most of the 16th century
• (one of the choices) OR (the noun of the choices) controlled North
Africa throughout most of the 16th century
• “(one of the choices) AND (the noun of the choices)” controlled
North Africa throughout most of the 16th century
• “(one of the choices) OR (the noun of the choices)” controlled North
Africa throughout most of the 16th century
Question 17: Martin Luther started the Protestant Reformation in 1517 by criticizing
the Roman Catholic Church. Where did he nail his 95 Theses?
a. Sistine Chapel
b. Wittenberg Cathedral
c. St. Paul's Cathedral
d. Louvre
60
Appendix C
Martin Luther nail his "95 Theses in the (one of the choices)"
Question 20: What association, formed in the late 16th century, comprised of five
Native American tribes?
a. Hopi League
b. Native American Council
c. Sioux Confederation
d. Iroquois League
This question was modified by replacing “What association, formed in the late 16th
century” with a quoted choice as follows:
61
Appendix D
Appendix D
QueryPost program code:
import java.io.*;
import java.util.*;
import javax.swing.*;
outFile.write("qID********hits********a********b********c********d********
CorrectAnswer********answer");
outFile.newLine();
line=inFile.readLine();
while(line!=null)
{
while(!line.startsWith("</q>")){
if(line.startsWith("qID=")){
62
Appendix D
qID=st.nextToken();
}
query[0]=inFile.readLine();
}
if (line.startsWith("a.")){
choice[0]=line.substring(3);
}
if(line.startsWith("b.")){
choice[1]=line.substring(3);
}
if(line.startsWith("c.")){
choice[2]=line.substring(3);
}
if(line.startsWith("d.")){
choice[3]=line.substring(3);
if (line.startsWith("answer:")){
StringTokenizer st=new StringTokenizer(line);
if(st.nextToken().equals("answer:")){
ans=st.nextToken();
}
}
line=inFile.readLine();
}
line=inFile.readLine();
for(int i=0;i<4;i++)
{
query[i+1]=query[0]+" "+choice[i];
}
for(int i=0;i<res.length;i++)
res[i]=0;
63
Appendix D
for(int i=0;i<query.length;i++)
{
res[i]= g.getNumberHits(query[i]);
System.out.println("Google Search Results for: "+query[i]+" "+ res[i]+
"documents");
outFile2.write("Google Search Results for: "+query[i]+" "+ res[i]+ "
documents");
outFile2.newLine();
outFile2.write("======================");
outFile2.newLine();
}
if (res[0] == -1)
{
System.out.println(" No hit for this question, so there is not an
answer");
outFile2.write(" No hit for this question, so there is not an answer");
outFile2.newLine();
}
else
{
answer=res[1];
System.out.println(" answer = first result"+res[1]);
if (res[1]<res[2])
{
answer= res[2];
System.out.println(" res[1]"+res[1]+" is less than res[2]"+res[2]);
if (res[2]<res[3])
{
answer= res[3];
System.out.println(" res[2]"+res[2]+" is less than res[3]"+res[3]);
if (res[3]<res[4])
{
answer= res[4];
System.out.println(" res[3]"+res[3]+" is less than res[4]"+res[4]);
64
Appendix D
else
{
System.out.println(" The answer for this question is choice c ");
tans="c";
}
}
else if (res[2]<res[4])
{
answer= res[4];
System.out.println(" res[2]"+res[2]+" is less than res[4]"+res[4]);
}
else
{
System.out.println(" The answer for this question is choice b ");
tans="b";
}
}
else if (res[1]<res[3])
{
answer= res[3];
System.out.println(" res[1]"+res[1]+" is less than res[3]"+res[3]);
if (res[3]<res[4])
{
answer= res[4];
System.out.println(" res[3]"+res[3]+" is less than res[4]"+res[4]);
65
Appendix D
}
else if (res[1]<res[4])
{
answer= res[4];
System.out.println(" res[1]"+res[1]+" is less than res[4]"+res[4]);
System.out.println(" The answer for this question is choice d ");
tans="d";
}
else
{
System.out.println(" The answer for this question is choice a ");
tans="a";
}
}
outFile.write(qID+"********"+res[0]+"********"+res[1]+"********"+res[2]+"****
****"+res[3]+"********"+res[4]+"********"+ans+"********"+tans);
outFile.newLine();
}
outFile.close();
outFile2.close();
}
}
66
Appendix E
Appendix E
An output sample of QueryPost program for sport questions:
qID********hits********a********b********c********d********CorrectAnswe
r********answer
1********282.0********23.0********15.0********44.0********64.0********d*
*******d
2********8.0********-1.0********-1.0********-1.0********-
1.0********c********a
3********54.0********41.0********35.0********52.0********28.0********c**
******c
4********33.0********30.0********32.0********31.0********22.0********b**
******b
5********16100.0********10900.0********27600.0********44100.0********34
500.0********c********c
6********3.0********-1.0********-1.0********-1.0********-
1.0********d********a
7********366.0********291.0********360.0********307.0********127.0******
**b********b
8********-1.0********-1.0********-1.0********-1.0********-
1.0********b********b
9********12200.0********886.0********410.0********845.0********10900.0**
******c********d
10********93.0********89.0********47.0********81.0********34.0********a*
*******a
11********85100.0********26800.0********23800.0********21600.0********2
4400.0********d********a
12********114000.0********96700.0********27800.0********45600.0********
50200.0********a********a
13********28700.0********862.0********628.0********529.0********-
1.0********a********a
14********15200.0********596.0********548.0********809.0********654.0***
*****b********c
15********5.0********-1.0********-1.0********-1.0********-
1.0********c********a
16********24400.0********785.0********770.0********927.0********910.0***
*****c********c
17********540000.0********70600.0********67300.0********72500.0********
62900.0********d********c
67
Appendix E
18********628000.0********11.0********110000.0********137000.0********1
83000.0********d********d
19********20000.0********943.0********15200.0********18500.0********132
00.0********c********c
20********15800.0********107.0********331.0********891.0********469.0***
*****d********c
68
Appendix F
Appendix F
QueryPost2 program code:
import java.io.*;
import java.util.*;
import javax.swing.*;
outFile.write("qID********hits********a********b********c********d********
CorrectAnswer********answer");
outFile1.write("qID********hits********a********b********c********d*******
*CorrectAnswer********answer");
outFile.newLine();
outFile1.newLine();
line=inFile.readLine();
while(line!=null)
{
while(!line.startsWith("</q>")){
if(line.startsWith("qID=")){
69
Appendix F
if (line.startsWith("a.")){
choice[0]=line.substring(3);
}
if(line.startsWith("b.")){
choice[1]=line.substring(3);
}
if(line.startsWith("c.")){
choice[2]=line.substring(3);
}
if(line.startsWith("d.")){
choice[3]=line.substring(3);
}
if (line.startsWith("answer:")){
StringTokenizer st=new StringTokenizer(line);
if(st.nextToken().equals("answer:")){
ans=st.nextToken();
}
}
line=inFile.readLine();
}
line=inFile.readLine();
for(int i=0;i<4;i++)
{
query[i+1]=query[0]+" "+choice[i];
}
for(int i=0;i<4;i++)
{
query[i+5]=query[0]+" "+"\""+choice[i]+"\"";
}
for(int i=0;i<res.length;i++)
res[i]=0;
for(int i=0;i<query.length;i++)
{
res[i]= g.getNumberHits(query[i]);
70
Appendix F
71
Appendix F
qanswer=res[5];
System.out.println(" answer = first result"+res[5]);
qtans="a";
72
Appendix F
if (res[5]<res[6])
{
qanswer= res[6];
System.out.println(" res[1]"+res[5]+" is less than res[2]"+res[6]);
if (res[6]<res[7])
{
qanswer= res[7];
System.out.println(" res[2]"+res[6]+" is less than res[3]"+res[7]);
if (res[7]<res[8])
{
qanswer= res[8];
System.out.println(" res[3]"+res[7]+" is less than res[4]"+res[8]);
System.out.println(" The answer for this question is choice d ");
qtans="d";
}
else
{
System.out.println(" The answer for this question is choice c ");
qtans="c";
}
}
else if (res[6]<res[8])
{
qanswer= res[8];
System.out.println(" res[2]"+res[6]+" is less than res[4]"+res[8]);
System.out.println(" The answer for this question is choice d ");
qtans="d";
}
else
{
System.out.println(" The answer for this question is choice b ");
qtans="b";
}
}
else if (res[5]<res[7])
{
qanswer= res[7];
System.out.println(" res[1]"+res[5]+" is less than res[3]"+res[7]);
if (res[7]<res[8])
{
qanswer= res[8];
System.out.println(" res[3]"+res[7]+" is less than res[4]"+res[8]);
73
Appendix F
outFile.write(qID+"********"+res[0]+"********"+res[1]+"********"+res[2]+"****
****"+res[3]+"*
*******"+res[4]+"********"+ans+"********"+tans);
outFile.newLine();
outFile1.write(qID+"********"+res[0]+"********"+res[5]+"********"+res[6]+"***
*****"+res[7]+"********"+res[8]+"********"+ans+"********"+qtans);
outFile1.newLine();
}
outFile.close();
outFile1.close();
outFile2.close();
}
}
74
Appendix G
Appendix G
QueryFilter program code:
import java.io.*;
import java.util.*;
import javax.swing.*;
String line=null,qID=null;
String[] query= new String[1];
line=inFile.readLine();
System.out.println(line);
while(line!=null)
{
System.out.println(line);
if(line.startsWith("qID=")){
qID=line;
System.out.println(qID);
line=inFile.readLine();
System.out.println(line);
StringTokenizer st=new StringTokenizer(line);
if(line.startsWith("Who ")){
75
Appendix G
System.out.println("Who if ");
query[0]=line.substring(4);
outFile1.write(qID);
outFile1.newLine();
outFile1.write(query[0]);
outFile1.newLine();
for(int i=0;i<6;i++)
{
outFile1.write(inFile.readLine());
outFile1.newLine();
}
}
else if(line.startsWith("Where ")){
System.out.println("Where if ");
query[0]=line.substring(6);
outFile2.write(qID);
outFile2.newLine();
outFile2.write(query[0]);
outFile2.newLine();
for(int i=0;i<6;i++)
{
outFile2.write(inFile.readLine());
outFile2.newLine();
}
}
76
Appendix G
query[0]=line.substring(5);
outFile4.write(qID);
outFile4.newLine();
outFile4.write(query[0]);
outFile4.newLine();
for(int i=0;i<6;i++)
{
outFile4.write(inFile.readLine());
outFile4.newLine();
}
}
else
{
System.out.println("Other ");
query[0]=line;
outFile5.write(qID);
outFile5.newLine();
outFile5.write(query[0]);
outFile5.newLine();
for(int i=0;i<6;i++)
{
outFile5.write(inFile.readLine());
outFile5.newLine();
}
}
}
line=inFile.readLine();
77
Appendix G
}
outFile1.close();
outFile2.close();
outFile3.close();
outFile4.close();
outFile5.close();
}
}
78
Appendix H
Appendix H
QueryPost3 program code:
import java.io.*;
import java.util.*;
import javax.swing.*;
outFile.write("qID********hits********a********b********c********d********
CorrectAnswer********answer");
outFile1.write("qID********hits********a********b********c********d*******
*CorrectAnswer********answer");
outFile.newLine();
outFile1.newLine();
line=inFile.readLine();
while(line!=null)
{
while(!line.startsWith("</q>")){
if(line.startsWith("qID=")){
79
Appendix H
if (line.startsWith("a.")){
choice[0]=line.substring(3);
}
if(line.startsWith("b.")){
choice[1]=line.substring(3);
}
if(line.startsWith("c.")){
choice[2]=line.substring(3);
}
if(line.startsWith("d.")){
choice[3]=line.substring(3);
}
if (line.startsWith("answer:")){
StringTokenizer st=new StringTokenizer(line);
if(st.nextToken().equals("answer:")){
ans=st.nextToken();
}
}
line=inFile.readLine();
}
line=inFile.readLine();
for(int i=0;i<4;i++)
{
query[i+1]=choice[i]+" "+query[0];
}
for(int i=0;i<4;i++)
{
query[i+5]="\""+choice[i]+"\""+" "+query[0];
}
for(int i=0;i<res.length;i++)
res[i]=0;
for(int i=0;i<query.length;i++)
{
res[i]= g.getNumberHits(query[i]);
80
Appendix H
81
Appendix H
qanswer=res[5];
System.out.println(" answer = first result"+res[5]);
qtans="a";
82
Appendix H
if (res[5]<res[6])
{
qanswer= res[6];
System.out.println(" res[1]"+res[5]+" is less than res[2]"+res[6]);
if (res[6]<res[7])
{
qanswer= res[7];
System.out.println(" res[2]"+res[6]+" is less than res[3]"+res[7]);
if (res[7]<res[8])
{
qanswer= res[8];
System.out.println(" res[3]"+res[7]+" is less than res[4]"+res[8]);
System.out.println(" The answer for this question is choice d ");
qtans="d";
}
else
{
System.out.println(" The answer for this question is choice c ");
qtans="c";
}
}
else if (res[6]<res[8])
{
qanswer= res[8];
System.out.println(" res[2]"+res[6]+" is less than res[4]"+res[8]);
System.out.println(" The answer for this question is choice d ");
qtans="d";
}
else
{
System.out.println(" The answer for this question is choice b ");
qtans="b";
}
}
else if (res[5]<res[7])
{
qanswer= res[7];
System.out.println(" res[1]"+res[5]+" is less than res[3]"+res[7]);
if (res[7]<res[8])
{
qanswer= res[8];
System.out.println(" res[3]"+res[7]+" is less than res[4]"+res[8]);
83
Appendix H
outFile.write(qID+"********"+res[0]+"********"+res[1]+"********"+res[2]+"****
****"+res[3]+"*
*******"+res[4]+"********"+ans+"********"+tans);
outFile.newLine();
outFile1.write(qID+"********"+res[0]+"********"+res[5]+"********"+res[6]+"***
*****"+res[7]+"********"+res[8]+"********"+ans+"********"+qtans);
outFile1.newLine();
}
outFile.close();
outFile1.close();
outFile2.close();
}
}
84
Appendix I
Appendix I
An output sample of the RASP system, which is a text file and all of its words are
tagged with special PoS:
qID=_NN1
3_MC
NBC_NP1
broadcast_VVD
the_AT
first_MD
sportscast_NN1
of_IO
this_DD1
game_NN1
in_II
1939_MC
a._NNU
Wrestling_VVG
b._&FW
Boxing_NN1
c._RR
Football_NN1
d._NNU
Baseball_NN1
answer_NN1
:_:
d_NN2
</q>_(
qID=_NN1
14_MC
What_DDQ
genre_NN1
is_VBZ
the_AT
1993_MC
movie_NN1
'Blue_NP1
'_$
a._NN1
Comedy_NP1
85
Appendix I
b._&FW
Horror_NN1
c._RR
Documentary_JJ
d._NNU
Western_JJ
answer_NN1
:_:
c_ZZ1
</q>_)
qID=_NN1
17_MC
What_DDQ
is_VBZ
the_AT
first_MD
name_NN1
of_IO
the_AT
lead_NN1
singer_NN1
in_II
the_AT
band_NN1
'Little_NP1
Crunchy_NP1
Blue_NP1
Things_NP1
'_$
a._NN1
Hunter_NP1
b._&FW
Eric_NP1
c._RR
Noah_NP1
d._NNU
Brian_NP1
answer_NN1
:_:
c_ZZ1
</q>_)
86
Appendix J
Appendix J
These are the noun tags according to the 155 CLAWS-2 part-of-speech (PoS) which
were taken from (WWW8)
Tag Tag type
ND1 singular noun of direction (north, southeast)
NN common noun, neutral for number (sheep, cod)
NN1 singular common noun (book, girl)
NN1$ genitive singular common noun (domini)
NN2 plural common noun (books, girls)
NNJ organization noun, neutral for number (department, council, committee)
NNJ1 singular organization noun (Assembly, commonwealth)
NNJ2 plural organization noun (governments, committees)
NNL locative noun, neutral for number (Is.)
NNL1 singular locative noun (street, Bay)
NNL2 plural locative noun (islands, roads)
NNO numeral noun, neutral for number (dozen, thousand)
NNO1 singular numeral noun (no known examples)
NNO2 plural numeral noun (hundreds, thousands)
NNS noun of style, neutral for number (no known examples)
NNS1 singular noun of style (president, rabbi)
NNS2 plural noun of style (presidents, viscounts)
NNSA1 following noun of style or title, abbreviatory (M.A.)
NNSA2 following plural noun of style or title, abbreviatory
NNSB preceding noun of style or title, abbr. (Rt. Hon.)
NNSB1 preceding sing. noun of style or title, abbr. (Prof.)
NNSB2 preceding plur. noun of style or title, abbr. (Messrs.)
NNT temporal noun, neutral for number (no known examples)
NNT1 singular temporal noun (day, week, year)
NNT2 plural temporal noun (days, weeks, years)
NNU unit of measurement, neutral for number (in., cc.)
NNU1 singular unit of measurement (inch, centimetre)
NNU2 plural unit of measurement (inches, centimetres)
NP proper noun, neutral for number (Indies, Andes)
NP1 singular proper noun (London, Jane, Frederick)
NP2 plural proper noun (Browns, Reagans, Koreas)
NPD1 singular weekday noun (Sunday)
NPD2 plural weekday noun (Sundays)
NPM1 singular month noun (October)
NPM2 plural month noun (Octobers)
87
Appendix K
Appendix K
NounRemove program code:
import java.io.*;
import java.util.*;
import javax.swing.*;
String line=null,qID=null;
line=inFile.readLine();
while(line!=null)
{
System.out.println(line);
if(line.startsWith("qID=")){
qID=inFile.readLine();
outFile.write("qID="+qID);
outFile.newLine();
System.out.println(qID);
line=inFile.readLine();
System.out.println(line);
if((line.startsWith("This"))||(line.startsWith("What"))){
line=inFile.readLine();
while (line.endsWith("_NN1")){
line=inFile.readLine();
System.out.println(line);
}
}
while(!line.startsWith("</q>")){
88
Appendix K
outFile.write(line);
outFile.newLine();
System.out.println(line);
line=inFile.readLine();
}
outFile.write(line);
outFile.newLine();
}
line=inFile.readLine();
}
outFile.close();
}
}
89