Вы находитесь на странице: 1из 2

ISU535 Information Retrieval - Spring 2004 - Prof.

Futrelle
Updated 28 January
This will be a closed-book, half-hour quiz.
Question 1. This is about the Boolean Model, Sec. 2.5.2.
Consider the four terms, in order: park, mountain, trails, difficult. Assume that the query, in
disjunctive normal form, DNF, is the following, where "OR" is the logical disjunction operator:
Q = (1,0,1,0) OR (0,1,1, 0) OR (1,1,1,0)
You'll be asked to write an English language description of this, which could be a straightforward
translation from DNF: "Search for a document containing park and trails, but not mountain or
difficult. Or, search for a document containing trails and mountain but not difficult. Or, search for a
document containing park and mountain and trails, but not difficult."
Another way of saying this, is that no document should contain difficult. All should contain trails. All
should contain park or mountain or both. This latter description is not the DNF form but easier to
understand.
Now consider the result of applying the query to the following two (tiny) documents. Which of the
two are retrieved, if either? Explain briefly how you arrived at your conclusions.
Document 1: "Loon park contains a lovely lake and is near Mystery mountain. It's not difficult to get
to from the city."
Document 2: "The Mystery mountain area has many easy trails, but no difficult ones."
Answer: Neither will be retrieved, because they both contain difficult. Oddly, the second one
contains difficult in a negated form. But essentially no retrieval systems can't take negation in
English into account. The intent of the query was probably to find a park or mountain without
difficult trails. But finding just what you want is not easy! Experimenting with google shows that
even when +difficulty is included, phrases such as "Difficulty level: Easy" appear. Not easy!
Question 2. This is about the Vector Model, Sec. 2.5.3.
I will NOT give you equation 2.1 or 2.3. You have to remember it. If you understand and practice
doing computations with it, you should easily be able to remember it.
Assume you index the terms "Mars", "landed" and "rover" in the following document:
Document = "After a successful landing on Mars, the Mars rover Opportunity landed on a Mars
plain in Meridiani section of Mars. The ship landed at an excellent landing spot."
Assume that the number of documents in the total collection of 64 that contain "Mars" is 16,
ISU535 Information Retrieval - Spring 2004 - Quiz #1 examples - Prof. F... http://www.ccs.neu.edu/home/futrelle/teaching/isu535sp2004/exams/qui...
1 of 2 8/24/2014 8:34 PM
"landed", 4 and "rover", 8. Using these, compute the three weight vector components for the
document. Ignore the stop words: the, a, an, of, on, in and at. Use lg = log
2
.
Answer: The highest frequency word is Mars, with 4 occurrences. The absolute frequencies of the
others are landed (2) and rover (1). This gives tf-idf factors of:
For Mars, f = 1 and idf = lg(64/16) = 2 so w = tf-idf = 2 for Mars.
For landed, f = 1/2 and idf = lg(64/4) = 4 so w = tf-idf = 2 for landed.
For rover, f = 1/4 and idf = lg(64/8) = 3 so w = tf-idf = 3/4 for rover.
Note that it is just a coincidence that a keyword, Mars, has the highest absolute frequency in the
document.
Go to ISU535 home page. or RPF's Teaching Gateway or homepage
ISU535 Information Retrieval - Spring 2004 - Quiz #1 examples - Prof. F... http://www.ccs.neu.edu/home/futrelle/teaching/isu535sp2004/exams/qui...
2 of 2 8/24/2014 8:34 PM

Вам также может понравиться