Вы находитесь на странице: 1из 6

ISCL wintersemester 2007 IR Midterm exam

17 December 2007 Non-electronic documents and calculators are authorized. Name : Semester :

Exercise 1 : Denitions
Dene the following terms : tokenization permuterm index champion list

Exercise 2 : Characteristics of a collection and its index


Consider a collection made of 500 000 documents, each containing on average 800 words. The number of dierent words (i.e. not taking duplicates into account) is estimated to 700 000. For all questions, give your computation. What is the size (mega or giga bytes) of the collection when stored (uncompressed) on disc ?

With the best reduction rate of the dictionary achieved when using a linguistic preprocessing (noise words, stemming), what is the size (number of terms) of the dictionary ?

Consider an index where the average length of a non-positional posting list is 200. What is the estimation of the total number of postings of this index ?

How many bytes do you allow respectively for encoding (without compression) a dictionary term ? a non-positional posting ?

What are the size (mega or giga bytes) of the resulting dictionary and posting lists ?

If you compress your dictionary using the dictionary-as-a-string method, what is the new size of the dictionary ?

Exercise 3 : Querying an index


What kind of queries can be applied to the collection, for each of these, what index is needed ?

Exercise 4 : Linguistic preprocessing


Are the following statements right or false (justify your answer) ? a) stemming increases retrieval precision.

b) stemming only slightly reduces the size of the dictionary.

c) stop lists contains all most frequent terms.

Exercise 5 : Porter stemming


What would be the result of the porter stemmer used with the following words ? busses rely realised What is the Porter measure of the following words (give your computation) ? crepuscular

rigorous

placement

Exercise 6 : Index architecture


Propose a Map-Reduce architecture for creating language specic indexes from an heterogeneous collection. You can illustrate this architecture using a gure.

Exercise 7 : Index compression


What is the largest gap that can be encoded in 2 bytes using the variable-byte encoding ?

What is the posting list that can be decoded from the variable byte-code 10001001 00000001 10000010 11111111 ?

What would be the encoding of the same posting list using a -code ?

Exercise 8 : Vector Space Model


Consider a collection made of the documents d1 , d2 , d3 and whose characteristics are the following : Term actor movie trailer tfd1 12 15 52 tfd2 35 24 13 tfd3 55 48 12 df 123 240 85

Compute the vector representations of d1 , d2 and d3 using the tf idft,d weighting and the euclidian normalisation.

Compute the cosine similarities between these documents.

Give the ranking retrieved by the system for the query movie trailer.

Exercise 9 : Term weighting


Compute the vector representations of the documents introduced in the previous exercise using the ltn weighting scheme.

Exercise 10 : Index architecture (extra credit)


Consider a hashtable as a structure mapping keys to values using a hash function h such that h(key) = value. What problem may arise from such a structure when inserting new key-value pairs ?

What workaround would you propose for this insertion ? Give an algorithm for inserting a key-value pair.

Вам также может понравиться