Вы находитесь на странице: 1из 4

1)

a)
b)
c)
d)

terms that occur and exactly matches the query


Frequency of terms that exactly matches the query
Term that occur the least across all the documents/corpus
Term that has a high frequency in the given document and the term has a low document
frequency in the whole collection of documents/corpus

2.
a) No because as long as one of the query word is missing from the document, the returned score will be
zero. For example, missing 1 out of 4 query words same as missing 3 out of 4. I would modify the
formula to become (1- D)P(qi|D) + DP(qi|C) where:
D = is a parameter to weight the background probability, different forms of estimation come from
different D
P(qi|C) is the probability for query word i in thecollection language model for collection C (i.e,
background probability)
b) Smoothing is a technique for estimating probabilities for missing (or unseen) words. When
parameter is close to 0, documents that contain all the querys terms are ranked higher(This is better for
short queries). When approaches 1 the query will act more like a Boolean OR or a coordination level
match, documents which contain the number of high-probability words are ranked higher(This isbetter
for longer queries), thus missing a word is much less important.

3) For static filtering, the basic operation an indexer must support is whenever a new document enters
the system, the filtering system must decide whether or not for the new document is relevant with
respect to each profile. If the new document is deemed relevant to the respective profile, a link is drawn
connecting from the new document to the respective profile and returned to the user.
For adaptive filtering, whenever a new document enters the system, the document is then delivered to
a profile, and the user will then provide feedback about the document,then the profile is then updated
and used for matching future incoming documents. This system responds according to users relevance
feedback.
For collaborative filtering, this system must be able to associate a single profile with each user. Then, it
must provide rating for every incoming item, as well as every item in the database for which the current
user has not explicityly provide a rating/judgement. In the end,it will try to predict as accurate as
possible for ratings that users would give to the document/item.

The similiarity/distance between two users can be changed into calculating the similarity/distane
between two items by calculating how much the ratings by common users for a pair of items
deviate from average ratings for those items:

Sim(i,j) = cos(i,j) = uU(Ru,i - ~Ri) (Ru,j - ~ Rj) / ( uU (Ru,i ~Ri)^2 * uU (Ru,j ~Rj)^2)
Where Ru = Ratings of the item from a common pair of users
And where i = item to be compared and j = another item to be compared.

4)
a) P@5 = 4/5 = 0.8
P@10 = 7/10 = 0.7
b) R@5 = 4/9 = 0.444
R@10 = 7/9 = 0.777
c) Mean average precision (MAP) = (1 + 1 + + 4/5 + 5/7 + 6/8 + 7/10 + 8/11 + 9/14)/9 = 0.7872

d) DCG5 = rel1 + =2 /2
DCG5 = 1 + 4 + 0 + 1/log2(4) + 1/log2(5) = 5.931

e) Perfect ranking: 4 4 4 2 1 1 1 1 1 0 0 0 0 0 0
IDEAL DCG1 = 4
IDEAL DCG2 = 8
IDEAL DCG3 = 10.5237
IDEAL DCG4 = 11.5237
IDEAL DCG5= 11.9544
NDCG5 = 5.931 / 11.9544 = 0.496
f)

Precision-Recall Graph
1 1/5

Precision

1
4/5
3/5
2/5
1/5
0
0

1/5

2/5

3/5

Recall

4/5

1 1/5

Interpolation Precision-Recall Graph

g) A macroaverage computes the measure of interest for each query and then averages these
measures.
A microaverage combines all the applicable data points from every query and computes the
measure from the combined data.
Microaverage and macroaverage equals each other when the number of queries is 1 or the
number of documents is 1

5)

Using rochio algorithm,assuming parameter b = 14, and c = 4

q- = original q + (14 * 1/number of relevant doc * (sum of related doc vector) ) (4 * 1/number of non
relevant doc * (sum of non related doc vector))
the new query weight(q) will be definitely more than previously as there the number of of relevant doc
outweights number of non relevant doc by a lot for the top 10 results. Thus this will be ,
q- = original q + (14 * 1/7 * (sum of related doc vector) ) (4 * 1/3 * (sum of non related doc vector))

b) By using tf*idf and cosine similarity formula.


Assume the set of categorie is {c1,.cn}
First create n prototype vectors by, for each categorie, in each document, calculate the tf*idf term
vector for each training document. And then sum all the document vector in each categorie to get the
prototype vector for the respective categorie.
Then,
for each document, calculate the term vector for the document using tf*idf. And then use cosine
similarity between the documents term vector and the prototype vector. If the cosine similarity
returned is more than a certain threshold set, update the most similar class prototype and return the
class.

Вам также может понравиться