Академический Документы
Профессиональный Документы
Культура Документы
Retrieval
Introduction to
Information Retrieval
Hinrich Schtze and Christina Lioma
Lecture 6: Scoring, Term Weighting,
The Vector Space Model
1
Introduction to Information
Retrieval
Overview
Recap
Term frequency
tf-idf weighting
Introduction to Information
Retrieval
Outline
Recap
Term frequency
tf-idf weighting
Introduction to Information
Retrieval
Heaps law
Vocabulary size M as a
function of collection
size
T (number of tokens)
for
Reuters-RCV1. For
these
data, the dashed line
log10M =
0.49 log10 T + 1.64 is
the
best least squares fit.
Thus, M = 4101.64T0.49 4
Introduction to Information
Retrieval
Zipfs law
The most frequent term
(the) occurs cf1 times,
the
second most frequent
term
(of) occurs
times, the third most
frequent term (and)
occurs
times etc.
5
Introduction to Information
Retrieval
Dictionary as a string
Introduction to Information
Retrieval
Gap encoding
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Compression of Reuters
data structure
dictionary, fixed-width
dictionary, term pointers into
string
, with blocking, k = 4
, with blocking & front coding
collection (text, xml markup
etc)
collection (text)
T/D incidence matrix
postings, uncompressed (32-bit
words)
postings, uncompressed (20
bits)
size in MB
11.2
7.6
7.1
5.9
3600.0
960.0
40,000.0
400.0
250.0
116.0
101.0
10
10
Introduction to Information
Retrieval
Take-away today
Ranking search results: why it is important (as
opposed to just presenting a set of unordered
Boolean results)
Term frequency: This is a key ingredient for
ranking.
Tf-idf ranking: best known traditional ranking
scheme
Vector space model: One of the most important
formal models for information retrieval (along
with Boolean and probabilistic models)
11
11
Introduction to Information
Retrieval
Outline
Recap
Term frequency
tf-idf weighting
Introduction to Information
Retrieval
Ranked retrieval
Thus far, our queries have all been Boolean.
Documents either match or dont.
13
Introduction to Information
Retrieval
14
Introduction to Information
Retrieval
15
15
Introduction to Information
Retrieval
16
Introduction to Information
Retrieval
17
Introduction to Information
Retrieval
JACCARD (A, A) = 1
JACCARD (A, B) = 0 if A B = 0
18
Introduction to Information
Retrieval
19
19
Introduction to Information
Retrieval
20
Introduction to Information
Retrieval
Outline
Recap
Term frequency
tf-idf weighting
Introduction to Information
Retrieval
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
1
1
0
0
1
0
0
1
1
1
0
1
0
0
1
0
22
22
Introduction to Information
Retrieval
N
...
0
0
0
0
0
3
1
0
2
2
0
0
8
1
0
0
1
0
0
5
1
1
0
0
0
0
8
5
represented as a count
23
23
Introduction to Information
Retrieval
24
Introduction to Information
Retrieval
Term frequency tf
The term frequency tft,d of term t in document d
is defined as the number of times that t occurs
in d.
We want to use tf when computing querydocument match scores.
But how?
Raw term frequency is not what we want
because:
A document with tf = 10 occurrences of the
term is more relevant than a document with tf
= 1 occurrence of the term.
But not 10 times more relevant.
25
25
Introduction to Information
Retrieval
tft,d wt,d :
0 0, 1 1, 2 1.3, 10 2,
1000 4, etc.
Introduction to Information
Retrieval
Exercise
Compute the Jaccard matching score and the tf
matching score for the following querydocument pairs.
q: [information on cars] d: all youve ever
wanted to know about cars
q: [information on cars] d: information on
trucks, information on planes, information on
trains
q: [red cars and red trucks] d: cops stop red
cars more often
27
27
Introduction to Information
Retrieval
Outline
Recap
Term frequency
tf-idf weighting
Introduction to Information
Retrieval
29
29
Introduction to Information
Retrieval
30
30
Introduction to Information
Retrieval
31
Introduction to Information
Retrieval
Document frequency
We want high weights for rare terms like
ARACHNOCENTRIC.
We want low (positive) weights for frequent
words like GOOD, INCREASE and LINE.
We will use document frequency to factor this
into computing the matching score.
The document frequency is the number of
documents in the collection that the term
occurs in.
32
32
Introduction to Information
Retrieval
idf weight
dft is the document frequency, the number of
documents that t occurs in.
dft is an inverse measure of the informativeness
of term t.
We define the idf weight of term t as follows:
33
Introduction to Information
Retrieval
dft
idft
1
100
1000
10,000
100,000
1,000,000
6
4
3
2
1
0
34
34
Introduction to Information
Retrieval
35
35
Introduction to Information
Retrieval
collection
frequency
10440
10422
document frequency
INSURANC
3997
E
8760
TRY
Collection frequency of t: number of tokens of t
in the collection
Document frequency of t: number of
documents t occurs in
Why these numbers?
Which word is a better search term (and should
get a higher weight)?
This example suggests that df (and idf) is
36
36
better for weighting than cf (and icf).
Introduction to Information
Retrieval
tf-idf weighting
The tf-idf weight of a term is the product of its
tf weight and its idf weight.
tf-weight
idf-weight
Best known weighting scheme in information
retrieval
Note: the - in tf-idf is a hyphen, not a minus
sign!
Alternative names: tf.idf, tf x idf
37
37
Introduction to Information
Retrieval
Summary: tf-idf
Assign a tf-idf weight for each term t in each
document d:
The tf-idf weight . . .
. . . increases with the number of occurrences
within a document. (term frequency)
. . . increases with the rarity of the term in the
collection. (inverse document frequency)
38
38
Introduction to Information
Retrieval
Symb Definition
ol
term frequency
tft,d number of occurrences of t
in
d
document
dft
frequency
number of documents in the
collection that t occurs in
cft
collection
total number of occurrences
frequency
of
Relationship between df and cf?
t in the collection
39
39
Introduction to Information
Retrieval
Outline
Recap
Term frequency
tf-idf weighting
Introduction to Information
Retrieval
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
1
1
0
0
1
0
0
1
1
1
0
1
0
0
1
0
41
41
Introduction to Information
Retrieval
Count matrix
Anthon Julius The
y and
Caesa Tempe
Cleopat r
st
ra
ANTHON
157
73
Y
4
157
BRUTUS
232
227
CAESAR
0
10
CALPURN
57
0
IA
2
0
CLEOPAT
2
0
RA
MERCY
Each document is now
WORSER
|V|.
vector
N
...
0
0
0
0
0
3
1
0
2
2
0
0
8
1
0
0
1
0
0
5
1
1
0
0
0
0
8
5
represented as a count
42
42
Introduction to Information
Retrieval
ANTHON
5.25
3.18
0.0
0.0
Y
1.21
6.10
0.0
1.0
BRUTUS
8.59
2.54
0.0
1.51
CAESAR
0.0
1.54
0.0
0.0
CALPURN
2.85
0.0
0.0
0.0
IA
1.51
0.0
1.90
0.12
CLEOPAT
1.37
0.0
0.11
4.15
RA
MERCY
Each document is now represented as
WORSER
|V|.
vector
of
tf
idf
weights
R
...
0.0
0.0
0.25
0.0
0.0
5.25
0.25
0.35
0.0
0.0
0.0
0.0
0.88
1.95
a real-valued
43
43
Introduction to Information
Retrieval
Documents as vectors
Each document is now represented as a realvalued vector of tf-idf weights R|V|.
So we have a |V|-dimensional real-valued
vector space.
Terms are axes of the space.
Documents are points or vectors in this space.
Very high-dimensional: tens of millions of
dimensions when you apply this to web search
engines
Each vector is very sparse - most entries are
zero.
44
44
Introduction to Information
Retrieval
Queries as vectors
Key idea 1: do the same for queries: represent
them as vectors in the high-dimensional space
Key idea 2: Rank documents according to their
proximity to the query
proximity = similarity
proximity negative distance
Recall: Were doing this because we want to get
away from the youre-either-in-or-out, feast-orfamine Boolean model.
Instead: rank relevant documents higher than
nonrelevant documents
45
45
Introduction to Information
Retrieval
46
46
Introduction to Information
Retrieval
Introduction to Information
Retrieval
48
Introduction to Information
Retrieval
49
49
Introduction to Information
Retrieval
Cosine
50
50
Introduction to Information
Retrieval
Length normalization
How do we compute the cosine?
A vector can be (length-) normalized by
dividing each of its components by its length
here we use the L2 norm:
This maps vectors onto the unit sphere . . .
. . . since after normalization:
As a result, longer documents and shorter
documents have weights of the same order of
magnitude.
Effect on the two documents d and d (d
appended to itself) from earlier slide: they have
51
51
identical vectors after length-normalization.
Introduction to Information
Retrieval
52
Introduction to Information
Retrieval
(if
and
are length-normalized).
53
53
Introduction to Information
Retrieval
54
54
Introduction to Information
Retrieval
Cosine: Example
term
frequencies (counts)
How similar are
these novels? SaS:
Sense and
Sensibility PaP:
Pride and
Prejudice WH:
Wuthering
Heights
term
AFFECTION
JEALOUS
GOSSIP
WUTHERIN
G
SaS
115
10
2
0
PaP
58
7
0
0
55
WH
20
11
6
38
55
Introduction to Information
Retrieval
Cosine: Example
term frequencies (counts)
frequency weighting
term
Sa Pa WH term
S
P
AFFECTIO
AFFECTIO
11 58 20 N
N
5 7 11 JEALOUS
JEALOUS
10 0
6 GOSSIP
WUTHERI
GOSSIP
2 0 38
NG
WUTHERI
0
NG
log
SaS PaP
3.06 2.7
2.0
6
1.30 1.8
0
5
0
0
WH
2.3
0
2.0
4
1.7
8
2.5
8
56
Introduction to Information
Retrieval
Cosine: Example
log frequency weighting
weighting &
term
Sa PaP WH term
normalizationS
AFFECTIO
log frequency
SaS
0.78
AFFECTIO 3.0 2.76 2.3 N
9
N
6 1.85
0 JEALOUS
0.51
JEALOUS
2.0
0 2.0 GOSSIP
5
GOSSIP
1.3
0
4 WUTHERI
0.33
cos(SaS,PaP)
1.7 NG
WUTHERI
0
5
0.789
NG
0
8 0.832 + 0.515
0.0
2.5
0.335 0.0 + 0.0
0.0 0.94.
8
cos(SaS,WH) 0.79
cosine
PaP
WH
0.83
2
0.55
5
0.0
0.0
0.52
4
0.46
5
0.40
5
0.555
+
0.58
8
cos(PaP,WH) 0.69
Why do we have cos(SaS,PaP) > cos(SAS,WH)?
57
57
Introduction to Information
Retrieval
58
58
Introduction to Information
Retrieval
59
59
Introduction to Information
Retrieval
tf-idf example
We often use different weightings for queries and
documents.
Notation: ddd.qqq
Example: lnc.ltn
document: logarithmic tf, no df weighting, cosine
normalization
query: logarithmic tf, idf, no normalization
Isnt it bad to not idf-weight the document?
Example query: best car insurance
Example document: car insurance auto
insurance
60
60
Introduction to Information
Retrieval
Key to columns: tf-raw: raw (unweighted) term frequency, tfwght: logarithmically weighted term frequency, df: document
frequency, idf: inverse document frequency, weight: the final
weight of the term in the query or document, nlized:
document weights after cosine normalization, product: the
product of final query weight and final document weight
1/1.92 0.52
1.3/1.92 0.68 Final similarity score between query and
document:
i wqi wdi = 0 + 0 + 1.04 + 2.04 = 3.08
61
61
Questions?
Introduction to Information
Retrieval
62
62
Introduction to Information
Retrieval
Take-away today
Ranking search results: why it is important (as
opposed to just presenting a set of unordered
Boolean results)
Term frequency: This is a key ingredient for
ranking.
Tf-idf ranking: best known traditional ranking
scheme
Vector space model: One of the most important
formal models for information retrieval (along
with Boolean and probabilistic models)
63
63
Introduction to Information
Retrieval
Resources
Chapters 6 and 7 of IIR
Resources at http://ifnlp.org/ir
Vector space for dummies
Exploring the similarity space (Moffat and Zobel,
2005)
Okapi BM25 (a state-of-the-art weighting method,
11.4.3 of IIR)
64
64