Академический Документы
Профессиональный Документы
Культура Документы
Retrieval
Introduction to
Information Retrieval
Dictionaries and tolerant retrieval
Introduction to Information
Retrieval
This lecture
Dictionary data structures
Tolerant retrieval
Wild-card queries
Spelling correction
Soundex
Introduction to Information
Retrieval
Introduction to Information
Retrieval
A naive dictionary
An array of struct:
char[20] int
Postings *
20 bytes 4/8 bytes
4/8 bytes
How do we store a dictionary in memory efficiently?
How do we quickly look up elements at query time?
4
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Hashtables
Each vocabulary term is hashed to an
integer
Pros:
Lookup is faster than for a tree: O(1)
Cons:
No easy way to find minor variants:
judgment/judgement
No prefix search
[tolerant retrieval]
If vocabulary keeps growing, need to
occasionally do the expensive operation of
rehashing everything
7
Introduction to Information
Retrieval
si-z
zyg
ot
n-sh
le
ens
hv-m
huy
g
ard
var
a-hu
Root
sic
k
a-m
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Tree: B-tree
a-hu
hy-m
n-z
Introduction to Information
Retrieval
Trees
Simplest: binary Search tree
More usual: B-trees
Trees require a standard ordering of characters
and hence strings but we typically have one
Pros:
Solves the prefix problem (terms starting with
hyp)
Cons:
Slower: O(log M) [and this requires balanced
tree]
Rebalancing binary trees is expensive
But B-trees mitigate the rebalancing problem
11
Introduction to Information
Retrieval
WILD-CARD QUERIES
12
Introduction to Information
Retrieval
Wild-card queries: *
mon*: find all docs containing any word
beginning with mon.
Easy with binary tree (or B-tree) lexicon:
retrieve all words in range: mon w < moo
*mon: find words ending in mon: harder
Maintain
an
additional
B-tree
for
terms
backwards.i.e root to-leaf path of the B-tree
corresponds to a term in the dictionary written
backwards eg. Lemon would be represented as
root-n-o-m-e-l.
Can retrieve all words in range: nom w < oom.
Introduction to Information
Retrieval
Query processing
At this point, we have an enumeration of all terms
in the dictionary that match the wild-card query.
We still have to look up the postings for each
enumerated term.
E.g., consider the query:
se*ate AND fil*er
This may result in the execution of many Boolean
AND queries.
14
Introduction to Information
Retrieval
We could look up co* AND *tion in a Btree and intersect the two term sets
Expensive
Introduction to Information
Retrieval
Permuterm index
A special index for general wildcard queries is the
permuterm index.
First, we introduce a special symbol $ into our character set,
to mark the end of a term. E.g. Hello => Hello $
Next, we construct a permuterm index, in which the various
rotations of each term (augmented with $) are linked to the
original vocabulary term.
Permuterm
vocabulary
16
Introduction to Information
Retrieval
Permuterm index
Consider the wild card query m*n
The key is to rotate such a wildcard query so that the * symbol appears
at the end of the string thus the rotated wildcard query becomes
n$m*.
Next, we look up this string in the permuterm index, where seeking
n$m* (via a search tree) leads to rotations of (among others) the terms
man and moron.
Now that the permuterm index enables us to identify the original
vocabulary terms matching a wildcard query, we look up these terms in
the standard inverted index to retrieve matching documents.
.
17
Introduction to Information
Retrieval
Introduction to Information
Retrieval
$a,ap,pr,ri,il,l$,
$i,is,s$,
$t,th,he,e$,
$c,cr,ru,
ue,el,le,es,st,t$,
$m,mo,on,nt,h$
$ is a special word boundary symbol
Maintain a second inverted index from bigrams to
dictionary terms that match each bigram.
19
Introduction to Information
Retrieval
mace
madden
mo
among
amortize
on
along
among
20
Introduction to Information
Retrieval
Processing wild-cards
Query mon* can now be run as
$m AND mo AND on
Gets terms that match AND version of our wildcard
query.
But wed enumerate moon.
Must post-filter these terms against query.
Surviving enumerated terms are then looked up in
the term-document inverted index.
Fast, space efficient (compared to permuterm).
21
Introduction to Information
Retrieval
SPELLING CORRECTION
22
Introduction to Information
Retrieval
Spell correction
There are two basic principles underlying most
spelling correction algorithms.
Of various alternative correct spellings for a miss-spelled
query, choose the nearest one.
When two correctly spelled queries are tied (or nearly
tied), select the one that is more common.
23
Introduction to Information
Retrieval
Spell correction
Two principal uses
Correcting document(s) being indexed
Correcting user queries to retrieve right
answers
Context-sensitive
Look at surrounding words,
e.g., I flew form Heathrow to Delhi.
24
Introduction to Information
Retrieval
Document correction
Especially needed for OCRed documents
Correction algorithms are tuned for this:
Can use domain-specific knowledge
E.g., OCR can confuse O and D more often than it
would confuse O and I (adjacent on the QWERTY
keyboard, so more likely interchanged in typing).
25
Introduction to Information
Retrieval
Introduction to Information
Retrieval
27
Introduction to Information
Retrieval
Introduction to Information
Retrieval
29
Introduction to Information
Retrieval
30
Introduction to Information
Retrieval
Introduction to Information
Retrieval
32
Introduction to Information
Retrieval
33
Introduction to Information
Retrieval
X Y / X Y
Equals 1 when X and Y have the same
elements and zero when they are disjoint
X and Y dont have to be of the same size
Always assigns a number between 0 and 1
Now threshold to decide if you have a match
E.g., if J.C. > 0.8, declare a match
34
Introduction to Information
Retrieval
35
Introduction to Information
Retrieval
Matching bigrams
Consider the query lord we wish to
identify words matching 2 of its 3 bigrams
(lo, or, rd)
lo
alone
or
border lore
rd
ardent
lore
sloth
morbid
border card
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Context-sensitive correction
Need surrounding context to catch this.
First idea: retrieve dictionary terms close (in
weighted edit distance) to each query term
Now try all possible resulting phrases with one word
fixed at a time
flew from heathrow
fled form heathrow
flea form heathrow
Hit-based spelling correction: Suggest the
alternative that has lots of hits.
38
Introduction to Information
Retrieval
Exercise
Suppose that for flew form Heathrow
we have 7 alternatives for flew, 19 for
form and 3 for heathrow.
How many corrected phrases will we
enumerate in this scheme?
39
Introduction to Information
Retrieval
Another approach
Break phrase query into a conjunction of
biwords.
Look for biwords that need only one term
corrected.
Enumerate
only
phrases
common (popular) biwords.
containing
40
Introduction to Information
Retrieval
41
Introduction to Information
Retrieval
SOUNDEX
42
Introduction to Information
Retrieval
Soundex
Our final technique for tolerant retrieval has to do
with phonetic correction: misspellings that arise
because the user types a query that sounds like
the target term.
Such algorithms are especially applicable to
searches on the names of people.
The main idea here is to generate, for each term,
a phonetic hash so that similar-sounding terms
hash to the same value.
43
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Introduction to Information
Retrieval
Soundex continued
4.
46
Introduction to Information
Retrieval
Queries such as
(SPELL(moriset) /3 toron*to) OR
SOUNDEX(chaikofski)
47
Introduction to Information
Retrieval
Exercise
Draw yourself a diagram showing the various
indexes in a search engine incorporating all
the functionality we have talked about
Identify some of the key design choices in
the index pipeline:
Does stemming happen before the Soundex
index?
What about n-grams?