Вы находитесь на странице: 1из 8

Introduc)on
to
Informa)on
Retrieval
 Introduc)on
to
Informa)on
Retrieval
 Ch.

Recap
of
the
previous
lecture

  Basic
inverted
indexes:

Introduc)on
to
   Structure:
Dic)onary
and
Pos)ngs

Informa(on
Retrieval

CS276:
Informa)on
Retrieval
and
Web
Search
   Key
step
in
construc)on:
Sor)ng

Christopher
Manning
and
Prabhakar
Raghavan
   Boolean
query
processing

Lecture
2:
The
term
vocabulary
and
pos)ngs
   Intersec)on
by
linear
)me
“merging”

lists
   Simple
op)miza)ons

  Overview
of
course
topics


Introduc)on
to
Informa)on
Retrieval
 Introduc)on
to
Informa)on
Retrieval


Plan
for
this
lecture
 Recall
the
basic
indexing
pipeline

Elaborate
basic
indexing
 Documents to Friends, Romans, countrymen.
be indexed.
  Preprocessing
to
form
the
term
vocabulary

  Documents
 Tokenizer
  Tokeniza)on
 Token stream. Friends Romans Countrymen
  What
terms
do
we
put
in
the
index?
 Linguistic
  Pos)ngs
 modules
Modified tokens. friend roman countryman
  Faster
merges:
skip
lists

Indexer friend
 2 4
  Posi)onal
pos)ngs
and
phrase
queries

roman
 1 2
Inverted index.
countryman
 13 16

Introduc)on
to
Informa)on
Retrieval
 Sec. 2.1 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.1

Parsing
a
document
 Complica)ons:
Format/language

  What
format
is
it
in?
   Documents
being
indexed
can
include
docs
from

many
different
languages

  pdf/word/excel/html?

  A
single
index
may
have
to
contain
terms
of
several

  What
language
is
it
in?
 languages.

  What
character
set
is
in
use?
   Some)mes
a
document
or
its
components
can

contain
mul)ple
languages/formats

  French
email
with
a
German
pdf
aWachment.

Each of these is a classification problem,   What
is
a
unit
document?

which we will study later in the course.   A
file?

  An
email?

(Perhaps
one
of
many
in
an
mbox.)

But these tasks are often done heuristically …   An
email
with
5
aWachments?

  A
group
of
files
(PPT
or
LaTeX
as
HTML
pages)


1
Introduc)on
to
Informa)on
Retrieval
 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.1

Tokeniza)on

  Input:
“Friends,
Romans
and
Countrymen”

  Output:
Tokens

  Friends

  Romans

  Countrymen

  A
token
is
an
instance
of
a
sequence
of
characters

TOKENS
AND
TERMS
   Each
such
token
is
now
a
candidate
for
an
index

entry,
a_er
further
processing

  Described
below

  But
what
are
valid
tokens
to
emit?


Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.1 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.1

Tokeniza)on
 Numbers

  Issues
in
tokeniza)on:
   3/20/91 
 
 

Mar.
12,
1991
 
 
 
20/3/91

  Finland’s
capital
→

   55
B.C.

  B‐52






Finland?
Finlands?
Finland’s?

  My
PGP
key
is
324a3df234cb23e

  Hewle:‐Packard
→
Hewle:
and
Packard
as
two
   (800)
234‐2333

tokens?
   O_en
have
embedded
spaces

  state‐of‐the‐art:
break
up
hyphenated
sequence.



  Older
IR
systems
may
not
index
numbers

  co‐educa?on

  But
o_en
very
useful:
think
about
things
like
looking
up
error

  lowercase,
lower‐case,
lower
case
?

codes/stacktraces
on
the
web

  It
can
be
effec)ve
to
get
the
user
to
put
in
possible
hyphens

  (One
answer
is
using
n‐grams:
Lecture
3)

  San
Francisco:
one
token
or
two?


   Will
o_en
index
“meta‐data”
separately

  How
do
you
decide
it
is
one
token?
   Crea)on
date,
format,
etc.


Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.1 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.1

Tokeniza)on:
language
issues
 Tokeniza)on:
language
issues

  French
   Chinese
and
Japanese
have
no
spaces
between

  L'ensemble
→
one
token
or
two?
 words:

  L
?
L’
?
Le
?
   莎拉波娃现在居住在美国东南部的佛罗里达。
  Want
l’ensemble
to
match
with
un
ensemble

  Not
always
guaranteed
a
unique
tokeniza)on


  Un)l
at
least
2003,
it
didn’t
on
Google

  Interna)onaliza)on!
   Further
complicated
in
Japanese,
with
mul)ple

alphabets
intermingled

  German
noun
compounds
are
not
segmented
   Dates/amounts
in
mul)ple
formats

  LebensversicherungsgesellschaUsangestellter
 フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
  ‘life
insurance
company
employee’

  German
retrieval
systems
benefit
greatly
from
a
compound
spli>er
 Katakana Hiragana Kanji Romaji
module

  Can
give
a
15%
performance
boost
for
German

 End-user can express query entirely in hiragana!

2
Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.1 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.2

Tokeniza)on:
language
issues
 Stop
words

  Arabic
(or
Hebrew)
is
basically
wriWen
right
to
le_,
   With
a
stop
list,
you
exclude
from
the
dic)onary

but
with
certain
items
like
numbers
wriWen
le_
to
 en)rely
the
commonest
words.
Intui)on:

right
   They
have
liWle
seman)c
content:
the,
a,
and,
to,
be

  Words
are
separated,
but
leWer
forms
within
a
word
   There
are
a
lot
of
them:
~30%
of
pos)ngs
for
top
30
words

form
complex
ligatures
   But
the
trend
is
away
from
doing
this:

  Good
compression
techniques
(lecture
5)
means
the
space
for

including
stopwords
in
a
system
is
very
small

  


















 









←

→



←
→
























←
start
   Good
query
op)miza)on
techniques
(lecture
7)
mean
you
pay
liWle

at
query
)me
for
including
stop
words.

  ‘Algeria
achieved
its
independence
in
1962
a_er
132

  You
need
them
for:

years
of
French
occupa)on.’
   Phrase
queries:
“King
of
Denmark”

  With
Unicode,
the
surface
presenta)on
is
complex,
but
the
   Various
song
)tles,
etc.:
“Let
it
be”,
“To
be
or
not
to
be”

stored
form
is

straighnorward
   “Rela)onal”
queries:
“flights
to
London”


Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.3 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.3

Normaliza)on
to
terms
 Normaliza)on:
other
languages

  We
need
to
“normalize”
words
in
indexed
text
as
well
   Accents:
e.g.,
French
résumé
vs.
resume.

as
query
words
into
the
same
form
   Umlauts:
e.g.,
German:
Tuebingen
vs.
Tübingen

  We
want
to
match
U.S.A.
and
USA
   Should
be
equivalent

  Result
is
terms:
a
term
is
a
(normalized)
word
type,
   Most
important
criterion:

which
is
an
entry
in
our
IR
system
dic)onary
   How
are
your
users
like
to
write
their
queries
for
these

  We
most
commonly
implicitly
define
equivalence
 words?

classes
of
terms
by,
e.g.,


  dele)ng
periods
to
form
a
term
   Even
in
languages
that
standardly
have
accents,
users

  U.S.A.,
USA



USA
 o_en
may
not
type
them

  dele)ng
hyphens
to
form
a
term
   O_en
best
to
normalize
to
a
de‐accented
term

  an?‐discriminatory,
an?discriminatory



an?discriminatory
   Tuebingen,
Tübingen,
Tubingen

Tubingen


Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.3 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.3

Normaliza)on:
other
languages
 Case
folding

  Normaliza)on
of
things
like
date
forms
   Reduce
all
leWers
to
lower
case

  7月30日 vs. 7/30   excep)on:
upper
case
in
mid‐sentence?

  Japanese use of kana vs. Chinese characters
   e.g.,
General
Motors

  Fed
vs.
fed

  SAIL
vs.
sail

  Tokeniza)on
and
normaliza)on
may
depend
on
the
   O_en
best
to
lower
case
everything,
since

language
and
so
is
intertwined
with
language
 users
will
use
lowercase
regardless
of

detec)on
 Is this
‘correct’
capitaliza)on…

Morgen will ich in MIT … German “mit”?   Google
example:

  Crucial:
Need
to
“normalize”
indexed
text
as
well
as
   Query
C.A.T.



query
terms
into
the
same
form
   #1
result
is
for
“cat”
(well,
Lolcats)
not

Caterpillar
Inc.


3
Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.3 Introduc)on
to
Informa)on
Retrieval


Normaliza)on
to
terms
 Thesauri
and
soundex

  Do
we
handle
synonyms
and
homonyms?

  An
alterna)ve
to
equivalence
classing
is
to
do
   E.g.,
by
hand‐constructed
equivalence
classes

  car
=
automobile 

color
=
colour

asymmetric
expansion
   We
can
rewrite
to
form
equivalence‐class
terms

  An
example
of
where
this
may
be
useful
   When
the
document
contains
automobile,
index
it
under
car‐
  Enter:
window
 
Search:
window,
windows
 automobile
(and
vice‐versa)

  Enter:
windows 
Search:
Windows,
windows,
window
   Or
we
can
expand
a
query

  Enter:
Windows 
Search:
Windows
   When
the
query
contains
automobile,
look
under
car
as
well


  Poten)ally
more
powerful,
but
less
efficient
   What
about
spelling
mistakes?

  One
approach
is
soundex,
which
forms
equivalence
classes

of
words
based
on
phone)c
heuris)cs

  More
in
lectures
3
and
9


Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.4 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.4

Lemma)za)on
 Stemming

  Reduce
inflec)onal/variant
forms
to
base
form
   Reduce
terms
to
their
“roots”
before
indexing

  E.g.,
   “Stemming”
suggest
crude
affix
chopping

  am,
are,
is
→
be
   language
dependent

  car,
cars,
car's,
cars'
→
car
   e.g.,
automate(s),
automa?c,
automa?on
all
reduced
to

automat.

  the
boy's
cars
are
different
colors
→
the
boy
car
be

different
color

  Lemma)za)on
implies
doing
“proper”
reduc)on
to
 for example compressed for exampl compress and
dic)onary
headword
form
 and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.

Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.4 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.4

Porter’s
algorithm
 Typical
rules
in
Porter

  Commonest
algorithm
for
stemming
English
   sses
→
ss

  Results
suggest
it’s
at
least
as
good
as
other
stemming
   ies
→
i

op)ons

  a)onal
→
ate

  Conven)ons
+
5
phases
of
reduc)ons

  )onal
→
)on

  phases
applied
sequen)ally

  each
phase
consists
of
a
set
of
commands

  sample
conven)on:
Of
the
rules
in
a
compound
command,
   
Weight
of
word
sensi)ve
rules

select
the
one
that
applies
to
the
longest
suffix.
   
 (m>1)
EMENT
→

  replacement
→
replac

  cement

→
cement


4
Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.4 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2.4

Other
stemmers
 Language‐specificity

  Other
stemmers
exist,
e.g.,
Lovins
stemmer

   Many
of
the
above
features
embody
transforma)ons


  hWp://www.comp.lancs.ac.uk/compu)ng/research/stemming/general/lovins.htm
that
are

  Single‐pass,
longest
suffix
removal
(about
250
rules)
   Language‐specific
and

  Full
morphological
analysis
–
at
most
modest
   O_en,
applica)on‐specific

benefits
for
retrieval
   These
are
“plug‐in”
addenda
to
the
indexing
process

  Do
stemming
and
other
normaliza)ons
help?
   Both
open
source
and
commercial
plug‐ins
are

  English:
very
mixed
results.
Helps
recall
for
some
queries
but
 available
for
handling
these

harms
precision
on
others

  E.g.,
opera)ve
(den)stry)
⇒ oper
  Definitely useful for Spanish, German, Finnish, …
  30% performance gains for Finnish!

Introduc)on
to
Informa)on
Retrieval
 Sec. 2.2 Introduc)on
to
Informa)on
Retrieval


Dic)onary
entries
–
first
cut

ensemble.french

時間.japanese

MIT.english These may be


grouped by
mit.german
language (or
guaranteed.english not…).
More on this in
entries.english ranking/query
FASTER
POSTINGS
MERGES:

sometimes.english
processing. SKIP
POINTERS/SKIP
LISTS

tokenization.english

Introduc)on
to
Informa)on
Retrieval
 Sec. 2.3 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.3

Augment
pos)ngs
with
skip
pointers

Recall
basic
merge
 (at
indexing
)me)

  Walk
through
the
two
pos)ngs
simultaneously,
in
 41 128
)me
linear
in
the
total
number
of
pos)ngs
entries
 2 4 8 41 48 64 128

11 31
2 4 8 41 48 64 128 Brutus 1 2 3 8 11 17 21 31
2 8
1 2 3 8 11 17 21 31 Caesar   Why?

  To
skip
pos)ngs
that
will
not
figure
in
the
search

If the list lengths are m and n, the merge takes O(m+n)
operations. results.

  How?

Can we do better?
Yes (if index isn’t changing too fast).   Where
do
we
place
skip
pointers?


5
Introduc)on
to
Informa)on
Retrieval
 Sec. 2.3 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.3

Query
processing
with
skip
pointers
 Where
do
we
place
skips?

41 128   Tradeoff:

2 4 8 41 48 64 128   More
skips
→
shorter
skip
spans
⇒
more
likely
to
skip.


But
lots
of
comparisons
to
skip
pointers.

11 31   Fewer
skips
→
few
pointer
comparison,
but
then
long
skip

1 2 3 8 11 17 21 31 spans
⇒
few
successful
skips.


Suppose we’ve stepped through the lists until we


process 8 on each list. We match it and advance.

We then have 41 and 11 on the lower. 11 is smaller.

But the skip successor of 11 on the lower list is 31, so


we can skip ahead past the intervening postings.

Introduc)on
to
Informa)on
Retrieval
 Sec. 2.3 Introduc)on
to
Informa)on
Retrieval


Placing
skips

  Simple
heuris)c:
for
pos)ngs
of
length
L,
use
√L

evenly‐spaced
skip
pointers.

  This
ignores
the
distribu)on
of
query
terms.

  Easy
if
the
index
is
rela)vely
sta)c;
harder
if
L
keeps

changing
because
of
updates.


  This
definitely
used
to
help;
with
modern
hardware
it

may
not
(Bahle
et
al.
2002)
unless
you’re
memory‐
PHRASE
QUERIES
AND
POSITIONAL

based
 INDEXES

  The
I/O
cost
of
loading
a
bigger
pos)ngs
list
can
outweigh

the
gains
from
quicker
in
memory
merging!


Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.1

Phrase
queries
 A
first
aWempt:
Biword
indexes

  Want
to
be
able
to
answer
queries
such
as
“stanford
   Index
every
consecu)ve
pair
of
terms
in
the
text
as
a

university”
–
as
a
phrase
 phrase

  Thus
the
sentence
“I
went
to
university
at
Stanford”
   For
example
the
text
“Friends,
Romans,
Countrymen”

is
not
a
match.

 would
generate
the
biwords

  The
concept
of
phrase
queries
has
proven
easily
   friends
romans

understood
by
users;
one
of
the
few
“advanced
search”
   romans
countrymen

ideas
that
works

  Each
of
these
biwords
is
now
a
dic)onary
term

  Many
more
queries
are
implicit
phrase
queries

  Two‐word
phrase
query‐processing
is
now

  For
this,
it
no
longer
suffices
to
store
only

immediate.




<term
:
docs>
entries


6
Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.1 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.1

Longer
phrase
queries
 Extended
biwords

  Longer
phrases
are
processed
as
we
did
with
wild‐   Parse
the
indexed
text
and
perform
part‐of‐speech‐tagging

(POST).

cards:

  Bucket
the
terms
into
(say)
Nouns
(N)
and
ar)cles/
  stanford
university
palo
alto
can
be
broken
into
the
 preposi)ons
(X).

Boolean
query
on
biwords:
   Call
any
string
of
terms
of
the
form
NX*N
an
extended
biword.

stanford
university
AND
university
palo
AND
palo
alto
   Each
such
extended
biword
is
now
made
a
term
in
the

dic)onary.

  Example:

catcher
in
the
rye

Without
the
docs,
we
cannot
verify
that
the
docs
 















N










X


X



N

matching
the
above
Boolean
query
do
contain
the
   Query
processing:
parse
it
into
N’s
and
X’s

phrase.
   Segment
query
into
enhanced
biwords

  Look
up
in
index:
catcher
rye

Can have false positives!

Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.1 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.2

Issues
for
biword
indexes
 Solu)on
2:
Posi)onal
indexes

  False
posi)ves,
as
noted
before
   In
the
pos)ngs,
store,
for
each
term
the
posi)on(s)
in

  Index
blowup
due
to
bigger
dic)onary
 which
tokens
of
it
appear:

  Infeasible
for
more
than
biwords,
big
even
for
them

<term,
number
of
docs
containing
term;

  Biword
indexes
are
not
the
standard
solu)on
(for
all
 doc1:
posi)on1,
posi)on2
…
;

biwords)
but
can
be
part
of
a
compound
strategy
 doc2:
posi)on1,
posi)on2
…
;

etc.>


Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.2 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.2

Posi)onal
index
example
 Processing
a
phrase
query

  Extract
inverted
index
entries
for
each
dis)nct
term:

<be: 993427; to,
be,
or,
not.

1: 7, 18, 33, 72, 86, 231; Which of docs 1,2,4,5   Merge
their
doc:posi)on
lists
to
enumerate
all

2: 3, 149; could contain “to be posi)ons
with
“to
be
or
not
to
be”.

4: 17, 191, 291, 430, 434; or not to be”?   to:


5: 363, 367, …>
  2:1,17,74,222,551;
4:8,16,190,429,433;
7:13,23,191;
...

  For
phrase
queries,
we
use
a
merge
algorithm
   be:



recursively
at
the
document
level
   1:17,19;
4:17,191,291,430,434;
5:14,19,101;
...

  But
we
now
need
to
deal
with
more
than
just
   Same
general
method
for
proximity
searches

equality


7
Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.2 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.2

Proximity
queries
 Posi)onal
index
size

  LIMIT!
/3
STATUTE
/3
FEDERAL
/2
TORT

   You
can
compress
posi)on
values/offsets:
we’ll
talk

  Again,
here,
/k
means
“within
k
words
of”.
 about
that
in
lecture
5


  Clearly,
posi)onal
indexes
can
be
used
for
such
   Nevertheless,
a
posi)onal
index
expands
pos)ngs

queries;
biword
indexes
cannot.
 storage
substan)ally

  Exercise:
Adapt
the
linear
merge
of
pos)ngs
to
   Nevertheless,
a
posi)onal
index
is
now
standardly

handle
proximity
queries.

Can
you
make
it
work
for
 used
because
of
the
power
and
usefulness
of
phrase

any
value
of
k?
 and
proximity
queries
…
whether
used
explicitly
or

  This
is
a
liWle
tricky
to
do
correctly
and
efficiently
 implicitly
in
a
ranking
retrieval
system.

  See
Figure
2.12
of
IIR

  There’s
likely
to
be
a
problem
on
it!


Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.2 Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.2

Posi)onal
index
size
 Rules
of
thumb

  Need
an
entry
for
each
occurrence,
not
just
once
per
   A
posi)onal
index
is
2–4
as
large
as
a
non‐posi)onal

document
 index

  Index
size
depends
on
average
document
size
 Why?   Posi)onal
index
size
35–50%
of
volume
of
original

  Average
web
page
has
<1000
terms
 text

  SEC
filings,
books,
even
some
epic
poems
…
easily
100,000
   Caveat:
all
of
this
holds
for
“English‐like”
languages

terms

  Consider
a
term
with
frequency
0.1%

Document size Postings Positional postings
1000 1 1
100,000 1 100

Introduc)on
to
Informa)on
Retrieval
 Sec. 2.4.3 Introduc)on
to
Informa)on
Retrieval


Combina)on
schemes
 Resources
for
today’s
lecture

  These
two
approaches
can
be
profitably
   IIR
2

combined
   MG
3.6,
4.3;
MIR
7.2

  For
par)cular
phrases
(“Michael
Jackson”,
“Britney
   Porter’s
stemmer:

Spears”)
it
is
inefficient
to
keep
on
merging
posi)onal
 hWp://www.tartarus.org/~mar)n/PorterStemmer/

pos)ngs
lists
   Skip
Lists
theory:
Pugh
(1990)

  Even
more
so
for
phrases
like
“The
Who”
   Mul)level
skip
lists
give
same
O(log
n)
efficiency
as
trees

  Williams
et
al.
(2004)
evaluate
a
more
   H.E. Williams, J. Zobel, and D. Bahle. 2004. “Fast Phrase
Querying with Combined Indexes”, ACM Transactions on
sophis)cated
mixed
indexing
scheme
 Information Systems.
  A
typical
web
query
mixture
was
executed
in
¼
of
the
 
hWp://www.seg.rmit.edu.au/research/research.php?author=4

)me
of
using
just
a
posi)onal
index
   D.
Bahle,
H.
Williams,
and
J.
Zobel.
Efficient
phrase
querying
with
an

auxiliary
index.
SIGIR
2002,
pp.
215‐221.

  It
required
26%
more
space
than
having
a
posi)onal

index
alone


Вам также может понравиться