Академический Документы
Профессиональный Документы
Культура Документы
Index Construction
Introduction to Information Retrieval
Index construction
How do we construct an index?
What strategies can we use with limited main
memory?
19-Jul-17 CS F469 2
Introduction to Information Retrieval
Indexing
Indexing is a technique borrowed from databases
An index is a data structure that supports efficient
lookups in a large data set
E.g., hash indexes, R-trees, B-trees, etc.
19-Jul-17 CS F469 3
Introduction to Information Retrieval
Forward index
What is INVERTED INDEX? First look at the FORWARD INDEX!
Documents Words
Document 1 Hat, dog, the, cow, is, now
Document 2 Cow, run, away, morning, in, tree
Document 3 What, family, at, some, is, take
CS F469
19-Jul-17 4
Introduction to Information Retrieval
Posting
One List
posting
19-Jul-17 CS F469 5
Introduction to Information Retrieval
Term Doc #
index construction I
did
1
1
enact 1
julius 1
19-Jul-17 CS F469 7
Introduction to Information Retrieval
19-Jul-17 CS F469 8
Introduction to Information Retrieval
19-Jul-17 CS F469 9
Introduction to Information Retrieval
19-Jul-17 CS F469 10
Introduction to Information Retrieval
Bottleneck
Parse and build postings entries one doc at a time
Now sort postings entries by term (then by doc
within each term)
Doing this with random disk seeks would be too slow
must sort T=100M records
19-Jul-17 CS F469 11
Introduction to Information Retrieval
Block Inversion
Inversion involves two steps:
1. We sort the termID-docID pairs.
2. We collect all termID-docID pairs with the same
termID into a posting list, where a posting list is
simply a docID.
This results an inverted index for the block we have just
read.
19-Jul-17 CS F469 14
Introduction to Information Retrieval
disk
19-Jul-17 CS F469 15
Introduction to Information Retrieval
One example of external sorting is the external mergesort algorithm. For example, for
sorting 900 megabytes of data using only 100 megabytes of RAM:
1. Read 100 MB of the data in main memory and sort by some conventional method, like
quicksort.
2. Write the sorted data to disk.
3. Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks, which now need to
be merged into one single output file.
4. Read the first 10 MB of each sorted chunk into input buffers in main memory and
allocate the remaining 10 MB for an output buffer. (In practice, it might provide better
performance to make the output buffer larger and the input buffers slightly smaller.)
5. Perform a 9-way merge and store the result in the output buffer. If the output buffer is
full, write it to the final sorted file. If any of the 9 input buffers gets empty, fill it with the
next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is
available.
19-Jul-17 CS F469 17
Introduction to Information Retrieval
19-Jul-17 CS F469 18
Introduction to Information Retrieval
SPIMI:
Single-pass in-memory indexing
Key idea 1: Generate separate dictionaries for each
block no need to maintain term-termID mapping
across blocks.
In other words, sub-dictionaries are generated on the
fly.
Key idea 2: Dont sort. Accumulate postings in
postings lists as they occur.
With these two ideas we can generate a complete
inverted index for each block.
These separate indexes can then be merged into one
big index.
19-Jul-17 CS F469 19
Introduction to Information Retrieval
SPIMI-Invert
Merge algorithm
19-Jul-17 CS F469 21
Introduction to Information Retrieval
Bl Bl
oc oc
Dicti Bl k2 k4
ona oc
Bl
Inverted
ry k12 Bl
oc Index Bl
oc oc
Main k1
k3 k5
Phase: Merge
Pass 2
1 Disk
BSBI
19-Jul-17 CS F469 22
Introduction to Information Retrieval
Sub
Bl -
Sub
Sub - Bl oc dict
- Bl ion
dict ocInvertedk 3
dict oc ion ary
ion
k1
ary
k12 ary Index
Sub
Main - Bl
dict oc
ion k2
ary
Phase: Merge
Single Pass
Disk
SPIMI
19-Jul-17 CS F469 23
Introduction to Information Retrieval
SPIMI BSBI
1. Add postings directly to 1. Collect term-docID pairs , sort
postings list them and then create
postings list
2. It is faster then BSBI because 2. Slower then SPIMI
there is no Sorting necessary
3. It saves memory because No 3. Require to store termID , so
termID needs to be stored need more space
4. Time complexity O( T ) 4. Time complexity O( T logT)
19-Jul-17 CS F469 24
Introduction to Information Retrieval
Distributed indexing
For web-scale indexing
must use a distributed computing cluster
Individual machines are fault-prone
Can unpredictably slow down or fail
How do we exploit such a pool of machines?
19-Jul-17 CS F469 25
Introduction to Information Retrieval
Distributed indexing
Maintain a master machine directing the indexing job
considered safe.
Break up indexing into sets of (parallel) tasks.
Master machine assigns each task to an idle machine
from a pool.
19-Jul-17 CS F469 27
Introduction to Information Retrieval
Parallel tasks
We will use two sets of parallel tasks
Parsers
Inverters
Break the input document collection into splits
Each split is a subset of documents (corresponding to
blocks in BSBI/SPIMI)
19-Jul-17 CS F469 28
Introduction to Information Retrieval
Parsers
Master assigns a split to an idle parser machine
Parser reads a document at a time and emits (term,
doc) pairs
Parser writes pairs into j partitions
Each partition is for a range of terms first letters
(e.g., a-f, g-p, q-z) here j = 3.
Now to complete the index inversion
19-Jul-17 CS F469 29
Introduction to Information Retrieval
Inverters
An inverter collects all (term,doc) pairs (= postings)
for one term-partition.
Sorts and writes to postings lists
19-Jul-17 CS F469 30
Introduction to Information Retrieval
Data flow
assign Master assign
Postings
Map Reduce
Segment files
19-Jul-17
phase CS F469
phase 31
Introduction to Information Retrieval
MapReduce
The index construction algorithm we just described is
an instance of MapReduce.
MapReduce (Dean and Ghemawat 2004) is a robust
and conceptually simple framework for distributed
computing
without having to write code for the distribution
part.
They describe the Google indexing system (ca. 2002)
as consisting of a number of phases, each
implemented in MapReduce.
19-Jul-17 CS F469 32
Introduction to Information Retrieval
Dynamic indexing
Up to now, we have assumed that collections are
static.
They rarely are:
Documents come in over time and need to be inserted.
Documents are deleted and modified.
This means that the dictionary and postings lists have
to be modified:
Postings updates for terms already in dictionary
New terms added to dictionary
19-Jul-17 CS F469 33
Introduction to Information Retrieval
Simplest approach
Maintain big main index
New docs go into small auxiliary index
Search across both, merge results
Deletions
Invalidation bit-vector for deleted docs
Filter docs output on a search result by this invalidation
bit-vector
Periodically, re-index into one main index
19-Jul-17 CS F469 34
Introduction to Information Retrieval
END
19-Jul-17 CS F469 37