Constructing and Sorting Inverted Indexes in Limited Memory

Introduction to Information Retrieval
Index Construction
Index construction
How do we construct an index?
What strategies can we use with limited main
memory?
19-Jul-17 CS F469 2
Indexing
Indexing is a technique borrowed from databases
An index is a data structure that supports efficient
lookups in a large data set
E.g., hash indexes, R-trees, B-trees, etc.
19-Jul-17 CS F469 3
Forward index
What is INVERTED INDEX? First look at the FORWARD INDEX!
Documents Words
Document 1 Hat, dog, the, cow, is, now
Document 2 Cow, run, away, morning, in, tree
Document 3 What, family, at, some, is, take
Querying the forward index would require sequential

iteration through each document and to each word to verify
a matching document
Too much time, memory and resources required!
CS F469
19-Jul-17 4
What is inverted index?
Posting
One List
posting
Opposed to forward index, store the list of documents

per each word
Directly access the set of documents containing the word
19-Jul-17 CS F469 5
Term Doc #
index construction I
did
1
1
enact 1
julius 1
Documents are parsed to extract words and these caesar

I
1
1
are saved with the Document ID. was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
Doc 1 Doc 2 so 2
let 2
it 2
be 2
I did enact Julius So let it be with with 2
Caesar I was killed Caesar. The noble

caesar
the
2
2
i' the Capitol; Brutus hath told you noble
brutus
2
2
Brutus killed me. Caesar was ambitious hath 2
told 2
you 2
caesar 2
was 2
19-Jul-17 CS F469 6
ambitious 2
Key step Term

I
did
Doc #
1
1
Term
ambitious
be
Doc #
2
2
enact 1 brutus 1
julius 1 brutus 2
After all documents have been caesar
I
1
1
capitol
caesar
1
1
parsed, the inverted file is was
killed
1
1
caesar
caesar
2
2
sorted by terms. i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1 I 1
We focus on this sort step. me

so
1
2
i'
it
1
2
We have 100M items to sort. let

it
2
2
julius
killed
1
1
be 2 killed 1
with 2 let 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
19-Jul-17 CS F469 7
Scaling index construction

In-memory index construction does not scale.
How can we construct an index for very large
collections?
Taking into account the hardware constraints we just
learned about . . .
Memory, disk, speed, etc.
19-Jul-17 CS F469 8
Sort-based index construction

As we build the index, we parse docs one at a time.
While building the index, we cannot easily exploit
compression tricks (you can, but much more complex)
The final postings for any term are incomplete until the end.
At 12 bytes per non-positional postings entry (termID 4 bytes
+ docID 4 bytes + freq 4 bytes), demands a lot of space for
large collections.
Total = 100,000,000 in the case of RCV1
Thus: We need to store intermediate results on disk.
19-Jul-17 CS F469 9
Use the same algorithm for disk?

Can we use the same index construction algorithm
for larger collections, but by using disk instead of
memory?
No: Sorting T = 100,000,000 records on disk is too
slow too many disk seeks.
We need an external sorting algorithm.
19-Jul-17 CS F469 10
Bottleneck
Parse and build postings entries one doc at a time
Now sort postings entries by term (then by doc
within each term)
Doing this with random disk seeks would be too slow
must sort T=100M records
19-Jul-17 CS F469 11
BSBI: Blocked sort-based Indexing

(Sorting with fewer disk seeks)
12-byte (4+4+4) records (termID, doc, freq).
These are generated as we parse docs.
Must now sort 100M such 12-byte records by term.
Define a Block ~ 10M such records
Can fit comfortably into memory for in-place sorting (e.g.,
quicksort). Total 100M records
Will have 10 such blocks to start with.
Basic idea of algorithm:
Accumulate postings for each block, sort, write to disk.
Then merge the blocks into one long sorted order.
The term -> termID mapping (= dictionary)
19-Jul-17 mustCSalready
F469 be available built from a first pass.
12
Blocked sort based indexing

Use termID instead of term
Main memory is insufficient to collect termID-docID
pair, we need external sorting algorithm that uses
disk
Segment the collection into parts of equal size
Sorts and group the termID-docID pairs of each part in
memory
Store the intermediate result onto disk
Merges all intermediate results into the final index
Running Time: O (T log T)
19-Jul-17 CS F469 13
Block Inversion
Inversion involves two steps:
1. We sort the termID-docID pairs.
2. We collect all termID-docID pairs with the same
termID into a posting list, where a posting list is
simply a docID.
This results an inverted index for the block we have just
read.
19-Jul-17 CS F469 14
Postings lists to be merged Merged postings lists

brutus: d1, 3; d3, 2 brutus: d6, 1; d8, 3 brutus: d1, 3; d3, 2; d6, 1; d8, 3
caesar: d1, 2; d2, 1; d4, 4
noble: d5, 2 + caesar: d6, 4;
julius: d10, 1
caesar: d1, 2; d2, 1; d4, 4; d6, 4
julius: d10, 1
with: d1, 2; d3, 1; d5, 2 killed: d6, 4; d7, 3 killed: d6, 4; d7, 3
noble: d5, 2
with: d1, 2; d3, 1; d5, 2
disk
19-Jul-17 CS F469 15
Sorting 10 blocks of 10M records

First, read each block, sort in main, write back to disk:
Quicksort takes 2N ln N expected steps
In our case 2 x (10M ln 10M) steps
Exercise: estimate total time to read each block from
disk and and quicksort it.
10 times this estimate gives us 10 sorted runs of
10M records each on disk. Now, need to merge all!
Done straightforwardly, merge needs 2 copies of data
on disk (one for the lists to be merged, one for the
merged output)
But we can optimize this
19-Jul-17 CS F469 16
How to merge the sorted runs?

Use a 9-element priority queue repeatedly deleting
External mergesort its smallest element and adding to it from the buffer
One-pass to which the smallest belonged.
One example of external sorting is the external mergesort algorithm. For example, for
sorting 900 megabytes of data using only 100 megabytes of RAM:
1. Read 100 MB of the data in main memory and sort by some conventional method, like
quicksort.
2. Write the sorted data to disk.
3. Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks, which now need to
be merged into one single output file.
4. Read the first 10 MB of each sorted chunk into input buffers in main memory and
allocate the remaining 10 MB for an output buffer. (In practice, it might provide better
performance to make the output buffer larger and the input buffers slightly smaller.)
5. Perform a 9-way merge and store the result in the output buffer. If the output buffer is
full, write it to the final sorted file. If any of the 9 input buffers gets empty, fill it with the
next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is
available.
19-Jul-17 CS F469 17
Remaining problem with sort-based

algorithm
Our assumption was: we can keep the dictionary in
memory.
We need the dictionary (which grows dynamically) in
order to implement a term to termID mapping.
Actually, we could work with term,docID postings
instead of termID,docID postings .
19-Jul-17 CS F469 18
SPIMI:
Single-pass in-memory indexing
Key idea 1: Generate separate dictionaries for each
block no need to maintain term-termID mapping
across blocks.
In other words, sub-dictionaries are generated on the
fly.
Key idea 2: Dont sort. Accumulate postings in
postings lists as they occur.
With these two ideas we can generate a complete
inverted index for each block.
These separate indexes can then be merged into one
big index.
19-Jul-17 CS F469 19
SPIMI-Invert
Dictionary term generated on the fly!
Merging of blocks is analogous

19-Jul-17 CS F469
to BSBI. 20
Merge algorithm
19-Jul-17 CS F469 21
BSBI vs. SPIMI
Bl Bl
oc oc
Dicti Bl k2 k4
ona oc
Bl
Inverted
ry k12 Bl
oc Index Bl
oc oc
Main k1
k3 k5
Phase: Merge
Pass 2
1 Disk
BSBI
19-Jul-17 CS F469 22
BSBI vs. SPIMI
Sub
Bl -
Sub
Sub - Bl oc dict
- Bl ion
dict ocInvertedk 3
dict oc ion ary
ion
k1
ary
k12 ary Index
Sub
Main - Bl
dict oc
ion k2
ary
Phase: Merge
Single Pass
Disk
SPIMI
19-Jul-17 CS F469 23
Difference between BSBI and SPIMI
SPIMI BSBI
1. Add postings directly to 1. Collect term-docID pairs , sort
postings list them and then create
postings list
2. It is faster then BSBI because 2. Slower then SPIMI
there is no Sorting necessary
3. It saves memory because No 3. Require to store termID , so
termID needs to be stored need more space
4. Time complexity O( T ) 4. Time complexity O( T logT)
19-Jul-17 CS F469 24
Distributed indexing
For web-scale indexing
must use a distributed computing cluster
Individual machines are fault-prone
Can unpredictably slow down or fail
How do we exploit such a pool of machines?
19-Jul-17 CS F469 25
Google data centers

Google data centers mainly contain commodity
machines.
Data centers are distributed around the world.
Estimate: a total of 1 million servers, 3 million
processors/cores (Gartner 2007)
Estimate: Google installs 100,000 servers each
quarter.
Based on expenditures of 200250 million dollars per year
This would be 10% of the computing capacity of the
world!?!
19-Jul-17 CS F469 26
Distributed indexing
Maintain a master machine directing the indexing job
considered safe.
Break up indexing into sets of (parallel) tasks.
Master machine assigns each task to an idle machine
from a pool.
19-Jul-17 CS F469 27
Parallel tasks
We will use two sets of parallel tasks
Parsers
Inverters
Break the input document collection into splits
Each split is a subset of documents (corresponding to
blocks in BSBI/SPIMI)
19-Jul-17 CS F469 28
Parsers
Master assigns a split to an idle parser machine
Parser reads a document at a time and emits (term,
doc) pairs
Parser writes pairs into j partitions
Each partition is for a range of terms first letters
(e.g., a-f, g-p, q-z) here j = 3.
Now to complete the index inversion
19-Jul-17 CS F469 29
Inverters
An inverter collects all (term,doc) pairs (= postings)
for one term-partition.
Sorts and writes to postings lists
19-Jul-17 CS F469 30
Data flow
assign Master assign
Postings
Parser a-f g-p q-z Inverter a-f
Parser a-f g-p q-z

Inverter g-p
splits Inverter q-z

Parser a-f g-p q-z
Map Reduce
Segment files
19-Jul-17
phase CS F469
phase 31
MapReduce
The index construction algorithm we just described is
an instance of MapReduce.
MapReduce (Dean and Ghemawat 2004) is a robust
and conceptually simple framework for distributed
computing
without having to write code for the distribution
part.
They describe the Google indexing system (ca. 2002)
as consisting of a number of phases, each
implemented in MapReduce.
19-Jul-17 CS F469 32
Dynamic indexing
Up to now, we have assumed that collections are
static.
They rarely are:
Documents come in over time and need to be inserted.
Documents are deleted and modified.
This means that the dictionary and postings lists have
to be modified:
Postings updates for terms already in dictionary
New terms added to dictionary
19-Jul-17 CS F469 33
Simplest approach
Maintain big main index
New docs go into small auxiliary index
Search across both, merge results
Deletions
Invalidation bit-vector for deleted docs
Filter docs output on a search result by this invalidation
bit-vector
Periodically, re-index into one main index
19-Jul-17 CS F469 34
Issues with main and auxiliary indexes

Problem of frequent merges you touch stuff a lot
Poor performance during merge
Actually:
Merging of the auxiliary index into the main index is efficient if we
keep a separate file for each postings list.
Merge is the same as a simple append.
But then we would need a lot of files inefficient for O/S.
Assumption for the rest of the lecture: The index is one big
file.
In reality: Use a scheme somewhere in between (e.g., split
very large postings lists, collect postings lists of length 1 in one
file etc.)
19-Jul-17 CS F469 35
Dynamic/Positional indexing at search engines

All the large search engines now do dynamic
indexing
Their indices have frequent incremental changes
News items, blogs, new topical web pages
Sarah Palin,
But (sometimes/typically) they also periodically
reconstruct the index from scratch
Query processing is then switched to the new index, and
the old index is then deleted
Positional indexes
Same sort of sorting problem just larger
19-Jul-17 CS F469 36
END
19-Jul-17 CS F469 37

Constructing and Sorting Inverted Indexes in Limited Memory

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Constructing and Sorting Inverted Indexes in Limited Memory

Загружено:

Авторское право:

Доступные форматы

Introduction to Information Retrieval

Querying the forward index would require sequential

What is inverted index?

Opposed to forward index, store the list of documents

Documents are parsed to extract words and these caesar

Caesar I was killed Caesar. The noble

Key step Term

We focus on this sort step. me

We have 100M items to sort. let

Scaling index construction

Sort-based index construction

Use the same algorithm for disk?

BSBI: Blocked sort-based Indexing

Blocked sort based indexing

Postings lists to be merged Merged postings lists

Sorting 10 blocks of 10M records

How to merge the sorted runs?

Remaining problem with sort-based

Dictionary term generated on the fly!

Merging of blocks is analogous

BSBI vs. SPIMI

BSBI vs. SPIMI

Difference between BSBI and SPIMI

Google data centers

Parser a-f g-p q-z Inverter a-f

Parser a-f g-p q-z

splits Inverter q-z

Issues with main and auxiliary indexes

Dynamic/Positional indexing at search engines

Вам также может понравиться