Вы находитесь на странице: 1из 15

Modern Information Retrieval

Parallel and Distributed IR

信息检索实验室
Outline
„ Parallel Computing
„ Performance Measures
„ Parallel IR(MIMD)
„ Distributed IR

信息检索实验室
Parallel Computing
„ Parallel computing is the simultaneous application of multiple processors to
solve a single problem, where each processor works on a different part of
the problem.

Processor1
1 2

3 Processor2

Processor3

Four classes of parallel architecture


„ SISD : single instruction stream, single data stream
„ SIMD : single instruction stream, multiple data stream
„ MISD : multiple instruction stream, single data stream tightly coupled
„ MIMD : multiple instruction stream, multiple data stream
loosely coupled
信息检索实验室
Performance Measure

Running time of best available sequential algorithm


S=
Running teim of parallel algorithm

The perfect speedup is S = N, where N is the number of processors

We can’t achive the perfect speedup because :


„ problem partition
„ parallel architecture
„ inherently sequential component

信息检索实验室
Parallel IR
„ Two directions
z Develop new retrieval strategies that directly lend themselves to

parallel implementation (Neural network)


z Adapt existing, well studied information retrieval algorithms to parallel

processing.

Search
Engine
User query User query
Search „ response time
Broker Search
result result Engine „ resource balance
Engine
Search
Engine
Multitasking model

信息检索实验室
Parallel IR

Search
User query/Results Engine
User query
Search
Broker Engine
Result
Search
Engine

Search
Engine

信息检索实验室
Parallel IR — Data Partitioning
K1 K2 … Ki … Kt
D1 W1,1 W2,1 … Wi,1 … Wt,1
D2 W1,2 W2,2 … Wi,2 … Wt,2
… … … … … … …
Dj W1,j W2,j … Wi,j … Wt,j
… … … … … … …
Dn W1,n W2,n … Wi,n … Wt,n

d j
= ( w1, j ,..., wt , j ) q = ( w1,q ,..., wt ,q ) sim(d j , q)

Partitioning Methods
Logical document partitioning
„ Document Partitioning
Physical document partitioning
„ Term Partitioning

信息检索实验室
Parallel IR — Data Partitioning
Logical Document Partitioning Inverted list
for tem i

P0
term i P1
P2

Indexing:
„ Patitions the documents among processers
„ Runs a separate indexing process on each processer in parallel
„ Merge all inverted lists to final inverted file

信息检索实验室
Parallel IR — Data Partitioning
Physical Document Partitioning Inverted file

1
Document set

Result Merging:
„ Global term statistics
„ Two phases approach

信息检索实验室
Parallel IR — Data Partitioning
Term Partitioning Inverted file
for term 1 to term 1000
1
Document set

Document partitioning VS Term partitioning:


„ Architecture
„ Performance

信息检索实验室
Distributed IR
„ Distrivuted IR is very similar to MIMD, but:
z Communication channel is network protocol

z Document partitioning is the best choice

„ Engineering issues:
z Search protocol

z Search server

z Request broker

„ Algorithmic issues:
z How to distribute documents across the distributed search servers

z How to select which server should receive a paticular search request

z How to combine the result from the different servers

信息检索实验室
Distributed IR

„ Collection Partitioning
z Replicate the collection across all of the search seavers
„ Each server indexs its replica of the documents
„ Each server indexs a subset of documents and merge all sub-indexs
to the final index file at each search server
z Random distribute the documents (Search Engine)
z Explicit semantic partition the documents

信息检索实验室
Distributed IR

„ Source Selection
z Always broadcast the query to each search server
z Use cosine similarity measure
„ Block technique
z Build content model by training

信息检索实验室
Distributed IR
„ Query Processing
z Select collections to search
z Distributed query to selected collections
z Evaluate query at distributed collections in parallel
z Combine results from distributed collections in final result

„ Ranking
z Merging ranked hit-lists returned by each search server
„ Global term statistical
„ Two phases search
z Reranking
„ Weighting document scores based on their collection similarity computed
during the source selection step
„ Require term statistics and reranking

w = 1+ | C | *( s − s ) / s
|C| — number of collections searched

信息检索实验室
s — collection’s score
信息检索实验室

Вам также может понравиться