Академический Документы
Профессиональный Документы
Культура Документы
信息检索实验室
Outline
Parallel Computing
Performance Measures
Parallel IR(MIMD)
Distributed IR
信息检索实验室
Parallel Computing
Parallel computing is the simultaneous application of multiple processors to
solve a single problem, where each processor works on a different part of
the problem.
Processor1
1 2
3 Processor2
Processor3
信息检索实验室
Parallel IR
Two directions
z Develop new retrieval strategies that directly lend themselves to
processing.
Search
Engine
User query User query
Search response time
Broker Search
result result Engine resource balance
Engine
Search
Engine
Multitasking model
信息检索实验室
Parallel IR
Search
User query/Results Engine
User query
Search
Broker Engine
Result
Search
Engine
Search
Engine
信息检索实验室
Parallel IR — Data Partitioning
K1 K2 … Ki … Kt
D1 W1,1 W2,1 … Wi,1 … Wt,1
D2 W1,2 W2,2 … Wi,2 … Wt,2
… … … … … … …
Dj W1,j W2,j … Wi,j … Wt,j
… … … … … … …
Dn W1,n W2,n … Wi,n … Wt,n
d j
= ( w1, j ,..., wt , j ) q = ( w1,q ,..., wt ,q ) sim(d j , q)
Partitioning Methods
Logical document partitioning
Document Partitioning
Physical document partitioning
Term Partitioning
信息检索实验室
Parallel IR — Data Partitioning
Logical Document Partitioning Inverted list
for tem i
P0
term i P1
P2
Indexing:
Patitions the documents among processers
Runs a separate indexing process on each processer in parallel
Merge all inverted lists to final inverted file
信息检索实验室
Parallel IR — Data Partitioning
Physical Document Partitioning Inverted file
1
Document set
Result Merging:
Global term statistics
Two phases approach
信息检索实验室
Parallel IR — Data Partitioning
Term Partitioning Inverted file
for term 1 to term 1000
1
Document set
信息检索实验室
Distributed IR
Distrivuted IR is very similar to MIMD, but:
z Communication channel is network protocol
Engineering issues:
z Search protocol
z Search server
z Request broker
Algorithmic issues:
z How to distribute documents across the distributed search servers
信息检索实验室
Distributed IR
Collection Partitioning
z Replicate the collection across all of the search seavers
Each server indexs its replica of the documents
Each server indexs a subset of documents and merge all sub-indexs
to the final index file at each search server
z Random distribute the documents (Search Engine)
z Explicit semantic partition the documents
信息检索实验室
Distributed IR
Source Selection
z Always broadcast the query to each search server
z Use cosine similarity measure
Block technique
z Build content model by training
信息检索实验室
Distributed IR
Query Processing
z Select collections to search
z Distributed query to selected collections
z Evaluate query at distributed collections in parallel
z Combine results from distributed collections in final result
Ranking
z Merging ranked hit-lists returned by each search server
Global term statistical
Two phases search
z Reranking
Weighting document scores based on their collection similarity computed
during the source selection step
Require term statistics and reranking
w = 1+ | C | *( s − s ) / s
|C| — number of collections searched
信息检索实验室
s — collection’s score
信息检索实验室