Академический Документы
Профессиональный Документы
Культура Документы
Presented By
14/7/2009 1
Outline
Introduction
Technical Background
New Approach
Experimental Results
Conclusion
14/7/2009 2
Introduction
Search engines search only the most recent version
of a web page
Ranked Queries
Consists of a set of terms
Ranking Function assigns a score to each page
Eg.Cosine Measure
Document-at-a-time Query Processing (DAAT)
14/7/2009 4
Technical Background Cont…
Index Updates
• Search engine updates
Old pages are replaced by new updated
pages
Non-existing pages are deleted
New pages are inserted
Efficient updates
14/7/2009 7
Content-dependent partition
Goal: Partition a page into a set of fragments
Winnowing Algorithm
14/7/2009 8
Content-dependent partition
cont…
Winnowing Algorithm
Uses two hash functions
R A K A B A B A B F H M A C …
h
ash
2 1 4 1 4 1 4 8 1 7 2 1 2 1 …
3 7 5 3 8 3 8 7 9 1 2 9 3
window of
size w
Block Block Block
1 2 3
Winnowing Algorithm
1. Choose a hash function to map substrings of some
fixed small size to integer values
2. Choose a larger window size and slide this window
over the hash array. Use the following rules to
partition the file
Suppose the current hash value is strictly smaller
than all other values in the window, cut directly
before it
Suppose there are several positions in the current
window with the same minimum value. If it has
cut previously directly before one of these
positions, no cut is applied in this step. Otherwise,
cut before the rightmost such position
14/7/2009 10
Sharing policies
Local sharing
Avoid re-indexing of a fragment if it has previously
occurred in a version of the same page
Number of Fragments that need to be indexed :13
fragments
Global sharing
If a fragment has previously occurred in any
other page, it is not indexed again
Fragments indexed:4 + 5 + 2 = 11 fragments
(Total 18)
14/7/2009 12
Data Structures
14/7/2009 16
Efficient Updates cont…
14/7/2009 18
Experimental Evaluation cont…
14/7/2009 19
Conclusion
Provides a new framework for indexing and query
processing on textual collections with significant
amounts of redundancy.
14/7/2009 20
References
[1] Jiangong Zhang and Torsten Suel. Efficient search in large textual collections
with redundancy. Proceedings of the Sixteenth International World Wide Web
Conference,
pages 412-420, 2007.
[5] V. Anh and A. Moat. Index compression using indexed binary codewords.
14/7/2009 21
Proceedings of the 15th
Int. Australasian Database Conference, pages 61-67, 2004.
Thank You
14/7/2009 22