Seminarppt

Efficient Search in Large
Textual Collections with

Redundancy
Jiangong Zhang and Torsten Suel
In the Proceedings of the Sixteenth
International World Wide Web Conference
(WWW2007) Banff, CANADA, May 2007
Presented By
14/7/2009 1
Outline
 Introduction
 Technical Background
 New Approach
 Experimental Results
 Conclusion
14/7/2009 2
Introduction
 Search engines search only the most recent version
of a web page
 Could full-text search over large web archives be

supported?
 Challenge: Achieving a reasonable index when

there are many
similar pages (e.g., different versions of the same
page)
 Proposes a new and general framework to

efficiently index and search large web page
collections with redundancy
14/7/2009 3
Technical Background
 Inverted Index
 Consists of a set of inverted lists
 Each List is sequence of postings
 Posting – (docID,f,p0..pf-1)
f-frequency of a term,pi-position of a term
 Query processing needs to traverse inverted lists
 Ranked Queries
 Consists of a set of terms
 Ranking Function assigns a score to each page
Eg.Cosine Measure
 Document-at-a-time Query Processing (DAAT)
14/7/2009 4
Technical Background Cont…
 Index Updates
• Search engine updates
 Old pages are replaced by new updated
pages
 Non-existing pages are deleted
 New pages are inserted
Fig.1 Search engine

update
14/7/2009 5
Technical Background Cont…
 Update of Archival Search

 New and changed pages are inserted
 No deletion of pages
 Multiple versions of one page
Fig.2 Archival search update

14/7/2009 6
New Approach
 Content-dependent partition: Winnowing[2]
 Sharing policies: Local Sharing and Global

Sharing
 Modified query processing
 Efficient updates
14/7/2009 7
Content-dependent partition
 Goal: Partition a page into a set of fragments
 Two similar pages will have many fragments in

common
 Fragments are identified by a fragID
 Index fragments instead of complete documents
 Winnowing Algorithm
14/7/2009 8
cont…
 Winnowing Algorithm
 Uses two hash functions
window of size b for

hashing
R A K A B A B A B F H M A C …
h
ash
2 1 4 1 4 1 4 8 1 7 2 1 2 1 …
3 7 5 3 8 3 8 7 9 1 2 9 3
window of
size w
Block Block Block
1 2 3
Fig.3 winnowing on a file with b=3 and

w=5
14/7/2009 9
Cont…
 Winnowing Algorithm
1. Choose a hash function to map substrings of some
fixed small size to integer values
2. Choose a larger window size and slide this window
over the hash array. Use the following rules to
partition the file
 Suppose the current hash value is strictly smaller
than all other values in the window, cut directly
before it
 Suppose there are several positions in the current
window with the same minimum value. If it has
cut previously directly before one of these
positions, no cut is applied in this step. Otherwise,
cut before the rightmost such position
14/7/2009 10
Sharing policies
 Local sharing
 Avoid re-indexing of a fragment if it has previously
occurred in a version of the same page
 Number of Fragments that need to be indexed :13
fragments
Fig.4 Local sharing

14/7/2009 11
Sharing policies cont…
 Global sharing
 If a fragment has previously occurred in any
other page, it is not indexed again
 Fragments indexed:4 + 5 + 2 = 11 fragments
(Total 18)
14/7/2009 12
Data Structures
Fig.6 Standard data

structures
 Inverted Index: Consists of inverted lists
sorted by docID.
 Dictionary: Stores a pointer to the start of
the inverted list for each term, plus other
statistical information.
 Page Table: Stores complete URL, length of
14/7/2009 13
document, pagerank, and other useful
information of the document.
Data Structures cont…
Fig.7 Additional data

 structures
Doc/Version Table: stores information about a page and its
various versions.
 Hash Table: stores a hash value of the content of each distinct
fragment.
 Reuse Table: stores information about a fragment such as in
which other pages the fragment occurs.
14/7/2009 14
Modified Query Processing
 Local Sharing Query Processing
 Identify pages that contain all query words.
 Check if any version of the obtained page contains all
words. Compute the actual score for a page or
version.
 Global Sharing Query Processing

 Identify pages that contain all query words
 Uses Reuse table and Doc/Ver table. Compute the
actual score for a page or version.
Both requires additional computational and memory
cost.
14/7/2009 15
Efficient Updates
Fig.8 Index Updates
 Partition into fragments.

 Hash the content of each fragment.
14/7/2009 16
Efficient Updates cont…
 Index fragment only if it is not in the Hash table.

 Add posting for each term in the fragment in the
Inverted index table.
 Posting - (fragID,f,p0…..pf-1) where f-frequency of a
term in the fragment and pi-position of terms in the
fragment.
 FragID –(docID of fragment’s primary page , fragment
number)
 Update or insert the appropriate records in the
various tables.
 All new postings are first inserted in a main-memory
structure and later periodically merged into disk
based structures
14/7/2009 17
Experimental Evaluation
Fig.9 Cumulative percentage of unique

fragments versus week of crawl
 Experiment used a data set from Stanford WebBase: total of

6,356,374 versions of pages from 2,528,362 distinct URLs.
 A reduction in the number of fragments when duplicate
fragments are eliminated under local sharing policy.
14/7/2009 18
Experimental Evaluation cont…
Fig.10 comparison of number Fig.11 Relative reduction in

of fragments under different the number of fragments and
policies. positions
 Global sharing performs better than local sharing in size.
14/7/2009 19
Conclusion
 Provides a new framework for indexing and query
processing on textual collections with significant
amounts of redundancy.
 Results in significant reductions in index size and query

processing cost.
 Supports highly efficient updates.
 It can be used for applications such as desktop search

and indexing of versioning file systems that retain old
versions of all files.
14/7/2009 20
References
[1] Jiangong Zhang and Torsten Suel. Efficient search in large textual collections
with redundancy. Proceedings of the Sixteenth International World Wide Web
Conference,
pages 412-420, 2007.
[2] S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: Local algorithms

for document fingerprinting. Proceedings of the 2003 ACM
SIGMODInternational Conference
on Management of Data,pages 76-85, 2003.
[3] L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and R. Agarwal. Dynamic

maintenance of web
indexes using landmarks. Proceedings of the 12th Int. World Wide Web
Conference, pages 102-
111, 2003.
[4] F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes

for fast query
evaluation. Proceedings of the 25th Annual SIGIR Conference on Research and
Development in
Information Retrieval, pages 222-229, 2002.
[5] V. Anh and A. Moat. Index compression using indexed binary codewords.
14/7/2009 21
Proceedings of the 15th
Int. Australasian Database Conference, pages 61-67, 2004.
Thank You
14/7/2009 22

Seminarppt

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Seminarppt

Загружено:

Авторское право:

Доступные форматы

Efficient Search in Large

Textual Collections with

 Could full-text search over large web archives be

 Challenge: Achieving a reasonable index when

 Proposes a new and general framework to

Fig.1 Search engine

 Update of Archival Search

Fig.2 Archival search update

 Sharing policies: Local Sharing and Global

 Modified query processing

 Two similar pages will have many fragments in

 Fragments are identified by a fragID

 Index fragments instead of complete documents

window of size b for

Fig.3 winnowing on a file with b=3 and

Fig.4 Local sharing

Fig.6 Standard data

Fig.7 Additional data

 Global Sharing Query Processing

Fig.8 Index Updates

 Partition into fragments.

 Index fragment only if it is not in the Hash table.

Fig.9 Cumulative percentage of unique

 Experiment used a data set from Stanford WebBase: total of

Fig.10 comparison of number Fig.11 Relative reduction in

 Global sharing performs better than local sharing in size.

 Results in significant reductions in index size and query

 Supports highly efficient updates.

 It can be used for applications such as desktop search

[2] S. Schleimer, D. Wilkerson, and A. Aiken. Winnowing: Local algorithms

[3] L. Lim, M. Wang, S. Padmanabhan, J. Vitter, and R. Agarwal. Dynamic

[4] F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes

Вам также может понравиться