Вы находитесь на странице: 1из 27

The Anatomy of a Large-Scale

Hypertextual Web Search Engine


Sergey Brin and Lawrence Page

The Original
Google Paper
Google is the common spelling of googol, or
10100, which fit well with the authors goal of
building very large-scale search engines.

Outline
Design goals
System features
System anatomy
Results and performance
Paper analysis

Design Goals

Design Goals
1. Scale with the rapid growth of the web

1,200,000,000
1,000,000,000

1,000,000,000
800,000,000
600,000,000
400,000,000
200,000,000
0

110,000 1,500

100,000,000
20,000,000

1994.0

Webpages Indexed

1997.0

100,000,000

2000.0

Queries/day

Design Goals
2. Improved Search Quality

Number of documents on the web are increasing


rapidly, but users ability to look at them lags.

Current search engines return lots of junk results,


with little relevance. (Note: Were talking about the
year 1998)

3. Academic Search Engine Research

Push more development and understanding into the


academic realm.

Systems that reasonable number of people can use.

Build an architecture to support novel research


activities in large-scale web data.

System
Features

System Features
1. Makes use of the link structure of
the Web to calculate a quality
ranking for each page, called the
PageRank.
A probability distribution used to
represent the likelihood that a person
randomly clicking on links will arrive
at any particular page.
It considers the importance of each
page that casts a vote, as votes from
some pages are considered to have
greater value, thus giving the linked
page greater value.

PageRank: Bringing Order


to the Web

PR(A) (1 d) d

PR(Ti )
C(T )
i
Ti L( A)

PR(A) PageRank of a webpage A


PR(Ti) PageRank of a webpage Ti pointing to A
C(Ti) Number of outbound links for webpage Ti
L(A) Set of webpages linking to A
d damping factor, a value between 0 and 1, is the
probability that a random surfer will stop clicking
Note that PageRanks form a probability distribution of
webpages, so the summation of all webpages will be 1.

PageRank: Bringing Order


to the Web
Assume a universe of 4 webpages: A, B,
C, and D

PR(A)

PR(B) PR(C) PR(D)

2
1
3

Taking into consideration that a random


surfer will eventually stop clicking, we
assume a damping factor, d, which is
generally assumed to be 0.85

PR(B) PR(C) PR(D)


PR(A) (1 d) d

2
1
3

System Features
2. Makes use of Anchor text of links on
webpages:
E.g. <a href=http://www.yahoo.com>Yahoo!</a>
Text of a link is not only associated with the
webpage it is on, it also gives information
(sometimes more relevant) to the webpage it
points to.
Anchors may exist for documents which generally
cannot be indexed by text-based search engines,
such as images, programs, and databases.

System Features
3. Uses location information for all hits and
thus makes extensive use of proximity in
search.
4. Keeps track of visual presentation of text
on webpages such as font sizes. Words with
bolder/larger font are given more importance.
5. Stores complete raw HTML of webpages in
repository.

System
Anatomy

Major Data Structures


1. BigFiles
Virtual files spanning multiple file systems and
addressable by 64 bit integers.

2. Repository
Contains full compressed HTML of all pages.
Stored one after another prefixed with docID,
length, and URL.
Compressed using high speed compression
technique (zlib) instead of high compression ratio
(bzip).

Major Data Structures


3. Document Index
Keeps information about each document.
Its a fixed width index, ordered by docID.
Stores document status, pointer into the
repository, and checksum.
If document is indexed, points to a variable width
file docinfo which contains URL and title. Else
points to URLlist containing only the URL.

4. Lexicon
Contains list of null separated words (about 14
million) and hash table of pointers.

Major Data Structures


5. Hit Lists
A list of occurrences of a particular word in a
particular document including position, font, and
capitalization information.
Hit lists account for most of the space used in both
the forward and the inverted indices.

6. Forward Index
Stored in a number of barrels.
If a document contains words that fall into a
particular barrel, the docID is recorded into the
barrel followed by a list of wordIDs with their hitlists.

Major Data Structures


7. Inverted Index
The inverted index consists of the same barrels as
the forward index, except that they have been
processed by the sorter.

Crawling the Web


1. Several distributed crawlers.
URLserver serves list of URLs to the crawler.
Each crawler keeps ~300 open connections.
At max, a system of 4 crawlers can crawl ~100
pages/sec or ~600 K/second of data.
Each maintains its own DNS cache for fast lookup.

2. Parser handles a huge array of possible errors


including HTML errors, non-ASCII characters,
or HTML tags nested hundreds deep

Indexing the Web


3. Indexing Documents into Barrels
After each document is parsed, it is encoded into a
number of barrels.
Every word is converted into a wordID using an inmemory hash table the lexicon.
Once words are converted into wordIDs, their
occurrences in the current document are translated
into hit lists and are written into the forward barrels.

4. Sorting
Sorter takes each of the forward barrels and sorts by
wordID to produce an inverted barrel for title and
anchor hits and full text inverted barrel.

Searching
1.

Parse the query

2.

Convert words into wordIDs.

3.

Seek to the start of the doclist in the short barrel for


every word.

4.

Scan through the doclists until there is a document


that matches all the search terms.

5.

Compute the rank of that document for the query.

6.

If we are in the short barrels and at the end of any


doclist, seek to the start of the doclist in the full
barrel for every word and go to step 4.

7.

If we are not at the end of any doclist go to step 4.


Sort the documents that have matched by rank and
return the top k.

Results and
Performance

Results and Performance


A qualitative analysis of the search results by
users has generally been positive.
The current version of Google answers most
queries in between 1 and 10 seconds.
Since Google takes into consideration the
proximity of word occurrences, results are
more relevant than other search engines giving
a set of results for all words in queries. (E.g.
search for bill clinton gives lower importance
to results with independent bill and clinton)

Future Works
Current version of Google search times are
dominated by disk IO. Introduce query caching,
and hardware, software and algorithmic
optimizations.
Improve search efficiency and quickly scale to
~100 million web pages.
Develop Google as a resource for large scale
research tool for searchers and researchers.

Analyses of the Research


Paper
Pros
One of the first descriptions of the PageRank
algorithm which changed how search engines
ranked and indexed the web.
Using citation graph and anchor text to rank pages
closely resembled user behavior of ranking
websites.
Google is a complete architecture for gathering web
pages, indexing them, and performing search
queries over them.
The paper mentions Google does not compromise
PageRanks for monetary gains giving more
credibility to search results. This holds true to date.

Analyses of the Research


Paper
Cons
One of the first flaws found in the PageRank
algorithm was the Google Bomb:
Because of the PageRank, a page will be ranked
higher if the sites that link to that page use
consistent anchor text.
A Google bomb is created if a large number of
sites link to the page in this manner.
Ranking quality is insufficient using only PageRank
and anchor text. (Google today uses more than
200 different parameters to judge quality of a
webpage.)

Thank You
Presented by: Nilay Khandelwal