Вы находитесь на странице: 1из 27

The Anatomy of a Large-Scale

Hypertextual Web Search Engine

Sergey Brin and Lawrence Page

The Original
Google Paper
Google is the common spelling of googol, or
10100, which fit well with the authors goal of
building very large-scale search engines.

Design goals
System features
System anatomy
Results and performance
Paper analysis

Design Goals

Design Goals
1. Scale with the rapid growth of the web



110,000 1,500



Webpages Indexed





Design Goals
2. Improved Search Quality

Number of documents on the web are increasing

rapidly, but users ability to look at them lags.

Current search engines return lots of junk results,

with little relevance. (Note: Were talking about the
year 1998)

3. Academic Search Engine Research

Push more development and understanding into the

academic realm.

Systems that reasonable number of people can use.

Build an architecture to support novel research

activities in large-scale web data.


System Features
1. Makes use of the link structure of
the Web to calculate a quality
ranking for each page, called the
A probability distribution used to
represent the likelihood that a person
randomly clicking on links will arrive
at any particular page.
It considers the importance of each
page that casts a vote, as votes from
some pages are considered to have
greater value, thus giving the linked
page greater value.

PageRank: Bringing Order

to the Web

PR(A) (1 d) d

PR(Ti )
C(T )
Ti L( A)

PR(A) PageRank of a webpage A

PR(Ti) PageRank of a webpage Ti pointing to A
C(Ti) Number of outbound links for webpage Ti
L(A) Set of webpages linking to A
d damping factor, a value between 0 and 1, is the
probability that a random surfer will stop clicking
Note that PageRanks form a probability distribution of
webpages, so the summation of all webpages will be 1.

PageRank: Bringing Order

to the Web
Assume a universe of 4 webpages: A, B,
C, and D




Taking into consideration that a random

surfer will eventually stop clicking, we
assume a damping factor, d, which is
generally assumed to be 0.85


PR(A) (1 d) d


System Features
2. Makes use of Anchor text of links on
E.g. <a href=http://www.yahoo.com>Yahoo!</a>
Text of a link is not only associated with the
webpage it is on, it also gives information
(sometimes more relevant) to the webpage it
points to.
Anchors may exist for documents which generally
cannot be indexed by text-based search engines,
such as images, programs, and databases.

System Features
3. Uses location information for all hits and
thus makes extensive use of proximity in
4. Keeps track of visual presentation of text
on webpages such as font sizes. Words with
bolder/larger font are given more importance.
5. Stores complete raw HTML of webpages in


Major Data Structures

1. BigFiles
Virtual files spanning multiple file systems and
addressable by 64 bit integers.

2. Repository
Contains full compressed HTML of all pages.
Stored one after another prefixed with docID,
length, and URL.
Compressed using high speed compression
technique (zlib) instead of high compression ratio

Major Data Structures

3. Document Index
Keeps information about each document.
Its a fixed width index, ordered by docID.
Stores document status, pointer into the
repository, and checksum.
If document is indexed, points to a variable width
file docinfo which contains URL and title. Else
points to URLlist containing only the URL.

4. Lexicon
Contains list of null separated words (about 14
million) and hash table of pointers.

Major Data Structures

5. Hit Lists
A list of occurrences of a particular word in a
particular document including position, font, and
capitalization information.
Hit lists account for most of the space used in both
the forward and the inverted indices.

6. Forward Index
Stored in a number of barrels.
If a document contains words that fall into a
particular barrel, the docID is recorded into the
barrel followed by a list of wordIDs with their hitlists.

Major Data Structures

7. Inverted Index
The inverted index consists of the same barrels as
the forward index, except that they have been
processed by the sorter.

Crawling the Web

1. Several distributed crawlers.
URLserver serves list of URLs to the crawler.
Each crawler keeps ~300 open connections.
At max, a system of 4 crawlers can crawl ~100
pages/sec or ~600 K/second of data.
Each maintains its own DNS cache for fast lookup.

2. Parser handles a huge array of possible errors

including HTML errors, non-ASCII characters,
or HTML tags nested hundreds deep

Indexing the Web

3. Indexing Documents into Barrels
After each document is parsed, it is encoded into a
number of barrels.
Every word is converted into a wordID using an inmemory hash table the lexicon.
Once words are converted into wordIDs, their
occurrences in the current document are translated
into hit lists and are written into the forward barrels.

4. Sorting
Sorter takes each of the forward barrels and sorts by
wordID to produce an inverted barrel for title and
anchor hits and full text inverted barrel.


Parse the query


Convert words into wordIDs.


Seek to the start of the doclist in the short barrel for

every word.


Scan through the doclists until there is a document

that matches all the search terms.


Compute the rank of that document for the query.


If we are in the short barrels and at the end of any

doclist, seek to the start of the doclist in the full
barrel for every word and go to step 4.


If we are not at the end of any doclist go to step 4.

Sort the documents that have matched by rank and
return the top k.

Results and

Results and Performance

A qualitative analysis of the search results by
users has generally been positive.
The current version of Google answers most
queries in between 1 and 10 seconds.
Since Google takes into consideration the
proximity of word occurrences, results are
more relevant than other search engines giving
a set of results for all words in queries. (E.g.
search for bill clinton gives lower importance
to results with independent bill and clinton)

Future Works
Current version of Google search times are
dominated by disk IO. Introduce query caching,
and hardware, software and algorithmic
Improve search efficiency and quickly scale to
~100 million web pages.
Develop Google as a resource for large scale
research tool for searchers and researchers.

Analyses of the Research

One of the first descriptions of the PageRank
algorithm which changed how search engines
ranked and indexed the web.
Using citation graph and anchor text to rank pages
closely resembled user behavior of ranking
Google is a complete architecture for gathering web
pages, indexing them, and performing search
queries over them.
The paper mentions Google does not compromise
PageRanks for monetary gains giving more
credibility to search results. This holds true to date.

Analyses of the Research

One of the first flaws found in the PageRank
algorithm was the Google Bomb:
Because of the PageRank, a page will be ranked
higher if the sites that link to that page use
consistent anchor text.
A Google bomb is created if a large number of
sites link to the page in this manner.
Ranking quality is insufficient using only PageRank
and anchor text. (Google today uses more than
200 different parameters to judge quality of a

Thank You
Presented by: Nilay Khandelwal