Вы находитесь на странице: 1из 13

Google’s Analogy

Altamash R. Jiwani, Government College of Engineering Amravati.

are devising new ways to get a high


Abstract: Google Pagerank.
The amount of information on the web is
This paper would provide a lot of tricks
growing rapidly and with this the web
and hacks for increasing your page rank in
creates new challenges for information
the worlds most popular search engine.
retrieval as well as the number of new
inexperienced users in the art of web
research.
Why Google is considered in this
paper??
A search engine is an information retrieval
system designed to help find information Because Google is the most popular large-
stored on the World Wide Web. This scale search engine which addresses many
search engine allows one to ask for of the problems of existing systems. It
information on the basis of specific criteria makes heavy use of the additional structure
mentioned and retrieves a list of items that present in hypertext to provide much
match those criteria. This list is sorted with higher quality search results. Some of the
respect to some measure of relevance of features include Fast crawling technology
the results. to gather the web documents and keep
In this paper, I present Google them up to date, efficient use of Storage
search engine, which has become a space to store indices, minimal response
prototype of a large-scale search engine time Queries system. In short “The best
which is in heavy use today and would be navigation service” which instead of
a far more than that in the near future making things easier for the computer,
which could be estimated from the fact that make things easier for the user and make
Google has an indexed database of at least the computer work harder.
24 million pages.
As a Google user, were familiar with the
This paper mainly covers “How speed and accuracy of a Google search.
Google works?” which includes Google How exactly does Google manage to find
hardware architecture , servers, what the right results for every query as quickly
Google indexes , features and limitations, as it does? All questions like this would be
Google ranking principles and tips, answered in this paper.
Googleplex and a lot more….
There’s something deeper to learn about
Everybody is running for this Google like a mystery waited to be solved.
amazing thing which has changed the way An example of this could be that Google is
of how we surf the net. Technically they a company that has built a single very
large, custom computer comprising of subsequently indexed and catalogued.
100,000 servers... It’s running their own Only information that is submitted is put
cluster operating system. They make their into the index.
big computer even bigger and faster each In both cases, when you query a
month, while lowering the cost of CPU search engine to locate information, you
cycles, making an efficient system with a are actually searching through the index
unique combination of advanced hardware that the search engine has created —you
and software. Google has taken the last 10 are not actually searching the Web. These
years of systems software research out of indices are giant databases of information
university labs, and built their own that is collected and stored and
proprietary, production quality system. subsequently searched. This explains why
What will they do next with the world’s sometimes a search on a commercial
biggest computer and most advanced search engine, such as Yahoo! or Google,
operating system? Still remains a mystery. will return results that are, in fact, dead
links
TYPES OF SEARCH
ENGINES……… Why will the same search on different
search engines produce different
There are basically three types of search results?
engines: Part of the answer to that question is
1) Those that are powered by robots because not all indices are going to be
(called crawlers; ants or spiders) exactly the same. It depends on what the
2) Those that are powered by human spiders find or what the humans submitted.
submissions. But more important, not every search
3) Those that are a hybrid of the two. engine uses the same algorithm to search
through the indices.
Crawler-based search engines are those Google developers:
that use automated software agents (called
crawlers) that visit a Web site, read the
information on the actual site, read the
sites meta tags and also follow the links
that the site connects to performing
indexing on all linked Web sites as well.
The crawler returns all that information
back to a central depository, where the
data is indexed. The crawler will
periodically return to the sites to check for
any information that has changed.
Larry Page
Co-Founder & President, Google
Human-powered search engines rely on Products
humans to submit information that is
will then be handled solely by that cluster.
A load balancer that is monitoring the
cluster then spreads the request out over
the servers in the cluster to make sure the
load on the hardware is even.

Then the following process is done.

ü Determine the documents pointed to


by the keywords
Sergey Brin ü Sort these documents using each
Co-Founder & President, Google one’s Page Rank
Technology ü Provide links to these documents on
the Web
ü Provide a link to view the cached
Google’s Hardware: version of the document in the doc
To provide sufficient service capacity,
server farm
Google’s physical structure consists of
ü Pull an excerpt from the page, using
clusters of computers situated around the
the cached version of the page, to
world known as server farms. These server
give a quick idea of what it is about
farms consist of a large number of
ü Return an initial result set of
commodity level computers running Linux
document excerpts and links, with
based systems that operate with GFS, or
links to retrieve further result sets of
the Google file system.
matches, rendered as HTML.
ü By default, Google returns result in
It has been speculated that Google has the
sets of ten matches (as an HTML
world’s largest computer. The estimate
page).
states Google as having up to:
ü You can change the number of
Ø 899 racks
results you want to see on the
Ø 79,112 machines
Google Preferences page.
Ø 158,224 CPUs
Ø 316,448 Ghz of processing power
Google prides itself on the fact that most
Ø 158,224 Gb of RAM
queries are answered in less than half a
Ø 6,180 Tb of Hard Drive space
second. Considering the number of steps
involved in answering a query, you can see
How Google Handles Search that this is quite a technological feat.
Queries??????
When a user enters a query into the search
box at Google.com, it is randomly sent to
one of many Google clusters. The query
Let's see how Google processes a query
.
The PageRank System developed, and it is probable that Google
uses a variation of it.
The PageRank algorithm is used to sort
pages returned by a Google search request. In the equation 't1 - tn' are pages linking to
Page Rank, named after Larry Page, who page A, 'C' is the number of outbound
came up with it, is one of the ways in links that a page has and 'd' is a damping
which Google determines the importance factor, usually set to 0.85.
of a page, which in turn decides where the
page will turn up in the results list. We can think of it in a simpler way:-

PageRank is a numeric value that a page's PageRank = 0.15 + 0.85 * (a


represents how important a page is on the "share" of the PageRank of every
web. Google figures that when one page page that links to it)
links to another page, it is effectively
casting a vote for the other page. The more
votes that are cast for a page, the more This equation shows that a website has a
important the page must be. Also, the maximum amount of PageRank that is
importance of the page that is casting the distributed between its pages by internal
vote determines how important the vote links. The maximum amount of PageRank
itself is. in a site increases as the number of pages
in the site increases. The more pages that a
The crucial element that makes PageRank site has, the more PageRank it has
work is the nature of the Web itself, which
depends almost solely on the use of hyper Let's consider a 3 page site (pages A, B
linking between pages and sites. In the and C) with no links coming in from the
system that makes Google’s PageRank outside.
algorithm work, links are a Web popularity
contest: Webmaster A thinks Webmaster The site's maximum PageRank is the
B’s site has good information (or is cool, amount of PageRank in the site.
or looks good, or is funny), Webmaster A Consider, the PageRank for the web pages
may decide to add a link to Webmaster B’s as,
site. Page A = 0.15
Page B = 1
Page C = 0.15
PR(A) = (1-d) + d(PR(t1)/C(t1)
+ ... + PR(tn)/C(tn))
That's the equation that calculates a page's
PageRank. It's the original one that was
published when PageRank was being
Page A has "voted" for page B and, as a
result, page B's PageRank has increased. Web crawlers are mainly used to create a
This is looking good for page B. copy of all the visited pages for later
processing by a search engine, that will
After 100 iterations the figures are:- index the downloaded pages to provide
Page A = 0.15 fast searches. Crawlers can also be used
Page B = 0.2775 for automating maintenance tasks on a
Page C = 0.15 Web site, such as checking links or
validating HTML code. Also, crawlers can
The total PageRank in the site is now be used to gather specific types of
(0.15+0.15+0.2775) =0.5775. information from Web pages, such as
Hence u could see that very less linking or harvesting e-mail addresses (usually for
poor linking decreases your PageRank. spam).

PageRank is also displayed on the toolbar


of your browser if you’ve installed the
Google toolbar
(http://toolbar.google.com/).

Google’s web crawler: The


Googlebot

A web crawler (also known as a Web


spider or Web robot) is a program or
automated script which browses the World
Wide Web in a methodical, automated
manner. Other less frequently used names
for Web crawlers are ants, automatic
indexers, bots, and worms.
There are 2 methods that the Googlebot the Googlebot will go through all links in
uses to find a web page, the index page and every subsequent page,
Ø Either it reaches the webpage after
until the entire site has been indexed.
crawling through links,
Ø Or it goes the page after it has been
submitted by the webmaster to
www.google.com/addurl.html
By submitting the base link of the site, for
example, http://wiki.media-culture.org.au/,
Because of Google’s decision to use a
large number of commodity level
computers instead of a smaller number of
www.google.com/addurl.html
server type systems, the Google File
System had to be designed to handle
Google File System: system failures, which resulted in it being
The Google file system is a propriety file designed to effect constant monitoring, of
management system developed by Sanjay systems, error detection, fault tolerance
Ghemawat, Shun-Tak Leung and Urs and automatic recovery . That meant that
Holzle for Google as a means to handle the clusters would have to hold multiple
massive number of requests over a large replicas of the information created by
number of server clusters. Google’s web crawlers.
The system was designed like most other Because of the size of the Google
distributed files systems for maximum database, the system had to be designed to
performance, to handle the large number of handle huge multi-gigabyte sized files
users, scalability, to be able to handle totaling many terabytes in size.
inevitable expansions, reliability, to ensure GFS helps ensuring that Google has
maximum uptime and availability, to maximum control over the system, at the
ensure computers are available to handle same time allowing the system to stay
queries. flexible.
page or website, then – type your
query inside square brackets
More on Google e.g. [ur query].

ü To limit the scope of a search to a


Google almost certainly knows more about particular file type is to use the
you than you would tell your mother. Did syntax for file type (filetype:). For
you ever search for information about example, filetype:ppt google finds
Aids, cancer, mental illnesses or bomb- mention of Google in PowerPoint
making equipment? Google knows, slides. Other formats include .pdf
because it has put a unique reference (Adobe Acrobat), .doc (Word) and
number in a permanent cookie on your .xls (Excel).
hard drive (which doesn't expire until
2038). It also knows your internet (IP) ü You can use an asterisk (*) as a
address. wildcard. Example: "George *
Bush" finds George W. Bush.
Google's privacy policy says that it "notes Example: "To * * * to be" finds "To
and saves information such as time of day, be or not to be".
browser type, browser language, and IP You can also use this strategy to
address with each query. That information find email addresses:
is used to verify our records and to provide "email * * <domain>".
more relevant services to users. For
example, Google may use your IP address ü To find out who links to a Web
or browser language to determine which page, use the link (link:) syntax. The
language to use when showing search search link:www.virtualchase.com
results or advertisements." would perform a reverse link search
If you add the Google Toolbar to to the url mentioned. Useful to see
your Windows browser, then it can send how popular your site is
Google information about the pages you
view, and Google can update the Toolbar ü Use quotation marks ” “ to locate an
code automatically, without asking you entire string.
eg. “bill gates conference” will only
return results with that exact
Searching Tips and Tricks for string.
Google:
ü Mark essential words with a ‘+’. If
a search term must contain certain
ü If you want to search Google for the words or phrases, mark it with a +
words that will be on the page you symbol. eg: +”bill gates”
want, not for a description of the conference will return all results
containing “bill gates” but not
necessarily those pertaining to a If you include other words in the
conference query, Google will highlight those
words within the cached document.
ü Negate unwanted words with a - For instance, cache:www.cwire.org
ü You may wish to search for the web will show the cached content
term bass, pertaining to the fish and with the word “web” highlighted.
be returned a list of music links as
well. To narrow down your search a ü info:
bit more, try: bass -music. This will The query [info:] will present some
return all results with “bass” and information that Google has about
NOT “music”. that web page. For instance,
info:www.cwire.org will show
ü site:www.cwire.org information about the CyberWyre
This will search only pages, which homepage.
reside on this domain.
ü weather:
ü related:www.cwire.org Used to find the weather in a
This will display all pages which particular city. eg. weather: new
Google finds to be related to your york
URL
ü allinurl:
ü spell:word If you start a query with [allinurl:],
Runs a spell check on your word Google will restrict the results to
those with all of the query words in
ü define:word the url. For instance, [allinurl:
Returns the definition of the word google search] will return only
documents that have both “google”
ü stocks: [symbol, symbol, etc] and “search” in the url.
Returns stock information. eg.
stock: msft ü inurl:
If you include [inurl:] in your
ü maps: query, Google will restrict the
A shortcut to Google Maps results to documents containing that
word in the url. For instance,
[inurl:google search] will return
ü phone: name_here documents that mention the word
Attempts to lookup the phone “google” in their url, and mention
number for a given name the word “search” anywhere in the
document (url or no).
ü cache:
ü allintitle:
If you start a query with uploaded to the server. Here's a simple
[allintitle:], Google will restrict the hack - Upload all your pages everyday
results to those with all of the query even if nothing has changed.
words in the title. For instance,
[allintitle: google search] will return Lots of light HTML pages: Google adores
only documents that have both simple websites with hundreds of pages. If
“google” and “search” in the title. you are building a page that (because of its
extensive contents) is going to be larger
ü intitle: than 50K, split it in two or three pages
If you include [intitle:] in your
query, Google will restrict the Start out slowly. If possible, begin with a
results to documents containing that new site that has never been submitted to
word in the title. For instance, the search engines or directories. Choose
[intitle:google search] will return an appropriate domain name, and start out
documents that mention the word by optimizing just the home page.
“google” in their title, and mention
the word “search” anywhere in the Learn basic HTML. Many search engine
document (title or no). Note there optimization techniques involve editing the
can be no space between the behind the scenes HTML code. Your high
“intitle:” and the following word. rankings can depend on knowing which
codes are necessary, and which aren't.
ü allinlinks:
Searches only within links, not Choose keywords wisely. The keywords
text or title. you think might be perfect for your site
may not be what people are actually
ü allintext: searching for. To find the optimal
Searches only within text of pages, keywords for your site, use tools such as
but not in the links or page title. WordTracker.

Create a killer Title tag. HTML title tags


Steps to get a high Google are critical because they're given a lot of
weight with all of the search engines. You
Pagerank: must put your keywords into this tag and
not waste space with extra words. Do not
Server Speed: Your website pages must be use the Title tag to display your company
downloaded nearly at the speed of light. name or to say "Home Page." Think of it
Yes it is, Google gives more visibility to more as a "Title Keyword Tag" and create
websites that are resident on fast servers. it accordingly. Add your company name to
the end of this tag, if you must use it.
Site Updating: Googlebot has the ability to
check out WHEN your pages have been
Create Meaty Meta tags. Meta tags can be then it is considered as dynamic URL by
valuable, but they are not a magic bullet. Search Engines, otherwise
Create a Meta Description tag that uses Make static Page, with URL not having
your keywords and also describes your elements from the provided list.
site. The information in this tag often
appears under your Title in the search Sites use Flash. Flash is not a problem, its
engine results pages. the non-professional application wasting
the effort. Mostly homepages, intro pages,
Use extra "goodies" to boost rankings. etc. are build up using Flash to provide a
Things like headlines, image alt tags, cool 'n interactive impact over the browser.
header tags <H1><H2>, etc.), links from But the problem arises when the maximum
other pages, keywords in file names, and or complete page is Flash dependent b'coz
keywords in hyperlinks can cumulatively Search Engines don't index Flash. Another
boost search engine rankings. Use any or major problem is that the links over flash
all of these where they make sense for can't be crawled over by search engines so
your site. they can't be indexed. Solution could be
that in such cases you need to get highly
Don't expect quick results. Getting high effective TITLE & META tags. To solve
rankings takes time; there's no getting the links problem the standard way is
around that fact. Once your site is added to creating a sitemap and linking it to each
a search engine or directory, its ranking web page via a standard HTML Hyperlink
may start out low and then slowly work its Tag.
way up the ladder.
Sites use Image Maps for navigation. The
Google measure "click-through problem arising from these are that link
popularity," i.e., the more people that click crawlers of Search Engines mostly get
on a particular site, the higher its ranking jumbled over the Image Maps & don't
will go. Be patient and give your site time spider most of links. You can overcome it
to mature. adding an alternate Simple Navigational
Menu or the standard technique to
Search engines don't index framed sites SiteMap.
very well so if they aren't necessary,
Simply remove them for the better ranking Sites also use JavaScript for navigation.
of your sites. Search engines mostly don't follow the
links provided in the javascript. You can
Sites that use Dynamic URLs. Most search overcome it adding an alternate Simple
Engines don't list any dynamic URLs from Navigational Menu or the standard
Database -driven or script running sites. technique to SiteMap.
Solution could be if your URL contains
only any of the following elements
? & % + = $ cgi-bin .cgi
Eventually, you'll see the fruits of your
labor i.e. your site’s listing in Google and
the rest of the search engines!

Conclusion:

Google is now the world's most powerful


website.
Google's mission: Organize the world's
information and make it universally
accessible and useful.
Google said “We're about not ever
accepting that the way something has been
done in the past is necessarily the best way
to do it today." But nobody believed it.
But now they have done a far more than
that.

The Google web site is powered by some


amazing technology. People often ask
"What are you working on? Isn't search a
solved problem?" And all this because of
Google i.e. The master of the internet
whose averages about 250 million
Searches Per Day.

Вам также может понравиться