Вы находитесь на странице: 1из 16

A

PRESENTATION ON

WEB CRAWLING
SUBMITTED TO:MR. YUDHVIR SINGH
(LECTURER IN C.S.E. DEPT)

SUBMITTED BY:NAME:-ANSHU ROLL NO:-0501302

INTRODUCTION
The World Wide Web (WWW) or the Web for short, is a collection of billions of documents written in a way that enables them to cite each other using hyperlinks, which is why they are a form of hypertext. These documents, or Web Pages, are typically a few thousand characters long, written in a diversity of languages, and cover essentially all the topics of human endeavor. The World Wide Web has become highly popular in the last few years, and is now one of the primary means of information publishing on the Internet. When the size of the Web increased beyond a few sites and a small number of documents, it became clear that manual browsing through a significant portion of the hypertext structure is no longer possible, let alone an effective method for resource discovery. Browsing is a useful but restrictive means of finding information. Given a page with many links to follow, it would be unclear and painstaking to explore them in search of a specific information need.

What are Web Crawlers?


A Web Crawler or Web Robot is a program that traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. These programs are sometimes called web robots, "spiders", "web wanderers", or "web worms". These names, while perhaps more appealing, may be misleading, as the term "spider" and "wanderer" give the false impression that the robot itself moves. In reality robots are implemented as a single software system that retrieves information from remote sites using standard Web protocols. Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

How Robots Follow Links to Find Pages


How does a crawler fetch all Web pages? - Before the advent of the Web, traditional text collections such as bibliographic databases and journal abstracts were provided to the indexing system directly, say on magnetic tape or disk. In contrast there is no catalog of all accessible URLs on the web. The only way to collect URLs is to scan collected pages for hyperlinks to other pages that have not been collected yet. This is the basic principle of Crawlers or Robots. Crawlers start at a given page, usually the main page of a site, read the text of the page just like a Web browser, and follow the links to other pages. If you think of a site as a Web, the crawler starts in the center, and follows links from strand to strand until it has reached every one.

Breadth-First Crawling
The idea of breadth-first indexing is to retrieve all the pages around the starting point before following links further away from the start. This is the most common way that crawlers follow links. If your robot is indexing several hosts, this approach distributes the load quickly, so that no one server must respond constantly. It's also easier for crawler writers to implement parallel processing for this system In the diagram, the starting point is at the center, with the darkest gray. The next pages, in the medium gray, will be indexed first, followed by those they link (in the light gray), and the outer links, in white.

Fig. 1.1

Depth-First Crawling
The alternate approach, depth-first indexing, follows all the links from the first link on the starting page, then the first link on the second page, and so on. Once it has indexed the first link on each page, it goes on to the second and subsequent links, and follows them. Some unsophisticated crawlers use this method, as it can be easier to code. In this diagram, the starting point is at the center, with the darkest gray. The first linked page is a dark gray, and the first links from that page are lighter grays.

Fig. 1.2

Scalability and Extendibility


The two major attributes of a Crawler are

Scalability By scalable, we mean that a Web Crawler is designed to scale up to the entire web, and can be used to fetch tens of millions of web documents. Scalability is achieved by implementing the data structures used during a Web crawl such that they use a bounded amount of memory, regardless of the size of the crawl. Hence, the vast majority of the data structures are stored on disk, and small parts of them are stored in memory for efficiency. Extendibility By extensible, we mean that a Web Crawler is designed in a modular way, with the expectation that third parties will add new functionality

ADVANCEMENTS IN WEB CRAWLER TECHNOLOGY


Parallel Crawlers

As the size of the Web grows, it becomes more difficult to retrieve the whole or a significant portion of the Web using a single process. Therefore, many search engines often run multiple processes in parallel to crawl the Web, so that download rate is maximized. We refer to this type of crawler as a parallel crawler. The main design criterion for such crawlers is to maximize its performance (e.g., download rate) while minimizing the overhead from parallelization.

The following issues make the study of a parallel crawler challenging and interesting:

Overlap: When multiple processes run in parallel to download pages, it is possible that different processes download the same page multiple times. One process may not be aware that another process has already downloaded the page. Clearly, such multiple downloads should be minimized to save network bandwidth and increase the crawler's effectiveness. Then how can we coordinate the processes to prevent overlap? Quality: Often, a crawler wants to download ``important'' pages first, in order to maximize the ``quality'' of the downloaded collection. However, in a parallel crawler, each process may not be aware of the whole image of the Web that they have collectively downloaded so far. For this reason, each

process may make a crawling decision solely based on its own image of the Web (that itself has downloaded) and thus make a poor crawling decision. Then how can we make sure that the quality of the downloaded pages is as good for a parallel crawler as for a centralized one?

Communication bandwidth: In order to prevent overlap, or to improve the quality of the downloaded pages, crawling processes need to periodically communicate to coordinate with each other. However, this communication may grow significantly as the number of crawling processes increases. Exactly what do they need to communicate and how significant would this overhead be? Can we minimize this communication overhead while maintaining the effectiveness of the crawler?

ADVANTAGES OF PARALLEL OVER SIMPLE CRAWLER


Scalability: Due to enormous size of the Web, it is often imperative to run a parallel crawler. A single-process crawler simply cannot achieve the required download rate in certain cases. Network-load dispersion: Multiple crawling processes of a parallel crawler may run at geographically distant locations, each downloading geographically-adjacent'' pages.

Network-load reduction: In addition to the dispersing load, a parallel crawler may actually reduce the network load.

METHODS OF SENDING PAGES


The downloaded pages may need to be transferred later to a central location, so that a central index can be built. Even in that case, it is believed that the transfer can be significantly smaller than the original page download traffic, by using some of the following methods: Compression: Once the pages are collected and stored, it is easy to compress the data before sending them to a central location. Difference: Instead of sending the entire image with all downloaded pages, we may first take difference between previous image and the current one and send only this difference. Since many pages are static and do not change very often, this scheme can significantly reduce the network traffic. Summarization: In certain cases, we may need only a central index, not the original pages themselves. In this case, we may extract the necessary information for the index construction (e.g., postings list) and transfer this.

Focused Crawling
Large-scale, topic-specific information gatherers are called focused crawlers. In contrast to giant, all-purpose crawlers, which must process large portions of the Web in a centralized manner, a distributed federation of focused crawlers can cover specialized topics in more depth and keep the crawl more fresh, because there is less to cover for each crawler.

REFERENCES
Robots & Spiders & Crawlers: How Web and intranet search engines follow links to build indexes A white paper Avi Rappaport The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork Robots in the Web: threat or treat? Martijn Koster How public is the Web? : Robots, access, and scholarly communication Center for Social Informatics SLIS Indiana University Accelerated Focused Crawling through Online Relevance Feedback Soumen Chakrabarti, Kunal Punera, Mallela Subramanyam Parallel Crawlers Junghoo Cho www.searchenginewatch.com www.botspot.com www.spiderline.com

Вам также может понравиться