Вы находитесь на странице: 1из 7

Web Crawlers

Nutch

Agenda
What are web crawlers Main policies in crawling Nutch Nutch architecture

Web crawlers
Crawl or visit web pages and download them Starting from one page determine which page(s) to go to next This is where we know how good/bad, efficient a crawler is Mainly depends on crawling policies used

Crawl policies
Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy
Pageranks Path ascending Focused crawling

Re-visit policy
Freshness Age

Politeness
So that crawlers dont overload web servers Set a delay between GET requests

Parallelization
Distributed web crawling To maximize download rate

Nutch
Is a Open Source web crawler Nutch Web Search Application
Maintain DB of pages and links Pages have scores, assigned by analysis Fetches high-scoring, out-of-date pages Distributed search front end Based on Lucene

http://lucene.apache.org/nutch/

http://lucene.apache.org/nutch/

Вам также может понравиться