Академический Документы
Профессиональный Документы
Культура Документы
Nutch
Agenda
What are web crawlers Main policies in crawling Nutch Nutch architecture
Web crawlers
Crawl or visit web pages and download them Starting from one page determine which page(s) to go to next This is where we know how good/bad, efficient a crawler is Mainly depends on crawling policies used
Crawl policies
Selection policy Re-visit policy Politeness policy Parallelization policy Selection policy
Pageranks Path ascending Focused crawling
Re-visit policy
Freshness Age
Politeness
So that crawlers dont overload web servers Set a delay between GET requests
Parallelization
Distributed web crawling To maximize download rate
Nutch
Is a Open Source web crawler Nutch Web Search Application
Maintain DB of pages and links Pages have scores, assigned by analysis Fetches high-scoring, out-of-date pages Distributed search front end Based on Lucene
http://lucene.apache.org/nutch/
http://lucene.apache.org/nutch/