Академический Документы
Профессиональный Документы
Культура Документы
Hubs Authorities
30
What any crawler should do &What Sec. 20.1.1
32
Sec. 20.2.1
DNS
Doc robots URL
FPs filters set
WWW
Parse
Fetch Dup
Content URL
seen?
URL
filter
elim
URL Frontier
33
Sec. 20.2.2
34
Sec. 20.2.1
Content seen?
Duplication is widespread on the web
If the page just fetched is already in
the index, do not further process it
This is verified using document
fingerprints or shingles
36
Sec. 20.2.1
37
Sec. 20.2.1
38
Sec. 20.2.1
39
Sec. 20.2.1
WWW
Parse Host
Fetch splitter Dup
Content URL
seen?
URL
filter
elim
From
other
nodes
URL Frontier 40
Sec. 20.2.3
41
Sec. 20.2.3
challenges
Even if we restrict only one thread to fetch
from a host, can hit it repeatedly
Common heuristic: insert time gap between
successive requests to a host that is >> time
for most recent fetch from that host
42
Sec. 20.2.3
Front queues
Prioritizer
1 K
Front queues
Prioritizer assigns to URL an integer priority
between 1 and K
Appends URL to corresponding queue
Heuristics for assigning priority
Refresh rate sampled from previous crawls
Application-specific (e.g., crawl news sites more
often)
44
Sec. 20.2.3
45
Sec. 20.2.3
Markov chains
n
The hope
AT&T
Alice Authorities
Hubs
ITIM
Bob
O2
Mobile telecom companies
Applications of Digital Library