Вы находитесь на странице: 1из 74

UNIT-V

SEARCHING AND RANKING


Searching ,Structure of the Web and IR
Static and Dynamic Ranking
Web Crawling and Indexing
Link Analysis
XML Retrieval Multimedia IR: Models and
Languages
Indexing and Searching Parallel and Distributed
IR
Digital Libraries
Web characteristics
Web document
Size of the Web
Web graph
Spam
Web document
Advertising as economic model
Search user experiences
Users
User queries
Query distribution
Users empirical evaluations
Various components of web search engine
Ranking
The primary challenge of a search engine is to return
results that match a users needs.
A word will potentially map to millions of documents
How to order them?
STATIC RANKING
A search engine provides result of any user query.
The users not only want the documents satisfying their query, but
they also want the most relevant document to be placed at the top
and least relevant at the bottom of the list.
The earliest ranking model was Boolean model and Later, the vector
and probabilistic models came.
vector model considered query and document as two vectors and
calculated a cosine function to compute the similarity between the
vectors.
Higher the value of cosine function, higher the rank of document.
Probabilistic model says that weight of query terms that appear in
previously retrieved documents is higher than other documents as
there is a high probability that previously retrieved document is more
relevant to user query.
Static algorithms use these models to explore the document structure
and rank them.
DYNAMIC RANKING
The static ranking dont take into consideration the interaction with user and
faces issues like query ambiguity and diversity in intent of user.
Dynamic Ranking provides a way to combine the otherwise contradictory
goals of result diversification and high recall
These algorithms interact with the user to know his intent amongst the
various possible intents, or they try to reorder the results of first retrieval
process and provide refined results to the user.
They focus on both the relevance and diversity.
The dynamic ranking tree allows the user to choose any of these nodes and
traverse down the path starting from that node.
Thus the user will be provided with more relevant documents after
recording the first interaction
Rank Tree
HITS
HITS is also commonly used for document
ranking.
Gives each page a hub score and an authority
score
A good authority is pointed to by many good hubs.
A good hub points to many good authorities.
Users want good authorities.
Hubs and Authorities
Common community structure
Hubs
Many outward links
Lists of resources
Authorities
Many inward links
Provide resources, content
Hubs and Authorities

Hubs Authorities

Link structure estimates over 100,000 Web communities


Often not categorized by portals
Issues with Ranking Algorithms
Spurious keywords and META tags
Users reinforcing each other
Increases authority measure
Link Similarity vs. Content similarity
Topic drift
Many hubs link to more than one topic
The Web crawling and Indexing
A web crawler (also known as a web spider or web robot) is a program or
automated script which browses the World Wide Web in a methodical,
automated manner. This process is called Web crawling or spidering. Many
legitimate sites, in particular search engines, use spidering as a means of
providing up-to-date data.
A Web crawler, sometimes called a spider, is an Internet bot that
systematically browses the World Wide Web, typically for the purpose of
Web indexing (web spidering).
Begin with known seed URLs
Fetch and parse them
Extract URLs they point to
Place the extracted URLs on a queue
Fetch each URL on the queue and repeat
Sec. 20.1.1

Simple picture complications


Web crawling isnt feasible with one machine
All of the above steps distributed
Malicious pages
Spam pages
Spider traps incl dynamically generated
Even non-malicious pages pose challenges
Latency/bandwidth to remote servers vary
Webmasters stipulations
How deep should you crawl a sites URL hierarchy?
Site mirrors and duplicate pages
Politeness dont hit a server too often

30
What any crawler should do &What Sec. 20.1.1

any crawler must do


What any crawler must do
Be capable of distributed operation:
designed to run on multiple What any crawler should do
distributed machines
Fetch pages of higher quality
Be scalable: designed to increase the first
crawl rate by adding more machines
Continuous operation: Continue
Performance/efficiency: permit full fetching fresh copies of a
use of available processing and previously fetched page
network resources
Extensible: Adapt to new data
Be Polite: Respect implicit and formats, protocols
explicit politeness considerations
Only crawl allowed pages
Respect robots.txt (more on this
shortly)

Be Robust: Be immune to spider


traps and other malicious behavior
from web servers
31
Sec. 20.2.1

Processing steps in crawling


Pick a URL from the frontier Which one?
Fetch the document at the URL
Parse the URL
Extract links from it to other docs (URLs)
Check if URL has content already seen
If not, add to indexes
For each extracted URL E.g., only crawl .edu,
obey robots.txt, etc.
Ensure it passes certain URL filter tests
Check if it is already in the frontier (duplicate URL
elimination)

32
Sec. 20.2.1

Basic crawl architecture

DNS
Doc robots URL
FPs filters set

WWW
Parse
Fetch Dup
Content URL
seen?
URL
filter
elim

URL Frontier
33
Sec. 20.2.2

DNS (Domain Name Server)


A lookup service on the internet
Given a URL, retrieve its IP address
Service provided by a distributed set of servers thus,
lookup latencies can be high (even seconds)
Common OS implementations of DNS lookup are
blocking: only one outstanding request at a time
Solutions
DNS caching
Batch DNS resolver collects requests and sends
them out together

34
Sec. 20.2.1

Parsing: URL normalization

When a fetched document is parsed, some of


the extracted links are relative URLs
E.g., http://en.wikipedia.org/wiki/Main_Page
has a relative link to
/wiki/Wikipedia:General_disclaimer which is the
same as the absolute URL
http://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
During parsing, must normalize (expand) such
relative URLs
35
Sec. 20.2.1

Content seen?
Duplication is widespread on the web
If the page just fetched is already in
the index, do not further process it
This is verified using document
fingerprints or shingles

36
Sec. 20.2.1

Filters and robots.txt

Filters regular expressions for URLs to


be crawled/not
Once a robots.txt file is fetched from a
site, need not fetch it repeatedly
Doing so burns bandwidth, hits web
server
Cache robots.txt files

37
Sec. 20.2.1

Duplicate URL elimination


For a non-continuous (one-shot) crawl, test
to see if an extracted+filtered URL has
already been passed to the frontier
For a continuous crawl see details of
frontier implementation

38
Sec. 20.2.1

Distributing the crawler


Run multiple crawl threads, under different
processes potentially at different nodes
Geographically distributed nodes
Partition hosts being crawled into nodes
Hash used for partition
How do these nodes communicate and share
URLs?

39
Sec. 20.2.1

Communication between nodes


Output of the URL filter at each node is sent to
the Dup URL Eliminator of the appropriate
nodeDNS To
Doc robots other URL
FPs filters nodes set

WWW
Parse Host
Fetch splitter Dup
Content URL
seen?
URL
filter
elim
From
other
nodes
URL Frontier 40
Sec. 20.2.3

URL frontier: two main considerations

Politeness: do not hit a web server too frequently


Freshness: crawl some pages more often than
others
E.g., pages (such as News sites) whose content
changes often
These goals may conflict each other.
(E.g., simple priority queue fails many links out of
a page go to its own site, creating a burst of
accesses to that site.)

41
Sec. 20.2.3

challenges
Even if we restrict only one thread to fetch
from a host, can hit it repeatedly
Common heuristic: insert time gap between
successive requests to a host that is >> time
for most recent fetch from that host

42
Sec. 20.2.3

Front queues

Prioritizer

1 K

Biased front queue selector


Back queue router 43
Sec. 20.2.3

Front queues
Prioritizer assigns to URL an integer priority
between 1 and K
Appends URL to corresponding queue
Heuristics for assigning priority
Refresh rate sampled from previous crawls
Application-specific (e.g., crawl news sites more
often)

44
Sec. 20.2.3

Back queue heap


One entry for each back queue
The entry is the earliest time te at which the host
corresponding to the back queue can be hit again
This earliest time is determined from
Last access to that host
Any time buffer heuristic we choose

45
Sec. 20.2.3

Back queue processing


A crawler thread seeking a URL to crawl:
Extracts the root of the heap
Fetches URL at head of corresponding back
queue q (look up from table)
Checks if queue q is now empty if so, pulls a
URL v from front queues
If theres already a back queue for vs host, append v
to q and pull another URL from front queues, repeat
Else add v to q
When q is non-empty, create heap entry for it
46
Indexing
Indexing the web
An inverted index is created
Forward index sorted according to word
For every valid wordID in the lexicon, create a pointer
to the appropriate barrel.
Points to a list of docIDs and hit lists.
Maps keywords to URLs
Some wrinkles:
Morphology: stripping suffixes (stemming), singular vs.
plural, tense, case folding
Semantic similarity
Words with similar meanings share an index.
Issue: trading coverage (number of hits) for
precision (how closely hits match request)
Indexing Issues
Indexing techniques were designed for static
collections
How to deal with pages that change?
Periodic crawls, rebuild index.
Varied frequency crawls
Records need a way to be purged
Hash of page stored
Can use the text of a link to a page to help label
that page.
Helps eliminate the addition of spurious keywords.
Indexing Issues
Availability and speed
Most search engines will cache the page being referenced.
Multiple search terms
OR: separate searches concatenated
AND: intersection of searches computed.
Regular expressions not typically handled.
Parsing
Must be able to handle malformed HTML, partial
documents
LINK ANALYSIS
LINK ANALYSIS
Sec. 21.2.1

Markov chains
n

Clearly, for all i,


j 1
Pij 1.

Markov chains are abstractions of random


walks.
Exercise: represent the teleporting random
walk from 3 slides ago as a Markov chain, for
this case:
Sec. 21.2.2

One way of computing a


Recall, regardless of where we start, we
eventually reach the steady state a.
Start with any distribution (say x=(100)).
After one step, were at xP;
after two steps at xP2 , then xP3 and so on.
Eventually means for large k, xPk = a.
Algorithm: multiply x by increasing powers of
P until the product looks stable.
Sec. 21.3

Hyperlink-Induced Topic Search (HITS)


In response to a query, instead of an ordered list
of pages each meeting the query, find two sets of
inter-related pages:
Hub pages are good lists of links on a subject.
e.g., Bobs list of cancer-related links.
Authority pages occur recurrently on good hubs for
the subject.
Best suited for broad topic queries rather than
for page-finding queries.
Gets at a broader slice of common opinion.
Sec. 21.3

Hubs and Authorities


Thus, a good hub page for a topic points to
many authoritative pages for that topic.
A good authority page for a topic is pointed
to by many good hubs for that topic.
Circular definition - will turn this into an
iterative computation.
Sec. 21.3

The hope

AT&T
Alice Authorities
Hubs

ITIM
Bob
O2
Mobile telecom companies
Applications of Digital Library

Вам также может понравиться