Вы находитесь на странице: 1из 56

· crawling is the search engine downloading the web page to make it faster getting it later

focused crawling is downloading links for famous websites in the website you looked into

can't crawl: for result site

private site and scripted pages

Site mapping is an XML file to tell the search engine about your web site and let the crawler download
your web page

Distributed Crawling
Why we need multiple computer for crawling?
· Put crawler close to websites that crawls
(same zone)
Because long distance of network leads to:
lower throughput(a few bytes to copy /sec.),
higher latency time (more time to transfer
bytes in network)
· Reduce number of sites that the crawler
has to remember.
· Crawler store links in queue. The queue should
not includes a duplication of links from other
sites which improve the efficiency of crawling
process.
· Reduce computing resources
such as CPU, network bandwidth, storage,…
· In general, we use multiple queues instead of
one queue.
· In order to identify which URL belong to each
crawler, they use hash function.
· Personal computers info.(mails, excel,
words,..) can be searched using desktop
search.

· Enterprise computer info.( different types


of files), can be searched by enterprise
search.
· Both of them (personal, enterprise) used
desktop crawling.
· Update speed: easy to find data compared
with web crawling.( area to search small ).
Modern file systems can send a notification
to crawler in order to copy the new
changes directly.
· Diskspace: web crawler copy every
document stored on the web to search
engine. While the desktop crawler only
copy the local docs. That stored on the
enterprise or desktop computers.
· Different types of files that should be
converted to HTML to understand by the
indexer to build the index terms.
· Data privacy: multiple user for same
computer, each one has a profile hidden
from others. Crawler cannot crawl these
data.
· Document feed
Feed is a technique to access real time
documents.

· Document are created at fixed time and


rarely updated, such as news articles , mails,

· Crawler use document feeds to find the


new documents because document order in
sequence.
· Document feeds include:
· Push feed: alert subscriber to new
documents (example: mobile ringtone)
· Pull feed: subscriber should check for
new documents(example: mailbox)

· Advantages of document feeds


· Easy to parse ( have same format)
· Single location to find information
· Conversion problem
Conversion is a problem for search
engines, because text stored in different
formats such as txt, rtf, html, xml, pdf,
ppt, xls.

· Several format are not supported by


commercial search engines. We need to
store them in a new format that
supported by search engines in order to
improve the indexer and efficiency in
general.
· To solve this problem using conversion
tool that convert different text format to
HTML, and XML.
· Why HTML and XML?
· Easy to parse
· Use tags, So, return some of
importance formatting such as
font size, heading that assists in
building the indexer by given
score to some terms
· Character encoding
· Text consist of set of letters ,number,
special symbols, represented in ASCII (7
bit) or EBCDIC (8 bit) .
· This format (ASCII, EBCDIC) good for
English languages, but it is not good for
china, Arabic because it required more
than 8 bits to represent. It’s a problem
when we use multiple languages in the
same documents.
· To solve this matter, Unicode
representation is developed.
UTF-8, UTF-16, UTF-32, UCS-2
· Unicode save spaces by different
formats
· Some symbols or letters need 3 or 4
bytes to represent such as china
· Storing the Documents
After documents converted to the required
format (HTML, XML), it should be stored for
indexing.
· In desktop search no need to store
documents because its stored local in the
computer., But other searches it s required.
· Why we need to store converted
documents?
· Crawling is expensive in terms of CPU,
network load. We need to save time and
efforts of crawling.
· Efficient access of text: information
extraction (links, tags, names, heading,
anchor text,…)
Basic requirements for a document
storage.
· Database system
· Random access
· Compression
· Updating

· Database system
Good place to store data of documents
because:
· it focus on all difficult details and small
pieces
· work on a network server, So, documents
can be accessed through network and it will
e available ant time.
· There is an analysis tools used to manage
and control data in database.
· Random Access
· In this requirement, the data of document
requested based on its URL. To do that,
they use the hashing technique.
· Hashing used with small and large amounts
of data. In large amount they used B-tree or
sorted data structure to identify the
document within the file.

· Compression
· text in document is redundant, it should be
reduced. So, it reduce the storage
requirement for document that lead to
efficient access
· Work good with large data.
· Update
· When new document version come to
crawler, it means update document
storage.
· To keep storage, the solution is merge new
version of doc with the old version that
stored before in the database. If the
changes between the two versions are
high, this mean that the merging process
will be expensive.
· Handle the anchor text.
Add new anchor texts to the stored data in
order to include their contents in the
indexing process, so, a good impact in the
ranking process.

· BigTable
A well known structure for storing very
large documents.
· Used in Google.
· Is a distributed database built for task
finding, storing, and updating web pages.
· Its size Petabyte (250) or 1015 byte.
· Each database only represented by one
table.
· Each table split into small pieces called
“tablets”
· Tablets consist of thousands of
machines
(inexpensive computers,
so no cost to increase)
· Big table is distributed database
· Although, bigtable works with database,
but there is no query language. Just row
level transaction (Simple compare to sql)

· Because bigtable includes inexpensive


machines, it involves failure recovery. For
this reason, the tablets stored in a
replicated file system that is accessible by all
bigtable servers.

· This mean, that any changes to a


bigtable tablet will be recorded to log file
system, and if any tablet crashes, another
bigtable server read the tablet data and the
log file.
· Once file data written to a bigtable, it is
never changed. This will help in failure
recovery
· New data stored in RAM, while the older
in files, to reduce the files number both of
them merged.
· Bigtable logically organized in rows.
each row represent a URL (single web page).
· Bigtable can have a huge number of
rows and columns.
· Each cell in the bigtable referred by row
key, column key, and timestamp.
· All rows have the same columns group,
but not all rows have the same columns.
· Rows partioned into tablets based on
the row key.

Example: For instance, all URLs beginning


with (a) could be located in one tablet,
while all those starting with (b) could be in
another tablet.
Detection Duplicates
· Duplicate document can occur in any
case for different reasons: copies, new
versions, plagiarism, spam, mirror sites,
several URL to the same sites.

· The result will be high number of


duplicated documents, in a study: 30% of
document exact or near-duplicated.

· Duplicated documents don’t provide or


little new information.

· Duplication consume significant


resources (CPU, Memory, network,..) during
crawling, indexing,, search.
· Several algorithms have been
developed to remove duplicated documents
during indexing and ranking, such as:
· Checksumming technique designed for
exact duplication documents.

Checksum is a value computed based on


the content of the document (i.e. sum
of bytes in the file)
· Checksum only works on exact same
value files
In some cases, documents have the same
value of sum, but the characters have
different order, in this case the Cyclic
Redundancy Check (CRC) function is used.

· In case the document is near-duplicate,


we should use another technique based on
defining a threshold value to check if there
is a duplication or not.

Near- duplication detection


· In case the document is near-duplicate,
we should use another techniques in order
to detects.

· Two scenarios for near-duplication


detection:
· Search, using IR technique which
required O(N) comparison.
· Discovering, using fingerprint technique
which required O(N2) comparison
(comparing all pairs of docs.)
How to generate
Fingerprints?
look at the photo on your desktop you
dumbass
· Two techniques to select the n-grams
from the documents:
· Specified combinations of characters.
· 0 mod p
· Fingerprint do not capture all
information in the document. This will lead
these errors in the detection process for
near-duplicates documents that affects on
the efficiency.
· To overcome these errors, recently,
developed fingerprint simhash technique.
· Simhash combines the advantages of
word based similarity and the efficiency of
fingerprint based hashing.
· The similarity of two pages measured by
the cosine correlation measure.

· Stopwords removed from the content.

Simhash example
delete all stop words and then give
weight to each word based on how
many times it's written
put the word value in binary
Vector number must be the same
number as the bits he asked for like 8
bits vektor must be 8 numbers
to calculate vektor weight*-1 if the first
binary number is 0
weight*1 if it's 1 and sum the equal of
all words for the first binary number
8-bit fingerprint: if the vektor value is
(+) put 1 if it's (-) put 0
Removing Noise
· Many web pages contain texts,
links, pictures that are not related
to the content of page.
· These contents are considered a
noise from the point view of
search engine that will affects on
the ranking of page. (ranking
based on words count)
· Several techniques have been
developed to ignore or reduce
important of these content in the
indexing.
Text processing
· Aimed to convert all forms of text into
one uniform for index terms.
· Index term is the representation of
document contents.
· For Example: find in the word processor.
· Search engine ignore case sensitive of
words.
· Text processing techniques include:
· Tokenization
· Stopping
· Stemming
· Structure (bold, head,…)
· links
· Text Statistics
Some words more frequently than others in
the documents, such as “and”, “the”, and
others. So we can predicts the word
occurrence.

· Retrieval models and ranking depends


on statistical prosperities of words. For
example: Important words occur in
documents, but with low frequency.

· Frequency of words in document is


skewed.
Zipf’s Law
· Describe distribution of word
frequencies.
( very skewed)
· Few words occur often, many words
rarely occur
· Example: two most common words
(“the”, “of”) make up about 10% of all
word occurrences in text documents,
most frequent 6 words make 20%, most
frequent 50 words make 40%.
· In large text, one half of unique words
only once.
· Zipf’s “law”:
· observation that rank (r) of a word times
its frequency (f) is approximately a
constant (k)
· assuming words are ranked in order of
decreasing frequency
· i.e., r.f  k or r.Pr  c, where Pr is
probability of word occurrence and c
0.1 for English

Vocabulary Growth
· As corpus grows, so does
vocabulary size
· Fewer new words when corpus is
already large
· Observed relationship (Heaps’
Law):
β
v = k.n
where v is vocabulary size (number
of unique words), n is the
number of words in corpus,
k, β are parameters that vary for each
corpus (typical values given are 10
≤ k ≤ 100 and β ≈ 0.5)
corpus is a large and structured set of(
)texts

Heaps’ Law Predictions


· Heaps’ law predicts that the
number of new words will
increase very rapidly when the
corpus is small and will continue
to increase indefinitely, but at a
slower rate for larger corpora
Estimating collection and result
sizes
· Word occurrence statistics used as
indicator to estimate the size of a result in
the web search.

· To estimate the result size we need to


define the result which is any document or
page that contain the words of the query.
· Some search application rank the
documents that not include all the words in
the query.
· Suppose that each words occurs in the
document independently, then:
The probability of a document containing
all words in the query is the products of the
probabilities of individual words occurring
in the document.
LAW
P(a ∩ b ∩ c) = P(a) · P(b) . P(c)

· suppose we have a query from three


words (a,b,c), then the probability to find
them in the document is: P(a ∩ b ∩ c) = P(a) ·
P(b) . P(c)
· Search engine always have access
to the number of documents (N),
that words occur (fa, fb, fc).
LAW
So, P(a) = fa/N, P(b) = fb/N, and
P(c) = fc/N.

How many pages contain all of the


query terms?
LAW
2
fabc = (fa · fb · fc)/N
Result Set Size
Estimation
· Poor estimates because words
are not independent
· Better estimates possible if co-
occurrence information available
· LAW
P(a ∩ b ∩ c) = P(a ∩ b) · P(c|(a ∩ b))
Estimation collection size
· To estimate the total number of
documents (N) stored in the
search engine based on the query
words: LAW
N = (fa · fb)/fab,

Where a and b are two


independent words
Tokenizing
· Forming words from sequence of
characters
· Surprisingly complex in English,
can be harder in other languages
· Early IR systems:
· any sequence of alphanumeric
characters of length 3 or more
· terminated by a space or other
special character
· upper-case changed to lower-case
· Example:
· “Bigcorp's 2007 bi-annual report
showed profits rose 10%.” becomes
· “bigcorp 2007 annual report
showed profits rose”
· Too simple for search applications
or even large-scale experiments
· Why? Too much information lost
· Small decisions in tokenizing can
have major impact on effectiveness
of some queries
Tokenizing Problems
· Small words can be important in
some queries, usually in
combinations
· xp, ma, pm, ben e king, el paso, master p,
gm, j lo, world war II
· Both hyphenated and non-
hyphenated forms of many words
are common
· Sometimes hyphen is not needed
· e-bay, wal-mart, active-x, cd-rom, t-shirts
· At other times, hyphens should be
considered either as part of the word
or a word separator
· winston-salem, mazda rx-7, e-cards, pre-
diabetes, t-mobile, spanish-speaking
· Special characters are an important
part of tags, URLs, code in
documents
· Capitalized words can have different
meaning from lower case words
· Bush, Apple
· Apostrophes can be a part of a
word, a part of a possessive, or just
a mistake
· rosie o'donnell, can't, don't, 80's,
1890's, men's straw hats, master's
degree, england's ten largest cities,
shriner's
· Numbers can be important,
including decimals
· nokia 3250, top 10 courses, united
93, quicktime 6.5 pro, 92.3 the
beat, 288358
· Periods can occur in numbers,
abbreviations, URLs, ends of
sentences, and other situations
· I.B.M., Ph.D., cs.umass.edu, F.E.A.R.
· Note: tokenizing steps for queries
must be identical to steps for
documents
Tokenizing Process
· First step is to use parser to identify
appropriate parts of document to
tokenize
· Defer complex decisions to other
components
· word is any sequence of alphanumeric
characters, terminated by a space or
special character, with everything
converted to lower-case
· everything indexed
· example: 92.3 → 92 3 but search finds
documents with 92 and 3 adjacent
· incorporate some rules to reduce
dependence on query transformation
components
· First step is to use parser to identify
appropriate parts of document to
tokenize
· Defer complex decisions to other
components
· word is any sequence of alphanumeric
characters, terminated by a space or
special character, with everything
converted to lower-case
· everything indexed
· example: 92.3 → 92 3 but search finds
documents with 92 and 3 adjacent
· incorporate some rules to reduce
dependence on query transformation
components
· Not that different than simple
tokenizing process used in past
· Examples of rules used with TREC
· Apostrophes in words ignored
· o’connor → oconnor bob’s → bobs
· Periods in abbreviations ignored
· I.B.M. → ibm Ph.D. → ph d

Stopping
· Function words (determiners,
prepositions) have little meaning
on their own
· High occurrence frequencies
· Treated as stopwords (i.e.
removed)
· reduce index space, improve
response time, improve
effectiveness
· Can be important in
combinations
· e.g., “to be or not to be”
· Stopword list can be created from
high-frequency words or based
on a standard list
· Lists are customized for
applications, domains, and even
parts of documents
· e.g., “click” is a good stopword for
anchor text
· Best policy is to index all words in
documents, make decisions
about which words to use at
query time
Stemming
· Many morphological variations of
words
· inflectional (plurals, tenses)
· derivational (making verbs nouns
etc.)
· In most cases, these have the
same or very similar meanings
· Stemmers attempt to reduce
morphological variations of
words to a common stem
· usually involves removing suffixes
· Can be done at indexing time or
as part of query processing (like
stopwords)

· Generally a small but significant


effectiveness improvement
· can be crucial for some languages
· e.g., 5-10% improvement for
English, up to 50% in Arabic
· Two basic types
· Dictionary-based: uses lists of
related words
· Algorithmic: uses program to
determine related words
· Algorithmic stemmers
· suffix-s: remove ‘s’ endings
assuming plural
· e.g., cats → cat, lakes → lake, wiis →
wii
· Many false negatives: supplies →
supplie
· Some false positives: ups → up

Porter Stemmer
· Algorithmic stemmer used in IR
experiments since the 70s
· Consists of a series of rules
designed to the longest possible
suffix at each step
· Effective in TREC
· Produces stems not words
· Makes a number of errors and
difficult to modify
Krovetz Stemmer
· Hybrid algorithmic-dictionary
· Word checked in dictionary
· If present, either left alone or replaced
with “exception”
· If not present, word is checked for suffixes
that could be removed
· After removal, dictionary is checked again
· Produces words not stems
· Comparable effectiveness
· Lower false positive rate, somewhat
higher false negative
Phrases
· Many queries are 2-3 word phrases
· Phrases are
· More precise than single words
· e.g., documents containing “black sea” vs.
two words “black” and “sea”
· Less ambiguous
· e.g., “big apple” vs. “apple”
· Can be difficult for ranking
· e.g., Given query “fishing supplies”, how
do we score documents with
· exact phrase many times, exact phrase just
once, individual words in same sentence,
same paragraph, whole document,
variations on words?
· Text processing issue – how are
phrases recognized?
· Three possible approaches:
· Identify syntactic phrases using a
part-of-speech (POS) tagger
· Use word n-grams
· Store word positions in indexes and
use proximity operators in queries
POS Tagging
· POS taggers use statistical models
of text to predict syntactic tags of
words
· Example tags:
· NN (singular noun), NNS (plural noun),
VB (verb), VBD (verb, past tense), VBN
(verb, past participle), IN
(preposition), JJ (adjective), CC
(conjunction, e.g., “and”, “or”), PRP
(pronoun), and MD (modal auxiliary,
e.g., “can”, “will”).
· Phrases can then be defined as
simple noun groups, for example
Word N-Grams
· POS tagging too slow for large
collections
· Simpler definition – phrase is any
sequence of n words – known as
n-grams
· bigram: 2 word sequence, trigram:
3 word sequence, unigram: single
words
· N-grams also used at character
level for applications such as OCR
N-grams typically formed from overlapping
sequences of words i.e. move n-word
“window” one word or charcter at a time in
document
N-grams, both character and word, are
generated by choosing a particular value for
n and then moving that “window” forward
one unit (character or word) at a time.
For example, the word “tropical” contains
the following character bigrams: tr, ro, op,
pi, ic, ca, and al.

Indexes based on n-grams are obviously


larger than word indexes
Find the number of bigram
character from the following word
” Tropical“

Tr ro op oi ic ca al
Bigram = 7

Trigram?
Tro rop opi pic ica cal
Trigra =6
· Frequent n-grams are more likely
to be meaningful phrases
· N-grams form a Zipf distribution
· Better fit than words alone
· Could index all n-grams up to
specified length
· Much faster than POS tagging
· Uses a lot of storage
· e.g., document containing 1,000
words would contain 3,990 instances
of word n-grams of length 2 ≤ n ≤ 5
· Most frequent trigram in English is
“all rights reserve d”
· In Chinese, “limited liability
corporation”
Document Structure and
Markup
· Some parts of documents are
more important than others
· Document parser recognizes
structure using markup, such as
HTML tags
· Headers, anchor text, bolded text
all likely to be important
· Metadata can also be important
· Links used for link analysis
Link Analysis
· Links are a key component of the
Web
· Important for navigation, but also
for search
· e.g., <a href="http://example.com"
>Example website</a>
· “Example website” is the anchor
text
· “http://example.com” is the
destination link
· both are used by search engines
Anchor Text
· Used as a description of the
content of the destination page
· i.e., collection of anchor text in all
links pointing to a page used as an
additional text field
· Anchor text tends to be short,
descriptive, and similar to query
text
· Retrieval experiments have
shown that anchor text has
significant impact on
effectiveness for some types of
queries
· i.e., more than PageRank
PageRank
· Billions of web pages, some more
informative than others
· Links can be viewed as
information about the popularity
(authority?) of a web page
· can be used by ranking algorithm
· Inlink count could be used as
simple measure
· Link analysis algorithms like
PageRank provide more reliable
ratings
· less susceptible to link spam
Random Surfer Model
· Browse the Web using the following
algorithm:
· Choose a random number r between 0
and 1
· If r < λ:
· Go to a random page
· If r ≥ λ:
· Click a link at random on the current page
· Start again
· PageRank of a page is the
probability that the “random
surfer” will be looking at that page
· links from popular pages will increase
PageRank of pages they point to
Link Quality
· Link quality is affected by spam
and other factors
· e.g., link farms to increase
PageRank
· trackback links in blogs can create
loops
· links from comments section of
popular blogs
· Blog services modify comment links to
contain rel=nofollow attribute
· e.g., “Come visit my <a
rel=nofollow
href="http://www.page.com">web
page</a>.”

Information Extraction
· Automatically extract structure
from text
· annotate document using tags to
identify extracted structure
· Named entity recognition
· identify words that refer to
something of interest in a particular
application
· e.g., people names, companies
names, locations, dates, product
names, prices, time, etc.
Named Entity Recognition
· Example showing semantic
annotation of text using XML tags
· Information extraction also
includes document structure and
more complex features such as
relationships and events
· Rule-based
· Uses lexicons (lists of words and
phrases) that categorize names
· e.g., locations, peoples’ names,
organizations, etc.
· Rules also used to verify or find
new entity names
· e.g., “<number> <word> street” for
addresses
· “<street address>, <city>” or “in
<city>” to verify city names
· “<street address>, <city>, <state>” to
find new cities
· “<title> <name>” to find new names
· Rules either developed manually
by trial and error or using
machine learning techniques
· Statistical
· uses a probabilistic model of the
words in and around an entity
· probabilities estimated using
training data (manually annotated
text)
· Hidden Markov Model (HMM) is
one approach
HMM for Extraction
· Markov Model describes a
process as a collection of states
with transitions between them
· each transition has a probability
associated with it
· next state depends only on current
state and transition probabilities
· Hidden Markov Model
· each state has a set of possible
outputs
· outputs have probabilities
HMM Sentence Model
· Each state is associated with a
probability distribution over words
(the output)
· Sequence of states
<start><name><not-an-
entity><location><not-an-entity><end>
Internationalization
· 2/3 of the Web is in English
· About 50% of Web users do not
use English as their primary
language
· Many (maybe most) search
applications have to deal with
multiple languages
· monolingual search: search in one
language, but with many possible
languages
· cross-language search: search in
multiple languages at the same
time
Internationalization
· Many aspects of search engines
are language-neutral (one
language)
· Major differences:
· Text encoding (converting to
Unicode)
· Tokenizing (many languages have
no word separators)
· Stemming
· Cultural differences may also
impact interface design and
features provided
Indexes
· Indexes are data structures
designed to make search faster
· Text search has unique
requirements, which leads to
unique data structures
· Most common data structure is
inverted index
· general name for a class of
structures
· “inverted” because documents are
associated with words, rather than
words with documents

· Indexes are designed to support


search
· faster response time, supports
updates
· Text search engines use a particular
form of search: ranking
· documents are retrieved in sorted
order according to a score computing
using the document representation,
the query, and a ranking algorithm
· What is a reasonable abstract
model for ranking?
· enables discussion of indexes without
details of retrieval model
Inverted Index
· Each index term is associated
with an inverted list
· Contains lists of documents, or lists
of word occurrences in documents,
and other information
· Each entry is called a posting
· The part of the posting that refers
to a specific document or location is
called a pointer
· Each document in the collection is
given a unique number
· Lists are usually document-ordered
(sorted by document number)

Simple Inverted
Index : just write the number of the
sentence that has the word

Inverted Index
· with counts : number of sentence +
how many times the word was there
supports better ranking algorithms
Inverted Index
with positions: number of the sentence+
number of the position of the word in
the sentence
·

supports
proximity matches

Fields and Extents


· Document structure is useful in
search
· field restrictions
· e.g., date, from:, etc.
· some fields more important
· e.g., title
· Options:
· separate inverted lists for each field
type
· add information about fields to
postings
Compression
· Inverted lists are very large
· e.g., 25-50% of collection for TREC
collections using Indri search engine
· Much higher if n-grams are indexed
· Compression of indexes saves disk
and/or memory space
· Typically have to decompress lists to
use them
· Best compression techniques have
good compression ratios and are easy
to decompress
· Lossless compression – no
information lost, store data in less
space
Delta Encoding
· Word count data is good
candidate for compression
· many small numbers and few larger
numbers
· encode small numbers with small
codes
· Document numbers are less
predictable
· but differences between numbers
in an ordered list are smaller and
more predictable
· Delta encoding:
· encoding differences between
document numbers (d-gaps)

Вам также может понравиться