Вы находитесь на странице: 1из 54

COURSE TITLE:

DISTRIBUTED DATA SYSTEMS


DR. SASIREKHA GVK

Lecture 1

SCOPE & OBJECTIVE:


The objective of the course is to expose the student to an
engineering approach to Information Retrieval in the
context of the Web.
The course will take an integrated approach to
conceptualizing, designing, and implementing IR
systems.
The scope of the course includes concepts in

Networked/distributed file systems


Data structures and algorithms for IR
Implementation issues in scalable distributed systems for
collecting, storing, and retrieving large data sets.

T1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information


Retrieval. Pearson Education.
Reference Books:
R1. Andrew Tanenbaum and Maarten van Steen. Distributed Systems.
Pearson Education.
R2.. Stefano Ceri and Giuseppe Pelagatti. Distributed Databases.
McGraw Hill International Edition.
R3. Marshall McKusick et. al. The Design and Implementation of the 4.4
BSD Operating System. Addison Wesley
R4. Robert R. Korfhage. Information Storage and Retrieval. John Wiley
(Indian Edition).
R5. Gerald J. Kowalski and Mark T. Maybury. Information Storage and
Retrieval Systems
R6. Soumen Chakrabarti. Mining the Web. Elsevier India.
R7. Hoffman and Beaumont. Content Networking Architecture,
Protocols and Practice. Elsevier India.
R8. William Frakes and Ricardo Baeza-Yates. Information Retrieval
Data Structures and Algorithms. Prentice Hall.

Lect.

Topics

Readings

Course Introduction The Science, Engineering, and Business of


Information Retrieval.

Class Notes

2-3

Introduction to I/O and File Systems Unix File Systems

R3

4-5

Networked File Systems NFS. Design and Implementation


Issues.

R3 Ch. 6, Ch. 7.
R1 Sec. 10.1

6-7

Distributed File Systems Coda. Design and Implementation


Issues.

R1 Secs. 10.2
and 10.3

The WWW as a Distributed Data/Document System

R1 Sec. 11.1

HTTP based Systems and Web Servers

R6 Ch 2

10-11

Markup Languages. XML for Data and Documents. XML Data


Exchange. XML Processing

T1. Sec 6.4

12-14

Crawling and Scraping. Tools and Techniques

R6 Ch 2.

15

Text Retrieval

T1 Sec. 2.5, Sec.


2.9

Lect.

Topics

Readings

16-17

Querying Keywords, Pattern Matching. Query Operations.

T1. Ch 4

18-19

Query Languages Querying XML Documents.

AR

20

Text Operations and Document Operations

T1. Ch 7

21-22

File Structures for Indexing Text

T1. Ch 8,
R8. Ch 2

23

Multimedia File Systems

Class Notes

24-25

Algorithms for Term and Query Operations

R8. Ch 3

26-27

Distributed Querying - Fragmentation and Localization.

R2. Ch 5

28-29

Cluster File Systems Design and Implementation

AR

30-32

Document Ranking Models and Algorithms

R8. , AR

33-34

Document Clustering Algorithms

R8. , AR

35

Search Engines Characteristics and Usage

T1. Ch 13

36-37

Search Engines - Architectures

T1. Ch 13
AR

38

Search Engines Implementation Issues.

AR

39-40

Multimedia Information Retrieval

T1. Ch 12

41

Course Summary

INFORMATION RETRIEVAL
What is information? examples
How is information stored?

Text (Documents)
XML and structured documents
Images
Audio (sound effects, songs, etc.)
Video
Source code
Applications/Web services

Jimmy Lins tutorial


Maryam Karimzadehgan

mkarimz2@uiuc.edu

INFORMATION RETRIEVAL VS. DATABASE


SYSTEMS

Information retrieval (IR) systems use a simpler


data model than database systems
Information organized as a collection of documents
Documents are unstructured, no schema

Information retrieval locates relevant documents, on


the basis of user input such as keywords or
example documents

e.g., find documents containing the words database


systems

Can be used even on textual descriptions provided


with non-textual data such as images.
Web search engines are the most familiar example
of IR systems

DIFFERENCES FROM DATABASE SYSTEMS

IR systems dont deal with transactional updates


(including concurrency control and recovery)

Database systems deal with structured data, with


schemas that define the data organization

IR systems deal with some querying issues not


generally addressed by database systems
Approximate searching by keywords
Ranking of retrieved answers by estimated
degree of relevance

NOMENCLATURE

Information retrieval (IR)


Focus on textual information (= text/document retrieval)
Other possibilities include image, video, music,

What do we search?

Generically, collections

What do we find?
Generically, documents
Even though we may be referring to web pages, PDFs,
PowerPoint slides, paragraphs, etc.

SEARCH PROCESS
Resource

Query

Search

Indexing

Document Collection

Slide is from Jimmy Lins tutorial

Index

11

Query
Formulation

4/3/2016

Source
Selection

Ranked List

Selection

Documents

Results
Intro
ducti
on
to
Infor
mati
on
Retri

THE CENTRAL PROBLEM IN SEARCH


Author

Searcher

Concepts

Concepts

Query Terms

Document Terms

tragic love story

fateful star-crossed romance

Do these represent the same concepts?


Why IR is hard? Because Language is hard!!!

ABSTRACT IR ARCHITECTURE
Query

Documents
online offline

Representation
Function

Representation
Function

Query Representation

Document Representation

Comparison
Function

Index

Hits

EXAMPLE

Google

Web

14

HOW DO WE REPRESENT TEXT?


Remember: computers dont understand anything!
Bag of words

Treat all the words in a document as index terms


Assign a weight to each term based on importance
(or, in simplest case, presence/absence of word)
Disregard order, structure, meaning, etc. of the words
Simple, yet effective!

Assumptions
Term occurrence is independent
Document relevance is independent
Words are well-defined

WHATS A WORD?

-
-


1982 .

-
,
.
2005-06

= 25 `''
` ''
.

SAMPLE DOCUMENT
McDonald's slims down spuds
Fast-food chain to reduce certain types of
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as
it moves to make all its fried menu items
healthier.
But does that mean the popular shoestring fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along with
an even healthier nutrition profile," said Mike
Roberts, president of McDonald's USA.
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use,
but at least one nutrition expert says playing with
the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It was
unclear Tuesday whether competitors Burger
King and Wendy's International (WEN: down
$0.80 to $34.91, Research, Estimates) would
follow suit. Neither company could immediately
be reached for comment.

Bag of Words
14 McDonalds
12 fat
11 fries
8 new
7 french
6 company, said, nutrition
5 food, oil, percent, reduce,
taste, Tuesday

INFORMATION RETRIEVAL MODELS


An IR model governs how a document and a query are
represented and how the relevance of a document to a user
query is defined.
Main (similarity ) models:

Boolean model
Vector space model
Statistical language model

Query
Representation
Function

Documents
online offline

Representation
Function

Query Representation

Document Representation

Comparison
Function

Index

Hits

18

AN EXAMPLE OF BOOLEAN

A document space is defined by three terms:

hardware, software, users

A set of documents are defined as:


A1=(1, 0, 0),
A4=(1, 1, 0),
A7=(1, 1, 1)

A2=(0, 1, 0),
A5=(1, 0, 1),
A8=(1, 0, 1).

A3=(0, 0, 1)
A6=(0, 1, 1)
A9=(0, 1, 1)

If the Query is hardware and software


what documents should be retrieved?

19

AN EXAMPLE (CONT.)
In

Boolean query matching:

document A4, A7 will be retrieved (AND)


retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (OR)

20

TERM WEIGHTING

Term weights consist of two components


Local: how important is the term in this document?
Global: how important is the term in the collection?

Heres the intuition:

Terms that appear often in a document should get high


weights
Terms that appear in many documents should get low
weights

How do we capture this mathematically?


Term frequency (local)
Inverse document frequency (global)

LANGUAGE MODELS FOR RETRIEVAL


Language Model

4/3/2016

Document

Food nutrition
paper

food ?
nutrition ?
healthy ?
diet ?

Slide is from ChengXiang Zhai

22

Text mining
paper

text ?
mining ?
assocation ?
clustering ?

food ?

Query =
data mining algorithms

Which model would most


likely have generated
this query?
Intro
ducti
on
to
Infor
mati
on
Retri

ABSTRACT IR ARCHITECTURE
Query

Documents
online offline

Representation
Function

Query Representation

Comparison
Function

Hits

Representation
Function

Document Representation

Index

EXAMPLE : SHAKESPEARES COLLECTED


WORKS
Grep pattern matching < million words
Web search using grep? millions or trillions of
words
Indexing avoids scanning

CONSTRUCTING INDEX (WORD COUNTING)


Documents

Bag of
Words

Index

case folding,
tokenization,
stopword removal,
stemming

STOPWORDS REMOVAL

Many of the most frequently used words in English are useless


in IR and text mining these words are called stop words.
the, of, and, to, .
Typically about 400 to 500 such words
For an application, an additional domain specific stopwords list
may be constructed
Why do we need to remove stopwords?
Reduce indexing (or data) file size

stopwords accounts 20-30% of total word counts.

Improve efficiency and effectiveness


stopwords are not useful for searching or text mining
they may also confuse the retrieval system.

26

STEMMING
Techniques

word. E.g.,

used to find out the root/stem of a

user
users
used
using

stem:

engineering
engineered
engineer

use

engineer

Usefulness of stemming:
improving effectiveness of IR and text mining

matching similar words


Mainly improve recall

reducing indexing size


combing words with same

as much as 40-50%.

roots may reduce indexing size


27

CONSTRUCTING INDEX (WORD COUNTING)


Documents

Bag of
Words

Index

case folding,
tokenization,
stopword removal,
stemming

INCIDENCE MATRIX
Anthony &
Cleopatra

Julius
Caesar

Tempest

Hamlet

Othello

Macbeth

Anthony

Brutus

Caesar

Calpurnia 0

Cleopatra 1

Mercy

worser

INVERTED INDEX
The

inverted index of a document collection is


basically a data structure that
attaches each distinctive term with a list of all
documents that contains the term.

30

AN EXAMPLE

Dictionary & Postings


31

SEARCH USING INVERTED INDEX


Given a query q, search has the following steps:
Step 1 (vocabulary search): find each term/word in q in
the inverted index.
Step 2 (results merging): Merge results to find
documents that contain all or some of the words/terms in
q.
Step 3 (Rank score computation): To rank the resulting
documents/pages, using,
content-based ranking
link-based ranking

32

INVERTED INDEX: BOOLEAN RETRIEVAL


Doc 1

Doc 2

one fish, two fish

blue

red fish, blue fish

egg
fish

1
1

cat in the hat

Doc 4

green eggs and ham

cat

Doc 3

blue

cat

egg

fish

green

green

ham

ham

hat

one

red

two

hat
one

1
1

red

two

INVERTED INDEX EXAMPLE


Doc 1
This is a sample
document
with one sample
sentence
Doc 2

Dictionary

Postings

Term

#
docs

Total
freq

This

is

sample

another

This is another
sample document

Doc id

Freq

Slide is from ChengXiang Zhai


34

BOOLEAN RETRIEVAL

To execute a Boolean query:

Build query syntax tree

OR

( blue AND fish ) OR ham

For each clause, look up postings


blue

fish

ham

AND
blue

fish

Traverse postings and apply Boolean operator

Efficiency analysis
Postings traversal is linear (assuming sorted postings)
Start with shortest posting first

Sec.

1.3

QUERY PROCESSING: AND

Consider processing the query:


Brutus AND Caesar
Locate Brutus in the Dictionary;

Locate Caesar in the Dictionary;

Retrieve its postings.


Retrieve its postings.

Merge the two postings:

16

32
8

64
13

128
21

Brutus
34 Caesar
36

Sec.

1.3

THE MERGE

Walk through the two postings simultaneously, in


time linear in the total number of postings entries

16

32
8

13

Brutus
34 Caesar

128

64
21

If the list lengths are x and y, the merge takes O(x+y)


operations.
Crucial: postings sorted by docID.

37

INTERSECTING TWO POSTINGS LISTS


(A MERGE ALGORITHM)

38

INVERTED INDEX: TF.IDF


Doc 1

Doc 2

one fish, two fish

red fish, blue fish

Doc 3

cat in the hat

Doc 4

green eggs and ham

tf
1

blue

cat

egg
fish

1
2

df

df
1

blue

cat

egg

fish

green

green

ham

ham

hat

one

red

two

hat
one

1
1

red

two

POSITIONAL INDEXES
Store term position in postings
Supports richer queries (e.g., proximity)
Naturally, leads to larger indexes

INVERTED INDEX: POSITIONAL INFORMATION


Doc 1

Doc 2

one fish, two fish

red fish, blue fish

Doc 3

cat in the hat

Doc 4

green eggs and ham

tf
1

blue

cat

egg
fish

1
2

df
1

blue

[3]

cat

[1]

egg

[2]

fish

[2,4]

green

green

[1]

ham

ham

[3]

hat

[2]

one

[1]

red

[1]

two

[3]

hat
one

1
1

red

two

2 2

[2,4]

RETRIEVAL IN A NUTSHELL
Look up postings lists corresponding to query terms
Traverse postings for each query term
Store partial query-document scores in
accumulators
Select top k results to return

RANKING

Ranking of documents on the basis of estimated relevance to a query is


critical
Relevance ranking is based on factors such as
Term frequency
Frequency of occurrence of query keyword in document
Inverse document frequency
How many documents the query keyword occurs in
Fewer give more importance to keyword
Hyperlinks to documents
More links to a document document is more important

RELEVANCE RANKING USING TERMS

Most systems add to the above model


Words that occur in title, author list, section headings,
etc. are given greater importance
Words whose first occurrence is late in the document
are given lower importance
Very common words such as a, an, the, it etc are
eliminated

Called stop words

Proximity: if keywords in query occur close together in


the document, the document has higher importance
than if they occur far apart

Documents are returned in decreasing order of


relevance score

Usually only top few documents are returned, not all

RELEVANCE USING HYPERLINKS


Number of documents relevant to a query can be
enormous if only term frequencies are taken into
account
Using term frequencies makes spamming easy

E.g. a travel agency can add many occurrences of the words


travel to its page to make its rank very high

Most of the time people are looking for pages from


popular sites
Idea: use popularity of Web site (e.g. how many
people visit it) to rank site pages that match given
keywords
Problem: hard to find actual popularity of site

Solution: next slide

Solution: use number of hyperlinks to a site as a


measure of the popularity or prestige of the site
Count only one hyperlink from each site
Popularity measure is for site, not for individual page

But, most hyperlinks are to root of site


Also, concept of site difficult to define since a URL prefix like
cs.yale.edu contains many unrelated pages of varying popularity

Refinements

When computing prestige based on links to a site, give more


weight to links from sites that themselves have higher
prestige

Definition is circular
Set up and solve system of simultaneous linear equations

Above idea is basis of the Google PageRank ranking


mechanism

Connections to social networking theories that ranked


prestige of people
E.g. the president of the U.S.A has a high prestige since
many people know him
Someone known by multiple prestigious people has high
prestige

Hub and authority based ranking

A hub is a page that stores links to many pages (on a topic)


An authority is a page that contains actual information on a
topic
Each page gets a hub prestige based on prestige of
authorities that it points to
Each page gets an authority prestige based on prestige of
hubs that point to it
Again, prestige definitions are cyclic, and can be got by
solving linear equations
Use authority prestige when ranking answers to a query

SYNONYMS AND HOMONYMS

Synonyms

E.g. document: motorcycle repair, query: motorcycle


maintenance

System can extend query as motorcycle and (repair or


maintenance)

Homonyms

need to realize that maintenance and repair are synonyms

E.g. object has different meanings as noun/verb


Can disambiguate meanings (to some extent) from the
context

Extending queries automatically using synonyms can be


problematic

Need to understand intended meaning in order to infer


synonyms

Or verify synonyms with user

Synonyms may have other meanings as well

CONCEPT-BASED QUERYING

Approach
For each word, determine the concept it represents from
context
Use one or more ontologies:

Hierarchical structure showing relationship between concepts


E.g.: the ISA relationship that we saw in the E-R model

This approach can be used to standardize


terminology in a specific field
Ontologies can link multiple languages
Foundation of the Semantic Web (not covered
here)

How to evaluate retrieval


results?

50

WHAT TO EVALUATE?
Coverage of information
Form of presentation
Effort required/ease of Use
Time and space efficiency

Metrics
Recall
proportion of relevant material actually retrieved
Precision
proportion of retrieved material actually relevant

4/3/2016

Introduction to Information Retrieval

51

PRECISION VS. RECALL


4/3/2016
52

All docs

Retrieved

| RelRetriev ed |
Recall
| Rel in Collection
|
Intro
Relevant

ducti
on
to
Infor
mati
on
Retri

| RelRetriev ed |
Precision
| Retrieved |

REVIEW QUESTIONS
What is the IR architecture and explain the methods
used in it?
How is the retrieval results evaluated?
Explain the inverse indexing method with example.
Explain Boolean retrieval with example
Define hub and authority.
Define TF and TIDF using an example.

NEXT LECTURE HIGHLIGHTS


What are i/o devices and their types
How do you access i/o devices?
What are device drivers?
File descriptors

Вам также может понравиться