DDS Lec1 SR

COURSE TITLE:
DISTRIBUTED DATA SYSTEMS

DR. SASIREKHA GVK
Lecture 1
SCOPE & OBJECTIVE:

The objective of the course is to expose the student to an
engineering approach to Information Retrieval in the
context of the Web.
The course will take an integrated approach to
conceptualizing, designing, and implementing IR
systems.
The scope of the course includes concepts in
Networked/distributed file systems

Data structures and algorithms for IR
Implementation issues in scalable distributed systems for
collecting, storing, and retrieving large data sets.
T1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information

Retrieval. Pearson Education.
Reference Books:
R1. Andrew Tanenbaum and Maarten van Steen. Distributed Systems.
Pearson Education.
R2.. Stefano Ceri and Giuseppe Pelagatti. Distributed Databases.
McGraw Hill International Edition.
R3. Marshall McKusick et. al. The Design and Implementation of the 4.4
BSD Operating System. Addison Wesley
R4. Robert R. Korfhage. Information Storage and Retrieval. John Wiley
(Indian Edition).
R5. Gerald J. Kowalski and Mark T. Maybury. Information Storage and
Retrieval Systems
R6. Soumen Chakrabarti. Mining the Web. Elsevier India.
R7. Hoffman and Beaumont. Content Networking Architecture,
Protocols and Practice. Elsevier India.
R8. William Frakes and Ricardo Baeza-Yates. Information Retrieval
Data Structures and Algorithms. Prentice Hall.
Lect.
Topics
Readings
Course Introduction The Science, Engineering, and Business of

Information Retrieval.
Class Notes
2-3
Introduction to I/O and File Systems Unix File Systems
R3
4-5
Networked File Systems NFS. Design and Implementation

Issues.
R3 Ch. 6, Ch. 7.
R1 Sec. 10.1
6-7
Distributed File Systems Coda. Design and Implementation

Issues.
R1 Secs. 10.2
and 10.3
The WWW as a Distributed Data/Document System
R1 Sec. 11.1
HTTP based Systems and Web Servers
R6 Ch 2
10-11
Markup Languages. XML for Data and Documents. XML Data

Exchange. XML Processing
T1. Sec 6.4
12-14
Crawling and Scraping. Tools and Techniques
R6 Ch 2.
15
Text Retrieval
T1 Sec. 2.5, Sec.

2.9
Lect.
Topics
Readings
16-17
Querying Keywords, Pattern Matching. Query Operations.
T1. Ch 4
18-19
Query Languages Querying XML Documents.
AR
20
Text Operations and Document Operations
T1. Ch 7
21-22
File Structures for Indexing Text
T1. Ch 8,
R8. Ch 2
23
Multimedia File Systems
Class Notes
24-25
Algorithms for Term and Query Operations
R8. Ch 3
26-27
Distributed Querying - Fragmentation and Localization.
R2. Ch 5
28-29
Cluster File Systems Design and Implementation
AR
30-32
Document Ranking Models and Algorithms
R8. , AR
33-34
Document Clustering Algorithms
R8. , AR
35
Search Engines Characteristics and Usage
T1. Ch 13
36-37
Search Engines - Architectures
T1. Ch 13
AR
38
Search Engines Implementation Issues.
AR
39-40
Multimedia Information Retrieval
T1. Ch 12
41
Course Summary
INFORMATION RETRIEVAL
What is information? examples
How is information stored?
Text (Documents)
XML and structured documents
Images
Audio (sound effects, songs, etc.)
Video
Source code
Applications/Web services
Jimmy Lins tutorial

Maryam Karimzadehgan
mkarimz2@uiuc.edu
INFORMATION RETRIEVAL VS. DATABASE

SYSTEMS
Information retrieval (IR) systems use a simpler

data model than database systems
Information organized as a collection of documents
Documents are unstructured, no schema
Information retrieval locates relevant documents, on

the basis of user input such as keywords or
example documents
e.g., find documents containing the words database

systems
Can be used even on textual descriptions provided

with non-textual data such as images.
Web search engines are the most familiar example
of IR systems
DIFFERENCES FROM DATABASE SYSTEMS
IR systems dont deal with transactional updates

(including concurrency control and recovery)
Database systems deal with structured data, with

schemas that define the data organization
IR systems deal with some querying issues not

generally addressed by database systems
Approximate searching by keywords
Ranking of retrieved answers by estimated
degree of relevance
NOMENCLATURE
Information retrieval (IR)

Focus on textual information (= text/document retrieval)
Other possibilities include image, video, music,
What do we search?
Generically, collections
What do we find?
Generically, documents
Even though we may be referring to web pages, PDFs,
PowerPoint slides, paragraphs, etc.
SEARCH PROCESS
Resource
Query
Search
Indexing
Document Collection
Slide is from Jimmy Lins tutorial
Index
11
Query
Formulation
4/3/2016
Source
Selection
Ranked List
Selection
Documents
Results
Intro
ducti
on
to
Infor
mati
on
Retri
THE CENTRAL PROBLEM IN SEARCH

Author
Searcher
Concepts
Concepts
Query Terms
Document Terms
tragic love story
fateful star-crossed romance
Do these represent the same concepts?

Why IR is hard? Because Language is hard!!!
ABSTRACT IR ARCHITECTURE
Query
Documents
online offline
Representation
Function
Representation
Function
Query Representation
Document Representation
Comparison
Function
Index
Hits
EXAMPLE
Google
Web
14
HOW DO WE REPRESENT TEXT?

Remember: computers dont understand anything!
Bag of words
Treat all the words in a document as index terms

Assign a weight to each term based on importance
(or, in simplest case, presence/absence of word)
Disregard order, structure, meaning, etc. of the words
Simple, yet effective!
Assumptions
Term occurrence is independent
Document relevance is independent
Words are well-defined
WHATS A WORD?
-
-

1982 .
-
,
.
2005-06

= 25 `''
` ''
.
SAMPLE DOCUMENT
McDonald's slims down spuds
Fast-food chain to reduce certain types of
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as
it moves to make all its fried menu items
healthier.
But does that mean the popular shoestring fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along with
an even healthier nutrition profile," said Mike
Roberts, president of McDonald's USA.
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use,
but at least one nutrition expert says playing with
the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It was
unclear Tuesday whether competitors Burger
King and Wendy's International (WEN: down
$0.80 to $34.91, Research, Estimates) would
follow suit. Neither company could immediately
be reached for comment.
Bag of Words
14 McDonalds
12 fat
11 fries
8 new
7 french
6 company, said, nutrition
5 food, oil, percent, reduce,
taste, Tuesday
INFORMATION RETRIEVAL MODELS

An IR model governs how a document and a query are
represented and how the relevance of a document to a user
query is defined.
Main (similarity ) models:
Boolean model
Vector space model
Statistical language model
Query
Representation
Function
Documents
online offline
Representation
Function
Comparison
Function
Index
Hits
18
AN EXAMPLE OF BOOLEAN
A document space is defined by three terms:
hardware, software, users
A set of documents are defined as:

A1=(1, 0, 0),
A4=(1, 1, 0),
A7=(1, 1, 1)
A2=(0, 1, 0),
A5=(1, 0, 1),
A8=(1, 0, 1).
A3=(0, 0, 1)
A6=(0, 1, 1)
A9=(0, 1, 1)
If the Query is hardware and software

what documents should be retrieved?
19
AN EXAMPLE (CONT.)
In
Boolean query matching:
document A4, A7 will be retrieved (AND)

retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (OR)
20
TERM WEIGHTING
Term weights consist of two components

Local: how important is the term in this document?
Global: how important is the term in the collection?
Heres the intuition:
Terms that appear often in a document should get high

weights
Terms that appear in many documents should get low
weights
How do we capture this mathematically?

Term frequency (local)
Inverse document frequency (global)
LANGUAGE MODELS FOR RETRIEVAL

Language Model
4/3/2016
Document
Food nutrition
paper
food ?
nutrition ?
healthy ?
diet ?
Slide is from ChengXiang Zhai
22
Text mining
paper
text ?
mining ?
assocation ?
clustering ?
food ?
Query =
data mining algorithms
Which model would most

likely have generated
this query?
Intro
ducti
on
to
Infor
mati
on
Retri
ABSTRACT IR ARCHITECTURE
Query
Documents
online offline
Representation
Function
Comparison
Function
Hits
Representation
Function
Index
EXAMPLE : SHAKESPEARES COLLECTED

WORKS
Grep pattern matching < million words
Web search using grep? millions or trillions of
words
Indexing avoids scanning
CONSTRUCTING INDEX (WORD COUNTING)

Documents
Bag of
Words
Index
case folding,
tokenization,
stopword removal,
stemming
STOPWORDS REMOVAL
Many of the most frequently used words in English are useless

in IR and text mining these words are called stop words.
the, of, and, to, .
Typically about 400 to 500 such words
For an application, an additional domain specific stopwords list
may be constructed
Why do we need to remove stopwords?
Reduce indexing (or data) file size
stopwords accounts 20-30% of total word counts.
Improve efficiency and effectiveness

stopwords are not useful for searching or text mining
they may also confuse the retrieval system.
26
STEMMING
Techniques
word. E.g.,
used to find out the root/stem of a
user
users
used
using
stem:
engineering
engineered
engineer
use
engineer
Usefulness of stemming:
improving effectiveness of IR and text mining
matching similar words

Mainly improve recall
reducing indexing size

combing words with same
as much as 40-50%.
roots may reduce indexing size

27
CONSTRUCTING INDEX (WORD COUNTING)

Documents
Bag of
Words
Index
case folding,
tokenization,
stopword removal,
stemming
INCIDENCE MATRIX
Anthony &
Cleopatra
Julius
Caesar
Tempest
Hamlet
Othello
Macbeth
Anthony
Brutus
Caesar
Calpurnia 0
Cleopatra 1
Mercy
worser
INVERTED INDEX
The
inverted index of a document collection is

basically a data structure that
attaches each distinctive term with a list of all
documents that contains the term.
30
AN EXAMPLE
Dictionary & Postings

31
SEARCH USING INVERTED INDEX

Given a query q, search has the following steps:
Step 1 (vocabulary search): find each term/word in q in
the inverted index.
Step 2 (results merging): Merge results to find
documents that contain all or some of the words/terms in
q.
Step 3 (Rank score computation): To rank the resulting
documents/pages, using,
content-based ranking
link-based ranking
32
INVERTED INDEX: BOOLEAN RETRIEVAL

Doc 1
Doc 2
one fish, two fish
blue
red fish, blue fish
egg
fish
1
1
cat in the hat
Doc 4
green eggs and ham
cat
Doc 3
blue
cat
egg
fish
green
green
ham
ham
hat
one
red
two
hat
one
1
1
red
two
INVERTED INDEX EXAMPLE

Doc 1
This is a sample
document
with one sample
sentence
Doc 2
Dictionary
Postings
Term
#
docs
Total
freq
This
is
sample
another
This is another
sample document
Doc id
Freq
Slide is from ChengXiang Zhai

34
BOOLEAN RETRIEVAL
To execute a Boolean query:
Build query syntax tree
OR
( blue AND fish ) OR ham
For each clause, look up postings

blue
fish
ham
AND
blue
fish
Traverse postings and apply Boolean operator
Efficiency analysis
Postings traversal is linear (assuming sorted postings)
Start with shortest posting first
Sec.
1.3
QUERY PROCESSING: AND
Consider processing the query:

Brutus AND Caesar
Locate Brutus in the Dictionary;
Locate Caesar in the Dictionary;
Retrieve its postings.

Retrieve its postings.
Merge the two postings:
16
32
8
64
13
128
21
Brutus
34 Caesar
36
Sec.
1.3
THE MERGE
Walk through the two postings simultaneously, in

time linear in the total number of postings entries
16
32
8
13
Brutus
34 Caesar
128
64
21
If the list lengths are x and y, the merge takes O(x+y)

operations.
Crucial: postings sorted by docID.
37
INTERSECTING TWO POSTINGS LISTS

(A MERGE ALGORITHM)
38
INVERTED INDEX: TF.IDF

Doc 1
Doc 2
one fish, two fish
red fish, blue fish
Doc 3
cat in the hat
Doc 4
green eggs and ham
tf
1
blue
cat
egg
fish
1
2
df
df
1
blue
cat
egg
fish
green
green
ham
ham
hat
one
red
two
hat
one
1
1
red
two
POSITIONAL INDEXES
Store term position in postings
Supports richer queries (e.g., proximity)
Naturally, leads to larger indexes
INVERTED INDEX: POSITIONAL INFORMATION

Doc 1
Doc 2
one fish, two fish
red fish, blue fish
Doc 3
cat in the hat
Doc 4
green eggs and ham
tf
1
blue
cat
egg
fish
1
2
df
1
blue
[3]
cat
[1]
egg
[2]
fish
[2,4]
green
green
[1]
ham
ham
[3]
hat
[2]
one
[1]
red
[1]
two
[3]
hat
one
1
1
red
two
2 2
[2,4]
RETRIEVAL IN A NUTSHELL
Look up postings lists corresponding to query terms
Traverse postings for each query term
Store partial query-document scores in
accumulators
Select top k results to return
RANKING
Ranking of documents on the basis of estimated relevance to a query is

critical
Relevance ranking is based on factors such as
Term frequency
Frequency of occurrence of query keyword in document
Inverse document frequency
How many documents the query keyword occurs in
Fewer give more importance to keyword
Hyperlinks to documents
More links to a document document is more important
RELEVANCE RANKING USING TERMS
Most systems add to the above model

Words that occur in title, author list, section headings,
etc. are given greater importance
Words whose first occurrence is late in the document
are given lower importance
Very common words such as a, an, the, it etc are
eliminated
Called stop words
Proximity: if keywords in query occur close together in

the document, the document has higher importance
than if they occur far apart
Documents are returned in decreasing order of

relevance score
Usually only top few documents are returned, not all
RELEVANCE USING HYPERLINKS

Number of documents relevant to a query can be
enormous if only term frequencies are taken into
account
Using term frequencies makes spamming easy
E.g. a travel agency can add many occurrences of the words

travel to its page to make its rank very high
Most of the time people are looking for pages from

popular sites
Idea: use popularity of Web site (e.g. how many
people visit it) to rank site pages that match given
keywords
Problem: hard to find actual popularity of site
Solution: next slide
Solution: use number of hyperlinks to a site as a

measure of the popularity or prestige of the site
Count only one hyperlink from each site
Popularity measure is for site, not for individual page
But, most hyperlinks are to root of site

Also, concept of site difficult to define since a URL prefix like
cs.yale.edu contains many unrelated pages of varying popularity
Refinements
When computing prestige based on links to a site, give more

weight to links from sites that themselves have higher
prestige
Definition is circular
Set up and solve system of simultaneous linear equations
Above idea is basis of the Google PageRank ranking

mechanism
Connections to social networking theories that ranked

prestige of people
E.g. the president of the U.S.A has a high prestige since
many people know him
Someone known by multiple prestigious people has high
prestige
Hub and authority based ranking
A hub is a page that stores links to many pages (on a topic)

An authority is a page that contains actual information on a
topic
Each page gets a hub prestige based on prestige of
authorities that it points to
Each page gets an authority prestige based on prestige of
hubs that point to it
Again, prestige definitions are cyclic, and can be got by
solving linear equations
Use authority prestige when ranking answers to a query
SYNONYMS AND HOMONYMS
Synonyms
E.g. document: motorcycle repair, query: motorcycle

maintenance
System can extend query as motorcycle and (repair or

maintenance)
Homonyms
need to realize that maintenance and repair are synonyms
E.g. object has different meanings as noun/verb

Can disambiguate meanings (to some extent) from the
context
Extending queries automatically using synonyms can be

problematic
Need to understand intended meaning in order to infer

synonyms
Or verify synonyms with user
Synonyms may have other meanings as well
CONCEPT-BASED QUERYING
Approach
For each word, determine the concept it represents from
context
Use one or more ontologies:
Hierarchical structure showing relationship between concepts

E.g.: the ISA relationship that we saw in the E-R model
This approach can be used to standardize

terminology in a specific field
Ontologies can link multiple languages
Foundation of the Semantic Web (not covered
here)
How to evaluate retrieval

results?
50
WHAT TO EVALUATE?
Coverage of information
Form of presentation
Effort required/ease of Use
Time and space efficiency
Metrics
Recall
proportion of relevant material actually retrieved
Precision
proportion of retrieved material actually relevant
4/3/2016
Introduction to Information Retrieval
51
PRECISION VS. RECALL

4/3/2016
52
All docs
Retrieved
| RelRetriev ed |
Recall
| Rel in Collection
|
Intro
Relevant
ducti
on
to
Infor
mati
on
Retri
| RelRetriev ed |
Precision
| Retrieved |
REVIEW QUESTIONS
What is the IR architecture and explain the methods
used in it?
How is the retrieval results evaluated?
Explain the inverse indexing method with example.
Explain Boolean retrieval with example
Define hub and authority.
Define TF and TIDF using an example.
NEXT LECTURE HIGHLIGHTS

What are i/o devices and their types
How do you access i/o devices?
What are device drivers?
File descriptors

DDS Lec1 SR

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DDS Lec1 SR

Загружено:

Авторское право:

Доступные форматы

COURSE TITLE:

DISTRIBUTED DATA SYSTEMS

SCOPE & OBJECTIVE:

Networked/distributed file systems

T1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information

Course Introduction The Science, Engineering, and Business of

Introduction to I/O and File Systems Unix File Systems

Networked File Systems NFS. Design and Implementation

Distributed File Systems Coda. Design and Implementation

The WWW as a Distributed Data/Document System

HTTP based Systems and Web Servers

Markup Languages. XML for Data and Documents. XML Data

T1. Sec 6.4

Crawling and Scraping. Tools and Techniques

T1 Sec. 2.5, Sec.

Querying Keywords, Pattern Matching. Query Operations.

Query Languages Querying XML Documents.

Text Operations and Document Operations

File Structures for Indexing Text

Multimedia File Systems

Algorithms for Term and Query Operations

Distributed Querying - Fragmentation and Localization.

Cluster File Systems Design and Implementation

Document Ranking Models and Algorithms

Document Clustering Algorithms

Search Engines Characteristics and Usage

Search Engines - Architectures

Search Engines Implementation Issues.

Multimedia Information Retrieval

Jimmy Lins tutorial

INFORMATION RETRIEVAL VS. DATABASE

Information retrieval (IR) systems use a simpler

Information retrieval locates relevant documents, on

e.g., find documents containing the words database

Can be used even on textual descriptions provided

DIFFERENCES FROM DATABASE SYSTEMS

IR systems dont deal with transactional updates

Database systems deal with structured data, with

IR systems deal with some querying issues not

Information retrieval (IR)

Slide is from Jimmy Lins tutorial

THE CENTRAL PROBLEM IN SEARCH

tragic love story

fateful star-crossed romance

Do these represent the same concepts?

HOW DO WE REPRESENT TEXT?

Treat all the words in a document as index terms

INFORMATION RETRIEVAL MODELS

A document space is defined by three terms:

hardware, software, users

A set of documents are defined as:

If the Query is hardware and software

Boolean query matching:

document A4, A7 will be retrieved (AND)

Term weights consist of two components

Heres the intuition:

Terms that appear often in a document should get high

How do we capture this mathematically?

LANGUAGE MODELS FOR RETRIEVAL

Slide is from ChengXiang Zhai

Which model would most

EXAMPLE : SHAKESPEARES COLLECTED

CONSTRUCTING INDEX (WORD COUNTING)