Академический Документы
Профессиональный Документы
Культура Документы
Lecture 1
Lect.
Topics
Readings
Class Notes
2-3
R3
4-5
R3 Ch. 6, Ch. 7.
R1 Sec. 10.1
6-7
R1 Secs. 10.2
and 10.3
R1 Sec. 11.1
R6 Ch 2
10-11
12-14
R6 Ch 2.
15
Text Retrieval
Lect.
Topics
Readings
16-17
T1. Ch 4
18-19
AR
20
T1. Ch 7
21-22
T1. Ch 8,
R8. Ch 2
23
Class Notes
24-25
R8. Ch 3
26-27
R2. Ch 5
28-29
AR
30-32
R8. , AR
33-34
R8. , AR
35
T1. Ch 13
36-37
T1. Ch 13
AR
38
AR
39-40
T1. Ch 12
41
Course Summary
INFORMATION RETRIEVAL
What is information? examples
How is information stored?
Text (Documents)
XML and structured documents
Images
Audio (sound effects, songs, etc.)
Video
Source code
Applications/Web services
mkarimz2@uiuc.edu
NOMENCLATURE
What do we search?
Generically, collections
What do we find?
Generically, documents
Even though we may be referring to web pages, PDFs,
PowerPoint slides, paragraphs, etc.
SEARCH PROCESS
Resource
Query
Search
Indexing
Document Collection
Index
11
Query
Formulation
4/3/2016
Source
Selection
Ranked List
Selection
Documents
Results
Intro
ducti
on
to
Infor
mati
on
Retri
Searcher
Concepts
Concepts
Query Terms
Document Terms
ABSTRACT IR ARCHITECTURE
Query
Documents
online offline
Representation
Function
Representation
Function
Query Representation
Document Representation
Comparison
Function
Index
Hits
EXAMPLE
Web
14
Assumptions
Term occurrence is independent
Document relevance is independent
Words are well-defined
WHATS A WORD?
-
-
1982 .
-
,
.
2005-06
= 25 `''
` ''
.
SAMPLE DOCUMENT
McDonald's slims down spuds
Fast-food chain to reduce certain types of
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as
it moves to make all its fried menu items
healthier.
But does that mean the popular shoestring fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along with
an even healthier nutrition profile," said Mike
Roberts, president of McDonald's USA.
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use,
but at least one nutrition expert says playing with
the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It was
unclear Tuesday whether competitors Burger
King and Wendy's International (WEN: down
$0.80 to $34.91, Research, Estimates) would
follow suit. Neither company could immediately
be reached for comment.
Bag of Words
14 McDonalds
12 fat
11 fries
8 new
7 french
6 company, said, nutrition
5 food, oil, percent, reduce,
taste, Tuesday
Boolean model
Vector space model
Statistical language model
Query
Representation
Function
Documents
online offline
Representation
Function
Query Representation
Document Representation
Comparison
Function
Index
Hits
18
AN EXAMPLE OF BOOLEAN
A2=(0, 1, 0),
A5=(1, 0, 1),
A8=(1, 0, 1).
A3=(0, 0, 1)
A6=(0, 1, 1)
A9=(0, 1, 1)
19
AN EXAMPLE (CONT.)
In
20
TERM WEIGHTING
4/3/2016
Document
Food nutrition
paper
food ?
nutrition ?
healthy ?
diet ?
22
Text mining
paper
text ?
mining ?
assocation ?
clustering ?
food ?
Query =
data mining algorithms
ABSTRACT IR ARCHITECTURE
Query
Documents
online offline
Representation
Function
Query Representation
Comparison
Function
Hits
Representation
Function
Document Representation
Index
Bag of
Words
Index
case folding,
tokenization,
stopword removal,
stemming
STOPWORDS REMOVAL
26
STEMMING
Techniques
word. E.g.,
user
users
used
using
stem:
engineering
engineered
engineer
use
engineer
Usefulness of stemming:
improving effectiveness of IR and text mining
as much as 40-50%.
Bag of
Words
Index
case folding,
tokenization,
stopword removal,
stemming
INCIDENCE MATRIX
Anthony &
Cleopatra
Julius
Caesar
Tempest
Hamlet
Othello
Macbeth
Anthony
Brutus
Caesar
Calpurnia 0
Cleopatra 1
Mercy
worser
INVERTED INDEX
The
30
AN EXAMPLE
32
Doc 2
blue
egg
fish
1
1
Doc 4
cat
Doc 3
blue
cat
egg
fish
green
green
ham
ham
hat
one
red
two
hat
one
1
1
red
two
Dictionary
Postings
Term
#
docs
Total
freq
This
is
sample
another
This is another
sample document
Doc id
Freq
BOOLEAN RETRIEVAL
OR
fish
ham
AND
blue
fish
Efficiency analysis
Postings traversal is linear (assuming sorted postings)
Start with shortest posting first
Sec.
1.3
16
32
8
64
13
128
21
Brutus
34 Caesar
36
Sec.
1.3
THE MERGE
16
32
8
13
Brutus
34 Caesar
128
64
21
37
38
Doc 2
Doc 3
Doc 4
tf
1
blue
cat
egg
fish
1
2
df
df
1
blue
cat
egg
fish
green
green
ham
ham
hat
one
red
two
hat
one
1
1
red
two
POSITIONAL INDEXES
Store term position in postings
Supports richer queries (e.g., proximity)
Naturally, leads to larger indexes
Doc 2
Doc 3
Doc 4
tf
1
blue
cat
egg
fish
1
2
df
1
blue
[3]
cat
[1]
egg
[2]
fish
[2,4]
green
green
[1]
ham
ham
[3]
hat
[2]
one
[1]
red
[1]
two
[3]
hat
one
1
1
red
two
2 2
[2,4]
RETRIEVAL IN A NUTSHELL
Look up postings lists corresponding to query terms
Traverse postings for each query term
Store partial query-document scores in
accumulators
Select top k results to return
RANKING
Refinements
Definition is circular
Set up and solve system of simultaneous linear equations
Synonyms
Homonyms
CONCEPT-BASED QUERYING
Approach
For each word, determine the concept it represents from
context
Use one or more ontologies:
50
WHAT TO EVALUATE?
Coverage of information
Form of presentation
Effort required/ease of Use
Time and space efficiency
Metrics
Recall
proportion of relevant material actually retrieved
Precision
proportion of retrieved material actually relevant
4/3/2016
51
All docs
Retrieved
| RelRetriev ed |
Recall
| Rel in Collection
|
Intro
Relevant
ducti
on
to
Infor
mati
on
Retri
| RelRetriev ed |
Precision
| Retrieved |
REVIEW QUESTIONS
What is the IR architecture and explain the methods
used in it?
How is the retrieval results evaluated?
Explain the inverse indexing method with example.
Explain Boolean retrieval with example
Define hub and authority.
Define TF and TIDF using an example.