Multimedia IR - Ppts 50

Information Retrieval (IR)
Deals with the representation, storage and retrieval of unstructured data Topics of interest: systems, languages, retrieval, user interfaces, data visualization, distributed data sets Classical IR deals mainly with text The evolution of multimedia databases and of the web have given new interest to IR
E.G.M Petrakis Multimedia Information Retrieval 1
Multimedia IR
A multimedia information system that can store and retrieve
attributes text 2D grey-scale and color images 1D time series digitized voice or music video
Applications
Financial, marketing: stock prices, sales etc. Scientific databases: sensor data
find companies whose stock prices move similarly
whether, geological, environmental data
Office automation, electronic encyclopedias, electronic books Medical databases X-rays, CT, MRI scans Criminal investigation suspects, fingerprints, Personal archives, text and color images
Queries
Formulation of users information need
Free-text query By example document (e.g., text, image) e.g., in a collection of color photos find those showing a tree close to a house
Retrieval is based on the understanding of the content of documents and of their components
Goals of Retrieval
Accuracy: retrieve documents that the user expects in the answer
With as few incorrect answers as possible All relevant answers are retrieved The system responds in real time
Speed: retrievals has to be fast
E.G.M Petrakis
Multimedia Information Retrieval
Accuracy of Retrieval
Depends on query criteria, query complexity and specificity
Attributes / Content criteria?
Depends on what is matched with the query

How documents content is represented Typically feature vectors
Depends also on matching function

E.G.M Petrakis
Euclidean distance
Speed of Retrieval
The query is compared (matched) with all stored documents The definition of similarity criteria (similarity or distance function between documents) is an important issue:
Similarity or distance function between documents Matching has to fast: document matching has to be computationally efficient and sequential searching must be avoided
Indexing: search documents that are likely to match the query

E.G.M Petrakis Multimedia Information Retrieval
Database
database descriptions
Doc1-FeactureVector
index
query
documents Doc1 Doc2 DocN
Doc2-FeactureVector
DocN-FeatureVector
E.G.M Petrakis
Main Idea
Descriptions are extracted and stored
Same or separate storage with documents Queries address the stored descriptions rather the documents themselves Images & video: the majority of stored data Text retrieval is a well researched area Most systems work with text, attributes
Multimedia Information Retrieval 9
Solution: retrieval by text
E.G.M Petrakis
Problem
Text descriptions for images and video contained in documents are not available
e.g., the text in a web site is not always descriptive of every particular image contained in the web site
Two approaches
human annotations feature extraction
Human Annotations
Humans insert attributes, captions for images and videos
inconsistent or subjective over time and users (different users do not give the same descriptions) retrievals will fail if queries are formulated using different keywords or descriptions time consuming process, expensive or impossible for large databases
E.G.M Petrakis
11
Feature Extraction
Features are extracted from audio, image and video
cheaper and replicable approach consistent descriptions but inexact low level feature (patterns, colors etc.) difficult to extract meaning different techniques for different data
E.G.M Petrakis
Architecture of a IR System for Images

Furht at.al. 96
E.G.M Petrakis
13
Multimedia IR in Practice
Relies on text descriptions Non-text IR
there is nothing similar to text-based retrieval systems automatic IR is sometimes impossible requires human intervention at some level significant progress have been made domain specific interpretations
E.G.M Petrakis
14
Ranking of IR Methods
Ranking in terms of complexity and accuracy
attributes text audio image video
accuracy
complexity
Combined IR based on two or more data types for more accurate retrievals Complex data types (e.g.,video) are rich sources of information
Similarity Searching
IR is not an exact process
two documents are never identical! they can be similar searching must be approximate
The effectiveness of IR depends on the

types and correctness of descriptions used types of queries allowed user uncertainly as to what he is looking for efficiency of search techniques
Approximate IR
A query is specified and all documents up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity Two common types of similarity queries:
range queries: retrieve all documents up to a distance threshold T nearest-neighbor queries: retrieve the k best matches
Distance - Similarity
Decide whether two documents are similar
distance: the lower the distance, the more similar the documents are similarity: the higher the similarity, the more similar the documents are
Key issue for successful retrieval: the more accurate the descriptions are, the more accurate the retrieval is
Retrieval Quality Criteria

Retrieval with as few errors as possible Two types of errors:
false dismissals or misses: qualifying but non retrieved documents false positives or false drops: retrieved but not qualifying documents
A good method minimizes both Ranking quality: retrieve qualifying documents before non-qualifying ones
Evaluation of Retrieval
Given a collection of documents, a set of
queries and human experts responses to the above queries, the ideal system will retrieve exactly what the human dictated
number of retrieved relevant in answer precision = number of retrieved number or retrieved relevant in answer recall = total number or relevant in the collection
The deviations from the above are measured
Precision-Recall Diagram
precision
1
the ideal method
0,5
0,5
Measure precision-recall for 1, 2 ..N answers High precision means few false alarms High recall mean few false dismissals
recall
Harmonic Mean
A single measure combining precision and recall
1 r: recall, p: precision r F takes values in [0,1] F 1 as more retrieved documents are relevant F 0 as few retrieved documents are relevant F is high when both precision and recall are high F expresses a compromise between precision and recall
F =
2 1 +p
Ranking Quality
The higher the Rnorm the better the ability of a method to retrieve correct answer before incorrect S S 1 2 (1 ) if S max 0 S max Rnorm 1 otherwise A human expert evaluates the answer set Then take answers in pairs: (relevant,irrelevant) S+ : pairs ranked correctly (the relevant entry has retrieved before the irrelevant one) S- : pairs ranked incorrectly Smax+: total ranked pairs
Searching Method
Sequential scanning: the query is matched with all stored documents
slow if the database is large or if matching is slow
The documents must be indexed
hashing, inverted files, B-trees, R-trees different data types are indexed and searched separately
E.G.M Petrakis
24
Goals of IR
Summarizing, there are two general goals common to all IR systems
effectiveness: IR must be accurate (retrieves what the user expects to see in the answer) efficiency: IR must be fast (faster than sequential scanning)
E.G.M Petrakis
25
Approximate IR
A query is specified and all documents up to a pre-specified degree of similarity are retrieved and presented to the user ordered by similarity Two common types of similarity queries:
range queries: retrieve all documents up to a distance threshold T nearest-neighbor queries: retrieve the k best matches
Query Formulation
Must be flexible and convenient SQL queries constraints on all attributes and data types may become very complex Queries by example e.g., by providing an example document or image Browsing: display headers, summaries, miniatures etc. or for refining the retrieved results
27
Access Methods for Text

Access methods for text are interesting for at least 3 reasons
multimedia documents contain text, e.g., images often have captions text retrieval has several applications in itself (library automation, web search etc.) text retrieval research has led to useful ideas like, vector space model, information filtering, relevance feedback etc.
E.G.M Petrakis
Text Queries
Single or multiple keyword queries
context queries: phrases, word proximity boolean: keywords with and, or, not
Natural language: free text queries Structured search takes also text structure into account and can be:
flat or hierarchical for searching in titles, paragraphs, sections, chapters hypertext: combines content-connectivity
E.G.M Petrakis
29
Keyword Matching
The whole collection is searched
no preprocessing no space overhead updates are easy
The user specifies a string (regular expression) and the text is parsed using a finite state automaton
KMP, BMH algorithms
Error Tolerant Methods

Methods that can tolerate errors Scan text one character at a time by keeping track of matched characters and retrieve strings within a desired editing distance from query
Wu, Manber 92, Baeza-Yates, Connet 92
Extension: regular expressions built-up by strings and operators: pro (blem |

tein) (s | ) | (0 | 1 | 2)
Error Counting Methods

Retrieve similar words or phrases e.g., misspelled words or words with different pronunciation editing distance: a numerical estimate of the similarity between 2 strings phonetic coding: search words with similar pronunciation (Soundex, Phonix) N-grams: count common N-length substrings
32
Editing Distance
Minimum number of edit operations that are needed to transform the first string to the second
edit operation: insert, delete, substitute and (sometimes) transposition of characters
d(si,ti) : distance between characters
usually d(si,ti) = 1 if si < > ti and 0 otherwise edit(cordis,codis) = 1: r is deleted edit(cordis, codris) = 1: transposition of rd
E.G.M Petrakis
j 0 1 2 3 4 5 6 7 8
E.G.M. Petrakis
i 0 1 a 0 1 a 1 0 a 2 1 a 3 2 b 4 3 c 5 4 c 6 5 c 7 6 d 8 7
2 b 2 1 1 2 2 3 4 5 6
3 b 3 2 2 2 2 2 4 5 6
4 c 4 3 3 3 3 2 3 4 5
5 d 5 4 4 4 4 3 3 4 4
6 d 6 5 5 5 5 4 4 4 4
initialization cost
total cost
34
Damerau-Levenstein Algorithm
Basic recurrence relation (DP algorithm)
edit(0,0) = 0;edit(i,0) = i; edit(0,j) = j; edit(i,j) = min{ edit(i-1,j) + 1, edit(i, j-1) + 1, edit(i-1,j-1) + d(si,ti), edit(i-2,j-2) + d(si,tj-1) + d(si-1,sj) + 1 }
Matching Algorithm
Compute D(A,B)
#A, #B lengths of A, B 0: null symbol R: cost of an edit operation
D(0,0) = 0 for i = 0 to #A: D(i:0) = D(i-1,0) + R(A[i]0); for j = 0 to #B: D(0:j) = D(0,j-1) + R(0B[j]); for i = 0 to #A for j = 0 to #B {
1. m1 = D(i,j-1) + R(0B[j]); 2. m2 = D(i-1,j) + R(A[i] 0); 3. m3 = D(i-1,j-1) + R(A[i] B[j]); 4. D(i,j) = min{m1, m2, m3};
}
E.G.M. Petrakis
36
Classical IR Models
Boolean model
simple based on set theory queries as Boolean expressions
Vector space model

queries and documents as vectors in term space
Probabilistic model
a probabilistic approach
Text Indexing
A document is represented by a set of
index terms that summarize document contents adjectives, adverbs, connectives are less useful mainly nouns (lexicon look-up) requires text pre-processing (off-line)
E.G.M Petrakis
38
Indexing
index
Data Repository
Text Preprocessing
Extract index terms from text Processing stages
word separation, sentence splitting change terms to a standard form (e.g., lowercase) eliminate stop-words (e.g. and, is, the, ) reduce terms to their base form (e.g., eliminate prefixes, suffixes) construct mapping between terms and documents (indexing)
E.G.M Petrakis
Text Preprocessing Chart
from Baeza Yates & Ribeiro Neto, 1999

Text Indexing Methods

Main indexing methods:
inverted files signature files bitmaps
Size of index
may exceed size of actual collection compress index to reduce storage typically 40% of size of collection
Inverted Files
Dictionary: terms stored alphabetically Posting list: pointers to documents containing the term Posting info: document id, frequency of occurrence, location in document, etc.
Pros: fast and easy to implement Cons: large space overhead (up to 300%)
dictionary indexed by B-trees, tries or binary searched
Inverted Index
index posting list
(1,2)(3,4) (4,3)(7,5)
documents
1 2 3 4 5 6 7 8 9 10 11
(10,3)
E.G.M Petrakis
44
Query Processing
1. Parse query and extract query terms 2. Dictionary lookup 3. Get postings from inverted file 4. Accumulate postings
one term at a time
record matching document ids, frequencies, etc. compute weights binary search
5. Rank answers by weights (e.g., frequencies)

45
Signature Files
A filter for eliminating most of the nonqualifying documents
signature: short hash-coded representation of document or query the signatures are stored and searched sequentially search returns all qualifying documents plus some false alarms the answer is searched to eliminate false alarms E.G.M Petrakis Multimedia Information 46
Retrieval
Superimposed Coding
Each word yields a bit pattern of size F with m bits set to 1 and the rest left as 0
size of F affects the false drop rate bit patterns are OR-ed to form the doc signature find documents having 1 in the same bits as the query
word data signature 001 000 110 010
base
document signature
E.G.M Petrakis
000 010 101 001

001 010 111 011
47
Bitmaps
For every term, a bit vector is stored
each bit corresponds to a document set to 1 if term appears in document, 0 otherwise e.g., word 10100: the word appears in documents 1 and 3 in a collection of 5 documents
Very efficient for Boolean queries

retrieve bit-vectors for each query term E.G.M Petrakis Multimedia Information 48 combine bit vectors with Boolean operators
Retrieval
Comparison of Text Indexing Methods

Bitmaps:
pros: intuitive, easy to implement and to process cons: space overhead pros: easy to implement, compact index, no lexicon cons: many unnecessary accesses to documents pros: used by most search engines cons: large space overhead, index can be compressed
Signature files:
Inverted files:
E.G.M Petrakis
References
Searching Multimedia Databases by Content, C. Faloutsos, Kluwer Academic Publishers, 1996 Modern Information Retrieval, R. Baeza-Yates, B. Ribeiro-Neto, Addison Wesley, 1999 Automatic Text Processing, Gerard Salton, Addison Wesley, 1989 Information Retrieval Links:
http://www-a2k.is.tokushima-u.ac.jp/member/kita/NLP/IR.html
E.G.M Petrakis
50

Multimedia IR - Ppts 50

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Multimedia IR - Ppts 50

Загружено:

Авторское право:

Доступные форматы

Information Retrieval (IR)

Speed: retrievals has to be fast

Multimedia Information Retrieval

Depends on what is matched with the query

Depends also on matching function

Multimedia Information Retrieval

Indexing: search documents that are likely to match the query

documents Doc1 Doc2 DocN

Multimedia Information Retrieval

Solution: retrieval by text

Architecture of a IR System for Images

Multimedia Information Retrieval

Multimedia Information Retrieval

The effectiveness of IR depends on the

Retrieval Quality Criteria

The deviations from the above are measured

the ideal method

The documents must be indexed

Multimedia Information Retrieval

Multimedia Information Retrieval

Access Methods for Text

Error Tolerant Methods

Extension: regular expressions built-up by strings and operators: pro (blem |

Error Counting Methods

d(si,ti) : distance between characters

Multimedia Information Retrieval

Multimedia Information Retrieval

Vector space model

Multimedia Information Retrieval

Text Preprocessing Chart

from Baeza Yates & Ribeiro Neto, 1999

Text Indexing Methods

dictionary indexed by B-trees, tries or binary searched

Multimedia Information Retrieval

5. Rank answers by weights (e.g., frequencies)

000 010 101 001

Multimedia Information Retrieval

Very efficient for Boolean queries

Comparison of Text Indexing Methods

Multimedia Information Retrieval

Вам также может понравиться