Вы находитесь на странице: 1из 29

Multimedia Information Retrieval (CSC 545)

Textual Retrieval
By Dr. Nursuriati Jamil

The problem of IR

Goal = find documents relevant to an information need from a large document set
Info. need

Document collection

Retrieval

IR system

Query Answer list

The retrieval problem


Given

Problem

N documents (D0, ..., DN-1) Query Q of user ranked list of k documents Dj (0<j<N) which match the query sufficiently well; ranking with respect to relevance of document to the query

Feature extraction (words, phrases, n-gram, stemming, stop words, thesaurus, multimedia) Retrieval model (Boolean retrieval, vector space retrieval, LSI, signatures, probabilistic retrieval) Index structures (inverted list, signature files, relational database, multidimensional index structures) Freshness of data (real-time, update every day / week / month) Query transformation (AND/OR, expansion, stemming, thesaurus) Ranking of retrieved documents (RSV, link structure, phrases)

Text Retrieval - Overview


Query Insert Result HBM4N11 HBJ3N129

Penderaan kanakkanak di Malaysia


Query transformation

Database Document Feature extraction

Relevance ranking RSV(Q,HBJ3N129)= .2 RSV(Q, HBM4N111=.4

Q = {dera, kanakkanak, Malaysia, seksa, pukul, hukum, budak, bayi, remaja}


Retrieval

docID = HBJ3N129 hukum -> word10, word25 denda -> word2, word35 word100, word123, kena -> word67, . OFFLINE

Inverted file: dera - HBJ3N129, HBM4N111 budak -> HBJ2N19, HBJ3N129 Malaysia-> HBJN129 ONLINE

Feature (Terms) extraction

A text retrieval system represents documents as sets of terms (e.g., words). Thereby, the originally structured document becomes an unstructured set of terms potentially annotated with attributes to denote frequency and position in the text. The transformation comprises several steps: 1. Elimination of structure (i.e. formats) 2. Elimination of frequent/infrequent terms (i.e. stop words) 3. Mapping text to terms (without punctuation) 4. Reduction of terms to their stems (stemming, syllable division) 5. Mapping to index terms
(The order of the steps above may vary; often, steps are even broken into several steps or several steps are combined into a single pass)

Types of terms: words, phrases or n-gram (i.e., sequence of n characters)

Overview of feature extraction

3 Stemming

Overview of feature extraction

Structure elimination

Frequent/infrequent terms

Text -> Term

Index

Stem

Step 1. Structure elimination

HTML contains special markups, so-called tags. They describe meta-information about the document and the layout/presentation of content. An HTML document is split into two parts, a header section and a body section:
Header: Contains metainformation about the document; they also describe all embedded elements like images.

Body: Encompasses the document enriched with markups for layout. The structure of the document is not always obvious.

Step 1. Structure elimination (cont.)

Meta data: HTML provides several possibilities to define metainformation (<meta>-Tag). The most frequent ones are: URL of page: http://www-dbs.ethz.ch/~mmir/ Title of document: <title>ETH Zurich - Homepage</title> Meta information in header section: <meta name=keywords content="ETHZ,ETH,swiss,> <meta name=description content=This page is about> Raw Text: the raw text subsumes all text pieces with tags stripped from the original <body>-section. A few tags are useful to derive additional information on the importance of a text piece:

Headlines: <h1>2. Information Retrieval </h1> Emphasized: <b>Retrieval model </b>

Special Character: Meta data and text data may contain special characters which have to be translated &nbsp; -> space, &uuml; -> Transformation to Unicode, ASCII or other character set

Overview of feature extraction

1 Eliminate structure

Step 1. Structure elimination (cont.)

Embedded links how to handle them:

Embedded objects (image, plug-ins):


<IMG SRC=img/MeAndCar.jpeg" ALT="picture of me in front of my car">

Links to external references: <a href=http://anywhere.in.the.net/important.html>

Question: what does the link text describe? the document itself or the embedded/referenced object?

Usually, the link text is associated with both the embedding and the linked document. In most cases, the link text is a good summary for the linked document. In a few cases, the link text is meaningless (click here)

Step 2. Eliminate frequent/infrequent terms

Indexing is used is to determine useful answers for user queries. Thus, it is not required to consider frequent terms with little or no semantics (e.g., the, a, it) or terms that appear seldom. Theoretical solution: restrict indexing to terms that have proven to be useful or that appear interesting from past, practical experiences with the system. However, this requires a feedback mechanism with the user to understand term importance. How to select important terms: Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.

Distribution of term frequencies

e.g. stopwords, most frequent words

e.g. seldom used words

Insignificant terms

Stop words are terms with little or no semantic meaning, thus often not indexed. Examples: English: the, a, is Bahasa Melayu: ada, iaitu, mana, bersabda, wahai Often, the rank of these terms is on the left side of the upper cutoff line. Generally, stop words are responsible for 20% to 30% of the term occurrences in a text. With the elimination of stop words, the memory consumption of the index can be reduced. Similarly, the most frequent terms in a collection of documents carry little information (rank on the left side of the upper cut-off line):

Analogously, one can strip off words that are seldom used. This assumes that users will not use them in their queries (the rank is on the right side of the lower cut-off). Although, the additional memory consumption is rather small.

The term Computer is meaningless to index articles about computer science. The term Computer, however, is important to distinguish between general articles such as careers in computer science.

Overview of feature extraction

2 Remove stopwords

Step 3: Mapping text to terms


To select appropriate features for documents, one typically uses linguistic or statistical approaches to define the features based on words, fragments of words or phrases. Most search engines use words or phrases as features. Some engines use stemming, some differentiate between upper and lower cases, and some support error correction. An interesting option is the usage of fragments, i.e., so-called ngrams. Although not directly related to semantics of text, they are very useful to support fuzzy retrieval. But there are other possibilities: fragments of words, i.e., n-grams: Example: street -> str, tre, ree, eet streets -> str, tre, ree, eet, ets strets -> str, tre, ret, ets Benefits: Simple misspellings or bad recognition often result in bad retrievals; fragments significantly improve retrieval quality. Stemming and syllable division not necessary any more better. No language specific retrieval necessary; every language is processed equally

Locations and frequency of terms


Retrieval algorithms often use the number of term occurrences and the positions of terms within the document to identify and rank results. Term frequency ("feature frequency"): tf(Ti, Dj) Number of occurrences of a feature Ti in document Dj Term frequency is important to rank documents. Term locations (feature locations): loc(Ti ,dj) ->P(N) [set of locations] Term locations frequently influence the ranking and whether a document appears in the result at all, e.g.: Condition: Q =shah NEAR alam (explicit phrase matching) looking for documents with the terms shah and alam close to each other Ranking: Q =shah alam (implicit phrase matching) documents with the terms shah next to alam should be at the top of results.

tf*idf weighting schema

tf = term frequency

frequency of a term/keyword in a document

The higher the tf, the higher the importance (weight) for the doc.

df = document frequency

no. of documents containing the term distribution of the term


the unevenness of term distribution in the corpus the specificity of term to a document

idf = inverse document frequency


The more the term is distributed evenly, the less it is specific to a document

weight(t,D) = tf(t,D) * idf(t)

Example
Term Haji #of docs 3 --> Dj, tfj Dj, tfj --> D7, 4 D26,10 Dj, tfj D40, 5 .

Iman

--> D21, 2

....

Term Haji occurs in three documents, 4 times in doc 7, 10 times in doc 26 and 5 times in doc 40.

Some common tf*idf schemes


tf(t, tf(t, tf(t, tf(t,

D)=freq(t,D) idf(t) = log(N/n) D)=log[freq(t,D)] n = #docs containing t D)=log[freq(t,D)]+1 N = #docs in corpus D)=freq(t,d)/Max[f(t,d)]

weight(t,D) = tf(t,D) * idf(t)

10

Overview of feature extraction

Term Abdul Agong :

Pos 5 4 :

#Doc 2 3 :

Dj,tfj 10, 1 2, 3 :

Dj,tfj 21, 2 6, 5 :

Dj,tfj

3 Text to term

31, 2 :

Step 4: Stemming

How word stemming works? Stemming broadens our results to include both word roots and word derivations. It is commonly accepted that removal of word-endings (sometimes called suffix stripping) is a good idea; removal of prefixes can be useful in some subject domains. Why do we need word stemming in the context of free text searching? Free text-searching, searches exactly as we type in to the search box, without changing it to thesaurus term. Morphological variants of words have similar semantic interpretations. Smaller dictionary size results in a saving of storage space and processing time.

11

Word stemming (cont.)

Algorithms for Word Stemming A stemming algorithm is an algorithm that converts a word to a related form. One of the simplest such transformations is conversion of plurals to singulars. Affix removal algorithms, Successor Variety, Table Lookup, N-gram In most languages, words have various inflected (or sometimes, derived) forms. The different forms should not carry different meanings but should be mapped to a single form. However, in many languages, it is not simple to derive the linguistic stem without a dictionary. At least for English, there exist algorithms without the need of a dictionary which still produce good results (Porter Algorithm).

Word stemming (cont.)

Pros & Cons Word Stemmers are used to conflate terms to improve retrieval effectiveness and/or to reduce the size of indexing files increase recall at the cost of decreased precision Over stemming and Under Stemming also create a problem for retrieving the documents.

12

Porter's Algorithm

The Porter Stemmer is a conflation Stemmer developed by Martin Porter at the University of Cambridge in 1980. Porter stemming algorithm (or 'Porter stemmer') is a process for removing the commoner morphological and inflexional endings from words in English. Most effective and widely used. Porter's Algorithm works based on number of vowel characters, which are followed be a consonant character in the stem (Measure), must be greater than one for the rule to be applied. A word can have any one of the forms: CC, C..V, V..V, V..C. These can be represented as [C](VC){m}[V].

Porter's Algorithm (cont.)

The rules in the Porter algorithm are separated into five distinct steps numbered from 1 to 5. They are applied to the words in the text starting from step 1 and moving on to step 5. Step 1 deals with plurals and past participles. The subsequent steps are much more straightforward. Ex. plastered->plaster, motoring-> motor Step 2 deals with pattern matching on some common suffixes. Ex. happy -> happi, relational -> relate, callousness -> callous Step 3 deals with special word endings. Ex. triplicate-> triplic, hopeful-> hope

13

Porter's Algorithm (cont.)

Step 4 checks the stripped word against more suffixes in case the word is compounded. Ex. revival -> reviv, allowance-> allow, inference-> infer etc., Step 5 checks if the stripped word ends in a vowel and fixes it appropriately Ex. probate -> probat, cease -> ceas, controll -> control The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach.

Dictionary-based stemming

A dictionary significantly improves the quality of stemming (Note: the Porter Algorithm does not derive a linguistic correct stem). It determines the correct linguistic stem for all words but at the price of additional lookup costs and maintenance costs for the dictionary. The EuroWordNet initiative tries to develop a semantic dictionary for the European languages. Next to words, the dictionary shall contain flexed forms and relations between words (see next section). However, the usage of these dictionaries is not for free (with the exception of WordNet for English). Names remain a problem of their own... Examples of such dictionaries / ontologies: EuroWordNet: http://www.illc.uva.nl/EuroWordNet/ GermanNet: http://www.sfs.uni-tuebingen.de/lsd/ WordNet: http://wordnet.princeton.edu/ We look a dictionary based stemming with the example of Morphy, the stemmer of WordNet. Morphy combines two approaches for stemming: a rule-based approach for regular flexions much like the porter algorithm but muchsimpler an exception list with strong or irregular flexions of terms

14

Stemming process
Stopwords Is it a stopword? Unstemmed words
Porters algortihm, Fatimahs algorithm, Wordnet dictionary

Stemming algorithm

Word dictionary Is it in dictionary?

Morphological rules
(e.g. ber..an, me+, +lah)

Stemmed words

Apply prefix-suffix, suffix & infix rules

Step 5: Mapping to index terms


Term extraction must further deal with homonyms (equal terms but different semantics) and synonyms (different terms but equal semantics). But there are further relations between terms that may be useful to consider. In the following, a list of the most common relationships: Homonyms (equal terms but different semantics): bank (shore vs. financial institute) Synonyms (different terms but equal semantics): walk, go, pace, run, sprint Hypernyms (umbrella term) / Hyponym (species) Animal -> dog, cat, bird, ... Holonyms (is part of) / Meronyms (has parts) door ->lock The relationships above define a network (often denoted as ontology) with terms as nodes and relations as edges. An occurrence of a term may be interpreted as occurrences of near-by terms in this network as well (whereby near-by has to be defined appropriately). Example: A document contains the term dog. We may also interpret this as an occurrence of the term animal (with a smaller weight)

15

Step 5 : (cont.)

Some search engine do not implement step 4 and 5. Google only recently improved its search capabilities with stemming. If the collection contains documents in different languages, cross-lingual approaches that (automatically) translate or relate terms to different languages and make them retrievable even for queries in different languages than the document. Term extraction for queries: Similar to term extraction of documents If term extraction of query implements step 5: Omit step 5 in term extraction of documents in the collection Extend the query terms with near-by terms: Expansion with synonyms: Q=house Qnew=house, home, domicile, ... If a specialized search returns not enough answers, exchange keywords with their hypernyms: e.g., Q=mare (female horse) -> Qnew =horse If a general search term returns too many results, let the user choose (i.e. relevance feedback) a more specialized term to reduce the result list: e.g., Q=horse -> Qnew =mare, pony, chestnut, pacer

What is WordNet?

A large lexical database, or electronic dictionary, developed and maintained at Princeton University http://wordnet.princeton.edu Includes most English nouns, verbs, adjectives, adverbs Electronic format makes it amenable to automatic manipulation Used in many Natural Language Processing applications (information retrieval, text mining, question answering, machine translation, AI/reasoning,...) Wordnets are built for many languages.

16

Whats special about WordNet?

Traditional paper dictionaries are organized alphabetically: words that are found together (on the same page) are not related by meaning WordNet is organized by meaning: words in close proximity are semantically similar Human users and computers can browse WordNet and find words that are meaningfully related to their queries (somewhat like in a hyperdimensional thesaurus) Meaning similiarity can be measured and quantified to support Natural Language Understanding

A simple picture
animal (animate, breathes, has heart,...) | bird (has feathers, flies,..) | canary (yellow, sings nicely,..)

17

Hypo-/hypernymy relates noun synsets


Creates relationships among more/less general concepts Creates hierarchies. Hierarchies can have up to 16 levels {vehicle} / \ {car, automobile} {bicycle, bike} / \ \ {convertible} {SUV} {mountain bike}

A car is a kind of vehicle <=> The class of vehicles includes cars, bikes

Hyponymy
Transitivity: A car is a kind of vehicle An SUV is a kind of car => An SUV is a kind of vehicle

18

Meronymy/holonymy (part-whole relation)


{car, automobile} | {engine} / \ {spark plug} {cylinder} An engine has spark plugs Spark plus and cylinders are parts of an engine

Meronymy/Holonymy
Inheritance: A finger is part of a hand A hand is part of an arm An arm is part of a body =>a finger is part of a body

19

Structure of WordNet (Nouns)


{conveyance; transport}
hyperonym

{vehicle}
hyperonym

{bumper} {car door}

{hinge; flexible joint}


meronym

{motor vehicle; automotive vehicle}


hyperonym meronym

{doorlock}
meronym

{car; auto; automobile; machine; motorcar}


meronym hyperonym hyperonym

{car window} {car mirror}

{armrest}

{cruiser; squad car; patrol car; police car; prowl car}

{cab; taxi; hack; taxicab; }

Homework
Select 5 most frequent noun terms, find homonyms, synonyms, hypernmys and holonyms of the terms. May use Wordnet at http://wordnet.princeton.edu/. Select Use Wordnet Online. Create the noun ontology.

20

IR models
Overview Boolean Retrieval Fuzzy Retrieval Vector Space Retrieval Probabilistic Retrieval (BIR Model) Latent Semantic Indexing

Boolean search

21

Boolean model

Historically: Documents were stored on tapes or punched cards. Searching: only sequential access. Today: Boolean search is still very frequent but is not state-of-theart.. Google uses it for simplicity but furtehr improved it by additionally sorting/ranking results sets. Model: Document D represented by binary vector d with di =1 if term ti occurs in document i. Query q comes from query space Q; let t be an arbitrary term, and q1 and q2 be queries from Q; Q is given by queries of type: t, q1 ^ q2, , q1 v q2, q1

Boolean model (cont.)

22

Term-document matrix

Query: Brutus AND Caesar AND NOT Calpurnia Take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100

Boolean retrieval
Query: Brutus AND Caesar AND NOT Calpurnia

23

Fuzzy retrieval

Fuzzy retrieval (cont.)

24

Vector-space model

Since Boolean models binary weights too limiting, vector supports partial matching. Non-binary weights are assigned to index terms in queries and documents. Term weights are used to compute degree of similarity between documents in the database and the users query.
term2 = ibadah

d
term3 = malam

term1 = solat

Vector-space model (cont.)

The tf metric is considered an indication of how well a term characterizes the content of a document. The idf, in turn, reflects the number of documents in the collection in which the term occurs, irrespective of the number of times it occurs in those documents.

25

Inverse document frequency

Document-Term-Matrix

26

Vector-space model (cont.)

Example

N = #of documents M= # of terms

27

arrived

gold

silver

truck

Class exercises

Using selected, most frequent 10 terms in your story, create term-document matrix for boolean model and vector model.

28

Remarks

There are many more methods to determine the vector representations and to compute retrieval status values

Advantages:

Main assumption of vector space retrieval Terms occur independent from each other in documents Not true: if one writes about Mercedes, the term "car" is likely to cooccur in document Simple model with efficient evaluation algorithms Partial match queries possible, i.e., it returns documents that only partly contain the query terms (similar to or-operator of Boolean retrieval) Very good retrieval quality; but not state-of-the-art Relevance feedback may further improve vector space retrieval Many heuristics and simplification; no proof for "correctness" of result set HTML/Web: occurrences of terms is not the most important criteria to rank documents (spamming)

Disadvantages:

29

Вам также может понравиться