Вы находитесь на странице: 1из 30

Information Searching

8/8/2011

Information Search
Traditional Search Web Search Metadata based Search Semantic Search

8/8/2011

Traditional Search
A collection of documents is a set of documents related to a specific context of interest Indexing process is applied to full text of documents

8/8/2011

Classical Search

8/8/2011

Web Information Searching


Search Engine Architecture

8/8/2011

Web Information Searching


Web Searching & Information Retrieval, IEEE Web Engineering, 2004 Search engines index each web page by representing it by a set of weighted keywords Using robots or spiders that crawl through the web search engines pick up useful pages Indexing of these pages includes:
Removing all frequent or non-significant words (stop words: and, be) Stemming removes all the derivational suffixes (retains root: thinking, thinkers, thinks) Pages found are represented by a set of weighted keywords

8/8/2011

Crawler/ Robot/ Spider


Crawler is a program controlled by a crawl module that browses the web Collects documents by recursively fetching links from a set of start pages, the received pages are (or parts) are compressed and stored in page repository URL and their links form web graph, which can be used by crawler control module to decide further crawling To save space docID represents pages in the index Indexer processes pages collected by crawler. It decides which pages to index, duplicate documents are discarded Inverted index is built which contains for each word a sorted list of couples (such as docID and position in the document)

8/8/2011

Query Engine
Query Engine processes user queries and returns matching answers using ranking algorithm Algorithm produces numerical score expressing importance of the answer with respect to query Utility data structures contains lists of related pages, which can facilitate search Various query independent as well as dependent data is used to decide ranking (data of modification, site, number of links to other pages or actual content of documents) Query dependent criteria include cosine measure of similarity in vector space model

8/8/2011

TIR(1)
Classical Models: Boolean, Vector, Probabilistic

Vector model is the most popular Documents and queries as vectors

8/8/2011

TIR (2)

Weighting algorithm : tf-idf (term frequency- inverse document frequency weight wij corresponding to the ith component of the document dj vector representation is given by w ij = tf ij * idfi Where tf ij = f ij / max l (f lj) where maximum is computed over all terms mentioned in the document dj. idf i is inverse document frequency for ki is given by idfi = log ( N /ni ). The relevance ranking is Sim (d,j) = cos () = (d.q) / |d| |q| Thus very frequent terms receive a low weight, while uncommon terms appearing in few documents receive a high weight the assumption that the index terms are independent.

8/8/2011

10

Web Searching (2)


Weighing procedure considers:
If a term appears more frequently than other terms, associated weight can be increased If term appears within many pages then, its weight would be decreased (may not be useful in discriminating items) Usually greater weights are assigned to short pages than longer ones Inverted file is updates such that for each keyword, the system can find a list of all web pages( with associated weight) inderxed under this term Degree of similarity can be calculated using this data

8/8/2011

11

Web searching (3)


To improve search: Giving more credit to words appearing in title field Considering distance between search keywords appearing within a page Using different models for assigning weights: probabilistic or language based

8/8/2011

12

TIR (3)
Problems with TIR
Keyword based search Measure of relevancy of retrieved document

8/8/2011

13

Semantic Metadata
Data which may be associated explicitly or implicitly with a given piece of content and whose relevance for that content is determined by its ontological position( its context) with the domain of knowledge Helps in classification, high precision searching Named entity recognition involves finding items of potential interest with a piece of text (person, place, thing, event) these are stored in the ontology Metadata is a snapshot of the documents relevant information Metadata contained within the snapshot references the instances of the named entities, which are stored in the ontology

8/8/2011

14

Relevant Information

8/8/2011

15

Types of Specs and Standards (or MetaModels)


Domain Independent: (Meta Content Framework), RDF, MOF

(object facility), DublinCore


Media Specific: MPEG4, MPEG7, VoiceXML Domain/Industry Specific (metamodels): MARC, DCMI, METS (Library), FGDC and UDK (Geographic), NewsML (News), PRISM (Publishing Requirements for Industry Standard Metadata) Application Specific: ICE Information & Content Exchange (communication
between sender and receiver)

Exchange/Sharing: XCM, XMI


Other Models: RDFS, namespaces, ontologies, (DAML, OIL)
8/8/2011 16

Dublin Core Metadata Initiative DCMI

1995-96

Simple element set designed for domain independent resource

description
15 elements are defined by this standard International, inter-discipline, W3C community consensus Semantic interface among resource description communities (very limited form of semantics)

8/8/2011

17

DCMI (2002)
http://dublincore.org/documents/usageguide/elements.shtml

Title: name given to the resource Contributor: entity responsible for making contributions to the content of resource Creator: Publisher: an entity responsible for making resource available Subject & keywords: topic of content of the resource Description: an account of the content of resource Format : data representation of the source Resource identifier: unambiguous reference Language Rights : copyright notice/statement Date, type, source, Relation, Coverage
8/8/2011 18

DDI Data Documentation Initiative


DDI Data Documentation Initiative: Technical documentation of social, behavioural, and economic data

SDMX Statistical Data and Metadata Exchange: Used by statisticians for exchange of time series data

8/8/2011

19

Creating and Serving Metadata to Power the Lifecycle of Content

Taalee Infrastructure Services


Produce Aggregate Catalog/ Index

Taalee Content Applications


Personalize Interactive Marketing What is the best way to monetize this interaction?

Integrate Syndicate

Where is the content? Whose is it?

What is this content about?

What other content is it related to?

What is the right content for this user?

Taalee Semantic MetaBase

8/8/2011

Broadcast, Wireline, Wireless, Interactive TV

20

Intelligent Search

8/8/2011

21

Intelligent Search using Ontologies

Mediator1: Ontology 1

User

Ontology

Query

Mediator3: Ontology 3 Mediator2: Ontology 2


Answer

8/8/2011

22

SWIR
Use of Vector Space Model for SW : documents at semantic level could be represented as vectors in a hyperspace defined by the set of all ontology concepts Weight of concept is relative importance of that concept SWIR needs
Good domain ontology Understanding semantic relationships among ontological concepts

8/8/2011

23

SWIR(3)

Weights are assigned to links based on certain properties of the ontology representing the strength of the relation Spread activation technique is used to find related concepts in the ontology given some initial set of concepts and initial weights

8/8/2011

24

SWIR (4) : Weighting Algorithm


In traditional IR tf-idf strategy, the first measure gives the degree of similarity between two related concept instances in a relation and the second measure gives the specificity of the concept relation

This Cluster measure for concept instances Cj and Ck is given by:


W ( Cj, Ck) = { nijk / nij }

Where nij represents that concepts Cj and Ci are related and nijk represents that both the concepts Cj and Ck are related to concept Ci. Therefore (Cj, Ck) represents percentage of concepts that Ck is related to that Cj is also related This particular measure reflects the fact that concepts sharing common relations are semantically similar

8/8/2011

25

SWIR (5)
The Specificity measure is given by: W (Cj, Ck) = 1/ n k Where nk is the number of instances of given relation type that have k as its destination node

The actual measure is the product of cluster and specificity measures

8/8/2011

26

SWIR (6): Spread Activation Algorithm


Given an initial set of concepts, the algorithm obtains a set of closely related (semantically related) concepts by navigating through the linked concepts in the graph The algorithm has as a starting point, an initial set of instances in the ontology with each having an initial activation value Constrained Spread Activation applies constraints like maximum path length, fan-out etc to propagation

8/8/2011

27

Conclusion (1)
Traditional information retrieval : small, static, homogeneous, centrally located, monolingual document collections Web information retrieval : huge volumes of data which is volatile, heterogeneous, distributed and multilingual Semantic web information retrieval is ontology based intelligent information retrieval Various semantic search strategies are explored Two major differences
Keyword vs. concept Response time a part of relevancy measure

Most successful semantic search algorithms are the Vector Space Model and the Hybrid approach which uses classical technique with spread activation algorithm
8/8/2011 28

Conclusion (2)
concepts which form the basis of the semantic domain model are not orthogonal. This issue can be addressed by reassigning the weights to concept links based on the relationship graph of the ontology concepts The spread activation algorithm has been used to deduce the relationships based on given set of relationships The SIR has been visualized as 4 layer process; keywords, indexed keywords, semantic concepts, relationships

8/8/2011

29

References
Berners-Lee T., Hendler J., Lassila O., The Semantic Web, Scientific American. 2001, 284: 35-43 R.Baeza-Yates, B.Ribeiro-Neto, Modern Information Retrieval, 1st edition, Addison-Wesley, 1999 Pokorny J., Web Searching and Information Retrieval, Web Engineering, July/August 2004, 43-48

8/8/2011

30

Вам также может понравиться