Information Searching

Information Searching
8/8/2011
Information Search
Traditional Search Web Search Metadata based Search Semantic Search
8/8/2011
Traditional Search
A collection of documents is a set of documents related to a specific context of interest Indexing process is applied to full text of documents
8/8/2011
Classical Search
8/8/2011
Web Information Searching

Search Engine Architecture
8/8/2011
Web Information Searching

Web Searching & Information Retrieval, IEEE Web Engineering, 2004 Search engines index each web page by representing it by a set of weighted keywords Using robots or spiders that crawl through the web search engines pick up useful pages Indexing of these pages includes:
Removing all frequent or non-significant words (stop words: and, be) Stemming removes all the derivational suffixes (retains root: thinking, thinkers, thinks) Pages found are represented by a set of weighted keywords
8/8/2011
Crawler/ Robot/ Spider

Crawler is a program controlled by a crawl module that browses the web Collects documents by recursively fetching links from a set of start pages, the received pages are (or parts) are compressed and stored in page repository URL and their links form web graph, which can be used by crawler control module to decide further crawling To save space docID represents pages in the index Indexer processes pages collected by crawler. It decides which pages to index, duplicate documents are discarded Inverted index is built which contains for each word a sorted list of couples (such as docID and position in the document)
8/8/2011
Query Engine
Query Engine processes user queries and returns matching answers using ranking algorithm Algorithm produces numerical score expressing importance of the answer with respect to query Utility data structures contains lists of related pages, which can facilitate search Various query independent as well as dependent data is used to decide ranking (data of modification, site, number of links to other pages or actual content of documents) Query dependent criteria include cosine measure of similarity in vector space model
8/8/2011
TIR(1)
Classical Models: Boolean, Vector, Probabilistic
Vector model is the most popular Documents and queries as vectors
8/8/2011
TIR (2)

Weighting algorithm : tf-idf (term frequency- inverse document frequency weight wij corresponding to the ith component of the document dj vector representation is given by w ij = tf ij * idfi Where tf ij = f ij / max l (f lj) where maximum is computed over all terms mentioned in the document dj. idf i is inverse document frequency for ki is given by idfi = log ( N /ni ). The relevance ranking is Sim (d,j) = cos () = (d.q) / |d| |q| Thus very frequent terms receive a low weight, while uncommon terms appearing in few documents receive a high weight the assumption that the index terms are independent.
8/8/2011
10
Web Searching (2)

Weighing procedure considers:
If a term appears more frequently than other terms, associated weight can be increased If term appears within many pages then, its weight would be decreased (may not be useful in discriminating items) Usually greater weights are assigned to short pages than longer ones Inverted file is updates such that for each keyword, the system can find a list of all web pages( with associated weight) inderxed under this term Degree of similarity can be calculated using this data
8/8/2011
11
Web searching (3)

To improve search: Giving more credit to words appearing in title field Considering distance between search keywords appearing within a page Using different models for assigning weights: probabilistic or language based
8/8/2011
12
TIR (3)
Problems with TIR
Keyword based search Measure of relevancy of retrieved document
8/8/2011
13
Semantic Metadata
Data which may be associated explicitly or implicitly with a given piece of content and whose relevance for that content is determined by its ontological position( its context) with the domain of knowledge Helps in classification, high precision searching Named entity recognition involves finding items of potential interest with a piece of text (person, place, thing, event) these are stored in the ontology Metadata is a snapshot of the documents relevant information Metadata contained within the snapshot references the instances of the named entities, which are stored in the ontology
8/8/2011
14
Relevant Information
8/8/2011
15
Types of Specs and Standards (or MetaModels)

Domain Independent: (Meta Content Framework), RDF, MOF
(object facility), DublinCore

Media Specific: MPEG4, MPEG7, VoiceXML Domain/Industry Specific (metamodels): MARC, DCMI, METS (Library), FGDC and UDK (Geographic), NewsML (News), PRISM (Publishing Requirements for Industry Standard Metadata) Application Specific: ICE Information & Content Exchange (communication
between sender and receiver)
Exchange/Sharing: XCM, XMI

Other Models: RDFS, namespaces, ontologies, (DAML, OIL)
8/8/2011 16
Dublin Core Metadata Initiative DCMI
1995-96
Simple element set designed for domain independent resource
description
15 elements are defined by this standard International, inter-discipline, W3C community consensus Semantic interface among resource description communities (very limited form of semantics)
8/8/2011
17
DCMI (2002)
http://dublincore.org/documents/usageguide/elements.shtml
Title: name given to the resource Contributor: entity responsible for making contributions to the content of resource Creator: Publisher: an entity responsible for making resource available Subject & keywords: topic of content of the resource Description: an account of the content of resource Format : data representation of the source Resource identifier: unambiguous reference Language Rights : copyright notice/statement Date, type, source, Relation, Coverage
8/8/2011 18
DDI Data Documentation Initiative

DDI Data Documentation Initiative: Technical documentation of social, behavioural, and economic data
SDMX Statistical Data and Metadata Exchange: Used by statisticians for exchange of time series data
8/8/2011
19
Creating and Serving Metadata to Power the Lifecycle of Content
Taalee Infrastructure Services

Produce Aggregate Catalog/ Index
Taalee Content Applications

Personalize Interactive Marketing What is the best way to monetize this interaction?
Integrate Syndicate
Where is the content? Whose is it?
What is this content about?
What other content is it related to?
What is the right content for this user?
Taalee Semantic MetaBase
8/8/2011
Broadcast, Wireline, Wireless, Interactive TV
20
Intelligent Search
8/8/2011
21
Intelligent Search using Ontologies
Mediator1: Ontology 1
User
Ontology
Query
Mediator3: Ontology 3 Mediator2: Ontology 2

Answer
8/8/2011
22
SWIR
Use of Vector Space Model for SW : documents at semantic level could be represented as vectors in a hyperspace defined by the set of all ontology concepts Weight of concept is relative importance of that concept SWIR needs
Good domain ontology Understanding semantic relationships among ontological concepts
8/8/2011
23
SWIR(3)
Weights are assigned to links based on certain properties of the ontology representing the strength of the relation Spread activation technique is used to find related concepts in the ontology given some initial set of concepts and initial weights
8/8/2011
24
SWIR (4) : Weighting Algorithm

In traditional IR tf-idf strategy, the first measure gives the degree of similarity between two related concept instances in a relation and the second measure gives the specificity of the concept relation
This Cluster measure for concept instances Cj and Ck is given by:

W ( Cj, Ck) = { nijk / nij }
Where nij represents that concepts Cj and Ci are related and nijk represents that both the concepts Cj and Ck are related to concept Ci. Therefore (Cj, Ck) represents percentage of concepts that Ck is related to that Cj is also related This particular measure reflects the fact that concepts sharing common relations are semantically similar
8/8/2011
25
SWIR (5)
The Specificity measure is given by: W (Cj, Ck) = 1/ n k Where nk is the number of instances of given relation type that have k as its destination node
The actual measure is the product of cluster and specificity measures
8/8/2011
26
SWIR (6): Spread Activation Algorithm

Given an initial set of concepts, the algorithm obtains a set of closely related (semantically related) concepts by navigating through the linked concepts in the graph The algorithm has as a starting point, an initial set of instances in the ontology with each having an initial activation value Constrained Spread Activation applies constraints like maximum path length, fan-out etc to propagation
8/8/2011
27
Conclusion (1)
Traditional information retrieval : small, static, homogeneous, centrally located, monolingual document collections Web information retrieval : huge volumes of data which is volatile, heterogeneous, distributed and multilingual Semantic web information retrieval is ontology based intelligent information retrieval Various semantic search strategies are explored Two major differences
Keyword vs. concept Response time a part of relevancy measure
Most successful semantic search algorithms are the Vector Space Model and the Hybrid approach which uses classical technique with spread activation algorithm
8/8/2011 28
Conclusion (2)
concepts which form the basis of the semantic domain model are not orthogonal. This issue can be addressed by reassigning the weights to concept links based on the relationship graph of the ontology concepts The spread activation algorithm has been used to deduce the relationships based on given set of relationships The SIR has been visualized as 4 layer process; keywords, indexed keywords, semantic concepts, relationships
8/8/2011
29
References
Berners-Lee T., Hendler J., Lassila O., The Semantic Web, Scientific American. 2001, 284: 35-43 R.Baeza-Yates, B.Ribeiro-Neto, Modern Information Retrieval, 1st edition, Addison-Wesley, 1999 Pokorny J., Web Searching and Information Retrieval, Web Engineering, July/August 2004, 43-48
8/8/2011
30

Information Searching

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Information Searching

Загружено:

Авторское право:

Доступные форматы

Information Searching

Web Information Searching

Web Information Searching

Crawler/ Robot/ Spider

Vector model is the most popular Documents and queries as vectors

Web Searching (2)

Web searching (3)

Types of Specs and Standards (or MetaModels)

(object facility), DublinCore

Exchange/Sharing: XCM, XMI

Dublin Core Metadata Initiative DCMI

Simple element set designed for domain independent resource

DDI Data Documentation Initiative

Creating and Serving Metadata to Power the Lifecycle of Content

Taalee Infrastructure Services

Taalee Content Applications

Where is the content? Whose is it?

What is this content about?

What other content is it related to?

What is the right content for this user?

Taalee Semantic MetaBase

Broadcast, Wireline, Wireless, Interactive TV

Intelligent Search using Ontologies

Mediator3: Ontology 3 Mediator2: Ontology 2

SWIR (4) : Weighting Algorithm

This Cluster measure for concept instances Cj and Ck is given by:

The actual measure is the product of cluster and specificity measures

SWIR (6): Spread Activation Algorithm

Вам также может понравиться