Академический Документы
Профессиональный Документы
Культура Документы
8/8/2011
Information Search
Traditional Search Web Search Metadata based Search Semantic Search
8/8/2011
Traditional Search
A collection of documents is a set of documents related to a specific context of interest Indexing process is applied to full text of documents
8/8/2011
Classical Search
8/8/2011
8/8/2011
8/8/2011
8/8/2011
Query Engine
Query Engine processes user queries and returns matching answers using ranking algorithm Algorithm produces numerical score expressing importance of the answer with respect to query Utility data structures contains lists of related pages, which can facilitate search Various query independent as well as dependent data is used to decide ranking (data of modification, site, number of links to other pages or actual content of documents) Query dependent criteria include cosine measure of similarity in vector space model
8/8/2011
TIR(1)
Classical Models: Boolean, Vector, Probabilistic
8/8/2011
TIR (2)
Weighting algorithm : tf-idf (term frequency- inverse document frequency weight wij corresponding to the ith component of the document dj vector representation is given by w ij = tf ij * idfi Where tf ij = f ij / max l (f lj) where maximum is computed over all terms mentioned in the document dj. idf i is inverse document frequency for ki is given by idfi = log ( N /ni ). The relevance ranking is Sim (d,j) = cos () = (d.q) / |d| |q| Thus very frequent terms receive a low weight, while uncommon terms appearing in few documents receive a high weight the assumption that the index terms are independent.
8/8/2011
10
8/8/2011
11
8/8/2011
12
TIR (3)
Problems with TIR
Keyword based search Measure of relevancy of retrieved document
8/8/2011
13
Semantic Metadata
Data which may be associated explicitly or implicitly with a given piece of content and whose relevance for that content is determined by its ontological position( its context) with the domain of knowledge Helps in classification, high precision searching Named entity recognition involves finding items of potential interest with a piece of text (person, place, thing, event) these are stored in the ontology Metadata is a snapshot of the documents relevant information Metadata contained within the snapshot references the instances of the named entities, which are stored in the ontology
8/8/2011
14
Relevant Information
8/8/2011
15
1995-96
description
15 elements are defined by this standard International, inter-discipline, W3C community consensus Semantic interface among resource description communities (very limited form of semantics)
8/8/2011
17
DCMI (2002)
http://dublincore.org/documents/usageguide/elements.shtml
Title: name given to the resource Contributor: entity responsible for making contributions to the content of resource Creator: Publisher: an entity responsible for making resource available Subject & keywords: topic of content of the resource Description: an account of the content of resource Format : data representation of the source Resource identifier: unambiguous reference Language Rights : copyright notice/statement Date, type, source, Relation, Coverage
8/8/2011 18
SDMX Statistical Data and Metadata Exchange: Used by statisticians for exchange of time series data
8/8/2011
19
Integrate Syndicate
8/8/2011
20
Intelligent Search
8/8/2011
21
Mediator1: Ontology 1
User
Ontology
Query
8/8/2011
22
SWIR
Use of Vector Space Model for SW : documents at semantic level could be represented as vectors in a hyperspace defined by the set of all ontology concepts Weight of concept is relative importance of that concept SWIR needs
Good domain ontology Understanding semantic relationships among ontological concepts
8/8/2011
23
SWIR(3)
Weights are assigned to links based on certain properties of the ontology representing the strength of the relation Spread activation technique is used to find related concepts in the ontology given some initial set of concepts and initial weights
8/8/2011
24
Where nij represents that concepts Cj and Ci are related and nijk represents that both the concepts Cj and Ck are related to concept Ci. Therefore (Cj, Ck) represents percentage of concepts that Ck is related to that Cj is also related This particular measure reflects the fact that concepts sharing common relations are semantically similar
8/8/2011
25
SWIR (5)
The Specificity measure is given by: W (Cj, Ck) = 1/ n k Where nk is the number of instances of given relation type that have k as its destination node
8/8/2011
26
8/8/2011
27
Conclusion (1)
Traditional information retrieval : small, static, homogeneous, centrally located, monolingual document collections Web information retrieval : huge volumes of data which is volatile, heterogeneous, distributed and multilingual Semantic web information retrieval is ontology based intelligent information retrieval Various semantic search strategies are explored Two major differences
Keyword vs. concept Response time a part of relevancy measure
Most successful semantic search algorithms are the Vector Space Model and the Hybrid approach which uses classical technique with spread activation algorithm
8/8/2011 28
Conclusion (2)
concepts which form the basis of the semantic domain model are not orthogonal. This issue can be addressed by reassigning the weights to concept links based on the relationship graph of the ontology concepts The spread activation algorithm has been used to deduce the relationships based on given set of relationships The SIR has been visualized as 4 layer process; keywords, indexed keywords, semantic concepts, relationships
8/8/2011
29
References
Berners-Lee T., Hendler J., Lassila O., The Semantic Web, Scientific American. 2001, 284: 35-43 R.Baeza-Yates, B.Ribeiro-Neto, Modern Information Retrieval, 1st edition, Addison-Wesley, 1999 Pokorny J., Web Searching and Information Retrieval, Web Engineering, July/August 2004, 43-48
8/8/2011
30