Вы находитесь на странице: 1из 8

International Journal of Advances in Science and Technology, Vol. 3, No.

5, 2011

An Agent Based Framework For Knowledge Extraction Using Semantic Web


B.Saleena1 and Dr.S.K.Srivatsa2 and M.Manickasundaram3
1

Research Scholar, School of Information Technology and Engg , Vellore Institute of Technology, Vellore saleenaameen@gmail.com
2

Senior Professor, Dept of Electronics and Instrumentation Engg, St.Josephs College of Engg, Chennai profsks@rediffmail.com
3

Senior Test Analyst, Royal Bank of Scotland Group, Chennai manic25@gmail.com

Abstract
The existing web retrieval systems fetch us a great deal of irrelevant data and lacks in the ability to provide us with the correct and related information to a particular domain. This paper presents a framework for knowledge extraction by coupling the concepts of agents and ontologies. A Web of semantically enriched resources is created based on the domain ontology. Software agents are used to perform the task of annotating the web resources and generating RDFs and creating the knowledge base based on the domain ontology. The Java Agent DEvelopment (JADE) Framework was used for implementation of the agents. Machine readable ontology is created and the annotated documents are categorized in to relevant information based on that ontology and stored in the knowledge base and query agents helps in querying as per the users request. The system was evaluated for few queries and the results were judged based on the relevancy of users needs.

Keywords Semantic web, Agents, Knowledge Extraction, Annotation, Ontology, RDF 1. Introduction
The World Wide Web is a repository with all kind information for humans. Because of the rapid growth of web and its resources, retrieving the required and relevant data is an intimidating process. Its done by visiting a list of web sites, then retrieving pieces of information from each of them and consolidating it manually to assimilate the required knowledge. With the increasing complexity of our systems and our needs, we need to go toward human level interaction and maximize the amount of semantics we can utilize and make it easier for the users to retrieve meaningful information from the web. Semantic web technologies prove to be a promising one to effectively retrieve the relevant and semantic related resources in a single search. The best method is to design a system to extract information from diverse web resources and to present the knowledge gathered in a structured form. Providing a domain-specific vocabulary and sharing of web resources helps us for better access to relevant knowledge and also in describing the contents of knowledge. Semantic web is a proficient way of representing data on the World Wide Web in a meaningful way. It is a network of information linked up in such a way as to be easily processable by machines. Agents can be used to reduce the work of users by doing the background routine tasks of searching from thousands of documents.

1.1 Semantic web and RDF


The Semantic Web is based on a set of standards for representing, querying, and applying rules to data. It helps in efficient data integration, more precise search, and better knowledge management.

November Issue

Page 1 of 88

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011 Semantic in the Semantic Web is not that computers are going to understand the meaning of anything, but that the logical pieces of meaning can be mechanically manipulated by a machine to useful ends[3]. A Standard machine understandable format is called Resource Description Framework (RDF) is used to represent the data and serves as a foundation on which the semantic web is built. RDF is a language for representing information about web resources. RDF is designed to be read by computers, not be designed to be displayed to people. RDF uses using XML as interchange syntax Resource Anything identifiable with a URI Description Statements about properties of resources Framework A common model for statements using diverse vocabularies.

1.2 Ontology
Ontology is explicit specification of conceptualization. It is the key component to using the semantic web approach for searching repositories. The relationships described as part of the ontology can allow the user to search on the basis of semantically related terms. At present the web resources associated with the same domain also differ in terms of syntax, structure and semantics. Ontologies help us to integrate the relevant documents under a common framework using RDF. Ontologies are meant to provide an understanding of the static domain knowledge that facilitates knowledge sharing and reuse. There are four different types [5] of ontologies. They are Domain ontology, Generic or Common Sense ontology, Method ontology, and Metadata ontology. The ontology used in this system is the domain ontology, which represents a tourism domain. The tourism domain concentrated in this project relates to three areas like type of trips, accommodation and hotel and the type of trips (Sub domain) are further classified into hill stations, pilgrimage and Education trips respectively. Sub domains can be drill downed based on the experts knowledge of the domain.

1.3 Role of agents in the semantic web


While RDF and ontologies provide the basic infrastructure of the semantic web, its the intelligent agents that will help us realize its power. An intelligent agent can be best described as an adaptive computer coding capable of reasoning and that learns from our behaviours. A personal agent on the Semantic Web will receive some tasks and preferences from the user and seeks information from Web sources by communicating with other agents. It compares the retrieved information with that of the user requirements and preferences to present the aggregate information having more relevance to the user. In the semantic web, different agents work together to create an 'information value chain' based on the users request. Agents can be deployed in creating a semantic web, querying the web and extracting knowledge from the web as per the users requirements. An Agent based architecture was developed for information extraction. A Java Agent Development (JADE) framework was used to implement agents. Three agents were implemented. Annotation Agent, Knowledge agent and Query agent. The agents communicate through structured messages proposed by FIPA (Foundation for Intelligent Physical Agents) The proposed system aims to implement a framework that uses annotation agents to annotate the information and generate RDF files. The knowledge agent classifies the RDF files based on the domain Ontology and creates a knowledge library, where the knowledge is stored in a meaningful way. The Query agent fetches the relevant results based on the Query of the user. The retrieved results are refined by applying the porter stemming Algorithm on the users Query keyword. The main advantage of the proposed system is to display all the related information of a particular tourist place in a single search. The paper is organized as follows. In Section 1 we present an introduction of concepts used in the paper and the problems of information retrieval from the commonly used search engines and in particular the tourism domain. Section 2 discusses the related research work going on in this field and Section 3 describes the architecture for informational retrieval using agents and the phases of development and Section 4 discusses about the experimental results and section 5 concludes this paper and discusses our future work.

November Issue

Page 2 of 88

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011

2. Related Work
The growth of semantic web has lead to effective knowledge management of diverse information present in the web. Few research works carried out in the area of semantic web and agents are discussed here. The Artequakt project of Harith Alani et al (2003) links a knowledge-extraction tool with an ontology to achieve continuous knowledge support and guide information extraction. The extraction tool searches online documents and extracts knowledge that matches the given classification structure [1]. It provides this knowledge in a machine-readable format that will be automatically maintained in a knowledge base.Problems related to this task, such as the identification and consolidation of duplicated knowledge and the verification of inconsistent knowledge, are highlighted. The work of Joshua Tauberer (2006) deals with the need for Resource Description Framework (RDF) to represent knowledge in the semantic web[3], in which computer applications make use of distributed, decentralized, structured information spread throughout the current web. In the work of Yi Xiao et al (2007) an agents-based intelligent retrieval framework in semantic web is proposed [8]. It is combined with other technologies such as information retrieval, knowledge modeling and ontology construction to perform the retrieval. There are so many ongoing researches on semantic web and intelligent agents and also the combination of both to extract information and respond to search queries of users based on tourism domain too. But none of the system focuses on providing details about the different amenities available in a tourist spot in a single search to guide the tourist efficiently. Traditional search Engines requires a lot of searches to gather information about a single tourist spot. For instance to retrieve the information about a particular hill station and to know about the hospitals or ATM in that particular location, different search queries has to be to posted, the proposed system is designed in such a way that the agents gather all the information pertaining to a particular tourist spot to guide the tourist efficiently .To accomplish this task a combination of semantic web and agents technologies are used.

3. System Architecture
This paper aims in using the semantic web technologies and intelligent agents to design a framework for knowledge extraction as shown in Figure 1. To achieve the task of designing this framework the following three phases were implemented by using semantic web technologies and the agents. Annotation of Web resources Creating a Knowledge Base and Knowledge Extraction This system consists of an annotation agent, knowledge agent and a query agent to carry out the tasks of converting the web resources to RDF, a machine- readable format which is used create a web of interrelated and meaningful information (semantic web) based on the domain ontology. The domain used to implement this prototype is tourism.

November Issue

Page 3 of 88

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011

Figure 1: Framework for knowledge Extraction using agents

3.1 Annotation of Web resources


In this phase the website owner registers his website links and the annotation agent performs the work of annotating the website. The URL links which describe web page contents are registered in the template provided and these metadata collected are stored appropriately. If there are additional links which does not have pre defined templates the annotation agent provides an interface to add new templates and add the appropriate metadata about the website. The agent gathers all the metadata and generates a machine readable format in RDF and distributes it across the web. For example one website will have links regarding tourist spots and accommodation, another website owner may register links about hospitals and resorts about that same spot or details about the new spot. Redundant links, if registered by different website owners are eliminated by performing a small check after the RDF triples are generated. Information related to same spots is grouped based on keywords and the domain ontology. Figure 2 depicts the template provided for annotation of web resources.

Figure 2: Annotation of Web Resources

November Issue

Page 4 of 88

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011 A sample RDF file is generated by annotating the details of a website for tourism domain. The website contains the details of hill station in Tamilnadu. The below example in Figure 3 shows the RDF generated for the web site registered by a owner for ooty and kodaikanal hill station. The sample RDF generated is not designed to be displayed to the people. Its just to show how an RDF file will look like. The main goal of the annotation agent is to provide machine-readable description of the contents of the Web accessible resources.

Figure 3: Sample RDF Generated

3.2 Creating a Knowledge base


A Knowledge agent collaborates with the annotation agent and generates a knowledge base for the system. The underlying structure of any RDF generated by the annotation agent is viewed as a collection of triples. RDF [3] provides a flexible method to decompose any knowledge in to triples with some rules about the semantics (meaning) of those pieces. RDF triples is a combination of three variables-Subject, Predicate and Object. The declaration of an RDF triple says that some relationship, indicated by the predicate, holds between the things denoted by subject and object of the triple. A sample format of triples is shown in Figure 4.
SUBJECT Kodaikanal Kodaikanal Kodaikanal Kodaikanal Kodaikanal PREDICATE WebsiteUrl PlacetoVisit Accommodation Hospitals Bank OBJECT http://www.kodaikanal.com/emergen cynumbers.html http://www.kodaikanal.com/travelinf ormation.html http://www.kodaikanal.com/wheretos tay/hotel.html http://www.kodaikanal.com/hospital. html http://www.kodaikanal.com/bank.ht ml

Figure 4: RDF triples

November Issue

Page 5 of 88

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011

The Knowledge agent takes the generated RDF and groups them in to triples. Based on the subject and the predicate of the triples the objects are grouped .This is done based on the ontology. A small portion of the domain ontology is shown in the Figure 5.

Figure 5: Sample Domain Ontology Based on the domain ontology an ontology library is created. The ontology library (repository) automatically creates folders based on its tree structure. The objects in triples are matched to the corresponding folders respectively. In this way the knowledge agent gathers all the related information from different websites and groups it in to a common domain to enrich the end user with only relevant information during his search.

3.3 Knowledge Extraction


The role of the query agent is to analyze the keyword inputted by the end user and ask for a secondary keyword to narrow down the search to improve the accuracy of the search results. Now the query agent reason down the related concepts based on the primary and secondary keyword and displays all the related information links based on the search. The novel approach behind this search is not only the keyword based information is fetched but also all other sub concepts relating to the keyword is fetched and displayed to the user. Porter stemming algorithm was implemented to improvise the search. This framework to extract knowledge from a particular domain can increase the rate of precision and recall than that of the existing methods.

4. Experimental results and Discussion


The Proposed approach is implemented in Java and the experimentation is performed on a 3.0 GHz Core 2duo PC machine with 2 GB main memory. Agents were implemented using JADE environment. The system developed gathers knowledge from the users and is stored in the form of RDF files. A User interface is developed for the users to search the information relating to the domain. Out of 55 attempts to extract the relevant information the following were the detail retrieved without stemming and after stemming algorithm was implemented. The results are shown in Table 1. Relevancy of knowledge retrieved is judged based on specific users and their current needs. Figure 6 shows the percentage of exact, relevant and irrelevant information retrieved from the knowledge base (KB) during different attempts.

November Issue

Page 6 of 88

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011

Table 1 : Results of Information Extracted from Knowledge base Nature of Information Extracted Exact Relevant Irrelevant Retrieval without Stemming 43 39 20

Retrieval With Stemming 49 45 15

Figure 6: Percentage of Type of Information retrieved from KB

5. Conclusion and Future Enhancements


The main objective of this system is to improve the extraction of knowledgeable data from diverse web sources using ontology and agents. This system does not require the user to explicitly specify all the keywords to extract the information regarding the tourism domain. This enhances the efficiency of search. Since the ontology is enriched with a variety of information pertaining to the domain all the relevant information can be extracted.The users are provided with all relevant information related to a particular tourist location in a single click. The search results are classified based on the users relevancy and needs. Porter Stemming algorithm is applied to refine the search and improve the search results for better accuracy. The system can be extended to perform automatic annotation of the web documents that is being registered. It can also enhance the searching techniques at the user interface.

6. References
[1] Harith Alani, Sanghee Kim, David E. Millard, Mark J. Weal,Wendy Hall, Paul H. Lewis, and Nigel R. Shadbolt ,Automatic Ontology-Based Knowledge Extraction from Web Documents, IEEE Intelligent systems,2003 [2] Natalya F. Noy and Deborah L. McGuinness, Ontology Development 101: A Guide to creating your first Ontology, http://protege.stanford.edu/publications/ontology_development/ontology101-noymcguinness.html

November Issue

Page 7 of 88

ISSN 2229 5216

International Journal of Advances in Science and Technology, Vol. 3, No.5, 2011 [3] Joshua Tauberer, What is RDF and what is it good for?http://www.rdfabout.com/intro/ July 2006 [4]Phil Cross,Libby Miller,Sean Palmer .Using RDF to Annotate the (Semantic) Web http://www.ilrt.bris.ac.uk/publications/researchreport/rr1015/report_html, June 2003 [5] A.Aldea, R.Banares J.Bocio, J.Gramajo, D.Isern,A.Kokossis, L.Jimenez, A.Moreno, D.Riano, An Ontology-BasedKnowledgeManagementPlatformhttp://www.isi.edu/info- agents/workshops/ijcai03/ papers/DIsern-article-ijcai.pdf, 2003 [6] White, M., Korelsky, T., Cardie, C., Ng, V., Pierce, D.,Wagstaff, K.: Multi document Summarization via Information Extraction. Proc. of Human Language TechnologyConf. (HLT 2001), San Diego, CA, 2000. [7] Reidsma, D., Kuper, J., Declerck, T., Saggion, H.,Cunningham, H.: Cross document annotation for multimedia retrieval. EACL Workshop on Language Technology and the Semantic Web, Budapest, 2003. [8] Yi Xiao,Ming Xiao, Fan Zhang, Agents-based Intelligent Retrieval Framework for the Semantic Web,InternationalConference on Wireless Communications, Networking and Mobile Computing (IEEE) 2007.Volume , Issue , 21-25 Sept. 2007 Page(s):5357 5360

Authors Profile
Dr.S.K.Srivatsa was born at Bangalore on 21st July 1945.He received his Bachelor of Electronics and Telecommunication Engineering degree (Honors) from Jadavpur University(securing first rank and two medals), Master degree in Electrical Communication Engineering(With distinction) and Ph.D from Indian Institute of Science, Bangalore. He retired as a Professor of Electronics Engineering from Anna University in 2005 and currently working as a Senior Professor at St. Josephs College of Engineering since August 2005. He has taught twenty-two different Courses at P.G level during the last 34 years. He has functioned as a Member of the Board of studies in some Educational Institutions. He is a life Fellow/Member in about two dozen registered professional societies He has received about a dozen awards. He is the author of well over 450 publications in reputed journals and Conferences. He has produced 34 Ph.Ds. His research interest pertains to Electronics and Computer Science

B.Saleena, is currently pursuing her Research from Vellore Institute of Technology(V.I.T) University.Having received her MCA degree from Madras University. and her M.E.( Comp. Science) from Anna University (With distinction). For the past 14 years she was working in B.S.Adbur Rahman Crescent Engineering College having progressed to the level of Asst.Professor(Sel.Gr) in the department of Computer Applications. Her research interests are in the area of semantic web and ontology engineering. She has taught ten different courses at the P.G level during the last 14 years. Her research works were presented and published in both National & Internationalzz conferences and journals.

M.Manickasundaram. is currently working as Senior Test Analyst in the Royal Bank of Scotland group. He has received his MCA degree from Madras University. For the past 7 years he is working in the Software Testing field. Some of the major accomplishments during this period were establishment of a standardized test methodology for testing Data Warehouse, publishing a factory model for validation of Business Intelligence reports, formulating and standardization of a test strategy for validating IPhone Applications. His research interests are in the area of Semantic Web and Agent Programming.

November Issue

Page 8 of 88

ISSN 2229 5216