Академический Документы
Профессиональный Документы
Культура Документы
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 40
Abstract— Much of the data available on the web is unstructured and constructing ontology from an unexplored domain is a difficult
task. Automatic generation of ontology from the unstructured data is a very important part in semantic web. In this paper we present a
methodology to automatically contruct an ontology from the information extracted from the web for a given keyword. This ontology
represents taxonomy of classes for the specified keyword’s domain and facilitates user to choose most significant sites that he can find
on the Web. The automatic construction of ontology that is being suggested, discharges generation and renewal of ontology
automatically whenever searching is completed. A key resource in our work is Google Ajax Search API for extracting information and
JSON is used to parse the output for the construction of ontology.The obtained classification in hierarchial structured list of the most
representative web sites for each ontology class is a great help for finding and accessing the desired web resources.
Index Terms— Semantic Web, Ontology, Google AJAX API, JSON, RDF, OWL, Information Extraction, Knowledge Base.
—————————— ——————————
1 INTRODUCTION
unstructured and there is no standard way of as classes in constructing the final ontology; OWL is used
representing information. The main problem for not ex- as language to construct Ontologies. For each concept
tracting structured information from the web is mainly which is used as Class in the Ontology, a URL from
due the web designers who design website in their own where the concepts are extracted is associated. The
way, information about visual representation, the usage process is repeated recursively using the combination of
of large no.of different files like .doc, .ps, .ppt, .pdf etc. the new concepts inorder to build an appropriate hie-
This becomes a serious drawback when one tries to create rarchy.
structured information representations like ontologies
from unstrcutred ones.
Though several tools are available to ease the web search Keyword Input Google Ajax API
as give below we will see why they donot satify the use
reuirement,
1. Search engines like Google [7], Yahoo [8] and Bing [9]
do a great job of indexing web sites but the way to ob-
tain them is quite limited they simply check the pres-
ence or absence of a key word on each web page. The
list of results is sorted using an effective rating me-
chanism according to the website’s relevance.
2. API search tools and directory services like The Google
API Search Tool [10][11], Yahoo Developer Network Parsing using JSON
API [12], Yahoo Directory services [13] provides many
options like structure the list of websites on different
categories of search. Many projects have been explor-
ing this path like Guided Google [14], Google API
Proximity Search (GAPS) developed by Staggerna- Class Selection for ontology
tion.com [15], but in many cases the result is quite re-
duced and sometimes the information is outdated.
These tools can be useful when one knows exactly what Class Class Class
to search and the domain where it belongs, but in most of and and and
URL … URL … URL
the cases the amount of returned results makes it difficult
to obatain desired information.
However, we will see in this paper how these difficulties ……
are addressed by constructing Ontologies.To achieve this Class Name+Keyword Class Name+Keyword
a) Web Search: This is a traditional search input field JSON data is required for a given application and how it
where, when a query is entered, a series of text search can be modified, much like what XML Schema provides
results appear on the page. for XML. JSON Schema is intended to provide validation,
b) Local Search: With Local Search, a Google Map is documentation, and interaction control of JSON data.
mashed together with a search input field and the search JSON Schema is based on the concepts from XML Sche-
results are based on a specific location. ma, RelaxNG, and Kwalify, but is intended to be JSON-
c) Video Search: The AJAX Video Search provides the based, so that JSON data in the form of a schema can be
ability to offer compelling video search along with ac- used to validate JSON data, the same serializa-
companying video based search results. tion/deserialization tools can be used for the schema and
Once connected, the application will be able to issue data, and it can be self descriptive.
search requests to Google's index of more than two billion
web pages and receive results as structured data, access
information in the Google cache, and check the spelling of
words. Google Web APIs support the same search syntax
as the Google.com site.
In short, the Google AJAX APIs serve as an enhanced
conduit to several of Google's most popular hosted ser-
vices. The hosted services such as Google Search or
Google Maps can be accessed directly, but with AJAX
APIs comes the ability to integrate these hosted services
into anyone's custom web pages. The way the AJAX APIs
work is by allowing any web page that is hosted on the
Internet access to Google search (or feed) data through
JavaScript code.
out.println("Title:"+j.getString("titleNoFormatting"));
out.println("<br>");
out.println("URL:"+j.getString("url"));
out.println("<br>");
out.println("Content:"+j.getString("content"));
Figure 6: Array output of result after searchin google us-
ing goolge ajax api search keyword “net- 5. Selection of URLs for the class is choosen on the basis of
work” the number of occurances of the keyword in the content
of the webpage (the “content” is extracted from the re-
{"responseData": {"results": [ sponse data by using JSON as shown above). After selec-
{"GsearchRe- tion of the URLs for the class appropriately, candidate
sultClass":"GwebSearch","unescapedUrl":"http://www.webopedia.c words or subclasses for constructing the hierarchy of the
om/TERM/N/network.html","url":"http://www.webopedia.com/T main keyword are selected according to their association
ERM/N/network.html","visibleUrl":"www.webopedia.com","cache with the main key word. The program chooses the words
Url":"http://www.google.com/search?q\u003dcache:HGyueh94nIk which are relevant to the main key word i.e having no
J:www.webopedia.com","title":"What is prepositions, determinants etc and those have minimum
\u003cb\u003enetwork\u003c/b\u003e? - A Word Definition From size i.e having more than two characters and those
the Webopedia Computer represented in standard ASCII.
\u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"What is 6. For each candidate word selected, a detailed analysis is
network? - A Word Definition From the Webopedia Computer conducted. Detailed analysis includes checking the num-
...","content":"This page describes the term ber of occurances of the word in the content of the web-
\u003cb\u003enetwork\u003c/b\u003e and lists other pages on the site. After this analysis an appropriate candidate key is
Web where you can find additional information."}, selected on the basis of the depth (total no of occurances)
of the word.
4. The response data is required to be parsed to select the 7. After the candidate word is selected, a new keyword is
appropriate class and the associated URL, for the con- selected by joining the candidate word and the initial
structing of Ontology. JSON is used to parse the response Keyword (eg. LAN Network) and the entire process is
data. After parsing, the program is able to separate the repeated recursively. Each repeated process has its own
websites returned for that keyword and catch the similar selection of candidates based on the constraints men-
or relevant words associated with the main Keyword tioned above. The recursive process is stopped when no
(“here Network”) which is explained in the next point. search results are found for that word.
Code snippet used to parse the response data using JSON 8. The final result is the hierarchy of class, subclasses
is as follows, which is ontology. Each candidate word is choosen as a
class name and URLs are stored from where they are se-
// Get the JSON response lected. The websites associated with each class are the
String line; most appropriate ones.
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new 5 EVALUATION AND RESULTS
InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null) { As an example “Network” is choosen as the initial key-
builder.append(line); word.As mentioned in Section 4.0 different constraints
out.println("\n\n"+line); are set like the minimum size of the candidate word
} (more than two characters), maximum no.of appearances
String response1 = builder.toString(); in the webpage etc. With these constraints the search is
performed and the result class hierarchy when visualized
in protégé is shown in the figure 7. As an example, take
the candidate keyword broadcast, the combined keyword
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 3, MARCH 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 45
Figure 7: Network hierarchy as shown in protégé Figure 8: URLs associated with Facebook calss shown in
protégé
The mechanism also stores appropriate URLs along with
the class names, allowing the user to access the most ap-
propriate or rather representative websites for the key-
word. As an example of URLs stored with the classes, the
URLs of the subclass mail of web and social network is
shown in figure 8.
The Jambalaya plugin of Protégé shows the complete
graphical view of the entire ontology in different for-
mats.Examle views like nested tree map ,class and indi-
vidual tree map, class tree are shown in figures 9,10,11.
The OWL file generated can be updated by repeating the
whole process at every certain amount of time for the
same main keyword, URLs are updated accordingly.
Small part of the OWL file generated by the current me-
chanism is given as below,
ACKNOWLEDGMENT
This research has been supported and funded by CSIR, India
under Empower Scheme and grant no: OLP-2104-28.
REFERENCES
[1] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific
American, 284(5):28{37, 2001.
Figure 10: “Class and Individual Tree map” as shown [2] D. Fensel, Ontologies: A Silver Bullet for Knowledge Management and
under Jambalaya in protégé Electronic Commerce, volume 2, Springer Verlag, 2001. W.-K. Chen, Li-
near Networks and Systems. Belmont, Calif.: Wadsworth, pp. 123-135,
1993. (Book style)
[3] T.R. Gruber, “A Translation Approach to Portable Ontology Specifica-
tion,” Knowledge Acquisition, 1993.
[4] David Urbansky, Automatic Construction of a Semantic,Domain-
Independent Knowledge Base..
[5] Extensible Markup Language (XML) W3C. Web page
http://www.w3.org/XML/.
[6] Simple HTML Ontology Extensions
http://www.cs.umd.edu/projects/plus/SHOE/
[7] Google. http://google.com
[8] Yahoo. http://yahoo.com
[9] Bing. http://bing.com
[10] http://code.google.com/apis/ajaxsearch/
[11] Google Groups (web-apis)
http://groups.google.com/groups?group=google.
[12] http://developer.yahoo.com/
[13] http://dir.yahoo.com/computers_and_internet/internet/directory_serv
Figure 11: “Class Tree” as shown under Jambalaya in
protégé ices/
[14] Ding Choon Hoong and Rajkumar Buyya,”Guided Google: A Meta
Search Engine and its Implementation using the Google Distributed Web
Services
6 CONCLUSIONS & FUTURE WORK [15] Google API Proximity Search (GAPS) -
Many authors have been working on ontology learning http://www.staggernation.com/gaps/readme.html
[16] www.json.org
and construction from different kinds of structured in- [17] http://json-schema.org
formation sources like data bases, knowledge bases or [18] http://www.linkeddatatools.com/introducing-rdf-part-2
dictionaries [24, 25]; some authors are putting their effort [19] http://www.linkeddatatools.com/introducing-rdf-part-2
[20] Protégé 3.4.1 , http://protege.stanford.edu/
on processing natural language texts [26]. Most of the [21] http://www.thechiselgroup.org/jambalaya
Ontologies are constructed on the basis of an explored [22] OWL API , http://sourceforge.net/projects/owlapi/
domain or on structured information like databases. [23] http://img.shopping.com/cctool/WhatsIs/1/1399_20943.epi.html
[24] Tim Finin and Zareen Syed: Creating and exploiting a web of Semantic Data
However, taking into consideration the amount of re- using wikitology,.
sources available easily on the Internet, creation of ontol- [25] D.Mukhopadhyay,A Banik,S Mukherjee: A Technique for Automatic Construc-
ogy automatically from unstructured documents like web tion of Ontology form Existing Database to fecilitate Sematic web
[26] M.Y.Dahab,H.A.Hassan,A.Rafae: TextOntoEx: Automatic Ontology Construc-
pages is interesting and important. In this paper a metho-
tion from Natural English text
dology is choosen to automatically construct and update Kalyan Netti, born in Andhra Pradesh, India. He obtained Master of Tehnolo-
the ontology based on the unstructured data from web gy (M.Tech) in Computer Science and Engineering with specialization in Data-
base Management systems from JNTU, Andhra Pradesh, India, in
using a low cost approach. The low cost approach, for 2004.Currently pursuing Ph.D (Computer Science and Engineering) in Data-
automatic construction of ontology, uses publicly availa- mining related areas. Kalyan Netti is interested in the following areas: Seman-
tic web technologies, Ontologies, Data Interoperability, Web Mining, Semantic
ble search engines like Google through its API like Heterogeneity, Relational database systems, temporal databases and tempor-
Google Ajax API search and JSON to parse the data. The al data modeling.
entire mechanism is implemented using JAVA, JSP. The