Automatic Construction of Ontology by Exploiting Web Using Google API and JSON

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 3, MARCH 2011, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 40
Automatic construction of ontology by

exploiting web using Google API and JSON
Kalyan Netti
Abstract— Much of the data available on the web is unstructured and constructing ontology from an unexplored domain is a difficult
task. Automatic generation of ontology from the unstructured data is a very important part in semantic web. In this paper we present a
methodology to automatically contruct an ontology from the information extracted from the web for a given keyword. This ontology
represents taxonomy of classes for the specified keyword’s domain and facilitates user to choose most significant sites that he can find
on the Web. The automatic construction of ontology that is being suggested, discharges generation and renewal of ontology
automatically whenever searching is completed. A key resource in our work is Google Ajax Search API for extracting information and
JSON is used to parse the output for the construction of ontology.The obtained classification in hierarchial structured list of the most
representative web sites for each ontology class is a great help for finding and accessing the desired web resources.
Index Terms— Semantic Web, Ontology, Google AJAX API, JSON, RDF, OWL, Information Extraction, Knowledge Base.
——————————  ——————————
1 INTRODUCTION
T he web hosts millions of information pieces and is

still growing at a rapid pace.No single human can
have an overview of all web pages and the informa-
be the right answer to the structuring and modelling
problems arising in Knowledge Management. They allow
transferring and processing information effectively in a
tion they provide, thus, the trend towards a machine distributed environment (like multiagent systems). There-
“understandable" web has been proposed in the semantic fore, the building of an ontology that represents a speci-
web initiative [1]. If machines can read, understand (dis- fied domain is a critical process that has to be made care-
ambiguate) and aggregate information pieces from many fully. However, manual ontology building is a difficult
different sources, the human can consume the desired task that requires an extended knowledge of the domain
information much faster. In di- (an expert) and in most cases; the result could be incom-
fferent knowledge domains the approaches for retrieving plete or inaccurate.
and extracting knowledge can vary. This requires more In order to ease the ontology construction process, auto-
human effort in configuring the extraction matic methodologies can be used after extracting struc-
system. This paper suggests a mechanism which mini- tured knowledge, like concepts and semantic relations,
mizes the human effort and from unstructured resources that cover the main do-
automatically builds a structured knowledge base main's topics [4]. The solution proposed in this paper is to
from the information that is available on the web. For use the available information on the Web to create the
achieving machine interoperability a structured way for ontology. This method has the advantage that the ontolo-
representing information is required and ontologies [2] gy is built automatically and fully represents the actual
(machineprocessable representations that contain the se- state of the art of a domain (based on the web pages that
mantic information of a domain), can be very useful. The cover a specific topic).
most widely quoted definition of “ontology” was given The methodology used in this paper is to extract informa-
by Tom Gruber in 1993, who defines ontology as [3]: “An tion from the Web to build ontology for a given domain
explicit specification of a conceptualization.” Ontologies and the most representative web sites for each ontology
have proved their usefulness in different applications concept will be retrieved.
scenarios, such as intelligent information integration,
knowledge-based systems, natural language processing.
2 RELATED WORK
The role of ontologies is to capture domain knowledge in
a generic way and provide a commonly agreed upon un- To improve machine interpretability of the web compenet
derstanding of a domain. The common vocabulary of on- there have been some proposals to make web components
tology, defining the meaning of terms and relations, is structured like using XML [5] notation to represent con-
usually organized in taxonomy. Ontology usually concepts and hierarchies, SHOE [6], “Simple HTML
tains modelling primitives such as concepts, relations Ontology Extensions” is a small extension to HTML
between concepts, and axioms. Ontologies have shown to which allows web page authors to annotate their web
documents and include tags with semantic information.
———————————————— However, none of these techniques has been widely used
 Kalyan Netti is with the National Geophysical Research Institute, Hydera- and, in addition, the information about visual represent-
bad under Council of Scientific and Industrial Researach, India tation that makes difficult the search for useful data has
increased. The search for the information and the ma-
chine processing of data are not easy because the web is
unstructured and there is no standard way of as classes in constructing the final ontology; OWL is used
representing information. The main problem for not ex- as language to construct Ontologies. For each concept
tracting structured information from the web is mainly which is used as Class in the Ontology, a URL from
due the web designers who design website in their own where the concepts are extracted is associated. The
way, information about visual representation, the usage process is repeated recursively using the combination of
of large no.of different files like .doc, .ps, .ppt, .pdf etc. the new concepts inorder to build an appropriate hie-
This becomes a serious drawback when one tries to create rarchy.
structured information representations like ontologies
from unstrcutred ones.
Though several tools are available to ease the web search Keyword Input Google Ajax API
as give below we will see why they donot satify the use
reuirement,
1. Search engines like Google [7], Yahoo [8] and Bing [9]
do a great job of indexing web sites but the way to ob-
tain them is quite limited they simply check the pres-
ence or absence of a key word on each web page. The
list of results is sorted using an effective rating me-
chanism according to the website’s relevance.
2. API search tools and directory services like The Google
API Search Tool [10][11], Yahoo Developer Network Parsing using JSON
API [12], Yahoo Directory services [13] provides many
options like structure the list of websites on different
categories of search. Many projects have been explor-
ing this path like Guided Google [14], Google API
Proximity Search (GAPS) developed by Staggerna- Class Selection for ontology
tion.com [15], but in many cases the result is quite re-
duced and sometimes the information is outdated.
These tools can be useful when one knows exactly what Class Class Class
to search and the domain where it belongs, but in most of and and and
URL … URL … URL
the cases the amount of returned results makes it difficult
to obatain desired information.
However, we will see in this paper how these difficulties ……
are addressed by constructing Ontologies.To achieve this Class Name+Keyword Class Name+Keyword
powerful Google Ajax Search API is used in this paper,

which has access to the vast index of more than two bil-
lion web pages and information in the Google cache.
Google Web APIs support the same search syntax as the
Google.com site. In short; the Google AJAX Search APIs Ontology
serve as an enhanced conduit to several of Google's most (OWL)
popular hosted services. This paper also uses JSON [15]
for parsing the streamed response obtained from search.
This paper also gives an insight on parsing the streamed FIGURE 1: MAIN ARCHITECTURE
response using JSON through web interface by using JSP,
thus provides a structured output for construction of on- The follwing sections explain the process in detail.
tology.Much information about JSON implementation
using JSP is not available on the web. 3.2 Google Ajax Search API
The architecture of how Google Web Services interact
3 METHODOLOGY with user applications is shown in Figure 2. The Google
3.1 Architecture server is responsible for processing users’ search queries.
The programmers develop applications in a language of
This section describes the metholodology used to discov-
their choice (Java, C, Perl, PHP, .NET, etc.) and connect to
er and select class names along with URLs for construc-
the remote Google Web APIs service. Communication is
tion of final Ontlogy.
performed via the Google AJAX Search Api. The AJAX
The mechanism as shown in Figure: 1 involves anlaysing
Search API allows you to easily integrate some very po-
a large number of websites in order to find related con-
werful and diverse Google based search mechanisms or
cepts for a domain by studying the keyword’s neigh-
"controls" onto a Web page with relatively minimal cod-
bourhood. The output is processed using JSON to select
ing. These include: [10]
the most appropriate ones. The selected concepts are used
© 2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

http://sites.google.com/site/journalofcomputing/
a) Web Search: This is a traditional search input field JSON data is required for a given application and how it
where, when a query is entered, a series of text search can be modified, much like what XML Schema provides
results appear on the page. for XML. JSON Schema is intended to provide validation,
b) Local Search: With Local Search, a Google Map is documentation, and interaction control of JSON data.
mashed together with a search input field and the search JSON Schema is based on the concepts from XML Sche-
results are based on a specific location. ma, RelaxNG, and Kwalify, but is intended to be JSON-
c) Video Search: The AJAX Video Search provides the based, so that JSON data in the form of a schema can be
ability to offer compelling video search along with ac- used to validate JSON data, the same serializa-
companying video based search results. tion/deserialization tools can be used for the schema and
Once connected, the application will be able to issue data, and it can be self descriptive.
search requests to Google's index of more than two billion
web pages and receive results as structured data, access
information in the Google cache, and check the spelling of
words. Google Web APIs support the same search syntax
as the Google.com site.
In short, the Google AJAX APIs serve as an enhanced
conduit to several of Google's most popular hosted ser-
vices. The hosted services such as Google Search or
Google Maps can be accessed directly, but with AJAX
APIs comes the ability to integrate these hosted services
into anyone's custom web pages. The way the AJAX APIs
work is by allowing any web page that is hosted on the
Internet access to Google search (or feed) data through
JavaScript code.
Figure 3: JSON Schema
Apart of certain limitations which are limited to textual

data formats which also apply to XML and YAML, JSON
is primarily used for communicating data over the Inter-
net.The proposed system in this paper uses JSON exten-
sively to parse the response send by the Google API. Pars-
Figure 2: Google AJAX Search API Architecture ing the response using JSON enables the system to store
the results in array, get the required URLs, Count the to-
The core JavaScript code that fetches the search or feed tal results, Content of website etc. JSON parsing allows
data can be as simple as Search.execute ( ) or Feed.load ( ). ths program to do exhaustive analysis on the response
As the request is made to Google's worldwide servers, a data to select appropriate candidate words.
response of either Search data or prepared AJAX Feed
data is streamed back to the Web page in either JSON 3.4 Ontology Representation
(JavaScript Object Notation) or XML formats. Parsing of Ontology is nothing but a specification of a
this data can either be done manually or automatically by conceptualization [3]. A conceptualization is an abstract,
using one of the provided UI controls that are built upon simplified view of the world that we wish to represent for
the lower level AJAX APIs. some purpose [3]. Every knowledge base, knowledge-
based system, or knowledge-level agent is committed to
3.3 JSON some conceptualization, explicitly or implicitly. It
JSON (an acronym for JavaScript Object Notation) [16] involves explicit representation of knowledge about some
as shown in Figure 3, is a lightweight text-based open topic. Ontologies can enhance the functioning of the Web
standard designed for human-readable data interchange. in many ways. They can be used in a simple fashion to
It is derived from the JavaScript programming language improve the accuracy of Web searches—the general
for representing simple data structures and associative search program can look for only those pages that refer to
arrays, called objects. Despite its relationship to Java- a precise concept instead of all the ones using ambiguous
Script, it is language-independent, with parsers available keywords. Problem-solving methods, domain-
for virtually every programming language. independent applications, and software agents use
In JSON the String data structure take on these forms knowledge bases built from ontologies as data.
as shown in Figure 3. JSON Schema [17] is a specification Ontologies can be domain specific as well. Domain
for a JSON-based format for defining the structure of specific ontologies describe a specific domain by defining
JSON data. JSON Schema provides a contract for what different terms applicable to that domain and specifying
the interrelationship between them. Several domain

ontologies can also be merged for a more general
representation. The work in this paper mainly involves
development of a domain specific ontology described via
one of the most common ontology language OWL [18]
[21] which is a vocubulary extension of RDF [19].
Development of ontology has several utilities, Firstly it
helps in understanding an area of knowledge better,
enables multiple machines to share the knowledge and
use the knowledge in various applications. Resource
Descripton Framework is nothing but a language to
represent the ontology of an area of knowledge.
Ontology provides greater machine interpretability
where processing of information is essential. Processing
of information allows find easy relations between classes
and subclasses etc. Moreover OWL is supported by many
Ontology visualizers and editores like Protégé which
allows domain experts to build knowledge-based systems
by creating and modifying reusable ontologies and Figure 5: Jambalaya tab for Network Ontolgy on Protégé
problem-solving methods. In this paper we use Protégé 3.4.1
3.4.1 [20], which facilitates user to explore, understand, The instances can be searched in Jambalaya, using a
analyse or even modify the resulting Ontology. The final search tool.
hierarchy which is the output of the process, described
under sub section 3.1, is presented in a refined way .The
ontology for Network keyword as visualized in Protégé is 4 DESIGN AND IMPLEMENTATION
shown in Figure 4. The detailed architecture as shown in Figure 3.1 has fol-
lowing seaquence and steps. The entire mechanism is
implemented in JAVA, JSP.
1. The whole process starts by giving a keyword in the
HTML interface, for which the ontology is constructed;
here Network was choosen as keyword.
2. After entering the keyword, the program uses Google
Ajax search API to obtain the information of the keyword
and websites that contain the keyword. The following
code is used to connect to Google,
query = URLEncoder.encode(query, "UTF-8");

URL url = new
URL("http://ajax.googleapis.com/ajax/servi
ces/search/web?start=0&rsz=large&v=1.0&q="
+ query);
// opening connection
URLConnection connection =
url.openConnection();
Figure 4: Network Ontolgy on Protégé 3.4.1
3. The result after searching is shown in Figure 6 (here the
Jambalaya [21] is a Protege plug-in that uses Shrimp to
keyword used is Network). Response data is in the form
visualize Protege-Frames and Protege-OWL ontologies.
of an array which contains useful information of the
Ontology with URLs for Network keyword when
matching websites like title, URL, content etc. The re-
visualized in jambalaya is shown in Figure 5, jambalaya
sponse date is given as below, filters like restricting the
facilitates the user to grpahically visualize the hierarchy.
number of returned results (here restricted to 100), are
Jambalya extension is available along with the install
used in the program
any where Protégé 3.4.1.
//Parsing of response data

out.println("Total results =
"+json.getJSONObject("responseData").getJSONObject("c
ursor").getString("estimatedResultCount")+"<br>");
out.println("<br>");
JSONArray ja =
json.getJSONObject("responseData").getJSONArray("resu
lts");
out.println("Results:<br>");
for (int i = 0; i < ja.length(); i++) {
JSONObject j = ja.getJSONObject(i);
out.println("Title:"+j.getString("titleNoFormatting"));
out.println("URL:"+j.getString("url"));
out.println("Content:"+j.getString("content"));
Figure 6: Array output of result after searchin google us-
ing goolge ajax api search keyword “net- 5. Selection of URLs for the class is choosen on the basis of
work” the number of occurances of the keyword in the content
of the webpage (the “content” is extracted from the re-
{"responseData": {"results": [ sponse data by using JSON as shown above). After selec-
{"GsearchRe- tion of the URLs for the class appropriately, candidate
sultClass":"GwebSearch","unescapedUrl":"http://www.webopedia.c words or subclasses for constructing the hierarchy of the
om/TERM/N/network.html","url":"http://www.webopedia.com/T main keyword are selected according to their association
ERM/N/network.html","visibleUrl":"www.webopedia.com","cache with the main key word. The program chooses the words
Url":"http://www.google.com/search?q\u003dcache:HGyueh94nIk which are relevant to the main key word i.e having no
J:www.webopedia.com","title":"What is prepositions, determinants etc and those have minimum
\u003cb\u003enetwork\u003c/b\u003e? - A Word Definition From size i.e having more than two characters and those
the Webopedia Computer represented in standard ASCII.
\u003cb\u003e...\u003c/b\u003e","titleNoFormatting":"What is 6. For each candidate word selected, a detailed analysis is
network? - A Word Definition From the Webopedia Computer conducted. Detailed analysis includes checking the num-
...","content":"This page describes the term ber of occurances of the word in the content of the web-
\u003cb\u003enetwork\u003c/b\u003e and lists other pages on the site. After this analysis an appropriate candidate key is
Web where you can find additional information."}, selected on the basis of the depth (total no of occurances)
of the word.
4. The response data is required to be parsed to select the 7. After the candidate word is selected, a new keyword is
appropriate class and the associated URL, for the con- selected by joining the candidate word and the initial
structing of Ontology. JSON is used to parse the response Keyword (eg. LAN Network) and the entire process is
data. After parsing, the program is able to separate the repeated recursively. Each repeated process has its own
websites returned for that keyword and catch the similar selection of candidates based on the constraints men-
or relevant words associated with the main Keyword tioned above. The recursive process is stopped when no
(“here Network”) which is explained in the next point. search results are found for that word.
Code snippet used to parse the response data using JSON 8. The final result is the hierarchy of class, subclasses
is as follows, which is ontology. Each candidate word is choosen as a
class name and URLs are stored from where they are se-
// Get the JSON response lected. The websites associated with each class are the
String line; most appropriate ones.
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new 5 EVALUATION AND RESULTS
InputStreamReader(connection.getInputStream()));
while((line = reader.readLine()) != null) { As an example “Network” is choosen as the initial key-
builder.append(line); word.As mentioned in Section 4.0 different constraints
out.println("\n\n"+line); are set like the minimum size of the candidate word
} (more than two characters), maximum no.of appearances
String response1 = builder.toString(); in the webpage etc. With these constraints the search is
performed and the result class hierarchy when visualized
in protégé is shown in the figure 7. As an example, take
the candidate keyword broadcast, the combined keyword
“broadcast network” returns information from different

websites. Candidate words like cable, satellite and terre- </owl:Class>
strial (different types of broadcasting [23]) are choosen <owl:Class rdf:ID="FDDI">
according to the exhaustive analysis mentioned above. <rdfs:subClassOf rdf:resource="#LAN"/>
The main intention in this paper is to build a hierarchy of </owl:Class>
classes using OWL. OWL facilitates the user to find inter- <LAN
class relations automatically like intersections, inclusions rdf:about="http://computer.howstuffworks.com"/>
or equalities [22], in this case the result OWL shows that, <Inter_net
Internet class includes social network and web and social rdf:about="http://computer.howstuffworks.com_"/>
network inturn includes facebook, twitter. The OWL file is <CableCARD
visualized using OWL editor Protégé as shown in figure rdf:about="http://en.wikipedia.org/wiki/CableCARD"/
7. >
<Computer
rdf:about="http://en.wikipedia.org/wiki/Computer_net
work_"/>
Figure 7: Network hierarchy as shown in protégé Figure 8: URLs associated with Facebook calss shown in
protégé
The mechanism also stores appropriate URLs along with
the class names, allowing the user to access the most ap-
propriate or rather representative websites for the key-
word. As an example of URLs stored with the classes, the
URLs of the subclass mail of web and social network is
shown in figure 8.
The Jambalaya plugin of Protégé shows the complete
graphical view of the entire ontology in different for-
mats.Examle views like nested tree map ,class and indi-
vidual tree map, class tree are shown in figures 9,10,11.
The OWL file generated can be updated by repeating the
whole process at every certain amount of time for the
same main keyword, URLs are updated accordingly.
Small part of the OWL file generated by the current me-
chanism is given as below,
<owl:Class rdf:ID="Facebook"> Figure 9: “Nested Tree Map” as shown under Jambalaya

<rdfs:subClassOf plugin in protégé
rdf:resource="#Social_Network"/>
obtained hierarchical structured list as a result of parsing

the response data and choosing appropriate candidate
classes consists of the most representative websites for
each ontology class. This structured list is a great help for
the users to find and access the desired web resources.
The most suggested future work would be to do more
exhaustive analysis on the relevant websites using the
intitial keyword and also designing a complex analysis
algorithm to choose between the candidate words.
ACKNOWLEDGMENT
This research has been supported and funded by CSIR, India
under Empower Scheme and grant no: OLP-2104-28.
REFERENCES
[1] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific
American, 284(5):28{37, 2001.
Figure 10: “Class and Individual Tree map” as shown [2] D. Fensel, Ontologies: A Silver Bullet for Knowledge Management and
under Jambalaya in protégé Electronic Commerce, volume 2, Springer Verlag, 2001. W.-K. Chen, Li-
near Networks and Systems. Belmont, Calif.: Wadsworth, pp. 123-135,
1993. (Book style)
[3] T.R. Gruber, “A Translation Approach to Portable Ontology Specifica-
tion,” Knowledge Acquisition, 1993.
[4] David Urbansky, Automatic Construction of a Semantic,Domain-
Independent Knowledge Base..
[5] Extensible Markup Language (XML) W3C. Web page
http://www.w3.org/XML/.
[6] Simple HTML Ontology Extensions
http://www.cs.umd.edu/projects/plus/SHOE/
[7] Google. http://google.com
[8] Yahoo. http://yahoo.com
[9] Bing. http://bing.com
[10] http://code.google.com/apis/ajaxsearch/
[11] Google Groups (web-apis)
http://groups.google.com/groups?group=google.
[12] http://developer.yahoo.com/
[13] http://dir.yahoo.com/computers_and_internet/internet/directory_serv
Figure 11: “Class Tree” as shown under Jambalaya in
protégé ices/
[14] Ding Choon Hoong and Rajkumar Buyya,”Guided Google: A Meta
Search Engine and its Implementation using the Google Distributed Web
Services
6 CONCLUSIONS & FUTURE WORK [15] Google API Proximity Search (GAPS) -
Many authors have been working on ontology learning http://www.staggernation.com/gaps/readme.html
[16] www.json.org
and construction from different kinds of structured in- [17] http://json-schema.org
formation sources like data bases, knowledge bases or [18] http://www.linkeddatatools.com/introducing-rdf-part-2
dictionaries [24, 25]; some authors are putting their effort [19] http://www.linkeddatatools.com/introducing-rdf-part-2
[20] Protégé 3.4.1 , http://protege.stanford.edu/
on processing natural language texts [26]. Most of the [21] http://www.thechiselgroup.org/jambalaya
Ontologies are constructed on the basis of an explored [22] OWL API , http://sourceforge.net/projects/owlapi/
domain or on structured information like databases. [23] http://img.shopping.com/cctool/WhatsIs/1/1399_20943.epi.html
[24] Tim Finin and Zareen Syed: Creating and exploiting a web of Semantic Data
However, taking into consideration the amount of re- using wikitology,.
sources available easily on the Internet, creation of ontol- [25] D.Mukhopadhyay,A Banik,S Mukherjee: A Technique for Automatic Construc-
ogy automatically from unstructured documents like web tion of Ontology form Existing Database to fecilitate Sematic web
[26] M.Y.Dahab,H.A.Hassan,A.Rafae: TextOntoEx: Automatic Ontology Construc-
pages is interesting and important. In this paper a metho-
tion from Natural English text
dology is choosen to automatically construct and update Kalyan Netti, born in Andhra Pradesh, India. He obtained Master of Tehnolo-
the ontology based on the unstructured data from web gy (M.Tech) in Computer Science and Engineering with specialization in Data-
base Management systems from JNTU, Andhra Pradesh, India, in
using a low cost approach. The low cost approach, for 2004.Currently pursuing Ph.D (Computer Science and Engineering) in Data-
automatic construction of ontology, uses publicly availa- mining related areas. Kalyan Netti is interested in the following areas: Seman-
tic web technologies, Ontologies, Data Interoperability, Web Mining, Semantic
ble search engines like Google through its API like Heterogeneity, Relational database systems, temporal databases and tempor-
Google Ajax API search and JSON to parse the data. The al data modeling.
entire mechanism is implemented using JAVA, JSP. The

Automatic Construction of Ontology by Exploiting Web Using Google API and JSON

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Automatic Construction of Ontology by Exploiting Web Using Google API and JSON

Загружено:

Авторское право:

Доступные форматы

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 3, MARCH 2011, ISSN 2151-9617

Automatic construction of ontology by

T he web hosts millions of information pieces and is

powerful Google Ajax Search API is used in this paper,

© 2011 Journal of Computing Press, NY, USA, ISSN 2151-9617

Figure 3: JSON Schema

Apart of certain limitations which are limited to textual

the interrelationship between them. Several domain

query = URLEncoder.encode(query, "UTF-8");

//Parsing of response data

“broadcast network” returns information from different

<owl:Class rdf:ID="Facebook"> Figure 9: “Nested Tree Map” as shown under Jambalaya

obtained hierarchical structured list as a result of parsing

Вам также может понравиться