Вы находитесь на странице: 1из 16

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO.

12, DECEMBER 2010 1781

XCDSearch: An XML Context-Driven


Search Engine
Kamal Taha, Member, IEEE, and Ramez Elmasri, Member, IEEE

Abstract—We present in this paper, a context-driven search engine called XCDSearch for answering XML Keyword-based queries as
well as Loosely Structured queries, using a stack-based sort-merge algorithm. Most current research is focused on building
relationships between data elements based solely on their labels and proximity to one another, while overlooking the contexts of the
elements, which may lead to erroneous results. Since a data element is generally a characteristic of its parent, its context is determined
by its parent. We observe that we could treat each set of elements consisting of a parent and its children data elements as one unified
entity, and then use a stack-based sort-merge algorithm employing context-driven search techniques for determining the relationships
between the different unified entities. We evaluated XCDSearch experimentally and compared it with five other search engines. The
results showed marked improvement.

Index Terms—XML keyword search, keyword search, keyword-based querying, XML search engine.

1 INTRODUCTION

T HERE is an ever-growing availability of semistructured


information on the Web and in digital libraries. Increas-
ingly, users, both expert and nonexpert, have access to text
documents results in increased precision. Kamps et al. [18]
found that over half of the queries can be expressed in a
fielded search-like format that does not use the hierarchical
documents equipped with some semantic hints through structure of the documents. They require only that certain
XML markup. How can we query such data? We could query keywords occur in elements with certain labels. Thus, they
the documents using a database approach, which performs can be expressed using Loosely Structured querying.
exact-match (e.g., XQuery [37]). But here recall is often too Business executives and employees need flexible access to
low, and we also need to learn the query language in order to vast quantities of information. Since they are likely to be
issue a query. There are two alternatives: to issue either a aware of some labels of elements/attributes containing data
Keyword-based query or a Loosely Structured query. The and are unlikely to be fully aware of the structure of the
popularity of Keyword-based querying stems from the fact underlying data, they can access the data using Loosely
that it is user-friendly and does not require the user to learn a Structured queries. On the other hand, business customers
query language or to be aware of the structure of the are most likely not aware of the elements’ labels or the
underlying data. Loosely Structured querying allows com- structure of the data. Thus, Keyword-based querying meets
bining some structural constraints within a Keyword query their needs. Some Internet-based businesses let their
by specifying the context where a search term should appear customers issue Loosely Structured queries by providing
(combining keywords and element names). That is, it them with graphical user interfaces containing menus (e.g.,
requires the user to know only the labels of elements drop-down menus and check boxes) and search fields (for
containing the keywords, but does not require him or her submitting keywords). Each entry in a menu represents an
to be aware of the structure of the underlying data. Thus, element, and its name depicts the element’s label. For
Loosely Structured querying combines the convenience of example, if the user selects from a drop-down menu the
Keyword-based querying while enriching queries by adding entry “book” and then enters in the search field a book’s
structural conditions, which leads to performance enhance- title, the system will search the XML document for an
ment. For example, the difference between the queries element labeled “book” containing the specified title. This
“Julie” and “Julie as author” can be clearly stated. Due to search behavior is directly analogous to the user’s provid-
the ambiguity of the first query and the disambiguation of ing keywords and also supplying the target elements. Most
the second, a reduction in the number of nonrelevant Internet searches are based on this type of controlled Loosely
Structured querying. Examples of existing applications are
Books247 [9] and mydeco [23]. The granularity of search
. K. Taha is with the Department of Software Engineering, Khalifa on Books247 is three levels of the document tree: book,
University of Science, Technology & Research (KUSTAR), PO Box chapter, and section. In mydeco, the user can choose an
127788, Abu Dhabi, UAE. E-mail: kamal.taha@kustar.ac.ae.
. R. Elmasri is with the Department of Computer Science and Engineering, entry such as “Designer” from a list and then choose a value
The University of Texas at Arlington, 335 Nedderman Hall, 416 Yates, such as “Charles” from check boxes. This represents the
Box 19015, Arlington, TX 76019. query “Charles as Designer,” where “Charles” is a value
E-mail: elmasri@uta.edu. contained in an element labeled “Designer.” Online digital
Manuscript received 6 June 2009; revised 29 Oct. 2009; accepted 4 Nov. 2009; libraries also typically provide advanced search forms,
published online 9 Dec. 2009. allowing users to narrow their search by specifying
Recommended for acceptance by T. Grust.
For information on obtaining reprints of this article, please send e-mail to:
structural hints in their queries. For example, sites such as
tkde@computer.org, and reference IEEECS Log Number TKDE-2009-06-0487. Wiley InterScience [31] allow searching by author, article,
Digital Object Identifier no. 10.1109/TKDE.2009.210. title, and publication information.
1041-4347/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
1782 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 12, DECEMBER 2010

The above type of controlled Loosely Structured querying between unified entities. Section 5 shows how a query is
is unfeasible for querying data with large and deep schema answered. In Section 6, we analyze our algorithm and
structures. Research works such as [10], [19] propose existing algorithms. Section 7 shows the system architecture.
techniques for XML Loosely Structured querying, where In Section 8, we present the experimental results. Section 9
the user provides search terms consisting of label-keyword presents our conclusion and related work.
pairs. Computing the Lowest Common Ancestor (LCA) of
elements containing keywords is the common denominator
among these proposed techniques. Despite the success of the
2 CONCEPTS USED IN XCDSEARCH
proposed search engines, they suffer recall and precision We model XML documents as rooted and labeled trees. A tree
limitations. The reason is that they employ mechanisms for t is a tuple: t ¼ ðn; e; r; tÞ where n is the set of nodes, e 
building relationships between data elements based solely n  n is the set of edges, r is the root node of t, and t : n ! 
on their labels and proximity to one another while over- is a node-labeling function where  is an alphabet of node
looking the contexts of the elements. The context of a data label. A node in a tree represents an element in an XML
element is determined by its parent, because a data element document. We use the term “data node” to denote a node of a
is generally a characteristic of its parent. If, for example, a tree data structure that has no child node and always has a
data element is labeled “title,” we cannot determine whether value. Nodes are numbered for easy reference. XCDSearch
it refers to a book title or a job title without referring to its accepts Keyword-based queries with the form Qð‘‘k1 ’’;
parent. Consider as another example an XML document ‘‘k2 ’’; . . . ; ‘‘kn ’’Þ and Loosely Structured queries with the form
containing two elements labeled “name,” one of them Qðlk1 ¼ ‘‘k1 ’’; . . . lkn ¼ ‘‘kn ’’; R1 ?; . . . Rn ?Þ, where ki denotes a
referring to the name of a student and the other to the keyword, lkn denotes the label of the node containing the
student’s school name. Building a relationship between keyword ki , and Ri denotes a return/result node.
these two “name” elements without consideration of their The structure of an XML document can be partitioned into
parents may lead to the incorrect conclusion that they belong multiple units (e.g., subtrees), where each unit is associated
to the same type. We propose in this paper a search engine with some document contents [20]. Thus, the framework of
called XCDSearch that avoids the pitfalls of non-context- XCDSearch partitions XML trees to subtrees, where each
driven search engines such as the ones cited above. It consists of a parent and its children data nodes and is treated
answers XML Keyword-based queries and Loosely Struc- as one unit. The idea of viewing each such set as one logical
tured queries, and it employs novel context-driven search entity is useful as it enables filtering many unrelated groups
techniques using a stack-based sort-merge algorithm. of nodes when answering a query. Compared to filtering
The framework of XCDSearch treats each set of elements individual nodes, this methodology leads to more accurate
consisting of a parent and its children data elements as one results and less processing time. Each subtree is treated as a
unified entity. It then uses a stack-based sort-merge algo- unified entity called a Canonical Tree (CT), which is a
rithm employing context-driven search techniques to metaphor of real-world entities. Two real-world entities may
determine the semantic relationships among the different have different names but belong to the same type, or they
unified entities. Let ei be a unified entity containing a data may have the same name but refer to two different types. To
element labeled nx0 and let ej be a unified entity containing overcome that labeling ambiguity, we observe that if we
a data element labeled ny . If ei and ej are semantically cluster Canonical Trees based on the ontological concepts of
related, data elements nx and ny are also related, and vice the parent nodes’ component of the Canonical Trees, we will
versa. Consider an XML tree containing interior elements identify a number of clusters. That is, each cluster contains
labeled book, job, student, and school. Also consider that Canonical Trees whose parent nodes’ component belong to
each of the elements book and job has a child data element the same ontological concept. Consider, for example, Fig. 2.
labeled title and that each of the elements student and school Using this clustering scheme, we will be able to determine
has a child data element labeled name. XCDSearch will that the two Canonical Trees whose parent nodes are nodes 4
(“paper”) and 24 (“article”) fall under the same cluster, since
treat each of the following sets of elements as one unified
both “paper” and “article” belong to the same ontological
entity: {book, title}, {job, title}, {student, name}, and {school,
concept of “publication.” We will also be able to determine
name}. Therefore, XCDSearch will be able to determine that
that the two data nodes labeled “name” (nodes 2 and 7) are
the two elements labeled title are not semantically identical
not semantically identical (they refer to two different types of
since they refer to two different types of entities; likewise,
entities), since they belong to Canonical Trees falling under
the two name elements refer to two different types of
different clusters: the ontological concepts of “student” and
entities. We make the following contributions in this paper:
“conference” are “person” and “publication proceedings,”
. We propose novel mechanisms for determining the respectively. We use the term Ontology Label (OL) to refer to
semantic relationships between different unified the ontological concept of a parent node. We now formalize
entities by using a stack-based sort-merge algorithm the Ontology Label and Canonical Tree concepts.
and employing context-driven search techniques. Definition 2.1 (OL and Ontology Label Abbreviation
. We propose mechanisms for answering Loosely (OLA)). Let m “is-a” m0 denote that class m is a subclass of
Structured queries and Keyword-based queries. class m0 in an Object-Oriented (OO) ontology.
. We experimentally evaluate the quality and efficiency
of XCDSearch and compare it with five systems. For example, a student “is a” person. m0 is the most
The rest of this paper is organized as follows: In Section 2, general superclass (root node) of m in a defined ontology
we define basic concepts used in the paper. In Section 3, we hierarchy. If m is an interior node’s label, m0 is called the
describe our context-driven search techniques. In Section 4, Ontology Label of m and is expressed as OLðmÞ ¼ m0 . Fig. 1
we describe an algorithm that determines the relationships shows an example of ontology hierarchy. For example, since
TAHA AND ELMASRI: XCDSEARCH: AN XML CONTEXT-DRIVEN SEARCH ENGINE 1783

TABLE 1
OLs and OLAs of the Parent Nodes in Fig. 2

Fig. 1. Example of ontology hierarchy.


Canonical Tree is T1 . We use the abbreviation “CT”
throughout the paper to denote “Canonical Tree.”
student is a subclass of person (see Fig. 1b), the Ontology
Label of node student(1) in Fig. 2 is expressed as A CTG is a hierarchical representation depicting the
OL(student) ¼ person. m0 is a cluster set that contains relationships between CTs. Fig. 3 shows a CTG depicting the
entities sharing the same domain, properties, and cognitive relationships between the CTs constructed from the XML
characteristics (e.g., the cluster person contains the entities tree in Fig. 2. Let n and n0 be two interior nodes in an XML
of student, author, etc.). The framework of XCDSearch tree. They are the parent nodes of CTs T and T 0 , respectively.
applies the above clustering concept to all parent nodes in CTs T and T 0 have a parent-child relationship in the CTG if
an XML tree, and the label of each of these clusters is an OL. there does not exist in the XML tree any node n00 on the path
Table 1 shows the Ontology Labels and clusters of parent from n to n0 where n00 has children data nodes or attributes.
nodes in the XML tree in Fig. 2. The table is an alternative In Fig. 2, for example, since node paper(4) is a descendant of
representation of the information in Fig. 1. We abbreviate node student(1) and since there is no interior node in the
each OL to a letter called an OLA. Table 1 shows the OLAs path from student(1) to paper(4) that has children data
of the OLs in the table. nodes or attributes, the CT whose parent node component is
node 4 (T2 in Fig. 3) is a child of the CT whose parent node
Definition 2.2 (CT). A Canonical Tree T is a pair, component is node 1 (T1 in Fig. 3).
T ¼ ðOLðn0 Þ; NÞ, where ðOLðn0 Þ is the Ontology Label of an
interior node n0 and N is a finite set of data nodes and/or Definition 2.3 (CTG). A CTG is a pair of sets, CT G ¼ ðTS ; EÞ;
attributes. Let ðn0 ; nÞ denote that there is an edge from node n0 where TS is a finite set of CTs and E, the set of edges is a binary
to node n in the XML tree. N ¼ fnjn is a data node; and relation on TS , so that E  TS  TS : TS ¼ fTi jTi is a CT ;
ðn0 ; nÞ; or n is an attribute of n0 g. In Fig. 2, for example, the and 1  i  jTS jÞ. The CTG is constructed as follows: If the two
parent node student(1) and its child data node name(2) interior nodes n; n0 in the XML tree both have children data
constitute a Canonical Tree. The parent node component nodes and/or attributes, and if either 1) ðn; n0 Þ is an edge in the
student(1) is represented in the Canonical Tree by its Ontology XML tree or 2) n is an ancestor of n0 , and there does not exist any
Label “person” (see T1 in Fig. 3). The Ontology Label of a node n00 on the path from n to n0 where n00 has children data nodes
Canonical Tree is the Ontology Label of the parent node or attributes, then ðT1 ; T2 Þ will be an edge in the CTG where
component of the Canonical Tree. For example, the Ontology T1 ¼ ðOLðnÞ; N1 Þ and T2 ¼ ðOLðn0 Þ; N2 Þ: N1 and N2 are the
Label of Canonical Tree T1 in Fig. 3 is the Ontology Label of the sets of children data nodes/attributes of n and n0 , respectively.
parent node component student(1), which is “person.” A Definition 2.4 (Dewey ID). Each CT is labeled with a Dewey
Canonical Tree is represented by a rectangle and is labeled with number-like label called a Dewey ID. A Dewey ID of CT Ti is a
a numeric ID. For example, in Fig. 3, the label of the root sequence of components, each having the form OLAx where

Fig. 2. A graduate school’s authors and coauthors bibliography XML tree. The paper (node 4) was authored by a student (node 1) and coauthored by
a contributing student (node 9) and a reviewing professor (node 20). The paper (node 12) was authored only by the contributing student (node 9).
The article (node 24) was authored only by the reviewing professor (node 20).
1784 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 12, DECEMBER 2010

Fig. 3. Canonical Trees Graph (CTG) depicting the relationships between the CTs constructed from the XML tree presented in Fig. 2.

OLA is an Ontology Label Abbreviation (recall Definition 2.1 publication(s) authored by “Julie Smith” (node 10) and
and Table 1) and x denotes a subscript digit. Each OLA appeared in “VLDB” conference proceedings (node 15).
represents the Ontology Label of an ancestor CT Tj of a CT Ti This information is contained in nodes 13 and 16. Therefore,
and the digit x represents the number of CTs preceding Tj in each of these two nodes is called an IAN.
the graph using Depth First Search, whose Ontology Labels Notation 3.3 (OLT and OLKC ). OLT denotes the
are the same as the Otology Label of Tj . When the sequence of Ontology Label of CT T . For example, OLT1 is “person.”
components OLAx in the Dewey ID of CT Ti are read from left OLKC denotes the Ontology Label of the KC.
to right, they reveal the chain of ancestors of Ti and their We call each CT that can contain an IAN for a KC an
Ontology Labels, starting from the root CT. The last Immediate Relative of the KC. Consider, for example, Figs. 2
component reveals the Ontology Label of Ti itself. Consider, and 3 and the Keyword-based query Q(“XQuery”). XQuery
for example, CT T4 in Fig. 3. Its Dewey ID p0 :b0 :p1 reveals is a title of a paper and is contained in node 13. It is intuitive
that the Dewey ID of the root CT is p0 and its Ontology Label that data nodes 10, 15, and/or 16 be IANs, but it is not
is “person.” It also reveals that the Dewey ID of the parent of intuitive that data node 2 be an IAN since “Tom Wilson”
T4 is p0 :b0 and its Ontology Label is “publication.” The last (node 2) did not author that paper. Since “XQuery” is
component p1 reveals the Ontology Label of T4 , which is contained in CT T5 , we can determine that the CTs contain-
“person,” and the subscript 1 attached to p indicates that there ing nodes 10, 15, and 16 are Immediate Relatives of T5 , while
is one CT preceding T4 in the graph using Depth First Search, the CT containing node 2 is not an Immediate Relative of T5 .
whose Ontology Label is “person.” We denote the set of CTs that are Immediate Relatives of a KC
Definition 2.5 (Semantically Related CTs). CTs T and T 0 are by IRKC . We use the notation “T 2 IRKC ” to denote that CT T
is an Immediate Relative of the KC. A CT can contain an IAN,
semantically related if the paths from T and T 0 to their LCA,
if it has strong association with the KC. Thus, IRKC is a set of
not including T and T 0 , do not contain more than one CT with
CTs that have strong associations with the KC. The IRKC
the same Ontology Label. The LCA of T and T 0 is the only CT concept enables the search to be focused on a specific part of
that contains the same OL in the two paths to T and T 0 . the XML document, which enhances the accuracy of results
and efficiency of the search.
3 CONTEXT-DRIVEN TECHNIQUES Because there are many abbreviations of concepts in the
paper, we summarize them in Table 2 for easy reference.
We present in this section our context-driven search Semistructured data encompass most of the modeling
techniques. We first present notations of key concepts. features of the OO model [21], and static aspects from the OO
Notation 3.1 (Keyword Context (KC)). KC is a CT modeling semantics can be found in XML data modeling
containing a keyword of a query. That is, one of the data [32], [12]. Therefore, many researchers took advantage of the
nodes of the KC holds a value matching one of the query’s richness of OO conceptual modeling to model XML data and
keywords. Consider, for example, Fig. 3 and the query describe its complex interrelationships [12], [25], [32], [33],
Qð‘‘XML’’Þ. CTs T2 and T9 are the KCs for the query. [36]. The concept of existence dependency was first proposed
Notation 3.2 (Intended Answer Node (IAN)). IAN is a for Entity-Relationship modeling [13], and has been adapted
data node in the XML tree containing the data that the user for use in OO conceptual modeling [32]. Existence depen-
is looking for. Consider, for example, Fig. 2 and the query dency has some correspondences with the IRKC concept. An
Qð‘‘Julie Smith’’; ‘‘VLDB’’Þ. As the semantics of the query object x is existence dependent on an object y if the existence
implies, the user wants to know information about the of x is dependent on the existence of y [32]. Following this
TAHA AND ELMASRI: XCDSEARCH: AN XML CONTEXT-DRIVEN SEARCH ENGINE 1785

TABLE 2
IAN cannot be contained in CT Tj if the keyword is contained
Abbreviations of Concepts (Abb Denotes Abbreviation) in CT Ti . In other words, an IAN of a query cannot be
contained in a CT whose Ontology Label is the same as the
Ontology Label of the KC, unless this CT is the KC itself. If
the IAN is contained in a CT T , then either OLT 6¼ OLKC or T
is the KC.
The following is the second rule we propose for
determining IRKC . If CT T is an Immediate Relative of a
KC, then OLT 6¼ OLT 0 where T 0 is a CT located between T
and the KC in the CTG. We can verify this rule as follows: Let:
definition, we can see the closeness between the existence 1) CT T 0 2 IRKC , 2) T 0 be a descendant of the KC, and 3) CT T
dependency and IRKC concepts: both denote that an object be a descendant of T 0 . In order for T to be an Immediate
has a strong association with another object (in our case, the Relative of the KC, intuitively T has to be an Immediate
KC). All CTs that are existence dependent on the KC belong Relative of T 0 , because T 0 relates (connects) T with the KC. If
to IRKC , but not necessarily all CTs belonging to the IRKC are T and T 0 have the same OL, then T 62 IRT 0 (according to the
existence dependent on the KC. In this section, we propose first rule). Therefore, in order for T to be an Immediate
three rules for the determination of IRKC . The first rule will Relative of the KC, OLT 6¼ OLT 0 .
be validated through the existence dependency concept. The The following is the third rule we propose for determin-
three rules will be verified heuristically. A formal strategy of ing IRKC . If CT T 0 62 IRKC and CT T is related to the KC
deriving IRKC will be given later in the section. through T 0 then T 62 IRKC . We can validate this rule as
The first rule we propose is as follows: If CT T 2 IRKC , follows: A KC has a domain of influence. This domain
then OLT 6¼ OLKC . We are going to validate this rule by covers CTs, whose degree of relativity to the KC is strong.
checking whether it conforms to the structural character- Actually these CTs are the Immediate Relatives of the KC. If
istics of existence dependency. Snoeck and Dedene [27] CT T 0 62 IRKC , then the degree of relativity between T 0 and
argue that the existence dependency relation is a partial the KC is weak. Intuitively, the degree of relativity between
ordering of object types. The authors transform an OO any other CT T and the KC is even weaker if T is related to
schema into a graph consisting of the object types found in the KC through T 0 , due to the proximity factor.
the schema and their relations. The object types in the graph Based on the above, we now present Definition 3.1 to
are related only through associations that express existence
formalize the Immediate Relatives concept.
dependency. Notice the resemblance between the concept
of this graph and the concept of the CTG, where both are Definition 3.1 (Immediate Relatives of a KC (IRKC )). The
type-oriented (an Ontology Label can be viewed as a type of Immediate Relatives of a KC is a set IRKC ; IRKC ¼ fT jT is a
a CT). The authors demonstrated through the graph that an CT, where OLT 6¼ OLKC and OLT 6¼ OLT 0 , where T 0 is a CT
object type is never existence dependent on itself. That is, if located between T and the KC in the CTG}.
the two objects Oi and Oj belong to the same type, Oi cannot
be dependent on Oj and vice versa. This finding is in We can determine IRKC by pruning from the CTG all
agreement with our proposed rule if we view a CT as an CTs 62 IRKC , and the remaining ones would be IRKC . We
object and its Ontology Label as the object’s type. Thus, if a present below three properties that regulate the pruning
CT T has the same Ontology Label as the KC, T can never process. The properties are inferred from Definition 3.1.
be existence dependent on the KC; therefore, it can never be
its Immediate Relative. Property 1. When computing IRKC , we prune from the CTG any
After validating the first rule, we now verify it heuristi- CT whose Ontology Label is the same as the Ontology Label of
cally: verify that CT T may contain an IAN for a KC (which the KC.
means T 2 IRKC ) if OLT 6¼ OLKC . Let Ti and Tj be two Property 2. When computing IRKC , we prune CT T 0 from
distinct CTs having the same Ontology Label. Therefore, the the CTG if: 1) there is another CT T 00 located between T 0 and
two CTs share common entity characteristics, and some of the KC, and 2) the Ontology Label of T 00 is the same as the
their data nodes are likely to have the same labels. Let Ontology Label of T 0 .
n1 ; n2 ; n3 ; n4 ; n5 , and n6 be data nodes, where n1 ; n2 ; n3 2
Ti and n4 ; n5 ; n6 2 Tj . Let n1 and n4 have the same label l1 , n2 Property 3. When computing IRKC , we prune from the CTG any
and n5 have the same label l2 , n3 has the label l3 , and n6 has CT that is related (connected) to the KC through a CT T,
the label l4 . Let dm m0 denote the distance between data nodes m
T 62 IRKC .
and m0 in the XML tree. Now consider the query Qðl1 ¼
‘‘ki ’’; l2 ?Þ: The keyword “ki ” is contained in data node n1 2 Ti We now present examples to show how the Immediate
(the KC is Ti ) and l2 is the label of the IAN. Intuitively, the Relatives of a CT can be determined.
IAN is n2 2 Ti and not n5 2 Tj , because dnn12 < dnn15 . If the label
Example 1. Let us determine the Immediate Relatives of CT
of the IAN in the same query is l3 (instead of l2 ), then
T8 (recall Fig. 3). By applying Property 1, CTs T1 and T4
obviously the IAN is n3 2 Ti . However, if the label of the IAN
in the same query is l4 , then the query is meaningless and are pruned because their Ontology Labels are the same as
unintuitive. Now consider the query Qðl3 ¼ ‘‘ki ’’; l1 ?Þ. Intui- the Ontology Label of T8 . By applying Property 3, CT T12
tively, the IAN is n1 and not n4 due to the proximity factor. If is pruned because it relates to CT T8 through the pruned
the label of the IAN in the same query is l2 (instead of l1 ), CT T1 , and CTs T5 , T6 , and T7 are pruned because they
intuitively the IAN is n2 and not n5 . Thus, we can conclude relate to CT T8 through the pruned CT T4 . The remaining
that in order for the query to be meaningful and intuitive, the CTs in the CTG are the IR of T8 (see Fig. 4).
1786 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 12, DECEMBER 2010

Fig. 4. IRT8 .
Fig. 7. IRT1 .
Example 2. Let us determine the IR of T10 . By applying
Property 2, CT T2 is pruned because it is located in the path
T1 ! T2 ! T8 ! T9 ! T10 and its Ontology Label is the
same as the Ontology Label of CT T9 , which is closer to T10 .
By applying Property 3, CTs T1 , T12 , T3 , T4 , T5 , T6 , and T7
are pruned because they relate to T10 through the pruned
T2 . The remaining CTs in the CTG are IRT10 (see Fig. 5).
Example 3. Figs. 6, 7, 8, and 9 show IRT3 , IRT1 , IRT6 , and
IRT4 . Fig. 8. IRT6 .

The naı̈ve approach for computing IRKC is to apply the second phase, which is described in this section, the system
three properties to all CTs in the CTG. The time complexity selects from these KCs subsets, where each subset contains
of this approach for computing the IR of all CTs in the the smallest number of KCs that: 1) are closely related to each
other, and 2) contain at least one occurrence of each
graph is OðjT j2 Þ. We constructed an efficient algorithm
keyword. The KCs in each subset are called Related Keyword
called ComputeIR (see Fig. 10), which works as follows: To
Contexts (RKC). In the third phase, the system locates the
compute IRKC , instead of examining each CT in the graph, IANs in the intersection of the Immediate Relatives of the
we only examine the CTs that are adjacent to any CT KCs composing the RKC.
T 0 ; T 0 2 IRKC . That is, if CT T 0 2 IRKC , the algorithm will
examine the CTs that are adjacent to T 0 ; otherwise, it will Definition 4.1 (RKC). RKC is a set of KCs, where 1) for each two
not examine any CT T 00 that is connected to the KC through Keyword Contexts KCi ; KCj 2 RKC; KCi 2 IRKCj , 2) the
T 0 , because T 00 will not be an IR of the KC according to set contains at least one occurrence of each keyword, and
Property 3. The algorithm’s time complexity is 3) there are no two distinct KCs 2 RKC that contain the same
PjT j keywords.
oð i¼1 jIRTi jÞ.
We constructed an algorithm called RKClookup (see
4 DETERMINING RELATED KCS Fig. 11) for computing RKC. It employs a stack-based
The process of answering a query goes through three
phases. In the first phase, the system locates the KCs. In the

Fig. 9. IRT4 .

Fig. 5. IRT10 .

Fig. 6. IRT3 . Fig. 10. Algorithm ComputeIR.


TAHA AND ELMASRI: XCDSEARCH: AN XML CONTEXT-DRIVEN SEARCH ENGINE 1787

sort-merge approach and the three properties described in


Section 3. The input to the algorithm is an array called
contexts, which contains the Dewey IDs of the KCs. Each
iteration of the algorithm produces a new stack state.
Each stack entry has a pair of components (Dewey ID and
keywords components). A Dewey ID component is an OLAx
(recall Definition 2.4). If OLAi ; OLAj ; . . . ; and OLAk are
the Dewey ID components in the stack from the bottom
entry to the stack entry, then 1) the stack entry represents
the CT, whose Dewey ID is OLAi :OLAj : . . . :OLAk and
Fig. 11. Algorithm RKClookup.
2) the bottom entry represents the root CT, whose Dewey
ID is OLAi . Consider Fig. 3 and the stack in Fig. 17b. The
stack entry represents CT p0 :b0 :p1 , the middle entry
represents CT p0 :b0 , and the bottom entry represents CT
p0 (the root of Fig. 3). The keywords component is defined
as an array of length m, where m is the number of
keywords of the query. Each field of keywords represents Fig. 12. Subroutine PushEntries.
one of the query’s keywords. Let k1 ; k2 ; . . . ; km be the
query’s keywords. Keywords[i] represents keyword ki . If a
CT T contains the keyword ki , we represent that by
storing the top Dewey ID component of T in keywords[i].
For example, CT p0 :b0 :p1 in Fig. 3 is represented by the
Fig. 13. Subroutine OLAin.
stack entry in Fig. 17b, and the keyword “Smith” that the
CT contains is represented by storing the top Dewey ID
component of the CT (the component p1 ) in the first field
of keywords.
In each iteration of the algorithm, the current KC being
processed is called the Current Context (CC) (line 2), and
prior KC processed is called Prior Context (PC). Line 3
computes the number of common-components prefixes “q”
in the CC and PC. For example, if the CC is p0 :b0 :p1 and the
PC is p0 :b0 , then q ¼ 2. Entry stack[q] represents the LCA of
the CKC and PC. If q equals the stack size (line 4), the CC is a
descendant of the PC, and subroutine PushEntries is called Fig. 14. Subroutine PopAndPushEntries.
(line 5); otherwise, subroutine PopAndPushEntries is called
(line 6). The time complexity of the algorithm is OðhjKCjÞ,
where h is the maximum depth of the CTG.
We now describe the algorithm’s subroutines shown in
Figs. 12, 13, 14, 15, and 16. Subroutine PushEntries (Fig. 12)
pushes into the stack the top components of CC that do not
match the top components of PC. By applying Properties 1
and 2, subroutine PopAndPushEntries (Fig. 14) first finds
the closest descendant CT T of the LCA of CC and PC that is
not an Immediate Relative of the LCA ðT 62 IRLCA Þ. It then
pops from the stack the entry representing T and all entries
located above it (to satisfy Property 3). Subroutine popIR
(Fig. 15) pops the remaining entries located above stack[q],
which represent CTs that are Immediate Relatives of the Fig. 15. Subroutine popIR.
LCA. If all the fields of the keywords of a popped entry are
occupied, the array is output (line 4) and its contents
represent RKC. Example 4 illustrates these concepts.
Example 4. Consider Fig. 2 and the Keyword-based query Fig. 16. Subroutine isAnswer.
Q(“Smith”, “XML”, “VLDB”), which asks for information
about an author whose last name is “Smith” and who was published in “VLDB.” The second answer subtree is
authored a publication titled “XML” that appeared in correct because “Joe Smith” (node 21) authored a publica-
“VLDB” conference proceedings/journal. There are two tion titled “XML” (node 25) that appeared in “VLDB”
candidate answer subtrees. The first is rooted at node 4 and
(node 27). First, the Dewey IDs of the KCs are stored in the
includes nodes 5, 10, and 15. The second is rooted at node
array contexts: contexts ¼ ½p0 :b0 ; p0 :b0 :p1 ; p0 :b0 :p1 :b1 :s1 ;
20 and includes nodes 21, 25, and 27. The first tree is an
incorrect answer because the publication titled “XML” p0 :b0 :p2 ; p0 :b0 :p2 :b2 ; p0 :b0 :p2 :b2 :s2 . The stack is initially
(node 5) and authored by “Julie Smith” (node 10) was empty. Below are the algorithm’s steps that created the
published in “EDBT” (node 7) and not in “VLDB.” Instead, stack states in Fig. 17.
Julie’s publication titled “XQuery” (node 13) is the one that
1788 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 12, DECEMBER 2010

Answer state (Fig. 17g): Line 7 calls subroutine popIR,


which proceeds as follows: The top 2 entries are popped
(line 2). Line 3 will discover that all the fields of the array
keywords at entry p0 :b0 :p2 are occupied. Therefore, this array is
output as RKC (line 4). Each component in the array
represents the rightmost component of the Dewey ID of one
of the KCs composing the RKC. Thus, RKC ¼ fp0 :b0 :p2 ;
p0 :b0 :p2 :b2 ; p0 :b0 :p2 :b2 :s2 g (see Fig. 17h). As can be seen, the
RKC contains the nodes that form the answer.

5 ANSWERING A QUERY
We describe in this section how the IANs are determined.
We describe in Section 5.1 how the IANs for a Loosely
Structured query are determined, and we describe in
Section 5.2 how the answer subtree for a Keyword-based
query is constructed.

5.1 Answering a Loosely Structured Query


A Loosely Structured query has two forms. We call the first
“query type A” and the second “query type B.”

5.1.1 Query Type A


In query type A, the label of the search-term element is
different than the label of the requested return element. If the
Fig. 17. States of stack, where S stands for “Smith,” X stands for “XML,” keyword(s) of the query is/are contained in one CT, the
and V stands for “VLDB.” (a) CKC is p0 :b0 . (b) CKC is p0 :b0 :p1 . (c) CKC IAN should be contained in the Immediate Relatives of this
is p0 :b0 :p1 :b1 :s1 . (d) CKC is p0 :b0 :p2 . (e) CKC is p0 :b0 :p2 :b2 . (f) CKC is CT. Consider, for example, Fig. 2 and the Query : Qðname ¼
p0 :b0 :p2 :b2 :s2 . (g) Answer state. (h) RKC corresponding to the answer ‘‘Joe Smith’’; title?Þ. The query asks for the titles of the
state.
publications authored by “Joe Smith” (node 21). Since “Joe
Smith” is contained in CT T8 , the IANs should be contained
First state (Fig. 17a): The CC is p0 :b0 . The components p0
in IRT8 (recall Fig. 4). The answer is nodes 5 and 25
and b0 are pushed into the stack. Keyword “XML” that the contained in CTs T2 and T9 , respectively.
CC contains is represented by storing the top component b0 If the keywords of the query are contained in more than
in the second field of the array keywords. one CT, the IANs are located in the intersection: IRKC1 \
Second state (Fig. 17b): Line 2: the CC is p0 :b0 :p1 . Line 3: IRKC2 \    \ IRKCn , where RKC ¼ fKC1 ; KC2 ; . . . KCn g.
since the two components p0 and b0 in the stack match the That is, each IAN is located in a CT Ti ; Ti 2 \KCj 2RKC IRKCj ,
first two components of the CC (stack[2] ¼ CC[2]), then where CT Ti is an IR of each KCj 2 RKC.
q ¼ 2. Line 4: since stack:length ¼ q, the CC is a descendant
Example 5. Consider Figs. 2 and 3 and the query Q(name ¼
of the PC, and subroutine PushEntries is called (line 5),
“Julie Smith,” name ¼ “EDBT,” title?). The query asks for
which proceeds as follows: Line 2: the nonmatching the title of the publication that was authored by “Julie
component p1 (CC[3]) is pushed into the stack. Line 3: Smith” (node 10) that appeared in “EDBT” conference
keyword “Smith” that the CC contains is represented by proceedings (node 7). Algorithm RKClookup (recall
storing the top component p1 in the array keywords. Fig. 11) will return the following RKC : RKC ¼ fT3 ; T4 g.
Third state (Fig. 17c): The CC is p0 :b0 :p1 :b1 :s1 , and it is a The IAN should be located in intersection IRT3 \ IRT4 ¼
descendant of the PC. The steps are similar to those of the fT2 ; T7 g. Recall Figs. 6 and 9 for IRT3 and IRT4 ,
second state. respectively. The IAN is the node 2 “title” contained in
Fourth state (Fig. 17d): Line 2: The CC is p0 :b0 :p2 . Line 3: CT T2 (see Fig. 18).
q¼ 2. Line 4: Since stack:length 6¼ q, subroutine PopAnd-
Example 6. Consider Figs. 2 and 3 and the query Q(name ¼
PushEntries is called (line 6), which proceeds as follows:
“Julie Smith,” name = “VLDB,” title?). The query asks for
Line 3: Since OLAin(stack[4]) ¼ OLAin(stack[2]), CT
the title of the publication that was authored by “Julie
p0 :b0 :p1 :b1 ðstack½4Þ 62 IRLCA (Property 1). Line 5 pops from Smith” (node 10) and appeared in “VLDB” conference
the stack the top two entries without passing their keyword proceedings (node 15). RKC ¼ fT4 ; T6 g. IRT4 \ IRT6 ¼
information to the current top entries. Line 7 calls fT5 ; T7 g. Recall Figs. 9 and 8 for IRT4 and IRT6 ,
subroutine popIR, which proceeds as follows: Line 2 pops respectively. The IAN is node 13 contained in CT T5
the current top entry, and lines 5-8 pass its keyword (see Fig. 19).
information to the current top entry. Line 8 of subroutine
PopAndPushEntries pushes the top component. 5.1.2 Query Type B
Fifth state (Fig. 17e) and sixth state (Fig. 17f): In each of the In query type B, the search-term element and the requested
two states, the CC is a descendant of the PC. The steps are return element have the same label. Consider, for example,
similar to the ones described in the second state. Fig. 2. A user who knows that “XQuery” is a title of one of
TAHA AND ELMASRI: XCDSEARCH: AN XML CONTEXT-DRIVEN SEARCH ENGINE 1789

Fig. 18. IRT3 \ IRT4 . Fig. 19. IRT4 \ IRT6 .

the papers authored by “Julie Smith” and who wants to “databases” (node 30), coauthored an article titled
know the titles of the other publication(s) authored by her “XML” (node 25) that was published in “VLDB” journal
can submit the query Q(title ¼ “XQuery,” title?). When (node 27) in “May 2002.”
searching for the CT containing the IAN “title,” we only
search for ones whose Ontology Label is “publication.” So, 5.3 Ranking Function
when answering query type B, we search only for CTs, For ranking results, we use the ranking function proposed in
whose Ontology Labels are the same as the Ontology Label XRANK [14]. For the query Qðk1 ; k2 ; . . . ; kn Þ, the overall
of the CT containing the search-term element. CTs whose ranking of the result element v1 , which is an ancestor of the
Ontology Labels are the same behave as rivals: they either element containing the keyword ki , is computed as follows:
P
cooperatively have done something to a CT T 0 , or something rðv1 ; QÞ ¼ ð r^ðv1 ; ki ÞÞ  pðv1 ; k1 ; k2 ; . . . kn Þ where pðv1 ; k1 ;
has been done to them collectively by T 0 . We call CT T 0 a k2 ; . . . ; kn Þ is a measure of keyword proximity and r^ðv1 ; ki Þ
pivoting entity (denoted by Tpiv ). In Fig. 3, for example, CTs is the sum of the ranks of element v1 with respect to the m
T1 , T4 , and T8 cooperatively authored the pivoting entity CT occurrences of keyword ki , where the rank of each of these
T2 . On the other hand, CTs T2 and T5 are collectively occurrences is computed as: rðv1 ; ki Þ ¼ eðvt Þ  decayt1 ,
authored by the pivoting entity CT T4 . where decay is a parameter that can be set to a value between
In query type B, the IAN(s) is/are located as follows:
0 and 1, and
Starting from CT Ti in the CTG that contains the search-
term element, we search the ascendants and descendants of 1  d1  d2  d3 X eðuÞ
Ti for the closest CT Tj in the graph, whose Immediate eðvÞ ¼ þ d1
Nd  Nde ðvÞ N ðuÞ
ðu;vÞ2HE h
Relatives include Ti ðTi 2 IRTj Þ and at least one other CT,
X eðuÞ X
whose Ontology Label is the same as the Ontology Label of þ d2 þ d3 eðuÞ;
Ti . CT Tj is Tpiv , and one or more of its Immediate Relatives N ðuÞ
ðu;vÞ2CE c ðu;vÞ2CE 1
must contain the IAN(s). That is, the IAN(s) must be
contained in a CT 2 IRTpiv . Example 7 illustrates this case. where d1 , d2 , and d3 are the probabilities of navigating
through hyperlinks, forward-containment edges, and
Example 7. Q(title ¼ “XQuery,” title?). The keyword
reverse-containment edges, respectively. Nd is the total
“XQuery” is contained in CT T5 . The closest CT to T5 ,
number of documents; NdeðvÞ is the number of elements
whose Immediate Relatives contain T5 and also another
containing element v, Nh ðuÞ is the number of outgoing
CT whose Ontology Label is the same as that of T5 is CT
T4 . So, T4 is Tpiv . The IAN should be contained in IRT4 hyperlinks from the document; Nc ðuÞ is the number of
(recall Fig. 9). The IAN is node 5 contained in T2 . subelements of u; HE and CE are the sets of hyperlink and
containment edges, respectively; CE 1 is the set of
5.2 Answering a Keyword-Based Query reverse-containment edges.
The answer subtree for a Keyword-based query is com-
posed of the following CTs: 1) the RKC, and 2) each CT 6 ANALYZING ALGORITHM RKCLOOKUP AND
Ti ; Ti 2 \KCj 2RKC IRKCj , where CT Ti is an IR of each
KCj 2 RKC. This methodology guarantees that each node
EXISTING ALGORITHMS
in the answer subtree is semantically related to all nodes 6.1 Non-Context-Driven Search Algorithms
containing keywords and the other connective nodes in the A number of studies [10], [19], [35] employ a semantic search
subtree. If algorithm RKClookup outputs n RKCs, there over XML documents modeled as trees. They build relation-
would be n answer subtrees. ships between data nodes based solely on their labels and
Example 8. Let us construct the answer subtree for the proximity to one another while overlooking their contexts.
query presented in Example 4. The RKC ¼ fT8 ; T9 ; T10 g We take [10], [19], [35] as samples of non-context-driven
(recall Fig. 17h). The answer subtree is composed from search engines and analyze the behavior of their algorithms
the RKC and the CTs located in the intersection as well as algorithm RKClookup. First, we describe below the
IRT8 \ IRT9 \ IRT10 . Recall Figs. 4 and 5 for IRT8 and three algorithms.
IRT10 . As for IRT9 , it contains the same CTs contained in XKSearch [35]. Xu and Papakonstantinou [35] use an
IRT10 . The intersection IRT8 \ IRT9 \ IRT10 ¼ T11 . So, the algorithm called the Stack Algorithm to compute the
CTs composing the answer subtree are CTs T8 , T9 , T10 , Smallest Lowest Common Ancestor (SLCA) of nodes
and T11 . Thus, the answer subtree is rooted at node 20 containing keywords. The algorithm is based on a stack
and contains nodes 21, 22, 24, 25, 26, 27, 28, 29, and 30. sort-merge approach, but it does not employ context-driven
This answer subtree conveys the following information: search techniques. It computes the longest common prefix
“Joe Smith” (node 21), whose area of expertise is of each node and that denoted by the top entry of the stack.
1790 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 12, DECEMBER 2010

It then pops the top entries containing Dewey components TABLE 3


that are not part of the common prefix. If a popped entry Classification of Queries (C# Denotes Criterion Number)
contains all keywords, it is an SLCA, which is a root of a
subtree, in which the nodes of the subtree contain all
queries’ keywords and have no descendant nodes that also
contain all keywords.
Schema-Free XQuery [19]. Li et al. [19] use an algorithm
called MLCAS for computing the Meaningful Lowest
Common Ancestor (MLCA) of nodes containing keywords.
Nodes a and b are considered related and their LCA node c
is considered the MLCA of a and b, if c is not an ancestor of
some node d, which is an LCA of node b and another node
that has the same label as a. Algorithm MLCAS uses a stack,
with the head of each stack node being a descendant of the
stack node below it. The basic idea is to perform one single-
merge pass over the nodes and conceptually merge them
into rooted trees containing MLCAs.
XSEarch [10]. Cohen et al. [10] use an algorithm called
ComputeInterconnectionIndex, which employs dynamic
programming to compute the relationships between all
pairs of nodes in an XML tree. If the relationship tree of
nodes a and b (the set of nodes in the path from a to b) Answer of algorithm RKClookup: RKC ¼ fT8 g: IRT8 ¼
contains two or more nodes with the same label, then fT2 ; T3 ; T9 ; T10 ; T11 g (recall Fig. 4). Therefore, XCDSearch
nodes a and b are unrelated. Otherwise, they are related. will return null, because IRT8 has no data node-labeled
We classified the XML documents of INEX [16], [17] into “area.”
categories based on their structural characteristics and on Criterion # 3. Algorithm RKClookup performs well
the similarity of their nodes’ labels and types. We then under this criterion. The algorithms of [10], [19], [35] cannot
classified INEX queries into seven criteria based on 1) the reason that nodes having different labels may belong to the
classifications of the XML documents they are intended to same type, since they do not employ ontological concepts.
query, and/or 2) the labels of the queries’ search term and Consider Fig. 2 and the query (name ¼ “Tom Wilson,”
return elements. Table 3 shows these criteria. We now area?), which asks for the research interest area of Tom
analyze the behavior of the algorithms of RKClookup and Wilson (node 2). Instead of returning only node 32, Cohen
[10], [19], [35] under these criteria. et al. [10] will return also nodes 19 and 30, because their
Criterion # 1. Algorithm RKClookup performs well under relationship trees with node 2 do not include two or more
this criterion. If the search-term element or return element of nodes with the same label. Cohen et al. [10] cannot reason
a Loosely Structured query has the same label as one of the that nodes 1, 9, and 20 belong to the same type.
nodes in the XML tree that have the same label and type, Answer of algorithm RKClookup: RKC ¼ fT1 g:IRT1 ¼
the algorithms of [10], [19] may return a faulty answer for the fT2 ; T3 ; T12 g-see Fig. 7. Thus, XCDSearch returns only node
query. If one of the nodes that have the same label and type 32 2 T12 .
contains a value matching the keyword of a Keyword-based Criterion # 4. Algorithm RKClookup performs well
query, the algorithm of [35] may return a faulty answer. The under this criterion. The algorithms of [10], [19], [35] cannot
reason is that these algorithms are unable to account for the reason that nodes having the same label may refer to
contexts of these nodes. We cite again the query Q(“Smith”, entities with different types.
“XML”, “VLDB”) presented in Example 4. The stack Criterion # 5. Let n denote the search-term element of a
algorithm of [35] will return two answer subtrees for this query falling under this criterion. To answer this query, a
query, one of which is incorrect. It will return node 4 as the search engine needs to be able to identify 1) the subset of
root of the (incorrect) answer subtree. The semantic of the data nodes in the XML tree that have the same label and
returned answer is incorrect because “Julie Smith” authored type as element n, and 2) the subset of data nodes that have
a paper titled “XML” but it did not appear in “VLDB.” the same label as element n but have different types. Unlike
Criterion # 2. Algorithm RKClookup performs well algorithm RKClookup, the algorithms of [10], [19], [35]
under this criterion. The algorithms of [19], [35] may return are unable to determine these subsets correctly. Consider
a faulty answer for a query if the XML tree contains 1) a Fig. 2 and the query (name ¼ “Tom Wilson, name?), which
set S of nodes having ancestor-descendant relationships could be interpreted as “What are the names of the coauthors
and containing all the keywords, and 2) a nodeðsÞ 62 S of the paper that was authored by Tom Wilson?” or “What is
containing a keyword. Consider that node 30 is pruned the name of the conference that published Tom Wilson’s
from Fig. 2, and there is a query (name = “Joe Smith,” area?) paper?” The correct answer comprises two separate result
that asks for the area of expertise of “Joe Smith” (node 21). sets: one is nodes {10, 21}, and the other is node {7}. However,
The correct answer is null. But Li et al. [19] will return node [19] will return null. It will not return node 10 because:
19, because the LCA of nodes 21 and 19 is not a descendant 1) node 1 is the LCA of nodes 2 and 10, and it is also an
of an LCA of node 21 and another node-labeled “area.” ancestor of node 4, which is the LCA of nodes 10, 21, and 7,
TAHA AND ELMASRI: XCDSEARCH: AN XML CONTEXT-DRIVEN SEARCH ENGINE 1791

and 2) the labels of nodes 21 and 7 are the same as the label of
node 2. Therefore, node 10 is related to nodes 21 and 7 and
not to node 2. The same thing applies to nodes 21 and 7.
XCDSearch answer: The pivoting entity (recall Section 5.1.2)
is CT T2 . IRT2 ¼ fT1 ; T12 ; T4 ; T7 ; T8 ; T11 ; T3 g. Therefore, the
XCDSearch will return nodes 10 2 T4 and 21 2 T8 in one
result set (since OLT4 ¼ OLT8 ) and return node 7 2 T3 in a
separate result set (since OLT3 6¼ OLT4 ; OLT8 ). IRT ¼
fT 0 ; T 00 g. Thus, the XCDSearch returns node 8 2 T 00 .
Criterion # 6. The four algorithms perform well in this
criterion because there is no ambiguity in selecting an IAN. Fig. 20. XCDSearch system architecture.
Criterion # 7. Algorithm RKClookup does not perform
well under this criterion. We conducted a statistical survey
Label of a node is not found in the database, the
using the test data of [16], [17], [37] to determine the
modulewillfirsttokenize the node’s label by parsing it into
probability of encountering an XML tree that met the
characteristics causing this criterion. We found that only a set of tokens using delimiters. In Fig. 2, for example, the
about 9 percent of the surveyed XML documents meet these label researchInterest will be tokenized into {research,
characteristics. The pitfall caused by this criterion could be interest}. The module then checks the database to determine
overcome by labeling the CT that has the same OL as its whether there is an ontology for the suffix token name. If
parent CT with a dummy OL, which is different than each not, the system administrator can use Protégé to build a
OL of a CT in the CTG. corresponding taxonomy of concept and relations. The
resultant OLs will be added to database Ontology_DB for
6.2 Context-Driven Search Algorithms
future reference.
We proposed previously context-driven search engines
called OOXSearch [26] and CXLEngine [28]. The following 7.2 Constructing CTGs
are the key differences between [26], [28] and XCDSearch: Module OntologyBuilder outputs to module GraphBuilder
1) the algorithms of [26], [28] employ Object-Oriented a table called OL_TBL, which stores the OLs of the interior
techniques for answering queries, while algorithm RKClook-
nodes in the input XML schema. Module GraphBuilder,
up employs a stack-based sort-merge algorithm; 2) while
creates a CTG corresponding to the input schema using
algorithm RKClookup considers all CTs in a CTG when
computing RKC, [26], [28] consider only the CTs that connect algorithm BuildCTreesGraph (see Fig. 21). The inputs to the
the KCs; and 3) [26], [28] answer only Loosely Structured algorithm are table OL_TBL and the list of nodes adjacent to
queries. each node. Lines 1-12 determine CTs, and lines13-18
connect them with edges. Line 6: if an adjacent node n0 to
an interior node n is a leaf data node, this node will be
7 SYSTEM IMPLEMENTATION AND ARCHITECTURE contained in a CT Tz (lines 8 or 12). Function setParent-
Fig. 20 shows the XCDSearch system architecture. We Comp in line 9 sets node n as the parent component of CT
describe below the processing steps and modules in the Tz. Line 10 stores the parent nodes of all CTs in a set called
system architecture that create Ontology Labels, CTGs, ParentComps. When function getCT (line 15) inputs the
and IRT .
closest ancestor interior node m0 to interior node m, it
7.1 Creating Ontology Labels returns the ID of the CT whose parent component is m0 .
Many ontologies are already available in electronic form Function setCTparent (line 17) connects with an edge the
and can be imported into an ontology-development CTs whose parent nodes are m0 and m. Line 18 assigns the
environment. Module OntologyBuilder (see Fig. 20) uses OL of m to CT Ty . The algorithm’s time complexity is
Protégé ontology editor [24], which allows the importing of OðjV j þ jEjÞ, where jV j and jEj are the number of nodes and
ontologies available in electronic form (ontologies done by edges, respectively, in the XML tree.
others) from locations specified by the namespaces URI
(URL) by ticking the “import” tickbox. The Ontology Labels 7.3 Constructing IRT
for 77 percent of the tag names in the test data we used in Using the input XML document, CTG, and the keywords,
the experiments were created automatically from existing the KCdeterminer identifies the KCs. Module IRdeterminer
ontologies that had been imported electronically to the uses algorithm ComputeIR (recall Fig. 10) to compute for
system. The formality in which an ontology is expressed each CT T in the CTG its IRT and saves it in a table called
often does not matter since many knowledge-representation IR TBL for future reference.
systems can import and export ontologies. The imported
ontologies are stored in database Ontology_DB for future 7.4 Constructing Results
references. After an XML schema describing the structure of XCDSearch Query Engine uses algorithm RKClookup
an XML document is input to module OntologyBuilder, the (recall Fig. 11) to compute the query’s RKC and then
module will access database Ontology_DB to determine the accesses table IR_TBL to construct the answer. The engine
Ontology Labels corresponding to the interior nodes of extracts the data from each answer data node n using an
the XML schema. If the ontology for creating the Ontology XQuery engine [29].
1792 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 12, DECEMBER 2010

saved the resultant OLs in database Ontology_DB (recall


Fig. 20) for future reference.

8.1 Recall and Precision Evaluation


We measured the recall and precision of XCDSearch and of
[10], [19], [35], [26], [28] using the test data of INEX [16],
[17]. An INEX assessment records for a given topic (query)
and a given document the degree of relevance of the
document component to the INEX topic. We compared the
answers obtained by each of the six systems to the answers
deemed relevant by an INEX assessment. For a given topic
and assessment, we measured how many of the XML nodes
that are deemed relevant in the assessment are missing (for
measuring recall) and how many more XML nodes are
retrieved (for measuring precision).
Using the 212 queries of [16], [17], we computed the
average recall and precision of XCDSearch and of [10], [19],
[35] under each of the seven query criteria described in
Section 6.1 and shown in Table 3. The results are shown in
Table 4. As the table shows, XCDSearch outperforms the
other three non-context-driven search systems in query
Fig. 21. Algorithm BuildCTreesGraph. criteria 1-6, while it does not perform well in criterion 7. Its
performance in query criteria 1-5 is due to its ontological
8 EXPERIMENTAL RESULTS labeling of interior nodes and to the fact that each of these
criteria requires a system to account for the contexts of
We experimentally evaluated the quality and efficiency of nodes. The context-driven search techniques employed by
XCDSearch and compared it with XSEarch [10], Schema- algorithm RKClookup enables XCDSearch to build seman-
Free XQuery [19], XKSearch [35], OOXSearch [26], and tic relationships among nodes based on their contexts and
CXLEngine [28]. We have implemented XCDSearch in Java Ontology Labels, while the algorithms of [10], [19], [35]
run on Intel(R) Core(TM)2 Dup CPU processor, with a CPU build relationships based solely on the nodes’ labels and
of 2.1 GHz and 3 GB of RAM, under Windows Vista. proximity to one another. The determination of nodes’
The implementation of [19] has been released as part of the contexts and Ontology Labels is needed in query criterion 5
TIMBER project [30], which we used for part of the more than in the other criteria, which explains the
evaluation of [19]. We implemented the entire proposed substantial performance of XCDSearch over the other three
systems of [10], [19], [35] from scratch. As for [26], [28], we systems in this criterion (see Table 4). As the table shows,
already have their implementations. XCDSearch does not perform well in criterion 7. This
Test Data: We used the following test data: problem could be overcome by relabeling the CT that has
the same OL as its parent CT.
. INEX 2005 [16]. Some of the documents in [16] are Fig. 22 shows the overall average of recall and precision
scientific articles (marked up with XML tags) from among the four systems. As the figure shows, 1) XCDSearch
publications of the IEEE Computer Society. There are outperformed the other three systems; 2) [19] outperformed
two types of topics (i.e., queries) included in the test [10] and slightly [35] in recall, because the technique it uses
collection, Content-and-structure (CAS) topics and for building relationships among nodes relies on the
Content-only (CO) topics. CAS topics are Loosely hierarchical relationships between the nodes, which re-
Structured queries, and CO topics are Keyword-based duces node-labeling conflicts; and 3) [35] outperformed [10]
queries. We used in the experiments all the 87 queries and slightly [19] in precision.
in the test collection with query numbers 202-288. To compare the search quality of XCDSearch with our
. INEX 2006 [17]. The documents in [17] made up previously proposed search engines, OOXSearch [26] and
from English documents from Wikipedia project CXLEngine [28], we measured their average recall and
(marked with XML tags) and contain hyperlinks precision using all the 47 CAS topics of [16] (query numbers
between individual documents and their elements. 242-288). The results showed that 1) the average recall of
The CAS and CO topics are combined into topics XCDSearch is higher than that of [28] by 34 percent and
called Content Only + Structure (CO+S). We used in higher than that of [26] by 39 percent; and 2) the average
the experiments all the 125 queries in the collection precision of XCDSearch is higher than that of [28] by
with query numbers 289-413. 37 percent and higher than that of [26] by 46 percent.
Creating Ontology Labels: INEX [16], [17] contains a total 8.2 Evaluating the Impact of Context Consideration
of 1,411 distinct tag names. XCDSearch created the Ontology and Ontological Labeling Components on
Labels for 77 percent of the distinct tag names automatically XCDSearch Quality
from existing ontologies (done by others), which had been We modified a copy of XCDSearch by removing its nodes’
imported to the system electronically. For the creation of the ontological labeling component. Our objectives were 1) to
Ontology Labels for the remaining 23 percent of tag names, measure the decline in recall and precision to determine
we built taxonomies of concepts and relations using [24] and the impact of the ontological labeling component on
TAHA AND ELMASRI: XCDSEARCH: AN XML CONTEXT-DRIVEN SEARCH ENGINE 1793

TABLE 4
Average Recall and Precision under Seven Criteria

Fig. 22. Overall average recall and precision of the four systems.

“C#,” “R,” and “P” denote criterion number, Recall, and Precision,
respectively.

XCDSearch’s search quality; and 2) to determine the


impact of the context-consideration component on
XCDSearch’s search quality by comparing the modified
copy’s recall and precision with those of [10], [19], [35]. If
the modified copy outperformed the other three systems,
this performance would be attributed to XCDSearch’s Fig. 23. Average Recall and Precision of a copy of XCDSearch with no
ontological labeling component as well as the average Recall and
context-consideration component. The modified copy Precision of the original versions of XCDSearch and [10], [19], [35].
labels CTs with the labels of the CTs’ parent nodes rather
than with the parents’ Ontology Labels. For example, the
copy would label CT T1 in Fig. 3 with the label “student”
instead of the Ontology Label “person.” Fig. 23 shows the
average recall and precision of the modified copy, using
all the queries of [16], [17]. (For ease of comparison, we
show also in the figure the average recall and precision of
the original version of XCDSearch and of the other three
systems.) As the figure shows, the average recall and
precision of the modified copy are less than those of the
original version by 8.3 and 18.2 percent, respectively, but
Fig. 24. Average Recall and Precision of copies of [10], [19], [35] with
they are higher than those of [10], [19], [35]. Thus, we ontological labeling component as well as the average Recall and
conclude that 1) the ontological labeling component has Precision of the original versions of XCDSearch and [10], [19], [35].
important but not critical impact on XCDSearch’s search
quality; and 2) XCDSearch outperforms the other three 8.3 Evaluating Initial Precision
systems using only its context-consideration component. To evaluate the performance of XCDSearch and [10], [19],
We modified copies of [10], [19], [35] so that they label [35] on initially retrieved elements, we computed the Mean
interior nodes with the nodes’ Ontology Labels prior to Average Precision (MAP), which is the fraction of relevant
building relationships between the nodes. Our objectives elements retrieved, at three ranks: ranks 1 (first answer
were: 1) to measure the copies’ recall and precision to
retrieved), 5, and 10. For the sake of fair comparisons, we
determine the impact of the ontological labeling component on
used the same ranking function for the four systems, which is
the search quality of [10], [19], [35]; and 2) to determine
the ranking function of XRANK [14] described in Section 5.3.
whether the original version of XCDSearch outperformed the
We set the parameters at d1 ¼ 0:35, d2 ¼ 0:25, and d3 ¼ 0:25.
modified copies. If so, the performance would be attributed
Table 5 shows the results. As the table shows, the MAPs of
to XCDSearch’s context-consideration component. Fig. 24 shows
the average recall and precision of the modified copies using XCDSearch are higher than those of the other three systems,
all the queries of [16], [17]. (For ease of comparison, we show and as the rank increases, the decrease in the MAP of
also in the figure the average recall and precision of the XCDSearch is not as significant as in the other systems.
original versions of the four systems.) As the figure shows, 8.4 Statistical Test of Significance
the average recall and precision of the modified copies are
We aim at using z-test [34] to:
higher than those of the original versions, but they are much
less than those of the original version of XCDSearch. Thus, 1. determine whether the differences between indivi-
we conclude that the context-consideration component has dual recall and precision scores of each of the four
significant impact on XCDSearch’s search quality. systems are large enough to be statistically
significant and
2. test our hypothesis on specific recall and precision
values of the population mean.
The z-score is the distance between the sample mean and
the population mean in units of the standard error. It is
calculated as Z ¼ ðX  MÞ=SE where X is the mean sample,
1794 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 12, DECEMBER 2010

TABLE 5 TABLE 8
MAP of the Systems z-Scores and the Probability of a
Randomly Selected Query to Achieve P  X

LO stands for loosely structured query and KE stands for keyword


query. [35] accepts only KE queries.

TABLE 6
Average Standard Deviation of the Mean

8.5 Evaluating Search Efficiency


R stands for recall and P stands for precision.
We measured the average query-execution time of
XCDSearch and [10], [19], [35] using the 212 queries of [16],
[17]. We first used each system to precompute the relationships
M is the population mean, SE ¼ D=sqrtðnÞ is the standard
between all nodes in each document (before submitting the
error of the mean in which D is the Average Standard
queries), saving the results for future accesses by the system
Deviation of the mean, and n is the sample size. Table 6 and recording the computation time “t.” For the sake of fair
shows the mean ðMÞ of recall and precision and the Average comparisons among the systems, we considered “t” as
Standard Deviation ðDÞ of the mean for the four systems. As constant for each system in the computation of the average
the values of D in Table 6 show, the retrieval performance of execution time: average query-execution time = (t + execu-
XCDSearch and [19] does not vary substantially with queries. tion time of all queries)/number of queries. For XCDSearch,
Tables 7 and 8 show the z-scores for recall/precision. “t” included the time for creating CTGs, IRs, and Ontology
Using the z-scores, we determined the probability of a Labels, but it did not include the time for building the
randomly selected query from the 212 queries used in the taxonomies of concepts for the 23 percent of the tag names
experiments to achieve recall/precision equal or higher than mentioned previously, whose ontologies had not been found
a sample of mean ðXÞ. Column ðR  XÞ in Table 7 shows the in XCDSearch’s databases.
probabilities using a sample of eight recall mean ðRÞ, and We generated 100, 150, 200, 250, and 300 MB documents
from each document in [16], [17] using [29]. We measured
column ðP  XÞ in Table 8 shows the probabilities using a
the average time that each system took to execute queries
sample of eight precision mean ðP Þ. These probabilities were
containing one, two, three, and four keywords submitted to
determined from a standard normal distribution table by the 100 MB documents. Fig. 25a shows the results. We also
using the z-scores as entries. For example, the probabilities measured the average time that each system took to execute
for the systems to return a query with recall  0:40 (shown in all the queries using documents of variable sizes. Fig. 25b
Table 7) are as follows: XCDSearch: 100 percent; Schema-Free shows the results. As the figures show, the running times of
XQuery: 97 percent; XSEarch: 39 percent; and XKSearch: XCDSearch are less than those of [10], [19] and slightly
77 percent. As the z-scores in Tables 7 and 8 show, the higher than those of [35]. Even though both XCDSearch and
distances from the sample mean to the population mean are [35] use stack-based sort-merge algorithms, the running
smaller for XCDSearch and [19]. The tables show also that times of XCDSearch are slightly higher than those of [35]
XCDSearch has a much higher probability for achieving because of the overhead of applying context-driven
recall/precision equal to or higher than the sample mean for techniques. The reason for the expensive running times of
[10], [19] stems from the fact that they check the relation-
a randomly selected query.
ships of each node containing keywords and then filter
results according to the search terms.
TABLE 7
z-Scores and the Probability of a
Randomly Selected Query to Achieve R  X

Fig. 25. (a) Execution times using variable number of keywords.


(b) Execution times using variable documents sizes.
TAHA AND ELMASRI: XCDSEARCH: AN XML CONTEXT-DRIVEN SEARCH ENGINE 1795

9 RELATED WORK AND CONCLUSION [12] R. Conrad, D. Scheffner, and C. Freytag, “XML Conceptual
Modeling Using UML,” Proc. Int’l Conf. Conceptual Modeling (ER
Keyword-based querying in relational databases has been ’00), 2000.
[13] R. Elmasri and S. Navathe, Fundamentals of Database Systems.
studied extensively [3], [4], [15]. These studies model the Addison-Wesley, 2007.
database as a graph in which tuples are regarded as the [14] L. Guo, F. Shao, and C. Botev, “XRANK: Ranked Keyword Search
graph’s nodes and the relationships between the tuples over XML Documents,” Proc. ACM SIGMOD ’03, 2003.
are regarded as the graph’s edges. Then, a keyword query is [15] V. Hristidis and Y. Papakonstantinou, “DISCOVER: Keyword
Search in Relational Databases,” Proc. Int’l Conf. Very Large Data
answered by returning a subgraph that satisfies the search Bases (VLDB ’02), 2002.
keywords. A number of studies [5], [6], [7], [8], [11] propose [16] Initiative for the Evaluation of XML Retrieval (INEX), http://
modeling XML documents as graphs, and keyword queries inex.is.informatik.uni-duisburg.de/2005/, 2005.
[17] Initiative for the Evaluation of XML Retrieval (INEX), http://
are answered by processing the graphs based on given
inex.is.informatik.uni-duisburg.de/2006/, 2006.
schemas. Some studies [5], [7], [8] propose techniques for [18] J. Kamps, M. Marx, M. Rijke, and B. Sigurbjornsson, “Structured
ranking results of XML keyword queries based on Queries in XML Retrieval,” Proc. Int’l Conf. Information and
importance and relevance. Others [2] use an RDBMS for Knowledge Management (CIKM ’05), 2005.
[19] Y. Li, C. Yu, and H. Jagadish, “Schema-Free XQuery,” Proc. Int’l
answering XML keyword queries. Conf. Very Large Data Bases (VLDB ’04), 2004.
Non-context-driven XML search engines build relation- [20] H. Leung, F. Chung, and C. Chan, “On the Use of Hierarchical
ships among data nodes based solely on their labels and Information in Sequential Mining Based XML Document Similar-
proximity to one another while overlooking their contexts ity Computation,” Knowledge and Information Systems, vol. 4, no. 7,
pp. 476-498, 2004.
(parents), which may cause these engines to return faulty [21] G. Li, S. Bressan, G. Dobbie, and B. Wadhwa, “XOO7: Applying
answers. In this paper, we have introduced XCDSearch, an O7 Benchmark to XML Query Processing Tools,” Proc. Int’l Conf.
XML context-driven search engine, which answers Keyword- Information and Knowledge Management (CIKM ’01), 2001.
[22] http://www.xml.com/2002/11/06/Ontology_Editor_Survey.
based and Loosely Structured queries. XCDSearch accounts html, 2010.
for nodes’ contexts by considering each set consisting of a [23] mydeco, http://mydeco.com/, 2010.
parent and its children data nodes in the XML tree as one [24] Protégé Ontology Editor, http://protege.stanford.edu/, 2010.
entity (CT). We proposed mechanisms for determining the [25] E. Pardede, J. Rahayu, and D. Taniar, “On Using Collection for
Aggregation and Association Relationships in XML Object-
semantic relationships among different CTs. We also Relational Storage,” Proc. ACM Symp. Applied Computing (SAC
proposed an efficient stack-based sort-merge algorithm that ’04), 2004.
selects from the set of CTs containing keywords (KCs) [26] K. Taha and R. Elmasri, “OOXSearch: A Search Engine for
Answering Loosely Structured XML Queries Using OO Program-
subsets, wherein each subset contains the smallest number of ming,” Proc. 24th British Nat’l Conf. Databases (BNCOD ’07), 2007.
KCs that are closely related to one another and contain at least [27] M. Snoeck and G. Dedene, “Existence Dependency: The Key to
one occurrence of each keyword. We took [10], [19], [35] as Semantic Integrity between Structural and Behavioral Aspects of
samples of non-context-driven XML search engines and Object Types,” IEEE Trans. Software Eng., vol. 24, no. 24, pp. 233-
251, Apr. 1998.
compared them heuristically and experimentally with [28] K. Taha and R. Elmasri, “CXLEngine: A Comprehensive XML
XCDSearch. The experimental results show that XCDSearch Loosely Structured Search Engine,” Proc. Int’l Conf. Extending
outperforms significantly the three other systems. Database Technology (EDBT) Workshop Database Technologies for
Handling XML Information on the Web (DataX ’08), 2008.
[29] ToXgene, a Template-Based Generator for Large XML Documents,
http://www.cs.toronto.edu/tox/toxgene/, 2010.
REFERENCES [30] TIMBER, http://www.eecs.umich.edu/db/timber/, 2010.
[1] S. Amer-Yahia, E. Curtmola, and A. Deutsch, “Flexible and [31] Wiley InterScience, http://www3.interscience.wiley.com/
Efficient XML Search with Complex Full-Text Predicates,” Proc. cgi-bin/home, 2010.
ACM SIGMOD ’06, 2006. [32] N. Widjaya, D. Taniar, and W. Rahayu, “Aggregation Transfor-
[2] D. Alorescu and I. Manolescu, “Integrating Keyword Search in mation of XML Schema to Object-Relational Databases,” Proc. Int’l
XML Query Processing,” Computer Networks, vol. 33, pp. 119-135, Workshop Innovative Internet Community Systems, pp. 251-262, 2003.
2000. [33] N. Widjaya and W. Rahayu, “Association Relationship Transfor-
[3] C. Agrawal and G. Das, “DBXplorer: A System for Keyword- mation of XML Schemas to Object-Relational Databases,” Proc.
Based Search over Relational Databases,” Proc. Int’l Conf. Data Int’l Conf. Information Integration and Web-Based Applications and
Eng. (ICDE ’02), 2002. Services (iiWAS ’02), 2002.
[4] B. Aditya and S. Sudarshan, “BANKS: Browsing and Keyword [34] R. Warner, Applied Statistics: From Bivariate through Multivariate
Searching in Relational Databases,” Proc. Int’l Conf. Very Large Techniques. Sage Publications, 2007.
Data Bases (VLDB ’02), 2002. [35] Y. Xu and Y. Papakonstantinou, “Efficient Keyword Search for
[5] B. Balmin, V. Hristidis, and N. Koudas, “A System for Keyword Smallest LCAs in XML Databases,” Proc. ACM SIGMOD ’05, 2005.
Proximity Search on XML Databases,” Proc. Int’l Conf. Very Large [36] R. Xiaou, T. Dillon, and L. Feng, “Modeling and Transformation of
Data Bases (VLDB ’03), 2003. Object-Oriented Conceptual Models into XML Schema,” Proc. Int’l
[6] B. Balmin, V. Hristidis, and Y. Papakonstantinon, “Keyword Conf. Database and Expert Systems Applications (DEXA ’01), 2001.
Proximity Search on XML Graphs,” Proc. Int’l Conf. Data Eng. [37] XML Query Use Cases, W3C Working Draft, 2007.
(ICDE ’03), 2003.
[7] B. Balmin and V. Hristidis, “ObjectRank: Authority-Based Key-
word Search in Databases,” Proc. Int’l Conf. Very Large Data Bases
(VLDB ’04), 2004.
[8] C. Botev, L. Guo, and F. Shao, “XRANK: Ranked Keyword Search
over XML Documents,” Proc. ACM SIGMOD ’03, 2003.
[9] Books247, http://www.books24x7.com/books24x7.asp, 2010.
[10] S. Cohen, J. Mamou, and Y. Sagiv, “XSEarch: A Semantic Search
Engine for XML,” Proc. Int’l Conf. Very Large Data Bases (VLDB
’03), 2003.
[11] S. Cohen and Y. Kanza, “Interconnection Semantics for Keyword
Search in XML,” Proc. Int’l Conf. Information and Knowledge
Management (CIKM ’05), 2005.
1796 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 12, DECEMBER 2010

Kamal Taha received the MS degree in soft- Ramez Elmasri received the MS and PhD
ware engineering in 2002 from the University of degrees in computer science at Stanford Uni-
St. Thomas in Minnesota. He received the versity in 1980. He is a professor of computer
PhD degree in computer science from the science and engineering at the University of
Department of Computer Science and Engineer- Texas at Arlington, since 1990. He was an
ing at the University of Texas at Arlington in assistant professor (1982-1987) and an associ-
2010. Dr. Taha is an Assistant Professor of ate professor (1987-1990) at the University of
Software Engineering at Khalifa University of Houston, Texas. He has more than 130 refereed
Science, Technology & Research (KUSTAR), publications in journals and conference proceed-
Abu Dhabi, UAE. Before joining KUSTAR, he ings. He is a coauthor of the textbooks Funda-
was an instructor of Computer Science and Engineering at The mentals of Database Systems (fifth edition, Addison-Wesley, 2007) and
University of Texas at Arlington. He worked as an engineering specialist Operating Systems: A Spiral Approach (first edition, McGraw-Hill
for Seagate Technology (a leading computer disc drive manufacturer in Science/Engineering/Math, 2009). His research interests are in data-
the US) from 1996 to 2005. Dr. Taha authored a book and coauthored base systems, XML, network management information systems, web
two book chapters. In addition, he has 13 refereed publications that modeling and ontologies, e-commerce, temporal databases, conceptual
have appeared (or are forthcoming) in journals, conferences, and modeling, object-oriented databases, distributed object computing,
workshop proceedings. His scholarly interests span databases, in- operating systems, systems integration, database models and lan-
formation retrieval, and data mining. He was selected by Marquis Who’s guages, DBMS system implementation, indexing techniques, and
Who to be included in the 2011-2012 (11th) Edition of Who’s Who in software engineering environments. He served on numerous conference
Science and Engineering. He serves as a member of the program and workshop program committees. He is a member of the IEEE, the
committee of two international conferences and as a reviewer for a IEEE Computer Society, and the ACM SIGMOD.
number of conferences and academic journals. He was a GAANN
Fellow (US Department of Education Graduate Assistance in Areas of
National Need). He is a member of the IEEE.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

Вам также может понравиться