Вы находитесь на странице: 1из 17

2008

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 11,

NOVEMBER 2012

Holistic Boolean-Twig Pattern Matching for Efficient XML Query Processing


Dunren Che, Tok Wang Ling, Senior Member, IEEE, and Wen-Chi Hou
AbstractTwig pattern matching is a critical operation for XML query processing, and the holistic computing approach has shown superior performance over other methods. Since Bruno et al. introduced the first holistic twig join algorithm, T wigStack, numerous socalled holistic twig join algorithms have been proposed. Yet practical XML queries often require support for more general twig patterns, such as the ones that allow arbitrary occurrences of an arbitrary number of logical connectives (AND, OR, and NOT); such types of twigs are referred to as B-twigs (i.e., Boolean-Twigs) or AND/OR/NOT-twigs. We have seen interesting work on generalizing the holistic twig join approach to AND/OR-twigs and AND/NOT-twigs, but have not seen any further effort addressing the problem of AND/ OR/NOT-Twigs at the full scale, which therefore forms the main theme of this paper. In this paper, we investigate novel mechanisms for efficient B-twig pattern matching. In particular, we introduce B-twig normalization as an important first-step in our approach toward eventually conquering the complexity of B-twigs, and then present BT wigMergethe first holistic twig join algorithm designed for Btwigs. Both analytical and experimental results show that BT wigMerge is optimal for B-twig patterns with AD (Ancestor-Descendant) edges and/or PC (Parent-Child) edges. Index TermsQuery processing, database management, XML data querying, twig join, boolean twig, logical predicate.

1 INTRODUCTION
[10], [11], [13]. Most of these holistic join algorithms deal only with plain twig patternswhere each query node corresponds to an element type name and the sibling nodes originated from the same parent node naturally imply the AND logic between them. However, practical XML queries naturally extend such plain twigs to more general ones, for example, allowing arbitrary combination of any number of ANDs, ORs, and NOTs. For example, the following query, =dblp=papertitle T wigJoin or year 2006 and conf SIGMOD==author selects the authors who have papers either titled Twig Join or published in SIGMOD 2006. This query contains both OR and AND operations. The next query, =dblp=paperNOT reference finds papers that do not have references. This query contains a NOT operation. A twig that may contain arbitrary combination of ANDs, ORs, and NOTs, is referred to as an AND/OR/NOT-twig or Boolean-twig (or simply B-twig). The importance of B-twigs for XML queries is obvious and well recognized [7], [13]. So far, we only see that Jiang et al. [7] studied the holistic twig join issue for AND/OR-twigs (i.e., twigs with only AND and OR predicates) and Yu et al. [13] tackled the problem for AND/NOT-twigs (i.e., twigs with only AND and NOT predicates). There is no integral method ever reported facing the full challenge of holistic B-twig computing. The challenge with full B-twigs lies in the arbitrary occurrences of an arbitrary number of AND/OR/ NOT predicates in a B-twig (we refer to this challenge as the double arbitrariness challenge of B-twigs). This challenge makes programmatic handling of B-twigs in the framework of holistic computing extremely hard (if not impossible). The severity of this challenge, we believe, is the reason why a holistic join solution for B-twigs has not been developed
Published by the IEEE Computer Society

N XML database stores a collection of data trees. An XML query describes a tree-shaped search pattern, which is often referred to as a twig pattern [5], with additional query conditions (if any) described as predicates on the three nodes. XML queries thus are called tree queries or twig queries. Answering a twig query is essentially to find all matching instances from a database that match the twig pattern implied by the query and satisfy all additional predicates (if any) in the query. A naive way of finding the matches for a twig pattern is to scan the database (usually for many times). A better way uses structural joins [14], [4] in a bulk way to compute the matches for each individual edge, and then stitch the matches found for individual edges together to form the answers for the whole twig. This approach typically creates large sets of unused intermediate results, even when the final result set is pretty small. Yet, a much more efficient approach, called holistic twig join, computes the matches for the whole twig in a holistic way so that irrelevant intermediate results (which need be output and input, and thus are most detrimental to query performance) can be avoided. The first holistic twig join algorithm, T wigStack, was proposed by Bruno et al. [5] in 2002. Since then the holistic join approach has been broadly extended by numerous followers [6], [8], [7], [9],

. D. Che and W.-C. Hou are with the Department of Computer Science, Southern Illinois University, Faner 2125, Mail Code 4511, 1000 Faner Drive, Carbondale, IL 62901. E-mail: {dche, hou}@cs.siu.edu. . T.W. Ling is with the Department of Computer Science, School of Computing, National University of Singapore, Computing 1 (COM1), 13 Computing Drive, Singapore 117417. E-mail: lingtw@comp.nus.edu.sg. Manuscript received 8 Nov. 2010; revised 11 Mar. 2011; accepted 16 May 2011; published online 7 June 2011. Recommended for acceptance by T. Grust. For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-2010-11-0589. Digital Object Identifier no. 10.1109/TKDE.2011.128.
1041-4347/12/$31.00 2012 IEEE

CHE ET AL.: HOLISTIC BOOLEAN-TWIG PATTERN MATCHING FOR EFFICIENT XML QUERY PROCESSING

2009

nearly 9 years after Bruno et al. first proposed the promising holistic join approach [5] that afterward quickly inspired the solutions for AND/OR-twigs [7] and AND/NOT-twigs [13] separately proposed by different researchers. From AND/ OR-twigs and AND/NOT-twigs to full B-twigs appears to be just one step, however, as the complexity implied by the double arbitrariness blows up, a holistic join approach for full B-twigs cannot be simply obtained from combining the methods separately designed for AND/OR-twigs and AND/NOT-twigs. Rather, a more creative strategy with more powerful supporting mechanisms must be invented for B-twigs. Solving the challenge of holistic B-twig computing has both practical and academic significance. From the practical perspective, this effort helps to mature the promising holistic twig join approach and can immediately find use in real XML query applications; from the academic side, it solves an important technical problem and the obtained result can be generalized to any data sources incarnating a tree data model (while XML is just one use case of the general tree data model). We are thus motivated to sort out the complication involved in holistic computing of B-twig pattern matches. In this paper, we present our complete approach, including the techniques we developed for systematically solving holistic B-twig computing. The contributions of our work reported here can be summarized as follows: We propose a novel facility, i.e., B-twig normalization, that serves as the first milestone in our approach toward eventually solving the increased complexity of B-twigs. . We expound a sound method for automatically performing B-twig normalization, which is an important prestep in our overall approach. . We present BTwigMerge, the first holistic join algorithm ever designed for (normalized) B-twigs, including numerous original supporting mechanisms. . BTwigMerge performs optimal matching [5] for both AD (Ancestor-Descendent) edges and PC (Parent-Child) edges, while prior algorithms claim optimality only for AD edges. The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 sets forth the preliminaries for the subsequent discussion, including data model, B-twig representation, and normalization. Section 4 describes our B-twig pattern matching approach in a general way and introduces the original supporting mechanisms of this approach. Section 5 presents our algorithm, BTwigMerge, including its various supporting functions (each implements an important supporting mechanism). Section 6 provides experimental results, demonstrating the superiority of our approach and algorithm. The paper is concluded in Section 7. .

RELATED WORK

Twig pattern matching is a core operation in the XML query processing. Naive navigation (or pointer-chasing), structural joins, and holistic twig joins have all been studied for twig pattern matching. In the following, we review representative works on structural joins and particularly on holistic twig joins.

The first structural join (called containment join) algorithm was proposed by Zhang et al. [14], which extends the traditional merge join to multipredicate merge join (MPMGJN). Al-Khalifa et al. [4] later proposed two families of structural join algorithms, i.e., tree-merge and stack-based structural joins, as primitives for XML twig query processing. In 2002, Bruno et al. [5] first proposed the holistic twig join approach for XML twig queries in order to overcome the drawback of structural joins that usually generate large sets of unused intermediate results. Bruno et al. designed the first holistic twig join algorithm, named T wigStack, which is optimal for twigs with only AD edges (but not with PC edges). The work of Lu et al. [9] aimed at making up this flaw and they presented a new holistic twig join algorithm, T wigStackList, in which a list structure is used to cache limited elements in order to identify a larger optimal query class. Chen et al. [6] studied the relationship between different data partition strategies and the optimal query classes for holistic twig joins. Lu et al. [10] proposed a new labeling scheme, called extended Dewey, and an interesting algorithm, named T JF ast, for efficient processing of XML twig patterns. Unlike all previous algorithms based on region encoding, to answer a twig query, T JF ast only needs to access the labels of the leaf query nodes. The result of Lu et al. [10] includes enhanced functionality (can process limited wildcard), reduced disk access, and increased total query performance. The same group [11] also studied efficient processing for ordered XML twig patterns using their region encoding scheme. In an ordinary twig, the multiple sibling nodes under a common parent node automatically signify the AND logic relationship among them, and all previously proposed holistic twig join algorithms already support this implied AND logic in their implementation schemes. Users would take all the three commonly used logical predicates, AND, OR, and NOT, as granted facilities in formulating their XML queries and thus would expect full support from a query engine for unlimited use of all these predicates in their XML queries. Jiang et al. [7] made the first effort toward incorporating support for OR predicates into the holistic twig join approach pioneered by Bruno et al. [5], and Yu et al. [13] made effort for supporting NOT predicates in XML twig queries. Jiang et al. [7] presented an interesting framework for holistic processing of AND/OR-twigs based on the concept of OR-block. With resort to OR-blocks, an AND/ OR-twig is transformed to an AND-only twig carrying special OR-blocks. This work is inspiring to uswe find it is possible to substantially extend their framework so that the NOT logic can be seamlessly incorporated. Nevertheless, this work is not straightforward, but requires creative reinvention of the wheels. In order to harness the complexity of B-twigs, we resort to B-twig normalization; then based on normalized B-twigs, we are able to extend and adapt the OR-block concept with new supporting mechanisms for handling the NOT predicates involved in B-twigs. The recent publication of Xu et al. [12] proposed another interesting algorithm that claims to be able to efficiently compute the answers to XML queries without holistically computing the twig patternsthe answers obtained contain individual elements corresponding to designated output

2010

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 11,

NOVEMBER 2012

query nodes. So basically this work does not belong to the category of holistic twig join algorithms. But what is interesting of their work [12] is the proposed path-partitioned element encoding scheme, which bears efficiency potential and may be considered in the future for further improving the performance of holistic B-twig pattern matching.

PRELIMINARIES
Fig. 1. Example twigs involving NOT.

In this section, we first address the data model issue, B-twig representation and normalization, and then introduce the notations and operations needed in our subsequent discussion.

3.1 Data Model We adopt the general perspective [5] that an XML database is a forest of rooted, ordered, and labeled trees, each node corresponds to a data element/value, and each edge represents an element-subelement or element-value relation. The order among sibling nodes implicitly defines a total order on the tree nodes. Node labels are important for efficient processing of a twig pattern as properly designed node labels may leave out the necessity of accessing the node contents during query evaluation. This is especially true with twig pattern matching, which is at the core of XML query processing. Node labels typically encode the region information of data elements that reflects the relative positional relationships among the elements in the source data file. We assume a simple encoding scheme using a triplet region codestart; end; levelwhich is assigned to each data element in a tree database as a label. When multiple documents are present, the document-id is added to the labels to differentiate the documents. Region code can be conveniently obtained through preorder document-tree traversing. 3.2 Tree Representation Each XML query implies a twig pattern, small, or large. The smallest twig may contain just a single node, but a typical twig usually comprises a number of nodes. The target of our investigation is the B-twigs that allow arbitrary combination of AND, OR, and NOT predicates, of which each may have multiple occurrences. Each B-twig may consist of two general categories of nodes: ordinary query nodes standing for element types (or tags) and special connective nodes denoting logical predicatesAND, OR, and NOT. More specifically, we represent a B-twig using the following specific types of nodes:
. QNode. An ordinary query node, associates to an element type (or tag name) in a tree database. For programmatic purposes (as in [7]), a QNode records its location step axis // or / for edge test, and a tag name for node test. Therefore, the content of a QNode takes the general format of =tag or ==tag. A nonroot QNode in a B-twig may be conveniently called a d-child or c-child (of its parent) depending on whether a // or a / symbol is recorded in the QNodes content (Notice that in the sequel we may not always show the location step axes in our illustrations when the emphasis is on something else).

ANode. An AND predicate node, always takes the text AND as its content. It connects two or more child subtrees through the AND logic. ONode. An OR predicate node, always takes the text OR as its content. It connects two or more child subtrees through the OR logic. NNode. The NOT predicate node, always takes the text NOT as its content. Functionally, a NNode negates the predicate denoted by the subtree immediately underneath it. A NNode is commonly combined with the node underneath it in the B-twig, forming a composite node. We have the following three kinds of composite nodes related to NOT -

NQNode: the combined form of a NOT node with a subsequent QNode child (such combination just causes the representation of a B-twig more compact, and does not affect the semantics or interpretation of the twig pattern). For example, in the query Q1 (shown in Fig. 1), the NOT and the subsequent child QNode /B can be combined and replaced by a single NQNode with content :/B. NANode. The combined form of a NOT node with its sole ANode child. NONode. The combined form of a NOT node with its sole ONode child. We could also have the fourth type of composite node that represents a NOT node combined with another (child) NOT node (i.e., a double negation node that could be named NNNode). As the net effect of double negations is the same as no negation at all, double negation nodes are not actually used in our representation for B-twigs. From now on, we generally refer to QNodes and NQNodes as query nodes, and other (plain or composite connectives) nodes in a B-twig as nonquery nodes. With the above mechanisms introduced, our representation scheme for B-twigs is apparently a superset of what can be represented by the scheme adopted by Jiang et al. [7] for the simpler AND/OR-twigs. Considering NOT as a new element added to B-twigs (comparing to the AND/OR-twigs studied in [7]), we next illustrate the four typical cases that the NOT predicates may appear in an XML twig query. The following four queries exemplify these four representative cases: Q1. A[NOT/B/C] Q2. A[NOT (/B AND/C)]//D Q3. A[NOT (/B OR/C)]//D Q4. A[NOT/B[NOT/C]] The B-twig representations of these queries using the various types of nodes introduced earlier are depicted in

CHE ET AL.: HOLISTIC BOOLEAN-TWIG PATTERN MATCHING FOR EFFICIENT XML QUERY PROCESSING

2011

Fig. 1, where predicate nodes are boxed for highlighting, and output nodes are marked with a . We adopt XPath-like expressions to represent a B-twig XML query, the expressions are accordingly called B-twig expressions. In such a B-twig expression, [ ] still denotes the interpolation of a predicate filter into a path expression, and ( ) signifies a subexpression or other-than-default association between an operator and its operands that needs be enforced. Q1 returns all A nodes that do not have any child node of type B that contains a child node of type C. Q2 returns all D nodes that are descendants of an A node that must not have a child node of type B and a child node of type C. Q3 returns all D nodes that are descendants of an A node that must not have a child node of type B nor a child node of type C. Q4 returns all A nodes that do not have a child node of type B that does not contain a child node of type C. In a B-twig, there may exist two kinds of AND logic, implied AND and explicit AND, of which the latter is represented through an explicit ANode. For example, in Fig. 1 (Q2), the AND logic between the entire left branch and the right branch (which contains a single leaf node // D) is implied, while the logic between the left leaf /B and the right leaf /C (in the left branch of Q2) is through an explicit ANodei.e., the AND node right above them. The answer to a B-twig query is a set of qualified twig instances (i.e., the embeddings of the twig pattern into the tree database). In previous algorithms, an output twig instance contains elements corresponding to all the QNodes appearing in the twig pattern. After generalizing to B-twigs, we need to adjust our output model accordingly since it makes little sense to output those elements corresponding to a QNode that is inside a predicate filter (especially those QNodes below a NOT predicate). We assume the following, simple output model for B-twigs: each output twig instance of a B-twig query comprises of elements from only those QNodes that are not inside of any predicate. The subtwig resulted from the original input B-twig query after pruning all predicate branches (which are subtrees rooted at a predicate node) is called the output twig of the query. Each leaf on the output twig is called an output leaf. All QNodes on the output twig require output if matching elements are eventually found. In the future, we no longer mark up the output nodes for the sake of simplicity.

Fig. 2. A motivating example.

3.3 B-Twig Normalization On the one hand, as a basic formatting mechanism, the three common logical operations, AND, OR, and NOT, bear obvious importance to XML queries (and other types of queries). On the other hand, the holistic twig join approach pioneered by Bruno et al. [5] has been repeatedly shown performance superiority over other methods by various researchers in different research contexts [6], [8], [7], [9], [10], [11], [13]. Naturally, one would expect the overall holistic twig join approach to be extended to more general twigs such as B-twigs, which indeed has been a dream of us (and others) for many years since the original work of Bruno et al. [5]. The groups at HKUST [7] and NUS [13] had, respectively, investigated a different aspect of this issuethe former focused on AND/OR-twigs [7] and the latter on AND/NOT-twigs [13]. However, since then 5 years

have passed, and no further progress has been made on holistically computing B-twig patterns. We have made numerous years of effort on conquering the full challenge of B-twigs. We could not easily sort out the complication caused by the double arbitrariness of B-twigs, and thus could not systematically and programmatically solve the problem of B-twigs with a nice algorithm. However, our effort has helped us gain in-depth insight into the challenge of holistic B-twig pattern computing. Let us now look closer at the technical consequence of the double arbitrariness of B-twigs: any node (whether a regular query node or logical predicate node) in a B-twig must be interpreted in the context of the B-twig as a whole, and every query node functions as one anothers (partial) filter. For example, in the simple B-twig of Fig. 2a, the qualification of an A node instance is verified via the existence of a B node instance, and vice versa; moreover, nodes E and F via logical predicates OR and NOT in turn are used to qualify a C node instance, and together to qualify a B node instance, and then all together to qualify an A node instance. In the context of holistically B-twig computing, due to the double arbitrariness inherent with a B-twig, it is just too hard to straightforwardly try to sort things out programmatically and then deal with each case, each subcase, each subsubcase, and so on. NOT predicates are the most problematic. For example, with twig query Q4 (see Fig. 1), at the lower part, the existence of a C node instance disqualifies the parent B node instances, however after passing another NOT node, what had been negated (rejected) now are required. Remembering that Q4 is a fairly simple example with only a single path and two mere NOT predicates, now imagine how overwhelmingly the complex would add up with a general B-twig that allows arbitrary combination of an arbitrary number of AND/OR/ NOT predicates in an arbitrarily shaped trig structure. The challenge we can perceive leads us to believe that, without taking a special action first, programmatically sorting the things out and designing a holistic algorithmic solution are going to be extremely hard if at all possible (we tend to believe that this is the very reason why there has not been a satisfactory holistic solution proposed for B-twigs until now). Our insight into the essentials of holistic B-twig computing inspired us to straightforwardly focus on the key point of the challengedouble arbitrariness in B-twigs. We think of normalization, and hope through proper normalization the arbitrariness (and thus the complexity) in a Btwig can be effectively restrained, and eventually lead to a satisfactory solution. Normalization on Boolean logical expressions has successfully aided automated theorem proving because normalized forms effectively reduce the complexity of combination of the operations in a Boolean

2012

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 11,

NOVEMBER 2012

expression. We next discuss how to incorporate normalization into B-twigs. Before we proceed, lets look at a motivating example that helps exclude further consideration of an alternative approachthe decomposition-based approach for B-twigs. From now on, we choose not to explicitly show the location step axis symbols, / and // in out illustrations when our discussion uniformly applies to both cases and there is no foreseen confusion. The motivating example is illustrated in Fig. 2, where (a) is the input B-twig. A decomposition approach would decompose this simple Btwig into three parts: c1, c2, and c3 as shown in Fig. 2c, then separately compute the three simple twigs using existing algorithms (either structural joins [4] or holistic joins such as T wigStack [5]) without any concern of logical predicates, and finally, logically combine the obtained subtwig results to form the whole answerspecifically with this example, the final answer is computed by the following expression, A=B=C A=B=C A=B=C P ath P ath P ath where each A=B=C A=B=C=E A=B=C=F term returns a set of paths, of which the superscript describes the return pattern, and the subscript describes the filter twig pattern that the returned paths must satisfy. In this example, all the three filter twig patterns happen to be paths, with the first term, the return pattern and filter pattern happen to be the same path pattern =A=B=C . This expression simply excludes all path solutions of pattern A=B=C satisfying twig pattern A=B=C=E or A=B=C=F from the total set of paths matching pattern A=B=C . A rather obvious problem of this decomposition approach is that the input streams corresponding to type A, B, and C need to be scanned three times each, the (intermediate) I/O cost is accordingly tripled, and the final combination step is also very expensive, which includes two set-difference operations (noticing the complexity involved in checking the equality among twig instances rather than simple numeric values). Before us, Jiang et al. [7] had carefully compared the holistic approach with the decomposition approach and concluded that for AND/OR-twigs, the decomposition approach typically causes 100 percent more disk I/Os than their holistic algorithm, GT wigMerge. We expect a much greater ratio of extra I/Os for more complex B-twigs if using a decomposition approach (our above example shows an I/ O increase ratio of 200 percent; obviously, the more complex a B-twig is, the more increase in I/O cost is incurred). Therefore, we choose to first transform an input B-twig into an equivalent alternative form such as the one shown in Fig. 2b that resembles the DNF (Disjunctive Normal Form) of the Boolean logic. Since the arbitrariness of AND, OR, and NOT predicates in the transformed form gets restrained, e.g., NOTs appear only at leaves, developing a holistic join scheme for such normalized B-twigs becomes more manageable. Regulating the occurrences of logical predicates in an arbitrary B-twig is the direct objective of

Fig. 3. Illustration of normalized B-twig patterns.

normalization, and the regulated occurrences of ANDs, ORs, and NOTs in a normalized B-twig make programmatically and holistically computing the B-twig patterns possible and a reality. The profile of our overall approach is becoming clear now: first, we regulate occurrences of logical predicates in an arbitrary B-twig through an important preprocessing step, Btwig normalization; then, we send the normalized B-twig to a specially designed holistic algorithm for evaluation. Next we discuss the B-twig normalization issue with details. The XPath-like B-twig expressions introduced earlier well preserves the essentials of ordinary Boolean logical expressions. Our B-twig normalization borrows ideals from the normalization of Boolean logical expression. The following definition incorporates DNF from Boolean logic into the context of a B-twig and forms our concept of normalized B-twigs. Notice that we choose DNF instead of CNF (Disjunctive Normal Form) due to the potential efficiency of DNF evaluation that immediately returns as long as one branch is successfully verified. Definition 3.1 (Normalized B-Twig). A normalized B-twig is a query tree that has only four types of nodes: QNodes, NQNodes, ONodes, and ANodes and satisfy the following conditions: 1) every OR predicate branch can be mapped to a DNF; 2) every NQNode must be a leave; 3) every explicit ANode must appear within an OR predicate branch. In the above definition, condition 1) requires mapping every predicate branch to a DNF, meaning that the logic embedded in a predicate branch needs to satisfy the requirement of a DNF; condition 2) states that all NQNodes must be pushed down to leaves (this condition is stipulated in order to fulfill the requirement of DNF and to facilitate processing of NOT predicates that otherwise are troublesome in the framework of holistically processing B-twigs); condition 3) further requires that no ANode may appear above an ONode. Notice that condition 3) does not apply to the implied AND logic between the sibling nodes originated from a common parent QNode. We envision the general pattern of normalized B-twigs is like the illustration of Fig. 3, where the desired form for an OR predicate branch is enclosed in a dashed circle for highlighting. If we look only at the logical predicate nodes and ignore all QNodes in an OR branch, and imagine we can somehow flatten the twig structure of the OR branch, then we would have seen the same DNF as we would have got from the embedded Boolean logic expression, except for that the OR and AND predicates in the context of a B-twig may be n-ary with subtrees as operands. As a short summary, in a

CHE ET AL.: HOLISTIC BOOLEAN-TWIG PATTERN MATCHING FOR EFFICIENT XML QUERY PROCESSING

2013

Fig. 5. Illustration of Rule 6 and 60 . Fig. 4. Illustration of Rule 4.

normalized B-twig, the root node must be a QNode, which may have some QNode children, some NQNode children, and some ONode children; in turn, the root ONode (such as node n5 in Fig. 3) of an OR branch may only have some ANode children, some QNode children, and some NQNode children, in whatever order; besides, any explicit ANode within an OR branch (e.g., node n6 in Fig. 3) may only have QNodes and NQNodes as children, and all NQNodes (if any) must all be pushed down to leaves. What we shall expect next is a procedure for automatically obtaining the normal form for an arbitrary input B-twig. Once again we borrow ideas from the normalization of Boolean logical expressions, and adapt them to the context of a B-twig. We envision such a procedure consists of repeated pushing down of NOT and AND predicates similar to the computation for a regular Boolean DNF. More specifically, given an arbitrary B-twig, our normalization procedure carries out the following three consecutive steps of transformation: 1) NOT-pushdown, 2) AND-pushdown, 3) simplification. Step 1 pushes NOT predicates all the way down to leaves (at leaves, NNodes, and their negated QNodes are conveniently combined to form composite NQNodes). Step 2 then pushes down all explicit ANodes if they are above an ONode. Step 3 cleans up and simplifies the resultant B-twigs from prior steps by removing any unnecessary ANodes and ONodes. Each transformation step is implemented via a set of transformation rules. In the following, we present these transformation rules using our XPath-like B-twig expressions. We assume the following additional conventions: for binary AND and OR predicates we assume the infix form such as A AND B, and for n-ary predicates we switch to the convenient functional form such as AND(A, B, C). In addition, as our rules are generic in the sense that they uniformly apply to both types of location step axes, / and //, we omit the axis symbols in these rules, for example, the term A[B] may be interpreted as either A[/B] or A[//B]. Rule 1. If a QNode n has a NANode child ni , then we push down the NOT logic by transforming ni to an ONode and all its children to their corresponding negated forms as shown below: ANOT B AND C ) ANOT B OR NOT C: Rule 2. If a QNode n has a NONode child ni , then we push down the NOT logic by transforming ni to an ANode and all its children to their corresponding negated forms, such as: ANOT B OR C ) ANOT B AND NOT C:

Rule 3. If a QNode n has a NQNode child ni that has a QNode nj , then we push down the NOT logic by performing a transformation as exemplified below: ANOT BC ) ANOT B OR BNOT C: Rule 4. An explicit ANode appearing above an ONode can be pushed down (below the ONode) via the following transformation: AB OR C AND D ) AB AND D OR C AND D: The transformation induced by Rule 4 (from left-hand side to right-hand side) is illustrated in Fig. 4, which packs up the two operations, split and union (used by Jiang et al. in [7]). Rule 5. If a QNode n has an NNNode child ni , then we simply remove the double negations as shown below: ANOT NOT B ) AB: Rule 5 appears trivial and unnecessary since double negation is an obvious case of redundancy and an ordinary user would typically not use double negation in a meaningful query. But double negation may be formed as a result of prior transformations carried out for B-twig normalization. So we still cannot ignore this trivial rule. Rule 5 is explicitly called upon at the last normalization stepsimplificationfor cleaning up. For added efficiency, Rule 5 in our current implementation is implicitly and instantly enforced: that is, whenever a NOT predicate is pushed down by a relevant rule and meets another NOT predicate, Rule 5 is immediately fired off to remove the double negation. Rule 6. If an ANode n has an ANode child ni , we merge these two ANodes by removing ni and directly linking the children of ni to node n as the new parent, such as follows: A AND B AND C ) ANDA; B; C: There is a variant of Rule 6, which we consider as a special case of Rule 6 and name it as Rule 60 . Rule 60 : AB AND C ) ABC. The input pattern of Rule 60 (see Fig. 5b) can be regarded as a special case of the input pattern of Rule 6 (Fig. 5a) when envisioning the AND logic implied by the QNode A in Fig. 5b. After removing the unnecessary AND node in both cases, Rule 6 causes the transformation from Figs. 5a to 5c, while Rule 60 induces the transformation from Figs. 5b to 5d. Rule 7. If an ONode n has an ONode child ni , we merge these two ONodes by removing ni and directly linking the children of ni to n (as new parent), such as: A OR B OR C ) ORA; B; C:

2014

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 11,

NOVEMBER 2012

Notice that Rule 7 is the symmetry of Rule 6 if we consider OR as the counterpart of AND, but Rule 7 does not have a variant to match Rule 60 . The correctness of the above rules is self-evident. Take Rule 3 for example, the left-hand side (as a subquery typically) requires all A nodes that do not have any B child that has a C child, while the right-hand side asks for those A nodes that either do not have any B child at all or the contained B child must not have any C child. Obviously, the right-hand side restates exactly the same filtering criterion, and the equivalence between the two sides trivially holds. For a rule-based system, good performance comes from good rule control strategy. Our normalization procedure logically consists of three consecutive steps: Step 1 repeatedly calls Rule 1 to 3 to push NOTs down (to leaves), Step 2 calls Rule 4 to push (only explicit) ANDs down (below ONodes), and Step 3 calls Rule 5 to 7 to remove unnecessary predicate nodes. In actual implementation, we did a little optimization: 1) Step 3 is fused into Steps 1 and 2, e.g., whenever two NOTs are made to meet by a prior transformation, Rule 5 is instantly invoked to remove both of them, analogously for the case of two ANDs (invoking Rule 6) and the case of two ORs (invoking Rule 7); 2) Rule 4 is applied to an input B-twig bottom-up, which can be more efficient, e.g., assuming node D on the left of Fig. 4 is a subtwig embedding another match of the input pattern of Rule 4, bottom-up application of Rule 4, i.e., processing the D subtwig before it is distributed, can save us one application of the rule. Basically, our normalization performs a linear search and firing of matched rules if found, and the time taken is On m, where n is the size of an input B-twig in terms of the total number of nodes in it and m is the number of rules. As in practice, an input twig rarely has more than a dozen nodes and our rule set contains only seven rules, the normalization step can be performed extremely efficient (With all the experiments weve done so far, the most expensive normalization took 55 milliseconds.). Theorem 3.1. Every B-twig has an equivalent normal form (per Definition 3.1) and can be obtained by the seven normalization rules. Proof. First, Rules 1, 2, and 3 together cover all the three cases that a NOT predicate may appear in a B-twig, i.e., respectively, above an AND, an OR, and an ordinary query node (the case with double NOTs is left for Rule 5 to handle subsequently), and with each case, a corresponding rule is applied to push the NOT logic down, and repeated application of the these rules guarantees all NOTs be pushed-down to leaves; then at Step 2, Rule 4 is (repeatedly) applied and is sufficient to push down every explicit AND node that is above an OR node; finally at Step 3, Rules 5, 660 , and 7, remove all redundant NOTs, ORs and ANDs, and produce a normalized form compliant to Definition 3.1, which is equivalent to the original input B-twig since each transformation pursued is backed by an equivalent transformation rule. u t The above theorem implies the validness of Definition 3.1 and the completeness of our rule set. A more formal proof can be constructed based on structural induction on the tree structure of input B-twigs (for space interest, we omit it).

Remarks. 1. Of the seven rules, only Rules 3 and 4 cause query expansion. We ignore added logical nodes as in our implementation, logical nodes barely contribute to the runtime of our holistic processing (cf. Section 5.3), while the number of query nodes (each associates to an input stream and induces significant I/O and CUP costs) and the total size of the input streams together dominate the CPU cost. Each application of Rule 3 causes an increase of one query node (the expansion ratio is 1/3 according to the input pattern of Rule 3), which is marginal. Each application of Rule 4 duplicates one query node in the best case as shown in Fig. 4 where node D is duplicated, but in the worst case, i.e., when D itself is a subtree, the query expansion can be exponential in the number of query nodes. However, the worst case rarely happen in practice because, first, a practical XML query barely have more than a dozen query nodes, second, normalization is only required at subtrees rooted at a predicate node (that is either an ANode containing some ONodes or a NNode which negates a nonleaf node), so the chances of applying Rule 4 more than once hardly happen with real queries, third, we expect most users to formulate their queries in a format that is close to the normalized form (this is a realistic expectation since DNF as a logical expression format is naturally appealing to everyone received even the minimum training). In the case when expansion cannot be avoided, we shall notice that those duplicate query nodes do not come with any extra I/O cost as we only need to read their streams into memory once, and use in-memory buffers maintaining separate stream cursors for duplicated query nodes. In addition, these duplicated query nodes only appear in predicate subtrees and are nonoutput nodes according to our output model, so they do not cause extra operation on any stack according to the processing strategy embodied in our algorithms. So, though appears to be problematic, query expansion caused by normalization does not cause serious consequence, but enables an optimal holistic solution for (normalized) B-twigs. We have done experiments with numerous (though still limited) queries, and noticed that maximum expansion ratio after normalization is about 50 percent in terms of duplicated query nodes, and the average expansion ratio is 30 percent or so.

2.

3.

4.

3.4 Auxiliary Operations Given a twig query Q, we tend to use q (and its variants such as qi and q0 ) to denote a query node (i.e., a QNode or NQNode1) in Q or the subtree rooted at q, and use n (and its variants such as ni and n0 ) to generally refer to any node allowed to appear in Q.
1. When it is a NQNode, what we really care is the QNode being negated and its associated stream that needs to be physically accessed.

CHE ET AL.: HOLISTIC BOOLEAN-TWIG PATTERN MATCHING FOR EFFICIENT XML QUERY PROCESSING

2015

We define a series of auxiliary operations on a B-twig and its nodes. The operations used by our algorithm are discussed in the sequel. childrenn returns all child nodes of n. parentn returns the parent node of n. Qchildrenn returns the set of QNodes in the subtree rooted at n that are reachable from n without traversing other QNodes. . NQchildrenn returns the set of NQNodes in the subtree rooted at n that are reachable from n without traversing other tree nodes. . Qparentn returns the nearest ancestor QNode of n. . Qsiblingq returns all sibling QNodes of q (excluding q itself). . subtreeQNodesq returns all QNodes in the subtree rooted at q (including q). . isLeaf n tests whether node n is a leaf. . isOutNoden tests whether node n is an output node. . isOutLeaf n tests whether node n is an output leaf. . isRootn tests whether node n is the root. . isQNoden tests whether node n is a QNode. . isNQNoden tests whether node n is a NQNode. . isONoden tests whether node n is an ONode. . isANoden tests whether node n is an ANode. All test operations return Boolean value true or false. Furthermore, we assume each element in our tree database is assigned a triplet region code as its label; each query node q in a B-twig is associated with an element stream, named Tq ; each stream maintains a list of elements that satisfy the node test and any relevant predicate; the elements in a stream are sorted by their labels in ascending order; each stream Tq is equipped with a cursor, denoted by Cq, to facilitate element access. We define the following operations on every stream and its cursor: . . . endCq : tests whether cursor Cq has reached the end of the stream, Tq . . Cq ! advance: advances cursor Cq forward by one position. In addition, for each output QNode q in B-twig Q, a stack, named Sq , is allocated. Similar to what is described by Bruno et al. in [5], each item in stack Sq consists of a pair: (elabel, p-pointer), where e-label is the label of element e from stream Tq and p-pointer points to the entry of a matching parent element of e in the parent stack, Sparentq ). The three common stack operations, pop, push, and top, are assumed. Furthermore, the nodes in each stack Sq (from bottom to top) must lie on a root-to-leaf path in the given tree database. .

we focus on normalized B-twigs and present our methodology accordingly. Also, the term, B-twig, generally refers to a normalized B-twig in the sequel. We attain our objective in two steps: first, we define what a match of a B-twig pattern is; second, we develop a valid strategy for obtaining the matches of a B-twig in a given tree database. Per Definition 3.1, a normalized B-twig may contain four different types of nodes: QNode, NQNode, ONode, and ANode, of which explicit ANodes can only appear within an ONode branch. Thus, our definition for a match of a Btwig requires checking the corresponding condition induced by each of the above types of nodes of the B-twig. In a normalized B-twig, an edge leading into a QNode still represents a child or descendant location step originating from the parent node, and the specific axis of the location step is recorded in the child QNode. So the edgeT est function introduced by Jiang et al. [7] can still be useful for testing the location step between a pair of candidate data elements. Howerever, NQNodes (which combine NNodes and the subsequent QNodes) are unique to B-twigs and bring a new line of challenge for holistic Btwig computing. At the first glance, it seems that a simple reverse to the edgeT est functions result would suffice to solve the problem of NQNodes. This nevertheless is deceptive because evaluating an NQNode actually requires excluding all the matching elements of the negated QNode. We must design a dedicated mechanism for the evaluation of NQNodes in a B-twig. This mechanism is another primitive function, named nEdgeT est. As for the evaluation of an ONode in a B-twig, since the matter now gets more complicated (comparing to AND/ OR-twigs) due to the new NQNodes, a novel evaluation strategy for OR predicates is needed. For ease of presentation, we adopt the following conventions: each query node qi is associated with a data element ei (by changing q to e) such that tag(ei ) = tag(qi ); similarly for the symbol ni when it refers to a query node (QNode or NQNode). Definition 4.1 (edgeT este0 ; e or edgeT este0 ; q). Let q be a QNode in a B-twig and q0 be Qparent(q), e and e0 be the associated elements of q and q0 , respectively. Boolean function edgeT este0 ; e or edgeT este0 ; q evaluates to true if element e0 is an ancestor (respectively, a parent) of element e when q is a d-child (respectively, a c-child) of q0 . Definition 4.2 (nEdgeT este0 ; q). Let q be a NQNode in a Btwig and q0 be Qparent(q), and e0 be the associated element of q0 . Boolean function nEdgeTest(e0 , q) evaluates to true if for all elements ei (if any) associated to the negated QNode q, function edgeT este0 ; ei returns false. Definition 4.3 (ONodeT este; n). Let ONode n be the root of an OR predicate subtree, and q is Qparent(n) associated to element e. Boolean function ONodeTest(e,n) evaluates to true if e satisfies the OR predicate represented by the ONode n. Definition 4.3 will become more solid after we present Definition 4.6 that explains how to satisfy an OR predicate. Definition 4.4 (Match for a B-Twig). Let Q be a query tree with N nodes n1 , n2 , nN , where n1 is the root QNode. By

B-TWIG MATCHING: APPROACH AND SUPPORTING MECHANISMS

The normalization we presented earlier helps to sort out the otherwise arbitrary combinations of logical predicates in Btwigs, and thus significantly facilitate programmatically solving the holistic join problem of B-twigs (part of the complexity is transferred to the normalization step where the complexity gets technically digested). From now on,

2016

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 11,

NOVEMBER 2012

Fig. 7. A normalized B-twig with OR-blocks.

Fig. 6. A nonnormalized B-twig example.

convention, ei is the associated element of ni if ni is a QNode or NQNode. We say element e1 has a match for the query tree rooted at n1 if the following holds for each child subtree nki of n1 : 1) if nki is an ONode, then ONodeT este1 ; nki evaluates to true; 2) if nki is an NQNode, then nEdgeT este1 ; nki evaluates to true; 3) otherwise (i.e., nki is a QNode) edgeT este1 ; nki evaluates to true and element eki has a match for the subtree rooted at nki in case nki is not a leaf. Note that in the above definition, the ANodes are not treated as a relevant case because in a normalized B-twig a QNode can never have an explicit ANode childall ANodes (if any) must be inside an ONode subtree if any. Definition 4.4 is a sound recursive definition. Definition 4.4 implies that, in order to identify a match for a B-twig, we need to call upon the three primitive functions: ONodeT est, nEdgeT est, and edgeT est. Their implementations thus become critical and needs to be addressed before we proceed to our main algorithms. Solving edgeT est and nEdgeT est (per their respective definitions) is relatively easy, and solving ONodeT est is a little tricky. We recall, in order to efficiently evaluate the OR predicates in AND/OR-twigs, Jiang et al. [7] introduced OR-blocks as a mechanism to simplify the representation of AND/OR-twigs. We carry the idea of OR-blocks a further step forward for evaluating the OR predicates in normalized B-twigs. As we have to deal with NOT predicates as well in the context of a B-twig, so we need a more powerful OR-block mechanism. With our redefined OR-block mechanism, a B-twig can be viewed and treated at a higher level as an AND-twig containing ordinary query nodes and special nodesOR-blocks. As a result, the existing and efficient holistic twig join approach can be leveraged (adapted) for B-twigs. This is a key point to retain the efficient framework of existing holistic algorithms and incorporate new mechanisms for AND/OR/NOT predicate evaluation needed for B-twigs. In the following, we first redefine the OR-block notion, and then develop a correspondingly sophisticated evaluation strategy for the more complex OR predicates contained in B-twigs. Definition 4.5 (OR-Block). Given a twig query Q, an OR-block is a tree t embedded in Q such that the root of t is an ONode n, parent(n) is a QNode, and the leaf nodes of t are Qchildren(n) or NQchildren(n). In addition, a logical formula, denoted as P n, is recorded in the root structure of the OR-block.

An OR-block encapsulates the details inside of an OR predicate branch. The logical formula, P n, recorded in an OR-block, provides the needed information when the OR predicates needs be evaluated w.r.t. a parent element node. Serving the evaluation of an OR predicate in the context of normalized B-twigs, our OR-block notion is different from that in [7] in: 1) it does not contain nested OR-blocks; 2) it may contain NQNodes in addition to ONodes, and they must all appear as leaves in the OR-block structure. Given a normalized B-twig, after all OR predicate branches being replaced by corresponding OR-blocks, the B-twig is said OR-block represented (and the process is called OR-block rerepresentation). Obviously, in an OR-block represented B-twig, there may only be QNodes, NQNodes, and OR-blocks. For example, Fig. 6 shows an original B-twig with all three types of logical predicates; after normalization and OR-block rerepresentation, the resulting form of the same B-twig is shown in Fig. 7which now consists of only QNodes, NQNodes, and OR-blocks (the nesting structure of OR predicates are flattened). We introduced OR-blocks to facilitate evaluation of OR predicates. In the following, we define the notion of OR predicate evaluation in the context of normalized B-twigs utilizing the mechanism of OR-blocks. Definition 4.6 (OR Predicate Evaluation). Given normalized B-twig Q, let ONode n be the root of an OR predicate branch in Q and QNode q be the parent of n, and n is currently associated to element e. We say element e satisfies the OR predicate rooted at n or ONodeTest(e,n) evaluates to true if P(n) in the corresponding OR-block is true after replacing each QNode or NQNode ni in P(n) with a respective Boolean function as follows: if ni is a leaf QNode or a leaf NQNode, replace ni with edgeT este; ni or nEdgeT este; ni accordingly; otherwise (ni is a nonleaf QNode), replace ni with the Boolean value of [edgeT este; ni AND (ei has a match for subtree ni )]. The above definition embodies our evaluation strategy for OR and NOT predicates in normalized B-twigs, and function ONodeT est implements our strategy. Obviously (per Definition 4.6), ONodeT est needs support from two other important functions, edgeT est and nEdgeT est. The implementations of these three supporting functions are given in Figs. 8, 9, and 10, respectively. All together, these three functions provide the basic supporting mechanisms to our holistic B-twig join approach. As shown in Fig. 8, function edgeT est performs edge test by checking the region coverage relationship between two candidate elements. Our algorithm differentiates between an AD edge and a PC edge. The edge type information or

CHE ET AL.: HOLISTIC BOOLEAN-TWIG PATTERN MATCHING FOR EFFICIENT XML QUERY PROCESSING

2017

Fig. 10. Function ONodeT est.

Fig. 8. Function edgeT est.

location step axis is obtained from the content of the child query node. The main structure of function edgeT est (and function nEdgeT est as well) is a while loop, which at the first glimpse appears unnecessary, but (at lines 8 and 9, see Fig. 8) brings an important optimizationfast skipping noncontributing elements in stream Tq until the cursor moves over the range of the parent element e. (This fast skipping has the effect of instant performance optimization, otherwise those noncontributing elements will stay in their streams causing extra iterations and consuming extra CPU time.) The implementation of function nEdgeT est relies on repeated calls to function edgeT est (see Fig. 9). The implementation of function ONodeT est (see Fig. 10) almost straightforwardly follows Definition 4.6. It is based on edgeT est, nEdgeT est, and yet another function, hasExtension, that realizes Definition 4.4 as will be shown in Fig. 15. Holistic twig joins typically disallow backtracking of stream cursors to guarantee linear time complexity. Notice that the evaluation of a NOT predicate involved in a B-twig requires disproving all elements in the negated stream (associated to the query node negated by the NOT predicate). This seems to imply that we need to scan to the very end of the negated stream to disprove all elements, and then backtrack the stream cursor to get ready for evaluating the subsequent parent elements. In fact, such backtracking can be avoided. Lets exemplify this: assuming we are processing the B-twig subexpression X NOT Y , and elements xi and yj are currently under the cursor of their respective streams, TX and TY , there are two cases to consider now: 1) yj happens to fall within the region of xi , this immediately disproves xi and the evaluation immediately returns; 2) yj is not in the range of xi , this leads to two

possible subcases: a) yj is ahead of xi then we just advance cursor CY and start the next iteration to evaluate with the next element following yj ; b) yj falls behind xi , this is enough to qualify xi , since all subsequent elements after yj (if any) can only be farther away from the coverage of xi due to the sortedness of the elements in stream TY . In all cases, advancing of the stream cursor TY never goes beyond the range covered by xi , and backtracking is never needed. The code in Fig. 9 embodies the idea discussed above.

A HOLISTIC B-TWIG JOIN ALGORITHM

With the supporting mechanisms set forth in the preceding sections, we now present our novel holistic B-twig join algorithm, BTwigMerge, in this section.

5.1 BTwigMerge: The Main Algorithm The structure of our main algorithm, BT wigMerge, as shown in Fig. 11, is not much different from most other holistic twig join algorithms [5], [6], [8], [7], [9], [10], [11], [13]. It is a merge-based, two-phase algorithm. However, as we confront a different, more complex problem of B-twigs that was not considered by previous holistic algorithms, the processing strategy of our BT wigMerge must be accordingly different. The difference mainly lies in the key

Fig. 9. Function nEdgeT est.

Fig. 11. Main algorithm BT wigMerge.

2018

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 11,

NOVEMBER 2012

supporting function, GetQNode (detailed in the next section). In addition to feeding the main algorithm the next query node to be processed, our GetQNode function thoroughly investigates the candidacy of the elements in the input streams and guarantees that for the next QNode returned to the main algorithm the current element in the associated stream is fully qualified, i.e., the element satisfies all relevant criteria (including all predicates and edge tests). Functionally, algorithm GT wigMerge [7] is the closest to our BT wigMerge, but at the main algorithm level, BT wigMerge is more concise: with each valid query node q returned by GetQNode (the validness is checked at line 3, see Fig. 11), BT wigMerge cleans up relevant stacks (lines 6 and 7), moves the element associated to q from stream to stack if it is not an output leaf (at lines 8 to 10), otherwise (q is an output leaf) outputs the path solutions currently on the stacks (lines 9 and 10). It is worth to point out that BTwigMerge does not explicitly process OR and any other predicates at the main algorithm level (differing from GT wigMerge [7]). Instead, all critical processing logics are encapsulated in the key supporting function, GetQNode, and other lower level supporting functions. GetQNode also checks whether each involved PC or AD edge is, respectively, satisfied by the stream head element associated to q that is to be returned to the main algorithm as the next QNode for processing. Therefore, our BT wigMerge achieves matching optimality not only with AD edges but also with PC edges (this is in strong contrast to all previous holistic algorithms including T wigStack [5] and GT wigMerge [7], etc.). Some important features of BT wigMerge are highlighted as follows: 1. BT wigMerge receives (from GetQNode) either a valid output QNode q or an invalid QNode, denoted by null. An invalid QNode is typically generated by GetQNode when a nontop level recursive call into this function fails to find a QNode associated with a fully qualified element. But since noncontributing elements encountered during this process have been skipped, the main algorithm quick jumps to its next iteration (at line 4) to start a new call to GetQNode for getting the next valid QNode. No stacks are allocated for nonoutput QNodes, nor for any output leaves (QNodes) since the contributing elements corresponding to an output leaf can be directly grabbed from the associated stream for output. GetQNode performs specific edge test (PC or AD), which renders both I/O and CPU optimality for both AD and PC edges involved in a B-twig. Stack cleaning is needed in BT wigMerge solely because each time after outputting path solutions, some elements on the stacks may become irrelevant for future path solutions and must be cleaned out. (In most prior algorithms such as T wigStack, stack cleaning is required to get rid of those noncontributing elements that may have been tentatively added to the stacks but are actually noncontributing.) BT wigMerge does not explicitly (at the main algorithm level) deal with any AND/OR/NOTpredicates, nor with any nonoutput QNodes.

Fig. 12. Function ORBlockMax.

2.

3.

4.

5.

5.2 GetQNode: The Key Supporting Algorithm GetQNode is an essential subroutine which is called by the main algorithm BT wigMerge to decide the next QNode for processing. It is GetQNode that guarantees that the stream head element associated to the returned QNode is part of the final output since all the relevant predicates (if any) are thoroughly checked by GetQNode or its lower level primitive subroutines such as edgeT est, nEdgeT est, ONodeT est, and hasExtension, etc. While feeding BTwigMerge with the next QNode to be processed, some elements on the stream under consideration may be found noncontributing to the final answer and thus should be skipped right away. The term, largest threshold value, introduced by Jiang et al. [7] refers to the start label of a subelement emax of another element, say, e such that emax maximizes the start label among all the offspring elements of e. Such a threshold value can be used to skip e and all its successors if their end label are smaller than this threshold value. It still makes sense to carry out this type of optimization for B-twig join, but we need to redefine the mechanism to fit the particular need of B-twigs. The largest threshold value is computed by a special supporting function, called ORBlockMaxn, in [7]. We extend this function for our purpose as shown in Fig. 12, which conforms to our revised notion for OR-blocks. Understanding the structural features of OR-blocks in normalized B-twigs is the key to understanding how our ORBlockMax function works. This algorithm traverses the structure of an OR-block and computes the maximum threshold value to help effectively skip disqualified elements in the parent stream. Line 1 initializes the variable q0 to a special (imaginary) query node, denoted by 0, which is always associated to a special (imaginary) element identified by the region code 0; 0; 0. When the input node is a NQNode, line 3 returns this special query node 0 (associated to the imaginary element 0; 0; 0). Variable q0 is reinitialized at line 8 to n, and is used at line 16 when choosing the qmax from all the QNodes qi under consideration such that qmax gives the maximal start value. At line 13, function arg minqi fei :startg selects qmin from all the QNodes qi under consideration such that qmin has the minimal start value. Notice that at this point (line 13), the imaginary element with region code (0, 0, 0) is excluded

CHE ET AL.: HOLISTIC BOOLEAN-TWIG PATTERN MATCHING FOR EFFICIENT XML QUERY PROCESSING

2019

Fig. 15. Function hasExtension.

Fig. 13. Function GetQNode.

because all NQNodes are irrelevant to the purpose of function ORBlockMaxi.e., to help skip disqualified elements in the parent stream. The implementation of function GetQNode is shown in Fig. 13. The QNode qx returned by GetQNodeq can be one of the following two cases: 1) qx null (here null denotes an invalid query node), signifying to the main algorithm to immediately start another call to GetQNode for quickly getting the next valid QNode if the streams are not exhausted yet; 2) qx is a valid output QNodethis is the dominating case, similarly handled as in all other holistic twig join algorithms. Comparing with GT wigMerge [7], the most related holistic join algorithm to BT wigMerge, the structure of our main algorithm (Fig. 11) is more succinct: we pushed all important testsincluding AD and PC edge tests, and tests on any AND/OR/NOT predicateall down to the core subroutine, GetQNode, or its lower level primitive supporting functions. The advantage is early

skipping of disqualified elements in streams, leading to improved algorithm performance. In subroutine GetQNodeq, the information provided by getMaxQChildq (line 8 in Fig. 13) is used to skip disqualified elements in stream Tq . Unlike its counterpart in GT wigMerge [7], our getMaxQChildq (see Fig. 15) considers NQNodes in addition which do not exist in the simpler AND/OR-twigs that GT wigMerge was designed for. Another critical supporting function, hasExtension (see Fig. 15), implements our definition of a match for a B-twig (Definition 4.4). The hasExtension function in turn calls three other supporting functions, ONodeT est, nEdgeT est, and edgeT est (see Section 4), of which, the nEdgeT est function is designed for dealing with NQNodes and is unique to our BT wigMerge algorithm. Proposition 5.1. Algorithm BT wigMerge is optimal with both AD edges and PC edges. Proof. The optimality of BT wigMerge is guaranteed by the key subroutine GetQNode, which provides QNodes with fully qualified stream elements (at lines 15 and 20, Fig. 13) to BT wigMerge, while function hasExtension is called beforehand (at lines 14 and 19, Fig. 13) to check the qualification w.r.t. the specifics of PC edges and AD edges. Therefore, noncontributing elements never get pushed to a stack, and unused path solutions are never produced. Therefore, BT wigMerge remains optimal whether the edges are AD edges or PC edges. u t

5.3 Cost Analysis of BTwigMerge We now analyze the I/O and CPU cost of our algorithm BTwigMerge. For ease of presentation, given B-twig query Q, we first introduce the following parameters:
jQNodesj is the total number of QNodes in Q. jNQNodesj is the total number of NQNodes in Q. Query size jQj jQNodesj jNQNodesj. Notice here we do not count other logical predicate nodes toward the query size. . jInputj stands for the total size of all the input streams relevant to query Q. . jListj stands for the average stream length. . jOutputj stands for the total count of the data elements included in all output B-twig instances produced for query Q. In terms of the set of twig patterns that can be processed, BT wigMerge is a s upe rs e t of GT wigMerge and GT wigMerge is a superset of T wigStack. At the main . . .

Fig. 14. Function getMaxQChild.

2020

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 11,

NOVEMBER 2012

algorithm level, the three algorithms share great similarity. The cost analysis methods are also similar. So, in the following, we only provide a compact analysis for the I/O and CPU cost of BT wigMerge. The I/O cost of BT wigMerge consists of three parts: the I/O cost for accessing all the relevant input stream elements and the I/O cost for dealing with the intermediate path solutions plus the I/O cost for outputting the final twig solutions. Since in BT wigMerge, we always advance the stream cursors and never backtrack, the first part of the I/O cost is the total size of all relevant input streams. For the second part, since BT wigMerge is optimal with both AD and PC edgesi.e., it never produces useless intermediate path solutions, the I/O cost of this part is two times (for first output and then input) of the total final output size, i.e., 2 jOutputj. And the third part (for outputting the final results), of course, is jOutputj. All together, the total I/O cost for BT wigMerge is the sum of the above three parts. We therefore have the following equations regarding the I/ O cost of BT wigMerge: I=Ocost jQNodesj jNQNodesj jListj 3 jOutputj jQj jListj 3 jOutputj jInputj 3 jOutputj: The CPU cost analysis for BT wigMerge is analogous. The CPU cost also consists of three parts. The first part is the time spent on computing the path solutions, the second part is the time spent on dealing with the obtained intermediate path solutions (output, input, and merging), and the third part is on outputting the final twig solutions. The main structure of BTwigMerge is a loop that repeats no more than jInputj times, which is the total number of elements in all the input streams because noncontributing elements are skipped at line 10, 17, and 22 of GetQNode (see Fig. 13) or by the optimization rendered by the two primitive functions, edgeT est and nEdgeT est (see Figs. 8 and 9, respectively). So the first part of the CPU cost is linear to the input size. The second part depends on how many intermediate path solutions are produced and how many of them are going to be merged to form the final output twig solutions. As BT wigMerge does not produce any unused intermediate path solutions (it actually does not push any noncontributing elements onto any stack), the second part of the cost is linear to and solely decided by the output size jOutputj. And the third part of course is also linear to the output size. Added together, for the overall CPU cost of BT wigMerge, we have exactly the same result as we have for the I/O cost (cost equations omitted). It is worth to point out that query size jQj in CPU cost is counted slightly differently from that in I/O cost: for the former, jQj counts the duplicated query nodes caused by normalization, but for I/O cost, it does not because duplicate query nodes do not cause extra physical I/O. The above cost analysis results shows that our BT wigMerge has both optimal I/O cost and optimal CPU cost for normalized B-twigs with both AD and PC edges. Our experimental study provides empirical evidences to further support this conclusion.

EXPERIMENTS

In this section, we present the experiment results. As our BT wigMerge is the only algorithm of its kinddesigned for holistic B-twig pattern matching, there does not exist a real competitor to compare with. In this case, one plausible baseline to compare with is a decomposition-based approach. As discussed in Section 3.3, typically, a decomposition-based approach first splits an input B-twig at every predicate node into a series of subtwigs, then separately computes the partial solutions to the subtwigs, and finally combines the obtained partial solutions to form the whole solutions to the original B-twig. Such a decompositionbased approach suffers severe performance disadvantage that Jiang et al. [7] had empirically proven with a subclass of B-twigs years ago. For more general B-twigs, the problem of decomposition-based approach can only become worse. So we have no intention to empirically reprove the conclusion of Jiang et al. [7] at the scale of full B-twigs, instead, we comparatively study the performance of our algorithm and other related algorithms on various common subclasses of B-twigs. As the first holistic twig join algorithm, T wigStack [5] is designed for simple AND-only twigs. In terms of the categories of twigs being processed, GT wigMerge [7] generalizes T wigStack and is a superset of T wigStackcapable for AND/OR-twigs; T wigStackList: [13] also generalizes T wigStack but from a different aspect and thus is a superset of T wigStack [5] as wellcapable for AND/NOT-twigs; our BT wigMerge significantly extends the approach embodied in GT wigMerge and becomes a superset of both GT wigMerge and T wigStackList:capable for full B-twigs, i.e., AND/ OR/NOT-twigs. The theme of our experimental study thus is set on comparing BT wigMerge, respectively, with these predecessor algorithms with regard to a common subset of twig queries that they are all (or both ) capable of dealing with.

6.1 Experimental Setup Before proceeding to the details of our experiment study, we first address a few related issues about this study. Platform setup. The platform of our experiments contains an Intel Core 2 DUO 2.2 GHz running Windows XP System with 4 GB memory and a 75 GB hard disk. Java SE is the software platform on which these algorithms are implemented and tested. The various data sets used for this study are kept as external files on the hard disk. JDK1.6 (with DOM) is used to access the XML data elements in the data sets, and Apache Xerces (v2.9) is the adopted XML parser for this study. Each set of test queries is kept in an external query file that are parsed and transformed into inmemory query trees before sent for execution. As a convenient platform, JUnit 1.4 was used for concise timing of these algorithms on test queries. Data preparation. To avoid potential bias of using a single data set, we choose three popular XML data sets for this study. The first data set is an XMark data set [3] stored in a single XML file. This data set takes roughly 100 MB, containing about 100 thousands elements (or nodes). The second data set is a generated one by Stylus XML Generator [1] using a given XML Schema. Stylus XML Generator allows users to specify the expected structure and size of

CHE ET AL.: HOLISTIC BOOLEAN-TWIG PATTERN MATCHING FOR EFFICIENT XML QUERY PROCESSING

2021

TABLE 1 Test Queries

Fig. 16. Performance of AND-only twigs (AD version).

the XML data via separate XML Schema files. For this purpose, we carefully designed an XML schema with varied tree structures to avoid biased results. The adopted schema simulates various online shopping orders and the generated data set takes about 120 MB, including more than 10,000 order elements. The depth of the data tree is not very deep but the size of each order is fairly big. The third data set is the TreeBank data set, downloaded from the University of Washington XML Repository website [2]. The XML document of the TreeBank data set is deep and has many recursions in structure. This data set takes 82 MB memory, consisting of 2.4 million data nodes. The average depth of TreeBank is 7.8 and the max depth is 36. Preprocessing is performed beforehand to obtain the region code label for each data element and produce the input streams needed by all the algorithms. Due to the qualitative similarity of the experiment results with the three data sets, we choose to present only the results obtained with the TreeBank data set in this paper. Performance metric. This study focused on the key performance metricCPU cost. The reason why a holistic approach can outperform a decomposition-based nonholistic approach (e.g., through edge-by-edge structural joins) by more than an order of magnitude is that the former can successfully avoid generating unused intermediate results, which must also be outputted and inputted due to the twophase nature of the overall approach to holistic twig joins. The CPU cost and I/O cost are correlated for all known holistic twig join algorithms. For BT wigMerge they are both optimal and linear to the input and output size. So, for simplicity, in this paper we look only at the CPU cost of the tested algorithms.

experiment 1 tests all four related algorithms on AND-only twigs. Experiment 2 tests GT wigMerge and BT wigMerge on AND/OR-twigs. Experiment 3 tests T wigStackList: and BT wigMerge at on AND/NOT-twigs. Experiment 4 tests BT wigMerge on AND/OR/NOT-twigs as BT wigMerge is the only algorithm that can be called upon in this case. Accordingly, we have four sets of particularly designed queries: AND-only twigs, AND/OR-twigs, AND/NOTtwigs, and AND/OR/NOT-twigs. As there are not standard B-twig queries available for our test, we designed our test queries with balanced consideration of the following factors: depth, width, skewed, balanced, and mixture of relevant logical predicates. Besides, BT wigMerge is the only one in our experimental study that extends matching optimality also to PC edges in a twig query. In order to empirically confirm this advantage of BT wigMerge over other algorithms, our test queries are set as two versions: AD version (contains all AD edges) and PC version (with all AD edges replaced by PC edges). The AD versions of the four sets of test queries are, respectively, shown in Table 1. The PC versions of these queries can be trivially derived from their corresponding AD versions by replacing the edges accordingly, thus are omitted.

6.2.1 Experiment 1: AND-Only Twig Queries The performance result of this set of queries (AD version) is plotted in Fig. 16, and the result with the PC counterparts is shown in Fig. 17 (notice that in Figs. 16, 17, 20, and 17, algorithm T wigStackList: is denoted by T wigStackListN ). With the AD version of this set of simple AND-only queries, all four algorithms reveal highly comparative performance. BT wigMerge slightly outperforms others because of the extra optimizationskipping over noncontributing stream elements at an earlier stage via primitive functions edgeT est and nEdgeT estwhich was not explored by any other algorithm. With the PC version of this

6.2 Experiment Results We have done four parallel experiments with the TreeBank data set, each aimed at a distinctive subclass of twigs:

Fig. 17. Performance of AND-only twigs (PC version).

2022

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 11,

NOVEMBER 2012

Fig. 18. Performance of AND/OR-twigs (AD version).

Fig. 21. Performance of AND/NOT-twigs (PC version).

Fig. 19. Performance of AND/OR-twigs (PC version).

Fig. 22. Performance AND/OR/NOT-twigs (AD & PC versions).

other queries (Q6, Q8, Q9), it produced relatively less unused intermediate path solutions. This accounts for the worst performance degradation of GT wigMerge which happened on the PC version of Q7 and Q10.

Fig. 20. Performance of AND/NOT-twigs (AD version).

set of queries, we observe that our BT wigMerge consistently and significantly outperforms other algorithms due to the matching optimality that BT wigMerge extends to PC edges. We also observe that with query Q1, BT wigMerge does not so obviously outperform other algorithms. We traced into the details of the evaluation of Q1 and found that in this particular case all algorithms actually produced very few intermediate results because the PC edges in Q1 are too strong conditions, making the other three algorithms appear to perform better (but still not as good as BT wigMerge).

6.2.2 Experiment 2: AND/OR-Twig Queries With AND/OR-twig queries, our experiment is limited to BT wigMerge and GT wigMerge only. The performance result with the AD version of these queries is plotted in Fig. 18, and that with the PC version of these queries is shown in Fig. 19. O v e r a l l , BT wigMerge m o d e r a t e l y o u t p e r f o r m s GT wigMerge (except for Q8, see Fig. 18) because of the aforementioned optimization that BT wigMerge solely possesses. But for the PC versions of these queries, BT wigMerge is consistently the winner and for some queries (Q7 and Q10) it even outperforms GT wigMerge by more than an order of magnitude due to the optimality of BT wigMerge with PC edges that GT wigMerge does not. With the cases of Q7 and Q10, GT wigMerge produced 131,133 and 151,128 unused intermediate path solutions, respectively, while for

6.2.3 Experiment 3: AND/NOT-Twig Queries With AND/NOT-twig queries, our experiment is limited to BT wigMerge and T wigStackList: only. The performance result with the AD version of these queries is plotted in Fig. 20, and that with the PC version of these queries is shown in Fig. 21. For BT wigMerge and T wigStackList: on AND/NOTtwig queries, we have generally similar conclusion to what we had with the comparison between BT wigMerge and GT wigMerge on AND/OR-twig queries. For the AD version of this set of test queries, we see (in Fig. 20) consistent improvement of BT wigMerge over T wigStackList: in performance with all five queries owing to the optimization solely pursued by BT wigMerge. For the PC version of this set of queries, we see the outperformance ratio of BT wigMerge over T wigStackList: ranges from 1.15 (with Q13) to 15.21 (with Q12). The performance degradation of T wigStackList:, per our understanding, is partially because of the suboptimality of it with PC edges and more likely due to the high cost of caching every element into a list structure during the course. As a matter of fact, T wigStackList: did not produced any unused intermediate path solutions with Q12 and Q14 (this confirms that the algorithm indeed extends I/O optimality to a larger subclass of twigs than the canonical T wigStack). However, the in-memory cost of caching and handling the stream elements in a list structure gets amplified with Q12 and Q14. 6.2.4 Experiment 4: AND/OR/NOT-Twig Queries For AND/OR/NOT-twig queries, our BT wigMerge is the only one that is designed and really works. The performance results for both AD and PC versions of the tested queries (see Table 1) are plotted and fitted into Fig. 22.

CHE ET AL.: HOLISTIC BOOLEAN-TWIG PATTERN MATCHING FOR EFFICIENT XML QUERY PROCESSING

2023

Generally, as PC edges are more restrictive than AD edges, when all AD edges in a B-twig query is replaced by a corresponding PC edge, far less matches will be found from a given database. Because BT wigMerge extends its matching optimality to PC edges as well, which means that it produces no unused intermediate path solutions, so we shall expect obvious improvement on execution time. Indeed, from Fig. 22 we see that the PC versions of these queries are faster than their AD counterparts, except for Q19 (of Experiment 4). The slowdown of Q19 after all AD edges being replaced by PC edges seems to contradict but actually supports our above reasoning. This plausible contradiction is explained below: the output model of Q19 is a single nodethe root node S, which is restricted by a NOT predicate that introduces a predicate subtree rooted at the NOT predicate node. After all the AD edges being replaced by PC edges, the condition represented by the branch below the NOT predicate node becomes more restrictive as a PC edge is more specific than an AD edge; however the negation of this condition becomes less restrictive to the node above, i.e., the output node S. So the consequence isthe PC version of Q19 produces more matches, which explains why more time is consumed. Remarks. 1. To make our comparison with other algorithms fair, the normalization time spent on all tested queries is counted into the total execution time of BTwigMerge (except for Q1 to Q5 which do not need normalization), although the normalization time is just a negligible fraction. Of these 20 test queries, Q20 took the longest, 49 milliseconds, for normalization. With the four sets of experiments we have done, all (intermediate) path solutions were kept in memory. However, for larger (or real world) databases, the intermediate result sets are expected to be much bigger, and thus must be written to and later read in from disks. This means that our BTwigMerge is expected to further outperform the other three algorithms. Given an input B-twig, normalization may cause increase of query size, which, however, does not come with a real consequence. First, as pointed out in Section 3.3, increased query nodes are all duplicate nodes and they do not incur any extra I/O (as each duplicate node shares the same inmemory buffer of the input stream as the original query node). Second, normalization has positive aspect on evaluation performancewhile OR nodes are pushed up and sit high in a normalized twig, evaluation of an OR predicate can return immediately once one branch evaluates to true. In conclusion, B-twig normalization well pays all its cost off, and in return, it facilitates the design and implementation of an optimal, holistic B-twig join algorithm, BT wigMerge. We also did scalability test with all four algorithms. All showed basically linear scalability. BT wigMerge is outstanding by its absolute sublinear scalability (e.g., when data set scale to its double size, query execution time is increased by

80 percent in average in the current setting of our experiments). The scalability is sublinear (better than linear) since when data set scales up, noncontributing data elements do not proportionally consume more CUP time due to the various optimization means (basically, quickly getting rid of the noncontributing items) in our algorithms (for the sake of space, further details are omitted). To wrap up, through experiments we have demonstrated the validity and outperformance of BT wigMerge over other related algorithms on various common subsets of B-twig queries that they both/all are able to process. The outperformance comes primarily from two sources: the various optimization mechanisms (pursued in edgeT est and nEdgeT est) and the matching optimality on PC edges that BT wigMerge solely possesses. Besides, our overall approach works with arbitrary B-twigs that no other approaches are designed for.

SUMMARY

2.

3.

Holistic twig joins are critical operations for XML queries. The three basic logical predicates, AND, OR, and NOT, are natural expression mechanisms that people would desire to apply to general XML queries. However, all previously proposed holistic twig join algorithms failed to provide an integral solution for efficient and uniform processing of Btwig queries (with arbitrary combination of these logical predicates) in a single algorithmic framework. Consequently, given a general B-twig query, all prior holistic algorithms become inapplicable and useless. In this paper, we presented a novel approach for holistic computing of Btwig patterns and described an original algorithm, called BT wigMerge, which is the first of its kindholistic computing of a more general class of twig patterns represented by B-twigs. The second distinctive feature of BT wigMerge is that it gracefully extends the I/O and CPU optimality to twigs with PC edges as well. In order to reduce the intrinsic complexity in arbitrary Btwigs, we proposed B-twig normalization that successfully sorts out the arbitrary combination of the logical predicates in B-twigs. We designed a valid procedure to automatically transform input B-twigs into normalized forms. The normalized B-twigs are then sent to BT wigMerge that embodies our holistic twig join strategy and contains numerous novel supporting mechanisms. We have done analytical and experimental study with regard to the validity and performance of our approach and its accompanying algorithms, and concluded that our BT wigMerge is so far the most powerful and most efficient holistic twig join algorithmthe sole one designed for Btwigs, with optimal I/O and optimal CPU on twigs with arbitrary AD and/or PC edges. As future work, the following is on our agenda: . Minimize the normalization needed at the preprocessing step of our approach at two particular aspects: 1) relax the strictness of the normal form definition for B-twigs, e.g., accepting both DNF- and CNF-based normal B-twigs; 2) identify interesting, nonnormative B-twig patterns that lead to big expansion ratio and yet can be handled without

4.

2024

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 24,

NO. 11,

NOVEMBER 2012

first being normalized. And accordingly, generalize our algorithms and supporting mechanisms. Investigate potential incorporation of new label encoding schemes such as Extended Dewey [10] and the path partitioned encoding scheme [12] into the framework of our BT wigMerge. Such advanced encoding schemes have the potential of deriving ancestor elements labels without physically accessing them, thus may bring improved overall performance for holistic B-twig pattern matching. Study novel indexing mechanisms that can fit in the framework of our approach for more effectively skipping disqualified elements (reducing I/O cost). Our current study has not yet addressed the type of XML queries that involve IDREF and anonymous node *. Incorporating the support for IDREF and wildcard * into the holistic B-twig computing framework presented in this paper is another interesting topic to study in the future.

[14] C. Zhang et al., On Supporting Containment Queries in Relational Database Management Systems, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 01), pp. 425-436, May 2001. Dunren Che received the PhD degree in computer science and engineering from Beihang University (formally known as BUAABeijing University of Aeronautics and Astronautics), China, in 1994. He is currently an associate professor of computer science at Southern Illinois University Carbondale. His research interests include database (especially query processing and optimization with recent focus on XML queries), cloud computing, data mining, and bioinformatics. He has produced nearly 90 peer-reviewed research papers. Tok Wang Ling received the PhD degree in computer science from the University of Waterloo, Canada. He is a professor in the Department of Computer Science at the National University of Singapore. His research interests include data modeling, ER approach, normalization theory, and semistructured data model, XML query processing, and XML keyword query processing. He has published more than 180 papers, coauthored a book, coedited a book, and coedited 9 conference proceedings. He is a senior member of the IEEE, an ACM distinguished scientist, and an ER fellow. Wen-Chi Hou received the MS and PhD degrees in computer science and engineering from Case Western Reserve University, Cleveland, Ohio, in 1985 and 1989, respectively. He is currently a professor of computer science at Southern Illinois University Carbondale. His research interests include statistical databases, mobile databases, XML databases, and data streams processing, and query optimization.

ACKNOWLEDGMENTS
The authors would like to thank Haijia Zhou who is a former graduate student at SIUC (Southern Illinois University Carbondale) and Dabin Ding who is a current PhD student at SIUC. Haijia Zhou helped build the initial testbed for this study and Dabin Ding updated it recently and completed the experiments reported in this paper.

REFERENCES
[1] [2] [3] [4] [5] [6] Stylus Studio XML Generator, http://www.stylusstudio.com/ xml_generator.html, 2012. Univ. of Washington XML Repository, http://www.cs. washington.edu/research/xmldata sets/, 2012. XMark ? An XML Benchmark Project, http://www.xmlbenchmark.org/, 2012. S. Al-Khalifa et al., Structural Joins: A Primitive for Efficient XML Query Pattern Matching, Proc. 18th Intl Conf. Data Eng. Conf. (ICDE 02), pp. 141-152, 2002. N. Bruno, N. Koudas, and D. Srivastava, Holistic Twig Joins: Optimal XML Pattern Matching, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 02), pp. 310-321, June 2002. T. Chen, J. Lu, and T.W. Ling, On Boosting Holism in XML Twig Pattern Matching Using Structural Indexing Techniques, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 05), pp. 455-466, June 2005. H. Jiang, H. Lu, and W. Wang, Efficient Processing of Twig Queries with OR-Predicates, Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD 04), pp. 59-70, 2004. H. Jiang, W. Wang, H. Lu, and J.X. Yu, Holistic Twig Joins on Indexed XML Documents, Proc. 29th Intl Conf. Very Large Data Bases (VLDB 03), pp. 273-284, Sept. 2003. J. Lu, T. Chen, and T.W. Ling, Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach, Proc. 13th ACM Intl Conf. Information and Knowledge Management (CIKM 04), pp. 533-542, Nov. 2004. J. Lu, T.W. Ling, C.-Y. Chan, and T. Chen, From Region Encoding to Extended Dewey: On Efficient Processing of XML Twig Pattern Matching, Proc. 31st Intl Conf. Very Large Data Bases (VLDB 05), pp. 193-204, Aug. 2005. J. Lu et al., Efficient Processing of Ordered XML Twig Pattern, Proc. 16th Intl Conf. Database and Expert Systems Applications (DEXA 05), pp. 300-309, 2005. X. Xu, Y. Feng, and F. Wang, Efficient Processing of XML Twig Queries with All Predicates, Proc. IEEE/ACIS Intl Conf. Computer and Information Science (ICIS 09), pp. 457-462, June 2009. T. Yu, T.W. Ling, and J. Lu, twigstacklist:: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data, Proc. 11th Intl Conf. Database Systems for Advanced Applications (DASFAA 06), pp. 249-263, 2006.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

[7] [8] [9]

[10]

[11] [12] [13]

Вам также может понравиться