Вы находитесь на странице: 1из 14

A

omparison of labeling s hemes for an estor queries


Haim Kaplan 

Tova Milo 

Ronen Shabo 

July 29, 2001

Abstra t
XML do uments are often viewed as trees (basi ally the parse tree of the do ument), and queries
over su h do uments typi ally test for an estor relationships among tree nodes. Sear h engines pro ess
su h queries using an index stru ture summarizing the an estor relations. In the index, ea h do ument
item (tree node), is identi ed using some logi al id (node label), su h that, given two labels, the engine
an determine the an estor relationship between the orresponding nodes, without a essing the a tual
do ument. The length of the labels is a main fa tor of the index size. Therefore, redu ing this length,
even by a onstant fa tor, is a riti al issue.
Labelings urrently being used by a tual systems are all variants of the following interval s heme:
number the leaves from left to right and label ea h node with a pair onsisting of the numbers of its
smallest and largest leaf des endants. An an estor test then amounts to an interval ontainment test
on the labels. The maximum label length with this s heme is 2 log n, where n is the number of nodes
in the tree. A onsiderable amount of theoreti al resear h has been devoted re ently to redu e the
worst ase bounds on the labels
p length, with the urrent best s heme produ ing labels with maximum
length of about log n + O( log n). In ontrast, we fo us here on nding a s heme that works best
in pra ti e on real XML data. For that we propose several new pre x-based labeling s hemes, where
an an estor query roughly amounts to testing whether one label is a pre x of the other. We analyze
our new s hemes both theoreti ally and empiri ally, omparing their performan e to that of previously
suggested s hemes. Our experimental study shows that pre x-based s hemes onsiderably improve the
spa e onsumption on real XML data.
The problem
p with the urrent theoreti ally best labeling s hemes is that the onstant multiplying
the additive log n is large and makes this term be the dominant when n is the urrent average size of
an XML le. Nevertheless, to evaluate the pra ti ality of the approa h we have also implemented and
tested two simpli ed versions with smaller additive fa tors (at the pri e of inferior asymptoti bounds).
Our study shows that even this smaller fa tor is still dominating and the s hemes did not perform on
XML data as well as our new pre x based s hemes. To obtain a better grasp of the additive onstants
we also tested the performan e on some families of random trees (rather than just XML trees). Our
experiments show that these s hemes be ome ompetitive on trees with at least 20K nodes.

 S hool

of Computer S ien e, Fa ulty of exa t s ien es, Tel-Aviv University, Tel Aviv 69978, Israel.

fhaimk,milo,ronensgpost.tau.a .il.

E-mail:

1 Introdu tion
The Web is onstantly growing and ontains a huge amount of useful information. To retrieve su h
data, people typi ally use sear h engines like Altavista [6 or Google [9 whi h provide full-text indexing
servi es (the user gives a few words and the engine returns do uments ontaining those words). The
new emerging XML Web-standard [23 allows more sophisti ated queries of do uments. XML allows to
des ribe the semanti nature of the do ument omponents, enabling users not only to ask full-text queries
(e.g. nd do uments ontaining the word \Fielding") but also utilize the do ument stru ture to ask for
more spe i data (e.g. nd all book items ontaining \Fielding" as an author and a pri e less than 12$)
[12, 24, 1, 10, 3. The key observation guiding the design of a sear h engine that supports stru tural
queries, is that an XML do ument an be viewed as a tree whose nodes are the do ument items and
whose edges orrespond to the omponent-of relationship among data items. With this view, a stru tural
query amounts to nding nodes with parti ular tags (book, pri e, author, et .) having ertain an estor
relationship between them (e.g. book nodes that are an estors of qualifying author and pri e nodes).
The heart of a typi al sear h engine [26, 25 is a big hash table whose entries are the item tag names
and the words of the indexed do uments. For ea h su h tag or word the table ontains the identi ers of
all the do uments ontaining it. To allow stru tural queries one adds to ea h su h do ument-identi er an
additional label asso iated with the parti ular node whi h ontains the item within the do ument (tree).
These labels are given su h that one an de ide whether one node is an an estor of the other based on the
labels of the two nodes alone. Thus stru tural queries an be answered by using the index only, without
a ess to the a tual do ument.
To allow for good performan e it is essential that the index stru ture (or at least a large part of it)
resides in main memory. Observe that we are talking here about an extremely large memory size. Just
to give a rough measure, it is estimated that the urrent number of Web pages is around one billion
and will grow to 100 billion by 2002 [11. Sear h engines typi ally index 15% of these pages, onsuming
around 0 3 of the original do ument size, (even when only the text with the do ument ids is indexed,
not in luding separation to items or an estor information). With 10KB text per page on the average,
this already leads to about 500 giga for now and 50,000 for 2002. Sin e the length of the node labels is
a main fa tor of the index size, redu ing this, even by a onstant fa tor, is a riti al issue, ontributing
dire tly to hardware ost redu tion and performan e improvement. Therefore our goal in this resear h is
to design a ompa t labeling s heme for the nodes of a tree su h that given the labels of two nodes one
an determine whether one is the an estor of the other.
Note however that the exa t minimization obje tives depend on the spe i physi al representation
of the labels in the index:
 Using a xed-length representation every label is allo ated the same amount of spa e. Therefore,
the longest possible label determines the length of every label, and the size of the index. Using a
xed-length representation, the lower the worst- ase guarantee on the length of a label, the smaller
the spa e we need to allo ate for the index.
 Using a variable-length representation ea h label may have a di erent length, onsisting of a xedlength pre x stating the a tual length of the label, followed by the label itself. Here, the size of the
index is essentially determined by the average length of a label (rather than the maximum length
of a label). However, the maximum length of a label also has some importan e (even if smaller).
This is be ause the length of the xed pre x of every label is the logarithm of the maximum length
of a label.
Labeling s hemes whi h are urrently being used by a tual systems [26, 25 are variants of the following
interval s heme rst suggested by Santoro and Khatib [21. We number the leaves from left to right and
label ea h node with a pair onsisting of its smallest and largest leaf des endants. An an estor query
then amounts to an interval ontainment test on the labels. It is easy to see that the size of the labels
:

generated by this s heme is bounded by 2 log where is the number of nodes in the tree. A variant of
this s heme is integrated in the Xyleme XML warehouse system [26. A onsiderable amount of work has
been devoted re ently to develop labeling s hemes with worst ase bounds smaller than 2 log . The best
su h labeling s hemes use a re ursive onstru tion, and generates labels of length log + (plog ) [5
([5 build upon the work of [2. A related s heme is also suggested in [22.) Although theoreti ally best,
these re ursive s hemes are not expe ted to improve
on the simple interval s heme in pra ti e. This is
be ause the onstant multiplying the additive plog is large and makes this term be the dominant when
is the urrent average size of an XML le. In this paper we suggest labeling s hemes whi h are likely
to be winners in pra ti e. We ompare our s hemes experimentally, using real XML data, to the interval
s heme as well as to the re ursive s hemes of [2, 5, 22. We also analyze our new s hemes theoreti ally
and prove worst ase guarantees on the maximum label length.
Our Results. In ontrast to the interval s heme we suggest to use a pre x-based approa h, where the
labeling is su h that an an estor query roughly amounts to testing whether one label is a pre x of the
other. We start with a simple s heme that assigns a pre x-free olle tion of binary strings to the edges
outgoing of ea h node. The label of a node is the on atenation of the labels of the edges on the path
from the root to . It is easy to see that is an an estor of i the label of is a pre x of the label of .
The way that the binary strings are assigned to the edges depends on the desired physi al representation.
For variable length representation we assign the binary strings so as to minimize the average length of a
label. For xed length representation we assign the strings so as to minimize the maximum length of a
label.
This simple pre x s heme already exploits the stru ture of real XML les. It redu es the size of the
index by 10% ompared to the interval s heme using variable length representation, and by more than
20% using xed length representation. Nevertheless this s heme su ers from bad worse ase guarantees.
For example for a tree whi h is a long path the labels are of length ( ). Therefore, a single skewed
XML le, may in rease the size of the index onsiderably, in parti ular for xed length representation.
To remedy this we introdu e the ompressed pre x s heme that rst transform the tree into a balan ed
one. We perform this balan ing by partitioning the tree into paths and ontra ting ea h path into a single
virtual node. Then we assign a olle tion of pre x free binary strings to the edges of the ompressed tree
outgoing of ea h virtual node. We label the nodes of the original tree using the assignment in the ompressed tree. The labels and the an estor test are only slightly more evolved than without ompression.
As for the simple pre x s heme, the way we assign binary strings to the edges of the ompressed tree
depends upon whether we use variable length representation or xed length representation. Using ompression we obtain an (log ) worst ase upper bound on the length of a label for both variable length
representation and xed length representation. As a result we an handle the indexing of isolated skewed
trees without a large penalty in the size of the index. Furthermore, for xed length representation, where
the goal is to minimize the maximum length of a label, we give an algorithm whi h uses ompression and
produ e labels of length at most 2 log . Thus mat hing the upper bound of the simple interval s heme.
The ompressed pre x s heme not only has good worst ase guarantees but it also performs well in
pra ti e on our data. For variable length representation we obtain using the ompressed pre x s heme
an additional 10% redu tion in the size of the index ompared to the simple pre x s heme. Summing up,
this is a total of 20% redu tion with respe t to the interval s heme. Re all that in terms of the a tual
numbers we previously mentioned this means redu ing the urrent index size, by over 100 giga bytes.
For xed length representation our experiments show that the ompressed pre x s heme never produ ed
labels of length larger than 1 5 log . The total redu tion of the index size ompared to the interval
s heme was about 25%. We point out that our olle tion of XML les did not ontain any parti ularly
skewed tree. Therefore the advantage of the ompressed pre x s heme over the simple pre x s heme has
not been as large as it would have been in ase even a single su h tree appears.
The implementations of the simple pre x s heme and the ompressed pre x s heme are straightforward. Although the asymptoti worst ase time omplexity of our labeling algorithms is slightly larger
n

than that of the interval s heme , our experiments show that the a tual running time of all labeling
algorithms is negligible relative to the time required to load and parse the do ument. Therefore, the use
of pre x labeling ontributes no signi ant overhead to the time needed to prepare the index.
We also ompared our pre x labeling s hemes to a simpli ed version of the re ursive labeling s heme
with the urrently best asymptoti bound ofp log + (plog ) bits on the length of the labels [5.
Rather than using a re urren e of depth ( log ) as in [5 (see also [22) our implementation limits
the re urren e to a two level partition of the tree. This, in order to redu e the onstant fa tors while
ompromising for an asymptoti guarantee of log + (1) bits. We also present a new 2-level s heme
whi h ombines our ompressed pre x labeling s heme with ideas from [2, 5. We prove that the new
hybrid 2-level s heme has an upper bound of log + (log log ) on the length of the labels. We also
implemented and evaluated this new s heme in pra ti e.
Despite areful tuning of the implementations of the 2-level s hemes to avoid large onstant fa tors,
our experiments show that these s hemes perform worse than the ompressed pre x s heme and even
the simple pre x s heme on real XML data. To get a better idea of the additive onstant fa tors of
the 2-level s hemes, and to ompare the performan e of all algorithm on di erent distributions of trees
besides XML, we used all algorithms to label random trees (from two di erent distributions) of varying
sizes. This experiment shows that 2-level s hemes be ome ompetitive on tree with at least 200K nodes.
One degenerate variant of the 2-level s hemes that prunes every leaf whi h is a single hild of its
parent , performed parti ularly good on the XML data using variable length representation. This s heme
exploits the fa t that in a typi al XML le most of the leaves are single hildren of their parents.
Additional Related work: Peleg [20 onsidered informative labeling s hemes of trees and graphs for
several types of problems. Maybe the problem losest to ours among the ones that Peleg studies is nding
a labeling s heme that given the labels of two nodes and allows nding the identi er of the lowest
ommon an estor of and . Peleg des ribed an (log ) labeling s heme for this problem and proves
that this bound is tight up to a onstant fa tor if the identi ers are predetermined.
Milo and Kaplan [17 have re ently showed how to add the ability to answer parent queries to the
theoreti ally best an estor s hemes with no asymptoti overhead. In another re ent work Alstrup et al
[4 show a labeling s heme that allows to identify nearest ommon an estors.
The stru ture of the paper is as follows. Se tion 2 des ribes the basi pre x labeling s heme and
Se tion 3 presents the ompressed version, The impli ations of full text indexing is onsidered in Se tion
4. Se tion 5 des ribes the two-level labeling s hemes. Finally, the experimental results are presented in
Se tion 6. We on lude in Se tion 7. For la k of spa e we omit the proofs of our theoreti al results and
some details of our implementations from this extended abstra t.
1

3
2

3
2

2 Simple pre x labeling


We assume in the sequel that all trees are su h that ea h internal node has at least two hildren . We
denote by the number of nodes in a tree . We denote the subtree of rooted at a node by , and
the number of nodes in by size( ).
We all an assignment of binary strings to the edges of the tree, s.t. the olle tion of strings asso iated
with the outgoing edges from any node is pre x free, a pre x free assignment. The simple pre x s heme
rst nds a pre x free assignment to the tree. Then it labels every node with the on atenation of
the strings assigned to the edges on the path from the root to . It is easy to see that with su h an
3

Tv

Tv

The worst ase running time of our pre x labeling algorithms is O(n log n). The running time of the interval s heme is

O(n2).

And then labels them with the label of the parent and an additional bit to distinguish them from their parent.
If this is not the ase to begin with we add a hild to ea h internal node having only one hild. This transformation
at most doubles the number of nodes, and sin e the bounds on the labels length will be in terms of log n this only means
addition of at most one single bit.
3

K
Q
L

M
N

b
a

Figure 1: (a) MSL vs. MML (b) long pre x labels ( ) Path de omposition
assignment labels are unique, and node is an an estor of , if and only if the label of is a pre x of
the label of .
For variable-length physi al representation, when di erent labels have di erent lengths, our goal is
to nd an assignment that minimizes the sum of the lengths of the labels (or equivalently the average
label length). We all this optimization problem the Minimum Sum of Labels (MSL) problem. When a
xed-length physi al representation is used, namely all labels have the same length (determined by the
longest label), our goal is to nd an assignment that minimizes the maximum length of a label. We all
this optimization problem the Minimum Maximum Label (MML) problem.
To see the di eren e between the two problems onsider for example the tree shown in Figure 1(a).
To minimize the sum of the lengths of the labels we assign to the edges outgoing from the root the strings
00 01 10 11 resp. All other nodes have exa tly two outgoing edges to whi h we assign zero and one resp.
(It will follow from Theorem 2.1 below that this is indeed the optimum labeling). On the other hand, to
minimize the maximum length of a label we will assign to the edges outgoing from the root the strings
000 001 01 1 resp., whi h will make the maximum label length 5 rather than the 6 with the previous
assignment. (Again, it will follow from Theorem 2.2 below that this is indeed the optimum labeling).
We solve the MSL problem using Hu man's algorithm [16. Re all that given a set of weights
, Hu man's algorithm nds a set
of pre x-free binary strings that minimizes the
sum 
(  j j). To nd the strings, the algorithm onstru ts a binary tree, whose leaves are the
given weights, that minimizes 
(  ), where is the number of edges on the path from the root
to . The orresponding strings are then obtained from the tree by tagging the outgoing edges of ea h
node by zero and one and on atenating the tags along the path from the root to ea h . To onstru t
the desired binary tree we start with a forest of singleton nodes of weights
. At ea h step
the algorithm ombines the two lightest trees into one tree (by adding a new node as a parent of the two
roots) whose weight is the sum of weights of the two lightest trees. The algorithm ends when there is
only one tree in the forest.
The MSL problem breaks into a olle tion of independent subproblems one for ea h internal node
of the tree. Let be an internal node whose hildren are
. To obtain the labels of the edges
outgoing of we apply Hu man's algorithm to the (multi) set of weights
where = size( ),
1   . (Re all that size( ) is the size of the subtree rooted at .) The following theorem establishes
the orre tness of this algorithm.
u

w1 ; : : : ; wn

b 1 ; : : : ; bn

i=1:::n

wi

bi

i=1:::n

wi

li

li

wi

wi

w1 ; : : : ; wn

u1 ; : : : ; uk

w1 ; : : : ; wk

Theorem 2.1

MSL problem.

ui

wi

ui

ui

For a given tree T , the above algorithm omputes a pre x free assignment that solves the

We solve the MML problem using a variant of Hu man's algorithm that omputes pre x-free olle tion
of binary strings
that minimizes max
( + j j) rather than the sum 
(  j j).
We all this variant the H-max algorithm. Given a set of weights
the H-max algorithm works
as Hu man's algorithm merging at ea h iteration the two lightest trees. The di eren e is in the weight
b1 ; : : : ; bn

i=1:::n

wi

bi

i=1:::n

w1 ; : : : ; wn

wi

bi

whi h H-max assigns to the new tree: Rather than taking the sum of the weights of the two merged
trees, the H-max algorithm sets the weight of the new tree to be 1 + the largest weight among the two
trees being merged. It is not hard to prove (see [14) that indeed the H-max algorithm nds a string
assignment whi h minimizes max
( + j j)
As in the solution of the MSL problem we apply the H-max algorithm to ea h internal node in
order to label its outgoing edges. In ontrast to the MSL algorithm here we have annot handle ea h
node independently but we have to handle them bottom-up. Let be an internal node whose hildren
are
(Assume we already labeled the subtrees ). To obtain the labels of the edges outgoing
of we apply the H-max algorithm to the (multi) set of weights
where be the length of the
longest on atenation of strings assigned to the edges of a root to leaf path in . The following theorem
establishes the orre tness of this algorithm. The proof is by simple indu tion on the depth of .
i=1:::n

wi

bi

u1 ; : : : ; uk

Tui

w1 ; : : : ; wk

wi

Tu i

Tv

Theorem 2.2

MML problem.

Given a tree T , the above algorithm omputes a pre x free assignment that solves the

Note that in ontrast with the interval s heme the pre x based s hemes do not have a logarithmi worst- ase upper bound on the maximum label length. As a simple example onsider the tree in
Figure 1(b). The label of the deepest leaf has 2 bits. The maximum label length annot in general
ex eed sin e an arbitrary binarization of the tree learly gives labels of length bounded by .

Remark:

n=

3 Compressed pre x labeling


In this se tion we enhan e the pre x labeling with a prepro essing step that rst ompresses the tree. As
a result labels of nodes even in a highly skewed trees annot get too large. The penalty is a slightly more
ompli ated an estor test whi h in orporates another omparison in addition to the pre x de ision. We
will use the following notation. Given strings
su h that =
we denote by sux( ).
We say that is smaller than , denoted
, when pre edes in the lexi ographi order of
binary strings.
At a high level the algorithm works as follows. We de ompose the tree into disjoint paths, and
onstru t a (partially ordered) ompressed tree ^ by ontra ting ea h path into a big virtual node (and
de ning a partial order on the hildren of the node). We rst label ^ using a s heme similar to the
simple pre x labeling s heme of se tion 2. On e ^ is labeled we label ea h original node based on the
label of the virtual node (path) to whi h it belongs. The label of a node in will be omposed of
two substrings ode( ) and separator( ). All nodes on the same path will have the same ode (whi h
is essentially the label of the node orresponding to the path in ^), and will be distinguished by their
separator. Furthermore, we assign odes and separators s.t. for every two nodes
2 , is an an estor
of i the following two onditions hold: (i) ode( ) is a pre x of ode( ), and (ii) separator( ) 
sux( ode( ) ode( )) separator ( ).
We show how to hoose the labels su h that the length of the ode on atenated with the separator
of ea h node is bounded by 2 log . One an then en ode ea h label using 2 log + log log + 2 bits: The
en oding onsists of two xed size elds one of size 2 log + 1 and the other of size log log + 1. The
rst eld ontains the on atenation ode( ) separator ( ), padded at the beginning with a sequen e of
'0's ending with a single '1'. (The '1' denotes where in the eld the a tual data starts). The se ond eld
tells us how to de ode the rst eld, namely, in whi h position the ode ends and the separator starts.
Compressing the tree: We ompress the tree by a quite standard partition of the tree into heavy
paths. For a node we de ne the path of , denoted by path( ), as the set onsisting of and nodes
in su h that ( )  ( ) 2 (note that these are all nonleaf nodes). Note that this set of nodes
indeed form a path sin e ea h of 's des endants an have at most one hild with weight  ( ) 2.
b1 ; b2 ; b3

s1

s2

b2

s1 < s2

s1

b1 :b3

b3

b2 ; b1

s2

v; u

u ;

v :

Tv

size w

size v =
v

size v =

Repla e the outgoing edges of ea h internal node v by a arbitrary binary tree with the hildren of

v at its leaves.

We partition the tree into paths re ursively as follows. We start from the root of and work top
down: We rst nd path( ) and mark it as a new path. Then we ontinue re ursively from every node
not on path( ) that is a hild of a node on path( ) and de ompose all the subtrees rooted at these nodes
in the same way. For a path , we denote by head( ) the node on losest to the root, and by parent( )
the parent of head( ) in . Our de omposition algorithm is illustrated in Figure 1( ). The de omposed
paths are marked with thi k dark lines. The head and parent of the path are marked by and
respe tively.
We obtain the ompressed tree ^ by ontra ting ea h path into one single node. Thus, the nodes of
^ are the paths of the path de omposition and a node 2 ^ is a hild of a node 2 ^ if and only if
parent( ) 2 . It is easy to see that the depth of ^ is at most log( ). (Sin e if is a parent of in ^
then size(head( )) size( ( )) 2). The representative in ^ of a node 2 , denoted by rep( ), is
the node orresponding to the path that ontains . The next lemma shows that ^ preserves an estor
relationship among nodes it represents.
Lemma 3.1 If
is an an estor of in than either and are on the same path, so rep( ) =
rep( ), or they belong to distin t paths, in whi h ase rep( ) is an an estor or rep( ) in ^.
For every internal node 2 ^ we de ne a partial order  on the hildren of su h that  if and
only if parent( ) is a proper an estor of parent( ). We will omit the subs ript from  without onfusion
sin e every node in ^ parti ipates in exa tly one su h partial order. For example in the ompressed tree
orresponding to the tree of Figure 1( ),
are hildren of and the partial order on them
is  f g   f g.
Assigning labels: We rst label the nodes of ^ with a pre x free assignment but in addition we require
that this assignment would respe t the partial order de ned on the edges outgoing out of ea h node of ^.
Spe i ally, if and 0 are hildren of and  0, then we require that the string assigned to the edge
( ) is smaller than the string assigned to ( 0). We all su h an assignment to edges of ^ an ordered
pre x free assignment.
On e we nd an ordered pre x free assignment to the edges of ^ we label ea h node of 2 ^ by the
on atenation of the strings assigned to the edges of ^ along the path from the root to . Finally, we
de ne the labels of the nodes of based on the labels of the ompressed tree ^. Re all that the label of
a node onsists of two parts, ode( ) and separator( ). We de ne ode( ) to be the label of rep( ) 2 ^,
and separator( ) to be the smallest (lexi ographi ally) string assigned to the outgoing edges of rep( )
whi h point to a representative of a hild of , or the empty string if rep( ) is a leaf node.
Our an estor test is spe i ed by the following theorem. The proof of this theorem follows from Lemma
3.1 and the fa t that the odes and separators were generated based on an ordered pre x free assignment.
Theorem 3.2 For two nodes
in , is an an estor of i ode( ) is a pre x of ode( ), and
separator( )sux( ode( ) ode( )) separator( ).
v

<

head q

v2

v1

v2

v1

v1

v2

v2

v1

K; L; M ; N ; O; P

L; M

O; P

p; q

p; q

v1 ; v2

v1

v2 ;

v1

v1

v2

v1

v2

v2

3.1 Finding an ordered pre x free assignment of T^

As in Se tion 2 the algorithm would be di erent for variable length representation and for xed length
representation. Noti e that for ea h 2 the on atenation ode( ) separator ( ) is a label of a node in
^, and ea h label of a node in ^ is ode( ) for some 2 . Therefore for xed length representation our
strategy is to nd an ordered pre x free assignment to ^ that minimizes the maximum length of a label
of a node in ^. This also minimizes the maximum length of a label of a node in . For variable length
representation we look for an ordered pre x free assignment that minimizes the sum of the lengths of the
ode parts of the labels of the nodes of .
Formally we de ne optimization problems that we all Ordered Minimum Sum of Labels (OMSL)
and Ordered Minimum Maximum Label (OMML) whi h are analogous to MSL and MML of Se tion 2.
v

v :

The input to ea h of these problems is a tree together with a partial order de ned on the hildren of
every node. The obje tive is to nd an ordered pre x free assignment that either minimizes the sum of
the lengths of the ode parts of the labels (OMSL) in , or minimizes the maximum length of a label
(OMML), depending on the desired physi al representation.
As for MSL and MML (see Se tion 2) we solve the OMSL and OMML problems by breaking them
into little pie es and solving an independent problem for ea h node. The problem we have to solve at
every node 2 ^ for the OMSL problem is the following. Let =
(head( )), for 1   ,
where is the th hild of . Find a binary tree with leaves
that minimizes P subje t
to the onstraint that must pre eded in an inorder traversal of the tree if  . Similarly,
for the OMML problem we are looking for a binary tree that minimizes maxf + g, and satis es the
order onstraints. Here is the length of the longest on atenation of the binary strings assigned to a
root to leaf path in the (re ursive) solution to the problem rooted at . It is straightforward to nd the
binary string orresponding to ea h edge outgoing from from the binary tree. Theorems analogous to
Theorem 2.1 and Theorem 2.2 hold for these algorithms, assuming we obtain an optimal solution to the
subproblem of ea h node .
The only missing link in what we des ribed so far is an algorithm to nd the desired binary tree
for ea h node . We annot use Hu man's algorithm (or H-max) sin e these assign binary string for
unordered set of weights without any guarantees on the lexi ographi order among them. Re ently,
motivated by this work, Barkan and Kaplan [8 des ribe an algorithm to nd the best tree (a ording to
both measures mentioned above) whi h is onsistent with a partial order de ned on the weights. Their
algorithm works for a family of partial orders that in lude those that arise in the OMSL and the OMML
problems and is polynomial when the weights are polynomialy bounded .
We an prove that both for the OMSL and OMML problems the length of the resulting labels is
(log ) bits. For OMML we an use the spe ial stru ture of the partial orders that arise at ea h node
to prove the following stronger bound
Theorem 3.3 Let
be a tree, and let ^ be the ompressed tree of . For the OMML problem the
algorithm nds an ordered pre x free assignment of ^ su h that the length of the resulting labels of nodes
in is bounded by 2 log .
The proof of Theorem 3.3 relies on the following spe ial stru ture of  for every node 2 ^. We an
partition the hildren of into blo ks
, where ( ) is the length of the path that orresponds
to . The blo k ontains all the hildren of the -th node on the path and all nodes in the same blo k
are unrelated. Furthermore, by the de nition of the heavy paths, the total weight of nodes in the last
blo k is at least half the total weight. We all a partial order on a set of weights that has these properties
a balan ed partial order. Our proof of Theorem 3.3 is onstru tive and suggests another algorithm (in
addition to the optimal algorithm of [8) that labels ^ with labels of length at most 2 log bits.
Unfortunately, both the algorithm of [8 and the algorithm suggested by the proof of Theorem 3.3
are quite ompli ated to implement. Therefore we implemented the following simple heuristi s that still
guarantee short labels.
Method 1 (Choose arbitrary total order): For the spe ial ase where the partial order on the
set of weights
is in fa t a total order, i.e.    there are simple polynomial
time algorithms [13, 18, 15 for nding an optimal binary tree with
at the leaves from left to
right that minimizes either 
(  ) or
( + ) (where is the number of edges from
the root to the leaf ontaining ). A binary tree ontaining    at the leaves from left
to right is alled an alphabeti tree and the orresponding binary strings are an alphabeti ode.
Our rst heuristi pi ks an arbitrary total order onsistent with  and nds an optimal alphabeti
ode onsistent with this total order using one of the algorithms mentioned above. We an prove that
T

qi

wi

sizeT

qi

w1 ; : : : ; wk

wi

wi li

wj

qi

li

qj

wi

wi

qi

B1 ; : : : ; Bl(p)

Bi

l p

w1 ; : : : ; wk

w1

w2

:::

wk

w1 ; : : : ; wk

i=1:::n

wi

li

maxi=1:::n wi

wi

li

w1

li

w2

:::

The weights here are bounded by n.

wk

both for OMSL and for OMML this heuristi still guarantee (log ) size labels. In parti ular using a
result of Mehlhorn [19 we an prove the following theorem.
O

The length of the labels obtained by the above s heme adapted to solve the OMML problem
is bounded by log(n) + 2d, where d is the depth of T^.

Theorem 3.4

Sin e the depth of ^ is at most log( ) Theorem 3.4 implies a bound of 3 log( ) on the maximum label
length.
Method 2 (Exploiting the stru ture of the partial order): In this heuristi we try to exploit the
fa t that the partial orders are balan ed with a simple algorithm. Sin e the partial order  is balan ed
we an split the hildren of into two groups, (i) an unordered group onsisting of the hildren in the
last partition set
and (ii) a partially ordered group onsisting of all other hildren. Furthermore,
the weight of the rst group being larger than that of the se ond group. Now we an treat ea h group
separately: (i) run Hu man's algorithm to assign strings to nodes in the unordered rst group, and (ii)
omplete the partial order de ned on the se ond group to an arbitrary total order and nd the optimal
string assignment onsistent with this total order using one of the algorithm mentioned above. Finally
we have to ombine the string assignments of the two parts into one string assignment onsistent with
the original partial order  . We do that by adding a leftmost zero to the strings of the ordered part
and a leading one to the strings of the unordered part. Intuitively we are able to get shorter labels sin e
the weight of the unordered set is large and we avoid imposing unne essary onstraints on its elements.
The algorithm whi h we use for solving the ordered and the unordered problems at ea h node depends
on whether we solve the OMSL problem or the OMML problem.
T

Bl(p)

4 Labeling a weighted tree


A typi al XML index ontains entries not only for the XML items appearing in the do uments (i.e. the
nodes in the do ument tree) but also for the words appearing in the do ument text. Looking at the
do ument parse tree, this text appears at the leaves. It follows that while an internal node of an XML
tree appear in the index just on e, a leaf may appear several times, on e per word o urring in the text
at that leaf.
When the physi al representation of labels is variable-length, hen e labels have di erent lengths, we
may want to take the number of words o urring in ea h leaf into a ount - we would like the labels of
leaves ontaining large number of words to be shorter than those having a smaller number of words (and
perhaps even shorter than labels of some internal nodes). In ontrast with the interval labeling s heme,
it is easy to adapt our pre x s hemes to a setup where nodes have di erent weights re e ting the number
of words that they ontain: The ore of our simple/ ompressed pre x s hemes, tailored for variable length
representation, were the algorithms for solving the MSL/OMSL problems, resp. If the weights of nodes
vary we an modify these algorithms su h that they minimizes the weighted sum of lengths of the labels
rather than their unweighted sum. As result nodes with high weights will get smaller labels.

5 Two-level algorithms
Aspmentioned in the introdu tion the theoreti ally best s heme pprodu es labels of length at most log +
( log ). This s heme is re ursive. It partition the tree into log levels of forests, labels the forests
independently using the interval s heme, and then arefully assembles the label of a node from the labels
of its representatives in the di erent forests. Although theoreti ally best, this re ursive s heme is not
expe ted to improve on the simple interval s heme in pra ti e. This is be ause the onstant multiplying
the additive plog is large and makes this term be the dominant when is the urrent average size of
an XML le or even mu h larger. To he k whether some of the ideas behind this s heme an generate
n

signi ant pra ti al improvements we implementedp a simpli ed version of this re ursive s heme that
partition the tree into only two levels, rather then log levels.
The s hemes of [2, 5 when restri ted to two levels works roughly as follows. We prune subtrees of
size smaller than p . Then even if the original tree did not have unary nodes,psu h nodes may exist after
into single verti es. We
pruning. To eliminate these nodes we ontra t indu ed paths of length about
p
also add leaves su h that ea h group of pruned subtrees of size at most 2 has a representative leaf in
the tree whi h remains after pruning. We denote by the tree that remains after these transformations.
We label the pruned subtrees, and , independently, using the interval s heme. We also assign to ea h
node on ea h ontra ted path a binary string whose length is inversely proportional to the size of the
pruned subtrees hanging o of . This is done su h that the olle tion of binary strings asso iated with
ea h path form an alphabeti ode where the order orresponds to the order of the verti es on the path
(using the algorithm of [15). The odeword orresponding to a vertex is on atenated to all interval
labels of nodes in the subtrees hanging o of . Finally we obtain the label of a pruned node by
on atenating the label of its representative leaf in with the interval of (with the alphabeti ode
sti ked to it of is pruned o of a ontra ted path). The label of node on a ontra ted path is the
label of the orresponding ontra ted node on together with 's alphabeti odeword. We will refer
to this s heme in the sequel as the two-level interval s heme. Although we limited the re urren e to a
2-level partition of the tree one an still prove an upper bound of log + (1) bits on the maximum
length of a label. For more details see [2, 5.
When we determine whether is an an estor of there are three possible ases. If and are in
di erent subtrees (hen e their labels ontain di erent an estor leaf labels from ) then they are learly
unrelated. If and belong to the same subtree we an determine whether is an an estor of by
omparing their labels within the subtree. Finally, if is in or on a ontra ted path and is in a
pruned subtree we an de ide by omparing the label of 's leaf an estor in (that is in orporated in
the label of ) to the label of (we also have to use the alphabeti odes in ase is in a pruned subtree
and is on the orresponding ontra ted path).
Alternatively, we an use the same pruning and ontra tion pro ess des ribed above to split the tree
but instead of using the interval s heme to label the pruned subtrees we an use the ompressed pre x
s heme (see Se tion 3). We all this s heme in the sequel the mixed two-level s heme. On e Theorem 3.3
is established it is not hard to prove the following performan e guarantee on the mixed 2-level s heme
(the proof is omitted from this extended abstra t).
Theorem 5.1 The lengths of the labels generated by the mixed 2-level s heme are at most log +
(log log ).
Intuitively, these 2-level s hemes exploit the fa t that, in the interval s heme, the length of a leaf-label
is half the length of a label of an internal node. They obtain the better bound on the maximum label
length by balan ing out the length of a leaf-label with the length of the label of an internal node.
n

3
2

3
2

6 Experimental results
We have implemented the labeling s hemes des ribed in this paper and ompared their performan e on
real XML data and random trees. We used two independent sets of XML les pi ked randomly from
les olle ted by the XML Web- rawler of [26. The rst set ontained 1822 les, with an average of
526 nodes per le and has been olle ted by the rawler by 22/5/2000. The se ond set ontained 509
les, with an average of 338 nodes per le, and was olle ted by 12/10/2000. At the bottom line, pre x
labeling s hemes, even simple ones, perform about 20%-30% better than traditional interval labeling on
the XML data. As we expe ted the performan e of the pre x labeling s hemes in pra ti e is mu h better
than their worst ase guarantees. The typi al XML le is relatively balan ed and does not have the
pathologi al stru ture required to make our algorithms a hieve their worst ase bounds. Noti e that the
9

average le size in both sets of les was relatively small. We expe t that on olle tions with larger les
our algorithms will perform even better, as the ontributions of the small additive onstants that they
in orporate be ome more negligible. The two level s hemes did not perform well on the real XML data.
We ompared the labelings produ ed by various algorithms both when a xed length representation
is used and when a variable length representation is used. When using a xed length representation
we assumed that all labels for a single le (tree) have the same length whi h is determined by the
largest among them. Labels in di erent les may have di erent lengths. When using variable length
representation, di erent labels an have di erent lengths (even within the same tree). Ea h label then
essentially onsists of two parts, a xed length pre x spe ifying the overall length of the label (in bits),
followed by the label value itself. Using variable length representation we an gain for example from
allo ating smaller labels to leaves when using the interval s heme, or allo ating smaller spa e to shorter
labels when using pre x based s hemes.
The results whi h we show for both representations are when the label length ould be any integral
number of bits. We also ompared the performan e of the algorithms when ea h label must o upy
an integral number of bytes. The results were similar and are therefore omitted. For variable length
representation we also ompared the algorithms while giving di erent weights to the nodes (a ording
to the amount of text they ontained) (see Se tion 4). Again the results were similar and are therefore
omitted.
6

(a) Fix Length


(b) Variable Length
Figure 2: XML Trees, the size of the index relative to the interval s heme
Our results for the rst data set of XML les are shown in Figure 2. Figure 2(a) shows the results
using xed length representation and Figure 2(b) shows the results using variable length representation.
For xed length representation we ompared the pre x labeling s hemes whi h are tailored to minimize
the maximum length of a label (algorithms of the MML family). For variable length representation
we used the pre x labeling s hemes tailored to minimize the sum of the labels (algorithms of the MSL
family). The implementations of the ompressed pre x s hemes that exploit the partial order (method
2 of Se tion 3) performed slightly better than the implementations that just impose an arbitrary total
order so only the results of former are shown (C-P-OMSL and C-P-OMML). The total size of the index
(total size of the labels for all les) is shown relative to the interval s heme (I). Both the 2-level interval
s heme (2L-I) and the mixed 2-level s heme (2L-MML), des ribed in Se tion 5 ould not signi antly
beat the interval s heme. They typi ally perform mu h worse due to their large additive onstants and
6

To parse a label we need to know to whi h type of le it belongs. Re all that the a tual identi ation of a node in the
index onsists of the le id plus the node label in the le. One possibility is to in orporate into the le id information about
the length of its labels. Alternatively, in a distributed environment, we an group the les based on their maximum label
length and assign di erent set of ma hines to manage ea h group. Then, all labels stored on a single ma hine have the same
length.

10

the small average le size. The pre x s hemes however both without ompression (MML and MSL)
and with ompression (C-P-OMML and C-P-OMSL) performed better than the interval s heme. The
ompressed pre x s hemes outperform slightly the simple pre x s heme on the rst data set, while it is
the other way around for xed length representation on the se ond data set (not shown in this extended
abstra t). The reason is probably the larger average tree size in the rst data set: As the le gets larger
the ompression is more likely to have some balan ing e e t on the tree, whi h pays for the overhead of
a more ompli ated label stru ture. Overall we see that pre x labeling is about 20%-30% more ompa t
than the onventional interval s heme.
The last algorithm whose performan e is shown in Figure 2 is a degenerate 2-level interval s heme
(2l-1). This algorithm rst prunes leaves whi h are single hildren of their parents and then applies the
interval s heme to the remaining tree. Finally it labels the pruned leaves with the label of their parent,
and adds an additional bit to all labels to distinguish between pruned leaves and the other nodes. This
s heme performed better than the interval s heme, in parti ular using variable length representation,
where it is even slightly better than the pre x-based s hemes. This is due to the large fra tion of leaves
whi h are only hildren of their parents in the XML trees.
We also ompared the algorithm on two distributions of random trees. The rst are trees drawn
uniformly at random from the set of ordered full binary trees with verti es (for varying values of )
[7. The se ond are trees drawn uniformly at random from the set of ordered labeled trees with verti es
and no apriori bound on the degree. (Our algorithms of ourse did not use the labels or the order of the
nodes.). We arried out this experiment for two main purposes. The rst purpose was to better evaluate
the additive onstants of the 2-level s hemes and he k how large need a tree to be for these s hemes to
win. The se ond purpose was to he k to what extent the relative performan e of the algorithms on the
XML data depends on the parti ular stru ture of the XML les.
7

(a) Fix Length


(b) Variable Length
Figure 3: Binary Trees (number of nodes in Thousands vs. label bit length per node)
Our results for the random binary trees are shown in Figure 3, and the results for the order labeled
trees are shown in Figure 4. We an see that the 2-level interval s heme performs slightly better than
the 2-level mixed s heme on these distributions of trees. The additive onstant of the 2-level interval
s heme with respe t to log is about 7. On the both distributions of trees the level s hemes be ome
ompetitive when the tree has over 20K nodes for xed length representation, and over 200K nodes for
variable length representation. The 2-level interval s heme performs slightly better than the 2-level mixed
s heme in both representations.
In ontrast with the XML les the random trees are often quite deep (for a random labeled tree with
about 200K nodes the depth varies from 680 to 2030). Therefore plain pre x s hemes without ompression
3
2

Ea h node is either a leaf or has both a left hild and a right hild

11

do not perform well (results are omitted). The performan e of the ompressed pre x s hemes, however,
is more interesting. For both families of random trees the appropriate ompressed pre x s heme perform
better than the interval s heme when using variable length representation but perform worse than the
interval s heme when using xed length representation. In parti ular this behavior holds when the average
size of the random tree is the same as the average size of the XML les. Thus we on lude that at least
with respe t to xed length representation the good performan e of the ompressed pre x s heme is due
to the parti ular stru ture of the XML les.

(a) Fix Length


(b) Variable Length
Figure 4: Labeled Trees (number of nodes in Thousands vs. label bit length per node)

7 Con lusions and suggestions for further resear h


Motivated by the design of a ompa t index for massive amounts of XML data, we study in this paper
the problem of giving short labels to the nodes of XML trees su h that, given the labels of two nodes,
one an determine whether one node is the an estor of the other. In ontrast to the traditional interval
labeling s heme, we introdu e a pre x-based approa h, demonstrating its adequa y to the labeling of
XML data.
One promising line for further resear h is to adapt the pre x labeling s hemes to support updates.
When an XML le is modi ed the sear h engine would want to atta h labels to the new nodes without
hanging the labels of existing nodes. By doing that it an refrain from deleting labels from its hash
table or rebuilding it altogether. Furthermore, by labeling the XML le in rementally through updates,
the sear h engine an potentially index and answer queries about multiple versions simultaneously. One
an enhan e a pre x labeling s heme to be dynami by inserting a pla e-holder node as an immediate
hild of ea h node that may gain des endants during updates. When insertions indeed o ur the labels
of these pla e-holders are expanded to labels of the new nodes, and new pla e-holders are in orporated
in the new subtrees. Note that this dynami generation of new labels is not possible with the interval
s heme { even if we leave gaps in the numbering of the leaves so as to a ount for new added leaves, an
intensive update to one spe i part of the do ument may onsume all the va ant ids and no additional
nodes ould be added without relabeling of existing nodes.

Referen es
[1 S. Abiteboul, P. Buneman, and D. Su iu. Data on the Web: From Relations to Semistru tured Data and XML.
Morgan-Kaufmann, 340 Pine Street, Sixth Floor San Fran is o, CA 94104, O tober 1999.
[2 S. Abiteboul, H. Kaplan, and T. Milo. Compa t labeling s hemes for an estor queries. In SODA'01, January
2001.

12

[3 S. Abiteboul, D. Quass, J. M Hugh, J. Widom, and J. Wiener. The lorel query language for semistru tured
data. International Journal on Digital Libraries, 1, 1997.
[4 S. Alstrup, C. Gavoille, H. Kaplan, and T. Rauhe. Identifying nearest ommon an estors in a distributed
environment. Submitted, july 2001.
[5 Stephen Alstrup and Theis Rauhe, January 2001. private ommuni ation at SODA'01.
[6 Altavista. http://www.altavista. om.
[7 D. B. Arnold and M. R. Sleep. Uniform random generation of balan ed parenthesis strings. ACM Trans.
Program. Lang. Syst., 2:122{128, 1980.
[8 A. Barkan and H. Kaplan. Partial alphabeti trees, 2001. Submitted.
[9 S. Brin and L. Page. The anatomy of a large-s ale hypertextual web sear h engine. In 7th WWW, 1998.
http://www.google. om.
[10 P. Buneman, S. Davidson, G. Hillebrand, and D. Su iu. A query language and optimization te hniques for
unstru tured data. In Pro eedings of ACM-SIGMOD International Conferen e on Management of Data, pages
505{516, 1996.
[11 D. Butler. Souped-up sear h engines. Nature, 405:112{115, May 2000.
[12 A. Deuts h, M. Fernandez, D. Flores u, A. Levy, and D. Su iu. A query language for xml. In International
World Wide Web Conferen e, 1999.
[13 E. N. Gilbert and E. F. Moore. Variable length binary en oding. Bell systems te hni al journal, 38:933{968,
1959.
[14 T. C. Hu, D. J. Kleitman, and J. K. Tamaki. Binary trees optimum under various riteria. SIAM J. Appl.
Math., 37(2):514{532, O tober 1979.
[15 T. C. Hu and C. Tu ker. Optimum omputer sear h trees. SIAM J. Appl. Math., 21:514{532, 1971.
[16 D. A. Hu man. A method for the onstru tion of minimum-redundan y odes. In Pro IRE, 40, pages
1098{1101, 1952.
[17 H. Kaplan and T. Milo. Short and simple labels for small distan es and other fun tions. In 7th International
Workshop on Algorithms and Data Stru tures (WADS), volume 2125 of LNCS. Springer, August 2001.
[18 D. E. Knuth. Optimum binary sear h trees. A ta Infomati a, 1:14{25, 1971.
[19 K. Mehlhorn. A best possible bound for the weighted path lenght of binary sear h trees. SIAM J. Comput.,
6(2):235{239, June 1977.
[20 D. Peleg. Informative labeling s hemes for graphs. In Mathemati al Foundations of Computer S ien e (MFCS),
pages 579{588, 2000.
[21 N. Santoro and R. Khatib. Labeling and impli it routing in networds. The Computer J., 28:5{8, 1985.
[22 M. Thorup and U. Zwi k. Compa t routing s hemes. In 13th ACM Symposium on Parallel
Ar hite tures (SPAA), 2001.
[23 W3C. Extensible markup language (xml) 1.0. http://www.w3.org/TR/REC-xml.
[24 W3C. Extensible stylesheet language (xsl). http://www.w3.org/Style/XSL/.
[25 Xdex. http://www.xmlindex. om.
[26 Xyleme. A dynami data warehouse for the xml data of the web. http://www.xyleme. om.

13

Algorithms and

Вам также может понравиться