Академический Документы
Профессиональный Документы
Культура Документы
Tova Milo
Ronen Shabo
Abstra
t
XML do
uments are often viewed as trees (basi
ally the parse tree of the do
ument), and queries
over su
h do
uments typi
ally test for an
estor relationships among tree nodes. Sear
h engines pro
ess
su
h queries using an index stru
ture summarizing the an
estor relations. In the index, ea
h do
ument
item (tree node), is identied using some logi
al id (node label), su
h that, given two labels, the engine
an determine the an
estor relationship between the
orresponding nodes, without a
essing the a
tual
do
ument. The length of the labels is a main fa
tor of the index size. Therefore, redu
ing this length,
even by a
onstant fa
tor, is a
riti
al issue.
Labelings
urrently being used by a
tual systems are all variants of the following interval s
heme:
number the leaves from left to right and label ea
h node with a pair
onsisting of the numbers of its
smallest and largest leaf des
endants. An an
estor test then amounts to an interval
ontainment test
on the labels. The maximum label length with this s
heme is 2 log n, where n is the number of nodes
in the tree. A
onsiderable amount of theoreti
al resear
h has been devoted re
ently to redu
e the
worst
ase bounds on the labels
p length, with the
urrent best s
heme produ
ing labels with maximum
length of about log n + O( log n). In
ontrast, we fo
us here on nding a s
heme that works best
in pra
ti
e on real XML data. For that we propose several new prex-based labeling s
hemes, where
an an
estor query roughly amounts to testing whether one label is a prex of the other. We analyze
our new s
hemes both theoreti
ally and empiri
ally,
omparing their performan
e to that of previously
suggested s
hemes. Our experimental study shows that prex-based s
hemes
onsiderably improve the
spa
e
onsumption on real XML data.
The problem
p with the
urrent theoreti
ally best labeling s
hemes is that the
onstant multiplying
the additive log n is large and makes this term be the dominant when n is the
urrent average size of
an XML le. Nevertheless, to evaluate the pra
ti
ality of the approa
h we have also implemented and
tested two simplied versions with smaller additive fa
tors (at the pri
e of inferior asymptoti
bounds).
Our study shows that even this smaller fa
tor is still dominating and the s
hemes did not perform on
XML data as well as our new prex based s
hemes. To obtain a better grasp of the additive
onstants
we also tested the performan
e on some families of random trees (rather than just XML trees). Our
experiments show that these s
hemes be
ome
ompetitive on trees with at least 20K nodes.
S hool
of Computer S ien e, Fa ulty of exa t s ien es, Tel-Aviv University, Tel Aviv 69978, Israel.
fhaimk,milo,ronensgpost.tau.a .il.
E-mail:
1 Introdu
tion
The Web is
onstantly growing and
ontains a huge amount of useful information. To retrieve su
h
data, people typi
ally use sear
h engines like Altavista [6 or Google [9 whi
h provide full-text indexing
servi
es (the user gives a few words and the engine returns do
uments
ontaining those words). The
new emerging XML Web-standard [23 allows more sophisti
ated queries of do
uments. XML allows to
des
ribe the semanti
nature of the do
ument
omponents, enabling users not only to ask full-text queries
(e.g. nd do
uments
ontaining the word \Fielding") but also utilize the do
ument stru
ture to ask for
more spe
i
data (e.g. nd all book items
ontaining \Fielding" as an author and a pri
e less than 12$)
[12, 24, 1, 10, 3. The key observation guiding the design of a sear
h engine that supports stru
tural
queries, is that an XML do
ument
an be viewed as a tree whose nodes are the do
ument items and
whose edges
orrespond to the
omponent-of relationship among data items. With this view, a stru
tural
query amounts to nding nodes with parti
ular tags (book, pri
e, author, et
.) having
ertain an
estor
relationship between them (e.g. book nodes that are an
estors of qualifying author and pri
e nodes).
The heart of a typi
al sear
h engine [26, 25 is a big hash table whose entries are the item tag names
and the words of the indexed do
uments. For ea
h su
h tag or word the table
ontains the identiers of
all the do
uments
ontaining it. To allow stru
tural queries one adds to ea
h su
h do
ument-identier an
additional label asso
iated with the parti
ular node whi
h
ontains the item within the do
ument (tree).
These labels are given su
h that one
an de
ide whether one node is an an
estor of the other based on the
labels of the two nodes alone. Thus stru
tural queries
an be answered by using the index only, without
a
ess to the a
tual do
ument.
To allow for good performan
e it is essential that the index stru
ture (or at least a large part of it)
resides in main memory. Observe that we are talking here about an extremely large memory size. Just
to give a rough measure, it is estimated that the
urrent number of Web pages is around one billion
and will grow to 100 billion by 2002 [11. Sear
h engines typi
ally index 15% of these pages,
onsuming
around 0 3 of the original do
ument size, (even when only the text with the do
ument ids is indexed,
not in
luding separation to items or an
estor information). With 10KB text per page on the average,
this already leads to about 500 giga for now and 50,000 for 2002. Sin
e the length of the node labels is
a main fa
tor of the index size, redu
ing this, even by a
onstant fa
tor, is a
riti
al issue,
ontributing
dire
tly to hardware
ost redu
tion and performan
e improvement. Therefore our goal in this resear
h is
to design a
ompa
t labeling s
heme for the nodes of a tree su
h that given the labels of two nodes one
an determine whether one is the an
estor of the other.
Note however that the exa
t minimization obje
tives depend on the spe
i
physi
al representation
of the labels in the index:
Using a xed-length representation every label is allo
ated the same amount of spa
e. Therefore,
the longest possible label determines the length of every label, and the size of the index. Using a
xed-length representation, the lower the worst-
ase guarantee on the length of a label, the smaller
the spa
e we need to allo
ate for the index.
Using a variable-length representation ea
h label may have a dierent length,
onsisting of a xedlength prex stating the a
tual length of the label, followed by the label itself. Here, the size of the
index is essentially determined by the average length of a label (rather than the maximum length
of a label). However, the maximum length of a label also has some importan
e (even if smaller).
This is be
ause the length of the xed prex of every label is the logarithm of the maximum length
of a label.
Labeling s
hemes whi
h are
urrently being used by a
tual systems [26, 25 are variants of the following
interval s
heme rst suggested by Santoro and Khatib [21. We number the leaves from left to right and
label ea
h node with a pair
onsisting of its smallest and largest leaf des
endants. An an
estor query
then amounts to an interval
ontainment test on the labels. It is easy to see that the size of the labels
:
generated by this s
heme is bounded by 2 log where is the number of nodes in the tree. A variant of
this s
heme is integrated in the Xyleme XML warehouse system [26. A
onsiderable amount of work has
been devoted re
ently to develop labeling s
hemes with worst
ase bounds smaller than 2 log . The best
su
h labeling s
hemes use a re
ursive
onstru
tion, and generates labels of length log + (plog ) [5
([5 build upon the work of [2. A related s
heme is also suggested in [22.) Although theoreti
ally best,
these re
ursive s
hemes are not expe
ted to improve
on the simple interval s
heme in pra
ti
e. This is
be
ause the
onstant multiplying the additive plog is large and makes this term be the dominant when
is the
urrent average size of an XML le. In this paper we suggest labeling s
hemes whi
h are likely
to be winners in pra
ti
e. We
ompare our s
hemes experimentally, using real XML data, to the interval
s
heme as well as to the re
ursive s
hemes of [2, 5, 22. We also analyze our new s
hemes theoreti
ally
and prove worst
ase guarantees on the maximum label length.
Our Results. In
ontrast to the interval s
heme we suggest to use a prex-based approa
h, where the
labeling is su
h that an an
estor query roughly amounts to testing whether one label is a prex of the
other. We start with a simple s
heme that assigns a prex-free
olle
tion of binary strings to the edges
outgoing of ea
h node. The label of a node is the
on
atenation of the labels of the edges on the path
from the root to . It is easy to see that is an an
estor of i the label of is a prex of the label of .
The way that the binary strings are assigned to the edges depends on the desired physi
al representation.
For variable length representation we assign the binary strings so as to minimize the average length of a
label. For xed length representation we assign the strings so as to minimize the maximum length of a
label.
This simple prex s
heme already exploits the stru
ture of real XML les. It redu
es the size of the
index by 10%
ompared to the interval s
heme using variable length representation, and by more than
20% using xed length representation. Nevertheless this s
heme suers from bad worse
ase guarantees.
For example for a tree whi
h is a long path the labels are of length ( ). Therefore, a single skewed
XML le, may in
rease the size of the index
onsiderably, in parti
ular for xed length representation.
To remedy this we introdu
e the
ompressed prex s
heme that rst transform the tree into a balan
ed
one. We perform this balan
ing by partitioning the tree into paths and
ontra
ting ea
h path into a single
virtual node. Then we assign a
olle
tion of prex free binary strings to the edges of the
ompressed tree
outgoing of ea
h virtual node. We label the nodes of the original tree using the assignment in the
ompressed tree. The labels and the an
estor test are only slightly more evolved than without
ompression.
As for the simple prex s
heme, the way we assign binary strings to the edges of the
ompressed tree
depends upon whether we use variable length representation or xed length representation. Using
ompression we obtain an (log ) worst
ase upper bound on the length of a label for both variable length
representation and xed length representation. As a result we
an handle the indexing of isolated skewed
trees without a large penalty in the size of the index. Furthermore, for xed length representation, where
the goal is to minimize the maximum length of a label, we give an algorithm whi
h uses
ompression and
produ
e labels of length at most 2 log . Thus mat
hing the upper bound of the simple interval s
heme.
The
ompressed prex s
heme not only has good worst
ase guarantees but it also performs well in
pra
ti
e on our data. For variable length representation we obtain using the
ompressed prex s
heme
an additional 10% redu
tion in the size of the index
ompared to the simple prex s
heme. Summing up,
this is a total of 20% redu
tion with respe
t to the interval s
heme. Re
all that in terms of the a
tual
numbers we previously mentioned this means redu
ing the
urrent index size, by over 100 giga bytes.
For xed length representation our experiments show that the
ompressed prex s
heme never produ
ed
labels of length larger than 1 5 log . The total redu
tion of the index size
ompared to the interval
s
heme was about 25%. We point out that our
olle
tion of XML les did not
ontain any parti
ularly
skewed tree. Therefore the advantage of the
ompressed prex s
heme over the simple prex s
heme has
not been as large as it would have been in
ase even a single su
h tree appears.
The implementations of the simple prex s
heme and the
ompressed prex s
heme are straightforward. Although the asymptoti
worst
ase time
omplexity of our labeling algorithms is slightly larger
n
than that of the interval s
heme , our experiments show that the a
tual running time of all labeling
algorithms is negligible relative to the time required to load and parse the do
ument. Therefore, the use
of prex labeling
ontributes no signi
ant overhead to the time needed to prepare the index.
We also
ompared our prex labeling s
hemes to a simplied version of the re
ursive labeling s
heme
with the
urrently best asymptoti
bound ofp log + (plog ) bits on the length of the labels [5.
Rather than using a re
urren
e of depth ( log ) as in [5 (see also [22) our implementation limits
the re
urren
e to a two level partition of the tree. This, in order to redu
e the
onstant fa
tors while
ompromising for an asymptoti
guarantee of log + (1) bits. We also present a new 2-level s
heme
whi
h
ombines our
ompressed prex labeling s
heme with ideas from [2, 5. We prove that the new
hybrid 2-level s
heme has an upper bound of log + (log log ) on the length of the labels. We also
implemented and evaluated this new s
heme in pra
ti
e.
Despite
areful tuning of the implementations of the 2-level s
hemes to avoid large
onstant fa
tors,
our experiments show that these s
hemes perform worse than the
ompressed prex s
heme and even
the simple prex s
heme on real XML data. To get a better idea of the additive
onstant fa
tors of
the 2-level s
hemes, and to
ompare the performan
e of all algorithm on dierent distributions of trees
besides XML, we used all algorithms to label random trees (from two dierent distributions) of varying
sizes. This experiment shows that 2-level s
hemes be
ome
ompetitive on tree with at least 200K nodes.
One degenerate variant of the 2-level s
hemes that prunes every leaf whi
h is a single
hild of its
parent , performed parti
ularly good on the XML data using variable length representation. This s
heme
exploits the fa
t that in a typi
al XML le most of the leaves are single
hildren of their parents.
Additional Related work: Peleg [20
onsidered informative labeling s
hemes of trees and graphs for
several types of problems. Maybe the problem
losest to ours among the ones that Peleg studies is nding
a labeling s
heme that given the labels of two nodes and allows nding the identier of the lowest
ommon an
estor of and . Peleg des
ribed an (log ) labeling s
heme for this problem and proves
that this bound is tight up to a
onstant fa
tor if the identiers are predetermined.
Milo and Kaplan [17 have re
ently showed how to add the ability to answer parent queries to the
theoreti
ally best an
estor s
hemes with no asymptoti
overhead. In another re
ent work Alstrup et al
[4 show a labeling s
heme that allows to identify nearest
ommon an
estors.
The stru
ture of the paper is as follows. Se
tion 2 des
ribes the basi
prex labeling s
heme and
Se
tion 3 presents the
ompressed version, The impli
ations of full text indexing is
onsidered in Se
tion
4. Se
tion 5 des
ribes the two-level labeling s
hemes. Finally, the experimental results are presented in
Se
tion 6. We
on
lude in Se
tion 7. For la
k of spa
e we omit the proofs of our theoreti
al results and
some details of our implementations from this extended abstra
t.
1
3
2
3
2
Tv
Tv
The worst ase running time of our prex labeling algorithms is O(n log n). The running time of the interval s heme is
O(n2).
And then labels them with the label of the parent and an additional bit to distinguish them from their parent.
If this is not the
ase to begin with we add a
hild to ea
h internal node having only one
hild. This transformation
at most doubles the number of nodes, and sin
e the bounds on the labels length will be in terms of log n this only means
addition of at most one single bit.
3
K
Q
L
M
N
b
a
Figure 1: (a) MSL vs. MML (b) long prex labels (
) Path de
omposition
assignment labels are unique, and node is an an
estor of , if and only if the label of is a prex of
the label of .
For variable-length physi
al representation, when dierent labels have dierent lengths, our goal is
to nd an assignment that minimizes the sum of the lengths of the labels (or equivalently the average
label length). We
all this optimization problem the Minimum Sum of Labels (MSL) problem. When a
xed-length physi
al representation is used, namely all labels have the same length (determined by the
longest label), our goal is to nd an assignment that minimizes the maximum length of a label. We
all
this optimization problem the Minimum Maximum Label (MML) problem.
To see the dieren
e between the two problems
onsider for example the tree shown in Figure 1(a).
To minimize the sum of the lengths of the labels we assign to the edges outgoing from the root the strings
00 01 10 11 resp. All other nodes have exa
tly two outgoing edges to whi
h we assign zero and one resp.
(It will follow from Theorem 2.1 below that this is indeed the optimum labeling). On the other hand, to
minimize the maximum length of a label we will assign to the edges outgoing from the root the strings
000 001 01 1 resp., whi
h will make the maximum label length 5 rather than the 6 with the previous
assignment. (Again, it will follow from Theorem 2.2 below that this is indeed the optimum labeling).
We solve the MSL problem using Human's algorithm [16. Re
all that given a set of weights
, Human's algorithm nds a set
of prex-free binary strings that minimizes the
sum
( j j). To nd the strings, the algorithm
onstru
ts a binary tree, whose leaves are the
given weights, that minimizes
( ), where is the number of edges on the path from the root
to . The
orresponding strings are then obtained from the tree by tagging the outgoing edges of ea
h
node by zero and one and
on
atenating the tags along the path from the root to ea
h . To
onstru
t
the desired binary tree we start with a forest of singleton nodes of weights
. At ea
h step
the algorithm
ombines the two lightest trees into one tree (by adding a new node as a parent of the two
roots) whose weight is the sum of weights of the two lightest trees. The algorithm ends when there is
only one tree in the forest.
The MSL problem breaks into a
olle
tion of independent subproblems one for ea
h internal node
of the tree. Let be an internal node whose
hildren are
. To obtain the labels of the edges
outgoing of we apply Human's algorithm to the (multi) set of weights
where = size( ),
1 . (Re
all that size( ) is the size of the subtree rooted at .) The following theorem establishes
the
orre
tness of this algorithm.
u
w1 ; : : : ; wn
b 1 ; : : : ; bn
i=1:::n
wi
bi
i=1:::n
wi
li
li
wi
wi
w1 ; : : : ; wn
u1 ; : : : ; uk
w1 ; : : : ; wk
Theorem 2.1
MSL problem.
ui
wi
ui
ui
For a given tree T , the above algorithm omputes a prex free assignment that solves the
We solve the MML problem using a variant of Human's algorithm that
omputes prex-free
olle
tion
of binary strings
that minimizes max
( + j j) rather than the sum
( j j).
We
all this variant the H-max algorithm. Given a set of weights
the H-max algorithm works
as Human's algorithm merging at ea
h iteration the two lightest trees. The dieren
e is in the weight
b1 ; : : : ; bn
i=1:::n
wi
bi
i=1:::n
w1 ; : : : ; wn
wi
bi
whi
h H-max assigns to the new tree: Rather than taking the sum of the weights of the two merged
trees, the H-max algorithm sets the weight of the new tree to be 1 + the largest weight among the two
trees being merged. It is not hard to prove (see [14) that indeed the H-max algorithm nds a string
assignment whi
h minimizes max
( + j j)
As in the solution of the MSL problem we apply the H-max algorithm to ea
h internal node in
order to label its outgoing edges. In
ontrast to the MSL algorithm here we have
annot handle ea
h
node independently but we have to handle them bottom-up. Let be an internal node whose
hildren
are
(Assume we already labeled the subtrees ). To obtain the labels of the edges outgoing
of we apply the H-max algorithm to the (multi) set of weights
where be the length of the
longest
on
atenation of strings assigned to the edges of a root to leaf path in . The following theorem
establishes the
orre
tness of this algorithm. The proof is by simple indu
tion on the depth of .
i=1:::n
wi
bi
u1 ; : : : ; uk
Tui
w1 ; : : : ; wk
wi
Tu i
Tv
Theorem 2.2
MML problem.
Given a tree T , the above algorithm omputes a prex free assignment that solves the
Note that in
ontrast with the interval s
heme the prex based s
hemes do not have a logarithmi
worst-
ase upper bound on the maximum label length. As a simple example
onsider the tree in
Figure 1(b). The label of the deepest leaf has 2 bits. The maximum label length
annot in general
ex
eed sin
e an arbitrary binarization of the tree
learly gives labels of length bounded by .
Remark:
n=
s1
s2
b2
s1 < s2
s1
b1 :b3
b3
b2 ; b1
s2
v; u
u ;
v :
Tv
size w
size v =
v
size v =
Repla e the outgoing edges of ea h internal node v by a arbitrary binary tree with the hildren of
v at its leaves.
We partition the tree into paths re
ursively as follows. We start from the root of and work top
down: We rst nd path( ) and mark it as a new path. Then we
ontinue re
ursively from every node
not on path( ) that is a
hild of a node on path( ) and de
ompose all the subtrees rooted at these nodes
in the same way. For a path , we denote by head( ) the node on
losest to the root, and by parent( )
the parent of head( ) in . Our de
omposition algorithm is illustrated in Figure 1(
). The de
omposed
paths are marked with thi
k dark lines. The head and parent of the path are marked by and
respe
tively.
We obtain the
ompressed tree ^ by
ontra
ting ea
h path into one single node. Thus, the nodes of
^ are the paths of the path de
omposition and a node 2 ^ is a
hild of a node 2 ^ if and only if
parent( ) 2 . It is easy to see that the depth of ^ is at most log( ). (Sin
e if is a parent of in ^
then size(head( )) size( ( )) 2). The representative in ^ of a node 2 , denoted by rep( ), is
the node
orresponding to the path that
ontains . The next lemma shows that ^ preserves an
estor
relationship among nodes it represents.
Lemma 3.1 If
is an an
estor of in than either and are on the same path, so rep( ) =
rep( ), or they belong to distin
t paths, in whi
h
ase rep( ) is an an
estor or rep( ) in ^.
For every internal node 2 ^ we dene a partial order on the
hildren of su
h that if and
only if parent( ) is a proper an
estor of parent( ). We will omit the subs
ript from without
onfusion
sin
e every node in ^ parti
ipates in exa
tly one su
h partial order. For example in the
ompressed tree
orresponding to the tree of Figure 1(
),
are
hildren of and the partial order on them
is f g f g.
Assigning labels: We rst label the nodes of ^ with a prex free assignment but in addition we require
that this assignment would respe
t the partial order dened on the edges outgoing out of ea
h node of ^.
Spe
i
ally, if and 0 are
hildren of and 0, then we require that the string assigned to the edge
( ) is smaller than the string assigned to ( 0). We
all su
h an assignment to edges of ^ an ordered
prex free assignment.
On
e we nd an ordered prex free assignment to the edges of ^ we label ea
h node of 2 ^ by the
on
atenation of the strings assigned to the edges of ^ along the path from the root to . Finally, we
dene the labels of the nodes of based on the labels of the
ompressed tree ^. Re
all that the label of
a node
onsists of two parts,
ode( ) and separator( ). We dene
ode( ) to be the label of rep( ) 2 ^,
and separator( ) to be the smallest (lexi
ographi
ally) string assigned to the outgoing edges of rep( )
whi
h point to a representative of a
hild of , or the empty string if rep( ) is a leaf node.
Our an
estor test is spe
ied by the following theorem. The proof of this theorem follows from Lemma
3.1 and the fa
t that the
odes and separators were generated based on an ordered prex free assignment.
Theorem 3.2 For two nodes
in , is an an
estor of i
ode( ) is a prex of
ode( ), and
separator( )sux(
ode( )
ode( )) separator( ).
v
<
head q
v2
v1
v2
v1
v1
v2
v2
v1
K; L; M ; N ; O; P
L; M
O; P
p; q
p; q
v1 ; v2
v1
v2 ;
v1
v1
v2
v1
v2
v2
As in Se
tion 2 the algorithm would be dierent for variable length representation and for xed length
representation. Noti
e that for ea
h 2 the
on
atenation
ode( ) separator ( ) is a label of a node in
^, and ea
h label of a node in ^ is
ode( ) for some 2 . Therefore for xed length representation our
strategy is to nd an ordered prex free assignment to ^ that minimizes the maximum length of a label
of a node in ^. This also minimizes the maximum length of a label of a node in . For variable length
representation we look for an ordered prex free assignment that minimizes the sum of the lengths of the
ode parts of the labels of the nodes of .
Formally we dene optimization problems that we
all Ordered Minimum Sum of Labels (OMSL)
and Ordered Minimum Maximum Label (OMML) whi
h are analogous to MSL and MML of Se
tion 2.
v
v :
The input to ea
h of these problems is a tree together with a partial order dened on the
hildren of
every node. The obje
tive is to nd an ordered prex free assignment that either minimizes the sum of
the lengths of the
ode parts of the labels (OMSL) in , or minimizes the maximum length of a label
(OMML), depending on the desired physi
al representation.
As for MSL and MML (see Se
tion 2) we solve the OMSL and OMML problems by breaking them
into little pie
es and solving an independent problem for ea
h node. The problem we have to solve at
every node 2 ^ for the OMSL problem is the following. Let =
(head( )), for 1 ,
where is the th
hild of . Find a binary tree with leaves
that minimizes P subje
t
to the
onstraint that must pre
eded in an inorder traversal of the tree if . Similarly,
for the OMML problem we are looking for a binary tree that minimizes maxf + g, and satises the
order
onstraints. Here is the length of the longest
on
atenation of the binary strings assigned to a
root to leaf path in the (re
ursive) solution to the problem rooted at . It is straightforward to nd the
binary string
orresponding to ea
h edge outgoing from from the binary tree. Theorems analogous to
Theorem 2.1 and Theorem 2.2 hold for these algorithms, assuming we obtain an optimal solution to the
subproblem of ea
h node .
The only missing link in what we des
ribed so far is an algorithm to nd the desired binary tree
for ea
h node . We
annot use Human's algorithm (or H-max) sin
e these assign binary string for
unordered set of weights without any guarantees on the lexi
ographi
order among them. Re
ently,
motivated by this work, Barkan and Kaplan [8 des
ribe an algorithm to nd the best tree (a
ording to
both measures mentioned above) whi
h is
onsistent with a partial order dened on the weights. Their
algorithm works for a family of partial orders that in
lude those that arise in the OMSL and the OMML
problems and is polynomial when the weights are polynomialy bounded .
We
an prove that both for the OMSL and OMML problems the length of the resulting labels is
(log ) bits. For OMML we
an use the spe
ial stru
ture of the partial orders that arise at ea
h node
to prove the following stronger bound
Theorem 3.3 Let
be a tree, and let ^ be the
ompressed tree of . For the OMML problem the
algorithm nds an ordered prex free assignment of ^ su
h that the length of the resulting labels of nodes
in is bounded by 2 log .
The proof of Theorem 3.3 relies on the following spe
ial stru
ture of for every node 2 ^. We
an
partition the
hildren of into blo
ks
, where ( ) is the length of the path that
orresponds
to . The blo
k
ontains all the
hildren of the -th node on the path and all nodes in the same blo
k
are unrelated. Furthermore, by the denition of the heavy paths, the total weight of nodes in the last
blo
k is at least half the total weight. We
all a partial order on a set of weights that has these properties
a balan
ed partial order. Our proof of Theorem 3.3 is
onstru
tive and suggests another algorithm (in
addition to the optimal algorithm of [8) that labels ^ with labels of length at most 2 log bits.
Unfortunately, both the algorithm of [8 and the algorithm suggested by the proof of Theorem 3.3
are quite
ompli
ated to implement. Therefore we implemented the following simple heuristi
s that still
guarantee short labels.
Method 1 (Choose arbitrary total order): For the spe
ial
ase where the partial order on the
set of weights
is in fa
t a total order, i.e. there are simple polynomial
time algorithms [13, 18, 15 for nding an optimal binary tree with
at the leaves from left to
right that minimizes either
( ) or
( + ) (where is the number of edges from
the root to the leaf
ontaining ). A binary tree
ontaining at the leaves from left
to right is
alled an alphabeti
tree and the
orresponding binary strings are an alphabeti
ode.
Our rst heuristi
pi
ks an arbitrary total order
onsistent with and nds an optimal alphabeti
ode
onsistent with this total order using one of the algorithms mentioned above. We
an prove that
T
qi
wi
sizeT
qi
w1 ; : : : ; wk
wi
wi li
wj
qi
li
qj
wi
wi
qi
B1 ; : : : ; Bl(p)
Bi
l p
w1 ; : : : ; wk
w1
w2
:::
wk
w1 ; : : : ; wk
i=1:::n
wi
li
maxi=1:::n wi
wi
li
w1
li
w2
:::
wk
both for OMSL and for OMML this heuristi
still guarantee (log ) size labels. In parti
ular using a
result of Mehlhorn [19 we
an prove the following theorem.
O
The length of the labels obtained by the above s
heme adapted to solve the OMML problem
is bounded by log(n) + 2d, where d is the depth of T^.
Theorem 3.4
Sin
e the depth of ^ is at most log( ) Theorem 3.4 implies a bound of 3 log( ) on the maximum label
length.
Method 2 (Exploiting the stru
ture of the partial order): In this heuristi
we try to exploit the
fa
t that the partial orders are balan
ed with a simple algorithm. Sin
e the partial order is balan
ed
we
an split the
hildren of into two groups, (i) an unordered group
onsisting of the
hildren in the
last partition set
and (ii) a partially ordered group
onsisting of all other
hildren. Furthermore,
the weight of the rst group being larger than that of the se
ond group. Now we
an treat ea
h group
separately: (i) run Human's algorithm to assign strings to nodes in the unordered rst group, and (ii)
omplete the partial order dened on the se
ond group to an arbitrary total order and nd the optimal
string assignment
onsistent with this total order using one of the algorithm mentioned above. Finally
we have to
ombine the string assignments of the two parts into one string assignment
onsistent with
the original partial order . We do that by adding a leftmost zero to the strings of the ordered part
and a leading one to the strings of the unordered part. Intuitively we are able to get shorter labels sin
e
the weight of the unordered set is large and we avoid imposing unne
essary
onstraints on its elements.
The algorithm whi
h we use for solving the ordered and the unordered problems at ea
h node depends
on whether we solve the OMSL problem or the OMML problem.
T
Bl(p)
5 Two-level algorithms
Aspmentioned in the introdu
tion the theoreti
ally best s
heme pprodu
es labels of length at most log +
( log ). This s
heme is re
ursive. It partition the tree into log levels of forests, labels the forests
independently using the interval s
heme, and then
arefully assembles the label of a node from the labels
of its representatives in the dierent forests. Although theoreti
ally best, this re
ursive s
heme is not
expe
ted to improve on the simple interval s
heme in pra
ti
e. This is be
ause the
onstant multiplying
the additive plog is large and makes this term be the dominant when is the
urrent average size of
an XML le or even mu
h larger. To
he
k whether some of the ideas behind this s
heme
an generate
n
signi
ant pra
ti
al improvements we implementedp a simplied version of this re
ursive s
heme that
partition the tree into only two levels, rather then log levels.
The s
hemes of [2, 5 when restri
ted to two levels works roughly as follows. We prune subtrees of
size smaller than p . Then even if the original tree did not have unary nodes,psu
h nodes may exist after
into single verti
es. We
pruning. To eliminate these nodes we
ontra
t indu
ed paths of length about
p
also add leaves su
h that ea
h group of pruned subtrees of size at most 2 has a representative leaf in
the tree whi
h remains after pruning. We denote by the tree that remains after these transformations.
We label the pruned subtrees, and , independently, using the interval s
heme. We also assign to ea
h
node on ea
h
ontra
ted path a binary string whose length is inversely proportional to the size of the
pruned subtrees hanging o of . This is done su
h that the
olle
tion of binary strings asso
iated with
ea
h path form an alphabeti
ode where the order
orresponds to the order of the verti
es on the path
(using the algorithm of [15). The
odeword
orresponding to a vertex is
on
atenated to all interval
labels of nodes in the subtrees hanging o of . Finally we obtain the label of a pruned node by
on
atenating the label of its representative leaf in with the interval of (with the alphabeti
ode
sti
ked to it of is pruned o of a
ontra
ted path). The label of node on a
ontra
ted path is the
label of the
orresponding
ontra
ted node on together with 's alphabeti
odeword. We will refer
to this s
heme in the sequel as the two-level interval s
heme. Although we limited the re
urren
e to a
2-level partition of the tree one
an still prove an upper bound of log + (1) bits on the maximum
length of a label. For more details see [2, 5.
When we determine whether is an an
estor of there are three possible
ases. If and are in
dierent subtrees (hen
e their labels
ontain dierent an
estor leaf labels from ) then they are
learly
unrelated. If and belong to the same subtree we
an determine whether is an an
estor of by
omparing their labels within the subtree. Finally, if is in or on a
ontra
ted path and is in a
pruned subtree we
an de
ide by
omparing the label of 's leaf an
estor in (that is in
orporated in
the label of ) to the label of (we also have to use the alphabeti
odes in
ase is in a pruned subtree
and is on the
orresponding
ontra
ted path).
Alternatively, we
an use the same pruning and
ontra
tion pro
ess des
ribed above to split the tree
but instead of using the interval s
heme to label the pruned subtrees we
an use the
ompressed prex
s
heme (see Se
tion 3). We
all this s
heme in the sequel the mixed two-level s
heme. On
e Theorem 3.3
is established it is not hard to prove the following performan
e guarantee on the mixed 2-level s
heme
(the proof is omitted from this extended abstra
t).
Theorem 5.1 The lengths of the labels generated by the mixed 2-level s
heme are at most log +
(log log ).
Intuitively, these 2-level s
hemes exploit the fa
t that, in the interval s
heme, the length of a leaf-label
is half the length of a label of an internal node. They obtain the better bound on the maximum label
length by balan
ing out the length of a leaf-label with the length of the label of an internal node.
n
3
2
3
2
6 Experimental results
We have implemented the labeling s
hemes des
ribed in this paper and
ompared their performan
e on
real XML data and random trees. We used two independent sets of XML les pi
ked randomly from
les
olle
ted by the XML Web-
rawler of [26. The rst set
ontained 1822 les, with an average of
526 nodes per le and has been
olle
ted by the
rawler by 22/5/2000. The se
ond set
ontained 509
les, with an average of 338 nodes per le, and was
olle
ted by 12/10/2000. At the bottom line, prex
labeling s
hemes, even simple ones, perform about 20%-30% better than traditional interval labeling on
the XML data. As we expe
ted the performan
e of the prex labeling s
hemes in pra
ti
e is mu
h better
than their worst
ase guarantees. The typi
al XML le is relatively balan
ed and does not have the
pathologi
al stru
ture required to make our algorithms a
hieve their worst
ase bounds. Noti
e that the
9
average le size in both sets of les was relatively small. We expe
t that on
olle
tions with larger les
our algorithms will perform even better, as the
ontributions of the small additive
onstants that they
in
orporate be
ome more negligible. The two level s
hemes did not perform well on the real XML data.
We
ompared the labelings produ
ed by various algorithms both when a xed length representation
is used and when a variable length representation is used. When using a xed length representation
we assumed that all labels for a single le (tree) have the same length whi
h is determined by the
largest among them. Labels in dierent les may have dierent lengths. When using variable length
representation, dierent labels
an have dierent lengths (even within the same tree). Ea
h label then
essentially
onsists of two parts, a xed length prex spe
ifying the overall length of the label (in bits),
followed by the label value itself. Using variable length representation we
an gain for example from
allo
ating smaller labels to leaves when using the interval s
heme, or allo
ating smaller spa
e to shorter
labels when using prex based s
hemes.
The results whi
h we show for both representations are when the label length
ould be any integral
number of bits. We also
ompared the performan
e of the algorithms when ea
h label must o
upy
an integral number of bytes. The results were similar and are therefore omitted. For variable length
representation we also
ompared the algorithms while giving dierent weights to the nodes (a
ording
to the amount of text they
ontained) (see Se
tion 4). Again the results were similar and are therefore
omitted.
6
To parse a label we need to know to whi
h type of le it belongs. Re
all that the a
tual identi
ation of a node in the
index
onsists of the le id plus the node label in the le. One possibility is to in
orporate into the le id information about
the length of its labels. Alternatively, in a distributed environment, we
an group the les based on their maximum label
length and assign dierent set of ma
hines to manage ea
h group. Then, all labels stored on a single ma
hine have the same
length.
10
the small average le size. The prex s
hemes however both without
ompression (MML and MSL)
and with
ompression (C-P-OMML and C-P-OMSL) performed better than the interval s
heme. The
ompressed prex s
hemes outperform slightly the simple prex s
heme on the rst data set, while it is
the other way around for xed length representation on the se
ond data set (not shown in this extended
abstra
t). The reason is probably the larger average tree size in the rst data set: As the le gets larger
the
ompression is more likely to have some balan
ing ee
t on the tree, whi
h pays for the overhead of
a more
ompli
ated label stru
ture. Overall we see that prex labeling is about 20%-30% more
ompa
t
than the
onventional interval s
heme.
The last algorithm whose performan
e is shown in Figure 2 is a degenerate 2-level interval s
heme
(2l-1). This algorithm rst prunes leaves whi
h are single
hildren of their parents and then applies the
interval s
heme to the remaining tree. Finally it labels the pruned leaves with the label of their parent,
and adds an additional bit to all labels to distinguish between pruned leaves and the other nodes. This
s
heme performed better than the interval s
heme, in parti
ular using variable length representation,
where it is even slightly better than the prex-based s
hemes. This is due to the large fra
tion of leaves
whi
h are only
hildren of their parents in the XML trees.
We also
ompared the algorithm on two distributions of random trees. The rst are trees drawn
uniformly at random from the set of ordered full binary trees with verti
es (for varying values of )
[7. The se
ond are trees drawn uniformly at random from the set of ordered labeled trees with verti
es
and no apriori bound on the degree. (Our algorithms of
ourse did not use the labels or the order of the
nodes.). We
arried out this experiment for two main purposes. The rst purpose was to better evaluate
the additive
onstants of the 2-level s
hemes and
he
k how large need a tree to be for these s
hemes to
win. The se
ond purpose was to
he
k to what extent the relative performan
e of the algorithms on the
XML data depends on the parti
ular stru
ture of the XML les.
7
Ea h node is either a leaf or has both a left hild and a right hild
11
do not perform well (results are omitted). The performan
e of the
ompressed prex s
hemes, however,
is more interesting. For both families of random trees the appropriate
ompressed prex s
heme perform
better than the interval s
heme when using variable length representation but perform worse than the
interval s
heme when using xed length representation. In parti
ular this behavior holds when the average
size of the random tree is the same as the average size of the XML les. Thus we
on
lude that at least
with respe
t to xed length representation the good performan
e of the
ompressed prex s
heme is due
to the parti
ular stru
ture of the XML les.
Referen
es
[1 S. Abiteboul, P. Buneman, and D. Su
iu. Data on the Web: From Relations to Semistru
tured Data and XML.
Morgan-Kaufmann, 340 Pine Street, Sixth Floor San Fran
is
o, CA 94104, O
tober 1999.
[2 S. Abiteboul, H. Kaplan, and T. Milo. Compa
t labeling s
hemes for an
estor queries. In SODA'01, January
2001.
12
[3 S. Abiteboul, D. Quass, J. M
Hugh, J. Widom, and J. Wiener. The lorel query language for semistru
tured
data. International Journal on Digital Libraries, 1, 1997.
[4 S. Alstrup, C. Gavoille, H. Kaplan, and T. Rauhe. Identifying nearest
ommon an
estors in a distributed
environment. Submitted, july 2001.
[5 Stephen Alstrup and Theis Rauhe, January 2001. private
ommuni
ation at SODA'01.
[6 Altavista. http://www.altavista.
om.
[7 D. B. Arnold and M. R. Sleep. Uniform random generation of balan
ed parenthesis strings. ACM Trans.
Program. Lang. Syst., 2:122{128, 1980.
[8 A. Barkan and H. Kaplan. Partial alphabeti
trees, 2001. Submitted.
[9 S. Brin and L. Page. The anatomy of a large-s
ale hypertextual web sear
h engine. In 7th WWW, 1998.
http://www.google.
om.
[10 P. Buneman, S. Davidson, G. Hillebrand, and D. Su
iu. A query language and optimization te
hniques for
unstru
tured data. In Pro
eedings of ACM-SIGMOD International Conferen
e on Management of Data, pages
505{516, 1996.
[11 D. Butler. Souped-up sear
h engines. Nature, 405:112{115, May 2000.
[12 A. Deuts
h, M. Fernandez, D. Flores
u, A. Levy, and D. Su
iu. A query language for xml. In International
World Wide Web Conferen
e, 1999.
[13 E. N. Gilbert and E. F. Moore. Variable length binary en
oding. Bell systems te
hni
al journal, 38:933{968,
1959.
[14 T. C. Hu, D. J. Kleitman, and J. K. Tamaki. Binary trees optimum under various
riteria. SIAM J. Appl.
Math., 37(2):514{532, O
tober 1979.
[15 T. C. Hu and C. Tu
ker. Optimum
omputer sear
h trees. SIAM J. Appl. Math., 21:514{532, 1971.
[16 D. A. Human. A method for the
onstru
tion of minimum-redundan
y
odes. In Pro
IRE, 40, pages
1098{1101, 1952.
[17 H. Kaplan and T. Milo. Short and simple labels for small distan
es and other fun
tions. In 7th International
Workshop on Algorithms and Data Stru
tures (WADS), volume 2125 of LNCS. Springer, August 2001.
[18 D. E. Knuth. Optimum binary sear
h trees. A
ta Infomati
a, 1:14{25, 1971.
[19 K. Mehlhorn. A best possible bound for the weighted path lenght of binary sear
h trees. SIAM J. Comput.,
6(2):235{239, June 1977.
[20 D. Peleg. Informative labeling s
hemes for graphs. In Mathemati
al Foundations of Computer S
ien
e (MFCS),
pages 579{588, 2000.
[21 N. Santoro and R. Khatib. Labeling and impli
it routing in networds. The Computer J., 28:5{8, 1985.
[22 M. Thorup and U. Zwi
k. Compa
t routing s
hemes. In 13th ACM Symposium on Parallel
Ar
hite
tures (SPAA), 2001.
[23 W3C. Extensible markup language (xml) 1.0. http://www.w3.org/TR/REC-xml.
[24 W3C. Extensible stylesheet language (xsl). http://www.w3.org/Style/XSL/.
[25 Xdex. http://www.xmlindex.
om.
[26 Xyleme. A dynami
data warehouse for the xml data of the web. http://www.xyleme.
om.
13
Algorithms and