LSD

The LSD tree: spatial accessto multidimensional point and non- saint objects *
AndreasHenrich FemUniversit&Hagen 5800 Hagen WestGermany Abstract Hans-Werner Six FemUniversit&Hagen 5800 Hagen WestGermany PeterWidmayer UniversitiitFreiburg 7800 Freiburg WestGermany
An obvious approachfor storing intervals uses data structuresfor points. Here, intervals need not be entirely Weproposethe Local Split Decisiontree(LSD tree,for containedin a cell, but may insteadintersectseveralcells. short),a data structuresupportingefficientspatialaccess to Hence,in order to perform rangequeriesefficiently, inforgeometricobjects.Its main advantages other structures mation about an interval must be stored in each bucket, over arethat it performswell for all reasonable distributions, whose correspondingcell intersectsthe interval. Using data cover quotients(which measure overlappingof the data this so-called clipping technique, the spacerequirements the inobjects),and bucket capacities,and that it maintainsmulcreasesubstantiallydue to the redundantinformation. tidimensionalpoints as well as arbitrary geometricobjects. The clipping approach basedon datastructures is which Thesepropertiesdemonstrated an extensive by performance, partition the data space into pairwise disjoint cells. To evaluation makethe LSD treeextremelysuitablefor the imavoid clipping problems, in R-Trees ([Gut84], [FSR87]). plementation spatialaccess of pathsin geometricdatabases. respectivelymultilayer grid files [SWSS],the data spaceis The paging algorithm for the binary tree directory is interdivided into overlapping cells, such that all, respectively esting in its own right because practical solution for the a most, intervals are entirely containedin a cell, i.e. need problem of how to page a (multidimensional) binary tree not be clipped. Unfortunately, the R-Tree suffers from a without access path degeneration presented. is poor exact match performanceand often from inefficient range queries becausecells may overlap considerablyin 1. Introduction a dynamic setting. In the multilayer grid file with each In non-standard applications suchascartography, CAD inducing a directoryoverhead and robotics,Database Management Systems haveto orga- layer a grid file is associated which deteriorates efficiency of operationsconcerning the nize large sets of multidimensionalgeometricobjects on few objects.* secondarystoragesuch that these objects can quickly be In the so-called tran.+ormation technique (lJIin85], retrievedaccordingto their spatial locations. Typical spaintervalsare interpretedaspointsin tial queriesare the retrievalof an object by its coordinates [SK88]),k-dimensional a 2kdimensionalspace,in order to usepoint datastructures (exactmatch)and rangequerieswhere all objectsgeometrically intersectingthe query region are selected further in a standardway. For instance,a l-dimensionalinterval for [a, b] may be interpretedas point (a, b). Sincea 5 b holds, processing presentation a screen.Sincethe set of obor on the imagespaceis a triangle. The main drawbacksof this jects variesover time, insertionsand deletions have to be approach that the point distributionin the triangularimare perfomml as well. age spaceis extremely skew and that a (bounded)range Data structures,which efficiently support spatial acqueryon intervalsbecomes partly unbounded a rangequery cessto geometricobjects,usuallydivide the dataspaceinto cells and store all objects located in a cell in an associ- on image points. From a wide spectrumof performancetests we have ated data bucket. As far as multidimensionalpoints are got the experiencethat the efficiencyof spatialdata strucconcerned,variousefficientdata structureshave beenprotures dependsat least on the object distribution, the cover posed (see e.g. @ee87], [HSW8&], [HSWSSb],[KS86], quotient definedas the sum of all object areasdivided by lKS881,[KWSSI,[NHS841;[Otoo86], [Rob81]). In typical applications, however,mostobjectsare arbitrarygeometric, the areaof the data space,and the bucket capacity,i.e. the i.e. non-point, objects. In many situations,it has proven maximal number of objects in a bucket. For an increasing cover quotient as well as for small bucket capacities, to be useful to representnon-point objectsby their (minicell deteriorate substanmal)boundingboxes,servingas simplegeometric keys, We clipping andoverlapping techniques tially. On the other hand, all non-tree structures(seee.g. thereforeconcentrate multidimensional on intervals,as far [HSW88a],Ir<S86],[KS881,[NHS84]) degenerate skew for as non-pointobjectsare concerned object distributions. In this paper, we propose a data structure supporting spatial accessto k-dimensionalpoints as well as kpermission to copy without fee all of part of this material is dimensionalintervals. The accessto intervalsis basedon grunted provided that the copies arc not made or distributed for the transformation technique.A sophisticated directorytree direct commercial advantage, the VLDB copyright notice and togetherwith a refinedsplitting technique eliminates prethe the title of the publication and its date appear, and notice is
given that copying w by permisrion of the Very Large Data Base Endowment.TOcopyotherwise, or to republish, requirea a fee and/or special permission from the Endowment. proceedings Conference of the Fifteenth International on Very Large Data Bases
l
This we& has been supported by DFG grants Si 374/l and Wi 81OfZ.
Amsterdam,
1989
- 45 -
dimtial2 100
*O 4 9 60 40 20 1 0e
d ioio
Figure 2.1: Possible partition of the data space for an LSD tree
vious drawbacksof this technique.The performance evaluation convincinglydemonstrates the new structureis that well qualified for maintaininglarge sets of geometricobjects. However,the main advantage the new structureis of not only the efficient spatialaccess also its robustness. but By robustness meanthat the new structurebehaves we well for all reasonabledata distributions, cover quotients and bucket capacities. Section 2 explains the new data structure for point objects while in section 3 the generalization non-point to objectsis provided. In section4 a performance evaluation of the new structureand a comparisonwith the multilayer grid file is presented.Section5 concludesthe paper.
2. The LSD tree for points
Figure 2.2: LSD tree associated with the data space partition of Figure 2.1
2.1 Basic ideas and properties
Like most structures,the new structurepartitions the dataspaceinto painvisedisjoint cells with associated buckets of fixed size. In contrastto the grid file [NHS84], however, it is not grid oriented, i.e. all cell boundariesmay occur at arbitrary positions. The free choice of split positionsis the basisof the gracefuladaptation arbitraryskew to object distributions.Sincea new split position can be chosenlocally optimal,i.e. optimalwith respectonly to the cell to be split and independent from other existingcell boundaries, we call the new strocture Local Split Decision tree (LSD tree, for short). Figure 2.1 showsa possiblepartition of a zdimensional data spacefor an LSD tree. The LSD directory maintainingthe flexible data space partitionis a binary tree similar to a kd tree [Ben75]. Each nodeof this tree represents split decisionby storingthe one split dimensionand the split position. Figure 2.2 illustrates the LSD tree associatedwith the data spacepartition of Figure 2.1. It should be obvious that the directory provides the freedom for using the split strategybest suitable for the actualapplication.This is an importantadvantage other over structures (see e.g. tFn3387], [HSW88al, [KS861,lKS881, [NHS84]) wheresplit decisionsare more or lessinfluenced by previous split decisions. Furthermore,the size of the
directory is directly related to the number of buckets,i.e. for n bucketsthe d&ctory containsn-l nodes. This is in contrastto the grid file whereseveral entriesin the directory may point to the samebucket. Besidesthe advantages the LSD tree directorythere of are somedrawbackswhich are typical for multidimensional binary tree structures: tree may become unbal1. A multidimensionalbii anced, i.e. may contain long paths with almost no branches,and 2. no suitablemethodfor paging a multidimensional binary tree is known. (The interestingpaging technique presentedin &ZL88] is suitable only for the onedimensionalcase.) We overcometheseproblemsby introducinga paging algorithmwhich preserves following external balancing the
PropeW The number of external directory pages which are traversed on any two paths from the directory root to a bucket diJers by at most 1.
When geometricobjects are insertedinto an initiahy empty LSD tree the directory grows up to a size when it cannot be kept in the dedicatedpart of the main memory any longer. Then the pagingalgorithmdetermines subtree a to be paged on secondarystorage such that the external balancingproperty is preserved.If tbe subtreeconsistsof n, nodes,the mainmemoryis thenable to receiveadditional n, nodesuntil a further invocationof the paging algorithm must take place. Figure 2.3 showsthe overall structureof the LSD tree.
2.2 A closer hk
In this section,we discussthe LSD tree in more detail by explainingthe insertionof a new geometricobject into the structure.
- 46 -
a strategy is to split a cell into two cells of equal areas.Note that this halving split strategyrelies on the assumption a uniform distributionof the objects. of Since the LSD directory is a binary tree, any type of split strategycan be implementedin an easy and efficient manner. Note that data dependentsplit strategiescannot be realized by data structuresbased on hashing (see e.g. [HSW88al, [KS861,[KSSSI,lNI-IS841). We now turn our attentionto the directory of the LSD tree. As already mentionedin the previous section,if the numberof nodes in the directory T exceedsthe maximal possiblenumberof internal nodes,a subtreeof T is written onto secondarystorage,i.e. stored in a directory page. In sucha directorypagea subtreeis organixedas a sequential heapof fixed height hr. Hence,wheneverthe height of the associated subtreeexceedsh, after an additional insertion, a directory page split has to be perfomxd. The directory page split algorithm is simple: the left and right subtree of the root are storedin two distinct directory pagesand the root is insertedinto the directory T by calling the directory
insertion algorithm.
Figure 2.3: Overall structure of the LSD tree directory root P split of
uu
Figure 2.4: Effect of a bucket split
Wearenow in a positionto describehow a new nodeq, resultingfrom a bucketor a directory pagesplit, is inserted into the directory T by the directory insertion algorithm. The heart of this algorithm is the paging algorithm we explain afterwards. In the following, we assumethat the main memory capacity reservedfor the directory T is ni. The directory insertion algorithm assuresthat the internal prefix treeTi of the directoryT containsat most~-1 nodes. lb10 casesmay occur: case1: The fatherp of the new nodeq is an internalnode.
cast 1.1: The number of internal nodes is less than
The searchfor the bucketb which will receivethe new object is guided by thedirectory as in kd trees. If b does not overflow, the insertionis finished,otherwise the bucket split algorithm createstwo new bucketsb and b, from b accordingto a split strategy we describeafterwards.In case of a bucket split the pointer in the directory referencingb is changedto a pointer referencinga new directorynode q representingthe split decisionconcerningb, i.e. the new nodeq is insertedinto the directoryby calling the directory insertion algorithm explainedlater. The new nodeq points to the new bucketsbi and b,,. Figure 2.4 depictsthe effect of a bucket split. We distinguishbetweentwo inherently distinct types of split strategies:
1. Data dependentsplit strategies These strategies dependonly on the objects storedin
ni-1. Then insert q into Ti and finish.

case 1.2: The number of internal nodes is equal to ni-1. Then insert q into Ti, cdl the paging algorithm for T, and finish. Note that after
the execution of the paging algorithm the numberof internal nodesis at most b-1. case2: The fatherp of thenew nodeq is a nodein a subtree TP of T storedin an externaldirectory page. case21: After the insertion of q the height of TP is at most hr.; then finish. case2.2: After the msertionof q the height of TP is greaterthan hr. Then call the directory page split algorithm for TP and finish.
The paging algorithm is called when after an insertion of an additionalnode the size of the internal PE~IXtree Ti reachesthe maximalpossiblenumberni of internal nodes. The algorithm searchesfor a subtreeTI in Ti such that pagingTI preserves externalbalancingproperty&fined the in section2.1. This property is preserved if T, is a paging candidate, i.e. T fultllls the following properties: *
the bucket to be split. A typical examplefor such a strategyis to choosefor the split position the average of all object coordinateswith respectto a certain dimension. Thesestrategies choosethe split dimension the split and position independentlyof the actual objects storedin the bucket to be split. A typical examplefor such
2.
Distribution dependentsplit strategies
- 47 -
is scanneduntil the object searchedfor is located or the searchends unsuccessfully.

233
Insertion
The insertionalgorithmhasbeenexplainedin detail in the previous section.

233
.... ........ ........ ....... LJ ..... ........ ........ ......
Deletion
The deletion of a node in the directory T basically works inversely to the insertion of a node. Due to space limitationswe cannotdiscussthis topic in more detail.
23.4 Range query
Figure 2.5: Directory before and after paging subiree T,
Any path from the mot of TI down to a bucketcontains the minimal numberof externaldirectorypages(of all pat.hsinthedirectoryT). 2. The height of Ts is at most hr. Figure 2.5 shows a directory before and after paging the subtreeTs. If more than one paging candidateoccurs in Ti, the paging algorithm choosesa candidatewith the maximal possiblenumber of nodes. In order to direct the searchfor a paging candidatein Ti the following numbersare attachedto eachnodev in Ti: n-(v), resp. nep&v),: the minimal, resp. maximal, number of external directory pagesoccurring on any path in T containing v. s(v): the numberof nodesof the biggestpaging candidate which can be reachedfrom v. h(v): The height of the subtreewith root v in Ti. The paging algorithm movesdown the internal directory Ti branchingat eachnodew on the searchpath according to the following criteria: 1. If neM(left son of w) # nw(right son of w), continuewith the son with lower rim. 2. If new(left son of w) = n~(right son of w), continuewith the son with greaters. The root r of a paging candidateT is determined if 1. h(r)<h,and 2. n-(r) = nep&r). The secondcondition assures after pagingTs the exterthat nal balancingproperty is preserved T. for It shouldbe clearthataftertheinsertionof a new nodeq into the internaldirectoryTi, resp. the pagingof a subtree TI of Ti, the numbersnw, nep,,,,, s and h mustbe updated foreachnodewonthepathPfromtherootofTitothe fatherof q, msp. to the fatherof the root of Ts (now stored in an externaldirectory page). Thesenumberscan easily be recomputed from the existingnumbersof the nodes(and their direct sons)on the path P.
1.
In a rangequery all points locatedin the query region are reported. According to Fredman lFred801,a query regionmaybe a rectangle(orthogond range query), a circle (circular range query) or a polygon@olygonal range query). In the following, we restrict the discussionto orthogonal rangequeries, because algorithmis the same all query the for types,exceptfor the proceduresevaluatingwhethera data region is enclosedby, intersectedby, or disjoint from the queryregion. But thesearedetailsleft to the implementation level. In order to report all points locatedin the query region Q we haveto traverse LSD tree to determine buckets the all whose associatedcells intersect Q. The query algorithm movesdown the LSD tree branchingat eachdirectorynode w accordingto the following criteria: (Here D(w) denotes the data region which is the union of all data cells whose corresponding bucketscan be reachedfrom w.) 1. If&nD(rightsonofw) = 0, continuewith the left son of w. 2. IfQnD(leftsonofw) = 0, continuewith the right son of w. 3. Otherwisecontinuewith both sonsof w. Note thatQ n D(w) # 0 is the invariantcondition of the loop of the query algorithm. Hence, in l., resp. 2.. C$; z; son of w) # 0, resp. Q rl D(right son of w) . ,
3. The LSD tree for non-point objects
2.3 The operations 23.1 Exact match
In an exact match operationthe directory is traversed until the corresponding bucket is determined. The bucket
We explain the non-point situation for k-dimensional intervalswhich serveas boundingboxes for arbitrary geometric objects in many applications. We restrict the discussionto the 2-dimensionalsituation,i.e. to rectangles in the plane,because generalization higher dimensions a to is straightforward. To store a set of rectanglesin the LSD tree we use the transform&on technique (lI-Iin851, [SK883), i.e. 2dimensionalrectanglesare storedas 4dimensional points. We choosethe simple corner representation [SK881which considersfor each of the two dimensionsthe lower and upperboundsof the rectanglesto be distinct dimensions. The idea is simple but severalsevereproblemsarise from this approach.First, there is a strong conelation betweenupperand lower bounds,because eachdimension for the upper bound of a rectangleis always greaterthan (or
- 48 -
equal to) the lower bound. Becauseof the correlationall pointsare locatedin a triangularshapedsubspace the imof age space.Furthermore, sincein almostall applicationsall rectangles small comparedto the data space,the points are are locatedin a small strip abovethe diagonal. Datastructures which rely on a rectangular shaped data spaceand partition the data space into rectangularcells tend to degenerate such applications,especiallyif they for are basedon hashingtechniques. However, the LSD tree overcomes drawbacksof the transformationtechnique the if a refined bucket split strategyis used. Since the split strategyis crucial to the efficiencyof the LSD tree for nonpoint objects,we devotethe next sectionto this topic. 3.1 The split strategies Figure 3.1: Split positions achieved by two basic split strategies In this section,we discusssplit strategies suitablefor the skew data distributionsinducedby the transformation SP2. This effect is desirablefor the usual situationwhere technique.First of all, a suitablesplit strategymusttakeinto rectangles tend to be small comparedto the data spaceand accountthe correlationbetweenlower and upperboundsof hencethe imagepoints tend to be locatedin a strip above rectanglesin the original dimension I, resp, 2, stored in the diagonal. We performedsimulationswith other roots dimensions1 and 2, resp. 3 and 4. but the lo* root behavedwell in all cases. We will explain two different split strategies,a data dependentand a distribution dependentone. The data 33 The operations dependentsplit strategy is simple: For the split dimension In this section,we discussthe LSD tree operationsfor under concern the averageover all coordinatesof objects non-point objects. The operationsexuct match, insertion storedin the bucket to be split including the object to be and deletion are identical to the correspondingoperations insertedis chosenas the split position. for 4dimensional points. In the case of a range query The distribution dependentsplit strategy is a combithe situationis different,because original 2dimensional the nation of two basic (distributiondependent) split strategies query region and the 4dimensional image query region each of them designedfor an extremesituation. The first differ substantiallybecause the different representations of split strategyrelies on the (fictitious) assumptionthat all of the objects. rectanglesare degenerated points, i.e. the upper and to In a range query, the query region can either be an lower bounds coincide for each dimension. Here, all imorthogonalrectangle,a circle or a polygon. Independent of age points are locatedon the diagonalw.r.t the dimensions the threekinds of queryregionswe distinguishbetweentwo 1 and 2, resp. 3 and 4. A suitablesplit strategyfor this case is to split the data cell into two cells containing equally query types for a set of rectangles% (see [SKSS]): long parts of the diagonal. The split position achievedby 1. Rectangleintersection: this split strategyis denotedbp SPr in Figure 3.1. GivenaqueryregionQfindallR~~s.t. QnR# 0. The secondbasic split strategyrelies on the assump- 2. Rectangleenclosure: tion that all imagepoints am uniformly distributedover the Given a queryregion Q tind all R E % s.t. R E Q. triangular subspace the image spaceb&t from dimenof In point situation, the algorithm for sions 1 and 2, resp. 3 and 4. Here, a suitablesplit strategy circular contrast to therange queries is different from the and polygonal halvesthe datacell into two cells of equal areas.The split position achievedby this split strategyis denotedbp SPs algorithm for orthogonalrange queries. However, due to spacelimitations of the paper we discussonly orthogonal in Figure 3.1. range queries. The split position SP calculatedby the combinedsplit Webeginexplainingthe rectangle intersection problem strategyis the weightedsum of SPt and SP2: for orthogonal query regions. In this case,the original 2dimensionalquery region for rectanglespi, ut] x ps,us] is where SP = aSPI +(1 -a)SPs ) transformed a 4dimensional query region for points. into For eachoriginal dimensiond E (1,2) we define
~#d,ud]) sf [Ld,Ud] x [Id,Ud] .
(Here Ld denotesthe lower and & the upperbound of the data spacewf.t the split dimensiond) The effect of the choice of Q is that for large data cells SPapproaches while for smalIcelIsSPapproaches SPi
Then for the original query region pi, ui] x [1s, the imus] age tegion is given by cp(@t x(p([ls, ui]). Figure 3.2 , ui]) illustratesthe transformation a query interval &I, ttd]. of The areaof the.imageregion can be reducedby using the transformationp instead of 9. Let & denote the
- 49 -
Ld
Iowa bounds
Figure 3.2: Transformation of query intend [Id, ud]
Figure 4.1: Uniformly distributed rectangles

d
4.
Performance evaluation
u,,
lower bound8
Figure 3.3: Improved transformation of query interval [b. ~1
greatestextensionof an insertedrectanglefor dimension d, then

% &ud]) sf [Id Ed,Ud] x [Id,Ud + Ed] .
Figure3.3 illustratesthe improvedtransformation a query of interval [ld, udl. For the image region the range query algorithm for points can directly be used. Wecontinuetbe discussion therectangleenclosure with problemfor orthogonal query regions. Sincein this casefor each dimensiond both, the lower and the upper bound of a rectangle,must be enclosedin the query interval [ld, ~1, we use the simple transformation
19(b,ud]) d [bud] x [bud] .
The image query region for the rectangleenclosure problemis smallerthan for the rectangleintersectionproblem. Hence, an enclosurequery can be performedmore efficiently than an intersectionquery. Note that this holds only for transformationtechniques not for clipping or and overlappingcell techniques.
To assess meritsof the LSD tree we haveevaluated the the performancefor rectanglesin the plane. We do not discuss the efficiency of the LSD tree for points, because the performancefor rectanglesis an upper bound of the performance points, We haveimplemented LSD tree for an on a SUN workstationin Modula-2. The inter& directory is stored in an array storing 1000 nodes.An external directory page of sire 512 bytescontains subtrees to a height of 6 organizedas sequentialheaps. up We choosebucketcapacitiesof 5 and 50 rectangles. The simulationsare basedon a sophisticated random .rectanglegeneratorcreating sets of 10,000 and 100,000 rectanglesaccordingto two different distributions. These distributionsare illustratedfor 1,000rectanglesin Figures 4.1 and 4.2. Since the cover quotient remainsconstantat 2.5, in the case of 10,000,resp. 100,000,rectanglesthe averageareaof a rectangleis 10, resp. 100, times smaller than for 1,000rectangles. Bucketsplitsareperformed accordingto the split strategies described section3.1. In the caseof the datadepenin dent split strategywe useda refinedinsertionprocedure:If a new rectanglecausesthe split of a bucket bi which has a brother bucket b, i.e. both bucketsstem from the same bucket split with split line S, and the capacityof b is not exhausted, objectwhich is closestto S in bi is movedto the b and S is updatedin the directory. Thenthe new rectangle canbe insertedwithout a bucketsplit. Otherwise,bt is split. First, we focus on the directory evaluation. We have randomly inserted 10,000, resp. 1OO,ooO, uniformly distributed rectanglesand 10,000,resp. lOO,OCQ skew distributedrectangles an initially emptyLSD tree. To siminto ulatea worst casesituation,100,ooO uniformly distributed rectangles havebeen insertedin sorted order. The sorting hasbeencarriedout by randominsertionsinto an LSD
- 50 -
storage utilizatim buckets
bucket utilialion
overall storage
UtiliZalion (loo
zky 2,865 282 2,838 290 392 3266 25,883 2593 69.8 % 70.9 % 70.5 % 69.0 % 60.6 % 61.2 % 77.3 % 77.2 % 68.2 %
68.1%
68.8 % 66.3 8 59.0 % 59.6 % 71.8 % 71.7 %
10,ooo
loo.ooo
distrib.
3.419
23,085 33,787
994
997 985
21
204 16
1
2 2
221 4,836 3501
18.7 46 8.7 % 16.2 %
Table 4.1: Size of the directory and storageutilization (directory page size = 512 bytes; max. number of internal nodes = 1OW)
case the distribution dependentsplit strategy is the winner. For the sorted caseand the data dependentsplit strategy the unbalanceof the directory is reflected mainly by the height of the internal directory. Becauseof the external balancing property the number of external directory levels is 2 as for the distribution dependentsplit strategy. The utilization of the directory pagesis mainly influenced by the split strategy and the bucket capacity (and, for the data dependentsplit strategy, of course by the order of insertion). For the sametest set we have also measuredthe bucker utilization which is defined as number of stored objects number of buckets x bucket capacity and the overall storage utilization defined as
number of stored objects x 100bytes
bytes needed for the LSD tree which includes the storage space needed for the directory and some administrative informations. Empty buckets are not allocatedbut represented nil-pointers in the directory. by The results m shown in pdble 4.1. For the data dependentsplit strategy the bucket utilization is independent of the object distribution and slightly above the theoretical value of 69.3% (In 2). Due to the refinement of the insertion procedure which prefers small bucket capacities the utilization is even higher for bucket capacity 5. The bucket utilization of 86.6% for the sorted situation is a consequenceof the same effect. For the distribution dependent split strategy the bucket utilization is
Figure 4.2: Skew distributed rectangles
tree with bucket capacity 5 followed by a left to right scan through the LSD leaves. The results are shown in Table 4.1. It comesout very clearly that the size of the directory does not depend on the data distribution but on the split strategy (and of course on the size of the data set and the bucket capacity). For unsorted situations, the data dependent split strategy performs significantly better than the distribution dependentvariant, while, as expected,in the sorted
- 51-
number of -gl=
rectangle distribution
unifoml *, skew
I unifolm skew
lOO,OCHI
skew
Table 4.2: Rang query pexformauce (directory page size = 512 bytes; max. number of internal no&s = 1000)
below 69% but still above 60% and hencenot bad at all. A comparisonof the overall storageutilization and the bucket utilization convincingly demonstrates the storagespace that neededto accommodatethe directory is rather small compared to the data storage space. We now turn our attention to the performance of the LSD tree operations. Clearly, in an exact match at most i directory pagesand one bucket must be read if the directory contains i external levels. The performanceof the insertion procedure is easy to estimate, too. For bucket capacity 5, resp. 50, we have between 4 and 5, resp. less than 3, external accesses (directory page and bucket I/G) per inserted object, if 100,000objects are inserted into an initially empty LSD tree irrespective of the split strategy used. Hence, we focus on range queries. We concentrateon the intersection query becauseits performance is an upper bound of the enclosure query performance. (Experiments show that the enclosurequeriescan be carried out 10%faster than intersection queries on the average.) Table 4.2 shows the average number of external accesses two types of for range queries. For square regions of sixes 0.5% and 5% of the size of the data space,we have performed 20 range queries each, at random positions. As with all other data structures larger query regions lead to fewer disk accesses found object, becausethe per number of buckets completely contained in the query region grows faster than the number of buckets intersectedby the region boundary. Another important characteristic number is the hit ratio, defined as number of objects found bucket capacity x disk accesses
The hit ratio is higher for smaller bucket capacities, becauseof the higher selectivity. For the data dependentsplit strategy, bucket capacity 5, skew distribution, and query type 2, the hit ratio is 66.4% if only bucket accessesare counted. This is nearly optimal with respect to a normal bucket utilization of 69.3%. For smaller query regions and larger bucket capacities the hit ratio deteriorates: Changing the bucket capacity to 50 yields 54.2%. The performanceresults in the sorted situation can be explained by the fact that the data cells tend to be long and small for the dam dependentsplit strategy in this case. For the remainder of this section we comparethe performance of the LSD tree and the multilayer gridfile [SW881 using 5 layers (5L-GF for short). According to the multilayer philosophy layer 5 is implemented as a clipping grid file. We have inserted 50,000 uniform distributed rectangles in random order into both initially empty structures. The cover quotient is 2.5 and the bucket capacity varies from 5 to 30 in stepsof 5. It turns out that the SL-GF is not able to work with bucket capacity 5. After the insertion of 20,974 rectangles a bucket of the clipping layer could not be split becauseeachrectangle stored in this bucket covered the whole corresponding data cell. (For the skew data distribution the 5L-GF runs into a similar error situation even for bucket capacity 10.) Figure 4.3 showsthe bucket utilization for the LSD tree with the data dependent split strategy (LSD&J, the LSD 2 wiem~G~tribution dependentsplit strategy (LSD&, . The range query performance is illustrated in Figures 4.4 and 4.5. We have used the samequery types as before. For the first, resp. secondtype, 308, resp. 2712, objectsare selected on the average.
- 52 -
70%
*w . . . . . . . . . . . . . . . . qsts
. .. .
b
5 10 15 20 25 30 bu&aapcity
Figure 4.3: Bucket utilization A hit ratio
height proves the efficiency of the LSD tree [HSW89]. At the moment, we are implementing more general (spatial) operations, like non-orthogonal range queries, point queries [SK881 and queries where geometric as well as standard attributes are qualified. Furthermore, we are embedding the LSD tree as spatial accesspath into the geometric database system Gral [Gtit89]. Hence, an empirical study about the benefits of the LSD tree in such an environment can be carried out in the near future. References
[Ben751 Bentley, J.L.: Multidimensional Binary Search Trees Used in Database Applications. Ccmmunications of the ACM, Vol. 18, 9, 509-517, 1975 [FSR87] Faloutsos,C., Sellis, T., Roussopoulos,N.: Analysis of Object Oriented Spatial Access Methods, Proc. ACM SIGMOD Int. Conf. on Management of Data, 426-439, 1987 [Fred801Fredman, ML.: The Inherent Complexity of Dynamic Data StructuresWhich AccommodateRange Queries, IEEE. CH1498-5/80. 1980 [Frcc87] Freeston.M.: Jbe BANG file: a new kind of grid tile, Proc. ACM SIGMOD Int. Conf. on Managementof Data, 260269. 1987 [Gut841Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching. Proc. ACM SIGMOD Jnt Conf. on Managementof Data, 47-57. 1984 [Gut891 Gitting. R.H.: Gral An Extensible Relational DatabaseSystem for Geometric Applications, Pmt. lsth Int. Conf. on VLDB (1989). to appear [Him851Hiichs, K.: The Grid Fide System: Implementation and Case Studiesof Apphcations, Doctoral Thesis No. 7734, ETH Zurich, 1985 [HSWSSa] Hutflesx, A., Six, H.-W.. Widmayer, P.: Globally Order Preserving Multidbnensional Linear Hashing. Ptoc. IEEE 4* Jnt, Cod. on Data Engineering. 572579. 1988 [HSW88b] Huttlesx, A., Six, H.-W., Widmayer, P.: Twin Grid Fiies: Space Optimizing Access Schemes,Proc. ACM SIGMOD Jnt. Conf. cu Management of Data, 183-190, 1988 [HSW89] Hemich. A., Six, H.-W.. Widmayer.P.: Paging binary treeswith external balancing, Proc. Jnt. Workshop on GraphtbeoreticConcepts in Computer Science (WG 89), Springer Lecture Notes in Comp. Science, to appear [KS861 Kriegel, H.-P., Seeger, B.: Multidimensional Order Preserving Linear Hashing with Partial Expansions, Proc. Jnt. Conf. on Database lltcory. 203-220, 1986 [KS881 Kriegel, H.-P., Seeger, B.: PLOP-Hashing: A Grid File without Directory, Proc. IEEE 4* Jnt. Conf. on Data Engineering, 369-376, 1988 [KW85] Krishnamtuthy. R.. Whang. K.-Y.: Multilevel Grid Files, IBM ResearchRepott, Yorktown Heights, 1985 [rZL88] Litwin. W., Zegour, D., Levy, G.: Multilevel Trie Hashing, Proc. Jnt. Conference Extending Database Technology (BDBT 88). Springer Lecture Notes in Ccmp. Science, 309-335. 1988 [NHSM] Nievergelt, J.. Hinterberger, H., Sevcik. K.C.: Ibe Grid File: An Adapable Symmetric Multikey File Structure, ACM Transactions cm DatabaseSystems. Vol. 9. 1. 38-71. 1984 [Otoo86] Otoo, E.J.: Balanced Multidimensional Extendible Hash Tree. Proc. 9 ACM SlGACf / SIGMOD Symposium on Principles of DatabaseSystems, 100-113. 1986 [Rob811Robinson. J.T.: The K-D-B-Tree: A Search Structurefor Large MultidirnensicmalDynamic Indexes. Proc. ACM SIGMOD Jnt. Conf. on Managementof Data. 10-18. 1981 [SK881Seeger,B., Kriegel. H.-P.: Techniquesfor Design and Jmplementatiut of Bfiicient Spatial Access Methods, Proc. IS* Int. Conf. on VLDB. 360-371. 1988 [SW881 Six, H.-W., Widmap P.: Spatial Searching in Geometric Databases,Prcc. IEEE 4 Jnt. Conf. on Data Engineering, 496-503, 1988
zm-.......................... 3;;;;;; 5 b 10 1s 20 25 30 bwk~capcity
Figure 4.4: Range query performance (0.5% of the data space)
Figure 4.5: Range query performance (5% of the data space)
It turns out that the LSD tree with the data dependent split strategy clearly outperforms the SL-GF while the LSD tme with the distribution dependentsplit strategy is at least as efficient as the SL-GF. We have not comparedthe exact match and insertion performancebecauseit is obvious that the 5 layers of the SL-GF do not allow a competitive
pXfOIRllUlCe.
It should be noted that besidesits better overall performance the LSD tree is much easier to implement than the SL-GF and does not need an additional (completely different) overtlow data structure for storing objects which do not fit into the main structure. Conclusion We have proposed the LSD tree, a data structure supporting efficient spatial accessto geometricobjects. Its main advantages over other structuresare that it performs well for all reasonabledata distributions, cover quotients, and bucket capacities,and that it maintains multidimensional points as well as arbitrary geometric objects. These properties make the LSD tree extremely suitable for the implementation of spatial accesspaths in geometric databases. In addition to the performance evaluation an analysis of the expectedstorageutilization and the expectedexternal 5.
- 53 -
- 54 -

LSD

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

LSD

Загружено:

Авторское право:

Доступные форматы

The LSD tree: spatial accessto multidimensional point and non- saint objects *

2.1 Basic ideas and properties

ni-1. Then insert q into Ti and finish.

Distribution dependentsplit strategies

is scanneduntil the object searchedfor is located or the searchends unsuccessfully.

The insertionalgorithmhasbeenexplainedin detail in the previous section.

Figure 2.5: Directory before and after paging subiree T,

2.3 The operations 23.1 Exact match

Figure 3.2: Transformation of query intend [Id, ud]

Figure 4.1: Uniformly distributed rectangles

Figure 3.3: Improved transformation of query interval [b. ~1

greatestextensionof an insertedrectanglefor dimension d, then

storage utilizatim buckets

221 4,836 3501

18.7 46 8.7 % 16.2 %

Figure 4.2: Skew distributed rectangles

Figure 4.3: Bucket utilization A hit ratio

zm-.......................... 3;;;;;; 5 b 10 1s 20 25 30 bwk~capcity

Figure 4.4: Range query performance (0.5% of the data space)

Figure 4.5: Range query performance (5% of the data space)

Вам также может понравиться