Вы находитесь на странице: 1из 70

Multidimensional

Index Structures

Professor Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Introduction
n Index structures discussed so far are one-
dimensional
n One dimensional index structures assume a
single search key, and retrieve records that
match a given search-key value. (The search
key can be a single field or a combination of
fields.)
n Many applications, e.g. GIS, OLAP, require us
to view data as existing in a space of two or
more dimensions.

© Prof. Navneet Goyal, BITS, Pilani


Multidimensional
Queries
n Geographical Information Systems (GIS)
n Online Analytical Processing (OLAP)

Sales (day, store, item, color, size)


Query: summarize the sales of pink shirts by day &
store

select day, store, count(*)


from sales
where item = ‘shirt’ & color = ‘pink’
GROUP BY day, store

BACK
© Prof. Navneet Goyal, BITS, Pilani
Multidimensional
Queries
n Partial Match Queries
all points with specified values in a subset of
dimensions
n Range Queries
all points within a range in each dimension

© Prof. Navneet Goyal, BITS, Pilani


Multidimensional
Queries
n Nearest Neighbor Queries
closest point to a given point
- If points represent cities, we might want to find
the city of over 1 lakh population closest to a given
small city
- Find the nearest ATM
n Region Queries: all retail shops within the geograhic
boundaries of a locality
n Where-am-I Queries
in which shape a particular point is located
- when you click your mouse, and the system
determines which of the displayed elements you are
clicking
© Prof. Navneet Goyal, BITS, Pilani
Multidimensional
Indexes
• Find tuples with
Dept = ’toy’ AND Sal > 100K
• (bad) solution: create index on (Dept,Sal).
• Doesn’t work for
Sal > 100K
• We’ll look at:
Grid files, partitioned hashing, Multiple-key
indexes, kd-trees, Quad-trees, R-trees,Bitmap
indexes
MD-queries in SQL
• GIS database POINTS(x,y,…)
• Nearest neighbor to point
(10.0,20.0)

Figure taken
from text
book
MD-queries in SQL
• GIS database Rectangles(id,xll,yll,xur,yur)
• Rectangles enclosing point
(10.0,20.0)

Figure taken
from text
book
B+-trees for Range
Queries
§ Database of 106 points evenly distributed in a 1000×1000
square
§ 100 point records fit in one block
§ B+-tree indexes with pointer lists on x and on y
§ B+-tree leaf has 200 key-pointer pairs on an avg.
§ Range query {(x,y) : 450 ≤ x ≤550, 450 ≤ y ≤550}
§ Find all points within the small square
§ How many points in 450 ≤ x ≤550? 1000

§ 100000
§ Assume all 100000 pointers fit in memory
§ How many disk I/Os for bringing 100000
Y
pointers to main memory?
§ 500 (100000/200) + 1 = 501
§ Same for y 0 X 1000
B+-trees for Range
Queries
§ How many point in the intersection of two
pointer sets? 10000
§ 500 leaf blocks
§ Root node in memory
§ Leaves have avg. 200 keys ➨ 500 disk I/O in each tree to
get pointer lists +1 intermediate block ➨ 1002 disk I/O’s
§ Retrieve 10000 records ➨ 10000 I/O’s (random access)
§ Sum 11,002 disk I/O’s
§ Sequential scan of file = 10,000 I/O’s (100 tuples per block)
§ Hence, the index is of no help
§ If the range were smaller, would there be an advantage in
using the index????
NN-query using B -trees
+

§ Find the point closest to (10,20)


§ Turn NN to (10,20) into range-query {(x,y): 10-d ≤x
≤10+d, 20-d ≤y ≤20+d}
§ Possible problems:
§ No point in selected range
§ Closest point within range need
might not be closest point overall

§ Solution: re-execute range query with slightly larger d


NN-queries
§ Choose d = 1 ➨ range-query =
§ {(x,y): 9 ≤ x ≤11, 19 ≤ y ≤21}
§ 2000 points in [9,11], same in [19,21]
§ Each dimension = 10 (or most likely 11)+1
=12 I/O’s to get pointers
§ Probable points in the answer= 4
§ Determine from the associated x,y coordinates
of the pointers which is the NN
§ One more I/O to get answer ➨ 25 I/O’s
§ If d is too small, we have to run another range
query with a larger d
§ So, conventional indexes not too bad in this
type of query but how to choose ‘d’
Limitations
§ Conventional indexes not too bad for NN
queries as compared to range queries
§ NN query was in fact converted into a range
query
§ Would suffer from same problems if range were
large
§ The MD aggregate query of slide 3 is also not
well supported.
§ Most similar queries would require that records
from all or almost all of the blocks of the data
file be retrieved
§ Methods we plan to discuss next will provide
better performance and are used in specialized
DBMSs that support multidimensional data
© Prof. Navneet Goyal, BITS, Pilani
Linearization
When we create a B+-tree on <age, sal>, we effectively
linearize the 2-d space since we sort entries first by age
and then by sal.
Consider entries:
<11, 80>, <12, 10>
<12, 20>, <13, 75>
A multidimensional index clusters entries so as to
exploit “nearness” in multidimensional space.

© Prof. Navneet Goyal, BITS, Pilani


MD Index Structures
§ Hash table like approaches
§ Grid Files
§ Partitioned Hashing
§ Tree like approaches
§ Kd-trees
§ Quad trees
§ R-trees
Grid Files
§ Divide data into stripes in each dimension
§ Rectangle in grid points to bucket
§ Example: database records (age,salary) for
people who buy gold jewelry

Data:
(25,60) (45,60) (50,75) (50,100)
(50,120) (70,110) (85,140) (30,260)
(25,400) (45,350) (50,275) (60,260)

Figure taken from the text book


Grid Files
§ Outperforms single-dimensional indexes for
answering MD queries
§ Each region into which a space is partitioned
can be thought of as a bucket of a hash table
and each of the point in
that region has its record
placed in a block belonging
to that bucket
n Overflow buckets can be
added to increase the size
the bucket
Grid Files: Lookup
§ Instead of using a 1-D array of buckets, grid
files use an array of n-D array of buckets
§ To locate a point, we need to know for each
dimension the location of grid lines
§ Positions of the point in each
of the dimensions together
determine the bucket
Grid Files: Lookup
Grid (main mem)

Buckets (disk)

Figure taken from the text book


Grid Files: Insertion
§ Follow the procedure for lookup and place the
record in the appropriate bucket
§ If there is room – done
§ If no room –
§ Add overflow blocks to the buckets. Works well as long
as chains do not get too long
§ Reorganize the structure by adding or moving grid
lines
§ Suppose a customer with age 52 & salary $200k
buys gold jewelry
Grid Files: Insertion
§ The central bucket is already full
§ Add an overflow bucket
§ Split the bucket either along age or sal. Dimension
§ Splitting
1. Vertical line at age=51. Separates the two 50’s from the 52
2. Horizontal line sal=130
3. Horizontal line sal=115
1 not advised, since it
Does not split any other
Bucket and we are left
with more empty buckets
and have not reduced the
size of occupied buckets
Grid Files: Insertion
§ 2 & 3 are equally good

130K
Performance of Grid Files
§ For high dimensional problems, the number of buckets
grows exponentially with the dimension
§ Many empty buckets if large portions of a space are
empty
§ If there is high correlation between age and salary, then
all points will lie along the diagonal and irrespective of the
position of grid lines, buckets off the diagonal will be
empty
§ Works well for uniform distribution
§ Grid lines can be chosen so that:
§ Bucket matrix can be kept in memory
§ In memory indexes on values of grid lines or do MM binary
search
§ Overflow blocks are limited
Performance of Grid Files
§ Lookup – we are directed to the proper bucket so only 1
disk I/O. If we are inserting or deleting, then an
additional write is needed. Inserts that require creation of
overflow block needs an additional write
§ Partial Match Queries
§ Find all customer aged 50 or all customer with sal. $200K
§ Need to look at all the buckets in a row or col. of bucket matrix
§ Only a small fraction of all the buckets would be accessed
§ Range Queries
§ Defines a rectangular region of the grid
§ 35-45 & 50-100
§ NN Queries
Grid Files: Range Queries
§ 35-45 & 50-100
Performance of Grid Files
§ NN Queries
§ Start by searching the bucket in which point P belongs
§ If atleast one point is found, we have a candidate Q
§ Possible that points belonging in adjacent buckets are closer to P
than Q
§ Find whether the distance bet. P and a border of its bucket <
dist(P,Q)
§ P(45, 200) – (50,120) is closest in the bucket
distance = 80.2
No need to search lower 3 buckets
other 5 must be searched
(30, 260) & (60, 260) are at 61.8 from P
Partitioned Hash Functions
§ Design a hash fn. that generates k bits
§ k bits divided among n attributes
§ ki bits of the hash value from the ith attribute
§ Σ ki = k
§ Hash fn = list of hash fns. (h1,h2,…hn)
§ Bucket in which a tuple (v1,v2,…vn) is computed by
concatenating the bit sequences:
h1(v1)h2(v2)…hn(vn)
Hash table with 10-bit bucket nos. = 1024 buckets
4 bits to 1st attribute & 6 bits for 2nd
Partitioned hashing
• Fixed bits for each dimension
• Example: Gold jewelry with
• first bit = age mod 2
• bits 2 and 3: salary mod 4
• Works well for partial
match, bad for range
and NN queries
Kd-trees
n kd tree - early structure used for indexing in
multiple dimensions.
n Each level of a k-d tree partitions the space into
two.
n Choose one dimension for partitioning at the root level of
the tree.
n Choose another dimensions for partitioning in nodes at the
next level and so on, cycling through the dimensions.
n In each node, approximately half of the points
stored in the sub-tree fall on one side and half on
the other.
n Partitioning stops when a node has less than a
given maximum number of points.
KD-trees
n Main-memory data structure generalizing the
binary search tree to multi-dimensional data

150

47 60
Example 1 400 a
b

c
Sal d e

g f
h
Sal 150 j l
i k
Age 60 Age 47 0 Age 100

Sal 80 f l Sal 300 c e

Age 38 g h d a b

i j k
Example 2

■ Each line in the figure (other than the outside box)


corresponds to a node in the k-d tree
ê the maximum number of points in a leaf node has been set to 1.
■ The numbering of the lines in the figure indicates the level of
the tree at which the corresponding node appears.
© Prof. Navneet Goyal, BITS, Pilani
KD-tree operations
• Lookup: as in binary trees
• range query: age [35,55] & sal [100,200]
• Insertion: insert(35,500)
KD-trees in secondary
storage
• File in kd-tree with n leaves
• Avg. length of path from root to a leaf = log2n
• Each node in 1 block è 1 disk I/O per node
• 1000 leaves ➨ log2(1000) = 10 = too many disk I/O’s
(avg. length of path from root to leaf)
• Much more than 2 or 3 disk I/Os for B-tree even for a
much larger file.
• Nodes of a kd-tree have little information.
• Solution for long paths & wasted space
• Group nodes into blocks
• Say, 3 nodes in 1 block
• Lookup for (25,60) takes
2 blocks!
Quad trees
n Each node of a quadtree is associated with a rectangular region
of space; the top node is associated with the entire target space.
n Each non-leaf nodes divides its region into four equal sized
quadrants
n correspondingly each such node has four child nodes corresponding to the four
quadrants and so on
n Leaf nodes have between zero and some fixed maximum number
of points (set to 1 in example).
Quad trees
• Nodes split at all dimensions at once
• Division fixed
• Cannot be balanced
Example 400 a
b
c
d e
Sal
g f
h l
j
i k
Sal 200, Age 50 0 Age 100

Sal 100, Age 75 c e Sal 300, Age 25


i h

j k f g l d a b
Quad-tree operations
• k-dimensions ➨ node has 2k children, eg
k=7 ➨ 128 children of a node
• Insertion: lookup and split if necessary.
Problem: high dimensional data
• Range query: age in [50,75], salary in
[100,200]
R-Trees
• Generalization of B+-trees to multi-
dimensional data
• Adaptation of the B+-trees to handle
spatial data
• Height balanced data structure
• Search key for an R-tree is a collection of
intervals, with one interval per dimension
• Search key is a BOX, with each side
parallel to the axis
• Search key is a bounding box or rectangle
R-Trees
n Supported in many modern database systems,
along with variants like R+ -trees and R*-trees
n Will consider only the two-dimensional case (N = 2)
n generalization for N > 2 is straightforward, although R-
trees work well only for relatively small N
R-Trees
n A rectangular bounding box is associated with
each tree node.
n Bounding box of a leaf node is a minimum sized rectangle
that contains all the rectangles/polygons/regions
associated with the leaf node.
n The bounding box associated with a non-leaf node contains
the bounding box associated with all its children.
n Bounding box of a node serves as its key in its parent node
(if any)
n Bounding boxes of children of a node are allowed to
overlap
n A polygon/region is stored only in one node, and
the bounding box of the node must contain the
polygon
R-Trees: Example
n A set of rectangles (solid line) and the bounding boxes
(dashed line) of the nodes of an R-tree for the rectangles.
R-Trees
n R-Tree is height balanced Tree similar to B-Tree.
n Leaf Nodes contain pointers to data objects.
n Insertions and Deletions are dynamic and can be
done in any order.
n Spatial Database contains a list of tuples
n Each tuple has a unique identifier.
R-Trees
n Leaf nodes index records of the form
(I,tuple-id).
where I is the n-dimensional bounding rectangle of
the spatial object.
I=(I0,I1…….In)
where each Ii is the extent along dimension i.

n Non leaf entries are of the form


(I,child pointer). I covers all rectangles of the
lower node entries
R-Trees
R-Trees: Queries
n To search for a point: compute its Bounding Box
(BB) which is just a point
n Start at root
n Check the BB of each child of the root to see if it
overlaps with the query box B, and if so, search the
subtree rooted at the child
n If more that one child of the root has a BB that
overlaps B, then all corresp. Subtrees must be
searched (imp diff from B+-trees)
R-Trees
n [Guttman 84] Main idea: allow
parents to overlap!
n => guaranteed 50% utilization
n => easier insertion/split algorithms.

n (only deal with Minimum Bounding


Rectangles - MBRs)
R-trees
n eg., w/ fanout 4: group nearby
rectangles to parent MBRs; each
group -> disk page
I

A C G
F H
B

E J
D

49
R-trees
n eg., w/ fanout 4:

P1 P3 I

A C G
F H
B
J A B C H I J
E P4
P2 D
D E F G

50
R-trees
n eg., w/ fanout 4:

P1 P3
P1 P2 P3 P4
A C F G I
H
B
E J A B C H I J
P4
P2 D
D E F G

51
R-trees - format of nodes
n {(MBR; obj-ptr)} for leaf nodes

P1 P2 P3 P4

x-low; x-high
obj A B C
y-low; y-high
... ptr
...

52
R-trees - format of nodes
n {(MBR; node-ptr)} for non-leaf
nodes
x-low; x-high node
P1 P2 P3 P4
y-low; y-high ptr
... ...

A B C

53
R-trees - range search?

P1 P3 I
P1 P2 P3 P4
A C G
F H
B
J A B C H I J
E P4
P2 D
D E F G

54
R-trees - range search?

P1 P3 I
P1 P2 P3 P4
A C G
F H
B
J A B C H I J
E P4
P2 D
D E F G

55
R-trees - range search
Observations:
n every parent node completely covers its
‘children’
n a child MBR may be covered by more
than one parent - it is stored under ONLY
ONE of them
n a point query may follow multiple
branches.
n everything works for any dimensionality
56
R-trees - insertion
n eg., rectangle ‘X’

P1 P3 I
P1 P2 P3 P4
A C G
F H
B
X J A B C H I J
E P4
P2 D
D E X F G

57
R-trees - insertion
n eg., rectangle ‘Y’

P1 P3 I
P1 P2 P3 P4
A C G
F H
B
J A B C H I J
Y E P4
P2 D
D E Y F G

58
R-trees - insertion
n eg., rectangle ‘Y’: extend suitable
parent.
n Q: how to measure ‘suitability’?
n A: minimize increase in area
(volume)

59
NODE SPLITTING
n Add new entry to full node with M entries
by dividing M+1 entries between two
nodes
n Criteria : total area of resulting rectangles
after split minimized

4/17/18 60
Bitmap indexes
• Fixed record positions 1,2, ...,n
• bitvector of length n for each key
• Example. File = 1:(30,foo), 2:(30,bar),
3:(40,baz), 4:(50,foo), 5:(40,bar),
6:(30,baz)
• Bitvector index on second field:
Key vector
-----------------
foo 100100
bar 010010
baz 001001
Bitmap indexes ...
• Possibly takes too much space: n records and m
values n × m bits needed
• Operations easy: Gold jewelry data, find
{(age,sal): age in [45,50] sal in [100,200]
Age index:

Data:

Sal index:
Compressed Bitmaps:
Run-length Encoding
§ We represent a run, that is, a sequence of i 0’s
followed by a 1, by some suitable binary
encoding of the integer i.
§ Concatenate the codes for each run together,
and that sequence of bits represents the entire
bit-vector. This is done for all bit-vectors in a
bitmap.
§ Consider the bit-vector 000101. It contains 2
runs, 001 and 01, of lengths 3 and 1
respectively. 3=(11)2 and 1=(1)2
§ RLE of 000101 would become 111
§ RLE of 010001 would become 111
Run-length Encoding
j = number of bits in the binary representation of i
j ≈ log2i is represented in unary by j-1 1’s and a
single 0. Then we can follow the i in binary.

§ Consider the bit-vector 00000000000001


§ i=13, then j=4. Thus the encoding of i begins
with 1110. We follow this by i in binary, or
1101. Thus the encoding for 13 is 11101101.
RLE: Example
Encode the bit-vector 000101 using RLE
Two runs of 3 and 1. For i=3, j=2 therefore 0001
is encoded as 1011
For i=1, j=1 therefore 01 is encoded as 01.
So 000101 will be encoded as 1011 01
Now decode 101101
First zero at 2nd bit therefore j=2. Next 2 bits are
11. So the first four bits are decoded as 0001.
Consider the second part. First bit is 0 therefore
j=1. Next bit is 1. So the last 2 bits are decoded
as 01
So 101101 is decoded to 000101
RLE: Example
Note that every decoded bit vector will have a 1 in
the last bit. Trailing zeros can not be recovered,
which can be recovered by using the cardinality of
the relation.
RLE: Example
Operations (AND/OR) on RLE Bit-Vectors
The encoded bit-vector for Age 25 is 00110111
The encoded bit-vector for Age 30 is 110111
To perform AND/OR operation on these encoded bit-
vectors, we do not need to decode them fully.
Decoding first runs of both bit-vectors we see that the
runs are 0 & 7 respectively.
First bit-vector has first 1 at position 1
Second bit-vector has first 1 at position 8
Decoding second run of first bit-vectors we see that
the run is 7. So next 1 is at position 9
Therefore the bit-vector generated by OR operation is
100000011.
Result: 1st, 8th, & 9th records are retrieved
Curse of Dimensionality

12 dimensional data (only 2 dims shown)


Figure taken form PRML by Bishop (Springer)
Curse of Dimensionality

12 dimensional data (only 2 dims shown)


Figure taken form PRML by Bishop (Springer)
Curse of Dimensionality

Figure taken form PRML by Bishop (Springer)

Вам также может понравиться