Академический Документы
Профессиональный Документы
Культура Документы
Index Structures
BACK
© Prof. Navneet Goyal, BITS, Pilani
Multidimensional
Queries
n Partial Match Queries
all points with specified values in a subset of
dimensions
n Range Queries
all points within a range in each dimension
Figure taken
from text
book
MD-queries in SQL
• GIS database Rectangles(id,xll,yll,xur,yur)
• Rectangles enclosing point
(10.0,20.0)
Figure taken
from text
book
B+-trees for Range
Queries
§ Database of 106 points evenly distributed in a 1000×1000
square
§ 100 point records fit in one block
§ B+-tree indexes with pointer lists on x and on y
§ B+-tree leaf has 200 key-pointer pairs on an avg.
§ Range query {(x,y) : 450 ≤ x ≤550, 450 ≤ y ≤550}
§ Find all points within the small square
§ How many points in 450 ≤ x ≤550? 1000
§ 100000
§ Assume all 100000 pointers fit in memory
§ How many disk I/Os for bringing 100000
Y
pointers to main memory?
§ 500 (100000/200) + 1 = 501
§ Same for y 0 X 1000
B+-trees for Range
Queries
§ How many point in the intersection of two
pointer sets? 10000
§ 500 leaf blocks
§ Root node in memory
§ Leaves have avg. 200 keys ➨ 500 disk I/O in each tree to
get pointer lists +1 intermediate block ➨ 1002 disk I/O’s
§ Retrieve 10000 records ➨ 10000 I/O’s (random access)
§ Sum 11,002 disk I/O’s
§ Sequential scan of file = 10,000 I/O’s (100 tuples per block)
§ Hence, the index is of no help
§ If the range were smaller, would there be an advantage in
using the index????
NN-query using B -trees
+
Data:
(25,60) (45,60) (50,75) (50,100)
(50,120) (70,110) (85,140) (30,260)
(25,400) (45,350) (50,275) (60,260)
Buckets (disk)
130K
Performance of Grid Files
§ For high dimensional problems, the number of buckets
grows exponentially with the dimension
§ Many empty buckets if large portions of a space are
empty
§ If there is high correlation between age and salary, then
all points will lie along the diagonal and irrespective of the
position of grid lines, buckets off the diagonal will be
empty
§ Works well for uniform distribution
§ Grid lines can be chosen so that:
§ Bucket matrix can be kept in memory
§ In memory indexes on values of grid lines or do MM binary
search
§ Overflow blocks are limited
Performance of Grid Files
§ Lookup – we are directed to the proper bucket so only 1
disk I/O. If we are inserting or deleting, then an
additional write is needed. Inserts that require creation of
overflow block needs an additional write
§ Partial Match Queries
§ Find all customer aged 50 or all customer with sal. $200K
§ Need to look at all the buckets in a row or col. of bucket matrix
§ Only a small fraction of all the buckets would be accessed
§ Range Queries
§ Defines a rectangular region of the grid
§ 35-45 & 50-100
§ NN Queries
Grid Files: Range Queries
§ 35-45 & 50-100
Performance of Grid Files
§ NN Queries
§ Start by searching the bucket in which point P belongs
§ If atleast one point is found, we have a candidate Q
§ Possible that points belonging in adjacent buckets are closer to P
than Q
§ Find whether the distance bet. P and a border of its bucket <
dist(P,Q)
§ P(45, 200) – (50,120) is closest in the bucket
distance = 80.2
No need to search lower 3 buckets
other 5 must be searched
(30, 260) & (60, 260) are at 61.8 from P
Partitioned Hash Functions
§ Design a hash fn. that generates k bits
§ k bits divided among n attributes
§ ki bits of the hash value from the ith attribute
§ Σ ki = k
§ Hash fn = list of hash fns. (h1,h2,…hn)
§ Bucket in which a tuple (v1,v2,…vn) is computed by
concatenating the bit sequences:
h1(v1)h2(v2)…hn(vn)
Hash table with 10-bit bucket nos. = 1024 buckets
4 bits to 1st attribute & 6 bits for 2nd
Partitioned hashing
• Fixed bits for each dimension
• Example: Gold jewelry with
• first bit = age mod 2
• bits 2 and 3: salary mod 4
• Works well for partial
match, bad for range
and NN queries
Kd-trees
n kd tree - early structure used for indexing in
multiple dimensions.
n Each level of a k-d tree partitions the space into
two.
n Choose one dimension for partitioning at the root level of
the tree.
n Choose another dimensions for partitioning in nodes at the
next level and so on, cycling through the dimensions.
n In each node, approximately half of the points
stored in the sub-tree fall on one side and half on
the other.
n Partitioning stops when a node has less than a
given maximum number of points.
KD-trees
n Main-memory data structure generalizing the
binary search tree to multi-dimensional data
150
47 60
Example 1 400 a
b
c
Sal d e
g f
h
Sal 150 j l
i k
Age 60 Age 47 0 Age 100
Age 38 g h d a b
i j k
Example 2
j k f g l d a b
Quad-tree operations
• k-dimensions ➨ node has 2k children, eg
k=7 ➨ 128 children of a node
• Insertion: lookup and split if necessary.
Problem: high dimensional data
• Range query: age in [50,75], salary in
[100,200]
R-Trees
• Generalization of B+-trees to multi-
dimensional data
• Adaptation of the B+-trees to handle
spatial data
• Height balanced data structure
• Search key for an R-tree is a collection of
intervals, with one interval per dimension
• Search key is a BOX, with each side
parallel to the axis
• Search key is a bounding box or rectangle
R-Trees
n Supported in many modern database systems,
along with variants like R+ -trees and R*-trees
n Will consider only the two-dimensional case (N = 2)
n generalization for N > 2 is straightforward, although R-
trees work well only for relatively small N
R-Trees
n A rectangular bounding box is associated with
each tree node.
n Bounding box of a leaf node is a minimum sized rectangle
that contains all the rectangles/polygons/regions
associated with the leaf node.
n The bounding box associated with a non-leaf node contains
the bounding box associated with all its children.
n Bounding box of a node serves as its key in its parent node
(if any)
n Bounding boxes of children of a node are allowed to
overlap
n A polygon/region is stored only in one node, and
the bounding box of the node must contain the
polygon
R-Trees: Example
n A set of rectangles (solid line) and the bounding boxes
(dashed line) of the nodes of an R-tree for the rectangles.
R-Trees
n R-Tree is height balanced Tree similar to B-Tree.
n Leaf Nodes contain pointers to data objects.
n Insertions and Deletions are dynamic and can be
done in any order.
n Spatial Database contains a list of tuples
n Each tuple has a unique identifier.
R-Trees
n Leaf nodes index records of the form
(I,tuple-id).
where I is the n-dimensional bounding rectangle of
the spatial object.
I=(I0,I1…….In)
where each Ii is the extent along dimension i.
A C G
F H
B
E J
D
49
R-trees
n eg., w/ fanout 4:
P1 P3 I
A C G
F H
B
J A B C H I J
E P4
P2 D
D E F G
50
R-trees
n eg., w/ fanout 4:
P1 P3
P1 P2 P3 P4
A C F G I
H
B
E J A B C H I J
P4
P2 D
D E F G
51
R-trees - format of nodes
n {(MBR; obj-ptr)} for leaf nodes
P1 P2 P3 P4
x-low; x-high
obj A B C
y-low; y-high
... ptr
...
52
R-trees - format of nodes
n {(MBR; node-ptr)} for non-leaf
nodes
x-low; x-high node
P1 P2 P3 P4
y-low; y-high ptr
... ...
A B C
53
R-trees - range search?
P1 P3 I
P1 P2 P3 P4
A C G
F H
B
J A B C H I J
E P4
P2 D
D E F G
54
R-trees - range search?
P1 P3 I
P1 P2 P3 P4
A C G
F H
B
J A B C H I J
E P4
P2 D
D E F G
55
R-trees - range search
Observations:
n every parent node completely covers its
‘children’
n a child MBR may be covered by more
than one parent - it is stored under ONLY
ONE of them
n a point query may follow multiple
branches.
n everything works for any dimensionality
56
R-trees - insertion
n eg., rectangle ‘X’
P1 P3 I
P1 P2 P3 P4
A C G
F H
B
X J A B C H I J
E P4
P2 D
D E X F G
57
R-trees - insertion
n eg., rectangle ‘Y’
P1 P3 I
P1 P2 P3 P4
A C G
F H
B
J A B C H I J
Y E P4
P2 D
D E Y F G
58
R-trees - insertion
n eg., rectangle ‘Y’: extend suitable
parent.
n Q: how to measure ‘suitability’?
n A: minimize increase in area
(volume)
59
NODE SPLITTING
n Add new entry to full node with M entries
by dividing M+1 entries between two
nodes
n Criteria : total area of resulting rectangles
after split minimized
4/17/18 60
Bitmap indexes
• Fixed record positions 1,2, ...,n
• bitvector of length n for each key
• Example. File = 1:(30,foo), 2:(30,bar),
3:(40,baz), 4:(50,foo), 5:(40,bar),
6:(30,baz)
• Bitvector index on second field:
Key vector
-----------------
foo 100100
bar 010010
baz 001001
Bitmap indexes ...
• Possibly takes too much space: n records and m
values n × m bits needed
• Operations easy: Gold jewelry data, find
{(age,sal): age in [45,50] sal in [100,200]
Age index:
Data:
Sal index:
Compressed Bitmaps:
Run-length Encoding
§ We represent a run, that is, a sequence of i 0’s
followed by a 1, by some suitable binary
encoding of the integer i.
§ Concatenate the codes for each run together,
and that sequence of bits represents the entire
bit-vector. This is done for all bit-vectors in a
bitmap.
§ Consider the bit-vector 000101. It contains 2
runs, 001 and 01, of lengths 3 and 1
respectively. 3=(11)2 and 1=(1)2
§ RLE of 000101 would become 111
§ RLE of 010001 would become 111
Run-length Encoding
j = number of bits in the binary representation of i
j ≈ log2i is represented in unary by j-1 1’s and a
single 0. Then we can follow the i in binary.