Вы находитесь на странице: 1из 36

Principles of Database

Management Systems

4.2: Hashing Techniques


Pekka Kilpelinen
(after Stanford CS245 slide originals by
Hector Garcia-Molina, Jeff Ullman and
Jennifer Widom)

DBMS 200 Notes 4.2: Hashi 1


Hashing?
Locating the storage block of a
record by the hash value h(k) of
its key k
Normally really fast
records (often) located by a single
disk access

DBMS 200 Notes 4.2: Hashi 2


Hashing

<key>
key h(key)
Buckets
(typically 1
disk block)

DBMS 200 Notes 4.2: Hashi 3


Two alternatives
(1) Hash value determines the storage block directly
.

records
key h(key)
.

to implement a primary index

DBMS 200 Notes 4.2: Hashi 4


Two alternatives
(2) Records located indirectly via index buckets

record
key h(key) key 1

Index

for a secondary index

DBMS 200 Notes 4.2: Hashi 5


Example hash function

Key = x1 x2 xn n byte character


string
Have b buckets
h = (x1 + x2 + + xn) mod b
{0, 1, , b-1}

DBMS 200 Notes 4.2: Hashi 6


This may not be best function

Good hash Expected number of


function: keys/bucket is the
same for all buckets

Read Knuth Vol. 3 if you really


need to select a good function.

DBMS 200 Notes 4.2: Hashi 7


Next: example to illustrate
inserts, overflows,
deletes
h(K)

DBMS 200 Notes 4.2: Hashi 8


EXAMPLE 2 records/bucket

0
INSERT: d

h(a) = 1 1
a e
h(b) = 2 c
2
b
h(c) = 1
3
h(d) = 0
h(e) = 1

DBMS 200 Notes 4.2: Hashi 9


EXAMPLE: deletion

Delete: 0
a
e 1
b d
f c d
2
c e
3
f maybe move
g g up

DBMS 200 Notes 4.2: Hashi 10


Rule of thumb:
Try to keep space utilization
between 50% and 80%
Utilization = # keys used
total # keys that fit
If < 50%, wasting space
If > 80%, overflows significant
depends on how good hash
function is & on # keys/bucket

DBMS 200 Notes 4.2: Hashi 11


How do we cope with growth?
Overflows and reorganizations
Dynamic hashing: # of buckets
may vary
Extensible
Linear
also others ...

DBMS 200 Notes 4.2: Hashi 12


Extensible hashing: two ideas

(a) Use i of b bits output by hash


function For example,
b=32
b
00110101
h(K)

use i grows over time.

DBMS 200 Notes 4.2: Hashi 13


(b) Use directory

h(K)[i ] to bucket

Directory contains 2i pointers to buckets, and


stores i.
Each bucket stores j, indicating #bits used for
placing the records in this block (j i)

DBMS 200 Notes 4.2: Hashi 14


Extensible Hashing:
Insertion
If there's room in bucket h(k)[i], place
record there; Otherwise
If j=i, set i=i+1 and double the directory
If j<i, split the block in two, distribute
records among them now using j+1 bits
of h(k); (Repeat until some records end
up in the new bucket); Update pointers of
bucket array
See the next example
DBMS 200 Notes 4.2: Hashi 15
Example: h(k) is 4 bits; 2
keys/block
(j) i =2
1
00
i=1 0001
01

10
1 2
1001 11
1010 1100

1 2 New directory
Insert 1100
1010
DBMS 200 Notes 4.2: Hashi 16
Example continued 2
0000
i= 2 0001
00

01
12
0001 0111
10 0111
11 2
1001
1010
Insert:
2
0111 1100
0000

DBMS 200 Notes 4.2: Hashi 17


Example continued
i=3
0000 2 000
i= 2 0001
001
00
0111 2
010
01
011
10 1001 3
1001 100
11
10101001 2 3 101
1010
Insert: 110

1001 1100 2 111

DBMS 200 Notes 4.2: Hashi 18


Extensible hashing: deletion

Reverse insert procedure

Example:
Walk thru insert example in reverse!

DBMS 200 Notes 4.2: Hashi 19


Summary Extensible hashing
+ Can handle growing files
- without full reorganizations
+ Only one data block examined
- Indirection
(Not bad if directory in memory)

- Directory doubles in size


(First it fits in memory, then it does not
sudden performance degradation)

DBMS 200 Notes 4.2: Hashi 20


Linear hashing: grow # of buckets by
one

Two ideas:
(a) Use i low order bits of hash b

01110101
grows i
(b) File grows linearly

No bucket directory needed

DBMS 200 Notes 4.2: Hashi 21


Linear Hashing:
Parameters
n: number of buckets in use
buckets numbered 0n-1
i: number of bits of h(k) used to address
buckets i log(n)

r: number of records in hash table
ratio r/n limited to fit an avg bucket in a block
next example: r 1.7n, and block holds 2 records
=> AVG bucket occupancy is 1.7/2 = 0.85 of a block

DBMS 200 Notes 4.2: Hashi 22


Example: 2 keys/block, b=4 bits, n=2, i =1
insert 0101

now r=4 >1.7n


get new bucket
0000 0101
10
1010 1111 and distribute keys btw
00 01 buckets 00 and 10

Rule If h(k)[i ] = (a1 ai)2 < n, then


look at bucket h(k)[i ]; else
look at bucket h(k)[i ] - 2i -1 = (0a2 ai)2

DBMS 200 Notes 4.2: Hashi 23


n=3, i =2;
distribute keys btw buckets 00 and
10:

0000 0101 1010


1010 1111
00 01 10

DBMS 200 Notes 4.2: Hashi 24


n=3, i =2; insert 0001:
0001
can have overflow
chains!

0000 0101 1010


1111
00 01 10

DBMS 200 Notes 4.2: Hashi 25


n=3, i =2
0001 insert 0111
0111
bucket 11 not in use
0000 0101 1010 redirect to 01
1111
now r=6 > 1.7n
00 01 10
-> get new bucket 11

DBMS 200 Notes 4.2: Hashi 26


n=4, i =2; distribute keys btw 01 and
11 0001
0111

0000 0101 1010 1111


1111
0001 0111
00 01 10 11

DBMS 200 Notes 4.2: Hashi 27


Example Continued: How to grow beyond
this?

i = 23

0000 0101 1010 1111 0101


0101 0101
000 0 01 0 10 11
0 100 101
...
101 110 111
m = 11 (max used block)
100
101

DBMS 200 Notes 4.2: Hashi 28


Summary Linear Hashing
+ Can handle growing files
- without full reorganizations
+ No indirection directory of extensible
- hashing
Can have overflow chains
- but probability of long chains can be
kept low by controlling the r/n fill ratio (?)

DBMS 200 Notes 4.2: Hashi 29


Summary

Hashing
- How it works
- Dynamic hashing
- Extensible
- Linear

DBMS 200 Notes 4.2: Hashi 30


Next:

Indexing vs Hashing
Index definition in SQL

DBMS 200 Notes 4.2: Hashi 31


Indexing vs Hashing

Hashing good for probes given key


e.g., SELECT
FROM R
WHERE R.A = 5

DBMS 200 Notes 4.2: Hashi 32


Indexing vs Hashing

INDEXING (Including B-Trees) good


for
Range Searches:
e.g., SELECT
FROM R
WHERE R.A > 5

DBMS 200 Notes 4.2: Hashi 33


Index definition in SQL

Create index name on rel (attr)


Create unique index name on rel
(attr)
defines candidate key

Drop INDEX name

DBMS 200 Notes 4.2: Hashi 34


CANNOT SPECIFY TYPE OF INDEX
Note
(e.g. B-tree, Hashing, )
OR PARAMETERS
(e.g. Load Factor, Size of Hash,...)
... at least in SQL
Oracle and IBM DB2 UDB provide a
PCTFREE clause to inditate the proportion
of B-tree blocks initially left unfilled
Oracle: Hash clusters with built-in or DBA-
specified hash function

DBMS 200 Notes 4.2: Hashi 35


The BIG picture.
Chapters 2 & 3: Storage, records,
blocks...
Chapter 4: Access Mechanisms
- Indexes
- B trees
- Hashing
NEXT
Chapters 6 & 7: Query Processing

DBMS 200 Notes 4.2: Hashi 36

Вам также может понравиться