Principles of Database Management Systems Hashing Techniques

Principles of Database
Management Systems
4.2: Hashing Techniques

Pekka Kilpelinen
(after Stanford CS245 slide originals by
Hector Garcia-Molina, Jeff Ullman and
Jennifer Widom)
DBMS 200 Notes 4.2: Hashi 1

Hashing?
Locating the storage block of a
record by the hash value h(k) of
its key k
Normally really fast
records (often) located by a single
disk access

Hashing
<key>
key h(key)
Buckets
(typically 1
disk block)

Two alternatives
(1) Hash value determines the storage block directly
.
records
key h(key)
.
to implement a primary index

Two alternatives
(2) Records located indirectly via index buckets
record
key h(key) key 1
Index
for a secondary index

Example hash function
Key = x1 x2 xn n byte character

string
Have b buckets
h = (x1 + x2 + + xn) mod b
{0, 1, , b-1}

This may not be best function
Good hash Expected number of

function: keys/bucket is the
same for all buckets
Read Knuth Vol. 3 if you really

need to select a good function.

Next: example to illustrate
inserts, overflows,
deletes
h(K)

EXAMPLE 2 records/bucket
0
INSERT: d
h(a) = 1 1
a e
h(b) = 2 c
2
b
h(c) = 1
3
h(d) = 0
h(e) = 1

EXAMPLE: deletion
Delete: 0
a
e 1
b d
f c d
2
c e
3
f maybe move
g g up

Rule of thumb:
Try to keep space utilization
between 50% and 80%
Utilization = # keys used
total # keys that fit
If < 50%, wasting space
If > 80%, overflows significant
depends on how good hash
function is & on # keys/bucket

How do we cope with growth?
Overflows and reorganizations
Dynamic hashing: # of buckets
may vary
Extensible
Linear
also others ...

Extensible hashing: two ideas
(a) Use i of b bits output by hash

function For example,
b=32
b
00110101
h(K)
use i grows over time.

(b) Use directory
h(K)[i ] to bucket
Directory contains 2i pointers to buckets, and

stores i.
Each bucket stores j, indicating #bits used for
placing the records in this block (j i)

Extensible Hashing:
Insertion
If there's room in bucket h(k)[i], place
record there; Otherwise
If j=i, set i=i+1 and double the directory
If j<i, split the block in two, distribute
records among them now using j+1 bits
of h(k); (Repeat until some records end
up in the new bucket); Update pointers of
bucket array
See the next example
Example: h(k) is 4 bits; 2
keys/block
(j) i =2
1
00
i=1 0001
01
10
1 2
1001 11
1010 1100
1 2 New directory
Insert 1100
1010
Example continued 2
0000
i= 2 0001
00
01
12
0001 0111
10 0111
11 2
1001
1010
Insert:
2
0111 1100
0000

Example continued
i=3
0000 2 000
i= 2 0001
001
00
0111 2
010
01
011
10 1001 3
1001 100
11
10101001 2 3 101
1010
Insert: 110
1001 1100 2 111

Extensible hashing: deletion
Reverse insert procedure
Example:
Walk thru insert example in reverse!

Summary Extensible hashing
+ Can handle growing files
- without full reorganizations
+ Only one data block examined
- Indirection
(Not bad if directory in memory)
- Directory doubles in size

(First it fits in memory, then it does not
sudden performance degradation)

Linear hashing: grow # of buckets by
one
Two ideas:
(a) Use i low order bits of hash b
01110101
grows i
(b) File grows linearly
No bucket directory needed

Linear Hashing:
Parameters
n: number of buckets in use
buckets numbered 0n-1
i: number of bits of h(k) used to address
buckets i log(n)

r: number of records in hash table
ratio r/n limited to fit an avg bucket in a block
next example: r 1.7n, and block holds 2 records
=> AVG bucket occupancy is 1.7/2 = 0.85 of a block

Example: 2 keys/block, b=4 bits, n=2, i =1
insert 0101
now r=4 >1.7n

get new bucket
0000 0101
10
1010 1111 and distribute keys btw
00 01 buckets 00 and 10
Rule If h(k)[i ] = (a1 ai)2 < n, then

look at bucket h(k)[i ]; else
look at bucket h(k)[i ] - 2i -1 = (0a2 ai)2

n=3, i =2;
distribute keys btw buckets 00 and
10:
0000 0101 1010

1010 1111
00 01 10

n=3, i =2; insert 0001:
0001
can have overflow
chains!
0000 0101 1010

1111
00 01 10

n=3, i =2
0001 insert 0111
0111
bucket 11 not in use
0000 0101 1010 redirect to 01
1111
now r=6 > 1.7n
00 01 10
-> get new bucket 11

n=4, i =2; distribute keys btw 01 and
11 0001
0111
0000 0101 1010 1111

1111
0001 0111
00 01 10 11

Example Continued: How to grow beyond
this?
i = 23
0000 0101 1010 1111 0101

0101 0101
000 0 01 0 10 11
0 100 101
...
101 110 111
m = 11 (max used block)
100
101

Summary Linear Hashing
+ Can handle growing files
- without full reorganizations
+ No indirection directory of extensible
- hashing
Can have overflow chains
- but probability of long chains can be
kept low by controlling the r/n fill ratio (?)

Summary
Hashing
- How it works
- Dynamic hashing
- Extensible
- Linear

Next:
Indexing vs Hashing
Index definition in SQL

Indexing vs Hashing
Hashing good for probes given key

e.g., SELECT
FROM R
WHERE R.A = 5

Indexing vs Hashing
INDEXING (Including B-Trees) good

for
Range Searches:
e.g., SELECT
FROM R
WHERE R.A > 5

Index definition in SQL
Create index name on rel (attr)

Create unique index name on rel
(attr)
defines candidate key
Drop INDEX name

CANNOT SPECIFY TYPE OF INDEX
Note
(e.g. B-tree, Hashing, )
OR PARAMETERS
(e.g. Load Factor, Size of Hash,...)
... at least in SQL
Oracle and IBM DB2 UDB provide a
PCTFREE clause to inditate the proportion
of B-tree blocks initially left unfilled
Oracle: Hash clusters with built-in or DBA-
specified hash function

The BIG picture.
Chapters 2 & 3: Storage, records,
blocks...
Chapter 4: Access Mechanisms
- Indexes
- B trees
- Hashing
NEXT
Chapters 6 & 7: Query Processing

Principles of Database Management Systems Hashing Techniques

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Principles of Database Management Systems Hashing Techniques

Загружено:

Авторское право:

Доступные форматы

Principles of Database

4.2: Hashing Techniques

DBMS 200 Notes 4.2: Hashi 1

DBMS 200 Notes 4.2: Hashi 2

DBMS 200 Notes 4.2: Hashi 3

to implement a primary index

DBMS 200 Notes 4.2: Hashi 4

for a secondary index

DBMS 200 Notes 4.2: Hashi 5

Key = x1 x2 xn n byte character

DBMS 200 Notes 4.2: Hashi 6

Good hash Expected number of

Read Knuth Vol. 3 if you really

DBMS 200 Notes 4.2: Hashi 7

DBMS 200 Notes 4.2: Hashi 8

DBMS 200 Notes 4.2: Hashi 9

DBMS 200 Notes 4.2: Hashi 10

DBMS 200 Notes 4.2: Hashi 11

DBMS 200 Notes 4.2: Hashi 12

(a) Use i of b bits output by hash

use i grows over time.

DBMS 200 Notes 4.2: Hashi 13

Directory contains 2i pointers to buckets, and

DBMS 200 Notes 4.2: Hashi 14

DBMS 200 Notes 4.2: Hashi 17

1001 1100 2 111

DBMS 200 Notes 4.2: Hashi 18

Reverse insert procedure

DBMS 200 Notes 4.2: Hashi 19

- Directory doubles in size

DBMS 200 Notes 4.2: Hashi 20

No bucket directory needed

DBMS 200 Notes 4.2: Hashi 21

DBMS 200 Notes 4.2: Hashi 22

now r=4 >1.7n

Rule If h(k)[i ] = (a1 ai)2 < n, then

DBMS 200 Notes 4.2: Hashi 23

0000 0101 1010

DBMS 200 Notes 4.2: Hashi 24

0000 0101 1010

DBMS 200 Notes 4.2: Hashi 25

DBMS 200 Notes 4.2: Hashi 26

0000 0101 1010 1111

DBMS 200 Notes 4.2: Hashi 27

0000 0101 1010 1111 0101

DBMS 200 Notes 4.2: Hashi 28

DBMS 200 Notes 4.2: Hashi 29

DBMS 200 Notes 4.2: Hashi 30

DBMS 200 Notes 4.2: Hashi 31

Hashing good for probes given key

DBMS 200 Notes 4.2: Hashi 32

INDEXING (Including B-Trees) good

DBMS 200 Notes 4.2: Hashi 33

Create index name on rel (attr)

Drop INDEX name

DBMS 200 Notes 4.2: Hashi 34

DBMS 200 Notes 4.2: Hashi 35

DBMS 200 Notes 4.2: Hashi 36

Вам также может понравиться