You are on page 1of 40

File Structures by Folk, Zoellick and Riccardi

Chap12. Extendible Hashing

SNU-OOPSLA-LAB
File Structures SNU-OOPSLA Lab. 1

Chapter Objectives

Describe the problem solved by extendible hashing and related approaches Explain how extendible hashing works; show how it combines tries with conventional, static hashing Use the buffer, file, and index classes of previous chapters to implement extendible hashing, including deletion Review studies of extendible hashing performance Examine alternative approaches to the same problem, including dynamic hashing, linear hashing, and hashing schemes that control splitting by allowing for overflow buckets

File Structures

SNU-OOPSLA Lab.

Contents

12.1 Introduction 12.2 How extendible hashing works 12.3 Implementation 12.4 Deletion 12.5 Extendible hashing performance 12.6 Alternative approaches

File Structures

SNU-OOPSLA Lab.

12.1 Introduction

Dynamic files

undergo a lot of growths

Static hashing

described in chapter 11 (direct hashing) typically worse than B-Tree for dynamic files eventually requires file reorganization

Extendible hashing

hashing for dynamic file Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)

File Structures

SNU-OOPSLA Lab.

Overview(1)

Direct access (hashing) files have static size, so not suitable for files whose size is unknown in advance

Dynamic file structure is desired which retains the feature of fast retrieval by primary key, and which also expands and contracts as the number of records in the file fluctuates (without reorganizing the whole file) Similar motivation!

Indexed-sequential File ==> B tree Hashing ==> Extendible Hashing

File Structures

SNU-OOPSLA Lab.

Overview(2)

Extendible Hashing

Primary key

Hashing function

H(key)

Extract first d digit Directory Index Table look-up File pointer

File Structures

SNU-OOPSLA Lab.

12.2 How Extendible Hashing works

Idea from Tries file (radix searching)

The branching factor of the tree is equal to the # of alternative symbols in each position of the key e.g.) Radix 26 trie - able, abrahms, adams, anderson, adnrews, baird

Use

the first n characters for branching


l r adams able abrahms e r baird
SNU-OOPSLA Lab. 7

a b
File Structures

b d n

anderson andrews

Extendible Hashing

H maps keys to a fixed address space, with size the largest prime less than a power of 2 (65531 < 216) File pointers point to blocks of records known as buckets, where an entire bucket is read by one physical data transfer, buckets may be added to or removed from the file dynamically The d bits are used as an index in a directory array containing 2d entries, which usually resides in primary memory The value d, the directory size(2d), and the number of buckets change automatically as the file expands and contracts

File Structures

SNU-OOPSLA Lab.

Extendible Hashing Example


Directory with d=3 and 4 buckets d=3 000 001 010 011 100 101 110 111 d=1 B0 d=3 B100 H(key)=100 H(key)=0

d=3 B101 H(key)=101 d=2 B11 H(key)=11

File Structures

SNU-OOPSLA Lab.

Turning the trie into a directory

Using Trie for extendible hashing


(1) Use Radix 2 Trie :
Keys in A : beginning with 0 Keys in B : beginning with 10 Keys in C : beginning with 11

0 1 0 1

A B C

(2) Retrieving from secondary storage the buckets containing keys, instead of individual keys

File Structures

SNU-OOPSLA Lab.

10

Representation of Trie (1)


Tree is not preferable (directory is not big) A flattened array


1. Make a complete full binary tree 2. Collapse it into the directory structure

0 1 0 1

A B C

00

A B C

01
10 11

File Structures

SNU-OOPSLA Lab.

11

Representation of Trie(2)

Directory is a complete binary tree


Directory entry : a pointer to the associated bucket Given an address beginning with the bits 10, the 210

directory entries

Introduced for uniform distribution

File Structures

SNU-OOPSLA Lab.

12

Retrieve a record

Steps in retrieving a record with a given key

find H(given key) extract first d bits of H(given key) use this value as an index into the directory to find a pointer use this pointer to read a bucket into primary memory locate the desired record within the bucket (scan)

File Structures

SNU-OOPSLA Lab.

13

Expansion & Contraction(1)

A pair of adjunct buckets with the same value of d which share a common value of the first d-1 bits of H(key) can be combined if the average load < 50%, so all records would be able to fit into one bucket File contraction is the reverse of expansion; the directory can be compacted and d decremented whenever all pairs of pointers have the same values

File Structures

SNU-OOPSLA Lab.

14

Expansion & Contraction(2)


Bucket B0 overflows, then splits into B0 and B1 d=3 d=2 000 001 010 d=2 011 100 d=3 101 110 111 d=3 d=2 B00 H(key)=11..
File Structures SNU-OOPSLA Lab. 15

B00 H(key)=00.. B01 H(key)=01.. B100 H(key)=100.. B00 H(key)=101..

Expansion & Contraction(3)


d=4 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
File Structures

d=2 B00 H(key)=00.. d=2 B01 H(key)=01.. d=4 B1000H(key)=1000.. d=4 B1001H(key)=1001.. d=3 B101 H(key)=101..

d=2
B11 H(key)=11.. Bucket B100 overflows, d increase to 4
SNU-OOPSLA Lab. 16

Splitting to Handle Overflow (1)

When overflow occurs


e.g.1) Overflowing of bucket A
Split A into A and D Come to use additional unused bits No need to expand the directory

00 01 10 11

A B C

00 01 10 11

A
D B C

File Structures

SNU-OOPSLA Lab.

17

Splitting to Handle Overflow(2)

e.g. Overflowing of bucket B

Do not have additional unused bits (need to expand the directory)


1. Divide B using 3 bits of hash address 2. Make a complete full binary tree 3. Collapse it into the directory structure

00 01 10 11
File Structures

A B C
SNU-OOPSLA Lab. 18

1. Result of overflow of bucket B


0 1

A
0 0
1 1

B D C 3. Directory

2. Complete Binary Tree


0 0 0 1 1 0 1 1 0 1
File Structures

000

001

010 011

0 1 0 1

B
D C
SNU-OOPSLA Lab.

B
100 101

D C

110
111
19

Creating Address

Function hash(KEY)

Fold/Add hashing algorithm Do not MOD hashing value by address space since no fixed address space exists Output from the hash function for a number of keys
bill lee pauline alan julie mike elizabeth mark 0000 0011 0110 1100 0000 0100 0010 1000 0000 1111 0110 0101 0100 1100 1010 0010 0010 1110 0000 1001 0000 0111 0100 1101 0010 1100 0110 1010 0000 1010 0000 0111

File Structures

SNU-OOPSLA Lab.

20

Int Hash (char * key) { int sum = 0; int len = strlen(key); if (len % 2 == 1) len ++; // make len even for (int j = 0; j < len; j+2) sum = (sum + 100 * key[j] + key[j+1]) % 19937; return sum; }

Figure 12.7 Function Hash (key) returns an integer hash value for key for a 15 bit

File Structures

SNU-OOPSLA Lab.

21

Int MakeAddress (char * key, int depth) { int retval = 0; int hashVal = Hash(key); // reverse the bits for (int j = 0; j < depth; j++) { retval = retval << 1; int lowbit = hashVal & 1; retval = retval | lowbit; hashVal = hashVal >> 1; } return retval; } Figure 12.9 Function MakeAddress(key,depth)
File Structures SNU-OOPSLA Lab. 22

Class Bucket: protected TextIndex {protected: Bucket (Directory & dir, int maxKeys = defaultMaxKeys); int Insert (char * key, int recAddr); int Remove(char * key); Bucket * Split (); int NewRange (int & newStart, int & newEnd); int Redistribute (Bucket & newBucket); int FindBuddy (); int TryCombine (); int Combine (Bucket * buddy, int buddyIndex); int Depth; Directory & Dir; int BucketAddr; friend class Directory; friend class BucketBuffer; }; Figure 12.10 Main members of class Bucket
File Structures SNU-OOPSLA Lab. 23

class Directory {public: Directory (..); ~Directory(); int Open (..); int Create(); int Close(); int Insert(); int Delete(); int Search(); protected int DoubleSize(); int Collape(); int InsertBucket (.); int Find (); int StoreBucket(); int LoadBucket() .. } Figure 12.11 Definition of class Directory
File Structures SNU-OOPSLA Lab. 24

12.4 Deletion

When to combine buckets

Buddy buckets: the buckets are siblings and at the leaf level of the tree (Buddy means something like friend) e.g., B and D in page 19 are buddy buckets

Examine the directory to see if we can make changes there

Shrink the directory if none of the buckets requires the depth of address information that is currently available in the directory

File Structures

SNU-OOPSLA Lab.

25

Buddy Bucket

Given a bucket with an address uvwxy, where u, v, w, x, and y have values of either 0 or 1, the buddy bucket, if it exists, has the value uvwxz, such that

z = y XOR 1

If enough keys are deleted, the contents of buddy buckets can be combined into a single bucket

File Structures

SNU-OOPSLA Lab.

26

Collapsing the Directory

Collapse condition

If a single cell, downsizing is impossible If there is a pair of directory cells that do not both point to the same bucket, collapsing is impossible

Allocating space

Allocate half the size of the original Copy the bucket references shared by each cell pair to a single cell in the new directory

File Structures

SNU-OOPSLA Lab.

27

12.5 Extendible Hashing Performance

Time : O(1)

If the directory can kept in RAM: a single access Otherwise: two accesses are necessary

Space utilization of the bucket


r (# of records), b (block size), N (# of Blocks) Utilization = r / bN Average utilization ==> 0.69

Space utilization for the directory


How

large a directory should we expect to have, given an expected number of keys?


Expected value for the directory size by Flajolet(1983)

Estimated directory size =3.92 / b X r(1+1/b)


SNU-OOPSLA Lab. 28

File Structures

Space utilization for buckets

Periodic and fluctuating


With uniform distributed addresses, all the buckets tend to fill up at the same time -> split at the same time As buffer fills up : 90% After a concentrated series of splits : 50% N ~= 4/(b ln 2) Utilization = r / bN ~= ln 2 = 0.69 Average utilization of 69%

r : # of records , b : block size


B tree space utilization

Normal B-tree : 67%, B-tree with redistribution in insertion : 85 %

File Structures

SNU-OOPSLA Lab.

29

12.6 Alternative Approaches(1): Dynamic Hashing

Similar to dynamic extendible hashing


Use a directory to track bucket addresses Extend the directory through the use of tries

Start with a hash function that covers an address space of a fixed size

When overflow occurs

splits forming the leaves of a trie that grows down from the original address node makes a trie

File Structures

SNU-OOPSLA Lab.

30

Alternative Approaches(2): Dynamic Hashing

Two kinds of nodes


External node: reference a data bucket Internal node: point to two children index nodes When a node has split children, it changed from an external node to an internal node

Two hash functions


Apply the first hash function original address space if external node is found : search is completed

if internal node is found : apply second hash function

File Structures

SNU-OOPSLA Lab.

31

(a)

Original address space

(b)

3
40

4
41

Original address space

(c)

1 20

2 21

3 1

Original address space


41

410

411

File Structures

SNU-OOPSLA Lab.

32

Dynamic Hashing vs. Extendible Hashing(1)

Overflow handling

Both schemes extend the hash function locally, as a binary search trie

Both schemes use directory structure

Dynamic hashing: a linked structure


Extendible hashing: perfect tree expressible as an array both schemes is the same (space utilization : 69%)

Space Utilization

File Structures

SNU-OOPSLA Lab.

33

Dynamic Hashing and Extendible Hashing(2)

Growth of directory

Dynamic hashing: slower, more gradual growth Extendible hashing: extend directory by doubling it Dynamic hashing is lager than a directory cell in extendible hashing (because of pointers) Dynamic hashing: more than one page fault (with linked structure for the directory) Extendible hashing: single page fault

Actual size of an index node

Page fault

File Structures

SNU-OOPSLA Lab.

34

Alternative Approaches(3): Linear Hashing

Unlike extendible hashing and dynamic hashing, linear hashing does not use a directory.

The actual address space is extended one bucket at a time as buckets overflow
Because the extension of the address space does not necessarily correspond to the bucket that is overflowing,

linear hashing necessarily involves the use of overflow buckets, even as the address space expands

No directories: Avoid additional seek resulting from additional layer Use more bits of hashed value

hd(k) : depth d hashing function (using function make_address)

File Structures

SNU-OOPSLA Lab.

35

The growth of address space in linear hashing(1)


w a
00

b
01

c
10

d
11

a
000

b
01

c
10

d
11

A
100

(a)

(b)

y
x a
00

x
A
100

b
01

c
10

d
11

B
101

a
00

b
01

c
10

d
11

A
100

B
101

C
110

(c)
File Structures SNU-OOPSLA Lab.

(d)

(continued...)

36

The growth of address space in linear hashing(2)

x a
00

b
01

c
10

d
11

A
100

B
101

C
110

D
111

(e)

File Structures

SNU-OOPSLA Lab.

37

Alternative Approaches(5) :Approaches to Controlling Splitting

Postpone splitting: increase space utilization


B-Tree: redistribution rather than splitting Hashing: placing records in chains of overflow buckets to postpone splitting

Triggering event for splitting

Linear hashing Every time any bucket overflows Not split overflowing bucket Litwin(1980): overall load factor of the file

Below 2 seeks, 75% ~ 80% storage utilization

File Structures

SNU-OOPSLA Lab.

38

Alternative Approaches(5) :Approaches to Controlling Splitting

Postpone splitting for extensible hashing


Use chaining overflow bucket Avoid doubling directory space 1.1 seek, 76% ~ 81% storage utilization

File Structures

SNU-OOPSLA Lab.

39

Lets Review !!!

12.1 Introduction 12.2 How extendible hashing works 12.3 Implementation 12.4 Deletion 12.5 Extendible hashing performance 12.6 Alternative approaches

File Structures

SNU-OOPSLA Lab.

40