You are on page 1of 12

Hash Based Indexing

Hashing involves computing the address of a data item by computing a


function on the search key value.
Hash indexes are best for equality searches. Cannot support range queries.

Hash function [h(key)]: Maps the key to a bucket (κάδο) where the key is
expected to belong.

A good hash function has the following properties:

• Distributes keys uniformly - all buckets are equally likely to be picked -


and at random - similar hash keys should be hashed to very different
buckets.
• Low Cost. Plain hash functions (rather than cryptographic hash functions
such as
MD5,SHA1) usually have a low computational cost.
• Determinism: for a given input value it always generates same hashvalue.

We shall utilize a Trivial Ηash Function i.e., the data itself (interpreted as an
integer in binary notation). E.g., 4410 = 1011002

Which Bucket does key k belongs to: h(k) mod Ν (Ν = # of buckets). If N


power of two (i.e., N=2d, dεN), then these are the d least significant bits.

Static Hashing:

• Builds a fixed structure at index construction time.


• Data entries are stored on a number of successive primary pages
1
• Primary pages are fixed, allocated sequentially during index construction.
Overflow pages are utilized when primary pages set full.
• Primary pages are never de-allocated during deletions.
• This is similar to the way of ISAM File organization.

• Search: Ideally 1 I/O (unless record is located in overflow chain).


• Insert/Delete: 2 I/Os (read and write) page.
• Drawback: Long overflow chains can develop and degrade performance.

How to avoid overflow chains?

1. 1.80% Occupancy: By initially keeping pages 80% full we can avoid


overflow pages if the file does not grow too much.
2. Rehashing: Hash the file with a different hash function (see next slide)
to achieve 80% occupancy and no overflows.
Drawback: Takes time (we need to rehash the complete DB)!
3. Dynamic Hashing: Allow the hash function to be modified dynamically
to accommodate the growth/shrink of the database (i.e., essentially
rehash selected, rather than all, items)
• Extendible Hashing
• Linear Hashing

Extendible Hashing:

To understand the concept of extendible hashing consider the following


example:

• A bucket (primary page) becomes full. Why not re-organize file by


doubling the number of buckets?
2
• Answer for this is – the entire file has to be read and written back to
disk twice to achieve the organization which is expensive.

• Basic idea – Use directory of pointers to buckets and double the


directory instead of doubling the data file.
• Directory is much smaller than file, so doubling the directory is much
cheaper.
• Split just the bucket that overflowed not all of them i.e., only one page
of data entries is split and no overflow pages are constructed.

3
Example:
Extendible hashing – SEARCH:

Locate data entry r with has value h(r ) =5 ( binary value 101). Look at
directory element 01 ie Global depth least significant bits of h(r ). We
then follow the pointer to the data page (bucket B in the below figure).

Extendible Hashing – INSERT:

Algorithm:
• Find target buffer: Done similarly to search.
• If target bucket is not full, insert and finish (e.g., insert h( r) =9 which
is binary is 1001 can be inserted to bucket B)
• If target bucket is full, split it (allocate new page and re-distribute).
Insertion of h(r ) =20 (10100) causes the split of bucket A and
redistribution between A and A2.

4
• Inserting h( r) =20 causes directory doubling.

• When target bucket is full and local depth is equal to global depth then
the bucket split causes directory doubling.
• Notice that after doubling some pointer are redundant and these are
utilized in subsequent inserts.

5
Global depth of directory: Tells us how many least significant bits to
utilized during the selection of the target bucket.
o Initially equal to log2(#Buckets), e.g., log28=3
o Directory Doubles => Increment Global Depth

Local depth of bucket: Tells as how many least significant bits to be


utilized to determine if any entry belongs to a given bucket.
o Bucket is split  Increment local depth

During an insertion if local depth = global depth then we need to split.

Global depth – local depth can be larger than 1. ( if corresponding buckets


are continuously splitted leaving in a way that the local depth of other
nodes is same while global depth increases).

Extendible Hashing- Delete:

• Delete: Essentially the reverse operation of insertion.


• If removal of data entry makes bucket empty then merge with `split
image’ (e.g., delete 32,16, then merge with A2).
• If each directory element points to same bucket as its split image, can
halve directory (although not necessary for correctness)
6
• Equality Search Cost: If directory fits in memory then answered with 1
disk access or else 2.
• Static Hashing on the other hand performs equlity searches with 1 I/O
(assuming no collisions)
• Yet, the extendible hashing directory can usually easily fits in main
memory.
• Directory can grow large if the distribution of hash value is skewed
( some buckets are utilized by many keys, while others remain empty).
• Multiple entries with same hash value(collisions) cause problem… as
splitting will not redistribute equally the keys.

Linear Hashing:-

• Linear Hashing is another dynamic hashing scheme.


• Linear Hashing avoids directory by splitting buckets in round robin
fashion and uses over flow pages.
• Over flow pages are not to be long. Space utilization could be lower
than EH, since splits are not concentrated on dense data areas.
• Can tune criterion for triggering splits.
• LH handles the problem of long overflow chains (presented in Static
Hashing) without using a directory
• Idea: Use a family of hash functions h0, h1, h2, ... where each hash
function maps the elements to twice the range of its predecessor, i.e.,
7
• if hi(r) maps a data entry r into M buckets, then hi+1(r) maps a
data entry into one of 2M buckets. Hash functions are like below:
hi(key) = h(key) mod(2iN), i=0,1,2… and N=“initial-#-of-
buckets”
• We proceed in rounds of splits: During round Level only hLevel(r) and
hLevel+1(r) are in use.
• The buckets in the file are split (every time we have an overflow), one-
by-one from the first to the last bucket, thereby doubling the number
of buckets.

INSERTION Algorithm:

Insert Algorithm Outline:


• Find target buffer (similarly to search with hLevel(r) and hLevel+1(r) )
• If target bucket is NOT full, insert and finish (e.g., insert h(r)=9,
which is binary 1001, can be inserted to bucket B).
• If target bucket is full:
o Add overflow page and insert data entry. (e.g., by inserting
h(r)=43 (101011) causes the split of bucket A and redistribution
between A and A2
o Split Next bucket and increment Next (can be performed in batch
mode)
Note that 32(100000), 44(101100), 36(100100)

8
• The buckets in the file are split (every time we have an overflow), one-
by-one from the first to the last bucket ΝR (using Next index), thereby
doubling the number of buckets.
• Since buckets are split round-robin, long overflow chains don’t develop
(like static hashing) as eventually every bucket is split!
• LH can choose any criterion to `trigger’split :
– e.g., Split whenever an overflow page is added.
– e.g., Split whenever the index is 75% full.
– Many other heuristics could be utilized.
Increase Level after Insert:
• If Next = NR (after overflow) then level is increased by 1 (thus h2, h1
will be utilized) and Next becomes 0
• Below the addition of 50* (110010)causes Next to become equal thus
one the Level is increased by one.

Overview of LH File:
Assume that we are in the middle of an execution, then the Linear Hash file
has the following structure.

9
SEARCHING:

Search:
To find bucket for data entry r find hLevel(r):

Unsplit Bucket: If hLevel(r) in range [Νext..NR) then r belongs here.

Split Bucket: If hLevel(r) maps to bucket smaller than Νext (i.e., a was split
previously r could belong to bucket that previously, then bucket hLevel(r) or
bucket hLevel(r) + NR; must apply hLevel+1(r) to find out.

10
LH described as a variant of EH:

The two schemes are actually quite similar:


• Begin with an Extendible Hashing index where directory has N
elements.
• Use overflow pages (rather than splitting the overflowed page as this
happens in EH) and split buckets in round-robin (similar to linear
hashing).
• First split is at bucket 0. (Imagine directory being doubled at this
point.)
• But elements 1==N+1, 2==N+2 (recall the EH redundant pointers)
• So, need only create directory element N, which differs from 0 for now.
• When bucket 1 splits, create directory element N+1, etc.

• So, directory can double gradually.


• Also, primary bucket pages are created in order (consequently correct
page can be located with a simple offset calculation). Thus, we actually
don’t need a directory.
• Vise-versa that is the same with LH.

11

12