You are on page 1of 10

Hashing I Theory & Implementation

DECEMBER 1, 2017
LATIF SIDDIQ SUNNY
Page |1

Hashing
Hashing is a technique that is used to uniquely identify a specific object from a group of similar objects.

Suppose, we have a large table of data. In this data table, we want to insert, remove, and search data.

If we use sorted arrays and keep the data sorted, then a data can be searched in O(log(n)) time using
Binary Search, but remove operations becomes costly as we have to maintain sorted order.

Table 1: Sorted Sequential Array Table Operations

If we use sorted/unsorted linked-list, insert, remove and search operations become costly.

Table 2: Linked List Table Operations


Page |2

With Balanced Binary Search Tree (For example, AVL Tree, Red Black Tree), we get moderate search,
insert, and delete times. These operations can be guaranteed to be in O(log(n)) time.

Table 3: Balanced Binary Tree Table Operations

Having an insertion, find and removal of O(log(n)) is good but as the size of the table becomes larger,
even this value becomes significant. We would like to be able to use an algorithm for finding of O (1). In
this case, we have to use Hashing.
So, hashing is a technique when we have insertion and search dominate operations, it helps to insert
data and search them in O (1) complexity. Using this technique, we store data in Hash table.

Hash Function
A hash function maps a big number or string to a small integer that can be used as index in hash table.

A good hash function should have following properties:

1. Efficiently computable.

2. Should uniformly distribute the keys (Each table position equally likely for each key)

Figure: Hash Function


Page |3

The hash function is used to map the search key to a list; the index gives the place in the hash table
where the corresponding record should be stored and where the data should be found.

Hashing Techniques

Direct Access Table

We can use a Direct Access Table for hashing. We build a large array and use the following hashing
function,

𝒉(𝒌) = 𝒌, 𝒘𝒉𝒆𝒓𝒆 𝒌 𝒊𝒔 𝒕𝒉𝒆 𝒌𝒆𝒚.


If space is not a concern, we can build such Direct Access Table. If T [1,2….m] is our table where m is the
highest key we can store, we can do insert, remove and search operation in O (1) time complexity.

Insert(x) Search(x) Remove(x)

T[x]=x Return T[x] T[x]=NIL

Advantage:

1. We can insert, search and remove in O (1) complexity.

2. No collision will occur.

Disadvantage:

1. Extra space is required.

2. We can not store a large value as we have limitation to have a huge sized array.

Direct Access Table with Modified Hash Function

As we have limitation to have a large array, we can use a small array and modify the hash function.

𝒉(𝒌) = 𝒌%𝒎 , 𝒘𝒉𝒆𝒓𝒆 𝒌 𝒊𝒔 𝒕𝒉𝒆 𝒌𝒆𝒚 𝒂𝒏𝒅 𝒎 𝒊𝒔 𝒕𝒉𝒆 𝒔𝒊𝒛𝒆 𝒐𝒇 𝒕𝒉𝒆 𝒂𝒓𝒓𝒂𝒚
But there is a chance of collision of data. For example,

m=7, if we insert 6, the hash value of 6 is 6. So, we insert 6 at the 6th position of the array. Then if we
insert 13, the hash value of 13 is 13%7= 6, but this place/ slot is not empty, a collision occurs. Though
we overcome the limitation of large size array, we can not avoid such collision.

Separate chaining

In this method we use same hash function described above, but this time the array should be an array of
pointer head of a linked-list. If there is no data in a slot, the head should be null. Whenever, we get a
data in a slot, we should insert the data at the end of that linked-list.
Page |4

Let we have an array of size 7., then m=7.

Figure: Separate chaining insertion

In this system, we avoid collision of data, but we have to search data in a linear approach.

Advantages:
1. Simple to implement.
2. Hash table never fills up, we can always add more elements to chain.
3. Less sensitive to the hash function or load factors.
4. It is mostly used when it is unknown how many and how frequently keys may be inserted or
deleted.

Disadvantages:
1. Cache performance of chaining is not good as keys are stored using linked list. Open addressing
provides better cache performance as everything is stored in same table.
2. Wastage of Space (Some Parts of hash table are never used)
3. If the chain becomes long, then search time can become O(n) in worst case.
4. Uses extra space for links.
Page |5

Analysis:

As the length in every chain is not equal, so we take average expected value.

Let there is a collision function,

𝑪(𝒙, 𝒚) = 𝟏: 𝒊𝒇 𝒉(𝒙) = 𝒉(𝒚), 𝟎: 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆


And the length function,

𝒍(𝒙) = ∑(𝒄𝒙,𝒚 ) , 𝒙 𝒂𝒏𝒅 𝒚 𝒃𝒐𝒕𝒉 𝒊𝒏 𝑻 𝒂𝒏𝒅 𝑻 𝒊𝒔 𝒕𝒉𝒆 𝒔𝒆𝒕 𝒐𝒇 𝒆𝒍𝒆𝒎𝒆𝒏𝒕 𝒊𝒏 𝒕𝒉𝒆 𝒔𝒍𝒐𝒕 𝒐𝒇 𝒕𝒉𝒆 𝒕𝒂𝒃𝒍𝒆
𝒚∊𝑻

So, the expected value of length is,

𝑬(𝒍(𝒙))

= 𝑬 ( ∑(𝒄𝒙,𝒚 ) )
𝒚∊𝑻

= ( ∑ 𝑬(𝒄𝒙,𝒚 ) )
𝒚∊𝑻

Now, 𝑬( 𝒄𝒙,𝒚 )

= 𝟏 ∗ 𝑷(𝒄𝒙,𝒚 = 𝟏) + 𝟎 ∗ (𝒄𝒙,𝒚 = 𝟎)

= 𝟏 ∗ 𝑷(𝒉(𝒙) = 𝒉(𝒚))
𝟏
=
𝒎
𝟏
𝑷(𝒉(𝒙) = 𝒉(𝒚)) = , 𝑷𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒐𝒇 𝒈𝒆𝒕 𝒔𝒂𝒎𝒆 𝒔𝒍𝒐𝒕 𝒐𝒇 𝒉𝒂𝒗𝒊𝒏𝒈 𝒔𝒂𝒎𝒆 𝒉𝒂𝒔𝒉 𝒗𝒂𝒍𝒖𝒆
𝒎
So, now 𝑬(𝒍(𝒙))

𝟏 𝒏
= ∑( ) = , 𝒏 𝒊𝒔 𝒕𝒉𝒆 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒆𝒍𝒆𝒎𝒆𝒏𝒕𝒔 𝒊𝒏 𝑻
𝐦 𝒎
𝒚∊𝑻

= α, Load Factor
Page |6

Search Complexity

If we insert some element x and there is I element before it, then

Search complexity = 1+i/m, i can very

So, average search complexity


𝑛−1
𝟏 𝑖
= 𝒏
∑ (1 + 𝑚)
0

1 𝑛(𝑛−1)
= 𝑛(n+ 2𝑚
)
(𝑛−1)
= (1+ )
2𝑚

(𝑛)
< (1+2𝑚)
𝛼
= (1+2 )

This is the tightest bound for successful search, θ(1+α).

If the search is unsuccessful then n=m. So, α=1. In this moment search complexity 0(c), c is a constant.

Open Addressing

Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method
a hash collision is resolved by probing, or searching through alternate locations in the array (the probe
sequence) until either the target record is found, or an unused array slot is found, which indicates that
there is no such key in the table.

Insert(k): Keep probing until an empty slot is found. Once an empty slot is found, insert k.
Search(k): Keep probing until slot’s key doesn’t become equal to k or an empty slot is reached.
Delete(k): If we simply delete a key, then search may fail. So, slots of deleted keys are marked specially
as “deleted”.
Insert can insert an item in a deleted slot, but search doesn’t stop at a deleted slot.

Linear Probing

In linear probing, we linearly probe for next slot.

let h(x) be the slot index computed using hash function and S be the table size

If slot h (x) % S is full, then we try (h(x) + 1) % S


If (h (x) + 1) % S is also full, then we try (h (x) + 2) % S
………………………………………………………………………………………..
Page |7

Figure: Linear Proving

Performance Analysis of Linear Proving

Let, size of hash table =m and current number of element=n.

Number of probes for an unsuccessful search= T (m, n)


𝒏
P[h(k) is occupied] = 𝒎
𝒏
So, E [T (m, n)] = 1+ * E [T (m-1, n-1)],
𝒎

1 is for hashing and E [ T (m-1, n-1)] is when current slot is filled, then there is n-1 element in m-1 sized
array.
𝒏 𝒎−𝟏
≤ 1+𝒎 ∗ (𝒎−𝟏)−(𝒏−𝟏)
𝒏
< 1+(𝒎−𝒏)
𝒎
=(𝒎−𝒏)

𝟏
= 𝒏
𝟏−
𝒎

𝟏
=𝟏−𝜶

=1+α+ α2 + α3 + α4+….
Page |8

If α=1, then n=m (array is full)

So, E [T (m, n)] < α

If α=1/2, then m=2n (array is half full)

So, E [T (m, n)] < 2 (constant)

The main problem with linear probing is clustering, many consecutive elements form groups and it
starts taking time to find a free slot or to search an element.

Quadratic Probing

In this probing, h (x, i) = (x+ i*i) %m is used, where i is the number of attempt.

let h (x) be the slot index computed using hash function.

If slot h (x) % S is full, then we try (h (x) + 1*1) % S

If (h (x) + 1*1) % S is also full, then we try (h (x) + 2*2) % S

If (h (x) + 2*2) % S is also full, then we try (h (x) + 3*3) % S

..................................................

..................................................

In this hashing some position will never be occupied.

Proof: First ceil(m/2) probes are unique, when is a prime number.

Let, proof it by contradiction.

Assume, first ceil(m/2) probes are not unique. ith and jth probe to the same location and i<j<ceil(m/2).

So, (h(k)+i*i) %m=(h(k)+j*j) %m

 (h(k)+i*i) =(h(k)+j*j) %m

 i*i=j*j %m

 i*i-j*j=0 %m

 (i+j) (i-j) =0 %m

(i*j) =0%m is possible when any of them is divisible by m.

As, m is a prime. So, (i+j) or (i-j) are not divisible by m as i<j<ceil(m/2) <m

So, first ceil(m/2) probes are unique.

If there is m sized array, m! sequence can be possible. By linear and quadratic probing, we can get m
sequence.
Page |9

Double Hashing

In this probing, hash (x) = (x+ i*hash2(x)) %m is used, where i is the number of attempt.

let hash(x) be the slot index computed using hash function.

If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S

If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) + 2*hash2(x)) % S

If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) + 3*hash2(x)) % S

..................................................

..................................................

Double hashing requires more computation time as two hash functions need to be computed.

This hashing can procedure m2 sequences.

Open Addressing vs. Separate Chaining


Advantages of Chaining:
1. Chaining is Simpler to implement.
2. In chaining, Hash table never fills up, we can always add more elements to chain. In open
addressing, table may become full.
3. Chaining is Less sensitive to the hash function or load factors.
4. Chaining is mostly used when it is unknown how many and how frequently keys may be inserted
or deleted.
5. Open addressing requires extra care for to avoid clustering and load factor.
Advantages of Open Addressing:
1. Cache performance of chaining is not good as keys are stored using linked list. Open addressing
provides better cache performance as everything is stored in same table.
2. Wastage of Space (Some Parts of hash table in chaining are never used). In Open addressing, a
slot can be used even if an input doesn’t map to it.
3. Chaining uses extra space for links.

Perfect Hashing
In this hashing, we can insert, remove and search in O (1) complexity in worst case. It can be possible if
we have some domain knowledge about data.

Actually, it uses same idea of double hashing. Whenever there can be a collision, step hashing function
gives an unique slot to search the data in O(1) complexity.