You are on page 1of 37

Week 14: Hashing

STIA2024
Data Structures & Algorithm
Analysis

1
Chapter Contents
 What is Hashing?
 Hash Functions
 Computing Hash Codes
 Compression a Hash Code into an Index for
the Hash Table
 Resolving Collisions
 Open Addressing with Linear Probing
 Open Addressing with Quadratic Probing

 Separate Chaining

2
Learning Objective
 To describe the basic idea of hashing,
 To describe the purpose of a hash table, and a hash
function,
 To describe how a hash function compresses a hash
code into an index to hash table,
 To explain what collisions are and why they occur,
 To describe open addressing as a method to resolve
collisions,
 To describe linear probing, and quadratic probingas
particular open addressing schemes,
 To describe separate chaining as method to resolve
collisions, and
 To describe the relative efficiencies of various
3
collisions resolution techniques.
Chapter Contents (ctd.)

 Efficiency
 The Load Factor
 The Cost of Open Addressing
 The Cost of Separate Chaining

4
What is Hashing?
 A technique that determines an index or location
for storage of an item in a data structure
 The hash function receives the search key
 Returns the index of an element in an array
called the hash table
 The index is known as the hash index
 Hashing can be excellent choice when searching is
the primary task.
 A technique that ideally can result in O(1) search
time.
 A perfect hash function maps each search key into
a different integer suitable as an index to the hash
table

5
What is Hashing?

Fig. 1: A hash function indexes its hash table.


6
What is Hashing?
 Two steps of the hash function
 Convert the search key into an integer
called the hash code
 Compress the hash code into the range
of indices for the hash table
 Typical hash functions are not perfect
 They can allow more than one search
key to map into a single index
 This is known as a collision

7
What is Hashing?

Fig. 2: A collision caused by the hash function h


8
Hash Functions

 General characteristics of a good


hash function
 Minimize collisions
 Distribute entries uniformly
throughout the hash table
 Be fast to compute

9
Computing Hash Codes
 We will override the hashCode method of
Object
 Guidelines
 If a class overrides the method equals, it should
override hashCode
 If the method equals considers two objects equal,
hashCode must return the same value for both
objects
 If an object invokes hashCode more than once
during execution of program on the same data, it
must return the same hash code
 If an object's hash code during one execution of a
program can differ from its hash code during
10
another execution of the same program
Computing Hash Codes
 The hash code for a string, s
int hash = 0;
int n = s.length();
for (int i = 0; i < n; i++)
hash = g * hash + s.charAt(i); // g is a positive constant

 Hash code for a primitive type


 Use the primitive typed key itself (e.g.
int)
 Manipulate internal binary
representations
 Use folding (a bit-wise boolean operation
11
such as exclusive or)
Compressing a Hash Code
 Must compress the hash code so it fits into
the index range
 Typical method for a code c is to compute
c modulo n (c%n)
 n is a prime number (the size of the
table)
 Index will then be between 0 and n – 1
private int getHashIndex(Object key)
{ int hashIndex = key.hashCode() % hashTable.length;
if (hashIndex < 0)
hashIndex = hashIndex + hashTable.length;
return hashIndex;
} // end getHashIndex
12
Resolving Collisions

 Options when hash functions returns


location already used in the table
 Use another location in the table
(open addressing)
 Change the structure of the hash table
so that each array location can
represent multiple values (separate
chaining)

13
Open Addressing with Linear
Probing
 Open addressing scheme locates alternate
location in hash table that is available, or
open.
 Locating an open location in a hash table is
called probing.
 Linear probing
 Resolves a collision by examining
consecutive locations in hash table,
beginning at the original hash index and
locating the next available location.
 If collision occurs at hashTable[k], look
successively at location k + 1, k + 2, …
14
Open Addressing with Linear
Probing

Fig. 3 : The effect of linear probing after adding four


entries whose search keys hash to the same index. 15
Open Addressing with Linear
Probing

Fig. 4: A revision of the hash table shown in 19-3 when


linear probing resolves collisions; each entry contains a
search key and its associated value 16
Removals

Fig. 5: A hash table if remove used null


17
to remove entries.
Removals
 We need to distinguish among three
kinds of locations in the hash table
1. Occupied
 The location references an entry in the
dictionary
2. Empty
 The location contains null and always did
3. Available
 The location's entry was removed from the
dictionary

18
Open Addressing with Linear
Probing

Fig. 6: A linear probe sequence (a) after adding an entry;


(b) after removing two entries;

19
Open Addressing with Linear
Probing

Fig. 6: A linear probe sequence (c) after a search; (d)


during the search while adding an entry; (e) after an
addition to a formerly occupied location. 20
Searches that Dictionary
Operations Require
 To retrieve an entry
 Search the probe sequence for the key
 Examine entries that are present, ignore locations
in available state
 Stop search when key is found or null reached
 To remove an entry
 Search the probe sequence same as for retrieval
 If key is found, mark location as available
 To add an entry
 Search probe sequence same as for retrieval
 Note first available slot
 Use available slot if the key is not found
21
Open Addressing, Quadratic
Probing
 Change the probe sequence
 Given search key k
 Probe to k + 1, k + 22, k + 32, … k + n2

 Reaches every location in the hash


table if table size is a prime number
 For avoiding primary clustering
 But can lead to secondary clustering

22
Open Addressing, Quadratic
Probing

Fig. 7: A probe sequence of length 5


using quadratic probing.

23
Separate Chaining
 Alter the structure of the hash table
 Each location can represent multiple
values
 Each location called a bucket
 Bucket can be a/an
 List
 Sorted list
 Chain of linked nodes
 Array
 Vector
 Resolving collisions by using buckets
that are linked chains. 24
Separate Chaining

Fig. 9: A hash table for use with separate chaining; each


bucket is a chain of linked nodes.
25
Separate Chaining

Fig. 10: Where new entry is inserted into linked bucket


when integer search keys are (a) duplicate and unsorted;
26
Separate Chaining

Fig. 10: Where new entry is inserted into linked bucket


when integer search keys are (b) distinct and unsorted;
27
Separate Chaining

Fig. 10: Where new entry is inserted into linked bucket


when integer search keys are (c) distinct and sorted
28
Separate Chaining

 Separate Chaining provides an efficiency


and simple way to resolve collisions.
 However, separate chaining requires more
memory than open addressing.

29
Efficiency Observations

 Successful retrieval or removal


 Same efficiency as successful search
 Unsuccessful retrieval or removal
 Same efficiency as unsuccessful search
 Successful addition
 Same efficiency as unsuccessful search
 Unsuccessful addition
 Same efficiency as successful search

30
Load Factor

 Perfect hash function not always possible


or practical
 Thus, collisions likely to occur

 As hash table fills


 Collisions occur more often

 Measure for table fullness, the load factor


(the ratio of the size of the data to the size
of the hash table)

31
Load Factor

  is zero – when hash table is empty


 For open addressing – the maximum value
of  is 1 when the hash table is full.
  not measure the number of locations in
available state (especially for separate
chaining:  has no maximum value)

32
Cost of Open Addressing

Note: Reasonable
efficiency requires
only < 0.5

Fig. 11: The average number of comparisons required by


a search of the hash table for given values of the load
factor when using linear probing.
33
Cost of Open Addressing

Note: for quadratic


probing or double
hashing, should
have < 0.5

Fig. 12: The average number of comparisons


required by a search of the hash table for given
values of the load factor when using either
quadratic probing or double hashing.
34
Cost of Separate Chaining

Note: Reasonable
efficiency requires
only < 1

Fig. 13: Average number of comparisons required by


search of hash table for given values of load factor
when using separate chaining. 35
References
 Data Structures and Abstractions with Java . Authors: Frank
M. Carrano & Walter Savitch . Chapter 19.

 Data Structures with Java . Authors : Hubbard J.R. & Huray


A. . Chapter 9

36
Conclusion

Q & A Session

37