You are on page 1of 50

# Chapter 9: Hashing

• Basic concepts
• Hash functions
• Collision resolution
• Bucket hashing

1
Basic Concepts
• Sequential search: O(n) Requiring several
key comparisons
• Binary search: O(log2n) before the target is
found

2
Basic Concepts
• Search complexity:
Size Binary Sequential Sequential
(Average) (Worst Case)
16 4 8 16
50 6 25 50
256 8 128 256
1,000 10 500 1,000
10,000 14 5,000 10,000
100,000 17 50,000 100,000
1,000,000 20 500,000 1,000,000

3
Basic Concepts
• Is there a search algorithm whose
complexity is O(1)?

4
Basic Concepts
• Is there a search algorithm whose
complexity is O(1)?
YES.

5
Basic Concepts
memory

hashing

## Each key has only one

Basic Concepts
001 Harry Lee

005 Vu Nguyen
005
Vu Nguyen
102002 Hash 100 007 Ray Black

107095
Sarah Trapp

7
Basic Concepts
function.

## • Prime area: memory that contains all the

8
Basic Concepts
• Synonyms: a set of keys that hash to the
same location.

## • Collision: the location of the data to be

inserted is already occupied by the
synonym data.

9
Basic Concepts
• Ideal hashing:
– No location collision

10
Basic Concepts
Insert A, B, C
hash(A) = 9
hash(B) = 9
hash(C) = 17

A
[1] [5] [9] [17]

11
Basic Concepts
Insert A, B, C
hash(A) = 9
hash(B) = 9 B and A
hash(C) = 17 collide at 9

A B
[1] [5] [9] [17]

Collision
Resolution

12
Basic Concepts
Insert A, B, C
hash(A) = 9
hash(B) = 9 B and A C and B
hash(C) = 17 collide at 9 collide at 17

C A B
[1] [5] [9] [17]

Collision
Resolution

13
Basic Concepts
Searh for B
hash(A) = 9
hash(B) = 9
hash(C) = 17

C A B
[1] [5] [9] [17]

Probing

14
Hash Functions
• Direct hashing
• Modulo division
• Digit extraction
• Mid-square
• Folding
• Rotation
• Pseudo-random

15
Direct Hashing
• The address is the key itself:

hash(Key) = Key

16
Direct Hashing
• Advantage: there is no collision.

size) is as large as the key space

17
Modulo Division

## • Fewer collisions if listSize is a prime

number

• Example:
Numbering system to handle 1,000,000 employees
Data space to store up to 300 employees

## hash(121267) = 121267 MOD 307 + 1 = 2 + 1 = 3

18
Digit Extraction
Address = selected digits from Key

• Example:
379452 → 394
121267 → 112
378845 → 388
160252 → 102
045128 → 051

19
Mid-square

## Address = middle digits of Key2

• Example:
9452 * 9452 = 89340304 → 3403

20
Mid-square
• Disadvantage: the size of the Key2 is too
large

## • Variations: use only a portion of the key

379452: 379 * 379 = 143641 → 364
121267: 121 * 121 = 014641 → 464
045128: 045 * 045 = 002025 → 202

21
Folding
• The key is divided into parts whose size
Key = 123|456|789

fold shift
123 + 456 + 789 = 1368
⇒ 368

22
Folding
• The key is divided into parts whose size
Key = 123|456|789

## fold shift fold boundary

123 + 456 + 789 = 1368 321 + 456 + 987 = 1764
⇒ 368 ⇒ 764

23
Rotation
• Hashing keys that are identical except for
the last character may create synonyms.

## • The key is rotated before hashing.

original key rotated key
600101 160010
600102 260010
600103 360010
600104 460010
600105 560010

24
Rotation
• Used in combination with fold shift

## original key rotated key

600101 → 62 160010 → 26
600102 → 63 260010 → 36
600103 → 64 360010 → 46
600104 → 65 460010 → 56
600105 → 66 560010 → 66

25
Pseudorandom

## Pseudorandom Rando Modulo

Number Generator m Division
Number

y = ax +
c
For maximum efficiency, a and c should be prime
numbers

26
Pseudorandom
• Example:

## Address = ((17*121267 + 7) MOD 307 + 1

= (2061539 + 7) MOD 307 + 1
= 2061546 MOD 307 + 1
= 41 + 1
= 42

27
Collision Resolution
• Except for the direct hashing, none of the
others are one-to-one mapping
⇒ Requiring collision resolution methods

## • Each collision resolution method can be

used independently with each hash function

28
Collision Resolution
• A rule of thumb: a hashed list should not be
allowed to become more than 75% full.

α = (k/n) x 100
n = list size
k = number of filled elements

29
Collision Resolution
• As data are added and collisions are
resolved, hashing tends to cause data to
group within the list
⇒ Clustering: data are unevenly distributed across
the list

## • High degree of clustering increases the

number of probes to locate an element
⇒ Minimize clustering

30
Collision Resolution
• Primary clustering: data become clustered

## Insert A9, B9, C9, D11, E12

A B C D E
[1] [9] [10] [11] [12] [13]

31
Collision Resolution
• Secondary clustering: data become
grouped along a collision path throughout a
list.

## Insert A9, B9, C9, D11, E12, F9

A B D E C F
[1] [9] [10] [11] [12] [13] [14] [23]

32
• When a collision occurs, an unoccupied element is
searched for placing the new element in.

## • There are different methods:

 Linear probe
 Double hashing
 Key offset

33
Linear Probe
• When a home address is occupied, go to the next

...

34
Linear Probe
001 Mary Dodd (379452)
002 Sarah Trapp (070918)
003 Bryan Devaux (121267)

002
Harry Eagle Hash
166702
008 John Carver (378845)

## 306 Tuan Ngo (160252)

307 Shouli Feldman (045128)

35
Linear Probe
001 Mary Dodd (379452)
002 Sarah Trapp (070918)
003 Bryan Devaux (121267)
004 Harry Eagle (166702)

002
Harry Eagle Hash
166702
008 John Carver (378845)

## 306 Tuan Ngo (160252)

307 Shouli Feldman (045128)

36
Linear Probe
 quite simple to implement
 data tend to remain near their home address

 produces primary clustering

37
• The address increment is the collision probe
number squared (12, 22, 32, ...).

## • Modulo is used not to run off the end of the

38
• Disadvantage: time required to square the probe
number.

(n + 1)2 = n2 + 2n + 1

Increment
Factor

39

## Probe Collision Probe2 and New Increment Next

Number Location Increment Address Factor Increment

1 1 12 = 1 02 1 1
2 2 22 = 4 06 3 4
3 6 32 = 9 15 5 9
4 15 42 = 16 31 7 16
5 31 52 = 25 56 9 25
6 56 62 = 36 92 11 36
7 92 72 = 49 41 13 49

40
• Limitation: it is not possible to generate a new
address for every element in the list.

## ⇒ Use a list size that is a prime number

(more elements reachable)

41
Pseudorandom Probe

## a should be a large prime number

42
Pseudorandom Probe

## • Significant limitation (of linear and quadratic

probes as well): there is only one collision path for
all keys.
⇒ produces secondary clustering

43
Secondary Clustering

## hash(A) = hash(B) = hash(C) = 9

hash(D) = 11

A C B D
[1] [9] [11] [13] [15]

Linear probes
Search/find location for D

44
Key Offset

45
Key Offset

## Key Home Key Offset Probe 1 Probe 2

166702 2 543 239 169
572556 2 1,865 026 050
067234 2 219 222 135

## offset = [166702 / 307 = 543]

address = (543 + 02) MOD 307 + 1 = 239

## address = (543 + 239) MOD 307 + 1 = 169

46
collision resolution increases the probability for
future collisions.
⇒ use linked lists to store synonyms

47
001 Mary Dodd (379452)
002 Sarah Trapp (070918) Harry Eagle (166702)
003 Bryan Devaux (121267)
Chris Walljasper (572556)

overflow
008 John Carver (378845)
area
prime
area
306 Tuan Ngo (160252)
307 Shouli Feldman (045128)

48
Bucket Hashing
• Hashing data to buckets that can hold multiple
pieces of data.

## • Each bucket has an address and collisions are

postponed until the bucket is full.

49
Bucket Hashing
Mary Dodd (379452)
001

## Sarah Trapp (070918)

002 Harry Eagle (166702)
Ann Georgis (367173)
Bryan Devaux (121267)
003
linear
Chris Walljasper(572556)
probe

307

50