Вы находитесь на странице: 1из 17

CS 2604 Spring 2004

The Cost of Searching

Hashing

Given a collection of N equally-likely data values, any search algorithm that proceeds by
comparing data values to each other must, on average, perform at least (log N)
comparisons in carrying out a search.

There are several simple ways to achieve (log N) in the worst case as well, including:
- binary search on a sorted array
- search in a balanced binary search tree
- search in a skip list

But, is there some way to beat the limit in the theoretical statement above?
There seem to be two possible openings:
- what if the data values are not equally-likely to be the target of a random search?
- what if the search process does not compare data elements?
In either case, the theorem would not apply

Computer Science Dept Va Tech November 2005

Data Structures & File Management

2000-2005 McQuain WD

Searching without Comparisons

Hashing

How could a search algorithm proceed without comparing data elements?


What if we had some sort of oracle that could take the key for a data value and
compute, in constant-bounded time, the location at which that key would occur within the
data collection?
data key K

(1)

L,

oracle

location of
matching record
within the
collection

If the container storing the collection supports random


access with (1) cost, as an array does, then we would
have a total search cost of (1).
Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

Data Collection

2000-2005 McQuain WD

CS 2604 Spring 2004

Hash Functions and Hash Tables

Hashing

hash function a function that can take a key value and compute an integer value (or an
index in a table) from it
A hash table employs a hash function, H, that maps key values to non-negative integers
(or to table index values).
When a record is added to the table, its key is processed by the hash function to produce
a location within the table, and the record will be stored at that location.
For example, student records for a class could be stored in an array C of dimension
10000 by truncating the students ID number to its last four digits:
H(IDNum) = IDNum % 10000
Given an ID number X, the corresponding record would be inserted at C[H(X)].
Of course, theres a problem what if the function produces the same location in the
table for two different key values? A collision.

Ideally, we can find a perfect hash function, one which never does that.
Computer Science Dept Va Tech November 2005

Data Structures & File Management

2000-2005 McQuain WD

Hash Table Insertion

Hashing

Simple insertion of an entry to a hash table involves two phases:


IDNumber

0 Vacant

Name

1 Filled

LocalAddress

2 Filled

HomeAddress

3 Filled

IDNumber
Major

H()

Level

K ??

. . .

Record

Table
Index

The appropriate record key value must be hashed to yield a table index. If that slot is
vacant, the record is inserted there. If not, we have a collision
Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

CS 2604 Spring 2004

Collisions

Hashing

The table slot that is produced by the hash function (and subsequent mod operation) is
usually referred to as the home slot for the key value.
When two different key values are mapped to the same home slot in the hash table, we
say that a collision has occurred.
- type I:

H(K1) == H(K2) but K1 != K2

- type II:

H(K1) != H(K2) but H(K1) % N == H(K2) % N

Whether collisions occur depends both on the way in which the key values are hashed,
and on the table size.
In general-purpose hash situations, it is often impractical (or impossible) to choose the
hash function and the table size so that collisions are logically impossible.

Computer Science Dept Va Tech November 2005

Data Structures & File Management

Basic Hashing Issues

2000-2005 McQuain WD

Hashing

How can a good hash function be found?


- clearly the logic of the function must depend upon the type of the key
- but, it should also depend upon the set of key values that will actually be hashed,
not just on the set of all possible key values
- it would seem to be good if the function was one-to-one
- it must be possible to compute the function efficiently
How should we choose the size of the table?
- if we know (in advance) how many records will be stored, is that sufficient?
- if we have no idea how many records will be stored, what do we do?
- there may be too many logically-possible key values to make the table large
enough to handle all of them at once (e.g., social security numbers, alpha-numeric
strings of length 10)
- should each slot in the table store a single element, or more than one?

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

CS 2604 Spring 2004

Hash Functions

Hashing

Suppose we have N records, and a table of M slots, where N M.


- there are MN different ways to map the records into the table, if we dont worry
about mapping two records to the same slot
- the number of different perfect mappings of the records into different slots in the
table would be
M!
P(M , N ) =
( M N )!
- for instance, if N = 50 and M = 100, there are 10100 different possible hash
mappings, only 1094 of which are perfect (1 in 1,000,000)
- so, there is no shortage of potential perfect hash functions (in theory)
- however, we need one that is effectively computable, that is, it must be possible to
compute it (so we need a formula for it) and it must be efficiently computable
- there are a number of common approaches, but the design of good, practical hash
functions must still be considered a topic of research and experiment

Computer Science Dept Va Tech November 2005

Data Structures & File Management

2000-2005 McQuain WD

Simple Hash Example

Hashing

It is usually desirable to have the entire key value affect the hash result (so simply
chopping off the last k digits of an integer key is NOT a good idea in most cases).
Consider the following function to hash a string value into an integer range:
unsigned int strHash(string toHash) {
unsigned int hashValue = 0;
for (int Pos = 0; Pos < toHash.length(); Pos++) {
hashValue = hashValue + int(toHash.at(Pos));
}
return hashVal;
}

Hashing: hash
h: 104
a: 97
s: 115
h: 104
Sum:
420
Mod by table
size to get the
index

This takes every element of the string into account a string hash function that
truncated to the last three characters would compute the same integer for "hash",
"stash", "mash", "trash.

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

CS 2604 Spring 2004

Hash Function Techniques

Hashing

Division
- the first order of business for a hash function is to compute an integer value
- if we expect the hash function to produce a valid index for our chosen table size,
that integer will probably be out of range
- that is easily remedied by modding the integer by the table size
- there is some reason to believe that it is better if the table size is a prime, or at least
has no small prime factors
Folding
- portions of the key are often recombined, or folded together
- shift folding:

123-45-6789 123 + 456 + 789

- boundary folding:

123-45-6789 123 + 654 + 789

- can be efficiently performed using bitwise operations


- the characters of a string can be xord together, but small numbers result
- chunks of characters can be xord instead, say in integer-sized chunks
Computer Science Dept Va Tech November 2005

Data Structures & File Management

2000-2005 McQuain WD

Hash Function Techniques

Hashing 10

Mid-square function
- square the key, then use the middle part as the result
- e.g., 3121 9740641 406 (with a table size of 1000)
- a string would first be transformed into a number, say by folding
- idea is to let all of the key influence the result
- if table size is a power of 2, this can be done efficiently at the bit level:
3121 100101001010000101100001 0101000010 (with a table size of 1024)
Extraction
- use only part of the key to compute the result
- motivation may be related to the distribution of the actual key values, e.g., VT
student IDs almost all begin with 904, so it would contribute no useful separation
Radix transformation
- change the base-of-representation of the numeric key, mod by table size
- not much of a rationale for it
Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

CS 2604 Spring 2004

Hash Function Design

Hashing 11

A good hash function should:


- be easy and quick to compute
- achieve an even distribution of the key values that actually occur across the
index range supported by the table
- ideally be mathematically one-to-one on the set of relevant key values

Note: hash functions are NOT random in any sense.

Computer Science Dept Va Tech November 2005

Data Structures & File Management

2000-2005 McQuain WD

Reducing Collisions

Hashing 12

A simple hash function is likely to map two or more key values to the same integer
value, in at least some cases.
A little bit of design forethought can often reduce this:
unsigned int strHash(string toHash) {
unsigned int hashValue = 0;
for (int Pos = 0; Pos < toHash.length(); Pos++) {
hashValue = (hashValue << 2) + int(toHash.at(Pos));
}
return hashVal;
}
Hashing: hash
h: 104
a:

97

s: 115
h: 104
Sum:

8772

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

The original version would


have hashed both of these
strings to the same table
index.
Flaw: it didn't take element
position into account.

Data Structures & File Management

Hashing: shah
s: 115
h: 104
a:

97

h: 104
Sum:

9516

2000-2005 McQuain WD

CS 2604 Spring 2004

A Classic Hash Function for Strings

Hashing 13

Consider the following function to hash a string value into an integer:


unsigned int elfHash(const string& toHash) {
unsigned int hashVal = 0;
for (int Pos = 0; Pos < toHash.length(); Pos++) { // use all elements
hashVal = (hashVal << 4) + int(toHash.at(Pos));
unsigned int hiBits = hashVal & 0xF0000000;

// shift/mix

// get high "nibble"

if (hiBits != 0) {
hashVal ^= hiBits >> 24; // xor high nibble with second nibble
}
hashVal &= ~hiBits;

// clear high nibble

}
return hashVal;
}

This was developed originally during the design of the UNIX operating system, for use
in building system-level hash tables.
Computer Science Dept Va Tech November 2005

Data Structures & File Management

Details

2000-2005 McQuain WD

Hashing 14

Here's a trace:
Character
hashVal
--------------------d: 64
00000064
i: 69
000006a9
s: 73
00006b03
t: 74
0006b0a4
r: 72
006b0ab2
i: 69
06b0ab89
b: 62
0b0ab892
u: 75
00ab8925
t: 74
0ab892c4
i: 69
0b892c09
o: 6f
0892c04f
n: 6e
092c05de
distribution: 15388030

hashVal
: 06b0ab89
hashVal << 4: 6b0ab890
add 62
: 6b0ab8f2
hiBits
: 60000000
hiBits >> 24: 00000060
hashVal ^
hiBits
hashVal &
~hiBits

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

6b0ab8f2
00000060
: 6b0ab892

f: 1111
6: 0110
^: 1001

: 0b0ab892

Data Structures & File Management

2000-2005 McQuain WD

CS 2604 Spring 2004

Perfect Hash Functions

Hashing 15

In most general applications, we cannot know exactly what set of key values will need
to be hashed until the hash function and table have been designed and put to use.
At that point, changing the hash function or changing the size of the table will be
extremely expensive since either would require re-hashing every key.
A perfect hash function is one that maps every key value to a different table cell than
any other key value.
A minimal perfect hash function does so using a table that has only as many slots as
there are key values to be hashed.
If the set of keys IS known in advance, it is possible to construct a specialized hash
function that is perfect, perhaps even minimal perfect.

Algorithms for constructing perfect hash functions tend to be tedious, but a number are
known.

Computer Science Dept Va Tech November 2005

Data Structures & File Management

Cichellis Method

2000-2005 McQuain WD

Hashing 16

This is used primarily when it is necessary to hash a relatively small collection of keys,
such as the set of reserved words for a programming language.
The basic formula is:
h(S) = S.length() + g(S[0]) + g(S[S.length()-1])
where g() is constructed using Cichellis algorithm so that h() will return a different
hash value for each word in the set.
The algorithm has three phases:
- computation of the letter sequences in the words
- ordering the words
- searching

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

CS 2604 Spring 2004

Cichellis Method

Hashing 17

Suppose we need to hash the words in the list below:


Determine the frequency with which each first and last letter
occurs:
letter: e a c o t m p u

calliope
clio
erato

freq:

euterpe

Score the words by summing the frequencies of their first and


last letters, and then sort them in that order:

melpomene
polyhymnia

calliope

euterpe

terpsichore

clio

calliope

thalia

erato

erato

urania

euterpe

Computer Science Dept Va Tech November 2005

12

terpsichore

melpomene

melpomene

polyhymnia

thalia

terpsichore

clio

thalia

polyhymnia

urania

urania

Data Structures & File Management

2000-2005 McQuain WD

Cichellis Method

Hashing 18

Finally, consider the words in order and define g(x) for each possible first and last
letter in such a way that each of the words will have a distinct hash value:
word

g_value assigned

h(word)

table slot

euterpe

e-->0

7 ok

calliope

c-->0

8 ok

erato

o-->0

5 ok

terpsichore

t-->0

11

2 ok

melpomene

m-->0

0 ok

thalia

a-->0

6 ok

clio

none

4 ok

polyhymnia

p-->0

10

1 ok

urania

u-->0

6 reject

u-->1

7 reject

u-->2

8 reject

u-->3

0 reject

u-->4

10

1 reject

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

CS 2604 Spring 2004

Cichellis Method

Hashing 19

Cichellis method imposes a limit on the search at this point (were assuming its 5
steps), and so we back up to the previous word and redefine the mapping there:
word

g_value assigned

polyhymnia

urania

h(word)

table slot

p-->0

10

1 reject

p-->1

11

2 reject

p-->2

12

u-->0

6 reject

u-->1

7 reject

u-->2

8 reject

u-->3

0 reject

u-->4

10

1 ok

So, if we define g() as determined above, then h() will be a minimal perfect hash
function on the given set of words.
The primary difficulty is the cost, because the search phase can degenerate to
exponential performance, and so it is only practical for small sets of words.
Computer Science Dept Va Tech November 2005

Data Structures & File Management

Table Size

2000-2005 McQuain WD

Hashing 20

There is some reason to believe that making the table size a prime integer produces
better results by reducing the probability of "clustering" of index values.
- if gcd(a,m) = d and r = a % m, then d | r
- so, if two hash values have a common divisor with the table size, the corresponding
table index values will both be multiples of that common divisor
There is also some empirical evidence that it is better if the table size not be close to any
power of 2.
So, there are commonly available tables of prime numbers that are considered suitable
candidates to be used as hash table sizes.
On the other hand, there are also performance advantages if the table size is chosen to be
a power of 2 (see the mid-square method, for example).

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

10

CS 2604 Spring 2004

Resolving Collisions

Hashing 21

When collisions occur, the hash table implementation must provide some mechanism to
resolve the collision:
- no strategy: just reject the insertion. Unacceptable.
- open hashing: place the record somewhere other than its home slot
- requires some method for finding the alternate location
- method must be reproducible
- method must be efficient
- chaining: view each slot as a container, storing all records that collide there
- requires an appropriate, efficient container for each table slot
- overhead is a concern (e.g., pointers needed by container)

Computer Science Dept Va Tech November 2005

Data Structures & File Management

2000-2005 McQuain WD

Open Hashing

Hashing 22

If the home slot for the record that is being inserted is already occupied, then simply
chose a different location within the table:
F
record

hash fn

linear probing:

Home slot K

start with the original hash index, say K, and search the table
sequentially from there until an empty slot is found. If no empty
slot is found

quadratic probing: search from the original hash index by considering indices K + 1,
K + 4, K + 9, etc., until an empty slot is found (or not).
other:

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

apply some other strategy for computing a sequence of alternative


table index values.
Data Structures & File Management

2000-2005 McQuain WD

11

CS 2604 Spring 2004

Linear Probing

Hashing 23

If the hash location is not vacant, then:


0

Vacant

Name

Filled

LocalAddress

Filled

HomeAddress

Filled

IDNumber

IDNumber

H()

Major
Level

Filled

. . .

K+1

Filled

K+2

Filled

K+3

Vacant

Record

Table
Index

If the hash location is filled, then we must find another location for the record

Data Structures & File Management

Computer Science Dept Va Tech November 2005

2000-2005 McQuain WD

Quadratic Probing

Hashing 24

Quadratic probing is an attempt to scatter the effect of collisions across the table in a
more distributed way:

K K+1

K+4

MM+1

K+9

M+4

10

11

K+16

12

13

14

15

16

17

M+9

18

19

M+16

Now, the probe sequences from near misses don't overlap completely.

Problem: will this eventually try every slot in the hash table? No. If the table size is
prime, this will try approximately half the table slots.

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

12

CS 2604 Spring 2004

General Increment Probing

Hashing 25

Quadratic probing can be generalized by picking some function, say S(i), to generate the
step sizes during the probe. Then we have the index sequence:
K, K + S(1), K + S(2), K + S(3), K + S(4), etc.

Letting S(i) = i2 yields quadratic probing.

The primary concern is that the probe sequence not cycle back to K too soon, because
once that happens the same sequence of index values will be generated a second time.

Computer Science Dept Va Tech November 2005

Data Structures & File Management

Key Dependent Probing

2000-2005 McQuain WD

Hashing 26

It also seems reasonable to use a probe function that takes into account the original key
value:
K, S(1, K), S(2, K), S(3, K), S(4, K), etc.

If done well, this could prevent the adjacencies that increment probing creates in the
probe sequences for adjacent slots (see slide 14 again).

One must be careful to make sure that the calculation of the probe indices is relatively
cheap, and that the probe sequence covers a reasonable fraction of the table slots.
The more complex the function used becomes, the harder it is to satisfy those concerns.

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

13

CS 2604 Spring 2004

Clustering

Hashing 27

Linear probing is guaranteed to find a slot for the insertion if there still an empty slot in
the table.
However, linear probing also tends to promote clustering within the table:
a

Suppose we now inserted some records that collided with records c and d:
a

c2

c3

c4

d2

c5

d3

The problem here is that the probabilities that a slot will be hit are no longer uniform. If
there are 20 slots in the table, the probability that the slot just after the last d will be hit
next is 9/20 instead of the ideal 1/20.
In other words, the slot that is most likely to be filled is one thats at the end of a cluster
of filled cells, which will make the cluster even longer, and make it even more likely
that the cluster will grow in the future.

Computer Science Dept Va Tech November 2005

Data Structures & File Management

2000-2005 McQuain WD

Deletions

Hashing 28

Deleting a record poses a special problem: what if there has been a collision with the
record being deleted? In that case, we must be careful to ensure that future searches will
not be disrupted.

Solution: replace the deleted record with a "tombstone" entry that indicates the cell is
available for an insertion, but that it was once filled so that a search will proceed past it
if necessary.
a

Problem: this increases the average search cost since probe sequences will be longer
that strictly necessary.
We could periodically re-hash the table or use some other reorganization scheme.

Question: how to tombstones affect the logic of hash table searching.


Question: can tombstones be "recycled" when new elements are inserted?
Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

14

CS 2604 Spring 2004

Chaining

Hashing 29

Design the table so that each slot is actually a container that can hold multiple records.

Here, the chains" are linked lists which could hold any number of colliding records.
Alternatively each table slot could be large enough to store several records directly in
that case the slot may overflow, requiring a fallback
Computer Science Dept Va Tech November 2005

Data Structures & File Management

Design Considerations

2000-2005 McQuain WD

Hashing 30

Aside from the theoretical issues already presented, there are practical considerations
that will influence the design of a hash table implementation.
- The table should, of course, be encapsulated as a template.
- The size of the table should be configurable via a constructor.
- The hash function does NOT, perhaps surprisingly, logically belong as a member
function of the hash table template.
The table implementation should be as general as possible, but the choice of a
particular hash function should take into account both the type and the expected
range of the key values.
Hence, it is natural to make the choice of hash function the responsibility of the
designer of the data element (key) being hashed. From this perspective, the table
simply asks a data element to hash itself. The hash function may either return an
integer, which the table then mods by its size, or it may take the table size as a
parameter.
- The probing strategy, if appropriate, IS the responsibility of the hash table, since it
requires knowing the internal table configuration.

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

15

CS 2604 Spring 2004

A HashTable Class

Hashing 31

Here is a sample interface for a hash table class using open hashing:
enum probeOption {LINEAR, QUADRATIC};
template <typename T, typename H> class HashTableT {
private:
enum slotState {EMPTY, FULL, TOMBSTONE};
T*
Table;
slotState* Status;
int
Size;
int
Usage;
probeOption Opt;
unsigned int Probe(int Step);
// continues . . .

Naturally the table is allocated dynamically.


The hash table does not provide a hash function; the second template parameter is
required to implement a public function with the interface:
unsigned int H::Hash(T);

Two client-selectable probe strategy options are provided.


Computer Science Dept Va Tech November 2005

Data Structures & File Management

A HashTable Class

2000-2005 McQuain WD

Hashing 32

The public interface provides the expected functions, as well as table resizing:
// . . .continued . . .
public:
HashTableT(unsigned int Sz = 97, probeOption O = LINEAR);
HashTableT(const HashTableT<T, H>& Source);
HashTableT<T, H> operator=(const HashTableT<T, H>& RHS);
~HashTableT();
T* Insert(const T& Elem);
T* Find(const T& Elem);
T* Delete(const T& Elem);
void Clear();
bool Resize(unsigned int Sz);
unsigned int tableSize() const;
void Display(ostream& Out);
};

For testing purposes it would also be natural to provide instrumentation.

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

16

CS 2604 Spring 2004

Hash Table as an Index

Hashing 33

The entries in the hash table do not have to be data records.

For example, if we have a large disk file of data records we could use a hash table to
store an index for the file. Each hash table entry might store a key value and the byte
offset within the file to the beginning of the corresponding record.
Or, the hash table entries could be pointers to data records allocated dynamically on the
system heap.

Computer Science Dept Va Tech November 2005

William D McQuain January 2004

Data Structures & File Management

2000-2005 McQuain WD

17

Вам также может понравиться