DS2

Hash tables
We have weighed the advantages of ordered versus unordered lists and found that ordered lists
like the DSArray have fast search capability while unordered lists like the linked list have fast
update capability. It would be nice to have a data structure that features both advantages. Does
one exist? Yes! Two standard data structures are worthy of study: the hash table and the binary
tree.

What is a hash table?
A hash table is an array of linked lists. It attempts to make up for the disparities of the linked list
and the array by combining the best of both to provide fast insert, delete and find, but
unfortunately not sorted access to the data.
A hash table uses a key field or fields to associate data with which it "hashes" or maps to an
integer slot in an array, otherwise known as a bucket. A bucket is an array cell consisting of a
linked list of objects. Pictorally we have:
0 1 2 n2 n1 <= N Buckets

| | | | ... | | |

|| || || Elements
\/ \/ \/ A,B hash to bucket 0
A J F J,Q,C hash to bucket 2
|| || F hashes to bucket n2
\/ \/
B Q
||
\/
C
How does the hash table provide fast
search?
That is on account of the array part. If a user wishes to find where an element exists, we call a
function h(x) that decides which bucket, or array index, element x should exist in. Particularly,
this function is called a hash function.

Element x > | Hash function h(x) | > Slot K

A black box which
decides which slot x
should be stored in
How does the hash table provide fast

update?
That is on account of the linked lists. Once the bucket that an element should reside in is found,
it is simple enough to traverse the linked list belonging to that bucket to find or simply add the
element. This is a fast process provided the linked lists are relatively small. If they are too large,
then we end up resorting to linear search, the link list's tragic flaw.
As a general rule, we have to keep the average size of a linked lists to a minimum. If a hash table
starts to get too large then performance decreases:
Linked List x Linked List y

|A | |A |
|B | Nice! |B | Poor!
|C | |C |
|...|
|Y |
|Z |

Rehashing
An intelligent hash table monitors its load, the ratio of total number of elements to buckets. If
this number exceeds a certain threshold percentage, say 75%, then the hash table should
reorganize or rehash itself, thus increasing the total number of buckets and refitting all existing
elements into the new list.
Here is an example: Suppose our hash table stores names in three buckets. The first bucket
stores names whose first character ranges from A..G, the second from H..O, the third from P..Z:
A..G H..O P..Z

| 0 | 1 | 2 | As it stands, the table is
severely overloaded:
| | |
Bernie Samson Sally load = total elements/# buckets
| | | = 15/3
Alfred Iris Robert = 5
| | |
Fred Karla Tony = 500 % !!
| | |
Connie Ozna Tia At most we want a 75% load
| |
Herbert Tanya
|
Teisha
Upon rehashing the list, say into 8 buckets, each bucket storing three letters, we get:
ABC DEF GHI JKL MNO PQR STU VWXYZ

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

| | | | | | |
Alfred Fred Iris Karla Ozna Robert Samson
| | |
Connie Herbert Tony
| |
Bernie Tia
|
Tanya
|
new load = 15/8 Teisha
= 1.875 |
Sally
187%
Choosing the number of buckets
In order to maintain an efficient hash table, we wish to preserve the most even distribution of
elements as we can throughout the linked lists throughout the lifetime of the hash table. The
worst case is that all the elements hash into one bucket, the best is that we have perfect
distribution of elements amongst linked lists:
0 1 N1 0 1 2 N1

| | | ... | | | | | | ... | |

|| || || || ||
A A B C Z
B
C
...
Z
WORST case: linear search BEST case: instantaneous O(1)
to add, delete, find performance to add, delete, find
elements elements

Certain choices of N all round yield better distributions than others. The best choice is to select
N as a prime number. The reason for this can be seen mathematically by examining a few
sample hash tables. Suppose we choose an even size for a hash table, say 10. Notice 10 has 2
divisors; 5 and 2. This means that cycles can occur at intervals of multiples of 2 and multiples of
5 when we hash an element and chop it into 1 of 10 slots via the modulus function.
In the long run, we will expect to get more cases of elements hashing to buckets {0,2,4,6,8} and
{0,5} than we would of {1,3,7,9}. Statistically, bucket 0 will be most overloaded because any
element that hashes to a multiple of 5 OR 2 can potentially be slotted there:
|0| 1 |2| 3 |4| |5| |6| 7 |8| 9
* * * * * < cycles of 2
$ $ < cycles of 5
Now consider a hash table of prime size, say 7. By the very definition of prime, there are no
cycles because there are no divisors. Therefore we cannot have any elements more likely to hash
to a particular bucket than any other particular bucket.
0 1 2 3 4 5 6 7 < no cycles!

Formal definition
We are now ready to design the hash table. Because we have already perfected the LinkedList
data structure, our job is actually quite easy. We can rely on this data structure and the C array to
function perfectly, while we write the high level code to manage them.
Data definition

HashTable
nElements : current number of elements
nBuckets : number of array slots
shouldRehash : flag deciding if rehashing should occur
acceptableLoad : load threshold before rehash occurs
ll[] : array of linked lists
Operations

init()
add()
search()
remove()
nElements()
nBuckets()
LLBucket(i) : returns linked list associated with slot <i>
wrapup()
Helper functions
rehash(newSize): rehashes hash list with <newSize>
Pseudocode for the operations
****************************************************************
* Initialize a hash table
init(ht,initialSize,shouldRehash,acceptableLoad)
ht.nBuckets < initialSize
ht.shouldRehash < shouldRehash
ht.acceptableLoad < acceptableLoad
allocate ht.ll as <ht.nBuckets> linked lists
initialize each linked list in <ht.ll[]>
****************************************************************
* Deallocate all structural information associated with <ht>
* if <shouldDelete> is set, all node data is deallocated as well
wrapup(ht,shouldDelete)
for i = 0 to ht.nBuckets1
LLWrapup(ht.ll[i])
deallocate ht.ll
****************************************************************
* Add <data> to <ht> given a hash function <hasher>
* If <ht.shouldRehash> is on, then possibly <ht> is rehashed
* into a list of double size if the load exceeds
* <ht.acceptableLoad>
add(ht,data,hasher)
if (ht>shouldRehash
AND (ht>nElements/ht>nBuckets > ht>acceptableLoad))
HTRehash(ht,ht>nBuckets*2+1,hasher)
bucket < hasher(data) MOD ht>nBuckets
LLAdd(ht>ll[bucket],data,NULL)
ht>nElements < ht>nElements + 1
****************************************************************
* Search for <data> element in <ht> given a hash function
* <hasher> and a compare function <comparer>
* If <data> is found the appropriate <LLNode> is returned
* otherwise NULL
search(ht,data,hasher,comparer)
bucket < hasher(data)
ll < ht.ll[bucket]
lnode < LLSearch(ll,data,cmp)
return (lnode)
****************************************************************
* Remove <data> from <ht> given a hash function
* <hasher> and a compare function <comparer>
* If <data> is found the appropriate <LLNode> is removed
* from the appropriate linked list and TRUE is returned
* otherwise FALSE is returned
* if <shouldDelete> is on then the actual <data> is deallocated
* otherwise it is not
remove(ht,data,hasher,comparer,shouldDelete)
bucket < hx(data)
ll < HTLLBucket(ht,bucket)
if (LLRemove(ll,data,cmp,shouldDelete))
ht>nElements < ht>nElements 1
return(TRUE)
else
return(FALSE)
****************************************************************
* Rehash hash table <ht> into a list of <newSize>
rehash(ht,newSize)
HTInit(newHT,newSize,ht>shouldRehash,ht>acceptableLoad);
copy over data elements into <newHT>
HTWrapup(ht,FALSE)
ht < newHT
deallocate newHT
The HashTable definition (HT.H)
#ifndef HASHTH
#define HASHTH
#include "ll.h"
/* Data description */
typedef struct
{
int nElements;
int nBuckets;
int shouldRehash;
float acceptableLoad;
LinkedList **ll;
} HashTable;
/* Hash function definition */
typedef int (*HashFunction)(void *data);
/* Function prototypes */
extern void HTInit(HashTable *ht,
int _initialSize,
int _shouldRehash,
float _acceptableLoad);
extern void HTWrapup(HashTable *ht,int shouldDelete);
extern void HTAdd(HashTable *ht,void *data,HashFunction hx);
extern LLNode* HTSearch(HashTable *ht,void *data,
HashFunction hx,
CompareFunction cmp);
extern int HTRemove(HashTable *ht,void *data,
HashFunction hx,
CompareFunction cmp,
int shouldDelete);
extern int HTNElements(HashTable *ht);
extern int HTNBuckets(HashTable *ht);
extern LinkedList* HTLLBucket(HashTable *ht,int bucket);
#endif
The HashTable operations (HT.C)
#include "ht.h"
static void HTRehash(HashTable *hT,int newSize,HashFunction hasher);
void HTInit(HashTable *ht,int _initialSize,
int _shouldRehash,float _acceptableLoad)
{
int i;
int nBytes;
ht>nElements = 0;
ht>shouldRehash = _shouldRehash;
ht>acceptableLoad = _acceptableLoad;
ht>nBuckets = _initialSize;
nBytes = ht>nBuckets * sizeof(LinkedList*);
ht>ll = (LinkedList**)malloc(nBytes);
for (i = 0; i < ht>nBuckets; i++)
{
ht>ll[i] = (LinkedList*)malloc(sizeof(LinkedList));
LLInit(ht>ll[i]);
}
}
void HTWrapup(HashTable *ht,int shouldDelete)
{
int i;
{
LinkedList *ll = HTLLBucket(ht,i);
LLWrapup(ll,shouldDelete);
}
free(ht>ll);
}
void HTAdd(HashTable *ht,void *data,HashFunction hx)
{
int bucket;
if (ht>shouldRehash
&& (ht>nElements/ht>nBuckets > ht>acceptableLoad))
HTRehash(ht,ht>nBuckets*2+1,hx);
bucket = (*hx)(data) % ht>nBuckets;
LLAdd(ht>ll[bucket],data,0);
ht>nElements++;
}
LLNode* HTSearch(HashTable *ht,void *data,
HashFunction hx,
CompareFunction cmp)
{
int bucket = (*hx)(data);
LinkedList *ll = HTLLBucket(ht,bucket);
LLNode *lnode = LLSearch(ll,data,cmp);
return (lnode);
}
int HTRemove(HashTable *ht,void *data,
HashFunction hx,
CompareFunction cmp,
int shouldDelete)
{
int bucket = (*hx)(data);
LinkedList *ll = HTLLBucket(ht,bucket);
if (LLRemove(ll,data,cmp,shouldDelete))
{
ht>nElements;
return(TRUE);
}
else
return(FALSE);
}
static void HTCopy(HashTable *nHT,HashTable *ht)
{
ht>nElements = nHT>nElements;
ht>nBuckets = nHT>nBuckets;
ht>shouldRehash = nHT>shouldRehash;
ht>acceptableLoad = nHT>acceptableLoad;
ht>ll = nHT>ll;
}
static void HTRehash(HashTable *ht,int newSize,HashFunction hasher)
{
int i;
HashTable *nHT = (HashTable*)malloc(sizeof(HashTable));
LinkedList *llist;
LLNode *node;
void *element;
/* create a new HashTable definition of <newSize> */

HTInit(nHT,newSize,ht>shouldRehash,ht>acceptableLoad);
/* copy over data elements into new HashTable */
{
llist = HTLLBucket(ht,i);
LLBegin(llist);
while (node = LLNext(llist))
{
element = node>data;
HTAdd(nHT,element,hasher);
}
}

/* get rid of <ht> HashTable linked lists */
/* but don't delete the elements ! */

HTWrapup(ht,FALSE);
/* copy hashTable members from <nHT> > <hT> */
HTCopy(nHT,ht);
/* done with temporary <nHT> */
free(nHT);
}
int HTNElements(HashTable *ht) { return ht>nElements; }
int HTNBuckets(HashTable *ht) { return ht>nBuckets; }
LinkedList* HTLLBucket(HashTable *ht,int bucket)
{
bucket = bucket % ht>nBuckets;
if ((bucket >= 0) && (bucket <= ht>nBuckets))
return ht>ll[bucket];
else
return 0;
};
Code to test the hash table (HTTEST.C)
#include "ht.h"
#define INITIAL_SIZE 10
#define ACCEPTABLE_LOAD 0.75
static int hasher(void *data)
{
char *name = (char*)data;
if (!strlen(name))
return 0;
else
{
char firstChar;
int bucket;
float ratio;
firstChar = toupper(name[0]);
ratio = (float)(firstChar 'A');
ratio = ratio/(float)26.0;
bucket = (int) (ratio * (float)INITIAL_SIZE);
return bucket;
}
}
static int comparer(void *data1,void *data2)
{
char *d1 = (char*)data1;
int result = strcmp(d1,d2);
if (result == 0)
return (EQUAL);
else if (result < 0)
return (LESS_THAN);
else
return (GREATER_THAN);
}

static void populateNames(HashTable *hT)
{
HTAdd(hT,(void*)"Carol",hasher);
HTAdd(hT,(void*)"Zoey",hasher);
HTAdd(hT,(void*)"Abe",hasher);
HTAdd(hT,(void*)"Jonah",hasher);
HTAdd(hT,(void*)"Saserna",hasher);
HTAdd(hT,(void*)"Murgoy",hasher);
HTAdd(hT,(void*)"Carlos",hasher);
}
static void displayNamesInLL(LinkedList *llist)
{
int areNames = FALSE;
LLNode *node;
int *element;
LLBegin(llist);
{
element = (char*)node>data;
printf("<%s> ",element);
areNames = TRUE;
}
if (areNames) printf("\n");
}
static void displayNames(HashTable *hT)
{
int i,nBuckets = HTNBuckets(hT);
for (i = 0; i < nBuckets; i++)
{
LinkedList* ll = HTLLBucket(hT,i);
printf("slot:%d\n",i);
displayNamesInLL(ll);
}
printf("Total elements: %d\n\n",HTNElements(hT));
}
void main()
{
HashTable ht,*hT = &ht;
LLNode *node;
HTInit(hT,(int)INITIAL_SIZE,(int)FALSE,(float)ACCEPTABLE_LOAD);
populateNames(hT);
displayNames(hT);
node = HTSearch(hT,(void*)"Murgoy",hasher,comparer);
/* search for an existing entry */

if (node)
printf("%s found\n",(char*)node>data);
else
printf("%s not found\n","Murgoy");
node = HTSearch(hT,(void*)"Clown",hasher,comparer);

/* search for nonexistent entry */

if (node)
printf("%s found\n",(char*)node>data);
else
printf("%s not found\n","Clown");

/* remove an existing element */

if (HTRemove(hT,(void*)"Abe",hasher,comparer,FALSE))
printf("%s removed","Abe");
/* remove a nonexisting element */
if (HTRemove(hT,(void*)"Clown",hasher,comparer,FALSE))
printf("%s removed","Clown");

printf("\nAfter remove:\n");
displayNames(hT);

HTWrapup(hT,FALSE);
}
Output of HTTEST.C
slot:0
<Carlos> <Abe> <Carol>
slot:1
slot:2
slot:3
<Jonah>
slot:4
<Murgoy>
slot:5
slot:6
<Saserna>
slot:7
slot:8
slot:9
<Zoey>
Total elements: 7

Murgoy found
Clown not found
Abe removed
After remove:
slot:0
<Carlos> <Carol>
slot:1
slot:2
slot:3
<Jonah>
slot:4
<Murgoy>
slot:5
slot:6
<Saserna>
slot:7
slot:8
slot:9
<Zoey>
Total elements: 6

Testing the rehash() function
If we are to successfully test the rehashing capabilities of our hash table, we must redesign our
hasher function. The reason being that currently it hashes only into slots 0..6. When our hash
table increases size, we want our hasher to hash into the extra slots.
A generalized string hasher adds up all the character ascii codes for all the characters in the string
and returns that as an integer hash key. The hash table then chops this big number down into a
slot from 0..nBuckets1.
We can test HTRehash() by simply overloading our linked list. When the hash table recognizes
the overload, a HTRehash() call is automatically triggered. We can then display the reorganized
hash table's contents.

| ht | > | ht |
| | Force Rehash | |
|nBuckets=7 | by adding elements | nBuckets=15 |

Code to test rehashing (HTTEST2.C)
#include "ht.h"
{
char *name = (char*)data;
int i,hashValue;
for (i = 0; i < strlen(name); i++)
hashValue += name[i];
return(hashValue);
}
{
int result = strcmp(d1,d2);
if (result == 0)
return (EQUAL);
else if (result < 0)
return (LESS_THAN);
else
return (GREATER_THAN);
}

static void populateNames(HashTable *hT)
{
HTAdd(hT,(void*)"Carol",hasher);
HTAdd(hT,(void*)"Zoey",hasher);
HTAdd(hT,(void*)"Abe",hasher);
HTAdd(hT,(void*)"Jonah",hasher);
HTAdd(hT,(void*)"Saserna",hasher);
HTAdd(hT,(void*)"Murgoy",hasher);
HTAdd(hT,(void*)"Carlos",hasher);
}
static void displayNamesInLL(LinkedList *llist)
{
int areNames = FALSE;
LLNode *node;
int *element;
LLBegin(llist);
{
element = (char*)node>data;
printf("<%s> ",element);
areNames = TRUE;
}
if (areNames) printf("\n");
}
static void displayNames(HashTable *hT)
{
int i,nBuckets = HTNBuckets(hT);
for (i = 0; i < nBuckets; i++)
{
LinkedList* ll = HTLLBucket(hT,i);
printf("slot:%d\n",i);
displayNamesInLL(ll);
}
printf("Total elements: %d\n\n",HTNElements(hT));
}
static void forceRehash(HashTable *hT)
{
HTAdd(hT,(void*)"Ken",hasher);
HTAdd(hT,(void*)"Ned",hasher);
HTAdd(hT,(void*)"Fred",hasher);
HTAdd(hT,(void*)"Sal",hasher);
HTAdd(hT,(void*)"Sonya",hasher);
HTAdd(hT,(void*)"Geo",hasher);
HTAdd(hT,(void*)"Greg",hasher);
}
void main()
{
HTInit(hT,(int)INITIAL_SIZE,(int)TRUE,(float)ACCEPTABLE_LOAD);
printf("Initial list..\n");
printf("\n");
populateNames(hT);
displayNames(hT);

/* force a rehash by increasing load factor */

forceRehash(hT);
printf("Rehashed list..\n");
printf("\n");
displayNames(hT);

HTWrapup(hT,FALSE);
}
Output of HTEST2.C
Initial list..

slot:0
slot:1
<Abe>
slot:2
<Murgoy> <Jonah>
slot:3
<Carol>
slot:4
slot:5
slot:6
<Carlos> <Saserna> <Zoey>
Total elements: 7

Rehashed list..

slot:0
<Fred>
slot:1
slot:2
<Sonya> <Saserna> <Carlos>
slot:3
<Geo> <Murgoy>
slot:4
<Greg>
slot:5
slot:6
<Ken> <Jonah>
slot:7
<Carol>
slot:8
<Sal> <Zoey>
slot:9
slot:10
slot:11
slot:12
slot:13
slot:14
<Ned> <Abe>
Total elements: 14

Car dealership search
Let's look at an application that stores automobile data for a used car dealership in a hash table
and provides a simple reporting mechanism for a particular make and model of car. The data to
record is:
Auto
Make : String
Model : String
Year : integer
Mileage : long integer
Condition : integer [15] 1Poor,5Excellent
Buy price : long integer
Sell price : long integer
State : [1On lot,2Sold]
Here is a sample search session:
Search on make,model: => Honda civic
Matching units:
Honda Year State Buy$ Sell$ Profit$

Civic 1992 Sold 11000 13800 2800
Civic 1995 On lot 16000
Civic 1989 Sold 3200 5500 2300
Civic 1989 On lot 4000

Total 34200 19300 5100
Design
We want to search on the combination of keys <make> and <model> so we can design the hash
table to hash on these two fields, specifically through the hash function.
hasher(auto)
return stringHash(auto.make)+stringHash(auto.model)
Our reporting mechanism merely cycles through all elements in the bucket that <make> <model>
hashes to, comparing <make> and <model>
Suppose we want to report on Honda civics and <Honda> <Civic> hashed to the third bucket in
our hash table:
0 1 2 N1

| | | | ... | |

|| || ||
\/ \/ \/
Chev Toyota Honda
Sprint Tercel Civic
|| ||
\/ \/
Pointiac Ford
Grand AM Tempo
||
\/
Honda
Civic
||
\/
...
Once we have isolated the bucket to search in, we can pick out all the Honda civic instances in
the linked list.
AUTO.H
#ifndef AUTOH
#define AUTOH
#define NCONDITIONS 5
#define NSTATES 2
#define NCHARS 30
typedef struct
{
char make[NCHARS],model[NCHARS]; /* search keys */
int year;
long int mileage;
int condition; /* [15] 1Poor,5Excellent */
long int buyPrice;
long int sellPrice;
int state; /* [1On lot,2Sold] */
} Auto;
#endif
AUTO.DAT
Chev Malibu 81 175000 2 400 800 1
Pontiac GrandAm 92 67000 3 7500 12000 2
Honda Civic 92 65000 4 11000 13800 2
Honda Civic 95 23000 5 16000 19000 1
Honda Civic 89 223000 2 3200 5500 2
Honda Civic 89 194000 3 4000 5500 1
Acura Integra 87 201000 3 3500 4900 1
Ford Tempo 91 145000 2 1400 3000 1
Ford Tempo 90 87000 4 5700 8650 2
AUTOTEST.C
#include "ht.h"
#include "auto.h"
#include <string.h>
#define AUTO_FILE "auto.dat"
static int stringHash(char *s)
{
int hashValue;
unsigned int i,len = strlen(s);
for (i = 0; i < len; i++)
hashValue += s[i];
return hashValue;
}
{
int hashCode;
Auto *autoMB;
autoMB = (Auto*)data;
hashCode = stringHash(autoMB>make)+stringHash(autoMB>model);
return hashCode;
}
{
Auto *auto1 = (Auto*)data1;
Auto *auto2 = (Auto*)data2;
if (!strcmp(auto1>make,auto2>make)
&& !strcmp(auto1>model,auto2>model))
return EQUAL;
else
return LESS_THAN;
}

static Auto* readAuto(FILE *fp)
{
Auto *autoMB;
if (feof(fp))
return NULL;
autoMB = (Auto*)malloc(sizeof(Auto));
fscanf(fp,"%s %s %d %ld %d %ld %ld %d",
autoMB>make,autoMB>model,&autoMB>year,
&autoMB>mileage,&autoMB>condition,
&autoMB>buyPrice,&autoMB>sellPrice,
&autoMB>state);
return autoMB;
}
static void readAutos(FILE *fp,HashTable *hT)
{
Auto *autoMB;
autoMB = readAuto(fp);
while (autoMB)
{
HTAdd(hT,(void*)autoMB,hasher);
autoMB = readAuto(fp);
}
}
static char* condition[] = {"?? ","Very poor",
"Poor ",
"Mediocre ",
"Good ",
"Excellent"};
static char* state[] = {"?? ","On lot","Sold "};
static void displayAuto(Auto *autoMB)
{
int stateIndex,conditionIndex;
if ((autoMB>state < 0) || (autoMB>state > NSTATES))
stateIndex = 0;
else
stateIndex = autoMB>state;
if ((autoMB>condition <= 0) || (autoMB>condition > NCONDITIONS))
conditionIndex = 0;
else
conditionIndex = autoMB>condition;
printf("%s %s %d %ld %s %ld %ld %s\n",
autoMB>make,autoMB>model,autoMB>year,
autoMB>mileage,condition[conditionIndex],
autoMB>buyPrice,autoMB>sellPrice,
state[stateIndex]);
}
void main()
{
LLNode *node;
char searchMake[80],searchModel[80];
int bucket,foundTheItem;
LinkedList *llist;
Auto searchAuto,*element;
FILE *fp;

HTInit(hT,(int)INITIAL_SIZE,(int)FALSE,(float)ACCEPTABLE_LOAD);
fp = fopen(AUTO_FILE,"r");
if (!fp)
exit(0);
readAutos(fp,hT);
while (1)
{
printf("Enter make and model\n");
scanf("%s",searchMake);
if (!strcmp(searchMake,"quit"))
break;
scanf("%s",searchModel);
strcpy(searchAuto.make,searchMake);
strcpy(searchAuto.model,searchModel);
bucket = (*hasher)((void*)&searchAuto);
llist = HTLLBucket(hT,bucket);
LLBegin(llist);
foundTheItem = FALSE;
{
element = node>data;
if ((*comparer)(element,(void*)&searchAuto) == EQUAL)
{
foundTheItem = TRUE;
displayAuto(element);
}
}
if (!foundTheItem)
{
printf("No such item...\n");
}
printf("\n");
}
HTWrapup(hT,TRUE);
}
Output of AUTOTEST.C
Enter make and model
Chev Malibu
Chev Malibu 81 175000 Poor 400 800 On lot

Ford Tempo
Ford Tempo 90 87000 Good 5700 8650 Sold
Ford Tempo 91 145000 Poor 1400 3000 On lot

Honda Civic
Honda Civic 89 194000 Mediocre 4000 5500 On lot
Honda Civic 89 223000 Poor 3200 5500 Sold
Honda Civic 95 23000 Excellent 16000 19000 On lot
Honda Civic 92 65000 Good 11000 13800 Sold

Pointiac TransAm
No such item...

quit
Designing hash functions
We have already seen a couple of implementations of hashing a string. What happens when the
hash table data becomes more complicated?
Ideally we should use as much information from the data as possible to maintain uniqueness or
randomness of the hash value for best distribution, but not at the expense of speed! If we look
closely at our second implementation of hasher() we can see some flaws. Suppose strings are up
to 1K long and we look at each character of the string to compute the hash value. The worst case
is that we have to process a K of data just to finger a bucket. Not good! A hash should be a
quick operation. Perhaps then to improve efficiency of the string hasher, we should look only at
the first N characters of the string.
Let's consider an example of a record of three fields.
ForestryRegister
LandOwner : String
DatePurchased : String
numberOfHectares : Integer
PredominantForest : {D=Decidous,
C=Coniferous,S=Scrub}
We have a lot of fields to hash on, yielding potentially a plethora of successful (and
unsuccessful!) hashing functions. We could make a mistake by picking a bad field to hash on,
say NumberOfHectares.
Why is it bad?
Let's say the average number of hectares on file is 12. This means we will get clustering of
records on and near bucket 12, not to mention that bucket 0 will never be used! for why would the
forestry department register a landowner with zero forest land?
Which field(s) to hash on?
If possible, better that we do a quick calculation on all the fields. We have no worry of
LandOwner or DatePurchased being large enough to reduce a complete string hash to
inefficiency. To really mangle up the key, we could create the following hash function and throw
in a few prime multiplications for safety:
forestryRegisterRecordHasher(ForestryRegister fr)
hashValue < fullStringHash(fr.LandOwner)
+ fullStringHash(fr.DatePurchased)
+ 13 * fr.NumberOfHectares
+ 17 * fr.PredominantForest
return (hashValue)
fullStringHash(String s)
hashValue < 0
for each character in s
hashValue < hashValue + s[i]
return (hashValue)
This is nice in that we get a pretty even distribution but the disadvantage is that upon a search for
a ForestryRegister record, the user is forced to supply all the details of the record!
{LandOwner,DatePurchased,NumberOfHectares}. 95+% of the time, users just want to type in a
particular LandOwner and get the details.
In we wish to provide this basic operation efficiently, we have to hash only on the key search
field, which logically should be LandOwner. Our hash function is reduced to:
forestryRegisterRecordHasher(ForestryRegister fr)
Multiple hashing
We have established the fact that good hashing performance comes from using as much
information from the key data as possible within the limits of time and the search characteristics
desired.
Depending on the application, we may opt for a hashing solution that chains a series of hashes
together or multiplehashes rather than performs one "endall" hash.
Suppose we are trying to record titles of books. We could organize the bucket calculation into
two levels; the first level being an initial general hash on the title, for simplicity let's say on the
title's first character. The second hash hashes on the second character of the title. Our data
organization looks something like:
A B C D Y Z

| 0 | 1 | 2 | 3 | ... | 24 | 25 |

|| || ||
\/ \/ \/ ...
. ...
| |
| A | .
| B | > "Absence of malice"
| C |
| D | > "Add up your cash"
...
| X | > "Axes and allies","Axeman [The]",...
| Y |
| Z | .
.

| |
| A | > "Bad moon's rising","Buddy Holly years",...
| B |
| C |
...
| Y | > "Bye bye birdie","Boycott the bridge game",...
| Z |

. We end up with a hash table for each letter of the alphabet, each with 27 buckets, a total of
26*27 = 702 buckets.
. We need two hashing functions to perform this mapping: h1(x) and h2(x).
Analysis of the multiple hashing example
Note that this example has some basic flaws.
If we check statistically the characteristics of the titles going into the system, we will see that
many books starting with "The" produce much clustering of titles in bucket 'T', subbucket 'H'*.
Conversely, we would expect there to be no titles like "ZZ..","HH..", etc, (amongst many others)
which makes for wasted buckets and thus further overloads more "popular" buckets.

| A | B | C | ... | T |

|| ||
\/ \/

| |> many books starting with "A .." | A |
| A |> literally no books | B |
| B | starting with "AA" | C |
| C | | D |
| D | ...
... | H |> many books
| X | ... starting with
| Y | | Y | "Th"
| Z | > very few books | Z |
starting with
"Az"
*Yet many library databases omit redundant prefixing articles in the search such as "The", "A",
"An".
A better solution...
The previous method is a poor attempt at multiple hashing, not because of the concept of
multiple hashing but rather in the choice of h1(x) and h2(x). Hashing on a single character is
generally not a good idea. To create better distribution we can revise h1(x) to do a full hash on
title, h2(x) to do a full hash on author. The number of buckets we can supply for the title hash
can be much greater than 26, say N, a prime. The number of buckets for the author hash can also
be large, say M, also a prime. Pictorally, our data organization consists of an N x M matrix of
buckets:

| 0 | 1 | 2 | ... | N |

|| || ||
\/ \/ ... \/

| 0 | 0 | | 0 |
| 1 | 1 | | 1 |
| 2 | 2 | | 2 |
| 3 | 3 | | 3 |
... ... ...
| M | M | | M |

We should expect even distribution throughout our implementation because full string hashing
generally guarantees random key generation.
The advantage of multiple hashing over single hashing is that it avoids the creation of a very,
very large initial table at application startup. This [1..N] array can be of moderate size and over
the course of the application run, the subhash tables [1..M] (also moderately sized) are created,
populated, and possibley rehashed over time according to the user needs.
Multiple hashing on the ForestryRegister
As another solution to the ForestryRegister example from before we could chain two hashes. We
still would like to use all the record information; we can just split it up amongst hashes.
h1(x) we will keep to a simple hash on LandOwner solely. This keeps the design open so that
applications can perform a standard search on the "major" or most common key.
But at the same time, by providing h2(x), we extend capability to a more restrictive search and
the possibility of multiple hashing.
h1(ForestryRecord fr)
return (hashValue)
h2(ForestryRecord fr)
hashValue < fullStringHash(fr.DatePurchased)
+ 17 * fr.PredominantForest
+ 13 * fr.NumberOfHectares
return (hashValue)
How to implement multiple hashing
We will have to:
. Create an array of N hash tables
. Write highlevel versions of our hash table user functions add, remove, find,...
Sample pseudocode:
[ FIRST_HASH = 0, SECOND_HASH = 1]
type hashers (hasher)[2] : create a new type, an array
of hasher functions
HashTable mht[N] : array of hash tables
****************************************************************
* Add <data> to hash table <mht> supporting multiple hashing
* <hashers> is an array of two hash functions
MHTAdd(mht,data,hashers)
h1 < hashers[FIRST_HASH]
bucket1 < h1(data)
hashTable1 < mht[bucket1] MOD N
return HTAdd(hashTable1,data,hashers[SECOND_HASH])
...
Unit IV Exercises
(1) Industrial Tool Corporation wishes to crossreference patterns on certain "important"
customers, those who have in the past put in large orders of equipment. Crossreference as in
"which bigwig is buying what and how much and when from us so we can be better prepared to
control our inventory in the next year". Currently, the company stores general inventory stats of
their clients in yearly files: it1992.dat, it1993.data, ...,up until the current year.
Upper management wishes a system that produces a chronological report of the major orderers,
who will be specified in a control file bigcust.ctl:
Heneca Industries <= Sample major orderers
Jackson Quarry
Sutherland Mills
Because the nature of the application is primarily crossreferencing (or searching), system
analysts have agreed that hash tables are a suitable means of looking up certain customers.
You are requested to design the system that produces the following report:
Purchase:
Year Company Month Qty Description

1992 :
Sutherland Mills May 45 Sander
Heneca Industries July 21 Oil gun
Jackson Quarry Aug 11 Drill press
Jackson Quarry Oct 9
1993 :
Jackson Quarry Jan 17 Crane lift
Heneca Industries Nov 11 Chain saw
1994 :
Sutherland Mills May 55 Sander
Sutherland Mills June 10 Chain saw
...
Read the files and store the records in separate hash tables for each year. There may be many
matches for a particular company depending on what commodities they have purchased in any
year. Provide the ability to update bigcust.dat from within your application. Here is a sample
menu:
1 Produce Report
2 Update major orderer list (bigcust.ctl)
3 Exit
(2a) Implement an online library search system using conventional hashing. Your program
should allow "Title" or "Author" search. Note you need only one copy of the data in memory, but
two hash tables, one that records hashed references to titles, another to authors.
Assume library entries are of the format:
LibraryBook
Title : String
Author : String
YearPublished : String
OnLoan : Boolean [Y/N]
Populate your database on file with at least 50 titles. Test your program by searching on various
titles and authors, some existent, some nonexistent.
Sample run
T x Title, A y Author search
=> T Bayshore blues
No such title
=> A Finigree, Wilson
Jacobs ladder 1987 On loan
Johnny Be Good 1988 Available
=> T Betty's cookery
Calhill, Sally 1976 Available
(b) Fancy up your application to provide "narrow down" search on the first two characters of the
title or author. Hint: Create a third and a fourth hash table that contain hash references to the first
two characters of title and author respectively. You can use conventional hashing or the multiple
hashing method as described in the notes (or a superior hybrid thereof).
Sample run
TN x Title narrowed down search,
AN y Author narrowed down search
=> TN Da
Dandy Lion and the wicked witch 1976 On loan
Davie Crockett 1954 On loan
Dad and Mom 1997 On loan
...
(3a) Revise the HashTable implementation to allow the user to optionally supply a growth delta
in the case of rehashing. If the growthDelta is nonzero than it overrides the "double the current
size" rule.
void HTInit(...,int growthDelta);
(b) Because a prime number of buckets creates a better element distribution, enhance your
solution in part (a) to bump the bucket number count up to the next prime number of buckets
after the addition of growth delta. For example, if the initial size was 7 and growth delta was 20,
27 is not the best choice for the next bucket size because it has divisors 3 and 9; a better choice is
29, the next prime after (initialSize + growthDelta).

(4) Come up with an expression for the average and worst case performance of a hash table of
bucket size b and number of elements n. Assume that the hash function produces even
distribution of elements.
(5) Write a program to monitor the number of comparisons required for average search time for a
hash list of load 100% 200% .. 500% .. 1000%.
For simplicity, assume keys are integers, and the hash function is a simple modulus function:
h(x) = x mod NBUCKETS
Your program produces a chart of the following specifications:
HashTable size: 1301
Load | # searches|# comparisons| Average # of comparisons/search

100 | 1000 | 1000 | 1.00
200 | 1000 | 2224 | 2.24
300 | 1000 | 2983 | 2.98
400 | | |
500 | | 5671 | 5.67
...
1000 | 1000 | 14022 | 14.02
The user can modify the hash table size, the number of searches, and the range and increment of
loads to be tested within the program.

Test various values of prime and nonprime hash table sizes and study the output.

DS2

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DS2

Загружено:

Авторское право:

Доступные форматы

Hash tables

How does the hash table provide fast

Вам также может понравиться