Вы находитесь на странице: 1из 84

Bilgisayar Mühendisliği Bölümü

DATA STRUCTURES
AND
ALGORITHMS

Lecture Notes 11
Sets and Maps

Spring 2008

GIT – Computer Engineering Department


Chapter Outline
 The Set and Map containers and how to use them
 Hash coding and its use in efficient search &
retrieval
 Two forms of hash tables:
– Open addressing
– Chaining
– Their relative benefits and performance tradeoffs
 Implementing both hash table forms
 Introduction to implementation of Maps and Sets
 Applying Maps and Sets to previous problems

GIT – Computer Engineering Department 2


Standard Library Containers

GIT – Computer Engineering Department 3


Associative Containers

GIT – Computer Engineering Department 4


The Set Abstraction
 A set is a collection containing no duplicate
elements
 Operations on sets include:
– Testing for membership
– Adding elements
– Removing elements
– Union
– Intersection
– Difference
– Subset
GIT – Computer Engineering Department 5
Set Implementation
 set is a template class in standard library
with template parameters
– Key_Type : type of the items
– Compare : function class to order items
 Comparison is not required for set abstraction
– Compare has default value

GIT – Computer Engineering Department 6


The set class
Function Behavior
template<typename II> Construct a set from the sequence of objects
set(II first, II last) represented by the iterator range first ..
last.
iterator begin() Returns an iterator to the first item in the
const iterator begin() const set.
iterator end() Returns an iterator to one past the last item
const iterator end() const in the set.
bool empty() Returns true if the set is empty.
size_t size() Returns the number of item in the set.
pair<iterator, bool> insert(const Inserts an item into the set. If the item is not
Key_Type& item) in the set, the iterator will reference the
inserted item and the bool will be true. If
the item is already in the set the iterator
will reference the item currently in the set
and the bool will be false.
template<typename II> void Inserts the items from the range first ..
insert(II first, II last) last into the set. Duplicates are ignored.
void erase(const Key_Type& item) Removes the item from the set.
iterator find(const Key_Type& Returns an iterator to the item in the set. If

GIT – Computer Engineering Department 7


Implementing the Union Operator
/** Construct the union of two sets. */
template <typename Key_Type, typename Compare>
std::set<Key_Type, Compare> operator+(
const std::set<Key_Type, Compare>& left,
const std::set<Key_Type, Compare>& right){
typename std::set<Key_Type, Compare> result(left);
result.insert(right.begin(), right.end());
return result;
}

GIT – Computer Engineering Department 8


Implementing the Set Difference
/** Construct the difference of two sets. */
template <typename Key_Type, typename Compare>
std::set<Key_Type, Compare> operator-(
const std::set<Key_Type, Compare>& left,
const std::set<Key_Type, Compare>& right) {
typename std::set<Key_Type, Compare> result(left);
for (typename std::set<Key_Type,
Compare>::const_iterator itr = right.begin();
itr != right.end(); ++itr)
result.erase(*itr);
return result;
}

GIT – Computer Engineering Department 9


Implementing Intersection
/** Construct the intersection of two sets */
template <typename Key_Type, typename Compare>
std::set<Key_Type, Compare> operator*(const
std::set<Key_Type, Compare>& left,
const std::set<Key_Type, Compare>& right)
{
return left - (left - right);
}

GIT – Computer Engineering Department 10


Overloaded ostream extraction operator
// Overloading the ostream insertion operator
template<typename Item_Type>
std::ostream& operator<<(std::ostream& out, const
std::set<Item_Type>& a_set) {
out << "{";
bool first = true;
for (typename std::set<Item_Type>::const_iterator itr =
a_set.begin(); itr != a_set.end(); ++itr) {
if (first)
out << *itr;
else
out << ", " << *itr;
first = false;
}
return out << "}";
}

GIT – Computer Engineering Department 11


Set Functions in <algorithm>
Function Behavior
template<typename II1, typename Forms the set difference of the elements
II2, typename OI, typename Compare> from the range first1 .. last1 and
OI set_difference(II1 first1, II1 first2 .. last2. The result is stored
last1, II2 first2, II2 last2, OI in the sequence beginning with result.
result, Compare less)
The end of the result sequence is
returned.
template<typename II1, typename Forms the set intersection of the
II2, typename OI, typename Compare> elements from the range first1 ..
OI set_intersection(II1 first1, II1 last1 and first2 .. last2. The
last1, II2 first2, II2 last2, OI result is stored in the sequence
result, Compare less)
beginning with result. The end of the
result sequence is returned.
template<typename II1, typename Forms the set union of the elements
II2, typename OI, typename Compare> from the range first1 .. last1 and
OI set_union(II1 first1, II1 last1, first2 .. last2. The result is stored
II2 first2, II2 last2, OI result, in the sequence beginning with result.
Compare less)
The end of the result sequence is
returned.

GIT – Computer Engineering Department 12


Set Example
int main()
{
set<string> set1;
set<string> set2;
set<string> set_u;
set<string> set_d;
set<string> set_i;

string data1[] = {"Apples", "Oranges", "Pineapples"};


string data2[] = {"Peaches", "Apples", "Grapes"};

set1.insert(data1, data1+3);
set2.insert(data2, data2+3);
cout << "set1 is " << set1 << endl;
cout << "set2 is " << set2 << endl;

set_union(set1.begin(), set1.end(),
set2.begin(), set2.end(),
inserter(set_u, set_u.begin()));
cout << "set1 + set2 is " << set_u << endl;

GIT – Computer Engineering Department 13


Set Example (2)
set_difference(set1.begin(), set1.end(),
set2.begin(), set2.end(),
inserter(set_d, set_d. begin()));
cout << "set1 - set2 is " << set_d << endl;

set_intersection(set1.begin(), set1.end(),
set2.begin(), set2.end(),
inserter(set_i, set_i. begin()));
cout << "set1 * set2 is " << set_i << endl;

bool is_member = (set1.find(string("Apples")) != set1.end());


cout << "\"Apples\" is an element of set1 is "
<< boolalpha << is_member << endl;

return 0;
}

GIT – Computer Engineering Department 14


vector vs. set
 Sets allow no duplicates
 Sets do not have positions, so no
operator[]
 Set iterator can produces elements in sorted
order

 How to implement a set???

GIT – Computer Engineering Department 15


The multiset
 The multiset is the same as the set except
that duplicate entries are allowed.
 The insert always inserts a new item.
 The erase removes all occurrences of an
item.
 The functions lower_bound and upper_bound
may be used to select a range from a set.
– In a multiset these functions can also return all
occurrences of a given value

 How to implement a multiset??


GIT – Computer Engineering Department 16
Example of a multiset
int count_occurences(const multiset<string>& words_set,
const string& target) {
multiset<string>::const_iterator first_itr =
words_set.lower_bound(target);
cout << "*first_itr == " << (first_itr !=
words_set.end() ? *first_itr : "end()") << endl;
multiset<string>::const_iterator last_itr =
words_set.upper_bound(target);
cout << "*last_itr == " << (last_itr !=
words_set.end() ? *last_itr : "end()") << endl;
int count = 0;
for (multiset<string>::const_iterator itr = first_itr;
itr != last_itr; ++itr)
++count;
return count;
}

GIT – Computer Engineering Department 17


The std::pair
 The std::pair is defined in <utility>
 It is a struct (data members are public) that
contains two data items: first and second.
 There is a template function make_pair that
can be used to construct pair objects.

GIT – Computer Engineering Department 18


Definition of pair
template<typename Type1, typename Type2>
struct pair {
Type1 first;
Type2 second;
pair(const Type1& x, const Type2& y): first(x),
second(y) {}
pair() : first(Type1()), second(Type2()) {}
template<typename Other_T1, typename Other_T2>
pair(const pair<Other_T1, Other_T2>& other) {
first = other.first; second = other.second;
}
};
template<typename Type1, typename Type2>
pair<Type1, Type2> make_pair(const Type1& x, const
Type2& y) {return pair<Type1, Type2>(x, y);}

GIT – Computer Engineering Department 19


Maps and the Map Interface
 Map is related to Set: it is a set of ordered pairs
 Ordered pair: (key, value)
– In a given map, there are no duplicate keys
– Values may appear more than once
 Can think of key as “mapping to” a particular value
 Maps support efficient organization of information in
tables
 Mathematically, these maps are:
– Many-to-one (not necessarily one-to-one)
– Onto (every value in the map has a key)

GIT – Computer Engineering Department 20


Map Picture

GIT – Computer Engineering Department 21


The map functions
 Template parameters
– Key_Type: The type of the keys
– Value_Type: The type of the values
– Compare: The function class that compares the
keys.
 All functions defined for the set are defined for
the map, taking a pair
– pair<Key_Type, Value_Type>
 If set is defined map can be defined easily
 Except the index operator: the index operator
(operator[]) is also defined for the map
GIT – Computer Engineering Department 22
The map operator[]
 The map can be used like an array, except
that the Key_Type is the index.
– Example:
map<string, string> a_map;
a_map["J"] = "Jane";
a_map["B"] = "Bill";
a_map["S"] = "Sam";
a_map["B1"] = "Bob";
a_map["B2"] = "Bill";

 Note:
– If a mapping exists, assignment will replace it.
– If a mapping does not exist, a reference will
create one with a default value
GIT – Computer Engineering Department 23
EX: Using a map to build an index
 Index the words in a text with their line
numbers
– Use a map (map<string, list<int>>)
– Each word (string) is a key
– List of line numbers (list<int>) is a value

GIT – Computer Engineering Department 24


Using a map to build an index
typedef map<string, list<int> > map_type;

void build_index(istream& in, map_type& index) {


string next_line; // Each data line
int line_num = 0; // Line number
// Keep reading lines until done
while (getline(in, next_line)) {
line_num++;
String_Tokenizer tokenizer(next_line, " ,.:-!?/%\'\"");
// Insert each token in the index
while (tokenizer.has_more_tokens()) {
string word = tokenizer.next_token();
to_lower(word);
index[word].push_back(line_num);
}
}
}

GIT – Computer Engineering Department 25


Maps
 How to implement Maps?
– What is the efficiency?

GIT – Computer Engineering Department 26


Chapter Outline
 The Set and Map containers and how to use them
 Hash coding and its use in efficient search &
retrieval
 Two forms of hash tables:
– Open addressing
– Chaining
– Their relative benefits and performance tradeoffs
 Implementing both hash table forms
 Introduction to implementation of Maps and Sets
 Applying Maps and Sets to previous problems

GIT – Computer Engineering Department 27


Hash Tables
 Goal: access item given its key (not its position)
 Therefore, want to locate it directly from the key
 In other words, we wish to avoid much searching

 Hash tables provide this capability


– Constant time in the average case! O(1)
– Linear time in the worst case O(n)

 Searching an array: O(n) Searching BST: O(log n)

GIT – Computer Engineering Department 28


Hash Codes
 Suppose we have a table of size N
 A hash code is a number in the range 0 to N-1
 We compute the hash code from the key
– a “default position” when inserting
– a “position hint” when looking up
 A hash function is a way of computing a hash code

 Desire: The set of keys should spread evenly over


the N values
 When two keys have the same hash code: collision

GIT – Computer Engineering Department 29


EX: Character Frequencies in a Text
 To build a Huffman tree, we need the number of
occurrences of each character
 Hash key: character
 Hash value: character frequency
 Assume text includes ASCII characters
– There are 128 characters in ASCII
 Hash table size is 128
– Character ASCII value is the index in the table
• HashTable[65] is frequency of character A
 Hash function: returns the character value

 What if the characters are Unicode


GIT – Computer Engineering Department 30
EX: Character Frequencies in a Text
 There are 216 possible characters, but ...
– Maybe only 100 or so occur in a text
 Choosing table size of 216 results in waste of space
 Approach: hash characters to range 0-199
– That is, use a hash table of size 200
 A possible hash function for this example:
int hash = uni_char % 200;

 Collisions are certainly possible (see later)


– Mapping 216 possible characters to 200 indices

GIT – Computer Engineering Department 31


EX: An ideal hash table
 Each key is mapped to a
different index !
– Not always possible
• many keys, finite indices
 Even distribution

Considerations :
 Devising a hash function
 Decide on table size
 Decide what to do when
collision

GIT – Computer Engineering Department 32


Devising Hash Functions
 Simple functions often produce many
collisions
– ... but complex functions may not be good either!
 It is often an empirical process
 If keys are integers
– hash function : key % table.size()
• Ex: TableSize = 10
Keys = 120, 330, 1000
– TableSize should be prime
• Provides even distribution

GIT – Computer Engineering Department 33


Devising Hash Functions
For strings:
 Adding letter values in a string
– same hash for strings with same letters in different order
– If TableSize is large and number of characters is small
• TableSize = 10000 & number of characters in a key = 8
127*8=1016 < 10000
 Use all characters
∑ 32i Key [KeySize -i -1 ]
– Early characters does not count
• Use only some number of characters
• Use characters in odd spaces
 Use first three characters
729*key[2] + 27*key[1] + key[0]
– If the keys are not random some part of the table is not used

GIT – Computer Engineering Department 34


Devising Hash Functions
For strings:
 Better approach:

size_t hash = 0;
for (size_t i = 0; i < s.size(); ++i)
hash = hash * 31 + s[i];
hash = has % table.size()

 This is the hash function used by Java in its String


class

GIT – Computer Engineering Department 35


Devising Hash Functions
 The String hash is good in that:
– Every letter affects the value
– The order of the letters affects the value
– Hash function distributes the hash values well
over the integers
• Probability of two strings having the same hash value is
quite small
• The hash function can be assumed to produce random
hash values
 The hash value may not be unique
– Too many possible strings
 Collision is still possible
GIT – Computer Engineering Department 36
Devising Hash Functions
 Guidelines for good hash functions:

– Spread values evenly: as if “random”

– Cheap to compute

 Generally, number of possible values much


greater than table size

GIT – Computer Engineering Department 37


Collision Problem
 Will consider two ways to organize hash
tables
– Open addressing
– Chaining

GIT – Computer Engineering Department 38


Open Addressing
 Hashed items are in a single array
 Hash code gives position “hint”
 Handle collisions by checking multiple positions
 Each check is called a probe of the table

If collision  try an alternate cell


h0(x), h1(x), h2(x), …

hi(x) = (hash(x) + F(i)) % table.size()


F(0) = 0

GIT – Computer Engineering Department 39


Linear Probing
 Probe by incrementing the index
– F(i) = i
 If “fall off end”, wrap around to the beginning
– Take care, not to cycle forever!
1. Compute
index as hash_fcn() % table.size()
2. if table[index] == NULL, item is not in the table
3. if table[index] matches item, found item (done)
4. Increment index circularly and go to 2
 Why must we probe repeatedly?
– hashCode may produce collisions
– remainder by table.size may produce collisions
GIT – Computer Engineering Department 40
Search Termination
Ways to obtain proper termination
– Stop when you come back to your starting point
– Stop after probing N slots, where N is table size
– Stop when you reach the bottom the second time
– Ensure table never full
• Reallocate when occupancy exceeds threshold

GIT – Computer Engineering Department 41


Hash Table Example
 Table of strings, initial size 5
 Add “Tom”, hash 84274  4 Slot 4
 Add “Dick”, hash 2129869  4 Slot 0 (wraps)
 Add “Harry”, hash 69496448  3 Slot 3
 Add “Sam”, hash 82879  4 Slot 1 (wraps)
 Add “Pete”, hash 2484038  3 Slot 2 (wraps)

 Note: many lookups will probe a lot!


 Size 11 gives these slots: 3, 5, 10, 56, 7

GIT – Computer Engineering Department 42


Hash Table Example
Insert keys
{89, 18, 49, 58, 69}
 When 49 is
inserted
collision
occurs
– Put into the
next
available
index 0
 58 collides with
18, 89, and 49
– Inserted at
index 1

GIT – Computer Engineering Department 43


Hash Table Considerations
 Cannot traverse a hash table
– Order of stored values is arbitrary
– Can use an iterator to produce in arbitrary order
 When item is deleted, cannot just set its entry to null
– Doing so would break probing
– Must store a “dummy value” instead
– Deleted items waste space and reduce efficiency
• Need to go through dummy values during search and even during
insertion
– Operations take O(n) time in the worst case
 Higher occupancy causes more collisions

GIT – Computer Engineering Department 44


Reducing Collisions By Growing
 Choose a new larger size, e.g., doubling
 Similar to reallocating a vector
– But, elements can move around in reinsertion
• Simple copy is not possible

 (Re)insert non-deleted items into new array


 Install the new array and drop the old

 Rehashing distributes items at least as well


 Deleted items (dummy values) are not inserted

GIT – Computer Engineering Department 45


Rehashing Example
 If 23 is inserted, the
table is over 70 percent
full.


A new table is created

17 is the first prime


twice as large as the
old one; so

Hnew (X) = X mod 17

GIT – Computer Engineering Department 46


Rehashing
 Rehashing is an expensive operation
– Running time is O(N)
 Rehashing frees the programmer from
worrying about table size
 Amortized Analysis: Average over N
operations
– Operations take: O(1) time

GIT – Computer Engineering Department 47


Quadratic Probing
 Linear probing
– Tends to form long clusters of keys in the table
– This causes longer search chains
 Ex:

GIT – Computer Engineering Department 48


Quadratic Probing
Quadratic probing can reduce the effect of primary
clustering
 Index increments form a quadratic series
– F(i) = i2
 Hash values: s, s+12, s+22, s+32, s+42, etc.
– (all % TableSize)
 Direct calculation involves multiply, add, remainder
– Incremental calculation better
Initially:
size_t index = ... 1st probe slot ...
int k = -1;
At each iteration:
k += 2;
index = (index + k) % table.size();
GIT – Computer Engineering Department 49
Quadratic Probing
 When 49
collides with
89, next
position
attemped is
one cell away

 58 collides at
position 8. The
cell one away
is tried,
another
collision
occurs. It is
inserted into
the cell 22=4
away

GIT – Computer Engineering Department 50


Quadratic Probing
 Probe sequence may not produce all table
slots
– A loop around full cells may happen
– Hash table not full but empty space not found

 Theorem : If the table size is prime and more


than half of the table is empty new element
can always be inserted.

 Problem : Secondary clustering!...


GIT – Computer Engineering Department 51
Double Hashing
Double Hashing solves the clustering problem
 Use second hash function
 F(i) = i * hash2(x)

 Poor example :
hash2(x) = x mod 9
hash1(x) = x mod 10
TableSize = 10

If x = 99 what happens ?
hash2(x) ≠ 0 for any x

GIT – Computer Engineering Department 52


Double Hashing
 Good choice :
hash2(x) = R – (x mod R)
R is a prime and < TableSize

GIT – Computer Engineering Department 53


Double Hashing

hash2(x) = 7 – (X mod 7)
GIT – Computer Engineering Department 54
Chaining
 Alternative to open addressing
 Each table slot references a linked list
– List contains all items that hash to that slot
– The linked list is often called a bucket
– So sometimes called bucket hashing
 Examines only items with same hash code
– As opposed to open addressing (search chains may
overlap)
 Insertion about as complex
 Deletion is simpler
 Linked list can become long  rehash

GIT – Computer Engineering Department 55


Chaining Picture

Two items hashed to bucket 3

Three items hashed to bucket 4

GIT – Computer Engineering Department 56


Performance of Hash Tables
 Load factor = # filled cells / table size
– Between 0 and 1
 Load factor has greatest effect on
performance
 Lower load factor  better performance
– Reduce collisions in sparsely populated tables

GIT – Computer Engineering Department 57


Performance of Hash Tables
 For open addressing, linear probing
– Knuth gives expected # probes
1 1 
I & U ⇒ 1 + 

2  (1 − λ )2 

1 1 
S ⇒ 1 + 

2  (1 − λ ) 
‫ג‬
– As approaches 1
• Number of probes increases
• insertions might fail
– Rehashing with larger TableSize
• if ‫ > ג‬0.5
• if insertion fails
GIT – Computer Engineering Department 58
Performance of Hash Tables
 For chaining
– ‫ ג‬is the avarage length of a list
– Successful Find  ‫ג‬/2 comparisons + time to
evaluate hash function
– Unsuccessful Find & Insert  ‫ ג‬comparisons +
time to evaluate hash function
 Good choice ‫ ~ ג‬1
 Here ‫ ג‬can be greater than 1

Disadvantage of separate chaining is


allocate/deallocate memory !
GIT – Computer Engineering Department 59
Performance of Hash Tables

L Number of Probes
Linear Probing Chaining
0 1.00 1.00
0.25 1.17 1.13
0.5 1.50 1.25
0.75 2.50 1.38
0.83 3.38 1.43
0.9 5.50 1.45
0.95 10.50 1.48

GIT – Computer Engineering Department 60


Random Collision Resolution
 Performance of double hashing is better than linear probing
 Assume: Random collision resolution
– Probes are independent
– No clustering problem
 Unsuccessful search and Insert
– Number of probes until an empty cell is found
(1- ‫ = )ג‬fraction of cells that are empty
1 / (1- ‫ = )ג‬expected number of probes until an empty cell
 Successful search
P(X)=Number of probes when the element X is inserted
1/N∑ P(X) approximately

1 λ 1 1 1
λ ∫0
dx = ln
1− x λ 1− λ

GIT – Computer Engineering Department 61


Performance of Hash Tables (3)
 Hash table:
– Insert: average O(1)
– Search: average O(1)
 Sorted array:
– Insert: average O(n)
– Search: average O(log n)
 Binary Search Tree:
– Insert: average O(log n)
– Search: average O(log n)
 But balanced trees can guarantee O(log n)
GIT – Computer Engineering Department 62
Implementing Hash Tables
 Class hash_map: used for both
implementations
 typedef std::pair<const Key_Type, Value_Type>
Entry_Type;

 File Hash_Table_Open.h: implements open


addressing
 Class Hash_Table_Chain: implements
chaining
 Further implementation concerns

GIT – Computer Engineering Department 63


Class hash_map
Function Behavior
iterator begin() Returns an iterator to the first entry in the
const_iterator begin() const map.
iterator end() Returns an iterator to one past the last entry
const_iterator end() const in the map.
bool empty() Returns true if the map is empty.
size_t size() Returns the number of entries in the map.
pair<iterator bool> insert(const Inserts an entry into the map. If the key is
Entry_Type& entry> not in the map, the returned iterator
references the inserted entry, and the bool
will be true. If the key is currently in the
map, the returned iterator references the
existing entry, and the bool will be false.
void erase(const Key_Type& key) Removes the item from the map.
iterator find(const Key_Type& Returns an iterator that references the item in
key) the map. If not present, end() is returned.
Value_Type& operator[](const Returns a reference to the value mapped to
Key_Type& key) the key. If the key is not currently in the
map, a default value is inserted, and a
reference to the inserted item is returned.

GIT – Computer Engineering Department 64


Hash_Table_Open
Data Fields and Private Functions
Data Field Attribute
hash<Key_Type> hash_fcn The function object that will compute the
hash function for the Key_Type.
size_t num_keys The number of keys in the table,
excluding deleted keys.
size_t num_deletes The number of deleted keys.
std::vector<Entry_Type*> the_table The vector to hold the hash table.
const double LOAD_THRESHOLD The maximum load factor.
static const size_t The initial capacity
INITIAL_CAPACITY
const Entry_Type dummy A dummy entry to represent deleted keys
const Entry_Type* DELETED A pointer to the dummy entry.
Function Behavior
size_t locate(const Key_Type& key) Returns the index of the specified key if
present in the table; otherwise; returns
the index of the first free slot.
void rehash() Doubles the capacity of the table and
permanently removes deleted items.

GIT – Computer Engineering Department 65


The hash template class
 The hash template class is a function class
that implements the hash function for its
template parameter.
 It is defined as follows:
template<typename Key_Type>
struct hash {
size_t operator()(const Key_Type&);
};

 We will describe some implementations later

GIT – Computer Engineering Department 66


The locate function
size_t locate(const Key_Type& key) {
size_t index = hash_fcn(key) % the_table.size ();
while (the_table[index] != NULL
&& (the_table[index] == DELETED
|| the_table[index]->first != key))
index = (index + 1) % the_table.size ();
return index;
}

GIT – Computer Engineering Department 67


The insert function
std::pair<iterator, bool> insert(const Entry_Type& entry) {
double load_factor = double(num_keys + num_deletes) /
the_table.size();
if (load_factor > LOAD_THRESHOLD){
rehash(); // Double the size of the table.
}
// Find the position in the table.
size_t index = locate(entry.first);
// See whether it is empty.
if (the_table[index] == NULL) {
// Create a new entry.
the_table[index] = new Entry_Type(entry);
num_keys++;
return std::make_pair(iterator(this, index), true);
} else {
// Item is already in the table.
return std::make_pair(iterator(this, index), false);
}
}

GIT – Computer Engineering Department 68


The index operator
Value_Type& operator[](const Key_Type& key) {
// Try to insert a dummy item.
std::pair<iterator, bool> ret = insert(Entry_Type(key,
Value_Type()));
// Return a reference to the value found or inserted.
return ret.first->second;
}

GIT – Computer Engineering Department 69


The rehash function
void rehash() {
// Create a new table whose size is double the current table.
std::vector<Entry_Type*> other_table(the_table.size() * 2, NULL);

// Swap this table with the current table.


the_table.swap(other_table);

// Reinsert all items from old table to new.


num_deletes = 0;
for (size_t i = 0; i < other_table.size(); i++) {
if ((other_table[i] != NULL) && (other_table[i] != DELETED)) {
size_t index = locate(other_table[i]->first);
the_table[index] = other_table[i];
}
}
}

GIT – Computer Engineering Department 70


The erase function
Algorithm for erase
1. Find the first table element that is empty or that
contains the key.
2. if an empty element is found
3. done
4. else
5. Delete the Entry_Type object pointed to.
6. Set the pointer in this entry to DELETED
7. Increment num_deletes
8. Decrement num_keys
Implementation is left as an exercise

GIT – Computer Engineering Department 71


Copy Constructor, etc.
 Because the vector<Entry_Type*> contains pointers
to dynamically created objects, we must define the
copy constructor, assignment operator, and
destructor.
hash_map(const hash_map<Key_Type, Value_Type>& other) :
hash_fcn(hash<Key_Type>()), num_keys(0),
the_table(other.the_table.size(), NULL),
LOAD_THRESHOLD(0.75),
num_deletes(0) {
for (size_t i = 0; i < other.the_table.size(); i++) {
if (other.the_table[i] != NULL &&
other.the_table[i] != DELETED)
insert(Entry_Type(other.the_table[i]->first,
other.the_table[i]->second));
}
}

GIT – Computer Engineering Department 72


Hash_Table_Chain.h
Data Field Attribute
hash<Key_Type> hash_fcn The hash function object.
size_t num_keys The number of keys
std::vector<std::list<Entry_Type> > A vector of lists containing the items
the_buckets
static const size_t INITIAL_CAPACITY The initial capacity
static double LOAD_THRESHOLD The maximum load threshold before
rehashing.

GIT – Computer Engineering Department 73


The insert function
std::pair<iterator, bool>
insert(const Entry_Type& entry) {
// Check for the need to rehash.
double load_factor = double(num_keys) / the_buckets.size();
if (load_factor > LOAD_THRESHOLD){rehash();}
// Find the position in the table.
size_t index = hash_fcn(entry.first) % the_buckets.size();
// Search for the key.
typename std::list<Entry_Type>::iterator pos =
the_buckets[index].begin();
while (pos != the_buckets[index].end()
&& pos->first != entry.first) ++pos;
if (pos == the_buckets[index].end()) { // Not in table
the_buckets[index].push_back(Entry_Type(entry));
num_keys++;
return std::make_pair(iterator(this, index,
--(the_buckets[index].end())),
true);
} else { // Already there
return std::make_pair(iterator(this, index, pos), false);
}
}

GIT – Computer Engineering Department 74


Copy Constructor, etc.
 Since Hash_Table_Chain.h uses a
std::vector<std::list<Entry_Type> > to hold the
entries:
– The vector and list both define the copy
constructor, etc. to make a deep copy and to
clean up any dynamically allocated objects when
destroyed.
 Therefore, no special implementation is
required for Hash_Table_Chain.h

GIT – Computer Engineering Department 75


Implementation Considerations
 The hash template class defines the hash function,
but provides no implementation.
 The user of the hash_map (either
Hash_Table_Open.h or Hash_Table_Chain.h) must
define a specialization for the Key_Type.
 Example for string:
template<>
struct hash<std::string> {
size_t operator()(const std::string& s) {
size_t result = 0;
for (size_t i = 0; i < s.length(); i++) {
result = result * 31 + s[i];
}
return result;
}
};

GIT – Computer Engineering Department 76


Specialization for int
 One possibility for int is to merely return the
integer value, cast to size_t.
– This does not produce a “random” distribution of
keys.
 An alternative is to multiply by some large
prime and take the result modulo the
computer’s word size.
template<>
struct hash<int> {
size_t operator()(int i) {
return size_t(4262999287U * i);
}
};

GIT – Computer Engineering Department 77


Defining Your Own hash function
 The hash_map class implementations use the
hash<Key_Type> function to locate the initial search
position.
 It then they use the objects equality operator
(operator ==) to determine if there is a match.
 Therefore, your hash function should obey the
following constraint:
– If obj1 == obj2 then
hash<type>()(obj1) == hash<type>()(obj2)

GIT – Computer Engineering Department 78


Applying Maps: Phone Directory
string Phone_Directory::add_or_change_entry(
const string& name, const string& number) {
string old_number = the_directory[name];
the_directory[name] = number;
modified = true;
return old_number;
}

string Phone_Directory::lookup_entry(const string& name)


const
{
const_iterator itr = the_directory.find(name);
if (itr != the_directory.end())
return itr->second;
else
return "";
}

GIT – Computer Engineering Department 79


Applying Maps: Phone Directory (2)
string Phone_Directory::remove_entry(const
string& name)
{
string old_number = the_directory[name];
the_directory.erase(name);
modified = old_number != string();
return old_number;
}

GIT – Computer Engineering Department 80


Applying Maps: Huffman Coding
Algorithm for build_frequency_table
1. while there are more characters in the input file
2. Read a character
3. Increment the entry in the map associated with this character
4. for each entry in the map
5. Store its data as the weight-symbol pair in the
vector<Huff_Data>
6. Return the vector<Huff_Data>

GIT – Computer Engineering Department 81


The build_frequency_table function
vector<Huff_Data> Huffman_Tree::build_frequency_table(istream&
in) {
map<char, int> frequencies;
char c;
while (in.get(c)) {
frequencies[c]++;
}
vector<Huff_Data> result;
for (map<char, int>::iterator itr = frequencies.begin();
itr != frequencies.end(); ++itr) {
result.push_back(Huff_Data(itr->second, itr->first));
}
return result;
}

GIT – Computer Engineering Department 82


build_code
 Additional data field:
std::map<char, Bit_String> code_map;

 Starter function
void build_code() {
code_map.clear();
build_code(huff_tree, Bit_String()); }

 Recursive function
void Huffman_Tree::build_code(const Binary_Tree<Huff_Data>& tree,
const Bit_String& code) {
if (tree.is_leaf()) {
Huff_Data datum = tree.get_data();
code_map[datum.symbol] = code;
} else {
Bit_String left_code(code);
left_code.append(false);
build_code(tree.get_left_subtree(), left_code);
Bit_String right_code(code);
right_code.append(true);
build_code(tree.get_right_subtree(), right_code);
}
}

GIT – Computer Engineering Department 83


encode
void Huffman_Tree::encode(std::istream& in, std::ostream& out)
{
Bit_String result;
char next_char;
while (in.get(next_char)) {
result += code_map[next_char];
}
result.write(out);
}

GIT – Computer Engineering Department 84