You are on page 1of 13

Hashing

Hashing can be used to build, search, or delete from a table.


The basic idea behind hashing is to take a field in a record, known as the key, and convert it
through some fixed process to a numeric value, known as the hash key, which represents the
position to either store or find an item in the table. The numeric value will be in the range of
0 to n-1, where n is the maximum number of slots (or buckets) in the table.
The fixed process to convert a key to a hash key is known as a hash function. This function
will be used whenever access to the table is needed.
One common method of determining a hash key is the division method of hashing. The
formula that will be used is:
hash key = key % number of slots in the table

The division method is generally a reasonable strategy, unless the key happens to have some
undesirable properties. For example, if the table size is 10 and all of the keys end in zero.
In this case, the choice of hash function and table size needs to be carefully considered. The
best table sizes are prime numbers.
One problem though is that keys are not always numeric. In fact, it's common for them to be
strings.

One possible solution: add up the ASCII values of the characters in the string to get a
numeric value and then perform the division method.
int hashValue = 0;
for( int j = 0; j < stringKey.length(); j++ )
hashValue += stringKey[j];
int hashKey = hashValue % tableSize;

The previous method is simple, but it is flawed if the table size is large. For example, assume
a table size of 10007 and that all keys are eight or fewer characters long.
No matter what the hash function, there is the possibility that two keys could resolve to the
same hash key. This situation is known as a collision.
When this occurs, there are two simple solutions:
1. chaining
2. linear probe (aka linear open addressing)
And two slightly more difficult solutions
3. Quadratic Probe
4. Double Hashing

Hashing with Chains


When a collision occurs, elements with the same hash key will be chained together. A chain
is simply a linked list of all the elements with the same hash key.
The hash table slots will no longer hold a table element. They will now hold the address of a
table element.

Searching a hash table with chains:


Compute the hash key
If slot at hash key is null
Key not found
Else
Search the chain at hash key for the desired key
Endif

Inserting into a hash table with chains:


Compute the hash key
If slot at hash key is null
Insert as first node of chain
Else
Search the chain for a duplicate key
If duplicate key
Dont insert
Else
Insert into chain
Endif
Endif

Deleting from a hash table with chains:


Compute the hash key
If slot at hash key is null
Nothing to delete
Else
Search the chain for the desired key
If key is not found
Nothing to delete

Else
Remove node from the chain
Endif
Endif

Hashing with Linear Probe


When using a linear probe, the item will be stored in the next available slot in the table,
assuming that the table is not already full.
This is implemented via a linear search for an empty slot, from the point of collision. If the
physical end of table is reached during the linear search, the search will wrap around to the
beginning of the table and continue from there.
If an empty slot is not found before reaching the point of collision, the table is full.

A problem with the linear probe method is that it is possible for blocks of data to form when
collisions are resolved. This is known as primary clustering.
This means that any key that hashes into the cluster will require several attempts to resolve
the collision.
For example, insert the nodes 89, 18, 49, 58, and 69 into a hash table that holds 10 items
using the division method:

Hashing with Quadratic Probe


To resolve the primary clustering problem, quadratic probing can be used. With quadratic
probing, rather than always moving one spot, move i2 spots from the point of collision, where
i is the number of attempts to resolve the collision.

Limitation: at most half of the table can be used as alternative locations to resolve collisions.
This means that once the table is more than half full, it's difficult to find an empty spot. This
new problem is known as secondary clustering because elements that hash to the same hash
key will always probe the same alternative cells.

Hashing with Double Hashing


Double hashing uses the idea of applying a second hash function to the key when a collision
occurs. The result of the second hash function will be the number of positions form the point
of collision to insert.
There are a couple of requirements for the second function:

it must never evaluate to 0


must make sure that all cells can be probed

A popular second hash function is: Hash2(key) = R - ( key % R ) where R is a prime number
that is smaller than the size of the table.

Hashing with Rehashing


Once the hash table gets too full, the running time for operations will start to take too long
and may fail. To solve this problem, a table at least twice the size of the original will be built
and the elements will be transferred to the new table.
The new size of the hash table:

should also be prime


will be used to calculate the new insertion spot (hence the name rehashing)

This is a very expensive operation! O(N) since there are N elements to rehash and the table
size is roughly 2N. This is ok though since it doesn't happen that often.
The question becomes when should the rehashing be applied?
Some possible answers:

once the table becomes half full


once an insertion fails
once a specific load factor has been reached, where load factor is the ratio of the
number of elements in the hash table to the table size

Deletion from a Hash Table


The method of deletion depends on the method of insertion. In any of the cases, the same
hash function(s) will be used to find the location of the element in the hash table.
There is a major problem that can arise if a collision occurred when inserting -- it's possible
to "lose" an element.

Operating system
What is pre-emptive and non-preemptive scheduling?
Tasks are usually assigned with priorities. At times it is necessary to run a certain task that
has a higher priority before another task although it is running. Therefore, the running task is
interrupted for some time and resumed later when the priority task has finished its execution.
This is called preemptive scheduling.
Eg: Round robin
In non-preemptive scheduling, a running task is executed till completion. It cannot be
interrupted.
Eg First In First Out

What is pre-emptive and non-preemptive scheduling?


Preemptive scheduling: The preemptive scheduling is prioritized. The highest priority
process should always be the process that is currently utilized.
Non-Preemptive scheduling: When a process enters the state of running, the state of that
process is not deleted from the scheduler until it finishes its service time.

What is page fault and when does it occur?


When the page (data) requested by a program is not available in the memory, it is called as a
page fault. This usually results in the application being shut down.

What is page fault and when does it occur?


A page is a fixed length memory block used as a transferring unit between physical memory
and an external storage. A page fault occurs when a program accesses a page that has been
mapped in address space, but has not been loaded in the physical memory.

What is dirty bit?


When a bit is modified by the CPU and not written back to the storage, it is called as a dirty
bit. This bit is present in the memory cache or the virtual storage space.

Define compactions.
Compaction is a process in which the free space is collected in a large memory chunk to
make some space available for processes.

Best-Fit, First-Fit and Worst-Fit Memory Allocation Method for Fixed


Partition
The following jobs are loaded into memory using fixed partition following a certain memory
allocation method (best-fit, first-fit and worst-fit).

Memory Block

Size

List of Jobs

Size

Turnaround

Job 1

100k

Block 1

50k

Job 2

10k

Block 2

200k

Job 3

35k

Block 3

70k

Job 4

15k

Block 4

115k

Job 5

23k

Block 5

15k

Job 6

6k

Job 7

25k

Job 8

55k

Job 9

88k

Job 10

100k

BEST-FIT
Best-fit memory allocation makes the best use of memory space but slower in making
allocation. In the illustration below, on the first processing cycle, jobs 1 to 5 are submitted
and be processed first. After the first cycle, job 2 and 4 located on block 5 and block 3
respectively and both having one turnaround are replace by job 6 and 7 while job 1, job 3
and job 5 remain on their designated block. In the third cycle, job 1 remain on block 4, while
job 8 and job 9 replace job 7 and job 5 respectively (both having 2 turnaround). On the next
cycle, job 9 and job 8 remain on their block while job 10 replace job 1 (having 3 turnaround).
On the fifth cycle only job 9 and 10 are the remaining jobs to be process and there are 3 free
memory blocks for the incoming jobs. But since there are only 10 jobs, so it will remain free.
On the sixth cycle, job 10 is the only remaining job to be process and finally on the seventh
cycle, all jobs are successfully process and executed and all the memory blocks are now
free.

FIRST-FIT
First-fit memory allocation is faster in making allocation but leads to memory waste. The
illustration below shows that on the first cycle, job 1 to job 4 are submitted first while job 6
occupied block 5 because the remaining memory space is enough to its required memory
size to be process. While job 5 is in waiting queue because the memory size in block 5 is
not enough for the job 5 to be process. Then on the next cycle, job 5 replace job 2 on block 1
and job 7 replace job 4 on block 4 after both job 2 and job 4 finish their process. Job 8 is in
waiting queue because the remaining block is not enough to accommodate the memory size
of job 8. On the third cycle, job 8 replace job 3 and job 9 occupies block 4 after processing
job 7. While Job 1 and job 5 remain on its designated block. After the third cycle block 1 and
block 5 are free to serve the incoming jobs but since there are 10 jobs so it will remain free.
And job 10 occupies block 2 after job 1 finish its turns. On the other hand, job 8 and job 9
remain on their block. Then on the fifth cycle, only job 9 and job 10 are to be process while
there are 3 memory blocks free. In the sixth cycle, job 10 is the only remaining job to be
process and lastly in the seventh cycle, all jobs are successfully process and executed and
all the memory blocks are now free.

WORST-FIT

Worst-fit memory allocation is opposite to best-fit. It allocates free available block to the new
job and it is not the best choice for an actual system. In the illustration, on the first cycle, job
5 is in waiting queue while job 1 to job 4 and job 6 are the jobs to be first process. After
then, job 5 occupies the free block replacing job 2. Block 5 is now free to accommodate the
next job which is job 8 but since the size in block 5 is not enough for job 8, so job 8 is in
waiting queue. Then on the next cycle, block 3 accommodate job 8 while job 1 and job 5
remain on their memory block. In this cycle, there are 2 memory blocks are free. In the fourth
cycle, only job 8 on block 3 remains while job 1 and job 5 are respectively replace by job 9
and job 10. Just the same in the previous cycle, there are still two free memory blocks. At
fifth cycle, job 8 finish its job while the job 9 and job 10 are still on block 2 and block 4
respectively and there is additional memory block free. The same scenario happen on the
sixth cycle. Lastly, on the seventh cycle, both job 9 and job 10 finish its process and in this
cycle, all jobs are successfully process and executed. And all the memory blocks are now
free.