Академический Документы
Профессиональный Документы
Культура Документы
Sándor Juhász
Budapest University of Technology and Economics, Department of Automation and Applied Informatics
1111 Budapest, Goldmann György tér 3., Hungary
Ákos Dudás
Budapest University of Technology and Economics, Department of Automation and Applied Informatics
1111 Budapest, Goldmann György tér 3., Hungary
ABSTRACT
Hash tables can provide fast mapping between keys and values even for voluminous data sets. Our main goal is to find a
suitable implementation having compact structure and efficient collision avoidance method. Our attention is focused on
maximizing the lookup performance when handling several millions of data items. This paper suggest a new memory
consumption oriented way for comparing the significantly different approaches and analyses various types of hash table
implementations in order to answer the question what structure needs to be used and how the parameters must be chosen
in order to achieve a maximal lookup performance with the lowest possible memory consumption.
KEYWORDS
bucket hashing, open hashing, lookup performance, data transformation, cache memory
1. INTRODUCTION
Hashing is a well known approach for realizing code tables and lookup charts. Hash tables are suitable for
data transformations where it is more efficient to store the output values belonging to the repeating input
values instead of using a complex or computation intensive conversion algorithm. Hash tables provide fast
searching capabilities combined with a compact storage structure. The reason of their efficiency lies in the
classification of the items according to a hash function applied on their keys, considerably reducing the
number of items that need to be tested. The result of the hash function becomes the identifier of the group
this item will belong to. Forming these groups helps to minimize the number of items that have to be
compared when searching for a specific key.
The distribution of the items over the available groups is mainly influenced by the factors like the choice
of the hash function, the number of groups created and the uniformity of the distribution of the hashed values.
The more items are mapped to the same group, the slower the search is, because of the increased number of
items the search has to examine.
The appropriate choice of the hash function and the number of groups significantly affect the length of the
search path: unbalanced hashing or low number of groups results in many items being mapped to the same
slot. Next to the reduction of the average search path length the storage structure itself also has a serious
performance impact. The main factors to be considered here are memory footprint and cache friendliness.
The number of indirections in the storage structure usually decrease memory consumption, but has a negative
effect on the lookup speed. The increased locality of the structure allows faster collision handling because of
the better cache usage, but maintaining this compactness usually calls for heavier administrative load.
Our paper proposes and analyses different storage structures trying to highlight the balance point of the
above mentioned basic options. A number of recent papers have considered the effect of modern computer
memory hierarchy as an influencing factor, and examined how caching affects the performance of algorithms
(Heileman and Luo 2005). We continue along this path and propose different implementation approaches and
tune the free parameters to further increase cache friendliness. The focus is set to cases where the number of
107
ISBN: 978-972-8924-62-1 © 2008 IADIS
items to be stored is in the domain of tens of millions or higher, and the number of search queries to complete
can exceed hundreds of millions.
Our main contribution is introducing a new comparison method for evaluation of hash table performance,
and to use this method to test and examine different variations of hash tables. Most papers and studies about
hash tables use either the load factor or the number/size of buckets as variable (Heileman and Luo 2005; Bell
1970) neglecting memory consumption. Our new method compares the execution times in function of the
measured or calculated amount of reserved memory allowing a more fair comparison of the algorithms under
the same memory conditions.
The rest of the paper is organized as follows: Section 2 presents the related literature by giving a short
introduction to hash tables and to the aspects of their performance tuning. Section 3 outlines our hash table
implementations and the calculation of their memory consumption. The experimental tests and results are
shown in Section 4, where the outcome of synthetic tests is used to draw the consequences. We conclude in
Section 5 by summarizing of the presented methods and results.
2. RELATED WORKS
Our work was inspired by a real life data mining project dealing with processing of web log data (Juhász and
Iváncsy 2007). The web logs in question record the activity of a few million people surfing the web during a
time interval of several months providing several terabytes of raw data. Because of the huge size a
compression is applied to the raw logs as a preprocessing step, which includes recoding the original verbose
text field formats to 4 byte integers to save space. Hash tables are used to complete this task as they are
promised to provide a fast, lookup table based transformation.
The history of hash tables dates back to the 1960’s, where their ancestor, key-to-address transformation
was studied to find a record in a background file using a key (Lum 1971; Brent 1973). Meantime the capacity
improvement of system memories made it more and more possible to store the data themselves as well in the
main memory, but the basic principle remains the same: the possible location of an item must be guessed as
precisely as possible by using its key.
Hash tables have two basic types differing in the collision handling method: the first one is called open
addressing, where the number of slots is fixed, and one slot contains exactly zero or one item. If more than
one item is mapped to the same slot by the hash function, the algorithm finds another free slot inside the table
by the help of a secondary function. This approach allocates a fixed amount of memory, independently of the
final number of items it will store in the end. Open hashing is well studied both analytically and in practice.
After studying some of the many suggested collision resolving methods Heileman and Luo (2005) claimed
that even though the data locality and cache usage was in favor of linear probing, it showed no significant
performance gain due to the long probe lengths. As we will show later, this drawback can be overcome by
increasing the number of slots which efficiently reduces the long probing paths. To examine this aspect, both
linear double hashing and the so called quadratic quotient method by Bell (1970) will be tested and compared
to linear probing in Section 4.
Bucket hashing, opposed to the previous case of open addressing, reserves additional external space
(outside the slots of the main table) for the colliding items, and links them to the corresponding original slots.
As several items can be linked to one single slot, they are now called buckets. This solution still leaves other
options to consider that will be detailed in Section 3.
As our original task consists of speeding up the transformation process, performance becomes our highest
priority. Open hashing is presumed to be the faster solution; although when studying bucket hashing, Lum et
al. (1971) found that with careful design (setting the number of buckets in the order of magnitude of the
items) bucket hashing can provide competing performance. As it is shown in Section 4, according to our
measurements, this claim is tenable under some cases, but not as a general statement.
To understand the basic dimensions of parameter tuning of hash tables, the relevant aspects influencing
the performance must be selected. The execution time is mostly influenced by the number of steps the search
algorithm takes to find an item, which is directly related to the number of items sharing the same hash value.
To reduce the number of colliding items, the hash function is the first place to optimize. Hash functions have
to be fast, and should produce values with uniform distribution. Lum et al. (1971) analyzed various
108
IADIS International Conference Informatics 2008
commonly used hash functions, such as modulo division, mid-square method, algebraic coding, and found
that modulo division is one of the best solutions. That is why this hash function was chosen for our studies.
The second parameter having considerable performance effect is the number of slots. Increasing the
number of slots causes fewer items to be mapped to the same group; however, the memory requirements
grow as well. Probability theory can prove to be helpful when seeking for the optimal size of the buckets.
When the distribution of hashed keys is uniform, the size of the buckets will form binomial distribution thus
the final bucket sizes can be calculated with a particular certainty. Mitzenmacher (2002) gives formula for
calculating the probability of a bucket ending up with k items, however that applies to cases only when the
number of buckets is less than the number of items. In this paper, instead of emphasizing the probability of
the individual buckets sizes, the expected bucket size is considered.
The third aspect strongly influencing the performance, marginally referred before, is cache friendliness. It
is known, that the way algorithms utilize memory can seriously impact the performance (Pfister 1998). As
consequence not only the number of steps (the number of executed CPU instructions) an algorithm takes is
important, but the memory access pattern, and the nature of the memory accesses (Wulf and McKee 1995;
Heileman and Luo 2005; Anderson 2006). Utilization of the cache highly determines the performance
requiring the algorithms and storage structures to be designed accordingly. This aspect acts as the source of
distinction between the several implementations of hash tables suggested in Section 3.
109
ISBN: 978-972-8924-62-1 © 2008 IADIS
N
⎛ 1⎞
P( v = 0) = ⎜1 − ⎟ . (3.1)
i ⎝ V⎠
The expected value of the number of empty buckets is:
N/V
⎛V⎛ 1⎞ ⎞
N
⎛ 1⎞
N ⎛⎛ 1⎞ ⎞
V
1 N/V
u = E ⎜ ∑ ⎜ 1 − ⎟ ⎟ = V ⎜ 1 − ⎟ = V ⎜ ⎜1 − ⎟ ⎟ ≈V = Ve − N / V . (3.2)
⎜ i =1 ⎝ V ⎠ ⎟ ⎝ V⎠ ⎜⎝ V ⎠ ⎟ e
⎝ ⎠ ⎝ ⎠
As s is the expected value of the number of items per the non-empty buckets, we know that
E ( v i ) = E ( v i | v i ≥ 1) ⋅ P( v i ≥ 1) + 0 ⋅ P( v i = 0) = sP( v i ≥ 1) , (3.3)
To calculate E(vi), we use
V ⎛V ⎞
∑ E ( v i ) = E⎜ ∑ v i ⎟ = E( N) = N = VE( v i ) , (3.4)
i =1 ⎝ i =1 ⎠
Combining the results of (3.3) and (3.4) provides
E ( vi ) N /V N /V
s = E (vi | vi ≥ 1) = = = . (3.5)
P (vi ≥ 1) 1 − P (vi < 1) ⎛ 1⎞
N
1 − ⎜1 − ⎟
⎝ V⎠
The above formulas for number u of empty buckets and the average size s of the non-empty buckets will
be used for calculating the expected memory cost in next subsection.
110
IADIS International Conference Informatics 2008
ks + ds byte ks + ds + 4 byte
ILinProbe, IDoubleHash, IList
IQuadQuot
ptr ptr
ptr key | ptr | value ptr length | key | value
ptr ptr
ptr key | ptr | value key | ptr | value key | ptr | value ptr length | key | key | key | value | value | value
ptr ptr
ptr key | ptr | value key | ptr | value ptr length | key | key | value | value
4. OPTIMIZING PERFORMANCE
The nature of the project described in the Section 2 determines the most decisive use case of the hash tables,
which is efficient searching for known elements. Based on this fact, our attention is focused on searching for
known items, ignoring the behavior in case of unsuccessful lookups and the time taken by inserting the items.
The hash tables were implemented in C++ compiled with Microsoft Visual Studio 2008 compiler. All
implementations were optimized using a custom memory manager and the best techniques known by us to
reduce the number of instruction. To measure the performance, 10 million random numbers of 20 bytes were
generated, which were used as keys, and a value of 4 bytes was assigned to each of them. After inserting the
above items into the table, 200 millions of random lookup operations were performed for the previously
inserted items. The items themselves and the order by which the items are searched for are pseudo-random
values generated with Mersenne Twister (Matsumoto and Nishimura 1998). The tests were carried out on an
Intel Pentium 4 processor @ 3.2 GHz, 2MB L2 cache, 4 GB memory, running Windows Server 2003 R2.
111
ISBN: 978-972-8924-62-1 © 2008 IADIS
290 200
240 150
190 100
140 50
90 0
1 2.5 5 7.5 10(+) 11 12.5 15 20 30 50 200 400 600 800 1000 1200 1400
number of buckets [million] reserved memory [MB]
112
IADIS International Conference Informatics 2008
Figure 5. Focusing to the interesting region of bucket Figure 6. Testing robustness of the hash tables.
numbers.
4.2 Robustness
So far the figures only showed how the different variations perform when the number of items is fixed, thus
the bucket number can be chosen accordingly. The number of items, however, may not always be known a
priori. A good hash table is robust, which means, it should handle the increase in the number of items with a
graceful degradation of look-up time, otherwise frequent reallocations would become too expensive.
The robustness test is shown in Figure 6. The number of items, the hash tables are designed for, is 2.5
million, thus the number of buckets was chosen to be 5 million. The number of items are varied from 1
million (40% of the expected amount) to 20 million (8 times as much as expected).
The original ILinProbe reaches its maximum at 5 million items, at this point the execution time becomes
very high compared to the others. IList on the other hand behaves very well, its execution time increases
linearly with the number of items, which is an acceptable behavior. One interesting case is PArray, which
gains advantage compared to the others with the increasing number of items per buckets.
113
ISBN: 978-972-8924-62-1 © 2008 IADIS
5. CONCLUSION
Various hash tables and implementation considerations were presented and analyzed in this paper. Our
intention was to provide an overview of the different structure and to present ways towards the optimization
of such lookup tables. When analyzing storage structures we came to the conclusion, that the traditional open
addressing is a good choice, regardless of the fact that empty slots seemingly waste memory. When using
pointers to decrease the cost of empty buckets, the further indirection causes performance slowdown that
should be compensated with a higher number of buckets, which seems to consume even more memory than
the original open hash approach. We also presented a hybrid open-bucket hash, IList, which uses chaining as
the collision resolving method, allowing the hash table to be tolerant towards changes in the number of items
without the frequent need to resize the whole table, but comes with a slight increase in memory.
Version ILinProbe (open addressing with linear probing) turned out to be the fastest, although linear
probing is presumed to be the best choice with near-uniform distribution of hash values only. In other cases
clustering of the items may increase the probe path length weakening the benefits of cache friendliness.
To overcome the above mentioned and the robustness problem we suggest the use of an IList type hash
table, which is able to cope with a wider range of circumstances. The best performance is obtained when
number of slots (buckets) for is twice the number of the items, but choosing the number of slots to be equal to
the number of items provides nearly the same performance (10% lower) for 55% of the memory cost.
We may also note, that the 30-70% saturation level told to be optimal for open addressing was justified,
although according to our result reducing the saturation below 50% does not increase the performance, and
going below 30% can indeed cause performance loss for large data sets.
ACKNOWLEDGEMENT
This work was supported by the Mobile Innovation Center, Hungary. Their help is kindly acknowledged.
REFERENCES
Anderson, B., 2006. Processor Cache 101: How Cache Works. AMD Developer Central, [Online]
http://developer.amd.com/article_print.jsp?id=84.
Bell, James R., 1970. The quadratic quotient method: A hash code eliminating secondary clustering. Communications of
the ACM, Vol. 13, No. 2, pp 107-109.
Brent, R. P., 1973. Reducing the retrieval time of scatter storage techniques. Communications of the ACM, Vol. 16, No.
2, pp 105-109.
Heileman, G. L. and Luo, W., 2005. How caching affects hashing. In Proceedings of the 7th Workshop on Algorithm
Engineering and Experiments, Vancouver, Canada, pp 141-154.
Intel Corporation, 2007. Intel® 64 and IA-32 Architectures Optimization Reference Manual. Vol. 1, Rev. 2.0, [Online]
http://developer.intel.com/design/processor/manuals/248966.pdf.
Juhász, S. and Iváncsy, R., 2007. Tracking Activity of Real Individuals in Web Logs. International Journal of Computer
Science, Vol. 2, No. 3, pp 172-177.
Lum, V. Y. et al, 1971. Key-to-address transform techniques: A fundamental performance study on large existing
formatted files. Communications of the ACM, Vol. 14, No. 4, pp 228-239.
Mitzenmacher, M., 2002. Good Hash Tables & Multiple Hash Functions. Dr. Dobbs Journal, No. 336, pp 28-32.
Munro, J. I. and Celis, P., 1986. Techniques for collision resolution in hash tables with open addressing. In Proceedings
of 1986 ACM Fall Joint Computer Conference, Dallas, United States, pp. 601-610.
Matsumoto, M. and Nishimura, T., 1998. Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random
number generator. ACM Transactions on Modeling and Computer Simulation, Vol. 8, Issue 1, pp 3-30.
Pagh, R. et al., 2007. Linear probing with constant independence. In Proceedings of the Thirty-Ninth Annual ACM
Symposium on theory of Computing, San Diego, United States, pp 318-327.
Pfister G.F., 1998. In Search of Clusters (Second Edition), Prentice Hall, Upper Saddle River, New Jersey, USA
Wulf, W. A. and McKee, S. A., 1995. Hitting the Memory Wall: Implications of the Obvious. Computer Architecture
News, Vol. 23, pp 20-24.
114