Вы находитесь на странице: 1из 16

Retrieval and Perfect Hashing

using Fingerprinting
Ingo Mller12 , Peter Sanders1 , Robert Schulze2 , and Wei Zhou12
Department of Informatics of KIT1 and HANA Core of SAP AG2

Sanders,
Schulze, Zhou:
KIT Mller,
University of
the State of Baden-Wuerttemberg
and
National
Laboratoryand
of the Perfect
Helmholtz Association
Retrieval
Hashing using

Fingerprinting

Institute for Theoretical Informatics


www.kit.edu
Algorithms II

The Retrieval Problem


Perfect Hash Function (PHF)
Map each key s P S to unique integer i P ID
Retrieval data structure
Associate value v P V to each s P S
Classical implementation: hash table

S V q t . . .
flag noun, verb,
flare n, v,
flask n,
flash n, v, . . .

Store key/ID or key/value pairs

Optimization: do not store S


Undefined behavior for s S

Vq

optimization

Applications
Look-up in dictionaries of in-memory DBMSs (like the SAP HANA database [1])
Many more. . . (see Botelho et al. [2])
1

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Known Methods
Perfect Hash Functions
Practical implementations exist: BPZ [3], CHD [4], etc.
Store only constant, sometimes optimal number of extra bits
Retrieval: use a PHF to index an array of values
Direct Retrieval Data Structures
CHM [5], etc.

Construction

Dynamic operations
Cache misses per query

Prior work

Our solution

complicated
inherently sequential
slow
no (rebuild)
2

simpler
easily parallelizable
faster
yes
1

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Fingerprint Retrieval (FiRe)


Overview
Level 1

Level 2

Level L

Level L 1

..
.

PHFbased
retrieval
data
structure

..
.

n
buckets
b

A bucket
..
.

fingerprints v1

v2

va

a cells

bucket hash1 keyq t1, . . . , m, fingerprint hash2 keyq t1, . . . , k


Recursively overflow to next level on fingerprint collision/full bucket
Fingerprints implemented as bit vector for simplicity and speed

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Fingerprint Retrieval (FiRe)


Asymptotics

L levels

m buckets

fingerprints

PHF
fallback

a cells

n elements
m nb buckets
a cells per bucket
r-bit values

Expected linear construction time


L FiRe levels, Onq for each level
Even for L : geometric series, as only a constant fraction of the
elements overflow

Constant worst-case query time, since L is constant

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Fingerprint Retrieval (FiRe)


Formulae
L levels

m buckets

PHF
fallback

fingerprints

n elements
m nb buckets
a cells per bucket
r-bit values

a cells

Let a1 be the expected number of elements in a bucket


Space overhead per element s
Cache misses per query l

qsizefingerprintsq

r a a1

a1

bits

b
a1

Calculation of a1 : see our paper

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Fingerprint Retrieval (FiRe)


Space/Time Trade-Off

How to choose parameters?


a and k such that a bucket
fits into a cache line
Depending on desired
trade-off

s -- space overhead [bits]

Space overhead s and expected number of cache misses l depend on


a: #cells per bucket
n
b: average #elements per bucket ( m
)
k: #possible fingerprint values
r = 32 bits
r: size of each value
40

opt. encoding a=14,


bit vector a=15,
bit vector a=14,
bit vector a=13,
bit vector a=12,

35
30
25
20
15
10
5
0
1

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

k=149
k= 32
k= 64
k= 96
k=128

1.5

2
2.5
l -- cache misses

Institute for Theoretical Informatics


Algorithms II

Fingerprint Retrieval (FiRe)


Dynamization

updates

queries

Updates and deletions


(easy)
Update associated
value in place
Ignore deletions

static part

S Vq t . . .
flag noun, verb,
flare n, v,
flask n,
flash n, v, . . .
dynamic part

Insertions
Needs a dynamic part with keys + some book-keeping information
Answer queries with the static part (FiRe)
Idea:
Overflow new and old element if fingerprints collide
Block fingerprint for future inserts
Rebuild when some stability criterion is violated

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Fingerprint Perfect Hashing (FiPHa)


L levels

m buckets

fingerprints

PHF
fallback

a cells

Fingerprint-Based Perfect Hashing (FiPHa)


Special case with large buckets of empty values
Associated ID is calculated as rankbucket vq a rankfingerprint vq
Very space efficient (2.79 bits overhead with 2.78 cache misses)
8

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Experimental Results
Settings

Configurations of a, b, k such that


l 1.05 cache misses: FiRe5
l 1.25 cache misses: FiRe25
l 1.50 cache misses: FiRe50
Retrieval data structure with FiPHa as PHF (3.78 cache misses)
Base lines
BPZ, CHD-0.5/0.99 from the C Minimal Perfect Hashing Library [6]
CHM-2/3 from our implementation
Datasets
Keys: 100 million unique random 32-bit integers
Values: integers of size r 8 bits
9

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Experimental Results
Build Times
r 8 bits
build time [ns / element]

3000
2500
2000
1500
1000
500
0
.
-0

-3

99

.
-0

Z
BP

-2

PH
Fi

e5

e2

Fi

R
Fi

e5

Fi

417 times faster sequential construction for FiRe


FiPHa slower, but faster than competitors
10

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Experimental Results
Space Overhead

space overhead [bits / element]

r 8 bits
35
30
25
20
15
10
5
0
.
-0

.
-0

BP
-3

-2

PH

Fi

e5

Fi

e2

Fi

e5

Fi

99

FiRe50 has comparable overhead to most competitors


FiPHa almost on par with best competitor (CHD-0.99)
11

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Experimental Results
Query Times
r 8 bits
query time [ns / element]

400
350
300
250
200
150
100
50
0
C

BP

.
-0

.
-0

-3

-2

C
M

PH

Fi

e5

R
Fi

e2

Fi

e5

R
Fi

99

FiRe has the best query times due to low number of cache misses
FiPHa comparable to CHD-0.99, but has much faster construction
12

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Summary and Future Work


Fingerprint Retrieval (FiRe) and Perfect Hashing (FiPHa)
Simple concept, easy implementation
Fast evaluation due to low number of cache-misses
Extremely fast construction, even with sequential implementation
Small space overhead
Highly configurable trade-off
Support for updates, insertions, and deletions
Future Work
Find more compact, yet practical representation of fingerprints
Adapt idea of cuckoo-hashing to fingerprinting
Improve trade-off with different settings for each level
Adapt fingerprinting idea to other data structures
13

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Thank you

14

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

References I

[1] F. Frber et al., SAP HANA Database: Data management for modern business applications, SIGMOD
Rec., vol. 40, no. 4, pp. 4551, 2012. [Online]. Available: http://doi.acm.org/10.1145/2094114.2094126
[2] F. C. Botelho and N. Ziviani, External perfect hashing for very large key sets, in Proceedings of the
sixteenth ACM conference on Conference on information and knowledge management. ACM, 2007, pp.
653662.
[3] F. C. Botelho, R. Pagh, and N. Ziviani, Simple and space-efficient minimal perfect hash functions, in
Algorithms and Data Structures. Springer, 2007, pp. 139150.
[4] D. Belazzougui, F. C. Botelho, and M. Dietzfelbinger, Hash, displace, and compress, in Algorithms-ESA
2009. Springer, 2009, pp. 682693.
[5] Z. J. Czech, G. Havas, and B. S. Majewski, An optimal algorithm for generating minimal perfect hash
functions, Information Processing Letters, vol. 43, no. 5, pp. 257264, 1992.
[6] D. de Castro Reis, D. Belazzougui, F. C. Botelho, and N. Ziviani, CMPH C Minimal Perfect Hashing
Library. [Online]. Available: http://cmph.sourceforge.net

15

Mller, Sanders, Schulze, Zhou:


Retrieval and Perfect Hashing using Fingerprinting

Institute for Theoretical Informatics


Algorithms II

Вам также может понравиться