Академический Документы
Профессиональный Документы
Культура Документы
Goal
Goal of todays lecture is to introduce concept of
hash-index. Hashing is an alternative to tree structure that allows near constant access time to ANY record in a very large database.
Presentation Outline
Introduction
function maps key values to positions. table is an array that holds the records.
Searching in a hash table can be done in O(1) regardless of the hash table size.
Introduction to Hashing
Cryptography was once known only to the key people in the the
National Security Agency and a few academics. Until 1996, it was illegal to export strong cryptography from the United States. Fast forward to 2006, and the Payment Card Industry Data Security Standard (PCI DSS) requires merchants to encrypt cardholder information. Visa and MasterCard can levy fines of up to $500,000 for not complying! Among methods recommended are: Strong one-way hash functions (hashed indexes) Truncation Index tokens and pads (pads must be securely stored) Strong cryptography
[Hashing for fun and profit: Demystifying encryption for PCI DSS Roger Nebel]
Shamir, and Adleman (RSA) public key algorithm for the TLS key exchange and authentication, and the Secure Hashing Algorithm 1 (SHA-1) for the key exchange and hashing.
[System cryptography: Use FIPS compliant algorithms for encryption, hashing, and signing, Microsoft TechNews, 2005]
10
11
We design a perfect hash function to losslessly pack sparse data while retaining efficient random access:
Hash table
1282 382
Offset table
182 1283
Hash table
353
Offset table
193
Hash function
Simply: h( p) p p (modulo table sizes)
p
q p mod r
Applications
2D 3D
Offset table
83333 453
24372
11632
pS
[q]
Domain
s h( p)
Hash table H
3D painting
20483, 56MB, 200fps
Sprite maps
+900KB, 200fps
Perfect hash on multidimensional data No collisions ideal for GPU Single lookup into a small offset table Offsets only ~4 bits per defined data Access only ~4 instructions on GPU Optimized spatial coherence
1.8%
Alpha compression
0.9bits/pixel, 800fps
Simulation
2563, 100fps
Collision detection
10243, 12MB, 140fps
stronger tool for database and password protection. http://msdn.microsoft.com/msdnmag/issues/03/08/S ecurityBriefs/ [Security Briefs, SMDN Magazine]
13
14
For this, we take the given key and produce a hash location by taking portions of the key (truncating the key).
Example If a hash table can hold 1000 entries and an 8-digit number is used as key, the 3rd, 5th and 7th digits starting from the left of the key could be used to produce the index. - e.g. .. Key is 62538194 and the hash location is 589. Advantage: Simple and easy to implement. Problems: Clustering and repetition.
15
16
17
18
19
20
Two main open addressing collision resolution techniques: - - Linear probing: increase by 1 each time [mod table size!] - - Quadratic probing: to the original position, add 1, 4, 9, 16, also in some cases key-dependent increment technique is used.
Probing If the table position given by the hashed key is already occupied, increase the position by some amount, until an empty position is found
21
Collision Resolution
Problem Clustering occurs, that is, the used spaces tend to appear in groups which tends to grow and thus increase the search time to reach an open space.
22
23
Problem Overflow may occurs when there is still space in the 24 hash table.
Key dependent increments are determined by using the key to calculate a new value and then using this as an increment to determine successive probes.
25
Collision Resolution
new position = current position + ( key DIV 11) MOD 11 Example Before key-dependent increments:
26
In all of the closed hash functions it is important to ensure that an increment of 0 does not arise.
If the increment is equal to hash size the same position will be probed all the time, so this value cannot be used. If we ensure that the hash size is prime and the divisors for the open and closed hash are prime, the rehash function does not produce a 0 increment, then this method will usually access all positions as does the linear probe. - Using a key-dependent method usually result reduces clustering and therefore searches for an empty position should not be as long as for the linear method.
27
28
After chaining:
29
In analyzing search efficiency, the average is usually used. Searching with hash tables is highly dependent on how full the table is since as the table approaches a full state, more rehashes are necessary. The proportion of the table which is full is called the Load Factor. - When collisions are resolved using open addressing, the maximum load factor is 1. - Using chaining, however, the load factor can be greater than 1 when the table is full and the linked list attached to each hash address has more than one element.
31
Also, hashing is very slow for any operations which require the entries to be sorted (e.g. query to Find the minimum key)
32
Perfect Hashing
A perfect hashing function maps a key into a unique address. If the range of potential addresses is the same as the number of keys, the function is a minimal (in space) perfect hashing function. What makes perfect hashing distinctive is that it is a process for mapping a key space to a unique address in a smaller address space, that is hash (key) unique address
Not only does a perfect hashing function improve retrieval performance, but a minimal perfect hashing function would provide 100 percent storage utilization.
33
Perfect Hashing
Process of creating a perfect hash function A general form of a perfect hashing function is:
34
Cichellis Algorithm
h2 = second_character (key)
and
g = T (x)
where T is the table of values associated with individual characters x which may apply in a key. The time consuming part of Cichellis algorithm is determining T.
35
Cichellis Algorithm
Table 1: Values associated with the characters of the Pascal reserved words
When we apply the Cichellis perfect hashing function to the keyword begin using table 1, we can get
The keyword begin would be stored in location 33. Since the hash values run from 2 through 37 for this set of data, the hash function is a minimal 36 perfect hashing function.
37
search key value(of a field) into a record or bucket of records. As for any index, 3 alternatives for data entries k*:
Data record with key value k
<k, rid of data record with search key value k> <k, list of rids of data records with search key k>
Cannot support range searches. Static and dynamic hashing techniques exist; trade-offs similar to ISAM vs. B+ trees.
Static Hashing
# primary pages fixed, allocated sequentially, never
de-allocated; overflow pages allowed if needed. h(k) mod M = bucket to which data entry with key k belongs. (M = # of buckets)
h(key) mod N
key h 0 2
N-1
Primary bucket pages Overflow pages
h(key) = (a * key + b) mod M usually works well. a and b are constants; lots known about how to tune h.
performance.
Rule of thumb
Try to keep space utilization between 50% and 80%
If <50%, wasting space If >80%, overflow significant Depends on how good hashing function is & On # keys/bucket
Extendible Hashing (Fagin et. al. 1979) Expandable hashing (Knott 1971) Dynamic Hashing (Larson 1978)
42
Extendible Hashing
Assume that a hashing technique is applied to a dynamically changing file composed of buckets, and each bucket can hold only a fixed number of items. Extendible hashing accesses the data stored in buckets indirectly through an index that is dynamically adjusted to reflect changes in the file. The characteristic feature of extendible hashing is the organization of the index, which is an expandable table.
43
Extendible Hashing
A hash function applied to a certain key indicates a position in the index and not in the file (or table or keys). Values returned by such a hash function are called pseudokeys. The database/file requires no reorganization when data are added to or deleted from it, since these changes are indicated in the index. Only one hash function h can be used, but depending on the size of the index, only a portion of the added h(K) is utilized. A simple way to achieve this effect is by looking at the address into the string of bits from which only the i leftmost bits can be used.
The number i is the depth of the directory. In figure 1(a) (in the next slide), the depth is equal to two.
44
Example
45
Reading and writing all pages is expensive! Idea: Use directory of pointers to buckets, double # of buckets by doubling the directory, splitting just the bucket that overflowed! Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page! Trick lies in how hash function is adjusted!
Example
Directory is array of size 4.
00 01 10 11
`global depth # bits of h(r); we denote r by h(r). If h(r) = 5 = binary 101, it is in bucket pointed to by 01.
10*
Bucket C
DIRECTORY
DATA PAGES
Insert: If bucket is full, split it (allocate new page, re-distribute). If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.)
10*
2 DIRECTORY 15* 7* 19*
Bucket C
010
011
100 Bucket D 101 110
2
4* 12* 20* Bucket A2 (`split image' of Bucket A)
Points to Note
20 = binary 10100. Last 2 bits (00) tell us r belongs in A or
Global depth of directory: Max # of bits needed to tell which bucket an entry belongs to. Local depth of a bucket: # of bits used to determine if an entry belongs to this bucket.
When does bucket split cause directory doubling? Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; directory is doubled by copying it over and `fixing pointer to split image page. (Use of least significant bits enables efficient doubling via copying of directory!)
Directory Doubling
Why use least significant bits in directory? Allows for doubling via copying! 6 = 110
000
001 2 1 0 1 00 01 10 11 010 011 1 0 1 00 10 2 3
6 = 110
000
100 010 110 001
6*
6*
100 101
6*
01 11
110
111
6*
6*
6*
Least Significant
vs.
Most Significant
100MB file, 100 bytes/rec, 4K pages contains 1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory. Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large. Multiple entries with same hash value cause problems!
can be merged with `split image. If each directory element points to same bucket as its split image, can halve directory.
Hybrid methods
Expandable Hashing Similar idea to an extendible hashing. But binary tree is used to store an index on the buckets. Dynamic Hashing multiple binary trees are used. Outcome: - To shorten the search. - Based on the key --- select what tree to search.
52
Linear Hashing
This is another dynamic hashing scheme, an alternative to
Extendible Hashing. LH handles the problem of long overflow chains without using a directory, and handles duplicates. Idea: Use a family of hash functions h0, h1, h2, ...
hi(key) = h(key) mod(2iN); N = initial # buckets h is some hash function (range is not 0 to N-1) If N = 2d0, for some d0, hi consists of applying h and looking at the last di bits, where di = d0 + i. hi+1 doubles the range of hi (similar to directory doubling)
Splitting proceeds in `rounds. Round ends when all NR initial (for round R) buckets are split. Buckets 0 to Next-1 have been split; Next to NR yet to be split. Current round number is Level. Search: To find bucket for data entry r, find hLevel(r): If hLevel(r) in range `Next to NR , r belongs here. Else, r could belong to bucket hLevel(r) or bucket hLevel(r) + NR; must apply hLevel+1(r) to find out.
Overview of LH File
In the middle of a round.
Bucket to be split Next Buckets that existed at the beginning of this round: this is the range of Buckets split in this round: If h Level ( search key value ) is in this range, must use h Level+1 ( search key value ) to decide if entry is in `split image' bucket.
hLevel
`split image' buckets: created (through splitting of other buckets) in this round
dont develop! Doubling of directory in Extendible Hashing is similar; switching of hash functions is implicit in how the # of bits examined is increased.
re-distribute entries.
Level=0, N=4 Level=0
h
1 000 001
h 0 00
PRIMARY Next=0 PAGES 32*44* 36* 9* 25* 5* Data entry r with h(r)=5 Primary bucket page
h
1 000 001
h 0 00
OVERFLOW PAGES
01
10 11
Next=1 9* 25* 5* 01 10 11 00
010 011
h1
Level=0 h1 000 001 010 011 100 101 110 h0 00 01 10 11 00 01 10 PRIMARY PAGES 32* 010 9* 25* 011 66*18* 10* 34* Next=3 31*35* 7* 11* 100 43* 101 110 111 OVERFLOW PAGES 000 001
h0 Next=0 00 01 10
PRIMARY PAGES 32* 9* 25* 66* 18* 10* 34* 43* 35* 11* 44* 36* 5* 37* 29* 14* 30* 22* 31* 7*
OVERFLOW PAGES
50*
11
00 11 10 11
44*36*
5* 37*29* 14*30* 22*
LH Described as a Variant of EH
The two schemes are actually quite similar: Begin with an EH index where directory has N elements. Use overflow pages, split buckets round-robin. First split is at bucket 0. (Imagine directory being doubled at this point.) But elements <1,N+1>, <2,N+2>, ... are the same. So, need only create directory element N, which differs from 0, now.
pages are created in order. If they are allocated in sequence too (so that finding ith is easy), we actually dont need a directory!
Useful Links
http://www.cs.ucla.edu/classes/winter03/cs143/l1/han
Summary
Hash-based indexes: best for equality searches, cannot
support range searches. Static Hashing can lead to long overflow chains. Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.)
Directory to keep track of buckets, doubles periodically. Directoryless schemes (linear dynamic hashing) available Can get large with skewed data; additional I/O if this does not fit in main memory.
Summary
Linear Hashing avoids directory by splitting buckets
Overflow pages not likely to be long. Duplicates handled easily. Space utilization could be lower than Extendible Hashing, since splits not concentrated on `dense data areas.
Check List
What is the intuition behind hash-structured indexes?
Why are they especially good for equality searches but
useless for range selections? What is Extendible Hashing? How does it handle search, insert, and delete? What is Linear Hashing? What are the similarities and differences between Extendible and Linear Hashing? How does perfect hash function works?