3 views

Uploaded by tjny699

- Vector Space
- Design Pattern Notes
- Python
- Calculus Multivariable 2nd Edition Blank & Krantz - Vector Calculus PDF
- hsrt
- hw1[1]
- description: tags: cks
- Lab Java FileReader ObjectEquality HashMaps
- Computer Lab #1
- Birthday Chocolate _ HackerRank.pdf
- Preface 2013 Matlab Third Edition
- MidtermExamination GM 18-19
- Instructions
- Functions
- tpack-lesson-plan-3
- _lesson01
- UT Dallas Syllabus for math1325.521 05u taught by Bentley Garrett (btg032000)
- V3I809.pdf
- Data Structures & Algorithms- Final Part 1 Answers
- cbse-class-11-mathematics-sample-paper-sa1-2014-1.pdf

You are on page 1of 3

Sergei Winitzki

2001-10-13 to December 30, 2008

1 Statistics of random hash For n > 1 the generating function is equal to the prod-

uct of the n (identical) generating functions (2):

function

1 n

1.1 Formulation of the problem G(n; q1 , ..., qN ) = (q1 + ... + qN )

Nn

A p-bit hash function is a function from N to the 1 X N!

= n q s1 ...qN

sN

.

integer range {0, 1, ..., 2p 1}. Such functions are used N P s1 !...sN ! 1

si 0; i si =n

as check sums on data files. A data file is considered (3)

as a stream of bits, that is, a binary representation

of a nonnegative integer number. If the hash func- This generating function contains, in principle, the

tion gives different results on two files, the files are complete information about the probabilities of draw-

surely different. For example, the MD5 sum is a 64-bit ing various sets of integers. Our task now is to use this

hash function frequently used to verify file integrity. A generating function for the computations we need to

good hash function will yield different results for even perform.

slightly different files; heuristically, a good hash func-

tion yields a random value. However, it is clear that

there will be, by pure chance, some cases where differ- 1.3 Average number of different inte-

ent inputs yield the same hash function value. These gers

are called hash collisions. The problem is to estimate

Each possible drawing of the n random integers is rep-

the frequency of hash collisions, assuming a perfect

resented in the generating function G by a term such

hash, i.e. that the hash values are perfectly random,

as q1 q32 q4 , which signifies a drawing of {1, 3, 3, 4}. The

uniformly distributed numbers in the hash range.

number of different integers in this drawing is 3. The

Therefore, the problem of finding the frequency of

generating function G is the sum of all these terms with

hash collisions is equivalent to the following mathe-

the coefficients equal to the probabilities of the draw-

matical problem. Suppose x1 , ..., xn are independent,

ings. The average number of different integers will be

uniformly randomly chosen integers, each ranging from

computed if we replace in G(n; q1 , ..., qN ) every term

1 to N (in the case of a p-bit hash function, we choose

q1s1 ...qN

sN

by the number of different qi s in that term.

N = 2p ). We need to compute the average number of

The number of different qi s in the term q1s1 ...qN sN

can

different integers in the set {x1 , ..., xn }. We would like

be computed as f (s1 ) + ... + f (sN ), where the function

to compute also the average number of pair collisions,

f (s) is defined as

triple collisions, etc.

(

0, s = 0,

1.2 The basic generating function f (s) = (4)

1, s 1.

One drawing of n integers can be described if we spec-

ify how many times each possible integer from the So we only need to replace q1s1 ...qNsN

by f (s1 ) + ... +

set {1, ..., N } is selected. Consider the probability f (sN ).

p(n; s1 , ..., sN ) that the integer i is selected si times An elegant way of doing this is to find an explicit

(i = 1, ..., N ). The generating function for this proba- formula for a linear map from polynomials in {qi } to

bility can be defined as integers, so that q1s1 ...qN

sN

is mapped to f (s1 ) + ... +

X f (sN ). This map can be found as follows.

G(n; q1 , ..., qN ) = q1s1 ...qN

sN

p(n; s1 , ..., sN ). (1)

First let us try to find the map for just one variable.

si0

We need a formula for a linear map such that q s is

For n = 1 we have mapped into f (s). In particular, we need f (s) = 1 for

( all s 1. In other words, q 2 is equivalent to q after

1

, if only one of si is 1, the map; this suggests that q should be replaced by a

p(1; s1 , ..., sN ) = N

0, otherwise. projection matrix. However, once we got the idea of

using a matrix we do not need to limit ourselves to a

So the generating function for n = 1 is simply particular choice of f (s). Let us keep f (s) general and

1 substitute instead of q some matrix T such that T s is

G(1; q1 , ...qN ) = (q1 + ... + qN ) . (2) mapped into f (s). This can be arranged if we choose

N

1

some vector u V and some covector v V such Therefore the average number of distinct integers is

that h n i

hv , T s ui = f (s), (5) nd = N 1 1 N 1 . (12)

where the operator T acts in the vector space V . This This formula describes the average number of collisions

construction yields a linear map from polynomials in q in a perfect hash function.

into numbers, such that q s is mapped into f (s). As a realistic example, let us assume that we have

Now let us generalize to N variables {qi }. We need a computed the 32-bit hash sums of one million different

linear map that yields f (s1 )+...+f (sN ). This suggests files. How many different hash sums do we have on

that we use a direct sum of N copies of the linear space the average? We substitute N = 232 and n = 106 into

V and substitute instead of qi the operators Eq. (12) and find

Ti 1V ... T ... 1V End(V ... V ) (6) nd (232 , 106 ) 106 116.4,

where the operator T acts on the i-th copy of V and

which means that about 116 files will have the same

1V is the identity operator in V . We now define the

hash sum even though the files are different. So we

vector u and the covector v ,

need to use a larger hash range; with N = 264 we find

u ... u, v

u v ... v , (7)

nd (264 , 106 ) = 106 2.7 108 . (13)

and verify that

This indicates a negligible chance of hash collisions.

v , Tis u

h i = f (s). (8) Therefore, a 64-bit hash sum is sufficient for a million

files.

When we substitute Ti instead of qi in a polynomial Let us perform an asymptotic estimate of the colli-

term q1s1 ...qN

sN

, we obtain an operator T s1 ... T sN , sion rate for very large N . We may expand Eq. (12)

which will yield as

v , (T s1 ... T sN )

h ui = f (s1 ) + ... + f (sN ). (9)

n n(n 1) n(n 1)

nd N 1 1 + 2

=n .

Therefore, we constructed a linear map that can be N 2N 2N

applied directly to the polynomial G(n; q1 , ..., qN ) to Therefore, the collision rate is negligible (n nd 1)

yield the average number of different integers if f (s) is when N n2 .

chosen as shown above.

Let us perform this computation using the explicit

form of G(n; q1 , ..., qN ). We substitute Ti instead of qi

1.4 Average number of pairs, triples,

and obtain etc.

n

If we wanted to find the average number of pairs, we

T1 + ... + TN

G(n; T1 , ..., TN ) = . could replace the term q1s1 ...qN sN

in the generating func-

N

tion G(n; q1 , ..., qN ) by f2 (s1 )+ ...+ f2 (sN ) where f2 (s)

The operator T1 + ... + TN can be simplified to is defined as

T1 +...+TN = [(N 1)1V + T ]...[(N 1)1V + T ] .

(

1, s = 2;

f2 (s) = s2 =

Let us denote for brevify 0, otherwise.

N 1 1 We can similarly consider the triples or, more generally,

Q= 1V + T.

N N p-tuples of coincident integers, by taking the function

Then we can write fp (s) = sp . We can describe all these p-tuples at once

if we consider the generating function of the average

G(n; T1 , ..., TN ) = Qn ... Qn . (10) number of p-tuples; this means introducing an addi-

Now we can evaluate the application tional formal parameter t and defining

X

ui = N hv , Qn ui .

v , G(n; T1 , ..., TN )

h f (s; t) = fp (s)tp = ts .

p0

Consider the function f (s) defined by Eq. (4). One can

certainly choose an operator T and vectors u, v such Hence, we use the same derivation as in the previous

that Eq. (5) holds for this f (s). Then we find section up to Eq. (11), but now we substitute the func-

tion f (s) = ts instead of the previously used f (s) in

n

1 X n nk

Eq. (11). Then we find

hv , Qn ui = n v , T ku

(N 1)

N k

k=0 h ui = N hv , Qn ui

v , G(n; T1 , ..., TN )

n

1 X n nk n

N X n

= n (N 1) f (k) (11) = (N 1)

nk k

t

N k Nn k

k=0

k=0

n n

1 X n nk N n (N 1) (N 1 + t)

n

= n (N 1) = n

. = . (14)

N k N N n1

k=1

2

The average number of pairs is read off from Eq. (14)

as the coefficient at t2 . The average number of p-tuples

is np

n (N 1)

np = .

p N n1

For example, with p = 2 we find

n2

n(n 1) (N 1)n2

n(n 1) 1

n2 = = 1 .

2 N n1 2N N

1.5 Remarks

Perhaps the calculation can be performed directly

without the procedure with the substitution of some

complicated operators into the generating function.

Maybe one can directly consider the generating func-

tion of the average number of p-tuples, starting with

Eq. (11).

- Vector SpaceUploaded bysreekantha2013
- Design Pattern NotesUploaded byAnonymous QS2oZy
- PythonUploaded byHarsh Jain
- Calculus Multivariable 2nd Edition Blank & Krantz - Vector Calculus PDFUploaded byJana
- hsrtUploaded bycockdacrjjkl
- hw1[1]Uploaded byGobara Dhan
- description: tags: cksUploaded byanon-944203
- Lab Java FileReader ObjectEquality HashMapsUploaded byGloria Collins
- Computer Lab #1Uploaded byammar_harb
- Birthday Chocolate _ HackerRank.pdfUploaded byRitesh Bhattacharyya
- Preface 2013 Matlab Third EditionUploaded byRussel
- MidtermExamination GM 18-19Uploaded byJenalyn Cardano
- InstructionsUploaded bymassivekholo
- FunctionsUploaded bywmzin77
- tpack-lesson-plan-3Uploaded byapi-253433937
- _lesson01Uploaded byAsa Ka
- UT Dallas Syllabus for math1325.521 05u taught by Bentley Garrett (btg032000)Uploaded byUT Dallas Provost's Technology Group
- V3I809.pdfUploaded byIJCERT PUBLICATIONS
- Data Structures & Algorithms- Final Part 1 AnswersUploaded byKeshireddy Srikanthreddy
- cbse-class-11-mathematics-sample-paper-sa1-2014-1.pdfUploaded bycerla
- CLEP College Algebra Fact SheetUploaded byManuel L Lombardero
- Unit 1 Review SheetUploaded bySusie Green
- C1 Algebra - InequalitiesUploaded bySyed Atif Hasan Mahmood
- Add Math F4 Awal Tahun 2018_1 (1)Uploaded bymelissa
- Unit III Authentication and Hash FunctionUploaded byVasantha Kumari
- 1409.3880v1Uploaded byKlorin Min
- 6 1 gcfUploaded byapi-314809600
- Smart Government. BlockchainUploaded byAndrés Rabosto
- TX_U5M11L02_TE.pdfUploaded bybirraj
- 1 FunctionsUploaded bybruce almighty

- ContinuousUploaded bytjny699
- Con FormalUploaded bytjny699
- ChainUploaded bytjny699
- [Daniel J. Amit] Field Theory, The Renormalization GroupUploaded bytjny699
- Haskell Report1Uploaded bytjny699
- derive-wkb.pdfUploaded bytjny699
- anharmonic-perturbation.pdfUploaded bytjny699
- asymptotic_series.pdfUploaded bytjny699
- Fermi-Walker_frame.pdfUploaded bytjny699
- topology_of_thermalized_domain_draft.pdfUploaded bytjny699
- Frequency EstimationUploaded bytjny699
- calcalambdulusUploaded bytjny699
- QM_notes_1.pdfUploaded bytjny699
- Stochastic Calc; FinanceUploaded bytjny699
- GMoore(Rutgers)-Group Theory SyllUploaded bytjny699

- Nelson - Feynman Integrals and the Schrodinger EquationUploaded byMarco Piazzi
- Discrete Mathematics_Chapter 02_Relations and Chen WUploaded byhexram
- Linear Programming Approach for Optimal Contrast-Tone MappingUploaded byPaul Malcolm
- Package ‘crypto’Uploaded byErnan Baldomero
- lec1 (1)Uploaded byRajveer Singh Pahadiya
- ElmerUploaded byElmer Piad
- Algebra 2 LessonsUploaded byBrock
- Intro_M4.ppt.pdfUploaded bychandralove
- The Prototype Resemblance Theory of DiseaseUploaded byRenato Francisco Merli
- Typed Feature Structures and Design Space ExplorationUploaded byravipande
- Optimization Design of Functionally Graded StructuresUploaded bySuta Vijaya
- MoocUploaded byAftaab Grewal
- Yanofsky - Universal ApproachUploaded byAmílcar Arroyo
- A2 Chapter 2 ReviewUploaded bymrhomolkamath
- 10-2 Arithmetic Sequences and SeriesUploaded byUniQueShoshy
- Studentlitteratur.ab.Calculus.and.Algebra.with.Mathcad.2000.2008.Retail.ebook DistributionUploaded byLiviu Orlescu
- Produce Reduced and Extended Zone Plot of Band EquationUploaded byTariqul Islam Ponir
- growing growing growing unit planUploaded byapi-250461623
- 3Bessel's FunctionsUploaded byUtkarsh Tewari
- 9a04304-Signals & SystemsUploaded bysivabharathamurthy
- PrpUploaded byTunde Awosanya
- AP Calculus NotebookUploaded bysamuelspark
- Heim, Kratzer - 1998 - Semantics in Generative Grammar.pdfUploaded byRafael Berg
- Notes on Demand TheoryUploaded byStella Ngoleka Issat
- g8m5l4- discrete and continuous rate functions 2Uploaded byapi-276774049
- MITRES18 005S10 Big Picture Calculus TranscriptUploaded byUnoShankar
- End BookUploaded byaziz6shodiyev
- Mc Ty Polynomial 2009 1Uploaded bySarman Tamilselvan
- MathUploaded byhungkg
- What is a Planned IndependentUploaded byRizwan Siddiqui