You are on page 1of 3

Average number of collisions in a hash function

Sergei Winitzki
2001-10-13 to December 30, 2008

1 Statistics of random hash For n > 1 the generating function is equal to the prod-
uct of the n (identical) generating functions (2):
1 n
1.1 Formulation of the problem G(n; q1 , ..., qN ) = (q1 + ... + qN )
A p-bit hash function is a function from N to the 1 X N!
= n q s1 ...qN
integer range {0, 1, ..., 2p 1}. Such functions are used N P s1 !...sN ! 1
si 0; i si =n
as check sums on data files. A data file is considered (3)
as a stream of bits, that is, a binary representation
of a nonnegative integer number. If the hash func- This generating function contains, in principle, the
tion gives different results on two files, the files are complete information about the probabilities of draw-
surely different. For example, the MD5 sum is a 64-bit ing various sets of integers. Our task now is to use this
hash function frequently used to verify file integrity. A generating function for the computations we need to
good hash function will yield different results for even perform.
slightly different files; heuristically, a good hash func-
tion yields a random value. However, it is clear that
there will be, by pure chance, some cases where differ- 1.3 Average number of different inte-
ent inputs yield the same hash function value. These gers
are called hash collisions. The problem is to estimate
Each possible drawing of the n random integers is rep-
the frequency of hash collisions, assuming a perfect
resented in the generating function G by a term such
hash, i.e. that the hash values are perfectly random,
as q1 q32 q4 , which signifies a drawing of {1, 3, 3, 4}. The
uniformly distributed numbers in the hash range.
number of different integers in this drawing is 3. The
Therefore, the problem of finding the frequency of
generating function G is the sum of all these terms with
hash collisions is equivalent to the following mathe-
the coefficients equal to the probabilities of the draw-
matical problem. Suppose x1 , ..., xn are independent,
ings. The average number of different integers will be
uniformly randomly chosen integers, each ranging from
computed if we replace in G(n; q1 , ..., qN ) every term
1 to N (in the case of a p-bit hash function, we choose
q1s1 ...qN
by the number of different qi s in that term.
N = 2p ). We need to compute the average number of
The number of different qi s in the term q1s1 ...qN sN
different integers in the set {x1 , ..., xn }. We would like
be computed as f (s1 ) + ... + f (sN ), where the function
to compute also the average number of pair collisions,
f (s) is defined as
triple collisions, etc.
0, s = 0,
1.2 The basic generating function f (s) = (4)
1, s 1.
One drawing of n integers can be described if we spec-
ify how many times each possible integer from the So we only need to replace q1s1 ...qNsN
by f (s1 ) + ... +
set {1, ..., N } is selected. Consider the probability f (sN ).
p(n; s1 , ..., sN ) that the integer i is selected si times An elegant way of doing this is to find an explicit
(i = 1, ..., N ). The generating function for this proba- formula for a linear map from polynomials in {qi } to
bility can be defined as integers, so that q1s1 ...qN
is mapped to f (s1 ) + ... +
X f (sN ). This map can be found as follows.
G(n; q1 , ..., qN ) = q1s1 ...qN
p(n; s1 , ..., sN ). (1)
First let us try to find the map for just one variable.
We need a formula for a linear map such that q s is
For n = 1 we have mapped into f (s). In particular, we need f (s) = 1 for
( all s 1. In other words, q 2 is equivalent to q after
, if only one of si is 1, the map; this suggests that q should be replaced by a
p(1; s1 , ..., sN ) = N
0, otherwise. projection matrix. However, once we got the idea of
using a matrix we do not need to limit ourselves to a
So the generating function for n = 1 is simply particular choice of f (s). Let us keep f (s) general and
1 substitute instead of q some matrix T such that T s is
G(1; q1 , ...qN ) = (q1 + ... + qN ) . (2) mapped into f (s). This can be arranged if we choose

some vector u V and some covector v V such Therefore the average number of distinct integers is
that h n i
hv , T s ui = f (s), (5) nd = N 1 1 N 1 . (12)
where the operator T acts in the vector space V . This This formula describes the average number of collisions
construction yields a linear map from polynomials in q in a perfect hash function.
into numbers, such that q s is mapped into f (s). As a realistic example, let us assume that we have
Now let us generalize to N variables {qi }. We need a computed the 32-bit hash sums of one million different
linear map that yields f (s1 )+...+f (sN ). This suggests files. How many different hash sums do we have on
that we use a direct sum of N copies of the linear space the average? We substitute N = 232 and n = 106 into
V and substitute instead of qi the operators Eq. (12) and find
Ti 1V ... T ... 1V End(V ... V ) (6) nd (232 , 106 ) 106 116.4,
where the operator T acts on the i-th copy of V and
which means that about 116 files will have the same
1V is the identity operator in V . We now define the
hash sum even though the files are different. So we
vector u and the covector v ,
need to use a larger hash range; with N = 264 we find

u ... u, v
u v ... v , (7)
nd (264 , 106 ) = 106 2.7 108 . (13)
and verify that
This indicates a negligible chance of hash collisions.
v , Tis u
h i = f (s). (8) Therefore, a 64-bit hash sum is sufficient for a million
When we substitute Ti instead of qi in a polynomial Let us perform an asymptotic estimate of the colli-
term q1s1 ...qN
, we obtain an operator T s1 ... T sN , sion rate for very large N . We may expand Eq. (12)
which will yield as
v , (T s1 ... T sN )
h ui = f (s1 ) + ... + f (sN ). (9)
n n(n 1) n(n 1)
nd N 1 1 + 2
=n .
Therefore, we constructed a linear map that can be N 2N 2N
applied directly to the polynomial G(n; q1 , ..., qN ) to Therefore, the collision rate is negligible (n nd 1)
yield the average number of different integers if f (s) is when N n2 .
chosen as shown above.
Let us perform this computation using the explicit
form of G(n; q1 , ..., qN ). We substitute Ti instead of qi
1.4 Average number of pairs, triples,
and obtain etc.
If we wanted to find the average number of pairs, we

T1 + ... + TN
G(n; T1 , ..., TN ) = . could replace the term q1s1 ...qN sN
in the generating func-
tion G(n; q1 , ..., qN ) by f2 (s1 )+ ...+ f2 (sN ) where f2 (s)
The operator T1 + ... + TN can be simplified to is defined as
T1 +...+TN = [(N 1)1V + T ]...[(N 1)1V + T ] .
1, s = 2;
f2 (s) = s2 =
Let us denote for brevify 0, otherwise.
N 1 1 We can similarly consider the triples or, more generally,
Q= 1V + T.
N N p-tuples of coincident integers, by taking the function
Then we can write fp (s) = sp . We can describe all these p-tuples at once
if we consider the generating function of the average
G(n; T1 , ..., TN ) = Qn ... Qn . (10) number of p-tuples; this means introducing an addi-
Now we can evaluate the application tional formal parameter t and defining
ui = N hv , Qn ui .
v , G(n; T1 , ..., TN )
h f (s; t) = fp (s)tp = ts .
Consider the function f (s) defined by Eq. (4). One can
certainly choose an operator T and vectors u, v such Hence, we use the same derivation as in the previous
that Eq. (5) holds for this f (s). Then we find section up to Eq. (11), but now we substitute the func-
tion f (s) = ts instead of the previously used f (s) in
1 X n nk
Eq. (11). Then we find
hv , Qn ui = n v , T ku

(N 1)
N k
k=0 h ui = N hv , Qn ui
v , G(n; T1 , ..., TN )
1 X n nk n  
N X n
= n (N 1) f (k) (11) = (N 1)
nk k
N k Nn k
n   n
1 X n nk N n (N 1) (N 1 + t)
= n (N 1) = n
. = . (14)
N k N N n1

The average number of pairs is read off from Eq. (14)
as the coefficient at t2 . The average number of p-tuples
is   np
n (N 1)
np = .
p N n1
For example, with p = 2 we find
n(n 1) (N 1)n2

n(n 1) 1
n2 = = 1 .
2 N n1 2N N

1.5 Remarks
Perhaps the calculation can be performed directly
without the procedure with the substitution of some
complicated operators into the generating function.
Maybe one can directly consider the generating func-
tion of the average number of p-tuples, starting with
Eq. (11).