Академический Документы
Профессиональный Документы
Культура Документы
Operational Research and Optimization (Master EEJSI) Teacher Coordinator : Prof. Emil Simion
Erika Batista - batista.erika@gmail.com Gal Canal - g.canal@lgpse.com Karim Ziadeh - karim.ziadeh@gmail.com December 2012
Table of contents
I. Introduction ..................................................................................................... 3 II. The Birthday Paradox .....................................................................................4 A. Explanation .................................................................................................4 B. Estimating the probability .......................................................................... 5 C. Birthday paradox vs. Same day probability ................................................ 7 III. The Birthday Paradox in Cryptography ....................................................... 8 A. What is the probability of having a conflict? ..............................................9 B. Practical uses ............................................................................................. 10 1) Hash........................................................................................................ 10 2) MAC (Message Authentication Code) ................................................... 10 IV. Other applications ....................................................................................... 12 A. The pseudo random number generator (PRNG) application ................... 12 B. Applications of the birthday paradox in communication networks ......... 14 V. Reality check ................................................................................................. 15 A. Challenging the hypothesis of uniformity................................................. 15 B. Description of the experimental data ....................................................... 15 C. Experimental process ................................................................................ 16 D. Preparation................................................................................................ 16 E. Experimental results ................................................................................. 18 1) Sample of 10 dates.................................................................................. 18 2) Sample of 23 dates ................................................................................. 18 3) Sample of 30 dates ................................................................................. 19 F. Numerical results....................................................................................... 19 G. Observations.............................................................................................. 19 H. Explanation of the results ........................................................................ 20 1) General distribution of births ............................................................... 20 2) Distribution of experimental data ........................................................ 20 I. Lessons learned .......................................................................................... 21 VI. Conclusion ...................................................................................................22 VII. Annex ......................................................................................................... 24 VIII. Bibliography .............................................................................................. 25
I. Introduction
This paper will study in detail the problem known in probability theory as the Birthday Paradox (or the Birthday problem), as well as its implications and different related aspects we consider relevant. The present study will treat the following topics: ! Definition of the problem ! Concrete check tests of the B-day paradox ! Survey / describe / explain / simulate real life applications of the B-day paradox o In cryptography (hash algorithm) o When unique identifiers are needed o In communication networks o In pseudo-random number generator The purpose of this paper is thus to give the reader a general overview of the general aspects and applications of the Birthday Paradox.
6)
In order to calculate P, that is the probability of no two people sharing a birthday; we will examine the first two persons. If there are only two people in the room, the first can have any birthday (365/365 possibilities), but the second must have been born on one of the remaining 364 days. Therefore, the probability of those two persons not having the same birthday can be calculated and expressed as follows:
))))))))))))))))))))))))))))))))))))))))))))))))))))))))
There are several studies that demonstrate that the distribution of birthdays throughout the year is not homogeneous. See: Eric Weinsteins World of Astronomy. Source: http://scienceworld.wolfram.com/astronomy/LeapDay.html (Last visited: December 8th, 2012); An Analysis of the Distributions of Birthdays in a Calendar Year, MURPHY, Roy. Source: http://www.panix.com/~murphy/bday.html (Last visited: December 8th, 2012); Birthday distribution, David Gleichs Notebook. Standford University, 2010. Source: http://www.stanford.edu/~dgleich/notebook/2009/04/birthday_distribution.html (Last visited: December 8th, 2012); Infographic Illustrates Most Common Birthdays, Baby-Making Days. ZIMMERMAN, Neetzan. Gawker, 2012. Source: http://gawker.com/5910778/infographic-illustrates-most-common-birthdaysbaby+making-days (Last visited: December 8th, 2012).
2
In order to determine the probability in a room of n people, in this case 23, we then assume that these events are independent from each other and we multiply their probability together C(n,2) times in order to obtain an approximation of the probability. Therefore:
So if
, that means:
The probability that two people share a birthday in a room of 23 people is thus 50% according to this approximation. Now for a method slightly longer, but perhaps more accurate: Were still calculating P(A), no one shares a birthday, in order to get to P(A). We keep the same group of 23 people, and because we consider these to be independent events, P(A) can be obtained through P(1) x P(2) x P(3) x P(4) x P(n), where n is 23. We can express each event as the given person not sharing their birthday with anyone else. So if we analyze person per person, starting by Person 1 like we did on the first calculations, since there are no previously analyzed people there is 100% chance or 1 probability that no one will share a birthday. Person 2, must be born in one of the remaining 364 days, or a probability of 364/365. As it follows, Person 3 should be born in the remaining 363 days, because he or she cant share a birthday with Person 1 or 2, an event which has a probability of 363/365, and so on. We do the same analysis for each of the following people, until the probability for Person n not sharing birthday with anyone which is , in this case P(23) = 343/365.
Or
The equation above gives us the following result: P(A') = 0.492703 Therefore, P(A) = 1 ! 0.492703 = 0.507297 (50.7297%) We have thus demonstrated that with only 23 people in a room we can reach the 50% probability threshold that two of them will share a birthday.
We end up with a probability of a bit more than 6%. In order to have a probability greater than 50%, we would need to have at least 253 people in the room. Note that 253 is considerably more than half of the days in the year, but given the fact that other people could share birthdays this decreases the chance of sharing a birthday with you. The following chart illustrates the stark difference between the probabilities we just compared:
> " for n = 23 Now if we take a hash function H, with n possible outputs and a known hash value H(x), If H is applied to k random inputs, then what is the smallest k such that the probability of having at least one input y from the set k satisfying H(y) = H(x) is 0.5? If k = 1, then the probability of having at least one input y from k such that H(y) = H(x) = . On the other hand, the probability of having at least one input y from k such that H(y) #H(x) = If k > 1 random inputs are generated, then the chances that none of them satisfies H(y) = H(x) is equal to the product of the probability that each of them satisfies H(y) # H(x) and it is equal to: K times Thus, the probability of having at least one match can be approximated to:
This is the main reason why the size of the hash value of modern hash functions is required to be large enough to make a birthday attack computationally infeasible.
B. Practical uses
1) Hash If hash values have n bits, it should take 2n tries to find a key mapping to a given hash value, and tries to find two keys that map to some common hash value. Therefore the one-way function is considered broken.
2) MAC (Message Authentication Code) MAC is a short piece of information or a security code that is typed in by a computer user in order to access accounts or portals. Just like hash, MAC functions possess different security requirements. This is how we compute an authentication tag: Message authentication: tag = MAC(key, M) Sender sends (M, tag) Receiver verifies that tag matches M Authentication with HMAC: tag=H(key | M) subject to extension attacks tag=H(M | key) relies on collision resistance HMAC: Compute tag = H(key | H(key | M)) M $ (key|M) looks random when key is secret
For secure applications, one would select a PRNG that has the widest possible period and the most uniform distribution for any possible seed. The birthday paradox can help checking the randomness of the PRNG: if the chances to find a collision between two values converge to the theoretical values then the distribution is uniform, which makes it a near-true random. Also, the birthday paradox can be used to attack weak implementations of PRNGs. An example of this problem is the DNS-cache poisoning attack which relies on various weaknesses of DNS servers implementations among which the PRNG used for the generation of a 16 bits security value meant to authenticate an authoritative answer.4 Coupled with deterministic UDP port allocation, this weak 16bits PRNG is a good example of target for a birthday attack: no connection (UDP), guessable answer port one attacker would only have to use as little as 300 tries to get 50% chance of a matching guess, and 500 to get nearly 100%.
))))))))))))))))))))))))))))))))))))))))))))))))))))))))
Domain Name Servers Pseudo-Random Number Generators and DNS Cache Poisoning Attack SZMIT et al. Source: http://maciej.szmit.info/documents/szmit_tomaszewski_szmit_DNS_ebook.pdf (Last visited: December 8th, 2012).
4
))))))))))))))))))))))))))))))))))))))))))))))))))))))))
3 3
the bigger m/n the more probable the meet up events are the smaller n, the less time is necessary before a first match.
Thus, in this case, the algorithm takes advantages of the birthday paradox by randomly generating a local sleep/idle map, with carefully chosen n and m values, to offer a good statistical guarantee of a matching non-sleep state between communicating nodes. This is then used in conjunction with the SIM (Sleep indication map) received from the network to generate a SIT (Sleep indication table) that will then be used to select the communication slots (the matching non sleep state) The used algorithm then takes care of handling the special cases (e.g.: if no matching slot found, transmit immediately).
V. Reality check
A. Challenging the hypothesis of uniformity
As previously mentioned, for the sake of simplicity, the Birthday Paradox equations make the assumption that the probability of the 365 possible years dates is uniformly distributed, and simply ignore the 29 February exception. With this in mind, and provided one would use a true random generator to create a sample space (thereby respecting the uniform distribution), he would find a perfect match between experimentation and theory. However, can we safely assume the uniform distribution of the probabilities? What would be the consequences otherwise?
))))))))))))))))))))))))))))))))))))))))))))))))))))))))
Of course, there is no statistical conclusion to be deduced from this, but the birthdates list matches the theory assumption n2, thus leaving only the assumption n1 as a variable.
6
C. Experimental process
Our approach was to divide a given list (birth or death) into smaller sets of 10, 23, and 30, then to apply the binomial model to find the converging value (the selected value has to leave a reasonable number of experiments to find the convergence with an acceptable error margin). In this model, the event we are looking for is the event of having at least one matching pair within the sample. It is given the value 1.
D. Preparation
The original data source was first prepared according to our needs (flat files with each line containing a day+month date) The program below, written in python, does the calculations and graph operations: 3 While it is possible to read a sample of n dates from the file, o Check if the sample has duplicates, and increment total of events if yes o Calculate and store mean(n) and sigma(n) in an array Draw the graph (red dots for the mean, blue line for sigma, green line for the expected value, legend) for the expected value, legend..)
3 3
import matplotlib.pyplot as plt import numpy as np import matplotlib.mlab as mlab import pylab as pl import collections as col def stdDev(X): mean = sum(X)/float(len(X)) tot = 0 for x in X: tot+=(x-mean)**2 return (tot/float(len(X)))**0.5 def hasDup(X): dic = {} # print "=============================" for y in X: if (y in dic): # print " + " + str(y) return 1 dic[y]=1 # print "-" + str(y) return 0 def graph(datafile,sample_len): source = [] source = np.loadtxt(datafile)
xAxis = [] ratios= [] means = [] sigmas = [] data = [] total = 0; # create 23-uple arrays data = [source[i:i+sample_len] for i in range(0, len(source), sample_len)] # print data[0] # print len(data) for num_sample in range(0,len(data)): xAxis.append(num_sample) sample = data[num_sample] # outcome = 1 if matching birthdays exist, else 0 result = hasDup(sample) ratios.append(total/float(1+num_sample-total)) total += result mean = total / float(num_sample+1) means.append(mean) sigma = stdDev(ratios) sigmas.append(sigma) print sum(means)/float(len(means)) print sum(sigmas)/float(len(sigmas)) plt.figure() p1, = plt.plot(xAxis, means,"ro") p2, = plt.plot(xAxis,sigmas) plt.title(datafile + " - " + str(num_sample) + " samples of "+str(sample_len)+" dates") plt.xlabel("sample") plt.ylim(0,1); plt.legend([p1,p2],["mean","sigma"],'best',numpoints=1) plt.ylabel("%") # plt.semilogx() # plt.semilogy() graph("naissances_2.txt",10) graph("deces_2.txt",10) graph("naissances_2.txt",23) graph("deces_2.txt",23) graph("naissances_2.txt",30) graph("deces_2.txt",30) graph("mariage_2.txt",23) plt.show()
E. Experimental results
Graphical results (the green line represents the theoretical limit): 1) Sample of 10 dates
2) Sample of 23 dates
3) Sample of 30 dates
F. Numerical results
File Birth Death Birth Death Birth Death Size of samples 10 10 23 23 30 30 # of samples 622 488 270 212 207 162 Experimental Mean 22,76% 16,32% 67,79% 57,75% 81,91% 83,50% Theoretical 11,7% 11,7% 50,7% 50,7% 70,6% 70,6% Experimental Sigma 0.0499416788462 0.0469627974932 0.827653217839 0.29933503001 1.23287674973 2.89535314161
G. Observations
At first glance, one can easily see that the calculated means are: 3 significantly above the theoretical values in each situation 3 relatively in the same order of magnitude than the theoretical ones. The number of samples is decreasing as their size grows, since we have a fixed set of dates, and this affects the quality of the convergence in the latter sets. Also, the death dates values appear to be closer to the expected values than the birth dates, although in the last experiment the reverse is true (but this could be due to the relatively small number of tries)
))))))))))))))))))))))))))))))))))))))))))))))))))))))))
RGNIER-LOILIER, Arnaud & ROHRBASSER, Jean-Marc. Population et socit : Y a-t-il une saison pour faire des enfants ?. INED, January 2011. Source: http://www.ined.fr/fichier/t_telechargement/51706/telechargement_fichier_fr_pu bli_pdf1_pop_soc474.pdf (Last visited: December 8th, 2012). 8 Also, our sample data consists of dates from past centuries up to now.
7
Clearly, birth dates is not the most uniformly distributed kind of data. Although less impacted by the human will, and less dispersed, the death dates probabilities from our sample arent really uniform either, as shown below.
I. Lessons learned
This experimental result brings out the question of how, and how much the non- homogeneous distribution of outcomes does affect the results? Qualitatively, this can be easily understood with the sets theory: Applying a non homogeneous probabilistic distribution to the target set of an injection actually results in reducing the effective cardinality of this set, thus raising the chance of collision since the source sets cardinality doesnt change. This clearly proves that any non-homogeneous distribution will result in a higher chance to find collisions in the outcomes set, which corroborates our experimental results. Incidentally, it means that the uniform distribution is the most difficult situation to find collisions in this scenario. Quantitatively, however, the general case would be a more complicated matter, although one could probably simplify the problem by identifying a well known distribution based model applicable to the data, or by limiting the deviation from an uniform distribution to a small value.9 ))))))))))))))))))))))))))))))))))))))))))))))))))))))))
See NUNNIKHOVEN, Thomas. A birthday problem solution for non uniform birth frequency. The American Statistician, Vol. 46, No. 4, November 1992. Also available:
9
VI. Conclusion
As we have seen throughout this work, the probability that in a set of n random people, two of them will share the same birthday (otherwise known as the Birthday Paradox), is a problem pertaining to probability theory with many unexpected applications. In the field of cryptography, namely Hash functions, we have seen how it is computationally possible to find two messages with the same hash value H(x) = H(y), even when a strongly collision-free hash function is being used. This represents a very important risk, given that the main purpose of cryptographic hash functions is to ensure the authenticity of messages. A contemporary solution to this problem has been to increase the size of the hash value of hash functions, the latter needs to be large enough to make the possibility of a birthday attack very unlikely. We have also discussed how the chance of a collision varies depending on whether were dealing with homogeneous or non-homogeneous probability distributions. In this sense, we observe that the collision hazard is superior when the target distribution is not random than in the case of pure randomness. With regards to other applications of the Birthday Paradox, we have also studied Pseudo-Random Number Generators (PRNG) as well as applications in communications networks. PRNG make the task of generating data in computer-based applications more efficient by omitting some of the many requirements of a true random number generator. This is particularly useful for applications that require a high amount of random data. PRNG provide a reasonably random set of numbers, while preserving a good rate of (pseudo-) randomness availability. In communication networks, the birthday paradox is applied advantage of the predictable risk of collision to achieve certain goals, as we have seen above. With regards to the reality check test, we have proven that non-homogeneous probability distribution results in a greater chance of collision. Consequently, a uniform distribution is most difficult scenario to find collisions. )))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))
http://www.bioforensics.com/conference06/Related_DB/Bday_problem.pdf (Last visited: December 8th, 2012).
VII. Annex
Exercise 1.6.3
VIII. Bibliography
Birthday distribution, David Gleichs Notebook. Stanford University, 2010. Source: http://www.stanford.edu/~dgleich/notebook/2009/04/birthday_distributio n.html (Last visited: December 8th, 2012); Cryptographic Hashes, CS461/ECE422. Source: https://wiki.engr.illinois.edu/download/attachments/183272958/cryptohash.pptx?version=1&modificationDate=1315004111000 (Last visited: December 8th, 2012). Definitions of Managed Objects for Asymmetric Digital Subscriber Line 2 (ADSL2). The Internet Society (2006). Source: http://www.ietf.org/rfc/rfc4429.txt (Last visited: December 8th, 2012). Leap Day. Eric Weinsteins World of Astronomy. http://scienceworld.wolfram.com/astronomy/LeapDay.html (Last December 8th, 2012); Source: visited:
The Pigeonhole Principle, The Hong Kong University of Science and Technology, Department of Mathematics. Source: http://www.math.ust.hk/~mabfchen/Math391I/Pigeonhole.pdf (Last visited: December 8th, 2012) HALEVI, Shai. Cryptographic Hash Functions and their many applications, IBM Research. USENIX Security August 2009. Source: http://static.usenix.org/events/sec09/tech/slides/halevi.pdf (Last visited: December 8th, 2012). MURPHY, Roy. An Analysis of the Distributions of Birthdays in a Calendar Year. Source: http://www.panix.com/~murphy/bday.html (Last visited: December 8th, 2012). NUNNIKHOVEN, Thomas. A birthday problem solution for non uniform birth frequency. The American Statistician, Vol. 46, No. 4, November 1992. Also available: http://www.bioforensics.com/conference06/Related_DB/Bday_problem.pdf (Last visited: December 8th, 2012).
RAMAKRISHNA, Siva and BHAVANI, Ganga. An Efficient Continuous Neighbour Discovery in Asynchronous Sensor Networks. International Journal of Computer Application, Issue 2, Volume 1, February 2012. Available: http://rspublication.com/ijca/feb-12/21.pdf (Last visited: December 8th, 2012). RGNIER-LOILIER, Arnaud & ROHRBASSER, Jean-Marc. Population et socit : Y a-t-il une saison pour faire des enfants ?. INED, January 2011. Source: http://www.ined.fr/fichier/t_telechargement/51706/telechargement_fichier _fr_publi_pdf1_pop_soc474.pdf (Last visited: December 8th, 2012). STAMM, Stephanie. Hash Functions and the Birthday Attack. United States Naval Academy. Midshipman 1/C. April 23, 2010. Source: http://www.dean.usma.edu/departments/math/courses/ma498/SASMC/slid es/Stamm%20USNA.pdf (Last visited: December 8th, 2012). STERLING GRAH, Joseph. Hash functions in cryptography. Master thesis, The University of Bergen, 2008. Source: https://bora.uib.no/bitstream/handle/1956/3206/47401627.pdf?sequence=1 (Last visited: December 8th, 2012). SZMIT et al. Domain Name Servers Pseudo-Random Number Generators and DNS Cache Poisoning Attack. Source: http://maciej.szmit.info/documents/szmit_tomaszewski_szmit_DNS_ebook. pdf (Last visited: December 8th, 2012). The Rice University Monarch Project. Source: http://monarch.cs.rice.edu/~santa/research/powermode/powermode.pdf (Last visited: December 8th, 2012). ZIMMERMAN, Neetzan. Infographic Illustrates Most Common Birthdays, Baby-Making Days. Gawker, 2012. Source: http://gawker.com/5910778/infographic-illustrates-most-commonbirthdays-baby+making-days (Last visited: December 8th, 2012).