Академический Документы
Профессиональный Документы
Культура Документы
TR Sriram, EE12B056
2
Pr [S (1 )] = Pr S (1
)
2+1
1
1
= Pr S k( + )
)
2
2+1
k
= Pr [Probability of Losing]
= Pr S
2
2
exp(
)
2
k 1
= exp( 2 ( + ))
2 2
2 1
2
= exp( k), where = ( + ) =
2
2
2 + 1
This implies
Pr [Best player losing the match] exp( k)
Pr [Best player losing the tournament] log2 (n)exp( k))
2. Number of games conducted in order to get a probability of 1 of winning
Pr [Best player winning the tournament] (1 log2 (n)exp( k)) 1
= log2 (n)exp( k) =
exp( k)
log2 (n)
) k
= log(
log2 (n)
log( log2(n) )
= k
log2 (n)
2
log( )
1
2
= k
, where = ( + ) =
2
2
2 + 1
Thus number of games is of the O(log( log(n)
)) and total number of games con
ducted in the tournament is 2kn which of the order of O(n log( log(n)
))
(c) (5 points) Now design a tournament with a total of O (n) games to get 1
probability of the best team winning eventually. O (n) means O(n) for all constant
> 0.
Solution:
Proof. Now instead of playing constant number of games in each round, if we
increase the number of each round we will get number of games to be bounded
Page 2
exp((k + i 1))
i=1
exp(k)(
1 exp(.log2 (n))
1
) < exp(k)(
)
1 exp()
1 exp()
= k log( (1 exp())
log( (1 exp())
2. (25 points) (Counting Distinct Elements) Universal hash families come in handy
when designing streaming algorithms that make a single pass and use very little space in
comparison to the size of the input stream. Here, we will construct a streaming algorithm
to estimate the number of distinct elements in the stream. The technique here is from a
seminal work of Alon, Matias and Szegedy (STOC 1996); the authors were awarded the
Godel prize in 2005 for this work.
(a) (5 points) Recall that for a prime p, and an integer k, where 1 k p, the
collection:
Hk = {hab : x (ax + b) mod k | a, b Fp , a 6= 0}
is a 2-universal hash family. Prove the following lemma:
Lemma 1. For every set S Fp ,
1. if |S| < k, then
PrhH4k [x S : h(x) = 0] < 1/4;
2. while, if |S| 2k, then
PrhH4k [x S : h(x) = 0] 3/8.
Solution:
Page 3
hH4k
X
xS
Pr [h(x) = 0]
hH4k
X 1
|S|
k
1
=
<
=
4k
4k
4k
4
xS
xS
hH4k
x,yS:x6=y
Pr [h(x) = 0 h(y) = 0]
hH4k
X
X 1
1
=
4k x,yS:x6=y 16k 2
xS
|S|
|S|
22
=
4k
16k
2k 2k
2k
4k 2 4k 4k
1 1
3
= =
2 8
8
(b) (10 points) Consider a stream of integers: x1 , x2 , . . .; where each xi [n]. For a
parameter k, where 1 k n, design a streaming algorithm with the following
guarantees:
1. if the stream contains strictly less than k distinct elements, the output is No
with probability at least 1 1/n2 ;
2. while, if the stream contains at least 2k distinct elements, then the output is
Yes with probability at least 1 1/n2 .
The algorithm should proceed in three phases as in the template solution below.
Fill in the details of the algorithm in the given template; you need not provide
implementation details but should be precise in your description.
Solution:
Initialization: (pre-processing done before making the pass)
1. Pick a random hash function g from the 2-universal hash family H4k :
[n] [4k] mentioned in the above question
2. Allocate t = 288 log n bits of memory and set them to 0
Processing: (x [n] is the element to process)
Page 4
1. Find the hash value of all the data points and if the hash value is
t then set that corresponding bit to 1
Output: (called after the input is processed)
Find number of bits set and if it is less than
NO
5k
16
Prove the guarantees of the algorithm designed above and calculate the space used
by the algorithm (in O() notation).
Solution:
Proof. We can find the hash value of each data point using one of the hash
function from the above hash family and if it is less than t we set that bit to
1. We can expoilt the probability gap of saying Yes when number of distinct
elements < k and < 2k. We can find how many bits are set and say YES if
number of set bits are more than some threshold. To get a probability guarantee
of n1 we can set the threshold to be 5*k/16.
When number of distinct elements k:
By using Chernoff bound for upper tail we get
1 k
1
5k
k
k
2
)
exp( 4 4 1 ) = exp(
Pr Number of bits set > (1 + ) =
4
4
16
144
2+ 4
exp(
288log(n)
1
= exp(2n) = 2
144
n
288logn
384logn
1
) exp(
) = exp(2n) = 2
192
192
n
Page 5
Solution:
Initialization: (pre-processing done before making the pass)
1. Pick log2 (n) random hash functions g1 , g2 ...glog2 (n) from the 2-universal
hash family HF ieldSize : [n] [F ieldsize] mentioned in the first
part. The field size of each of the function is in powers of 2 from
1 4, 2 4, 4 4...2log2 (n) 4
2. Allocate O(log2 (n)) bits for each of the hash function and a total of
O(log2 (n)2 ) bits and set them to 0
Processing: (x [n] is the element to process)
1. Find the hash value for all hash functions for all the data points and if
the hash value is constant log2 (n) then set that corresponding bit
to 1 for that corresponding hash function. The constant value varies
for each of the hash function and its analysis specified in previous
part.
Output: (called after the input is processed)
Call the algorithm in part b) for each of the hash function in increasing order
of field size (ie.,1*4,2*4,4*4...) and stop when it says NO and output that the
fieldsize of the corresponding hash function by 4 as the output.
Solution:
Proof.
Basically we are using a set of parallel estimators and then output the best estimate. We use the previous algorithm combined with binary search to estimate
number of distinct elements. Run the previous algorithm parallely for each of
the log2 (n) hash functions and then after processing the data, check the output
in increasing order of field size of hash function and stop when we encounter the
first NO. Basically we are finding a lower and upper bound for the number of
distinct points with a very high probability. To prove that this algorithm works
we need to estimate the probability of it failing and show that it is very small.
Pr [The above algorithm fails] Pr [any one of the parallel estimations fail]
log2 (n)
i=1
Page 6
1
n
1
2 =
2
n
n
n
The above algorithm works with a very high probability guarantee of 1 n1 .
The
used is just the sum of space used by individual components =
Plog2space
(n)
log(n) which is of the order of O((log(n))2 ).
1
log2 (n)
2. Abort the above search if more than 10l data-points have already been looked
at among all the hash tables.
(a) (3 points) Calculate the value of k so that, if x, y X satisfy d(x, y) > cR, then:
PrgHk [g(x) = g(y)] < 1/n.
Show hence that the probability of aborting the query procedure is at most 1/5.
Solution: Let x, y X such that d(x, y) > cR. For g(x) = g(y) each of the k
independent hash values should match.
Pr [g(x) = g(y)] =
gHk
k
Y
j=1
gj H
1
n
1
= klog(p2 ) log( )
n
1
= k logp2 ( )
n
=
<
10l
10l
10
5
(b) (6 points) For the value of k calculated above, calculate the collision probability,
under a random g, of a data-point x that is indeed within distance R from y.
Calculate l such that the probability that none of the hash functions gi cause a
collision between x and y is at most 1/5.
Solution: An element collides with the given datapoint if any one of the l hash
values are same ie., i s.t gi (x) = gi (y).
Pr [gi (x) = gi (y)] (p1 )k
Pr [gi (x) 6= gi (y)] (1 (p1 )k )
Pr [gi (x) 6= gi (y)for all i 1, ..., l] (1 (p1 )k )l ,
and thus the probability that there is a collision is atleast 1 (1 (p1 )k )l . For
this part, we need Pr(no collision) (1 (p1 )k )l 51
= l log(1/5)/ log(1 (p1 )k )
Page 8
(c) (6 points) State tha guarantees of the algorithm: output, space-complexity, timecomplexity and confidence for the values of k, and l calculated above.
Solution: The guarantees of the above algorithm is:
Time Complexity:
Precomputation = O(n L k F ) where F is the time taken to find the hash
value of an individual hash function
There are n data points and for each of them we end up calculating the hash
function for k*L times, so the order is O(n L k F )
Query Time = O(k l F + 10L D) where D is the time taken to compute
distance
We need to compute k*l hash functions for the query point and we encounter a
maximum of 10L points
Space complexity: O(n l) There are l hash functions and n data points, each
will occupy unit space, so total space occupied is n l
Page 9