Вы находитесь на странице: 1из 9

CS 6841 - Assignment 1

TR Sriram, EE12B056

Collaborators are EE12B056(Myself) and EE12B128(Shijin)


1. (10 points) (Designing a Tournament) This question will help you understand the
applications of Chernoff bound and Union bound more, as youll deal with tuning parameters to optimize the efficiency of the system. Youll also get an understanding of
why NBA tournaments are designed the way they are, with more repeated matches being
played as we get closer to the finals. For example, in the NBA, the early rounds are
best-of-five, and the later rounds become best-of-seven. You will understand why, and
also learn how to design tournaments with n participants, for large n.
Suppose there are n teams, and they are totally ranked. That is, there is a well-defined
best team, second ranked team and so on. Its just that we (the algorithm designer) dont
know the ranking. Moreover, assume that for any given match between two players, the
better ranked team will win the match with probability p = 21 + , independent of all
other matches between these players and all other players also. Here is a small positive
constant.
(a) (2 points) Let n be a power of two, and fix an arbitrary tournament tree starting
with n/2 matches, then n/4 matches and so on. What is the probability that the
best team wins the tournament?
Solution: ( 12 + )log2 (n)
Proof. Probability of winning each match is 12 + , and number of opponents
to face is log2 (n) and hence the probability of the best player winning the
tournament is ( 12 + )log2 (n)
(b) (3 points) Use Chernoff bounds to bound the probability that the best team will
not win the tournament, if each match-up occurs as a best-of-k series. How many
games do you end up conducting in total to get a 1  probability of the best team
winning?
Solution:
Proof. 1.Let S = Number of matches won.
We want S to be k2
E[S] = = k ( 12 + )
2
Let = 2+1
By using Chernoff bound for lower tail we get,


2
Pr [S (1 )] = Pr S (1
)
2+1


1
1
= Pr S k( + )
)
2
2+1


k
= Pr [Probability of Losing]
= Pr S
2
2
exp(
)
2
k 1
= exp( 2 ( + ))
2 2
2 1
2
= exp( k), where = ( + ) =
2
2
2 + 1
This implies
Pr [Best player losing the match] exp( k)
Pr [Best player losing the tournament] log2 (n)exp( k))
2. Number of games conducted in order to get a probability of 1  of winning
Pr [Best player winning the tournament] (1 log2 (n)exp( k)) 1 

=  log2 (n)exp( k) =
exp( k)
log2 (n)

) k
= log(
log2 (n)
log( log2(n) )
= k

log2 (n)
2
log(  )

1
2
= k
, where = ( + ) =

2
2
2 + 1
Thus number of games is of the O(log( log(n)
)) and total number of games con
ducted in the tournament is 2kn which of the order of O(n log( log(n)
))

(c) (5 points) Now design a tournament with a total of O (n) games to get 1 
probability of the best team winning eventually. O (n) means O(n) for all constant
 > 0.
Solution:
Proof. Now instead of playing constant number of games in each round, if we
increase the number of each round we will get number of games to be bounded

Page 2

by (2k+1)*n. If lower bound of k is just a function of  and independent of n


the number of games will be of the order of O(n).
Pr [Losing ith round] exp((k + i 1))
log2 (n)

Pr [Losing the tournament]

exp((k + i 1))

i=1

exp(k)(

1 exp(.log2 (n))
1
) < exp(k)(
)
1 exp()
1 exp()
= k log( (1 exp())

log( (1 exp())

We can take a k which is just above lower bound and it is independent of n.


Total number of games played is (2k+1)*n and it is in the order O(n) since k is
independent of n.
= k

2. (25 points) (Counting Distinct Elements) Universal hash families come in handy
when designing streaming algorithms that make a single pass and use very little space in
comparison to the size of the input stream. Here, we will construct a streaming algorithm
to estimate the number of distinct elements in the stream. The technique here is from a
seminal work of Alon, Matias and Szegedy (STOC 1996); the authors were awarded the
Godel prize in 2005 for this work.
(a) (5 points) Recall that for a prime p, and an integer k, where 1 k p, the
collection:
Hk = {hab : x (ax + b) mod k | a, b Fp , a 6= 0}
is a 2-universal hash family. Prove the following lemma:
Lemma 1. For every set S Fp ,
1. if |S| < k, then
PrhH4k [x S : h(x) = 0] < 1/4;
2. while, if |S| 2k, then
PrhH4k [x S : h(x) = 0] 3/8.
Solution:

Page 3

Proof. When |S| < k:


Pr [x S : h(x) = 0]

hH4k

X
xS

Pr [h(x) = 0]

hH4k

X 1
|S|
k
1
=
<
=
4k
4k
4k
4
xS

When |S| 2k: By using Bonferroni inequality we get:


X
X
Pr [x S : h(x) = 0]
Pr [h(x) = 0]
hH4k

xS

hH4k

x,yS:x6=y

Pr [h(x) = 0 h(y) = 0]

hH4k

X
X 1
1

=
4k x,yS:x6=y 16k 2
xS

|S|
|S|
22
=
4k
16k
2k 2k
2k

4k 2 4k 4k
1 1
3
= =
2 8
8

(b) (10 points) Consider a stream of integers: x1 , x2 , . . .; where each xi [n]. For a
parameter k, where 1 k n, design a streaming algorithm with the following
guarantees:
1. if the stream contains strictly less than k distinct elements, the output is No
with probability at least 1 1/n2 ;
2. while, if the stream contains at least 2k distinct elements, then the output is
Yes with probability at least 1 1/n2 .
The algorithm should proceed in three phases as in the template solution below.
Fill in the details of the algorithm in the given template; you need not provide
implementation details but should be precise in your description.
Solution:
Initialization: (pre-processing done before making the pass)
1. Pick a random hash function g from the 2-universal hash family H4k :
[n] [4k] mentioned in the above question
2. Allocate t = 288 log n bits of memory and set them to 0
Processing: (x [n] is the element to process)

Page 4

1. Find the hash value of all the data points and if the hash value is
t then set that corresponding bit to 1
Output: (called after the input is processed)
Find number of bits set and if it is less than
NO

5k
16

then output YES else output

Prove the guarantees of the algorithm designed above and calculate the space used
by the algorithm (in O() notation).
Solution:
Proof. We can find the hash value of each data point using one of the hash
function from the above hash family and if it is less than t we set that bit to
1. We can expoilt the probability gap of saying Yes when number of distinct
elements < k and < 2k. We can find how many bits are set and say YES if
number of set bits are more than some threshold. To get a probability guarantee
of n1 we can set the threshold to be 5*k/16.
When number of distinct elements k:
By using Chernoff bound for upper tail we get


1 k
1
5k
k
k
2
)
exp( 4 4 1 ) = exp(
Pr Number of bits set > (1 + ) =
4
4
16
144
2+ 4
exp(

288log(n)
1
= exp(2n) = 2
144
n

When number of distinct elements 2k:


By using Chernoff bound for lower tail we get


3k
1
5k
1 3k
k
Pr Number of bits set <
(1 ) =
exp( 2
) = exp(
)
8
6
16
6 82
192
exp(

288logn
384logn
1
) exp(
) = exp(2n) = 2
192
192
n

The space used by this algorithm is O(k) = O(288 log(n))


(c) (10 points) Using the above, design a streaming algorithm that outputs an estimate
of the number of distinct elements, upto a factor 2 with probability at least 1 1/n
(n is a parameter passed to the algorithm). As before, the algorithm should be
in three phases (and may use the above procedures). Prove the guarantees and
calculate the total space used by the algorithm.

Page 5

Solution:
Initialization: (pre-processing done before making the pass)
1. Pick log2 (n) random hash functions g1 , g2 ...glog2 (n) from the 2-universal
hash family HF ieldSize : [n] [F ieldsize] mentioned in the first
part. The field size of each of the function is in powers of 2 from
1 4, 2 4, 4 4...2log2 (n) 4
2. Allocate O(log2 (n)) bits for each of the hash function and a total of
O(log2 (n)2 ) bits and set them to 0
Processing: (x [n] is the element to process)
1. Find the hash value for all hash functions for all the data points and if
the hash value is constant log2 (n) then set that corresponding bit
to 1 for that corresponding hash function. The constant value varies
for each of the hash function and its analysis specified in previous
part.
Output: (called after the input is processed)
Call the algorithm in part b) for each of the hash function in increasing order
of field size (ie.,1*4,2*4,4*4...) and stop when it says NO and output that the
fieldsize of the corresponding hash function by 4 as the output.
Solution:
Proof.
Basically we are using a set of parallel estimators and then output the best estimate. We use the previous algorithm combined with binary search to estimate
number of distinct elements. Run the previous algorithm parallely for each of
the log2 (n) hash functions and then after processing the data, check the output
in increasing order of field size of hash function and stop when we encounter the
first NO. Basically we are finding a lower and upper bound for the number of
distinct points with a very high probability. To prove that this algorithm works
we need to estimate the probability of it failing and show that it is very small.

Pr [The above algorithm fails] Pr [any one of the parallel estimations fail]
log2 (n)

Pr [ith estimation fails]

i=1

Page 6

1
n
1
2 =
2
n
n
n
The above algorithm works with a very high probability guarantee of 1 n1 .
The
used is just the sum of space used by individual components =
Plog2space
(n)
log(n) which is of the order of O((log(n))2 ).
1
log2 (n)

extra creditextra creditextra credit


(d) (5 extra credit points) Show how the above algorithm can be modified to obtain a
factor (1 + )-approximate with high probability.
extra creditextra creditextra credit
(e) (10 extra credit points) Can you reduce the space even further while maintaining
the guarantees as in part c? The improvement should be asymptotic, and not just
by a constant factor.
3. (15 points) (Locality Sensitive Hashing) Given a distance metric d : X X R0 ,
a LSH family is a collection:
H = { h : X [M ] }
such that, for all x, y X:
1. if d(x, y) R,
PrhH [h(x) = h(y)] p1 ;
2. while, if d(x, y) > cR,
PrhH [h(x) = h(y)] < p2 .
3. moreover, this (collision) probability is a non-increasing function of d(x, y).
Here, R, c, p1 , p2 are parameters that determine the quality of the hash family. As was
outlined in class, Indyk and Motwani designed an algorithm to store n data-points so
that c-approximate near-neighbor queries may be answered in sub-linear time.
Recall that the algorithm consists of two components:
Preprocessing: (given the data-points x1 , . . . , xn , and two integer parameters k, l
chosen appropriately.)
1. Choose l hash functions, g1 , . . . , gl Hk and initialize hash tables for each of
them. In other words, each gi is the concatenation of k (randomly chosen) hash
functions from H.
2. Insert each data point in each of the hash functions.
Query: (given data-point y.)
1. Iterate over the points mapped to gi (y) for 1 i l in the hash table and
output all points xj at distance at most cR.
Page 7

2. Abort the above search if more than 10l data-points have already been looked
at among all the hash tables.
(a) (3 points) Calculate the value of k so that, if x, y X satisfy d(x, y) > cR, then:
PrgHk [g(x) = g(y)] < 1/n.
Show hence that the probability of aborting the query procedure is at most 1/5.
Solution: Let x, y X such that d(x, y) > cR. For g(x) = g(y) each of the k
independent hash values should match.
Pr [g(x) = g(y)] =

gHk

k
Y
j=1

Pr (gj (x) = gj (y)) < (p2 )k

gj H

1
n
1
= klog(p2 ) log( )
n
1
= k logp2 ( )
n

Pr [g(x) = g(y)] < (p2 )k =

Expected number of items with distance > cR in a bin n*1/n = 1


Pr [Items over all the bins 10l]

E[Items over all the bins]


l
1
1

=
<
10l
10l
10
5

(b) (6 points) For the value of k calculated above, calculate the collision probability,
under a random g, of a data-point x that is indeed within distance R from y.
Calculate l such that the probability that none of the hash functions gi cause a
collision between x and y is at most 1/5.
Solution: An element collides with the given datapoint if any one of the l hash
values are same ie., i s.t gi (x) = gi (y).
Pr [gi (x) = gi (y)] (p1 )k
Pr [gi (x) 6= gi (y)] (1 (p1 )k )
Pr [gi (x) 6= gi (y)for all i 1, ..., l] (1 (p1 )k )l ,
and thus the probability that there is a collision is atleast 1 (1 (p1 )k )l . For
this part, we need Pr(no collision) (1 (p1 )k )l 51
= l log(1/5)/ log(1 (p1 )k )

Page 8

(c) (6 points) State tha guarantees of the algorithm: output, space-complexity, timecomplexity and confidence for the values of k, and l calculated above.
Solution: The guarantees of the above algorithm is:
Time Complexity:
Precomputation = O(n L k F ) where F is the time taken to find the hash
value of an individual hash function
There are n data points and for each of them we end up calculating the hash
function for k*L times, so the order is O(n L k F )
Query Time = O(k l F + 10L D) where D is the time taken to compute
distance
We need to compute k*l hash functions for the query point and we encounter a
maximum of 10L points
Space complexity: O(n l) There are l hash functions and n data points, each
will occupy unit space, so total space occupied is n l

Page 9

Вам также может понравиться