Академический Документы
Профессиональный Документы
Культура Документы
1. Coin ips. A fair coin is ipped until the rst head occurs. Let X denote the number of ips required. (a) Find the entropy H (X ) in bits. The following expressions may be useful:
1 X
n=1
r ; rn = 1 ? r
1 X
(b) A random variable X is drawn according to this distribution. Find an \e cient" sequence of yes-no questions of the form, \Is X contained in the set S ?" Compare H (X ) to the expected number of questions required to determine X .
n=1
r : nrn = (1 ? r )2
Solution:
(a) The number X of tosses till the rst head appears has the geometric distribution with parameter p = 1=2, where P (X = n) = pq n?1 , n 2 f1; 2; : : :g . Hence the entropy of X is
H (X ) = ?
= ?
1 X
n "=1 1 X
p log p ? pq log q = ?1 ?q p2 = ?p log pp? q log q = H (p)=p bits. If p = 1=2, then H (X ) = 2 bits. 7
pq n?1 log(pqn?1 )
X
x: y=g(x)
p(x):
Consider any set of x 's that map onto a single y . For this set X X p(x) log p(x) p(x) log p(y) = p(y) log p(y);
x: y=g(x) x: y=g(x)
since log is a monotone increasing function and p(x) x: y=g(x) p(x) = p(y ). Extending this argument to the entire range of X (and Y ), we obtain X H (X ) = ? p(x) log p(x) = ?
x X X
X
y
y x: y=g(x)
= H (Y ); with equality i g is one-to-one with probability one. (a) Y = 2X is one-to-one and hence the entropy, which is just a function of the probabilities (and not the values of a random variable) does not change, i.e., H (X ) = H (Y ). (b) Y = cos(X ) is not necessarily one-to-one. Hence all that we can say is that H (X ) H (Y ), with equality if cosine is one-to-one on the range of X .
3. Minimum entropy. What is the minimum value of H (p1; :::; pn) = H (p) as p ranges over the set of n -dimensional probability vectors? Find all p 's which achieve this minimum. Solution: We wish to nd all probability vectors p = (p1; p2; : : :; pn) which minimize X H (p) = ? pi log pi : Now ?pi log pi 0, with equality i pi = 0 or 1. Hence the only possible probability vectors which minimize H (p) are those with pi = 1 for some i and pj = 0; j 6= i . There are n such vectors, i.e., (1; 0; : : :; 0), (0; 1; 0; : : :; 0), : : : , (0; : : :; 0; 1), and the minimum value of H (p) is 0. 4. Axiomatic de nition of entropy. If we assume certain axioms for our measure of information, then we will be forced to use a logarithmic measure like entropy. Shannon used this to justify his initial de nition of entropy. In this book, we will rely more on the other properties of entropy rather than its axiomatic derivation to justify its use. The following problem is considerably more di cult than the other problems in this section. If a sequence of symmetric functions Hm (p1 ; p2; : : :; pm) satis es the following properties, Normalization: H2 1 ; 1 = 1; 2 2 Continuity: H2(p; 1 ? p) is a continuous function of p , Grouping: Hm (p1; p2; : : :; pm ) = Hm?1 (p1+p2 ; p3; : : :; pm)+(p1 +p2 )H2 p p ;pp , +p +p prove that Hm must be of the form
1 2 1 2 1 2
m X i=1
pi log pi;
m = 2; 3; : : ::
(2.1)
There are various other axiomatic formulations which also result in the same de nition of entropy. See, for example, the book by Csiszar and Korner 3]. Solution: Axiomatic de nition of entropy. This is a long solution, so we will rst outline what we plan to do. First we will extend the grouping axiom by induction and prove that Hm(p1; p2; : : :; pm) = Hm?k (p1 + p2 + + pk ; pk+1 ; : : :; pm) pk p1 +(p1 + p2 + + pk )Hk p + p + + p ; : : :; p + p + + p : (2.2)
1 2
We will then show that for any two integers r and s , that f (rs) = f (r) + f (s). We use this to show that f (m) = log m . We then show for rational p = r=s , that
m m
(2.3)
10
Sk =
k X i=1
pi
(2.4)
and we will denote H2 (q; 1 ? q ) as h(q ). Then we can write the grouping axiom as
(2.5)
p2 Hm(p1; : : :; pm) = Hm?1 (S2 ; p3; : : :; pm) + S2h S 2 p2 p 3 = Hm?2 (S3 ; p4; : : :; pm ) + S3h S3 + S2 h S2
. . . = Hm?(k?1) (Sk ; pk+1 ; : : :; pm) +
k X
pi : Si h S i i=2
Now, we apply the same grouping axiom repeatedly to Hk (p1=Sk ; : : :; pk =Sk ), to obtain
p1 ; : : :; pk Hk S Sk k
(2.10) (2.11)
(2.12)
which is the extended grouping axiom. Now we need to use an axiom that is not explicitly stated in the text, namely that the function Hm is symmetric with respect to its arguments. Using this, we can combine any set of arguments of Hm using the extended grouping axiom. 1 1 1 Let f (m) denote Hm ( m ;m ; : : :; m ). Consider 1 ; 1 ; : : :; 1 ): (2.13) f (mn) = Hmn ( mn mn mn
f (m) = lim h( m1 ): But by the continuity of H2 , it follows Thus lim f (m + 1) ? mm +1 +1 ) = 0. that the limit on the right is h(0) = 0. Thus lim h( m1 +1 Let us de ne an+1 = f (n + 1) ? f (n) (2.26) and 1 ): (2.27) bn = h( n Then 1 f (n) + b an+1 = ? n + (2.28) n+1 1 n 1 X (2.29) = ?n + 1 i=2 ai + bn+1
= f (m) + f (n): We can immediately use this to conclude that f (mk ) = kf (m). Now, we will argue that H2 (1; 0) = h(1) = 0. We do this by expanding H3(p1 ; p2; 0) ( p1 + p2 = 1) in two di erent ways using the grouping axiom H3(p1 ; p2; 0) = H2 (p1; p2) + p2 H2(1; 0) (2.20) = H2 (1; 0) + (p1 + p2 )H2(p1; p2) (2.21) Thus p2 H2(1; 0) = H2(1; 0) for all p2 , and therefore H (1; 0) = 0. We will also need to show that f (m + 1) ? f (m) ! 0 as m ! 1 . To prove this, we use the extended grouping axiom and write 1 ) f (m + 1) = Hm+1 ( m 1 ; : : :; (2.22) +1 m+1 m 1 1 (2.23) = h( m 1 + 1 ) + m + 1 Hm ( m ; : : :; m ) m f (m) = h( m 1 ) + (2.24) +1 m+1 and therefore 1 ): f ( m ) = h ( (2.25) f (m + 1) ? mm +1 m+1
= Hmn?2n ( 1 ; 1 ; 1 ; : : :; 1 ) + 2 Hn ( 1 ; : : :; 1 ) m m mn mn m n n . . . = H ( 1 ; : : :: 1 ) + H ( 1 ; : : :; 1 )
m
mn mn mn 1 1 ) + 1 H ( 1 ; : : :; 1 ) 1 = Hmn?n ( m ; mn ; : : :; mn m nn n
12 and therefore
ai :
N X n=2
(2.30)
nbn =
=1
(nan + an?1 + : : : + a2 ) = N
ai :
(2.31)
PN n = N (N + 1)=2, we obtain n PN nb N 2 X n n N + 1 an = PN n
=2
Now by continuity of H2 and the de nition of bn , it follows that bn ! 0 as n ! 1 . Since the right hand side is essentially an average of the bn 's, it also goes to 0 (This can be proved more precisely using 's and 's). Thus the left hand side goes to 0. We can then see that N X aN +1 = bN +1 ? N 1 (2.33) + 1 an also goes to 0 as N ! 1 . Thus
n=2
n=2
n=2
(2.32)
f (n + 1) ? f (n) ! 0 asn ! 1:
We will now prove the following lemma
(2.34)
Lemma 2.0.1 Let the function f (m) satisfy the following assumptions:
f (mn) = f (m) + f (n) for all integers m , n . limn!1 (f (n + 1) ? f (n)) = 0 f (2) = 1 , then the function f (m) = log2 m .
Proof of the lemma: Let P be an arbitrary prime number and let ) log2 n g(n) = f (n) ? f (P log2 P Then g (n) satis es the rst assumption of the lemma. Also g (P ) = 0. Also if we let n f (P ) n = g (n + 1) ? g (n) = f (n + 1) ? f (n) + log P log2 n + 1 2 then the second assumption in the lemma implies that lim n = 0. (2.35)
(2.36)
13 (2.37)
n : n(1) = P
n = n(1)P + l (2.38) l < P . From the fact that g(P ) = 0, it follows that g(Pn(1)) = g(n(1)), g(n) = g(n(1)) + g(n) ? g(Pn(1)) = g(n(1)) +
(1)
n ?1 X i=Pn
(1)
(2.39)
from n , we can de ne n
(2)
g(n) = g(n(k) +
Since n(k) n=P k , after
0 n i? k X @ X
(
1)
j =1 i=Pn(i)
1 iA :
terms, we have n(k) = 0, and g (0) = 0 (this follows directly from the additive property of g ). Thus we can write tn X g(n) = (2.42) i log n + 1 : (2.43) bn P log P g(n) ! 0, since g (n) has at most o(log n) terms . Since n ! 0, it follows that log i 2 n Thus it follows that f (n) = f (P ) (2.44) lim n!1 log2 n log2 P Since P was arbitrary, it follows that f (P )= log2 P = c for every prime number P . Applying the third axiom in the lemma, it follows that the constant is 1, and f (P ) = log2 P . For composite numbers N = P1 P2 : : :Pl , we can apply the rst property of f and the prime number factorization of N to show that X X f (N ) = f (Pi) = log2 Pi = log2 N: (2.45)
2
log n + 1 k = log P
i=1
Thus the lemma is proved. The lemma can be simpli ed considerably, if instead of the second assumption, we replace it by the assumption that f (n) is monotone in n . We will now argue that the
14
k+1 ck f ( m ) < c r r k r
1 log2 m < k + r
Combining these two equations, we obtain 2m f (m) ? logc <1 r Since r was arbitrary, we must have f (m) = log2 m
and we can identify c = 1 from the last assumption of the lemma. Now we are almost done. We have shown that for any uniform distribution on m outcomes, f (m) = Hm (1=m; : : :; 1=m) = log2 m . We will now show that
H2(p; 1 ? p) = ?p log p ? (1 ? p) log(1 ? p): (2.51) To begin, let p be a rational number, r=s , say. Consider the extended grouping axiom for Hs 1 ) = H ( 1 ; : : :; 1 ; s ? r ) + s ? r f (s ? r) f (s) = Hs( 1 ; : : :; (2.52) s s s |s {z s} s
Substituting f (s) = log2 s , etc, we obtain
r ; s ? r ) + s f (s) + s ? r f (s ? r) = H2 ( s s r s
(2.53) (2.54)
Thus (2.51) is true for rational p . By the continuity assumption, (2.51) is also true at irrational p .
15
To complete the proof, we have to extend the de nition from H2 to Hm , i.e., we have to show that X Hm(p1 ; : : :; pm ) = ? pi log pi (2.55) for all m . This is a straightforward induction. We have just shown that this is true for m = 2. Now assume that it is true for m = n ? 1. By the grouping axiom,
Hn(p1; : : :; pn) = Hn?1 (p1 + p2; p3; : : :; pn) p2 1 +(p1 + p2 )H2 p p ; 1 + p2 p1 + p2 n X = ?(p1 + p2 ) log(p1 + p2 ) ? pi log pi i=3 p p2 p 1 2 1 log ? p + p log p + p ? p p p1 + p2 1 2 1 2 1 + p2 n X = ? pi log pi :
i=1
Thus the statement is true for m = n , and by induction, it is true for all m . Thus we have nally proved that the only symmetric function that satis es the axioms is
Hm(p1; : : :; pm) = ?
m X i=1
pi log pi :
(2.61)
The proof above is due to Renyi 10] 5. Entropy of functions of a random variable. Let X be a discrete random variable. Show that the entropy of a function of X is less than or equal to the entropy of X by justifying the following steps:
a) H (X; g(X )) (= H (X ) + H (g(X ) j X ) (b) = H (X ); (c) H (X; g(X )) = H (g(X )) + H (X j g(X )) (d) H (g(X )): Thus H (g (X )) H (X ):
Solution: Entropy of functions of a random variable. (a) H (X; g (X )) = H (X ) + H (g (X )jX ) by the chain rule for entropies. (b) H (g (X )jX ) =P 0 since for any particular value of X, g(X) is xed, and hence H (g(X )jX ) = x p(x)H (g(X )jX = x) = Px 0 = 0. (c) H (X; g (X )) = H (g (X )) + H (X jg (X )) again by the chain rule. (d) H (X jg (X )) 0, with equality i X is a function of g (X ), i.e., g (:) is one-to-one.
Hence H (X; g (X )) H (g (X )).
16
H (Y jX ) = ?
X
x
p(x)
X
y
p(x0)(?p(y1jx0) log p(y1jx0 ) ? p(y2 jx0) log p(y2 jx0)) > > 0;
since ?t log t 0 for 0 t 1, and is strictly positive for t not equal to 0 or 1. Therefore the conditional entropy H (Y jX ) is 0 if and only if Y is a function of X . 7. Pure randomness and bent coins. Let X1; X2; : : :; Xn denote the outcomes of independent ips of a bent coin. Thus Pr fXi =: 1g = p; Pr fXi = 0g = 1 ? p , where p is unknown. We wish to obtain a sequence Z1 ; Z2; : : :; ZK of fair coin ips from X1; X2; : : :; Xn . Toward this end let f : X n ! f0; 1g , where f0; 1g = f ; 0; 1; 00; 01; : : :g is the set of all nite length binary sequences, be a mapping ), and K may depend f (X1; X2; : : :; Xn) = (Z1; Z2; : : :; ZK ), where Zi Bernoulli ( 1 2 on (X1; : : :; Xn). In order that the sequence Z1 ; Z2; : : : appear to be fair coin ips, the map f from bent coin ips to fair ips must have the property that all 2k sequences (Z1; Z2; : : :; Zk ) of a given length k have equal probability (possibly 0), for k = 1; 2; : : : . For example, for n = 2, the map f (01) = 0, f (10) = 1, f (00) = f (11) = (the null 1 string), has the property that PrfZ1 = 1jK = 1g = PrfZ1 = 0jK = 1g = 2 . Give reasons for the following inequalities:
a) nH (p) (= H (X1; : : :; Xn) (b) H (Z1; Z2; : : :; ZK ; K ) (c) = H (K ) + H (Z1; : : :; ZK jK ) (d) = H (K ) + E (K ) (e) EK:
Thus no more than nH (p) fair coin tosses can be derived from (X1; : : :; Xn), on the average. (f ) Exhibit a good map f on sequences of length 4. Solution: Pure randomness and bent coins.
17
(a) Since X1; X2; : : :; Xn are i.i.d. with probability of Xi = 1 being p , the entropy H (X1; X2; : : :; Xn) is nH (p). (b) Z1 ; : : :; ZK is a function of X1 ; X2; : : :; Xn , and since the entropy of a function of a random variable is less than the entropy of the random variable, H (Z1; : : :; ZK ) H (X1; X2; : : :; Xn). (c) K is a function of Z1 ; Z2; : : :; ZK , so its conditional entropy given Z1 ; Z2; : : :; ZK is 0. Hence H (Z1; Z2; : : :; ZK ; K ) = H (Z1; : : :; ZK ) + H (K jZ1; Z2; : : :; ZK ) = H (Z1; Z2; : : :; ZK ): (d) Follows from the chain rule for entropy. (e) By assumption, Z1 ; Z2; : : :; ZK are pure random bits (given K ), with entropy 1 bit per symbol. Hence
H (Z1; Z2; : : :; ZK jK ) =
=
X
k
k X
= EK:
(f) Follows from the non-negativity of discrete entropy. (g) Since we do not know p , the only way to generate pure random bits is to use the fact that all sequences with the same number of ones are equally likely. For example, the sequences 0001,0010,0100 and 1000 are equally likely and can be used to generate 2 pure random bits. An example of a mapping to generate random bits is 0000 ! 0001 ! 00 0011 ! 00 1010 ! 0 1110 ! 11 1111 ! 0010 ! 01 0100 ! 10 1000 ! 11 0110 ! 01 1100 ! 10 1001 ! 11 0101 ! 1 1101 ! 10 1011 ! 01 0111 ! 00
(2.72)
18
n n! X
19 (2.81)
Now for su ciently large n , the probability that the number of 1's in the sequence is close to np is near 1 (by the weak law of large numbers). For such sequences, k n is close to p and hence there exists a such that
n k
(2.82)
using Stirling's approximation for the binomial coe cients and the continuity of the entropy function. If we assume that n is large enough so that the probability that n(p ? ) k n(p + ) is greater than 1 ? , then we see that EK (1 ? )n(H (p) ? 2 ) ? 2, which is very good since nH (p) is an upper bound on the number of pure random bits that can be produced from the bent coin sequence. 8. World Series. The World Series is a seven-game series that terminates as soon as either team wins four games. Let X be the random variable that represents the outcome of a World Series between teams A and B; possible values of X are AAAA, BABABAB, and BBBAAAA. Let Y be the number of games played, which ranges from 4 to 7. Assuming that A and B are equally matched and that the games are independent, calculate H (X ), H (Y ), H (Y jX ), and H (X jY ).
World Series. Two teams play until one of them has won 4 games. There are 2 (AAAA, BBBB) World Series with 4 games. Each happens with probability (1=2)4 . ? World Series with 5 games. Each happens with probability (1=2)5 . There are 8 = 2 4 3 ? There are 20 = 2 5 World Series with 6 games. Each happens with probability (1=2)6 . 3 ? World Series with 7 games. Each happens with probability (1=2)7 . There are 40 = 2 6 3
Solution:
The probability of a 4 game series ( Y The probability of a 5 game series ( Y The probability of a 6 game series ( Y The probability of a 7 game series ( Y
= 4) is = 5) is = 6) is = 7) is
H (X ) =
p(x)log p(1x) = 2(1=16) log16 + 8(1=32) log32 + 20(1=64) log64 + 40(1=128) log128 = 5:8125
20
9. In nite entropy. This problem shows that the entropy of a discrete random variable can P 2 ?1 be in nite. Let A = 1 n=2 (n log n) . (It is easy to show that A is nite by bounding the in nite sum by the integral of (x log2 x)?1 .) Show that the integer-valued random variable X de ned by Pr(X = n) = (An log2 n)?1 for n = 2; 3; : : : has H (X ) = +1 . Solution: In nite entropy. By de nition, pn = Pr(X = n) = 1=An log2 n for n 2. Therefore
H (X ) = ?
= ?
1 X
n=2 1 X n=2
n=2
= log A +
An log2 n
We conclude that H (X ) = +1 . 10. Conditional mutual information vs. unconditional mutual information. Give examples of joint random variables X , Y and Z such that (a) I (X ; Y j Z ) < I (X ; Y ),
The rst term is nite. For base 2 logarithms, all the elements in the sum in the last term are nonnegative. (For any other base, the terms of the last sum eventually all become positive.) So all we have to do is bound the middle sum, which we do by comparing with an integral. 1 X 1 > Z 1 1 dx = K ln ln x 1 = +1 : 2 Ax log x 2 n=2 An log n
21
(a) The last corollary to Theorem 2.8.1 in the text states that if X ! Y ! Z that is, if p(x; y j z ) = p(x j z )p(y j z ) then, I (X ; Y ) I (X ; Y j Z ). Equality holds if and only if I (X ; Z ) = 0 or X and Z are independent. A simple example of random variables satisfying the inequality conditions above is, X is a fair binary random variable and Y = X and Z = Y . In this case,
I (X ; Y ) = H (X ) ? H (X j Y ) = H (X ) = 1 I (X ; Y j Z ) = H (X j Z ) ? H (X j Y; Z ) = 0: So that I (X ; Y ) > I (X ; Y j Z ). (b) This example is also given in the text. Let X; Y be independent fair binary random variables and let Z = X + Y . In this case we have that, I (X ; Y ) = 0
and, and,
11. Average entropy. Let H (p) = ?p log2 p ? (1 ? p) log2 (1 ? p) be the binary entropy function. (a) Evaluate H (1=4) using the fact that log2 3 1:584. Hint: Consider an experiment with four equally likely outcomes, one of which is more interesting than the others. (b) Calculate the average entropy H (p) when the probability p is chosen uniformly in the range 0 p 1. (c) (Optional) Calculate the average entropy H (p1; p2; p3) where (p1; p2; p3) is a uniformly distributed probability vector. Generalize to dimension n . Solution: Average Entropy. (a) We can generate two bits of information by picking one of four equally likely alternatives. This selection can be made in two steps. First we decide whether the rst outcome occurs. Since this has probability 1=4, the information generated is H (1=4). If not the rst outcome, then we select one of the three remaining outcomes; with probability 3=4, this produces log2 3 bits of information. Thus
H (1=4) + (3=4) log2 3 = 2 and so H (1=4) = 2 ? (3=4) log2 3 = 2 ? (:75)(1:585) = 0:811 bits.
22
p
1
p ln p + (1 ? p) ln(1 ? p)dp = ?2
2 x2 x ln x dx = ?2 x ln x + 2 4
1 0
1 =2 :
1 log2 e = 1=(2 ln 2) = :721 bits. Therefore the average entropy is 2 (c) Choosing a uniformly distributed probability vector (p1 ; p2; p3) is equivalent to choosing a point (p1; p2) uniformly from the triangle 0 p1 1, p1 p2 1. The probability density function has the constant value 2 because the area of the triangle is 1/2. So the average entropy H (p1; p2; p3) is
?2
Z Z
1 0
After some enjoyable calculus, we obtain the nal result 5=(6 ln 2) = 1:202 bits. 12. Venn diagrams. Using Venn diagrams, we can see that the mutual information common to three random variables X , Y and Z should be de ned by I (X ; Y ; Z ) = I (X ; Y ) ? I (X ; Y jZ ) : This quantity is symmetric in X , Y and Z , despite the preceding asymmetric de nition. Unfortunately, I (X ; Y ; Z ) is not necessarily nonnegative. Find X , Y and Z such that I (X ; Y ; Z ) < 0, and prove the following two identities: I (X ; Y ; Z ) = H (X; Y; Z ) ? H (X ) ? H (Y ) ? H (Z ) + I (X ; Y ) + I (Y ; Z ) + I (Z ; X ) I (X ; Y ; Z ) = H (X; Y; Z ) ? H (X; Y ) ? H (Y; Z ) ? H (Z; X ) + H (X ) + H (Y ) + H (Z ) The rst identity can be understood using the Venn diagram analogy for entropy and mutual information. The second identity follows easily from the rst. Solution: Venn Diagrams. To show the rst identity, I (X ; Y ; Z ) = I (X ; Y ) ? I (X ; Y jZ ) by de nition = I (X ; Y ) ? (I (X ; Y; Z ) ? I (X ; Z )) by chain rule = I (X ; Y ) + I (X ; Z ) ? I (X ; Y; Z ) = I (X ; Y ) + I (X ; Z ) ? (H (X ) + H (Y; Z ) ? H (X; Y; Z )) = I (X ; Y ) + I (X ; Z ) ? H (X ) + H (X; Y; Z ) ? H (Y; Z ) = I (X ; Y ) + I (X ; Z ) ? H (X ) + H (X; Y; Z ) ? (H (Y ) + H (Z ) ? I (Y ; Z )) = I (X ; Y ) + I (X ; Z ) + I (Y ; Z ) + H (X; Y; Z ) ? H (X ) ? H (Y ) ? H (Z ): To show the second identity, simply substitute for I (X ; Y ), I (X ; Z ), and I (Y ; Z ) using equations like I (X ; Y ) = H (X ) + H (Y ) ? H (X; Y ) : These two identities show that I (X ; Y ; Z ) is a symmetric (but not necessarily nonnegative) function of three random variables.
p1
23
13. Coin weighing. Suppose one has n coins, among which there may or may not be one counterfeit coin. If there is a counterfeit coin, it may be either heavier or lighter than the other coins. The coins are to be weighed by a balance. (a) Find an upper bound on the number of coins n so that k weighings will nd the counterfeit coin (if any) and correctly declare it to be heavier or lighter. (b) (Di cult) What is the coin weighing strategy for k = 3 weighings and 12 coins? Solution: Coin weighing. (a) For n coins, there are 2n + 1 possible situations or \states". One of the n coins is heavier. One of the n coins is lighter. They are all of equal weight. Each weighing has three possible outcomes - equal, left pan heavier or right pan heavier. Hence with k weighings, there are 3k possible outcomes and hence we can distinguish between at most 3k di erent \states". Hence 2n + 1 3k or n (3k ? 1)=2. Looking at it from an information theoretic viewpoint, each weighing gives at most log2 3 bits of information. There are 2n + 1 possible \states", with a maximum entropy of log2(2n + 1) bits. Hence in this situation, one would require at least log2 (2n + 1)= log2 3 weighings to extract enough information for determination of the odd coin, which gives the same result as above. (b) There are many solutions to this problem. We will give one which is based on the ternary number system. We may express the numbers f?12; ?11; : : :; ?1; 0; 1; : : :; 12g in a ternary number system with alphabet f?1; 0; 1g . For example, the number 8 is (-1,0,1) where ?1 30 + 0 31 + 1 32 = 8. We form the matrix with the representation of the positive numbers as its columns. 1 2 3 4 5 6 7 8 9 10 11 12 0 3 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 = 0 31 0 1 1 1 -1 -1 -1 0 0 0 1 1 2 = 2 32 0 0 0 0 1 1 1 1 1 1 1 1 3 = 8 Note that the row sums are not all zero. We can negate some columns to make the row sums zero. For example, negating columns 7,9,11 and 12, we obtain 1 2 3 4 5 6 7 8 9 10 11 12 30 1 -1 0 1 -1 0 -1 -1 0 1 1 0 1 = 0 31 0 1 1 1 -1 -1 1 0 0 0 -1 -1 2 = 0 32 0 0 0 0 1 1 -1 1 -1 1 -1 -1 3 = 0 Now place the coins on the balance according to the following rule: For weighing #i , place coin n On left pan, if ni = ?1. Aside, if ni = 0.
24
On right pan, if ni = 1. The outcome of the three weighings will nd the odd coin if any and tell whether it is heavy or light. The result of each weighing is 0 if both pans are equal, -1 if the left pan is heavier, and 1 if the right pan is heavier. Then the three weighings give the ternary expansion of the index of the odd coin. If the expansion is the same as the expansion in the matrix, it indicates that the coin is heavier. If the expansion is of the opposite sign, the coin is lighter. For example, (0,-1,-1) indicates (0)30 +(?1)3+(?1)32 = ?12, hence coin #12 is heavy, (1,0,-1) indicates #8 is light, (0,0,0) indicates no odd coin. Why does this scheme work? It is a single error correcting Hamming code for the ternary alphabet (discussed in Section 8.11 in the book). Here are some details. First note a few properties of the matrix above that was used for the scheme. All the columns are distinct and no two columns add to (0,0,0). Also if any coin is heavier, it will produce the sequence of weighings that matches its column in the matrix. If it is lighter, it produces the negative of its column as a sequence of weighings. Combining all these facts, we can see that any single odd coin will produce a unique sequence of weighings, and that the coin can be determined from the sequence. One of the questions that many of you had whether the bound derived in part (a) was actually achievable. For example, can one distinguish 13 coins in 3 weighings? No, not with a scheme like the one above. Yes, under the assumptions under which the bound was derived. The bound did not prohibit the division of coins into halves, neither did it disallow the existence of another coin known to be normal. Under both these conditions, it is possible to nd the odd coin of 13 coins in 3 weighings. You could try modifying the above scheme to these cases. 14. Drawing with and without replacement. An urn contains r red, w white, and b black balls. Which has higher entropy, drawing k 2 balls from the urn with replacement or without replacement? Set it up and show why. (There is both a hard way and a relatively simple way to do this.) Solution: Drawing with and without replacement. Intuitively, it is clear that if the balls are drawn with replacement, the number of possible choices for the i -th ball is larger, and therefore the conditional entropy is larger. But computing the conditional distributions is slightly involved. It is easier to compute the unconditional entropy. With replacement. In this case the conditional distribution of each draw is the same for every draw. Thus 8 r > < red with prob. r+w w+b Xi = > white with prob. r+w+b (2.83) : black with prob. r+w b +b and therefore H (XijXi?1 ; : : :; X1) = H (Xi) (2.84)
Entropy, Relative Entropy and Mutual Information 25 r log r ? w log w ? b log(2.85) = log(r + w + b) ? r + w +b r+w+b r + w + b b: Without replacement. The unconditional probability of the i -th ball being red is still r=(r + w + b), etc. Thus the unconditional entropy H (Xi) is still the same as with replacement. The conditional entropy H (XijXi? ; : : :; X ) is less than the
unconditional entropy, and therefore the entropy of drawing without replacement is lower. (x; y ) 0 (x; y ) = (y; x) (x; y ) = 0 if and only if x = y (x; y ) + (y; z ) (x; z ). (a) Show that (X; Y ) = H (X jY )+H (Y jX ) has the above properties, and is therefore a metric. Note that (X; Y ) is the number of bits needed for X and Y to communicate their values to each other. (b) Verify that (X; Y ) can also be expressed as (X; Y ) = H (X ) + H (Y ) ? 2I (X ; Y ) = H (X; Y ) ? I (X ; Y ) = 2H (X; Y ) ? H (X ) ? H (Y ): (2.86) (2.87) (2.88)
1 1
Solution: A metric
(a) Let (X; Y ) = H (X jY ) + H (Y jX ): (2.89) Then Since conditional entropy is always 0, (X; Y ) 0. The symmetry of the de nition implies that (X; Y ) = (Y; X ). By problem 2.6, it follows that H (Y jX ) is 0 i Y is a function of X and H (X jY ) is 0 i X is a function of Y . Thus (X; Y ) is 0 i X and Y are functions of each other - and therefore are equivalent up to a reversible transformation. Consider three random variables X , Y and Z . Then
H (X jY ) + H (Y jZ )
26
Note that the inequality is strict unless X ! Y ! Z forms a Markov Chain and Y is a function of X and Z . (b) Since H (X jY ) = H (X ) ? I (X ; Y ), the rst equation follows. The second relation follows from the rst equation and the fact that H (X; Y ) = H (X ) + H (Y ) ? I (X ; Y ). The third follows on substituting I (X ; Y ) = H (X ) + H (Y ) ? H (X; Y ). 16. Example of joint entropy. Let p(x; y ) be given by
@@ Y X @
0 1 Find (a) (b) (c) (d) (e) (f)
0 1
1 3 1 3 1 3
H (X ); H (Y ): H (X j Y ); H (Y j X ): H (X; Y ): H (Y ) ? H (Y j X ): I (X ; Y ).
27
since the second term is always negative. Hence letting y = 1=x , we obtain ? ln y 1 ? 1 or
with equality i y = 1. 18. Entropy of a sum. Let X and Y be random variables that take on values x1 ; x2; : : :; xr and y1 ; y2; : : :; ys , respectively. Let Z = X + Y: (a) Show that H (Z jX ) = H (Y jX ): Argue that if X; Y are independent, then H (Y ) H (Z ) and H (X ) H (Z ): Thus the addition of independent random variables adds uncertainty. (b) Give an example (of necessarily dependent random variables) in which H (X ) > H (Z ) and H (Y ) > H (Z ): (c) Under what conditions does H (Z ) = H (X ) + H (Y )?
1 ln y 1 ? y
X p(x)H (Z jX = x) X X = ? p(x) p(Z = z jX = x) log p(Z = z jX = x) Xx Xz = p(x) p(Y = z ? xjX = x) log p(Y = z ? xjX = x) x y X = p(x)H (Y jX = x) = H (Y jX ): If X and Y are independent, then H (Y jX ) = H (Y ). Since I (X ; Z ) 0, we have H (Z ) H (Z jX ) = H (Y jX ) = H (Y ) . Similarly we can show that
H (Z jX ) = H (Z ) H (X ).
28
X = ?Y =
Then H (X ) = H (Y ) = 1, but Z = 0 with prob. 1 and hence H (Z ) = 0. (c) We have H (Z ) H (X; Y ) H (X ) + H (Y ) because Z is a function of (X; Y ) and H (X; Y ) = H (X ) + H (Y jX ) H (X ) + H (Y ). We have equality i (X; Y ) is a function of Z and H (Y ) = H (Y jX ), i.e., X and Y are independent. 19. Entropy of a disjoint mixture. Let X1 and X2 be discrete random variables drawn according to probability mass functions p1 ( ) and p2 ( ) over the respective alphabets X1 = f1; 2; : : :; mg and X2 = fm + 1; : : :; ng: Let
X=
(a) Find H (X ) in terms of H (X1) and H (X2) and : (b) Maximize over to show that 2H (X ) 2H (X ) + 2H (X ) and interpret using the notion that 2H (X ) is the e ective alphabet size. and expanding the various terms. Instead, we will use the algebra of entropies for a simpler proof. Since X1 and X2 have disjoint support sets, we can write
Solution: Entropy. We can do this problem by writing down the de nition of entropy
(
X=
De ne a function of X ,
1 when X = X1 2 when X = X2
29
20. A measure of correlation. Let X1 and X2 be identically distributed, but not necessarily independent. Let X2 j X1 ) : = 1 ? H (H (X )
1
X ;X ) : Show = I (H (X ) Show 0 1: When is = 0? When is = 1? Solution: A measure of correlation. X1 and X2 are identically distributed and (X2jX1) = 1 ? HH (X1) (a) = H (X1) ? H (X2jX1) H (X1) = H (X2) ? H (X2jX1) (since H (X1) = H (X2)) H (X1) = I (X1; X2) : H (X1) (b) Since 0 H (X2jX1) H (X2) = H (X1), we have (X2jX1) 1 0 HH (X1) 0 1: (c) = 0 i I (X1; X2) = 0 i X1 and X2 are independent. (d) = 1 i H (X2jX1) = 0 i X2 is a function of X1 . By symmetry, X1 is a function of X2 , i.e., X1 and X2 have a one-to-one relationship. 21. Data processing. Let X1 ! X2 ! X3 ! ! Xn form a Markov chain in this order; i.e., let p(x1 ; x2; : : :; xn ) = p(x1 )p(x2jx1) p(xnjxn?1 ): Reduce I (X1; X2; : : :; Xn) to its simplest form. Solution: Data Processing. By the chain rule for mutual information, I (X1; X2; : : :; Xn) = I (X1; X2)+ I (X1; X3jX2)+ + I (X1; XnjX2; : : :; Xn?2 ): (2.95) By the Markov property, the past and the future are conditionally independent given the present and hence all terms except the rst are zero. Therefore I (X1; X2; : : :; Xn) = I (X1; X2): (2.96)
30
22. Bottleneck. Suppose a (non-stationary) Markov chain starts in one of n states, necks down to k < n states, and then fans back to m > k states. Thus X1 ! X2 ! X3 , X1 2 f1; 2; : : :; ng , X2 2 f1; 2; : : :; kg , X3 2 f1; 2; : : :; mg . (a) Show that the dependence of X1 and X3 is limited by the bottleneck by proving that I (X1; X3) log k: (b) Evaluate I (X1; X3) for k = 1, and conclude that no dependence can survive such a bottleneck.
Solution:
Bottleneck.
(a) From the data processing inequality, and the fact that entropy is maximum for a uniform distribution, we get
I (X1; X3)
Thus, the dependence between X1 and X3 is limited by the size of the bottleneck. That is I (X1; X3) log k . (b) For k = 1, I (X1; X3) log 1 = 0 and since I (X1; X3) 0, I (X1; X3) = 0. Thus, for k = 1, X1 and X3 are independent. 23. Run length coding. Let X1; X2; : : :; Xn be (possibly dependent) binary random variables. Suppose one calculates the run lengths R = (R1; R2; : : :) of this sequence (in order as they occur). For example, the sequence X = 0001100100 yields run lengths R = (3; 2; 2; 1; 2). Compare H (X1; X2; : : :; Xn), H (R) and H (Xn; R). Show all equalities and inequalities, and bound all the di erences. Solution: Run length coding. Since the run lengths are a function of X1; X2; : : :; Xn , H (R) H (X). Any Xi together with the run lengths determine the entire sequence X1; X2; : : :; Xn . Hence
24. Markov's inequality for probabilities. Let p(x) be a probability mass function. Prove, for all d 0, H (X ): (2.101) Pr fp(X ) dg log 1
X
x
x:p(x)<d
= H (X )
25. Logical order of ideas. Ideas have been developed in order of need, and then generalized if necessary. Reorder the following ideas, strongest rst, implications following: (a) Chain rule for I (X1; : : :; Xn ; Y ), chain rule for D(p(x1; : : :; xn)jjq (x1; x2; : : :; xn )), and chain rule for H (X1; X2; : : :; Xn). (b) D(f jjg ) 0, Jensen's inequality, I (X ; Y ) 0.
32
27. Conditional mutual information. Consider a sequence of n binary random variables X1; X2; : : :; Xn . Each sequence with an even number of 1's has probability 2?(n?1) and each sequence with an odd number of 1's has probability 0. Find the mutual informations I (X1; X2); I (X2; X3jX1); : : :; I (Xn?1 ; XnjX1; : : :; Xn?2):
Consider a sequence of n binary random variables X1; X2; : : :; Xn . Each sequence of length n with an even number of 1's is equally likely and has probability 2?(n?1) . Any n ? 1 or fewer of these are independent. Thus, for k n ? 1, I (Xk?1 ; Xk jX1; X2; : : :; Xk?2 ) = 0: However, given X1; X2; : : :; Xn?2 , we know that once we know either Xn?1 or Xn we know the other. I (Xn?1; XnjX1; X2; : : :; Xn?2) = H (XnjX1; X2; : : :; Xn?2) ? H (XnjX1; X2; : : :; Xn?1 ) = 1 ? 0 = 1 bit:
28. Mixing increases entropy. Show that the entropy of the probability distribution, (p1; : : :; pi; : : :; pj ; : : :; pm), is less than the entropy of the distribution pj ; : : :; pi +pj ; : : :; p ). Show that in general any transfer of probability that (p1; : : :; pi + m 2 2 makes the distribution more uniform increases the entropy.
Mixing increases entropy. This problem depends on the convexity of the log function. Let P1 = (p1 ; : : :; pi; : : :; pj ; : : :; pm) pj ; : : :; pj + pi ; : : :; p ) P2 = (p1 ; : : :; pi + m 2 2 Then, by the log sum inequality, pj pi + pj H (P2 ) ? H (P1) = ?2( pi + 2 ) log( 2 ) + pi log pi + pj log pj pj ) + p log p + p log p = ?(pi + pj ) log( pi + i i j j 2 0: Thus, H (P2) H (P1):
Solution:
33
29. Inequalities. Let X , Y and Z be joint random variables. Prove the following inequalities and nd conditions for equality. (a) H (X; Y jZ ) H (X jZ ). (b) I (X; Y ; Z ) I (X ; Z ). (c) H (X; Y; Z ) ? H (X; Y ) H (X; Z ) ? H (X ). (d) I (X ; Z jY ) I (Z ; Y jX ) ? I (Z ; Y ) + I (X ; Z ). Solution: Inequalities. (a) Using the chain rule for conditional entropy,
H (X; Y jZ ) = H (X jZ ) + H (Y jX; Z ) H (X jZ );
with equality i H (Y jX; Z ) = 0, that is, when Y is a function of X and Z . (b) Using the chain rule for mutual information,
I (X; Y ; Z ) = I (X ; Z ) + I (Y ; Z jX ) I (X ; Z );
with equality i I (Y ; Z jX ) = 0, that is, when Y and Z are conditionally independent given X . (c) Using rst the chain rule for entropy and then the de nition of conditional mutual information,
I (X ; Z jY ) + I (Z ; Y ) = I (X; Y ; Z ) = I (Z ; Y jX ) + I (X ; Z ) ;
and therefore We see that this inequality is actually an equality in all cases. 30. Maximum entropy. Find the probability mass function p(x) that maximizes the entropy H (X ) of a non-negative integer-valued random variable X subject to the constraint 1 X EX = np(n) = A for a xed value A > 0. Evaluate this maximum H (X ). Solution: Maximum entropy
n=0
I (X ; Z jY ) = I (Z ; Y jX ) ? I (Z ; Y ) + I (X ; Z ) :
34 Recall that,
1 X
i=0
pi log pi ?
1 X
i=0
1 X
i=0
pi log qi :
pi log pi
pi log qi
1 X
i=0
= ? log( )
pi + log( )
1 ! X
i=0
ipi
= ? log ? A log
Notice that the nal right hand side expression is independent of fpig , and that the inequality, 1 X ? pi log pi ? log ? A log holds for all ; such that,
i=0
1 X
i=0
i=1=
1 1? : (1 ? )2 :
i=A=
35
pi = F (pi ; 1; 2) = ?
1 X
i=0
pi log pi + 1(
1 X
i=0
pi ? 1) + 2(
1 X
i=0
ipi ? A)
is the function whose gradient we set to 0. Many of you used Lagrange multipliers, but failed to argue that the result obtained is a global maximum. An argument similar to the above should have been used. On the other hand one could simply argue that since ?H (p) is convex, it has only one local minima, no local maxima and therefore Lagrange multiplier actually gives the global maximum for H (p). 31. Shu es increase entropy. Argue that for any distribution on shu es T and any distribution on card positions X that H (TX ) H (TX jT ) (2.112) ? 1 = H (T TX jT ) (2.113) = H (X jT ) (2.114) = H (X ); (2.115) if X and T are independent. Solution: Shu es increase entropy. H (TX ) H (TX jT ) (2.116) ? 1 = H (T TX jT ) (2.117) = H (X jT ) (2.118) = H (X ): (2.119) The inequality follows from the fact that conditioning reduces entropy and the rst equality follows from the fact that given T , we can reverse the shu e. 32. Conditional entropy. Under what conditions does H (X j g (Y )) = H (X j Y )? Solution: (Conditional Entropy). If H (X jg(Y )) = H (X jY ), then H (X )?H (X jg(Y )) = H (X ) ? H (X jY ), i.e., I (X ; g(Y )) = I (X ; Y ). This is the condition for equality in the data processing inequality. From the derivation of the inequality, we have equality i X ! g (Y ) ! Y forms a Markov chain. Hence H (X jg (Y )) = H (X jY ) i X ! g(Y ) ! Y . This condition includes many special cases, such as g being oneto-one, and X and Y being independent. However, these two special cases do not exhaust all the possibilities.
36
33. Fano's inequality. Let Pr(X = i) = pi ; i = 1; 2; : : :; m and let p1 p2 p3 pm : ^ = 1, with resulting probability The minimal probability of error predictor of X is X of error Pe = 1 ? p1 : Maximize H (p) subject to the constraint 1 ? p1 = Pe to nd a bound on Pe in terms of H . This is Fano's inequality in the absence of conditioning. Solution: (Fano's Inequality.) The minimal probability of error predictor when there ^ = 1, the most probable value of X . The probability of error in is no information is X this case is Pe = 1 ? p1 . Hence if we x Pe , we x p1 . We maximize the entropy of X for a given Pe to obtain an upper bound on the entropy for a given Pe . The entropy,
m X
i=2 m X
pi log pi
p ; p ; : : :; pm is attained by an uniform distribution. Hence since the maximum of H P Pe e Pe any X that can be predicted with a probability of error Pe must satisfy
(2.124)
which is the unconditional form of Fano's inequality. We can weaken this inequality to obtain an explicit lower bound for Pe ,
Pe
(2.125)
34. Monotonic convergence of the empirical distribution. Let p ^n denote the empirical probability mass function corresponding to X1 ; X2; : : :; Xn i.i.d. p(x), x 2 X . Speci cally, n X 1 (2.126) p ^n (x) = n I (Xi = x) is the proportion of times that Xi = x in the rst n samples, where I is an indicator function. (a) Show for X binary that
i=1
(2.127)
Thus the expected relative entropy \distance" from the empirical distribution to 1 0 1 p ^n + 2 p ^n the true distribution decreases with sample size. Hint: Write p ^2n = 2 and use the convexity of D .
37 (2.128)
Taking expectations and using the fact the Xi 's are identically distributed we get,
X
i6=j
n X n ? 1 p ^j ^n ; np ^n = n n?1 + p j =1 n X 1 ^j p ^n = n p n?1 j =1
or,
38 where,
1 p ^j n?1 = n ? 1
X
i6=j
I (Xi = x):
Again using the convexity of D(pjjq ) and the fact that the D(^ pj n?1 jjp) are identically distributed for all j and hence have the same expected value, we obtain the nal result. 35. Entropy of initial conditions. Prove that H (X0jXn ) increases with n for any Markov chain. Solution: Entropy of initial conditions. For a Markov chain, by the data processing theorem, we have I (X0; Xn?1) I (X0; Xn): (2.129) Therefore H (X0) ? H (X0jXn?1) H (X0) ? H (X0jXn) (2.130) or H (X0jXn ) increases with n .