Вы находитесь на странице: 1из 32

Chapter 2 Entropy, Relative Entropy and Mutual Information

1. Coin ips. A fair coin is ipped until the rst head occurs. Let X denote the number of ips required. (a) Find the entropy H (X ) in bits. The following expressions may be useful:
1 X
n=1

r ; rn = 1 ? r

1 X

(b) A random variable X is drawn according to this distribution. Find an \e cient" sequence of yes-no questions of the form, \Is X contained in the set S ?" Compare H (X ) to the expected number of questions required to determine X .

n=1

r : nrn = (1 ? r )2

Solution:
(a) The number X of tosses till the rst head appears has the geometric distribution with parameter p = 1=2, where P (X = n) = pq n?1 , n 2 f1; 2; : : :g . Hence the entropy of X is

H (X ) = ?
= ?

1 X

n "=1 1 X

p log p ? pq log q = ?1 ?q p2 = ?p log pp? q log q = H (p)=p bits. If p = 1=2, then H (X ) = 2 bits. 7

# 1 X n n pq log p + npq log q n=0 n=0

pq n?1 log(pqn?1 )

Entropy, Relative Entropy and Mutual Information


(b) Intuitively, it seems clear that the best questions are those that have equally likely chances of receiving a yes or a no answer. Consequently, one possible guess is that the most \e cient" series of questions is: Is X = 1? If not, is X = 2? If not, is X = 3? : : : with a resulting expected number of questions equal to P1 n n=1 n(1=2 ) = 2: This should reinforce the intuition that H (X ) is a measure of the uncertainty of X . Indeed in this case, the entropy is exactly the same as the average number of questions needed to de ne X , and in general E (# of questions) H (X ). This problem has an interpretation as a source coding problem. Let 0 =no, 1 =yes, X =Source, and Y =Encoded Source. Then the set of questions in the above procedure can be written as a collection of (X; Y ) pairs: (1; 1), (2; 01), (3; 001), etc. . In fact, this intuitively derived code is the optimal (Hu man) code minimizing the expected number of questions. 2. Entropy of functions. Let X be a random variable taking on a nite number of values. What is the (general) inequality relationship of H (X ) and H (Y ) if (a) Y = 2X ? (b) Y = cos X ?

Solution: Let y = g(x). Then


p(y) =

X
x: y=g(x)

p(x):

Consider any set of x 's that map onto a single y . For this set X X p(x) log p(x) p(x) log p(y) = p(y) log p(y);
x: y=g(x) x: y=g(x)

since log is a monotone increasing function and p(x) x: y=g(x) p(x) = p(y ). Extending this argument to the entire range of X (and Y ), we obtain X H (X ) = ? p(x) log p(x) = ?
x X X

X
y

y x: y=g(x)

p(x) log p(x)

p(y) log p(y)

= H (Y ); with equality i g is one-to-one with probability one. (a) Y = 2X is one-to-one and hence the entropy, which is just a function of the probabilities (and not the values of a random variable) does not change, i.e., H (X ) = H (Y ). (b) Y = cos(X ) is not necessarily one-to-one. Hence all that we can say is that H (X ) H (Y ), with equality if cosine is one-to-one on the range of X .

Entropy, Relative Entropy and Mutual Information

3. Minimum entropy. What is the minimum value of H (p1; :::; pn) = H (p) as p ranges over the set of n -dimensional probability vectors? Find all p 's which achieve this minimum. Solution: We wish to nd all probability vectors p = (p1; p2; : : :; pn) which minimize X H (p) = ? pi log pi : Now ?pi log pi 0, with equality i pi = 0 or 1. Hence the only possible probability vectors which minimize H (p) are those with pi = 1 for some i and pj = 0; j 6= i . There are n such vectors, i.e., (1; 0; : : :; 0), (0; 1; 0; : : :; 0), : : : , (0; : : :; 0; 1), and the minimum value of H (p) is 0. 4. Axiomatic de nition of entropy. If we assume certain axioms for our measure of information, then we will be forced to use a logarithmic measure like entropy. Shannon used this to justify his initial de nition of entropy. In this book, we will rely more on the other properties of entropy rather than its axiomatic derivation to justify its use. The following problem is considerably more di cult than the other problems in this section. If a sequence of symmetric functions Hm (p1 ; p2; : : :; pm) satis es the following properties, Normalization: H2 1 ; 1 = 1; 2 2 Continuity: H2(p; 1 ? p) is a continuous function of p , Grouping: Hm (p1; p2; : : :; pm ) = Hm?1 (p1+p2 ; p3; : : :; pm)+(p1 +p2 )H2 p p ;pp , +p +p prove that Hm must be of the form
1 2 1 2 1 2

Hm(p1; p2; : : :; pm) = ?

m X i=1

pi log pi;

m = 2; 3; : : ::

(2.1)

Let f (m) be the entropy of a uniform distribution on m symbols, i.e., f (m) = H 1 ; 1 ; : : :; 1 :


m

There are various other axiomatic formulations which also result in the same de nition of entropy. See, for example, the book by Csiszar and Korner 3]. Solution: Axiomatic de nition of entropy. This is a long solution, so we will rst outline what we plan to do. First we will extend the grouping axiom by induction and prove that Hm(p1; p2; : : :; pm) = Hm?k (p1 + p2 + + pk ; pk+1 ; : : :; pm) pk p1 +(p1 + p2 + + pk )Hk p + p + + p ; : : :; p + p + + p : (2.2)
1 2

We will then show that for any two integers r and s , that f (rs) = f (r) + f (s). We use this to show that f (m) = log m . We then show for rational p = r=s , that

m m

(2.3)

10

Entropy, Relative Entropy and Mutual Information


H2(p; 1 ? p) = ?p log p ? (1 ? p) log(1 ? p). By continuity, we will extend it to irrational p and nally by induction and grouping, we will extend the result to Hm for m 2.
To begin, we extend the grouping axiom. For convenience in notation, we will let

Sk =

k X i=1

pi

(2.4)

and we will denote H2 (q; 1 ? q ) as h(q ). Then we can write the grouping axiom as

p2 : Hm(p1; : : :; pm) = Hm?1 (S2; p3; : : :; pm) + S2 h S 2


Applying the grouping axiom again, we have

(2.5)

p2 Hm(p1; : : :; pm) = Hm?1 (S2 ; p3; : : :; pm) + S2h S 2 p2 p 3 = Hm?2 (S3 ; p4; : : :; pm ) + S3h S3 + S2 h S2
. . . = Hm?(k?1) (Sk ; pk+1 ; : : :; pm) +
k X

(2.6) (2.7) (2.8) (2.9)

pi : Si h S i i=2

Now, we apply the same grouping axiom repeatedly to Hk (p1=Sk ; : : :; pk =Sk ), to obtain

p1 ; : : :; pk Hk S Sk k

k ?1 S X S pi =Sk k ? 1 pk = H2 S ; S + S i h S k k i =Sk i=2 k k X pi : 1 Si h S = S k i=2 i

(2.10) (2.11)

From (2.9) and (2.11), it follows that

p1 ; : : :; pk ; Hm(p1; : : :; pm) = Hm?k (Sk ; pk+1 ; : : :; pm) + Sk Hk S Sk k

(2.12)

which is the extended grouping axiom. Now we need to use an axiom that is not explicitly stated in the text, namely that the function Hm is symmetric with respect to its arguments. Using this, we can combine any set of arguments of Hm using the extended grouping axiom. 1 1 1 Let f (m) denote Hm ( m ;m ; : : :; m ). Consider 1 ; 1 ; : : :; 1 ): (2.13) f (mn) = Hmn ( mn mn mn

Entropy, Relative Entropy and Mutual Information


By repeatedly applying the extended grouping axiom, we have f (mn) = H ( 1 ; 1 ; : : :; 1 )
mn

11 (2.14) (2.15) (2.16) (2.17) (2.18) (2.19)

f (m) = lim h( m1 ): But by the continuity of H2 , it follows Thus lim f (m + 1) ? mm +1 +1 ) = 0. that the limit on the right is h(0) = 0. Thus lim h( m1 +1 Let us de ne an+1 = f (n + 1) ? f (n) (2.26) and 1 ): (2.27) bn = h( n Then 1 f (n) + b an+1 = ? n + (2.28) n+1 1 n 1 X (2.29) = ?n + 1 i=2 ai + bn+1

= f (m) + f (n): We can immediately use this to conclude that f (mk ) = kf (m). Now, we will argue that H2 (1; 0) = h(1) = 0. We do this by expanding H3(p1 ; p2; 0) ( p1 + p2 = 1) in two di erent ways using the grouping axiom H3(p1 ; p2; 0) = H2 (p1; p2) + p2 H2(1; 0) (2.20) = H2 (1; 0) + (p1 + p2 )H2(p1; p2) (2.21) Thus p2 H2(1; 0) = H2(1; 0) for all p2 , and therefore H (1; 0) = 0. We will also need to show that f (m + 1) ? f (m) ! 0 as m ! 1 . To prove this, we use the extended grouping axiom and write 1 ) f (m + 1) = Hm+1 ( m 1 ; : : :; (2.22) +1 m+1 m 1 1 (2.23) = h( m 1 + 1 ) + m + 1 Hm ( m ; : : :; m ) m f (m) = h( m 1 ) + (2.24) +1 m+1 and therefore 1 ): f ( m ) = h ( (2.25) f (m + 1) ? mm +1 m+1

= Hmn?2n ( 1 ; 1 ; 1 ; : : :; 1 ) + 2 Hn ( 1 ; : : :; 1 ) m m mn mn m n n . . . = H ( 1 ; : : :: 1 ) + H ( 1 ; : : :; 1 )
m

mn mn mn 1 1 ) + 1 H ( 1 ; : : :; 1 ) 1 = Hmn?n ( m ; mn ; : : :; mn m nn n

12 and therefore

Entropy, Relative Entropy and Mutual Information


(n + 1)bn+1 = (n + 1)an+1 +
N X n=2 N X n=2 n X i=2

ai :
N X n=2

(2.30)

Therefore summing over n , we have

nbn =
=1

(nan + an?1 + : : : + a2 ) = N

ai :

(2.31)

Dividing both sides by

PN n = N (N + 1)=2, we obtain n PN nb N 2 X n n N + 1 an = PN n
=2

Now by continuity of H2 and the de nition of bn , it follows that bn ! 0 as n ! 1 . Since the right hand side is essentially an average of the bn 's, it also goes to 0 (This can be proved more precisely using 's and 's). Thus the left hand side goes to 0. We can then see that N X aN +1 = bN +1 ? N 1 (2.33) + 1 an also goes to 0 as N ! 1 . Thus
n=2

n=2

n=2

(2.32)

f (n + 1) ? f (n) ! 0 asn ! 1:
We will now prove the following lemma

(2.34)

Lemma 2.0.1 Let the function f (m) satisfy the following assumptions:
f (mn) = f (m) + f (n) for all integers m , n . limn!1 (f (n + 1) ? f (n)) = 0 f (2) = 1 , then the function f (m) = log2 m .
Proof of the lemma: Let P be an arbitrary prime number and let ) log2 n g(n) = f (n) ? f (P log2 P Then g (n) satis es the rst assumption of the lemma. Also g (P ) = 0. Also if we let n f (P ) n = g (n + 1) ? g (n) = f (n + 1) ? f (n) + log P log2 n + 1 2 then the second assumption in the lemma implies that lim n = 0. (2.35)

(2.36)

Entropy, Relative Entropy and Mutual Information


For an integer n , de ne

13 (2.37)

Then it follows that n(1) < n=P , and where 0 and

n : n(1) = P

n = n(1)P + l (2.38) l < P . From the fact that g(P ) = 0, it follows that g(Pn(1)) = g(n(1)), g(n) = g(n(1)) + g(n) ? g(Pn(1)) = g(n(1)) +
(1)

n ?1 X i=Pn
(1)

(2.39)

Just as we have de ned n process, we can then write

from n , we can de ne n

(2)

from n(1) . Continuing this (2.40) (2.41)

g(n) = g(n(k) +
Since n(k) n=P k , after

0 n i? k X @ X
(

1)

j =1 i=Pn(i)

1 iA :

terms, we have n(k) = 0, and g (0) = 0 (this follows directly from the additive property of g ). Thus we can write tn X g(n) = (2.42) i log n + 1 : (2.43) bn P log P g(n) ! 0, since g (n) has at most o(log n) terms . Since n ! 0, it follows that log i 2 n Thus it follows that f (n) = f (P ) (2.44) lim n!1 log2 n log2 P Since P was arbitrary, it follows that f (P )= log2 P = c for every prime number P . Applying the third axiom in the lemma, it follows that the constant is 1, and f (P ) = log2 P . For composite numbers N = P1 P2 : : :Pl , we can apply the rst property of f and the prime number factorization of N to show that X X f (N ) = f (Pi) = log2 Pi = log2 N: (2.45)
2

log n + 1 k = log P

the sum of bn terms, where

i=1

Thus the lemma is proved. The lemma can be simpli ed considerably, if instead of the second assumption, we replace it by the assumption that f (n) is monotone in n . We will now argue that the

14

Entropy, Relative Entropy and Mutual Information


only function f (m) such that f (mn) = f (m)+ f (n) for all integers m; n is of the form f (m) = loga m for some base a . Let c = f (2). Now f (4) = f (2 2) = f (2) + f (2) = 2c . Similarly, it is easy to see that f (2k ) = kc = c log2 2k . We will extend this to integers that are not powers of 2. For any integer m , let r > 0, be another integer and let 2k mr < 2k+1 . Then by the monotonicity assumption on f , we have

kc rf (m) < (k + 1)c


or

(2.46) (2.47) (2.48) (2.49) (2.50)

Now by the monotonicity of log , we have

k+1 ck f ( m ) < c r r k r
1 log2 m < k + r

Combining these two equations, we obtain 2m f (m) ? logc <1 r Since r was arbitrary, we must have f (m) = log2 m

and we can identify c = 1 from the last assumption of the lemma. Now we are almost done. We have shown that for any uniform distribution on m outcomes, f (m) = Hm (1=m; : : :; 1=m) = log2 m . We will now show that

H2(p; 1 ? p) = ?p log p ? (1 ? p) log(1 ? p): (2.51) To begin, let p be a rational number, r=s , say. Consider the extended grouping axiom for Hs 1 ) = H ( 1 ; : : :; 1 ; s ? r ) + s ? r f (s ? r) f (s) = Hs( 1 ; : : :; (2.52) s s s |s {z s} s
Substituting f (s) = log2 s , etc, we obtain

r ; s ? r ) + s f (s) + s ? r f (s ? r) = H2 ( s s r s

(2.53) (2.54)

Thus (2.51) is true for rational p . By the continuity assumption, (2.51) is also true at irrational p .

s ? r ) = ? r log r ? 1 ? s ? r log 1 ? s ? r : H2( r ; 2 s s s 2s s s

Entropy, Relative Entropy and Mutual Information

15

To complete the proof, we have to extend the de nition from H2 to Hm , i.e., we have to show that X Hm(p1 ; : : :; pm ) = ? pi log pi (2.55) for all m . This is a straightforward induction. We have just shown that this is true for m = 2. Now assume that it is true for m = n ? 1. By the grouping axiom,

Hn(p1; : : :; pn) = Hn?1 (p1 + p2; p3; : : :; pn) p2 1 +(p1 + p2 )H2 p p ; 1 + p2 p1 + p2 n X = ?(p1 + p2 ) log(p1 + p2 ) ? pi log pi i=3 p p2 p 1 2 1 log ? p + p log p + p ? p p p1 + p2 1 2 1 2 1 + p2 n X = ? pi log pi :
i=1

(2.56) (2.57) (2.58) (2.59) (2.60)

Thus the statement is true for m = n , and by induction, it is true for all m . Thus we have nally proved that the only symmetric function that satis es the axioms is

Hm(p1; : : :; pm) = ?

m X i=1

pi log pi :

(2.61)

The proof above is due to Renyi 10] 5. Entropy of functions of a random variable. Let X be a discrete random variable. Show that the entropy of a function of X is less than or equal to the entropy of X by justifying the following steps:
a) H (X; g(X )) (= H (X ) + H (g(X ) j X ) (b) = H (X ); (c) H (X; g(X )) = H (g(X )) + H (X j g(X )) (d) H (g(X )): Thus H (g (X )) H (X ):

(2.62) (2.63) (2.64) (2.65)

Solution: Entropy of functions of a random variable. (a) H (X; g (X )) = H (X ) + H (g (X )jX ) by the chain rule for entropies. (b) H (g (X )jX ) =P 0 since for any particular value of X, g(X) is xed, and hence H (g(X )jX ) = x p(x)H (g(X )jX = x) = Px 0 = 0. (c) H (X; g (X )) = H (g (X )) + H (X jg (X )) again by the chain rule. (d) H (X jg (X )) 0, with equality i X is a function of g (X ), i.e., g (:) is one-to-one.
Hence H (X; g (X )) H (g (X )).

16

Entropy, Relative Entropy and Mutual Information


Combining parts (b) and (d), we obtain H (X ) H (g (X )). 6. Zero conditional entropy. Show that if H (Y jX ) = 0, then Y is a function of X , i.e., for all x with p(x) > 0, there is only one possible value of y with p(x; y ) > 0. Solution: Zero Conditional Entropy. Assume that there exists an x , say x0 and two di erent values of y , say y1 and y2 such that p(x0; y1) > 0 and p(x0; y2) > 0. Then p(x0) p(x0 ; y1) + p(x0; y2) > 0, and p(y1 jx0) and p(y2jx0) are not equal to 0 or 1. Thus

H (Y jX ) = ?

X
x

p(x)

X
y

p(yjx) log p(yjx)

(2.66) (2.67) (2.68)

p(x0)(?p(y1jx0) log p(y1jx0 ) ? p(y2 jx0) log p(y2 jx0)) > > 0;

since ?t log t 0 for 0 t 1, and is strictly positive for t not equal to 0 or 1. Therefore the conditional entropy H (Y jX ) is 0 if and only if Y is a function of X . 7. Pure randomness and bent coins. Let X1; X2; : : :; Xn denote the outcomes of independent ips of a bent coin. Thus Pr fXi =: 1g = p; Pr fXi = 0g = 1 ? p , where p is unknown. We wish to obtain a sequence Z1 ; Z2; : : :; ZK of fair coin ips from X1; X2; : : :; Xn . Toward this end let f : X n ! f0; 1g , where f0; 1g = f ; 0; 1; 00; 01; : : :g is the set of all nite length binary sequences, be a mapping ), and K may depend f (X1; X2; : : :; Xn) = (Z1; Z2; : : :; ZK ), where Zi Bernoulli ( 1 2 on (X1; : : :; Xn). In order that the sequence Z1 ; Z2; : : : appear to be fair coin ips, the map f from bent coin ips to fair ips must have the property that all 2k sequences (Z1; Z2; : : :; Zk ) of a given length k have equal probability (possibly 0), for k = 1; 2; : : : . For example, for n = 2, the map f (01) = 0, f (10) = 1, f (00) = f (11) = (the null 1 string), has the property that PrfZ1 = 1jK = 1g = PrfZ1 = 0jK = 1g = 2 . Give reasons for the following inequalities:
a) nH (p) (= H (X1; : : :; Xn) (b) H (Z1; Z2; : : :; ZK ; K ) (c) = H (K ) + H (Z1; : : :; ZK jK ) (d) = H (K ) + E (K ) (e) EK:

Thus no more than nH (p) fair coin tosses can be derived from (X1; : : :; Xn), on the average. (f ) Exhibit a good map f on sequences of length 4. Solution: Pure randomness and bent coins.

Entropy, Relative Entropy and Mutual Information


a) nH (p) (= H (X1; : : :; Xn) (b) H (Z1; Z2; : : :; ZK ) (c) = H (Z1; Z2; : : :; ZK ; K ) (d) = H (K ) + H (Z1; : : :; ZK jK ) (e) = H (K ) + E (K ) (f ) EK :

17

(a) Since X1; X2; : : :; Xn are i.i.d. with probability of Xi = 1 being p , the entropy H (X1; X2; : : :; Xn) is nH (p). (b) Z1 ; : : :; ZK is a function of X1 ; X2; : : :; Xn , and since the entropy of a function of a random variable is less than the entropy of the random variable, H (Z1; : : :; ZK ) H (X1; X2; : : :; Xn). (c) K is a function of Z1 ; Z2; : : :; ZK , so its conditional entropy given Z1 ; Z2; : : :; ZK is 0. Hence H (Z1; Z2; : : :; ZK ; K ) = H (Z1; : : :; ZK ) + H (K jZ1; Z2; : : :; ZK ) = H (Z1; Z2; : : :; ZK ): (d) Follows from the chain rule for entropy. (e) By assumption, Z1 ; Z2; : : :; ZK are pure random bits (given K ), with entropy 1 bit per symbol. Hence

H (Z1; Z2; : : :; ZK jK ) =
=

X
k

k X

p(K = k)H (Z1; Z2; : : :; ZkjK = k) (2.69) p(k)k


(2.70) (2.71)

= EK:

(f) Follows from the non-negativity of discrete entropy. (g) Since we do not know p , the only way to generate pure random bits is to use the fact that all sequences with the same number of ones are equally likely. For example, the sequences 0001,0010,0100 and 1000 are equally likely and can be used to generate 2 pure random bits. An example of a mapping to generate random bits is 0000 ! 0001 ! 00 0011 ! 00 1010 ! 0 1110 ! 11 1111 ! 0010 ! 01 0100 ! 10 1000 ! 11 0110 ! 01 1100 ! 10 1001 ! 11 0101 ! 1 1101 ! 10 1011 ! 01 0111 ! 00

(2.72)

18

Entropy, Relative Entropy and Mutual Information


The resulting expected number of bits is EK = 4pq 3 2 + 4p2q 2 2 + 2p2q2 1 + 4p3q 2 (2.73) 3 2 2 3 = 8pq + 10p q + 8p q: (2.74) For example, for p 1 , the expected number of pure random bits is close to 1.625. 2 This is substantially less then the 4 pure random bits that could be generated if p were exactly 1 . 2 We will now analyze the e ciency of this scheme of generating random bits for long sequences of bent coin ips. Let n be the number of bent coin ips. The algorithm that we will use is the obvious extension of the above method of generating pure bits using the fact that all sequences with the same number of ones are equally likely. ?n such sequences, which are Consider all sequences with k ones. There are k ? were a power of 2, then we ? pure all equally likely. If n could generate log n k k ? n is not a power random bits from such a set. However, in the general case, of k ? n 2 and the best we can to is the divide the set of k elements into subset of sizes n which are powers of 2. The largest set would have a size 2blog (k )c and could be ? used to generate blog n k c random bits. We could divide the remaining elements into the largest set which is a power of 2, etc. The worst case would occur when ? n = 2l+1 ? 1, in which case the subsets would be of sizes 2l ; 2l?1; 2l?2; : : :; 1. k Instead of analyzing the scheme exactly, we will nd a lower on number ? just ?nbound of random bits generated from a set of size n . Let l = b log c . Then at least k k l half of the elements belong to a set of size 2 and would generate l random bits, 1 th belong to a set of size 2l?1 and generate l ? 1 random bits, etc. On at least 4 the average, the number of bits generated is 1 l + 1 (l ? 1) + + 1 1 E K jk 1's in sequence] (2.75) 2 4 2l l?1 1 2 3 = l? 1 4 1 + 2 + 4 + 8 + + 2l?2 (2.76) l ? 1; (2.77) since the in nite series sums to 1. ? n Hence the fact that k is not a power of 2 will cost at most 1 bit on the average in the number of random bits that are produced. Hence, the expected number of pure random bits produced by this algorithm is

n ? 1c EK k k k=0 ! ! ! n n X n k n ? k pq log k ? 2 k=0 k ! ! n n X n k n ? k = p q log k ? 2 k=0 k pk q n?k blog

n n! X

(2.78) (2.79) (2.80)

Entropy, Relative Entropy and Mutual Information ! ! X n pk qn?k log n ? 2: k k n p? k n p


( ) ( + )

19 (2.81)

Now for su ciently large n , the probability that the number of 1's in the sequence is close to np is near 1 (by the weak law of large numbers). For such sequences, k n is close to p and hence there exists a such that

n k

2n(H ( n )? ) 2n(H (p)?2 )

(2.82)

using Stirling's approximation for the binomial coe cients and the continuity of the entropy function. If we assume that n is large enough so that the probability that n(p ? ) k n(p + ) is greater than 1 ? , then we see that EK (1 ? )n(H (p) ? 2 ) ? 2, which is very good since nH (p) is an upper bound on the number of pure random bits that can be produced from the bent coin sequence. 8. World Series. The World Series is a seven-game series that terminates as soon as either team wins four games. Let X be the random variable that represents the outcome of a World Series between teams A and B; possible values of X are AAAA, BABABAB, and BBBAAAA. Let Y be the number of games played, which ranges from 4 to 7. Assuming that A and B are equally matched and that the games are independent, calculate H (X ), H (Y ), H (Y jX ), and H (X jY ).
World Series. Two teams play until one of them has won 4 games. There are 2 (AAAA, BBBB) World Series with 4 games. Each happens with probability (1=2)4 . ? World Series with 5 games. Each happens with probability (1=2)5 . There are 8 = 2 4 3 ? There are 20 = 2 5 World Series with 6 games. Each happens with probability (1=2)6 . 3 ? World Series with 7 games. Each happens with probability (1=2)7 . There are 40 = 2 6 3

Solution:

The probability of a 4 game series ( Y The probability of a 5 game series ( Y The probability of a 6 game series ( Y The probability of a 7 game series ( Y

= 4) is = 5) is = 6) is = 7) is

2(1=2)4 = 1=8. 8(1=2)5 = 1=4. 20(1=2)6 = 5=16. 40(1=2)7 = 5=16.

H (X ) =

p(x)log p(1x) = 2(1=16) log16 + 8(1=32) log32 + 20(1=64) log64 + 40(1=128) log128 = 5:8125

20

Entropy, Relative Entropy and Mutual Information


H (Y ) =
1 p(y)log p( y) = 1=8 log 8 + 1=4 log 4 + 5=16 log(16=5) + 5=16 log(16=5) = 1:924

H (Y jX ) = 0. Since H (X ) + H (Y jX ) = H (X; Y ) = H (Y ) + H (X jY ), it is easy to determine H (X jY ) = H (X ) + H (Y jX ) ? H (Y ) = 3:889

Y is a deterministic function of X, so if you know X there is no randomness in Y. Or,

9. In nite entropy. This problem shows that the entropy of a discrete random variable can P 2 ?1 be in nite. Let A = 1 n=2 (n log n) . (It is easy to show that A is nite by bounding the in nite sum by the integral of (x log2 x)?1 .) Show that the integer-valued random variable X de ned by Pr(X = n) = (An log2 n)?1 for n = 2; 3; : : : has H (X ) = +1 . Solution: In nite entropy. By de nition, pn = Pr(X = n) = 1=An log2 n for n 2. Therefore

H (X ) = ?
= ?

1 X

n=2 1 X n=2

p(n) log p(n)


1=An log2 n log 1=An log2 n
2 2

1 log(An log n) X = An log n n 1 log A + log n + 2 log log n X


=2

n=2

= log A +

1 2 log log n 1 +X : 2 n=2 An log n n=2 An log n 1 X

An log2 n

We conclude that H (X ) = +1 . 10. Conditional mutual information vs. unconditional mutual information. Give examples of joint random variables X , Y and Z such that (a) I (X ; Y j Z ) < I (X ; Y ),

The rst term is nite. For base 2 logarithms, all the elements in the sum in the last term are nonnegative. (For any other base, the terms of the last sum eventually all become positive.) So all we have to do is bound the middle sum, which we do by comparing with an integral. 1 X 1 > Z 1 1 dx = K ln ln x 1 = +1 : 2 Ax log x 2 n=2 An log n

Entropy, Relative Entropy and Mutual Information


(b) I (X ; Y j Z ) > I (X ; Y ). Solution: Conditional mutual information vs. unconditional mutual information.

21

(a) The last corollary to Theorem 2.8.1 in the text states that if X ! Y ! Z that is, if p(x; y j z ) = p(x j z )p(y j z ) then, I (X ; Y ) I (X ; Y j Z ). Equality holds if and only if I (X ; Z ) = 0 or X and Z are independent. A simple example of random variables satisfying the inequality conditions above is, X is a fair binary random variable and Y = X and Z = Y . In this case,

I (X ; Y ) = H (X ) ? H (X j Y ) = H (X ) = 1 I (X ; Y j Z ) = H (X j Z ) ? H (X j Y; Z ) = 0: So that I (X ; Y ) > I (X ; Y j Z ). (b) This example is also given in the text. Let X; Y be independent fair binary random variables and let Z = X + Y . In this case we have that, I (X ; Y ) = 0
and, and,

I (X ; Y j Z ) = H (X j Z ) = 1=2: So I (X ; Y ) < I (X ; Y j Z ). Note that in this case X; Y; Z are not markov.

11. Average entropy. Let H (p) = ?p log2 p ? (1 ? p) log2 (1 ? p) be the binary entropy function. (a) Evaluate H (1=4) using the fact that log2 3 1:584. Hint: Consider an experiment with four equally likely outcomes, one of which is more interesting than the others. (b) Calculate the average entropy H (p) when the probability p is chosen uniformly in the range 0 p 1. (c) (Optional) Calculate the average entropy H (p1; p2; p3) where (p1; p2; p3) is a uniformly distributed probability vector. Generalize to dimension n . Solution: Average Entropy. (a) We can generate two bits of information by picking one of four equally likely alternatives. This selection can be made in two steps. First we decide whether the rst outcome occurs. Since this has probability 1=4, the information generated is H (1=4). If not the rst outcome, then we select one of the three remaining outcomes; with probability 3=4, this produces log2 3 bits of information. Thus

H (1=4) + (3=4) log2 3 = 2 and so H (1=4) = 2 ? (3=4) log2 3 = 2 ? (:75)(1:585) = 0:811 bits.

22

Entropy, Relative Entropy and Mutual Information


(b) If p is chosen uniformly in the range 0 nats) is

p
1

1, then the average entropy (in

p ln p + (1 ? p) ln(1 ? p)dp = ?2

2 x2 x ln x dx = ?2 x ln x + 2 4

1 0

1 =2 :

1 log2 e = 1=(2 ln 2) = :721 bits. Therefore the average entropy is 2 (c) Choosing a uniformly distributed probability vector (p1 ; p2; p3) is equivalent to choosing a point (p1; p2) uniformly from the triangle 0 p1 1, p1 p2 1. The probability density function has the constant value 2 because the area of the triangle is 1/2. So the average entropy H (p1; p2; p3) is

?2

Z Z
1 0

After some enjoyable calculus, we obtain the nal result 5=(6 ln 2) = 1:202 bits. 12. Venn diagrams. Using Venn diagrams, we can see that the mutual information common to three random variables X , Y and Z should be de ned by I (X ; Y ; Z ) = I (X ; Y ) ? I (X ; Y jZ ) : This quantity is symmetric in X , Y and Z , despite the preceding asymmetric de nition. Unfortunately, I (X ; Y ; Z ) is not necessarily nonnegative. Find X , Y and Z such that I (X ; Y ; Z ) < 0, and prove the following two identities: I (X ; Y ; Z ) = H (X; Y; Z ) ? H (X ) ? H (Y ) ? H (Z ) + I (X ; Y ) + I (Y ; Z ) + I (Z ; X ) I (X ; Y ; Z ) = H (X; Y; Z ) ? H (X; Y ) ? H (Y; Z ) ? H (Z; X ) + H (X ) + H (Y ) + H (Z ) The rst identity can be understood using the Venn diagram analogy for entropy and mutual information. The second identity follows easily from the rst. Solution: Venn Diagrams. To show the rst identity, I (X ; Y ; Z ) = I (X ; Y ) ? I (X ; Y jZ ) by de nition = I (X ; Y ) ? (I (X ; Y; Z ) ? I (X ; Z )) by chain rule = I (X ; Y ) + I (X ; Z ) ? I (X ; Y; Z ) = I (X ; Y ) + I (X ; Z ) ? (H (X ) + H (Y; Z ) ? H (X; Y; Z )) = I (X ; Y ) + I (X ; Z ) ? H (X ) + H (X; Y; Z ) ? H (Y; Z ) = I (X ; Y ) + I (X ; Z ) ? H (X ) + H (X; Y; Z ) ? (H (Y ) + H (Z ) ? I (Y ; Z )) = I (X ; Y ) + I (X ; Z ) + I (Y ; Z ) + H (X; Y; Z ) ? H (X ) ? H (Y ) ? H (Z ): To show the second identity, simply substitute for I (X ; Y ), I (X ; Z ), and I (Y ; Z ) using equations like I (X ; Y ) = H (X ) + H (Y ) ? H (X; Y ) : These two identities show that I (X ; Y ; Z ) is a symmetric (but not necessarily nonnegative) function of three random variables.

p1

p1 ln p1 + p2 ln p2 + (1 ? p1 ? p2) ln(1 ? p1 ? p2 )dp2dp1 :

Entropy, Relative Entropy and Mutual Information

23

13. Coin weighing. Suppose one has n coins, among which there may or may not be one counterfeit coin. If there is a counterfeit coin, it may be either heavier or lighter than the other coins. The coins are to be weighed by a balance. (a) Find an upper bound on the number of coins n so that k weighings will nd the counterfeit coin (if any) and correctly declare it to be heavier or lighter. (b) (Di cult) What is the coin weighing strategy for k = 3 weighings and 12 coins? Solution: Coin weighing. (a) For n coins, there are 2n + 1 possible situations or \states". One of the n coins is heavier. One of the n coins is lighter. They are all of equal weight. Each weighing has three possible outcomes - equal, left pan heavier or right pan heavier. Hence with k weighings, there are 3k possible outcomes and hence we can distinguish between at most 3k di erent \states". Hence 2n + 1 3k or n (3k ? 1)=2. Looking at it from an information theoretic viewpoint, each weighing gives at most log2 3 bits of information. There are 2n + 1 possible \states", with a maximum entropy of log2(2n + 1) bits. Hence in this situation, one would require at least log2 (2n + 1)= log2 3 weighings to extract enough information for determination of the odd coin, which gives the same result as above. (b) There are many solutions to this problem. We will give one which is based on the ternary number system. We may express the numbers f?12; ?11; : : :; ?1; 0; 1; : : :; 12g in a ternary number system with alphabet f?1; 0; 1g . For example, the number 8 is (-1,0,1) where ?1 30 + 0 31 + 1 32 = 8. We form the matrix with the representation of the positive numbers as its columns. 1 2 3 4 5 6 7 8 9 10 11 12 0 3 1 -1 0 1 -1 0 1 -1 0 1 -1 0 1 = 0 31 0 1 1 1 -1 -1 -1 0 0 0 1 1 2 = 2 32 0 0 0 0 1 1 1 1 1 1 1 1 3 = 8 Note that the row sums are not all zero. We can negate some columns to make the row sums zero. For example, negating columns 7,9,11 and 12, we obtain 1 2 3 4 5 6 7 8 9 10 11 12 30 1 -1 0 1 -1 0 -1 -1 0 1 1 0 1 = 0 31 0 1 1 1 -1 -1 1 0 0 0 -1 -1 2 = 0 32 0 0 0 0 1 1 -1 1 -1 1 -1 -1 3 = 0 Now place the coins on the balance according to the following rule: For weighing #i , place coin n On left pan, if ni = ?1. Aside, if ni = 0.

24

Entropy, Relative Entropy and Mutual Information

On right pan, if ni = 1. The outcome of the three weighings will nd the odd coin if any and tell whether it is heavy or light. The result of each weighing is 0 if both pans are equal, -1 if the left pan is heavier, and 1 if the right pan is heavier. Then the three weighings give the ternary expansion of the index of the odd coin. If the expansion is the same as the expansion in the matrix, it indicates that the coin is heavier. If the expansion is of the opposite sign, the coin is lighter. For example, (0,-1,-1) indicates (0)30 +(?1)3+(?1)32 = ?12, hence coin #12 is heavy, (1,0,-1) indicates #8 is light, (0,0,0) indicates no odd coin. Why does this scheme work? It is a single error correcting Hamming code for the ternary alphabet (discussed in Section 8.11 in the book). Here are some details. First note a few properties of the matrix above that was used for the scheme. All the columns are distinct and no two columns add to (0,0,0). Also if any coin is heavier, it will produce the sequence of weighings that matches its column in the matrix. If it is lighter, it produces the negative of its column as a sequence of weighings. Combining all these facts, we can see that any single odd coin will produce a unique sequence of weighings, and that the coin can be determined from the sequence. One of the questions that many of you had whether the bound derived in part (a) was actually achievable. For example, can one distinguish 13 coins in 3 weighings? No, not with a scheme like the one above. Yes, under the assumptions under which the bound was derived. The bound did not prohibit the division of coins into halves, neither did it disallow the existence of another coin known to be normal. Under both these conditions, it is possible to nd the odd coin of 13 coins in 3 weighings. You could try modifying the above scheme to these cases. 14. Drawing with and without replacement. An urn contains r red, w white, and b black balls. Which has higher entropy, drawing k 2 balls from the urn with replacement or without replacement? Set it up and show why. (There is both a hard way and a relatively simple way to do this.) Solution: Drawing with and without replacement. Intuitively, it is clear that if the balls are drawn with replacement, the number of possible choices for the i -th ball is larger, and therefore the conditional entropy is larger. But computing the conditional distributions is slightly involved. It is easier to compute the unconditional entropy. With replacement. In this case the conditional distribution of each draw is the same for every draw. Thus 8 r > < red with prob. r+w w+b Xi = > white with prob. r+w+b (2.83) : black with prob. r+w b +b and therefore H (XijXi?1 ; : : :; X1) = H (Xi) (2.84)

Entropy, Relative Entropy and Mutual Information 25 r log r ? w log w ? b log(2.85) = log(r + w + b) ? r + w +b r+w+b r + w + b b: Without replacement. The unconditional probability of the i -th ball being red is still r=(r + w + b), etc. Thus the unconditional entropy H (Xi) is still the same as with replacement. The conditional entropy H (XijXi? ; : : :; X ) is less than the
unconditional entropy, and therefore the entropy of drawing without replacement is lower. (x; y ) 0 (x; y ) = (y; x) (x; y ) = 0 if and only if x = y (x; y ) + (y; z ) (x; z ). (a) Show that (X; Y ) = H (X jY )+H (Y jX ) has the above properties, and is therefore a metric. Note that (X; Y ) is the number of bits needed for X and Y to communicate their values to each other. (b) Verify that (X; Y ) can also be expressed as (X; Y ) = H (X ) + H (Y ) ? 2I (X ; Y ) = H (X; Y ) ? I (X ; Y ) = 2H (X; Y ) ? H (X ) ? H (Y ): (2.86) (2.87) (2.88)
1 1

15. A metric. A function (x; y ) is a metric if for all x; y ,

Solution: A metric
(a) Let (X; Y ) = H (X jY ) + H (Y jX ): (2.89) Then Since conditional entropy is always 0, (X; Y ) 0. The symmetry of the de nition implies that (X; Y ) = (Y; X ). By problem 2.6, it follows that H (Y jX ) is 0 i Y is a function of X and H (X jY ) is 0 i X is a function of Y . Thus (X; Y ) is 0 i X and Y are functions of each other - and therefore are equivalent up to a reversible transformation. Consider three random variables X , Y and Z . Then

H (X jY ) + H (Y jZ )

H (X jY; Z ) + H (Y jZ ) = H (X; Y jZ ) = H (X jZ ) + H (Y jX; Z ) H (X jZ );

(2.90) (2.91) (2.92) (2.93)

26

Entropy, Relative Entropy and Mutual Information


from which it follows that (X; Y ) + (Y; Z ) (X; Z ): (2.94)

Note that the inequality is strict unless X ! Y ! Z forms a Markov Chain and Y is a function of X and Z . (b) Since H (X jY ) = H (X ) ? I (X ; Y ), the rst equation follows. The second relation follows from the rst equation and the fact that H (X; Y ) = H (X ) + H (Y ) ? I (X ; Y ). The third follows on substituting I (X ; Y ) = H (X ) + H (Y ) ? H (X; Y ). 16. Example of joint entropy. Let p(x; y ) be given by

@@ Y X @
0 1 Find (a) (b) (c) (d) (e) (f)

0 1
1 3 1 3 1 3

Draw a Venn diagram for the quantities in (a) through (e).

H (X ); H (Y ): H (X j Y ); H (Y j X ): H (X; Y ): H (Y ) ? H (Y j X ): I (X ; Y ).

Solution: Example of joint entropy


1 2 log 3 +3 log 3 = 0:918 bits = H (Y ). (a) H (X ) = 3 2 1 (b) H (X jY ) = 3 H (X jY = 0) + 2 H (X jY = 1) = 0:667 bits = H (Y jX ). 3 1 (c) H (X; Y ) = 3 3 log 3 = 1:585 bits. (d) H (Y ) ? H (Y jX ) = 0:251 bits. (e) I (X ; Y ) = H (Y ) ? H (Y jX ) = 0:251 bits. (f) See Figure 1. 1 17. Inequality. Show ln x 1 ? x for x > 0: Solution: Inequality. Using the Remainder form of the Taylor expansion of ln(x) about x = 1, we have for some c between 1 and x (x ? 1)2 x ? 1 (x ? 1) + ?21 ln(x) = ln(1) + 1 t t=1 t t=c 2

Entropy, Relative Entropy and Mutual Information


Figure 2.1: Venn diagram to illustrate the relationships of entropy H(Y) and relative entropy
H(X) H(X|Y) I(X;Y) H(Y|X)

27

since the second term is always negative. Hence letting y = 1=x , we obtain ? ln y 1 ? 1 or

with equality i y = 1. 18. Entropy of a sum. Let X and Y be random variables that take on values x1 ; x2; : : :; xr and y1 ; y2; : : :; ys , respectively. Let Z = X + Y: (a) Show that H (Z jX ) = H (Y jX ): Argue that if X; Y are independent, then H (Y ) H (Z ) and H (X ) H (Z ): Thus the addition of independent random variables adds uncertainty. (b) Give an example (of necessarily dependent random variables) in which H (X ) > H (Z ) and H (Y ) > H (Z ): (c) Under what conditions does H (Z ) = H (X ) + H (Y )?

1 ln y 1 ? y

Solution: Entropy of a sum.


(a) Z = X + Y . Hence p(Z = z jX = x) = p(Y = z ? xjX = x).

X p(x)H (Z jX = x) X X = ? p(x) p(Z = z jX = x) log p(Z = z jX = x) Xx Xz = p(x) p(Y = z ? xjX = x) log p(Y = z ? xjX = x) x y X = p(x)H (Y jX = x) = H (Y jX ): If X and Y are independent, then H (Y jX ) = H (Y ). Since I (X ; Z ) 0, we have H (Z ) H (Z jX ) = H (Y jX ) = H (Y ) . Similarly we can show that
H (Z jX ) = H (Z ) H (X ).

28

Entropy, Relative Entropy and Mutual Information


(b) Consider the following joint distribution for X and Y Let

X = ?Y =

1 with probability 1=2 0 with probability 1=2

Then H (X ) = H (Y ) = 1, but Z = 0 with prob. 1 and hence H (Z ) = 0. (c) We have H (Z ) H (X; Y ) H (X ) + H (Y ) because Z is a function of (X; Y ) and H (X; Y ) = H (X ) + H (Y jX ) H (X ) + H (Y ). We have equality i (X; Y ) is a function of Z and H (Y ) = H (Y jX ), i.e., X and Y are independent. 19. Entropy of a disjoint mixture. Let X1 and X2 be discrete random variables drawn according to probability mass functions p1 ( ) and p2 ( ) over the respective alphabets X1 = f1; 2; : : :; mg and X2 = fm + 1; : : :; ng: Let

X=

X1 ; with probability ; X2 ; with probability 1 ? :


1 2

(a) Find H (X ) in terms of H (X1) and H (X2) and : (b) Maximize over to show that 2H (X ) 2H (X ) + 2H (X ) and interpret using the notion that 2H (X ) is the e ective alphabet size. and expanding the various terms. Instead, we will use the algebra of entropies for a simpler proof. Since X1 and X2 have disjoint support sets, we can write

Solution: Entropy. We can do this problem by writing down the de nition of entropy
(

X=
De ne a function of X ,

X1 with probability X2 with probability 1 ?

= f (X ) = Then as in problem 1, we have

1 when X = X1 2 when X = X2

H (X ) = H (X; f (X )) = H ( ) + H (X j ) = H ( ) + p( = 1)H (X j = 1) + p( = 2)H (X j = 2) = H ( ) + H (X1) + (1 ? )H (X2)


where H ( ) = ? log ? (1 ? ) log(1 ? ).

Entropy, Relative Entropy and Mutual Information

29

20. A measure of correlation. Let X1 and X2 be identically distributed, but not necessarily independent. Let X2 j X1 ) : = 1 ? H (H (X )
1

X ;X ) : Show = I (H (X ) Show 0 1: When is = 0? When is = 1? Solution: A measure of correlation. X1 and X2 are identically distributed and (X2jX1) = 1 ? HH (X1) (a) = H (X1) ? H (X2jX1) H (X1) = H (X2) ? H (X2jX1) (since H (X1) = H (X2)) H (X1) = I (X1; X2) : H (X1) (b) Since 0 H (X2jX1) H (X2) = H (X1), we have (X2jX1) 1 0 HH (X1) 0 1: (c) = 0 i I (X1; X2) = 0 i X1 and X2 are independent. (d) = 1 i H (X2jX1) = 0 i X2 is a function of X1 . By symmetry, X1 is a function of X2 , i.e., X1 and X2 have a one-to-one relationship. 21. Data processing. Let X1 ! X2 ! X3 ! ! Xn form a Markov chain in this order; i.e., let p(x1 ; x2; : : :; xn ) = p(x1 )p(x2jx1) p(xnjxn?1 ): Reduce I (X1; X2; : : :; Xn) to its simplest form. Solution: Data Processing. By the chain rule for mutual information, I (X1; X2; : : :; Xn) = I (X1; X2)+ I (X1; X3jX2)+ + I (X1; XnjX2; : : :; Xn?2 ): (2.95) By the Markov property, the past and the future are conditionally independent given the present and hence all terms except the rst are zero. Therefore I (X1; X2; : : :; Xn) = I (X1; X2): (2.96)

(a) (b) (c) (d)

30

Entropy, Relative Entropy and Mutual Information

22. Bottleneck. Suppose a (non-stationary) Markov chain starts in one of n states, necks down to k < n states, and then fans back to m > k states. Thus X1 ! X2 ! X3 , X1 2 f1; 2; : : :; ng , X2 2 f1; 2; : : :; kg , X3 2 f1; 2; : : :; mg . (a) Show that the dependence of X1 and X3 is limited by the bottleneck by proving that I (X1; X3) log k: (b) Evaluate I (X1; X3) for k = 1, and conclude that no dependence can survive such a bottleneck.

Solution:

Bottleneck.

(a) From the data processing inequality, and the fact that entropy is maximum for a uniform distribution, we get

I (X1; X3)

I (X1; X2) = H (X2) ? H (X2 j X1 ) H (X2) log k:

Thus, the dependence between X1 and X3 is limited by the size of the bottleneck. That is I (X1; X3) log k . (b) For k = 1, I (X1; X3) log 1 = 0 and since I (X1; X3) 0, I (X1; X3) = 0. Thus, for k = 1, X1 and X3 are independent. 23. Run length coding. Let X1; X2; : : :; Xn be (possibly dependent) binary random variables. Suppose one calculates the run lengths R = (R1; R2; : : :) of this sequence (in order as they occur). For example, the sequence X = 0001100100 yields run lengths R = (3; 2; 2; 1; 2). Compare H (X1; X2; : : :; Xn), H (R) and H (Xn; R). Show all equalities and inequalities, and bound all the di erences. Solution: Run length coding. Since the run lengths are a function of X1; X2; : : :; Xn , H (R) H (X). Any Xi together with the run lengths determine the entire sequence X1; X2; : : :; Xn . Hence

H (X1; X2; : : :; Xn) = H (Xi; R) = H (R) + H (XijR) H (R) + H (Xi) H (R) + 1:

(2.97) (2.98) (2.99) (2.100)

24. Markov's inequality for probabilities. Let p(x) be a probability mass function. Prove, for all d 0, H (X ): (2.101) Pr fp(X ) dg log 1

Entropy, Relative Entropy and Mutual Information

31 1 p(x) log d (2.102) (2.103) (2.104) (2.105)

Solution: Markov inequality applied to entropy. X 1


P (p(X ) < d) log d =
x:p(x)<d

X
x

x:p(x)<d

= H (X )

p(x) log p(1x)

p(x) log p(1x)

25. Logical order of ideas. Ideas have been developed in order of need, and then generalized if necessary. Reorder the following ideas, strongest rst, implications following: (a) Chain rule for I (X1; : : :; Xn ; Y ), chain rule for D(p(x1; : : :; xn)jjq (x1; x2; : : :; xn )), and chain rule for H (X1; X2; : : :; Xn). (b) D(f jjg ) 0, Jensen's inequality, I (X ; Y ) 0.

Solution: Logical ordering of ideas.


(a) The following orderings are subjective. Since I (X ; Y ) = D(p(x; y )jjp(x)p(y )) is a special case of relative entropy, it is possible to derive the chain rule for I from the chain rule for D . Since H (X ) = I (X ; X ), it is possible to derive the chain rule for H from the chain rule for I . It is also possible to derive the chain rule for I from the chain rule for H as was done in the notes. (b) In class, Jensen's inequality was used to prove the non-negativity of D . The inequality I (X ; Y ) 0 followed as a special case of the non-negativity of D . 26. Second law of thermodynamics. Let X1; X2; X3 : : : be a stationary rst-order Markov chain. In Section 2.9, it was shown that H (Xn j X1 ) H (Xn?1 j X1 ) for n = 2; 3 : : : . Thus conditional uncertainty about the future grows with time. This is true although the unconditional uncertainty H (Xn) remains constant. However, show by example that H (Xn jX1 = x1 ) does not necessarily grow with n for every x1 . Solution: Second law of thermodynamics. H (XnjX1) H (XnjX1; X2) (Conditioning reduces entropy) (2.106) = H (Xn jX2) (by Markovity) (2.107) = H (Xn?1 jX1) (by stationarity) (2.108) Alternatively, by an application of the data processing inequality to the Markov chain X1 ! Xn?1 ! Xn , we have I (X1; Xn?1) I (X1; Xn): (2.109)

32

Entropy, Relative Entropy and Mutual Information


Expanding the mutual informations in terms of entropies, we have H (Xn?1) ? H (Xn?1 jX1) H (Xn) ? H (XnjX1): By stationarity, H (Xn?1 ) = H (Xn) and hence we have H (Xn?1jX1) H (XnjX1): (2.110) (2.111)

27. Conditional mutual information. Consider a sequence of n binary random variables X1; X2; : : :; Xn . Each sequence with an even number of 1's has probability 2?(n?1) and each sequence with an odd number of 1's has probability 0. Find the mutual informations I (X1; X2); I (X2; X3jX1); : : :; I (Xn?1 ; XnjX1; : : :; Xn?2):

Solution: Conditional mutual information.

Consider a sequence of n binary random variables X1; X2; : : :; Xn . Each sequence of length n with an even number of 1's is equally likely and has probability 2?(n?1) . Any n ? 1 or fewer of these are independent. Thus, for k n ? 1, I (Xk?1 ; Xk jX1; X2; : : :; Xk?2 ) = 0: However, given X1; X2; : : :; Xn?2 , we know that once we know either Xn?1 or Xn we know the other. I (Xn?1; XnjX1; X2; : : :; Xn?2) = H (XnjX1; X2; : : :; Xn?2) ? H (XnjX1; X2; : : :; Xn?1 ) = 1 ? 0 = 1 bit:

28. Mixing increases entropy. Show that the entropy of the probability distribution, (p1; : : :; pi; : : :; pj ; : : :; pm), is less than the entropy of the distribution pj ; : : :; pi +pj ; : : :; p ). Show that in general any transfer of probability that (p1; : : :; pi + m 2 2 makes the distribution more uniform increases the entropy.
Mixing increases entropy. This problem depends on the convexity of the log function. Let P1 = (p1 ; : : :; pi; : : :; pj ; : : :; pm) pj ; : : :; pj + pi ; : : :; p ) P2 = (p1 ; : : :; pi + m 2 2 Then, by the log sum inequality, pj pi + pj H (P2 ) ? H (P1) = ?2( pi + 2 ) log( 2 ) + pi log pi + pj log pj pj ) + p log p + p log p = ?(pi + pj ) log( pi + i i j j 2 0: Thus, H (P2) H (P1):

Solution:

Entropy, Relative Entropy and Mutual Information

33

29. Inequalities. Let X , Y and Z be joint random variables. Prove the following inequalities and nd conditions for equality. (a) H (X; Y jZ ) H (X jZ ). (b) I (X; Y ; Z ) I (X ; Z ). (c) H (X; Y; Z ) ? H (X; Y ) H (X; Z ) ? H (X ). (d) I (X ; Z jY ) I (Z ; Y jX ) ? I (Z ; Y ) + I (X ; Z ). Solution: Inequalities. (a) Using the chain rule for conditional entropy,

H (X; Y jZ ) = H (X jZ ) + H (Y jX; Z ) H (X jZ );
with equality i H (Y jX; Z ) = 0, that is, when Y is a function of X and Z . (b) Using the chain rule for mutual information,

I (X; Y ; Z ) = I (X ; Z ) + I (Y ; Z jX ) I (X ; Z );
with equality i I (Y ; Z jX ) = 0, that is, when Y and Z are conditionally independent given X . (c) Using rst the chain rule for entropy and then the de nition of conditional mutual information,

H (X; Y; Z ) ? H (X; Y ) = H (Z jX; Y ) = H (Z jX ) ? I (Y ; Z jX ) H (Z jX ) = H (X; Z ) ? H (X ) ;


with equality i I (Y ; Z jX ) = 0, that is, when Y and Z are conditionally independent given X . (d) Using the chain rule for mutual information,

I (X ; Z jY ) + I (Z ; Y ) = I (X; Y ; Z ) = I (Z ; Y jX ) + I (X ; Z ) ;
and therefore We see that this inequality is actually an equality in all cases. 30. Maximum entropy. Find the probability mass function p(x) that maximizes the entropy H (X ) of a non-negative integer-valued random variable X subject to the constraint 1 X EX = np(n) = A for a xed value A > 0. Evaluate this maximum H (X ). Solution: Maximum entropy
n=0

I (X ; Z jY ) = I (Z ; Y jX ) ? I (Z ; Y ) + I (X ; Z ) :

34 Recall that,

Entropy, Relative Entropy and Mutual Information ?


1 X
i=0

1 X
i=0

pi log pi ?
1 X
i=0

1 X
i=0

pi log qi :

Let qi = ( )i . Then we have that,

pi log pi

pi log qi
1 X
i=0

= ? log( )

pi + log( )

1 ! X
i=0

ipi

= ? log ? A log

Notice that the nal right hand side expression is independent of fpig , and that the inequality, 1 X ? pi log pi ? log ? A log holds for all ; such that,
i=0

1 X
i=0

i=1=

1 1? : (1 ? )2 :

The constraint on the expected value also requires that,


1 X
i=0

i=A=

Combining the two constraints we have, (1 ? )2 = 1 ? = 1? = A; which implies that, = AA +1 1 : = A+ 1 1?

So the entropy maximizing distribution is, A i: 1 pi = A + 1 A+1

Entropy, Relative Entropy and Mutual Information


Plugging these values into the expression for the maximum entropy, ? log ? A log = (A + 1) log(A + 1) ? A log A: The general form of the distribution,

35

pi = F (pi ; 1; 2) = ?
1 X
i=0

can be obtained either by guessing or by Lagrange multipliers where,

pi log pi + 1(

1 X
i=0

pi ? 1) + 2(

1 X
i=0

ipi ? A)

is the function whose gradient we set to 0. Many of you used Lagrange multipliers, but failed to argue that the result obtained is a global maximum. An argument similar to the above should have been used. On the other hand one could simply argue that since ?H (p) is convex, it has only one local minima, no local maxima and therefore Lagrange multiplier actually gives the global maximum for H (p). 31. Shu es increase entropy. Argue that for any distribution on shu es T and any distribution on card positions X that H (TX ) H (TX jT ) (2.112) ? 1 = H (T TX jT ) (2.113) = H (X jT ) (2.114) = H (X ); (2.115) if X and T are independent. Solution: Shu es increase entropy. H (TX ) H (TX jT ) (2.116) ? 1 = H (T TX jT ) (2.117) = H (X jT ) (2.118) = H (X ): (2.119) The inequality follows from the fact that conditioning reduces entropy and the rst equality follows from the fact that given T , we can reverse the shu e. 32. Conditional entropy. Under what conditions does H (X j g (Y )) = H (X j Y )? Solution: (Conditional Entropy). If H (X jg(Y )) = H (X jY ), then H (X )?H (X jg(Y )) = H (X ) ? H (X jY ), i.e., I (X ; g(Y )) = I (X ; Y ). This is the condition for equality in the data processing inequality. From the derivation of the inequality, we have equality i X ! g (Y ) ! Y forms a Markov chain. Hence H (X jg (Y )) = H (X jY ) i X ! g(Y ) ! Y . This condition includes many special cases, such as g being oneto-one, and X and Y being independent. However, these two special cases do not exhaust all the possibilities.

36

Entropy, Relative Entropy and Mutual Information

33. Fano's inequality. Let Pr(X = i) = pi ; i = 1; 2; : : :; m and let p1 p2 p3 pm : ^ = 1, with resulting probability The minimal probability of error predictor of X is X of error Pe = 1 ? p1 : Maximize H (p) subject to the constraint 1 ? p1 = Pe to nd a bound on Pe in terms of H . This is Fano's inequality in the absence of conditioning. Solution: (Fano's Inequality.) The minimal probability of error predictor when there ^ = 1, the most probable value of X . The probability of error in is no information is X this case is Pe = 1 ? p1 . Hence if we x Pe , we x p1 . We maximize the entropy of X for a given Pe to obtain an upper bound on the entropy for a given Pe . The entropy,

H (p) = ?p1 log p1 ?


= ?p1 log p1 ?

m X

pi log pi ? P log P Pe P e Pe e e i=2 p2 ; p3 ; : : :; pm = H (Pe) + Pe H P Pe e Pe H (Pe) + Pe log(m ? 1);


2 3

i=2 m X

pi log pi

(2.120) (2.121) (2.122) (2.123)

p ; p ; : : :; pm is attained by an uniform distribution. Hence since the maximum of H P Pe e Pe any X that can be predicted with a probability of error Pe must satisfy

H (X ) H (Pe ) + Pe log(m ? 1); H (X ) ? 1 : log(m ? 1)

(2.124)

which is the unconditional form of Fano's inequality. We can weaken this inequality to obtain an explicit lower bound for Pe ,

Pe

(2.125)

34. Monotonic convergence of the empirical distribution. Let p ^n denote the empirical probability mass function corresponding to X1 ; X2; : : :; Xn i.i.d. p(x), x 2 X . Speci cally, n X 1 (2.126) p ^n (x) = n I (Xi = x) is the proportion of times that Xi = x in the rst n samples, where I is an indicator function. (a) Show for X binary that
i=1

ED(^ p2njjp) ED(^ pnjjp):

(2.127)

Thus the expected relative entropy \distance" from the empirical distribution to 1 0 1 p ^n + 2 p ^n the true distribution decreases with sample size. Hint: Write p ^2n = 2 and use the convexity of D .

Entropy, Relative Entropy and Mutual Information


(b) Show for an arbitrary discrete X that

37 (2.128)

ED(^ pnjjp) ED(^ pn?1jjp):

Solution: Monotonic convergence of the empirical distribution.


(a) Note that,
2n X 1 p ^2n (x) = 2n I (Xi = x) i=1 n 2n X X 1 1 1 1 = I (X = x) + 2 n I (Xi = x) 2 n i=1 i i=n+1 = 1p ^n (x) + 1 p ^0 (x): 2 2 n Using convexity of D(pjjq ) we have that, D(^ p2njjp) = D( 1 p ^ + 1p ^0 jj 1 p + 1 p) 2 n 2 n 2 2 1 D(^ 1 D(^ 0 p jj p ) + n 2 2 pn jjp):

Taking expectations and using the fact the Xi 's are identically distributed we get,

ED(^ p2njjp) ED(^ pnjjp):


(b) The trick to this part is similar to part a) and involves rewriting p ^n in terms of p ^n?1 . We see that, 1 p ^n = n or in general, 1 p ^n = n
n ?1 X i=0

I (Xi = x) + I (Xnn= x) I (Xi = x) + I (Xjn= x) ;

X
i6=j

where j ranges from 1 to n . Summing over j we get,

n X n ? 1 p ^j ^n ; np ^n = n n?1 + p j =1 n X 1 ^j p ^n = n p n?1 j =1

or,

38 where,

Entropy, Relative Entropy and Mutual Information


n X j =1

1 p ^j n?1 = n ? 1

X
i6=j

I (Xi = x):

Again using the convexity of D(pjjq ) and the fact that the D(^ pj n?1 jjp) are identically distributed for all j and hence have the same expected value, we obtain the nal result. 35. Entropy of initial conditions. Prove that H (X0jXn ) increases with n for any Markov chain. Solution: Entropy of initial conditions. For a Markov chain, by the data processing theorem, we have I (X0; Xn?1) I (X0; Xn): (2.129) Therefore H (X0) ? H (X0jXn?1) H (X0) ? H (X0jXn) (2.130) or H (X0jXn ) increases with n .

Вам также может понравиться