A Generalization of The Shannon-McMillan-Breiman Theorem and The Kelly Criterion

A Generalization of the Shannon-McMillan-Breiman
Theorem and the Kelly Criterion Leading to a Definition

of “Pragmatic Information”
Edward D Weinberger
Aloha Enterprises, Inc.
370 Central Park West, #110
New York, NY 10025
edw@alohaenterprises.net
Abstract
After reviewing the strengthened hypotheses of the Shannon-MacMillan-Breiman

Theorem, versus the standard statement of the Noiseless Coding Theorem, we state and
prove a similar result for relative information, otherwise known as “information gain”. If
the “gain” results from receiving a side message, this gain, when averaged over the
ensemble of possible side messages, is precisely the “pragmatic information” defined in
Weinberger (2002).
The relative information result proven herein can be used to extend the information
theoretic analysis of the Kelly Criterion, and its generalization, the horse race, an analysis
of securities market trading strategies presented in Cover and Thomas (1992). We show,
in particular, that their results for statistically independent horse races also apply to a
series of races where the stochastic process of winning horses, payoffs, and strategies
depend on some ergodic process, including, but not limited to the history of previous
races. Also, if the bettor is receiving messages (side information) about the probability
distribution of winners, the doubling rate of the bettor's winnings can be interpreted as the
pragmatic information of the messages.
Both the theorem proven herein and the application to trading make a compelling case for
Weinberger’s definition of pragmatic information.
Introduction
Shannon began the seminal paper of information theory (Shannon and Weaver, 1962)
with a demonstration that his celebrated entropy measure was, given a few reasonable
conditions, unique. Yet Shannon thought that the ultimate importance of this quantity
was not that it obeyed the uniqueness theorem, but that it represented the minimum
compressed length of a message, a result now familiar as the Noiseless Coding Theorem.
Similar considerations inform this paper. In 2002, the present author published a
definition of “pragmatic information” as “the amount of information in a message that is
A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion … 1

actually used to make a decision” (Weinberger, 2002). The published definition
distinguishes pragmatic information from the usual (Shannon) measure of information,
which refers merely to the reduction in uncertainty resulting from the receipt of a
message, and not to the meaning that the uncertainty reduction has to the receiver.
The most obvious, if not the only objective meaning that a message can have to a receiver
is in observed changes in subsequent behavior. Indeed, Weaver made this point in his
extended introduction to the Shannon paper that began information theory (Shannon and
Weaver, 1962).
It is tempting to define the pragmatic information of a message, m, as the relative entropy

of the probabilities of the decision maker’s actions before and after receiving m. This
quantity is defined over a symbol string α, consisting of symbols drawn from an alphabet
A, as 1
^ (P; Q) = Σ α P (α) log [P(α) ⁄ Q(α)],
where P(α) is the probability of α subsequent to the receipt of m and Q(α) is the
probability of Q beforehand. Indeed, a theorem similar to the Noiseless Coding
Theorem, albeit less well known, states that a message erroneously compressed using the
wrong symbol probabilities will require an additional compressed length equal in
expected value to the relative entropy between the wrong distribution and the right one.
In fact, as Weinberger (2002) makes clear, pragmatic information is more sensibly

defined over the ensemble, M, of possible messages. His definition of pragmatic
information as the mutual information between M and the ensemble of outcomes, O, i.e.
 Pr m , o    Pr o | m  
  Pr m , o log  Pr m  Pr o   mM Prm  oO Pro | m log  
m M o O
       Pr o  
has the all-important property that the joint pragmatic information from independent
message ensembles is the sum of the pragmatic information values from each ensemble.
It is also the relative information of the probability distribution of outputs subsequent to
the receipt of each message with respect to the prior distribution, averaged over all
possible messages. Thus, the “wrong distribution” theorem cited above remains valid
under Weinberger’s definition.
Shannon’s statement of the noiseless coding theorem was not the most general form of
the theorem, in that
1. the bounds apply only to the expected length of the encoding, thus providing no
assurance that any individual sequence will not have a markedly different compressed
length.
2. the symbol sequences were assumed to be generated by a so-called discrete Markov
source, i.e. one in which the (stationary) conditional probability of observing each
1
Here, and in the sequel, all logarithms are to the base | A |, where | A | is the size of A, unless otherwise
indicated.

symbol, given the entire past history of the sequence, is identical to the conditional
probability of observing that symbol, given only the most recent N < ∞ symbols.
Subsequent attempts to generalize the Noiseless Coding Theorem beyond the above
restrictions culminated in the Shannon-McMillan-Breiman Theorem. However, it
appears that there is no result of corresponding generality for the theorem involving
relative information. It is the purpose of this paper to provide one.
The paper is therefore organized as follows: The section after this introduction reviews
the formal statements of the above mentioned theorems and the ideas surrounding them.
Before actually stating and proving our generalization, we also briefly discuss the nature
and importance of the stochastic processes covered by this generalization, showing, in
particular, that these stochastic processes are natural choices for price series of financial
assets. The section following the formal statement and proof of the theorem applies it to
the Kelly Criterion and its generalization, the “horse race”, a model of securities market
trading strategies. We summarize the paper in a concluding section by arguing that the
results contained herein make the case for Weinberger’s definition of pragmatic
information.
Background
“Standard” information theory addresses only the amount of information received, as

opposed to any consideration as to its meaning. However, some kind of learning is
clearly taking place if the compression rate can be improved, and the additional amount
of compression measures, in some sense, the amount that has been learned. The
following theorem, a formal statement of the “wrong probability” theorem described
above, would therefore seem to provide a useful quantification of this learning:
“Wrong Probability” Theorem (Cover and Thomas, 1991). If an encoding that

would be optimal assuming the wrong probability measure Q(α) is used in place
of the optimal encoding for the symbol string α using the correct probability, P(α) ,
then the expected length of the resulting encoding, EQ[l(α)] = Σα Q(α) l(α),
satisfies the bounds
[ (P) + ^ (P; Q) ≤ EQ[l(α)] ≤ [ (P) + ^ (P; Q) + 1,
where [ (P) is the entropy of α and l(α) is the length of α in symbols.
In other words, the transmitter/receiver pair has, by learning the correct probabilities,
reduced the expected encoded length by ^ (P; Q) symbols.
Many processes of practical importance can be modeled by Markov process, but many
processes, i.e. the ones with so-called “long range correlations,” can’t. The distinction
can be characterized by the correlation, ρ(k), between αn and α k+n, the n th and the k+n th
symbols of α, defined as

|A|-2 Σ a,b ε A [Pr{αn = a, α k+n = b} - Pr{αn = a} Pr{α k+n = b} ] = ρ(k),
where |A| is the size of the alphabet, A, from which symbols can be selected.
As is well known, ρ(k) for a Markov process is, asymptotically, a decaying exponential in
k, but for processes with long range correlations, many other forms for ρ(k) are possible,
most notably k-ν , for some constant, ν. Such processes have received so much attention
in recent years that semi-popular books, such as the excellent review by Schroeder
(1991), have been written about them. The Shannon-McMillan-Breiman Theorem,
discussed below, establishes that the Noiseless Coding Theorem is true for these long
range processes, as well.
As an example of how these long range processes might arise, various authors in the
financial literature have claimed, albeit using very different language, that long range
correlations appear in the daily closing prices of common stocks (See, for example,
Murphy, 1999). These authors, as proponents of “technical analysis”, purport to see
patterns such as “support levels”, “resistance levels”, and “trends” in charts, tables, etc.
that can sometimes span decades. While these claims appear to violate the well known
“efficient market hypothesis,” a claim that has been buttressed by libraries of statistics
(See, for example, Malkiel, 1996), more recent work (such as Lo and MacKinlay, 1999
and Lo, Mamaysky and Wang, 2000) suggests that some of the price signals used by
technical analysts may have some predictive value after all. Thus, it is reasonable to
expect long range correlations in stock price patterns, especially in the form of complex
patterns that elude the simple statistical tests comprising most of the evidence supporting
the efficient market hypothesis.
If such correlations are present, they should be considered in the analysis of the so-called
“Kelly Criterion”, a celebrated formula for determining the optimal fraction of the
portfolio’s capital to each bet in an ongoing trading strategy. Cover and Thomas 2 (1991)
consider a generalization of the Kelly Criterion to the “horse race,” in which exactly one
of M possible future states of the world will obtain (exactly one of M horses can win) in
each race. We invest a fraction, bi(n) ≥ 0, of our wealth on each horse, i, for the nth race,
with the reward of bi(n) Ri if the ith horse wins the race, and the penalty of losing bi(n)
otherwise 3 . If the value of the portfolio after the nth race is S(n), the value of the portfolio
after the n + 1st race is thus
M 
S n  1  S n   bi Ri X i n ,
 i 1 
2
Some readers may know of these results through the paper of Alogoet and Cover (1988), where they first
appeared. The results relevant to this paper are summarized in the textbook cited here.
3
For M = 2, this model corresponds to the situation where a trader has decided to commit a portion b < 1 of
his wealth in the hope of realizing a gain of bR, with R > 1. It is the experience of the author that
professional traders actually think this way. Sometimes, a trader might decide to close out a trade at one or
more intermediate points, i.e. before either b is lost or bR is won, a situation that can be incorporated into
the present model by assuming that M > 2.

where the random variable Xi(n) is defined as
1 if horse i wins the n th race

X i n   
0 otherwise
with a probability, pi, that, in general, may be unknown to the bettor. If the X ’s are
independent and identically distributed, the strong law of large numbers guarantees that
[log2 S(t)]/n converges with probability one as n  ∞ to
M
1
W E log 2 S n    pi log 2 bi Ri  ,
n i 1
where pi is the probability that Xi = 1. That S(n)  2nW almost surely as n  ∞ is the
justification for calling W a “doubling rate” in the literature. Per Cover and Thomas
(1991), the hypothesis of independent, identically distributed “horse race” Xi’s and
constant Ri’s can be replaced by the hypothesis of ergodic Ri’s with arbitrary distribution
when the doubling rate is optimal (See below).
By setting qi = 1/(RiT), with Σk (1/Ri) = T, we ensure that 0 ≤ qi ≤ 1, for all i, and that Σk
qk = 1. We can therefore interpret the q’s as “track probabilities,” the odds that the
bookies are prepared to offer. We then have
M
W   pi log 2 bi Ri 
i 1
M
b p 
  pi log 2  i i 
i 1  pi Tqi 
 ^  p; q   ^  p; b   log 2 T
The special case where T = Σ i (1/Ri) = 1 corresponds to the situation where the game is
fair with respect to the track probabilities (and is the case considered by Cover and
Thomas). Indeed, for T = 1 and the bet allocations bi = qi, S(n + 1) = S(n) for all n and for
all values of the X’s. When T > 1, the race is rigged against the bettor, who can only
make money (via a positive doubling rate) if his betting allocation is a better estimate of p
than the track probabilities q. This is the situation of a trader who is a “price taker” in the
securities markets, because he must buy at an “offered price” that is higher than the
market maker’s selling or “bid price”. On the other hand, when T < 1, the race is rigged
in favor of the bettor, who can make money even if his estimate of p is no better than, and
perhaps even somewhat worse than q. This is the situation of the market maker, who
buys at the bid and sells at the offer. However, regardless of the magnitude of T, the best
policy for the bettor is always to choose b as close to p as possible.
If b = p, we obtain the optimal doubling rate,

M
W *   pi log 2  pi Ri 
i 1
M M
  pi log 2 Ri   pi log 2 pi
i 1 i 1
 E log 2 R  [  p .
In other words, [ (p), the entropy of p, plus the optimal doubling rate, W*, is the
expected log return per bet, E[log2 R].
Also, both Shannon’s result and the relative entropy result apply only to the expected
length of the encoding, thus providing no assurance that any individual sequence will not
have a markedly different compressed length (Similarly, the Weak Law of Large
Numbers guarantees only that the average of a series of independent, identically
distributed random variables converges, with higher and higher probability, to the
common expected value of the random variables. There is no guarantee that a particular
realization of that random series may have a sample average that is wildly different from
this expected value, provided that such realizations are sufficiently rare. In fact, this is
not the case, but we know this only because of the Strong Law of Large Numbers, a
different theorem with a stronger conclusion and a more sophisticated proof.).
The most general statement of the Noiseless Coding Theorem that could be hoped for is
that effectively every symbol sequence would have the right encoded length, provided
that it is generated from an ergodic source 4 . That this is, in fact, the case is the content of
the following
Theorem (Shannon-McMillan-Breiman).
1 1
lim  logPα0 , α1 ,, αn 1  lim  E P log Pα 
n n n  n
with probability 1 (Billingsley, 1977) for any ergodic source.
Note that log[P{ α0, α1, α2, … , αn-1}]/n is a random variable because α0, α1, α2, … , αn-1 is
a particular realization of a stochastic process. Yet the Shannon-McMillan-Breiman
(SMB) Theorem states that, in the limit as the length of the substring approaches infinity,
the length of the optimally encoded version of the substring is always the log of the
probability of the observed substring, almost surely under P. This is precisely the result
needed to establish the strong form of the Noiseless Coding Theorem. Indeed, per Cover
4
An ergodic source is, for our purposes, a stationary stochastic process in which no set, f, of symbol
sequences is mapped via time translation onto itself, except for sets of zero probability and sets of
probability one. Per Birkhoff’s ergodic theorem (Billingsly, 1978), these conditions guarantee that
averages taken over a sufficiently long individual sequence will correspond to averages taken over the
ensemble of all such sequences. Otherwise, it is possible that even stationary processes could get trapped
in some subset of the ensemble.

and Thomas, (1991), the length, l(α0 … αn-1), of the optimal encoding of α0 thru αn-1,
satisfies
- log P{α0 … αn-1} ≤ lP(α0 … αn-1) ≤ - log P{α0 … αn-1} + 1,
so
1 1
lim  logPα0 , α1 ,  , αn 1  lim α0 , α1 ,  , αn 1 ,
n  n n  n
as required. This result argues powerfully, as we said previously, in favor of using

Shannon’s definition as the measure of information in a long message.
The Generalization of the Shannon-McMillan-Breiman Theorem to Relative

Information
We now formally ask a similar question about relative information: Initially, the receiver
has assigned the probability Q(α0n-1) to receiving the subsequence {α0, α1, α2, … , αn-1} =
α0n-1. If P(α0n-1) is the true distribution of α0n-1, then the SMB Theorem establishes that
the per-symbol length of the optimal encoding of α0n-1 is, in fact, the per-symbol entropy
of P. We would like to claim that the decrease in length of the optimal encoding required
for each sequence is the per-symbol relative entropy, n-1 ^ (P; Q), almost surely with
respect to the true probability distribution P for any ergodic source.
One problem that could arise is that the ratio P(α) ⁄ Q(α) might be undefined because
Q(α) might approach zero as the length of α becomes infinite. Fortunately, this problem
can be avoided by the natural condition that sets of sequences that occur with Q-
probability zero must also occur with P-probability zero (i.e. P is absolutely continuous
with respect to Q).
We can now present a proof of the following:
Theorem Let α0, α1, α2, … , αn-1 = α0n-1 be a sequence of characters from some
alphabet A, generated by a source that is ergodic with respect to the probability
distribution P(α0n-1), and let the optimal encoding of that sequence have length
lP(α0n-1). If a second distribution Q(α0n-1) is defined on the sequence and if the
encoding of that sequence that would be optimal assuming Q(α0n-1), has length
lQ(α0n-1) > lP(α0n-1), then
Lim n →  n-1 [l Q(α0n-1) - lP(α0n-1)] = Lim n →  n-1 log [P(α0n-1)/ Q (α0n-1)]

= Lim n →  n-1 Σ a ε A n P(a) log [P(a)/ Q (a)]
{
= Lim n →  n-1 EP log [P(α0n-1)/ Q (α0n-1)] }

almost surely under P, provided that P is absolutely continuous with respect to Q
(Note that the second to the last sum is over all possible fixed sequences, a ε An,
as opposed to the random sequence, α0n-1.).
Proof: The statement
Lim n →  - n-1 lP(α0n-1) = Lim n →  n-1 Σ a ε A n P(a) log P(a)
almost surely under P is the content of the Shannon-McMillan-Breiman Theorem. It

remains to prove that
Lim n →  n-1 l Q(α0n-1) = Lim n →  - n-1 Σ a ε A n P(a) log P(a)
almost surely under P, as well. The proof parallels the proof of the Shannon-McMillan-
Breiman Theorem in Billingsley (1978). Following Billingsley, then, we write
g 0 α    log Q 0 
Q 1 ,  0 
g1 α    log
Q 1

Q  k ,, 1 , 0 
g k α    log
Q  k ,, 1

By the definitions of the gk’s, we have immediately that
1 n 1

n k 0
   1
g k T k α   log Q αn01  ,
 n 
for finite n, where we have introduced the shift, T, that maps the sequence α0, α1, α2, … ,
αn-1 into α1, α2, … , αn
The sum on the left side of the above would be the kind of sum to which Birkhoff’s
pointwise Ergodic Theorem applies, except that the gk’s depend on k. However, in
Billingsley’s proof of the Shannon-McMillan-Breiman Theorem, he establishes that, for
all probability measures Q,
 Q  k ,, 1 , 0 
lim g k α   lim  log    lim  log Q 0  1 ,  2 ,,  k   g α   
k  k   Q  k ,, 1  k 
.
with Q -probability 1. Since convergence fails, at worst, on a set of measure zero under
Q, and P >> Q, the gk’s must also converge to a finite limit with probability 1 under P.
We therefore write

1 n 1
 k
n k 0
g  
T k
α 
1 n 1

n k 0
g T 
k
α 
1 n 1
n k 0
    
 gk T k α  g T k α .
and make the following claims:
1. The first sum on the right side of the above equation has the same limit as
1
  
 E P log Q αn01 as n  ∞, and
n
2. The second sum on the right side of the above equation converges almost
everywhere-P to zero.
Proof of Claim 1: Birkhoff’s ergodic theorem guarantees that
g T k α  ,
1 n 1
E P g α   lim 
n  n k 0
almost everywhere-P. Furthermore, another intermediate result in Billingsley’s proof is

that
E Q sup k g k α    ,
so the corresponding expectation under P must also be finite. Lesbegue’s Dominated

Convergence Theorem then justifies writing
E P g α   lim E P g n α 
n
Furthermore, because the gn’s converge almost surely for all εK > 0, we can find K < n
sufficiently large that |gn(α) – gk(α)| < εK for all k between K and n. The error in replacing
1 n 1 P
E [gn(α)] by  E g k α  is therefore bounded by
P
n k 0
1 n 1 P 1 K 1 P nK 1 K 1 P

n k K
E  g n α   g k α    
n k 0
E  g n α   g k α   
n
 K   E  g n α   g k α  .
n k 0
So, by fixing K and letting n  ∞, we conclude that this error is less than εK plus a term
that vanishes. We now observe that K, itself, can be made arbitrarily large, allowing εK to
become arbitrarily small, and implying that

1 n1 P
lim E P g n α   lim  E g k α 
n n
n k 0
 lim
1 n1 P
  
 E gk T k α
n k 0
n
 1 n1 
 lim E P   g k T k α   
n  n k 0 
 1
 lim E P  log Q αn01 

 
n  n 
where the second equality follows from the stationarity of α, and the third is justified by
another application of Lesbegue’s Dominated Convergence Theorem.
Proof of Claim 2, which is that
lim
1 n 1

n k 0

g k T k α   g T k α   0
n 
almost everywhere-P. We have
lim sup n  g T α   g T α   lim sup n  g T α   g T α 

n 1 n 1
1 k k 1 k k
k k
n k 0 n  k 0
 lim
1 n1
 
 GN T k α
n k 0
n 
 E P G N α 
for all N (Here we have defined GN α   sup k  N g k α   g α  and invoked the ergodic
theorem to justify the equality of the terms involving GN.). We already know that
lim G α   0
N 
N
almost surely-P, so, given that GN (α) is dominated by the P-integrable function g(α) +
supk gk (α), Lesbegue’s Dominated Convergence Theorem allows us to conclude that
 
lim E G α   E lim G α   0 .
P P
N N
N  N 
Putting the two claims together, we conclude that

1
Lim n →  n-1 EP [g(α)] =  log Q α0 , α1 ,, αn1 
n
as claimed. The desired result follows immediately upon recalling that
- log Q{α0 … αk} ≤ l Q(α0 … αk) ≤ - log Q{ α0 … αk} + 1

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion …
10
for an optimal encoding.
Conclusion and Summary
The theorem contributed by this paper can be used to conclude that the formula for the
doubling rate in the T=1 case,
W  ^  p; q   ^  p; b ,
remains true if b and q for race n are functions of some ergodic symbol source, Yn,
including, but not limited to the outcomes of previous races; that is, if b(n)= b(Y1, Y2, …,
Yn-1) and q(n) = q(Y1, Y2, …, Yn-1). In this case, the stochastic doubling rate is
1 n  M bi Y 1,Y 2,,Y n  1 n  bi Y 1,Y 2,,Y n 

 log2  X i n  log2  k 
n k 1  i1 qi Y 1,Y 2,,Y n  n k 1  qik Y 1,Y 2,,Y n
1 n

 log2 bik Y 1,Y 2,,Y n 
n k 1

1 n

log2 qik Y 1,Y 2,,Y n ,
n k 1

where ik is the randomly chosen subscript of the winning horse in the kth race. Since the
sums on the right side of the equation above are of the form considered by the theorem,
their limits as n approaches infinity are as stated.
Furthermore, suppose that the bettor comes up with a way of using the various messages
in an ensemble M to ensure that his choice of b(k) is p(k) in the kth race, rather than the
track probabilities. The expected doubling rate, averaged all of the messages in M, will
be precisely the quantity defined above as the pragmatic information of p with respect to
the track probabilities.
In earlier work (Weinberger, 2002, and Weinberger, 2004), the present author suggested
that the mutual information between the output of a decision maker with and without
benefit of an ensemble of input messages can reasonably be applied to measure pragmatic
information. This paper makes a similar argument to the one that ultimately proved
definitive for Shannon’s entropy measure: If a sufficiently long uncompressed symbol
string has length n, it can always be compressed to a length nh, where h is the per-symbol
entropy. Similarly, because of the theorem proven here, we know that improvement in
the optimally compressed length of a sufficiently long symbol string by using the actual
symbol distribution instead of an erroneous one is the relative entropy between the two
distributions. The expected improvement, when averaged over an ensemble of the
messages that would update the symbol distribution in a variety of ways, is thus the
proposed pragmatic information measure. Furthermore, the proposed measure has an
intuitive interpretation as the expected doubling rate of a trading strategy when presented
11
with a series of trading signals. Both of these results present a compelling argument that
Weinberger’s proposed definition of pragmatic information is indeed correct.
REFERENCES
Alogoet, P. and Cover, T. (1988). “Asymptotic Optimality and Asymptotic Equipartition

Properties of Log-Optimum Investment,” The Annals of Probability, 16 (2), 876-898.
Billingsley, P. (1978). Ergodic Theory and Information Theory, Robert E. Krieger Publishing
Company, Huntington, NY.
Cover, T. and Thomas, J. (1991). Elements of Information Theory, John Wiley & Sons, New
York.
Lo, A. and MacKinlay, A. C. (1999). A Non-Random Walk Down Wall Street, Princeton
University Press, Princeton, N. J.
Lo, A., Mamaysky, H. and Wang, J. (2000). “Foundations of Technical Analysis: Computational
Algorithms, Statistical Inference, and Empirical Implementation,” Journal of Finance, 55, 1705-
1765.
Murphy, J. (1999). Technical Analysis of the Financial Markets: A Comprehensive Guide to

Trading Methods and Applications, New York Institute of Finance, New York, NY.
Shannon, C. and Weaver, W. (1962). The Mathematical Theory of Communication. University

of Illinois Press, Champaign-Urbana.
Schroeder, Manfred (1991). Fractals, Chaos, Power Laws, W.H. Freeman & Co., New York.
Weinberger, E. (2002). “A Theory of Pragmatic Information and Its Application to the

Quasispecies Model of Biological Evolution,” BioSystems, 66, No. 3, 105-119. Electronic
version: http://arxiv.org/abs/nlin.AO/0105030.
Weinberger, E. (2006). “Pragmatic Information and Gaian Development,” Mind and Matter, 4,
#2, 219-234. Electronic version: http://arxiv.org/abs/nlin.AO/0606012.

12

A Generalization of The Shannon-McMillan-Breiman Theorem and The Kelly Criterion

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Generalization of The Shannon-McMillan-Breiman Theorem and The Kelly Criterion

Загружено:

Авторское право:

Доступные форматы

A Generalization of the Shannon-McMillan-Breiman

Theorem and the Kelly Criterion Leading to a Definition

After reviewing the strengthened hypotheses of the Shannon-MacMillan-Breiman

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion … 1

It is tempting to define the pragmatic information of a message, m, as the relative entropy

^ (P; Q) = Σ α P (α) log [P(α) ⁄ Q(α)],

In fact, as Weinberger (2002) makes clear, pragmatic information is more sensibly

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion … 2

“Standard” information theory addresses only the amount of information received, as

“Wrong Probability” Theorem (Cover and Thomas, 1991). If an encoding that

[ (P) + ^ (P; Q) ≤ EQ[l(α)] ≤ [ (P) + ^ (P; Q) + 1,

where [ (P) is the entropy of α and l(α) is the length of α in symbols.

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion … 3

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion … 4

1 if horse i wins the n th race

If b = p, we obtain the optimal doubling rate,

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion … 5

with probability 1 (Billingsley, 1977) for any ergodic source.

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion … 6

- log P{α0 … αn-1} ≤ lP(α0 … αn-1) ≤ - log P{α0 … αn-1} + 1,

as required. This result argues powerfully, as we said previously, in favor of using

The Generalization of the Shannon-McMillan-Breiman Theorem to Relative

We can now present a proof of the following:

Lim n →  n-1 [l Q(α0n-1) - lP(α0n-1)] = Lim n →  n-1 log [P(α0n-1)/ Q (α0n-1)]

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion … 7

Proof: The statement

Lim n →  - n-1 lP(α0n-1) = Lim n →  n-1 Σ a ε A n P(a) log P(a)

almost surely under P is the content of the Shannon-McMillan-Breiman Theorem. It

Lim n →  n-1 l Q(α0n-1) = Lim n →  - n-1 Σ a ε A n P(a) log P(a)

By the definitions of the gk’s, we have immediately that

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion … 8

and make the following claims:

Proof of Claim 1: Birkhoff’s ergodic theorem guarantees that

almost everywhere-P. Furthermore, another intermediate result in Billingsley’s proof is

so the corresponding expectation under P must also be finite. Lesbegue’s Dominated

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion … 9

Proof of Claim 2, which is that

almost everywhere-P. We have

lim sup n  g T α   g T α   lim sup n  g T α   g T α 

Putting the two claims together, we conclude that

- log Q{α0 … αk} ≤ l Q(α0 … αk) ≤ - log Q{ α0 … αk} + 1

Conclusion and Summary

1 n  M bi Y 1,Y 2,,Y n  1 n  bi Y 1,Y 2,,Y n 

Alogoet, P. and Cover, T. (1988). “Asymptotic Optimality and Asymptotic Equipartition

Murphy, J. (1999). Technical Analysis of the Financial Markets: A Comprehensive Guide to

Shannon, C. and Weaver, W. (1962). The Mathematical Theory of Communication. University

Weinberger, E. (2002). “A Theory of Pragmatic Information and Its Application to the

A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion …

Вам также может понравиться