Академический Документы
Профессиональный Документы
Культура Документы
Abstract
The relative information result proven herein can be used to extend the information
theoretic analysis of the Kelly Criterion, and its generalization, the horse race, an analysis
of securities market trading strategies presented in Cover and Thomas (1992). We show,
in particular, that their results for statistically independent horse races also apply to a
series of races where the stochastic process of winning horses, payoffs, and strategies
depend on some ergodic process, including, but not limited to the history of previous
races. Also, if the bettor is receiving messages (side information) about the probability
distribution of winners, the doubling rate of the bettor's winnings can be interpreted as the
pragmatic information of the messages.
Both the theorem proven herein and the application to trading make a compelling case for
Weinberger’s definition of pragmatic information.
Introduction
Shannon began the seminal paper of information theory (Shannon and Weaver, 1962)
with a demonstration that his celebrated entropy measure was, given a few reasonable
conditions, unique. Yet Shannon thought that the ultimate importance of this quantity
was not that it obeyed the uniqueness theorem, but that it represented the minimum
compressed length of a message, a result now familiar as the Noiseless Coding Theorem.
Similar considerations inform this paper. In 2002, the present author published a
definition of “pragmatic information” as “the amount of information in a message that is
where P(α) is the probability of α subsequent to the receipt of m and Q(α) is the
probability of Q beforehand. Indeed, a theorem similar to the Noiseless Coding
Theorem, albeit less well known, states that a message erroneously compressed using the
wrong symbol probabilities will require an additional compressed length equal in
expected value to the relative entropy between the wrong distribution and the right one.
Pr m , o Pr o | m
Pr m , o log Pr m Pr o mM Prm oO Pro | m log
m M o O
Pr o
has the all-important property that the joint pragmatic information from independent
message ensembles is the sum of the pragmatic information values from each ensemble.
It is also the relative information of the probability distribution of outputs subsequent to
the receipt of each message with respect to the prior distribution, averaged over all
possible messages. Thus, the “wrong distribution” theorem cited above remains valid
under Weinberger’s definition.
Shannon’s statement of the noiseless coding theorem was not the most general form of
the theorem, in that
1. the bounds apply only to the expected length of the encoding, thus providing no
assurance that any individual sequence will not have a markedly different compressed
length.
2. the symbol sequences were assumed to be generated by a so-called discrete Markov
source, i.e. one in which the (stationary) conditional probability of observing each
1
Here, and in the sequel, all logarithms are to the base | A |, where | A | is the size of A, unless otherwise
indicated.
Subsequent attempts to generalize the Noiseless Coding Theorem beyond the above
restrictions culminated in the Shannon-McMillan-Breiman Theorem. However, it
appears that there is no result of corresponding generality for the theorem involving
relative information. It is the purpose of this paper to provide one.
The paper is therefore organized as follows: The section after this introduction reviews
the formal statements of the above mentioned theorems and the ideas surrounding them.
Before actually stating and proving our generalization, we also briefly discuss the nature
and importance of the stochastic processes covered by this generalization, showing, in
particular, that these stochastic processes are natural choices for price series of financial
assets. The section following the formal statement and proof of the theorem applies it to
the Kelly Criterion and its generalization, the “horse race”, a model of securities market
trading strategies. We summarize the paper in a concluding section by arguing that the
results contained herein make the case for Weinberger’s definition of pragmatic
information.
Background
In other words, the transmitter/receiver pair has, by learning the correct probabilities,
reduced the expected encoded length by ^ (P; Q) symbols.
Many processes of practical importance can be modeled by Markov process, but many
processes, i.e. the ones with so-called “long range correlations,” can’t. The distinction
can be characterized by the correlation, ρ(k), between αn and α k+n, the n th and the k+n th
symbols of α, defined as
where |A| is the size of the alphabet, A, from which symbols can be selected.
As is well known, ρ(k) for a Markov process is, asymptotically, a decaying exponential in
k, but for processes with long range correlations, many other forms for ρ(k) are possible,
most notably k-ν , for some constant, ν. Such processes have received so much attention
in recent years that semi-popular books, such as the excellent review by Schroeder
(1991), have been written about them. The Shannon-McMillan-Breiman Theorem,
discussed below, establishes that the Noiseless Coding Theorem is true for these long
range processes, as well.
As an example of how these long range processes might arise, various authors in the
financial literature have claimed, albeit using very different language, that long range
correlations appear in the daily closing prices of common stocks (See, for example,
Murphy, 1999). These authors, as proponents of “technical analysis”, purport to see
patterns such as “support levels”, “resistance levels”, and “trends” in charts, tables, etc.
that can sometimes span decades. While these claims appear to violate the well known
“efficient market hypothesis,” a claim that has been buttressed by libraries of statistics
(See, for example, Malkiel, 1996), more recent work (such as Lo and MacKinlay, 1999
and Lo, Mamaysky and Wang, 2000) suggests that some of the price signals used by
technical analysts may have some predictive value after all. Thus, it is reasonable to
expect long range correlations in stock price patterns, especially in the form of complex
patterns that elude the simple statistical tests comprising most of the evidence supporting
the efficient market hypothesis.
If such correlations are present, they should be considered in the analysis of the so-called
“Kelly Criterion”, a celebrated formula for determining the optimal fraction of the
portfolio’s capital to each bet in an ongoing trading strategy. Cover and Thomas 2 (1991)
consider a generalization of the Kelly Criterion to the “horse race,” in which exactly one
of M possible future states of the world will obtain (exactly one of M horses can win) in
each race. We invest a fraction, bi(n) ≥ 0, of our wealth on each horse, i, for the nth race,
with the reward of bi(n) Ri if the ith horse wins the race, and the penalty of losing bi(n)
otherwise 3 . If the value of the portfolio after the nth race is S(n), the value of the portfolio
after the n + 1st race is thus
M
S n 1 S n bi Ri X i n ,
i 1
2
Some readers may know of these results through the paper of Alogoet and Cover (1988), where they first
appeared. The results relevant to this paper are summarized in the textbook cited here.
3
For M = 2, this model corresponds to the situation where a trader has decided to commit a portion b < 1 of
his wealth in the hope of realizing a gain of bR, with R > 1. It is the experience of the author that
professional traders actually think this way. Sometimes, a trader might decide to close out a trade at one or
more intermediate points, i.e. before either b is lost or bR is won, a situation that can be incorporated into
the present model by assuming that M > 2.
with a probability, pi, that, in general, may be unknown to the bettor. If the X ’s are
independent and identically distributed, the strong law of large numbers guarantees that
[log2 S(t)]/n converges with probability one as n ∞ to
M
1
W E log 2 S n pi log 2 bi Ri ,
n i 1
where pi is the probability that Xi = 1. That S(n) 2nW almost surely as n ∞ is the
justification for calling W a “doubling rate” in the literature. Per Cover and Thomas
(1991), the hypothesis of independent, identically distributed “horse race” Xi’s and
constant Ri’s can be replaced by the hypothesis of ergodic Ri’s with arbitrary distribution
when the doubling rate is optimal (See below).
By setting qi = 1/(RiT), with Σk (1/Ri) = T, we ensure that 0 ≤ qi ≤ 1, for all i, and that Σk
qk = 1. We can therefore interpret the q’s as “track probabilities,” the odds that the
bookies are prepared to offer. We then have
M
W pi log 2 bi Ri
i 1
M
b p
pi log 2 i i
i 1 pi Tqi
^ p; q ^ p; b log 2 T
The special case where T = Σ i (1/Ri) = 1 corresponds to the situation where the game is
fair with respect to the track probabilities (and is the case considered by Cover and
Thomas). Indeed, for T = 1 and the bet allocations bi = qi, S(n + 1) = S(n) for all n and for
all values of the X’s. When T > 1, the race is rigged against the bettor, who can only
make money (via a positive doubling rate) if his betting allocation is a better estimate of p
than the track probabilities q. This is the situation of a trader who is a “price taker” in the
securities markets, because he must buy at an “offered price” that is higher than the
market maker’s selling or “bid price”. On the other hand, when T < 1, the race is rigged
in favor of the bettor, who can make money even if his estimate of p is no better than, and
perhaps even somewhat worse than q. This is the situation of the market maker, who
buys at the bid and sells at the offer. However, regardless of the magnitude of T, the best
policy for the bettor is always to choose b as close to p as possible.
E log 2 R [ p .
In other words, [ (p), the entropy of p, plus the optimal doubling rate, W*, is the
expected log return per bet, E[log2 R].
Also, both Shannon’s result and the relative entropy result apply only to the expected
length of the encoding, thus providing no assurance that any individual sequence will not
have a markedly different compressed length (Similarly, the Weak Law of Large
Numbers guarantees only that the average of a series of independent, identically
distributed random variables converges, with higher and higher probability, to the
common expected value of the random variables. There is no guarantee that a particular
realization of that random series may have a sample average that is wildly different from
this expected value, provided that such realizations are sufficiently rare. In fact, this is
not the case, but we know this only because of the Strong Law of Large Numbers, a
different theorem with a stronger conclusion and a more sophisticated proof.).
The most general statement of the Noiseless Coding Theorem that could be hoped for is
that effectively every symbol sequence would have the right encoded length, provided
that it is generated from an ergodic source 4 . That this is, in fact, the case is the content of
the following
Theorem (Shannon-McMillan-Breiman).
1 1
lim logPα0 , α1 ,, αn 1 lim E P log Pα
n n n n
Note that log[P{ α0, α1, α2, … , αn-1}]/n is a random variable because α0, α1, α2, … , αn-1 is
a particular realization of a stochastic process. Yet the Shannon-McMillan-Breiman
(SMB) Theorem states that, in the limit as the length of the substring approaches infinity,
the length of the optimally encoded version of the substring is always the log of the
probability of the observed substring, almost surely under P. This is precisely the result
needed to establish the strong form of the Noiseless Coding Theorem. Indeed, per Cover
4
An ergodic source is, for our purposes, a stationary stochastic process in which no set, f, of symbol
sequences is mapped via time translation onto itself, except for sets of zero probability and sets of
probability one. Per Birkhoff’s ergodic theorem (Billingsly, 1978), these conditions guarantee that
averages taken over a sufficiently long individual sequence will correspond to averages taken over the
ensemble of all such sequences. Otherwise, it is possible that even stationary processes could get trapped
in some subset of the ensemble.
so
1 1
lim logPα0 , α1 , , αn 1 lim α0 , α1 , , αn 1 ,
n n n n
We now formally ask a similar question about relative information: Initially, the receiver
has assigned the probability Q(α0n-1) to receiving the subsequence {α0, α1, α2, … , αn-1} =
α0n-1. If P(α0n-1) is the true distribution of α0n-1, then the SMB Theorem establishes that
the per-symbol length of the optimal encoding of α0n-1 is, in fact, the per-symbol entropy
of P. We would like to claim that the decrease in length of the optimal encoding required
for each sequence is the per-symbol relative entropy, n-1 ^ (P; Q), almost surely with
respect to the true probability distribution P for any ergodic source.
One problem that could arise is that the ratio P(α) ⁄ Q(α) might be undefined because
Q(α) might approach zero as the length of α becomes infinite. Fortunately, this problem
can be avoided by the natural condition that sets of sequences that occur with Q-
probability zero must also occur with P-probability zero (i.e. P is absolutely continuous
with respect to Q).
Theorem Let α0, α1, α2, … , αn-1 = α0n-1 be a sequence of characters from some
alphabet A, generated by a source that is ergodic with respect to the probability
distribution P(α0n-1), and let the optimal encoding of that sequence have length
lP(α0n-1). If a second distribution Q(α0n-1) is defined on the sequence and if the
encoding of that sequence that would be optimal assuming Q(α0n-1), has length
lQ(α0n-1) > lP(α0n-1), then
{
= Lim n → n-1 EP log [P(α0n-1)/ Q (α0n-1)] }
almost surely under P, as well. The proof parallels the proof of the Shannon-McMillan-
Breiman Theorem in Billingsley (1978). Following Billingsley, then, we write
g 0 α log Q 0
Q 1 , 0
g1 α log
Q 1
Q k ,, 1 , 0
g k α log
Q k ,, 1
1 n 1
n k 0
1
g k T k α log Q αn01 ,
n
for finite n, where we have introduced the shift, T, that maps the sequence α0, α1, α2, … ,
αn-1 into α1, α2, … , αn
The sum on the left side of the above would be the kind of sum to which Birkhoff’s
pointwise Ergodic Theorem applies, except that the gk’s depend on k. However, in
Billingsley’s proof of the Shannon-McMillan-Breiman Theorem, he establishes that, for
all probability measures Q,
Q k ,, 1 , 0
lim g k α lim log lim log Q 0 1 , 2 ,, k g α
k k Q k ,, 1 k
.
with Q -probability 1. Since convergence fails, at worst, on a set of measure zero under
Q, and P >> Q, the gk’s must also converge to a finite limit with probability 1 under P.
We therefore write
1. The first sum on the right side of the above equation has the same limit as
1
E P log Q αn01 as n ∞, and
n
2. The second sum on the right side of the above equation converges almost
everywhere-P to zero.
g T k α ,
1 n 1
E P g α lim
n n k 0
E Q sup k g k α ,
E P g α lim E P g n α
n
Furthermore, because the gn’s converge almost surely for all εK > 0, we can find K < n
sufficiently large that |gn(α) – gk(α)| < εK for all k between K and n. The error in replacing
1 n 1 P
E [gn(α)] by E g k α is therefore bounded by
P
n k 0
1 n 1 P 1 K 1 P nK 1 K 1 P
n k K
E g n α g k α
n k 0
E g n α g k α
n
K E g n α g k α .
n k 0
So, by fixing K and letting n ∞, we conclude that this error is less than εK plus a term
that vanishes. We now observe that K, itself, can be made arbitrarily large, allowing εK to
become arbitrarily small, and implying that
lim
1 n1 P
E gk T k α
n k 0
n
1 n1
lim E P g k T k α
n n k 0
1
lim E P log Q αn01
n n
where the second equality follows from the stationarity of α, and the third is justified by
another application of Lesbegue’s Dominated Convergence Theorem.
lim
1 n 1
n k 0
g k T k α g T k α 0
n
lim
1 n1
GN T k α
n k 0
n
E P G N α
for all N (Here we have defined GN α sup k N g k α g α and invoked the ergodic
theorem to justify the equality of the terms involving GN.). We already know that
lim G α 0
N
N
almost surely-P, so, given that GN (α) is dominated by the P-integrable function g(α) +
supk gk (α), Lesbegue’s Dominated Convergence Theorem allows us to conclude that
lim E G α E lim G α 0 .
P P
N N
N N
The theorem contributed by this paper can be used to conclude that the formula for the
doubling rate in the T=1 case,
W ^ p; q ^ p; b ,
remains true if b and q for race n are functions of some ergodic symbol source, Yn,
including, but not limited to the outcomes of previous races; that is, if b(n)= b(Y1, Y2, …,
Yn-1) and q(n) = q(Y1, Y2, …, Yn-1). In this case, the stochastic doubling rate is
Furthermore, suppose that the bettor comes up with a way of using the various messages
in an ensemble M to ensure that his choice of b(k) is p(k) in the kth race, rather than the
track probabilities. The expected doubling rate, averaged all of the messages in M, will
be precisely the quantity defined above as the pragmatic information of p with respect to
the track probabilities.
In earlier work (Weinberger, 2002, and Weinberger, 2004), the present author suggested
that the mutual information between the output of a decision maker with and without
benefit of an ensemble of input messages can reasonably be applied to measure pragmatic
information. This paper makes a similar argument to the one that ultimately proved
definitive for Shannon’s entropy measure: If a sufficiently long uncompressed symbol
string has length n, it can always be compressed to a length nh, where h is the per-symbol
entropy. Similarly, because of the theorem proven here, we know that improvement in
the optimally compressed length of a sufficiently long symbol string by using the actual
symbol distribution instead of an erroneous one is the relative entropy between the two
distributions. The expected improvement, when averaged over an ensemble of the
messages that would update the symbol distribution in a variety of ways, is thus the
proposed pragmatic information measure. Furthermore, the proposed measure has an
intuitive interpretation as the expected doubling rate of a trading strategy when presented
A Generalization of the Shannon-MacMillan-Breiman Theorem and the Kelly Criterion …
11
with a series of trading signals. Both of these results present a compelling argument that
Weinberger’s proposed definition of pragmatic information is indeed correct.
REFERENCES
Billingsley, P. (1978). Ergodic Theory and Information Theory, Robert E. Krieger Publishing
Company, Huntington, NY.
Cover, T. and Thomas, J. (1991). Elements of Information Theory, John Wiley & Sons, New
York.
Lo, A. and MacKinlay, A. C. (1999). A Non-Random Walk Down Wall Street, Princeton
University Press, Princeton, N. J.
Lo, A., Mamaysky, H. and Wang, J. (2000). “Foundations of Technical Analysis: Computational
Algorithms, Statistical Inference, and Empirical Implementation,” Journal of Finance, 55, 1705-
1765.
Schroeder, Manfred (1991). Fractals, Chaos, Power Laws, W.H. Freeman & Co., New York.
Weinberger, E. (2006). “Pragmatic Information and Gaian Development,” Mind and Matter, 4,
#2, 219-234. Electronic version: http://arxiv.org/abs/nlin.AO/0606012.