(Berger, T) Living Information Theory

LIVING INFORMATION THEORY
The 2002 Shannon Le ture

by
Toby Berger
S hool of Ele tri al and Computer Engineering
Cornell University, Itha a, NY 14853
1 Meanings of the Title
The title, "Living Information Theory," is a triple entendre. First and foremost, it pertains
to the information theory of living systems. Se ond, it symbolizes the fa t that our resear h
ommunity has been living information theory for more than ve de ades, enthralled with
the beauty of the subje t and intrigued by its many areas of appli ation and potential
appli ation. Lastly, it is intended to onnote that information theory is de idedly alive,
despite sporadi protestations to the ontrary. Moreover, there is a thread that ties together
all three of these meanings for me. That thread is my strong belief that one way in whi h
information theorists, both new and seasoned, an assure that their subje t will remain
vitally alive deep into the future is to embra e enthusiasti ally its appli ations to the life
s ien es.
2 Early History of Information Theory in Biology
In the 1950's and early 1960's a adre of s ientists and engineers were adherents of the
premise that information theory ould serve as a al ulus for living systems. That is, they
believed information theory ould be used to build a solid mathemati al foundation for
biology whi h always had o upied a pe uliar middle ground between the hard and the soft
s ien es. International meetings were organized by Colin Cherry and others to explore this
frontier, but by the mid-1960's the eort had dissipated. This may have been due in part
to none other than Claude Shannon himself, who in his guest editorial, The Bandwagon, in
the Mar h 1956 issue of the IRE Transa tions on Information Theory stated:
Information theory has ... perhaps ballooned to an importan e beyond its a tual
a omplishments. Our fellow s eintists in many dierent elds, attra ted by the
fanfare and by the new avenues opened to s ienti analysis, are using these
ideas in ... biology, psy hology, linguisti s, fundamental physi s, e onomi s, the
theory of the organization, ... Although this wave of popularity is ertainly
pleasant and ex iting for those of us working in the eld, it arries at the same
time an element of danger. While we feel that information theory is indeed a
valuable tool in providing fundamental insights into the nature of ommuni ation
problems and will ontinue to grow in importan e, it is ertainly no pana ea for
the ommuni ation engineer or, a fortiori, for anyone else. Seldom do more than
a few of nature's se rets give way at one time.
More devastating was Peter Elias's s athing 1958 editorial in the same journal, Two Famous
Papers whi h in part read:
The rst paper has the generi title Information Theory, Photosynthesis and
Religion... written by an engineer or physi ist ... I suggest we stop writing [it,
and release a supply of man power to work on ... important problems whi h
need investigation.
The demise of the nas ent ommunity that was endeavoring to inje t information theory
into mainstream biology probably was o asioned less by these \purist" information theory
editorials than by the relatively primitive state of quantitative biology at the time. Note in
this regard that:
1. The stru ture of DNA was not determined by Cri k and Watson until ve years after
Shannon published A Mathemti al Theory of Communi ation.
2. It was not possible to measure even a single neural pulse train with millise ond a ura y; ontrastingly, today it is possible simultaneously to re ord a urately in vivo
the pulse trains of many neighboring neurons as an aid to developing an information
theory of real neural nets.
3. It was not possible to measure time variations in the on entrations of hemi als
at sub-millise ond speeds in volumes of submi ron dimensions su h as those whi h
onstitute ion hannels in neurons. This remains a stumbling blo k, but measurement
te hniques apitalizing on uores en e and other phenomena are steadily progressing
toward this goal.
We oer arguments below to support the premise that matters have progressed to a stage
at whi h biology is positioned to prot meaningfully from an invasion by information theorists. Indeed, during the past de ade some biologists have equipped themselves with more
than a surfa e knowledge of information theory and are applying it orre tly and fruitfully
to sele ted biologi al subdis iplines, notable among whi h are genomi s and neuros ien e.
Sin e our interest here is in the information theory of sensory per eption, we will dis uss
neuros ien e and es hew genomi s.
3 Information Within Organisms
At a fundamental level information in a living organism is instantiated in the time variations of the on entrations of hemi al and ele tro hemi al spe ies (ions, mole ules and
ompounds) in the ompartments that omprise the organism. Chemi al thermodynami s and statisti al me hani s tell us that these on entrations are always tending toward a
multiphase equilibrium hara terized by minimization of the Helmholtz free energy fun tional. On the other hand, omplete equilibrium with the environment never is attained
both be ause the environment onstantly hanges and be ause the organism must exhibit
homeostasis in order to remain \alive". A fas inating dynami prevails in whi h the organism sa ri es internal energy in order to redu e its un ertainty about the environment,
whi h in turn permits it to lo ate new sour es of energy and nd mates with whom to
perpetuate the spe ies. This is one of several onsiderations that strongly militate in favor
of looking at an information gain by a living system never in absolute terms but rather
always relative to the energy expended to a hieve it.
There is, in addition, an intriguing mathemati al analogy between the equations that govern multiphase equilibrium in hemi al thermodynami s and those whi h spe ify points on
2
Shannon's rate-distortion fun tion of an information sour e with respe t to a delity riterion [9. This analogy is not in this writer's opinion just a mathemati al uriosity but
rather is entral to fruitfully "bringing information theory to life." We shall not be exploring
this analogy further here, however. This is be ause, although it provides an overar hing
theoreti al framework, it operates on a level whi h does not readily lead to on rete results
apropos our goal of developing an information-theoreti ally based formulation of sensory
per eption.
An information theorist venturing into new territory must treat that territory with respe t.
In parti ular, one should not assume that, just be ause the basi on epts and methods developed by Shannon and his dis iples have proved so ee tive in des ribing the key features
of man-made ommuni ation systems, they an be applied en masse to render expli able
the long-standing mysteries of another dis ipline. Rather, one must think riti ally about
information-theoreti on epts and methods and then apply only those that genuinely transfer to the new territory. My endeavors in this onne tion to date have led me to the following
two beliefs:
Judi ious appli ation of Shannon's fundamental on epts of entropy, mutual information, hannel apa ity and rate-distortion is ru ial to gaining an elevated understanding of how living systems handle sensory information.
Living systems have little if any need for the elegant blo k and onvoulution oding
theorems and te hniques of information theory be ause, as will be explained below,
organisms have found ways to perform their information handling tasks in an ee tively Shannon-otpimum manner without having to employ oding in the informationtheoreti sense of the term.
Is it ne essary to learn hemistry, bio hemistry, biophysi s, neuros ien e, and su h before
one an make any useful ontributions? The answer, I feel, is \Yes, but not deeply." The
obje t is not to get to the point where you an think like a biologist. The obje t is to get to
the point where you an think like the biology. The biology has had hundreds of millions of
years to evolve via natural sele tion to a point at whi h mu h of that whi h it does is done
in a nearly optimum fashion. Hen e, thinking about how the biology
do things is
often ee tively identi al to thinking about how the biology
do things and is perhaps
even a more fruitful endeavor.
Information theorists are fond of guring out how best to transmit information over a
\given" hannel. When trespassing on biologi al turf, however, an information theorist
must abandon the tenet that the hannel is given. Quite to the ontrary, nature has evolved
the hannels that fun tion within organisms in response to needs for spe i information
residing either in the environment or in the organism itself - hannels for sight, hannels for
sound, for olfa tion, for tou h, for taste, for blood al ohol and osmolality regulation, and so
on. Common sense strongly suggests that biologi al stru tures built to sense and transfer
information from ertain sites lo ated either outside or inside the organism to other su h
sites will be e iently \mat hed" to the data sour es they servi e. Indeed, it would be
should
does
Elwyn Berlekamp related at IEEE ISIT 2001 in Washington, DC, a onversation he had with Claude
Shannon in an MIT hallway in the 1960's the gist of whi h was:
CES: Where are you going, Elwyn?
EB: To the library to study arti les, in luding some of yours.
CES: Oh, don't do that. You'd be better o to just gure it out for yourself.
ill-advised to expe t otherwise, sin e natural sele tion rarely hooses foolishly, espe ially in
the long run. The ompelling hypothesis, at least from my perspe tive, is that all biologi al
hannels are well mat hed to the information sour es that feed them.
4 Double Mat hing of Sour es and Channels
Mat hing a hannel to a sour e has a pre ise mathemati al meaning in information theory.
Let us onsider the simplest ase of a dis rete memoryless sour e (dms) with instantaneous
letter probabilities fp(u); u 2 Ug and a dis rete memoryless hannel (dm ) with instantaneous transition probabilities fp(yjx); x 2 X ; y 2 Yg. Furthermore, let us suppose that the
hannel's purpose is to deliver a signal fY g to its output terminal on the basis of whi h
one ould onstru t an approximation fV g to the sour e data fU g that is a urate enough
for satisfa tory performan e in some appli ation of interest. Following Shannon, we shall
measure said a ura y by means of a distortion measure d : U V ! [0; 1. fV g will be
onsidered to be a su iently a urate approximation of fU g if and only if the average distortion does not ex eed a level deemed to be tolerable whi h we shall denote by D. Stated
mathemati ally, our requirement for an approximation to be su iently a urate is
k
lim
!1 E n
X d(U ; V ) D:
n
k =1
In order for the dm fp(yjx)g to be instantaneously mat hed to the ombination of the dms
fp(u)g and the distortion measure fd(u; v)g at delity D, the following requirements must
be satised:
1. The number of sour e letters produ ed per se ond must equal the number of times
per se ond that the hannel is available for use.
2. There must exist two transition probability matri es fr(xju); u 2 U ; x 2 Xg and
fw(vjy); y 2 Y ; v 2 Vg, su h that the end-to-end transition probabilities
X X r(xju)p(yjx)w(vjy); (u; v) 2 U V
q(vju) :=
2X 2Y
solve the variational problem that denes the point (D; R(D)) on Shannon's ratedistortion fun tion of the dms fp(u)g with respe t to the distortion measure fd(u; v)g.
Readers not onversant with rate-distortion theory should refer to Se tion 11 below. If
that does not su e, they should ommune at their leisure with Shannon [4, Jelinek [10,
Gallager [11 or Berger [9. However, the two key examples that follow should be largely
a essible to persons unfamiliar with the ontent of any of these referen es. Ea h example
is onstru ted on a foundation omprised of two of Shannon's famous formulas.
5 Two Key Examples of Double Mat hing
This example uses: (1) the formula for the apa ity of a binary symmetri
hannel (BSC) with rossover probability , namely
C = 1 h() = 1 + log + (1 )log (1 ) bits= hannel use;
Example 1.
where we assume without loss of essential generality that 1=2, and (2) the formula for
the rate-distortion fun tion of a Bernoulli-1/2 sour e with respe t to the error frequen y
distortion measure d(x; y) = 1 (x; y), namely
R(D) = 1 h(D) = 1 + D log D + (1 D)log (1 D) bits=sour e letter; 0 D 1=2:
Shannon's onverse hannel oding theorem [1 establishes that it is not possible to send
more than nC bits of information a ross the hannel via n hannel uses. Similarly, his
onverse sour e oding theorem [4 establishes that it is not possible to generate an approximation V ; : : : ; V to sour e letters U ; : : : ; U that has an average distortion E n P
d(U ; V ) of D or less unless that representation is based on nR(D) or more bits of information about these sour e letters. A ordingly, assuming the sour e resides at the hannel
input, it is impossible to generate an approximation to it at the hannel output that has
an average distortion any smaller than the value of D for whi h R(D) = C , even if the
number n of sour e letters and hannel uses is allowed to be ome large. Comparing the
above formulas for C and R(D), we see that no value of average distortion less than an
be a hieved. This is true regardless of how ompli ated an en oder we pla e between the
sour e and the hannel, how ompli ated a de oder we pla e between the hannel and the
re ipient of the sour e approximation, and how large a nite delay we allow the system to
employ.
It is easy to see, however, that D = an be a hieved simply by onne ting the sour e
dire tly to the hannel input and using the hannel output as the approximate re onstru tion of the sour e output. Hen e, this trivial ommuni ation system, whi h is devoid of any
sour e or hannel oding and operates with zero delay, is optimum in this example. There
are two reasons for this:
Reason One: The hannel is instantaneously mat hed to the sour e as dened above with
the parti ularly simple stru ture that X = U , V = Y , r(xju) = (u; x) and w(vjy) = (y; v).
That is, the sour e is instantaneously and deterministi ally fed into the hannel, and the
hannel output dire tly serves as the approximation to the sour e.
Reason Two: The sour e also is mat hed to the hannel in the sense that the distribution
of ea h U , and hen e of ea h X , is p(0) = p(1) = 1=2, whi h distribution maximizes the
mutual information between a hannel input and the orresponding output. That is, the
hannel input letters are i.i.d. with their ommon distrubution being the one that solves
the variational problem that denes the hannel's apa ity.
The hannel in this example is a time-dis rete, average-power- onstrained
additive white Gaussian noise (AWGN). Spe i ally, its k output Y equals X + N ,
where X is the k input and the additive noises N are i.i.d. N (0; N ) for k = 1; 2; : : :.
Also, the average signaling power annot ex eed S , whi h we express mathemati ally by
the requirement
X X S:
lim
E
n
!1
2
k =1
Example 2.
th
th
k =1
Shannon's well-known formula for this hannel's apa ity is

S
1
C = log (1 + ) bits= hannel use:
2
N
The sour e in this example produ es i.i.d. N (0; ) symbols fU g. The squared error
distortion measure, d(u; v) = (v u) , is employed, so the end-to-end a ura y is the
2
mean-squared-error,
MSE = lim
!1 E n
X(V
):
2
Shannon's elebrated formula for the rate-distortion fun tion of this sour e and distortion
measure ombination is
R(D) = (1=2)log ( =D); 0 D :
The minimum a hievable value of the MSE is, as usual, the value of D that satises R(D) =
C , whi h in this example is
S
D = =(1 + ):
N
As in Example 1, we nd that this minimum value of D is trivially attainable without
any sour e or hannel oding and
p with zero delay. However, in this instan e the sour e
symbols must be s aled by := S= before being put into the hannel in order to ensure
omplian e with
p the power onstraint. Similarly, V is produ ed by multiplying Y by the
onstant := S=(S + N ), sin e this produ es the minimum MSE estimate of U based
on the hannel output. Hen e, the hannel is instantaneously mat hed to the sour e via
the deterministi transformations r(xju) = (x u) and w(vjy) = (v y). Moreover,
the sour e is mat hed to the hannel in that, on e s aled by , it be omes the hannel
input whi h, among all those that omply with the power onstraint, maximizes mutual
information between itself and the hannel output that it eli its. Thus, the s aled sour e
drives the onstrained hannel at its apa ity.
It an be argued validly that, notwithstanding the fa t that Examples 1 and 2 deal with
sour e models, hannel models, and distortion measures all dear to information theorists,
these examples are ex eptional ases. Indeed, if one were to modify fp(u)g or fp(yjx)g
or fd(u; v)g even slightly, there no longer would be an absolutely optimum system that is
both oding-free and delay-free. A hieving optimal performan e would then require the
use of oding s hemes whose omplexity and delay diverge as their end-to-end performan e
approa hes the minimum possible average distortion attainable between the sour e and
an approximation of it based on information delivered via the hannel. However, if the
perturbations to the sour e, to the hannel and/or to the distortion measure were minor,
then an instantaneous system would exist that is only mildly suboptimum. Be ause of its
simpli ity and relatively low operating osts, this mildly suboptimum s heme likely would
be deemed preferable in pra ti e to a highly ompli ated system that is truly optimum in
the pure information theory sense.
I have argued above for why it is reasonable to expe t biologi al hannels to have evolved so
as to be mat hed to the sour es they monitor. I further believe that, as in Examples 1 and
2, the data sele ted from a biologi al sour e to be onveyed through a biologi al hannel
will drive that hannel at a rate ee tively equal to the its resour e- onstrained apa ity.
That is, I postulate that double mat hing of hannel to sour e and of sour e to hannel
in a manner analogous to that of Examples 1 and 2 is the rule rather than the ex eption
in the information theory of living systems. Indeed, suppose sele ted stimuli were to be
onditioned for transmission a ross one of an organism's internal hannels in su h a way that
information failed to be onveyed at a rate nearly equal to the hannel's apa ity al ulated
for the level of resour es being expended. This would make it possible to sele t additional
data and then properly ondition and transmit both it and the original data through the
2
hannel in a manner that does not in rease the resour es onsumed. To fail to use su h an
alternative input would be wasteful either of information or of energy, sin e energy usually is
the onstrained resour e in question. As explained previously, a fundamental hara teristi
of an e ient organism is that it always should be optimally trading information for energy,
or vi e versa, as ir umstan es di tate. The only way to assure that pertinent information
will be garnered at low laten y at the maximum rate per unit of power expended is not
only to mat h the hannel to the sour e but also to mat h the sour e to the hannel.
6 Bit Rate and Thermodynami E ien y
We shall now dis uss how in reasing the number of bits handled per se ond unavoidably
in reases the number of joules expended per bit (i.e., de reases thermodynami e ien y).
To establish this in full generality requires penetrating deeply into thermodynami s and statisti al me hani s. We shall instead ontent ourselves with studying the energy-information
tradeo impli it in Shannon's elebrated formula for the apa ity of an average-power onstrained bandlimited AWGN hannel, namely
C (S ) = W log(1 +
S
);
N0 W
where S is the onstrained signaling power, W is the bandwidth in positive frequen ies,
and N is the one-sided power spe tral density of the additive white Gaussian noise. Like
all apa ity- ost fun tions, C (S ) is on ave in S . Hen e, its slope de reases as S in reases;
spe i ally, C 0(S ) = W=(S + N W ). The slope of C (S ) has the dimensions of apa ity
per unit of power, whi h is to say (bits/se ond)/(joules/se ond) = bits/joule. Sin e the
thermodynami e ien y of the information-energy tradeo is measured in bits/joule, it
de reases steadily as the power level S and the bit rate C (S ) in rease. This militates in
favor of gathering information slowly in any appli ation not hara terized by a stringent
laten y demand. To be sure, there are ir umstan es in whi h an organism needs to gather
and pro ess information rapidly and therefore does so. However, energy onservation di tates that information handling always should be ondu ted at as leisurely a pa e as the
appli ation will tolerate. For example, re ent experiments have shown that within the
neo ortex a neural region sometimes transfers information at a high rate and a ordingly
expends energy liberally, while at other times it onveys information at a relatively low
rate and thereby expends less than proportionately mu h energy. In both of these modes,
and others in between, our hypothesis is that these oalitions of neurons operate in an
information-theoreti ally optimum manner. We shall attempt to des ribe below how this is
a omplished.
0
7 Feedforward and Feedba k: Bottom-Up and Top-Down
Before turning in earnest to information handling by neural regions, we rst need to generalize and further expli ate the phenomenon of double mat hing of sour es and hannels.
So far, we have dis ussed this only in the ontext of sour es and hannels that are memoryless. We ould extend to sour es and/or hannels with memory via the usual pro edure
of blo king su essive symbols into a \supersympbol" and treating long supersymbols as
nearly i.i.d., but this would in rease the laten y by a fa tor equal to the number of symbols
7
per supersymbol, thereby defeating one of the prin ipal advantages of double mat hing. We
suggest an alternative approa h below whi h leads to limiting the memory of many ru ial
pro esses to at most rst-order Markovianness.
It has long been appre iated that neuromus ular systems and metaboli regulatory me hanisms exhibit masterful use of feedba k. Physiologi al measurements of the past fteen or
so years have in ontrovertibly established that the same is true of neurosensory systems.
Des ribing signaling paths in the primate visual ortex, for example, Woods and Krantz [8
tell us that "In addition to all the onne tions from V1 and V2 to V3, V4 and V5, ea h of
these regions onne ts ba k to V1 and V2. These seemingly ba kward or reentrant onne tions are not well understood. Information, instead of owing in one dire tion, in both
dire tions. Thus, later levels do not simply re eive information and send it forward, but are
in an intimate two-way ommuni ation with other modules." Of ourse, it is not that information owed unidire tionally in the visual system until some time in the 1980's and then
began to ow bidire tionally. Rather, as is so often the ase in s ien e, measurements made
possible by new instrumentation and methodologies have demanded that ertain herished
paradigms be seriously revised. In this ase, those mistaken paradigms espoused so- alled
\bottom-up" unidire tionality of signaling pathways in the human visual system (HVS) [6
[5.
Instead of speaking about feedforward and feedba k signaling, neuros ientists refer to
bottom-up and top-down signaling, respe tively. Roughly speaking, neurons whose axons
arry signals prin ipally in a dire tion that moves from the sensory organs toward the \top
brain" are alled bottom-up neurons, while those whose axons propagate signals from the
top brain ba k toward the sensory organs are alled top-down neurons. Re ent measurements have revealed that there are roughly as many top-down neurons in the HVS as there
are bottom-up neurons. Indeed, nested feedba k loops operate at the lo al, regional and
global levels. We shall see that a theory of sensory per eption whi h embra es rather
than es hews feedba k reaps rewards in the form of analyti al results that are both simpler
to obtain and more powerful in s ope.
2
8 Neurons and Coalitions
The roughly 10 neurons in the human visual system (HVS) onstitute ir a one-tenth of
all the neurons in the brain. HVS neurons are dire tly inter onne ted with one another via
an average of 10 synapses per neuron. That is, a typi al HVS neuron has on its dendriti
tree about 10 synapses at ea h of whi h it taps o the spike signal propagating along the
axon of one of the roughly 10 other HVS neurons that are aerent (i.e., in oming) to it. Via
pro essing of this multidimensional input in a manner to be dis ussed below, it generates
an eerent (i.e., outgoing) spike train on its own axon whi h propagates to the ir a 10
other HVS neurons with whi h it is in dire t onne tion. The 10 10 matrix whose
(i; j ) entry is 1 if neuron i is aerent to neuron j and 0 otherwise thus has a 1's density
of 10 . However, there exist spe ial subsets of the HVS neurons that are onne ted mu h
more densely than this. These spe ial subsets, among whi h are the ones referred to as V1,
V2 : : : V5 in the above quote, onsist of a few million to as many as a few tens of millions
of neurons and have onne tivity submatri es whose 1's densities range from 0.1 to as mu h
10
10
10
Shannon's topi for the inaugural Shannon Le ture in June 1973 was Feedba k. Biologi al onsiderations,
one of his many interests [3, may have in part motivated this hoi e.
as 0.5. Common sense suggests and experiments have veried that the neurons omprising
su h a subset work together to ee t one or more ru ial fun tions in the pro essing of visual
signals. We shall hen eforth refer to su h subsets of neurons as \ oalitions". Alternative
names are for them in lude neural \regions", \groupings", \ ontingents", and \assemblies".
Figure 1 shows a s hemati representation of a neural oalition. Whereas real neural
spike trains o ur in ontinuous time and are asyn hronous, Figure 1 is a time-dis rete
model. Its time step is ir a 2.5 ms, whi h orresponds to the minimal duration between
su essive instants at whi h a neuron an generate a spike; spikes are also known as an
a tion potentials. The time-dis rete model usually aptures the essen e of the oalition's
operation as regards information transfer. Any spike traveling along an axon aerent to a
oalition of neurons in the visual ortex will rea h all the members of that oalition within
the same time step. That is, although the leading edge of the spike arrives serially at the
synapses to whi h it is aerent, the result is as it were multi asted to them all during a
single time slot of the time-dis rete model.
The input to the oalition in Figure 1 at time k is a random binary ve tor X (k) whi h
possesses millions of omponents. Its i omponent, X (k) is 1 if a spike arrives on the i
axon aerent to the oalition during the k time step and is 0 otherwise. The aerent
neurons in Figure 1 have been divided into two groups indexed by BU and TD, standing
respe tively for bottom-up and top-down. The verti al lines represent the neurons of the
oalition. The presen e (absen e) of dark dot where the i horizontal line and m verti al
line ross indi ates that the i aerent axon forms (does not form) a synapse with the
m neuron of the oalition. The strength, or weight, of synapse (i; m) will be denoted by
W ; if aerent axon i does not form a synapse with oalition neuron m, then W = 0.
If W > 0, the onne tion (i; m) is said to be ex itatory; if W < 0, the onne tion is
iinhibitory. In primate visual ortex about ve-sixths of the onne tions are ex itatory.
The so- alled post-synapti potential (PSP) of neuron m is built up during time step
k as a weighted linear ombination of all the signals that arrive at its synapses during this
time step. If this sum ex eeds the threshold T (k) of neuron m at time k, then neuron
m produ es a spike at time k and we write Y (k) = 1; if not, then Y (k) = 0. The
thresholds do not vary mu h with m and k with one ex eption. If Y (k) = 1, a refra tory
period of duration about equal to a typi al spike width follows during whi h the threshold is
extremely high, making it virtually impossible for the neuron to spike. In real neurons, the
PSP is reset to its rest voltage after a neuron spikes. One short oming of our time-dis rete
model is that it assumes that a neuron's PSP is reset between the end of one time step
and the beginning of the next even if the neuron did not re a spike. In reality, if the peak
PSP during the previous time step did not ex eed threshold and hen e no a tion potential
was produ ed, ontributions to this PSP will partially arry over into the next time step.
Be ause of apa itative leakage they generally will have de ayed to one-third or less of their
peak value a time step ago, but they will not have vanished entirely.
3
th
th
th
th
th
th
th
im
im
im
im
The time-dis rete model mirrors reality with insu ient a ura y in ertain saturated or near-saturated
onditions hara terized by many of the neurons in a oalition spiking almost as fast as they an. Su h
instan es, whi h o ur rarely in the visual system but relatively frequently in the auditory system, are
hara terized by su essive spikes on an axon being separated by short, nearly uniform durations whose
sample standard deviation is less than 1 millise ond. Mathemati al methods based on Poisson limit theorems
(a.k.a. mean eld approximations) and PDE's an be used to distinguish and quantify these ex eptional
onditions [25 [26.
4
Most neuros ientists today agree that the detailed shape of the a tion potential spike as a fun tion of
time is in onsequential for information transmission purposes, all that matters being whether there is or is
not a spike in the time slot in question.
Every time-dis rete system diagram must in lude a unit delay element in order to allow
time to advan e. In Figure 1 unit delays o ur in the boxes marked . Note, therefore,
that the random binary ve tor Y (k 1) of spikes and non-spikes produ ed by the oalition's
neurons during time step k 1 gets fed ba k to the oalition's input during time step k. This
re e ts the high inter onne tion density of the neurons in question that is responsible for
why they onstitute a oalition. Also note that, after the spike trains on the axons of ea h
neuron in the oalition are delivered to synapses on the dendrites of sele ted members of
the oalition itself, they then pro eed in either a top-down or a bottom-up dire tion toward
other HVS oalitions. In the ase of a densely onne ted oalition, about half of its eerent
axons' onne tions are lo al ones with other neurons in the oalition. On average, about
one quarter of its onne tions provide feedba k, mostly to the oalition immediately below
in the HVS hierar hy although some are dire ted farther down. The remaining quarter
are fed forward to oalitions higher in the hierar hy, again mainly to the oalition dire tly
above. This elu idates why we distinguished the external input X to the oalition as being
omprised of both a bottom-up (BU) and a top-down (TD) subset; these subsets ome,
respe tively, mainly from the oalitions dire tly below and dire tly above.
Neuros ientists refer not only to bottom-up and top-down onne tions but also to horizontal onne tions [15, [16, [17. Translated into feedforward-feedba k terminology, horizontal onne tions are lo al feedba k su h as Y (k 1) in Figure 1, top-down onne tions
are regional feedba k, and bottom-up onne tions are regional feedforward. Bottom-up
and top-down signals also an be onsidered to onstitute instan es of what information
theorists all side information.
In information theory parlan e, the neural oalition of Figure 1 is a time-dis rete, nitestate hannel whose state is the previous hannel output ve tor. At the hannel input
appear both the regional feedforward signal fX (k)g and the regional feedba k signal
fX (k)g. However, there is no hannel en oder in the information-theoreti sense of that
term that is able to operate on these two signals in whatever manner suits its fan y in order
to generate the hannel input. Rather, the omposite binary ve tor pro ess fX (k)g :=
f(X (k); X (k))g simply enters the hannel by virtue of the axons arrying it being
aerent to the synapses of the hannel's neurons. We shall subsequently see that there is
no de oder in the information-theoreti sense either; as foretold, the system is oding-free.
My use of the adje tive \ oding-free" is likely to rile both information theorists and
neuros ientists - information theorists be ause they are deeply enamored of oding and
neuros ientists be ause they are a ustomed to thinking about how an organism's neural
spike trains serve as oded representations of aspe ts of its environment. In hopes of not
losing both halves of my audien e at on e, allow me to elaborate. Certainly, sensory neurons'
spike trains onstitute an en oding of environmental data sour es. However, unless they
expli itly say otherwise, information theorists referring to oding usually mean hannel
oding rather than sour e oding. Channel oding onsists of the intentional insertion of
leverly sele ted redundant parity he k symbols into a data stream in order to provide
error dete tion and error orre tion apabilities for appli ations involving noisy storage or
transmission of data. I do not believe that the brain employs error ontrol oding (ECC).
5
BU
TD
BU
TD
5
In [2 Shannon wrote, "Channels with feedba k from the re eiving to the transmitting point are a spe ial
ase of a situation in whi h there is additional information available at the transmitter whi h may be used
as an aid in the forward transmission system."
6
There is a possibility that the brain employs a form of spa e-time oding, with the emphasis heavily
on spa e as opposed to time. Here, the spa e dimension means the neurons themselves, the ardinality of
whi h dwarfs that of the pau ity of antennas that omprise the spa e dimension of the spa e-time odes
10
9 Mathemati al Model of a Neural Coalition
It's time for some mathemati al information theory. Let's see what has to happen in order
for an organism to make optimum use of the hannel in Figure 1, i.e., to transmit information
through it at a rate equal to its apa ity. (Of ourse, we are not interested just in sending any
old information through the hannel - we want to send the \right" information through the
hannel, but we shall temporarily ignore this requirement.) A hannel's apa ity depends
on the level of resour es expended. What resour es, if any, are being depleted in the ourse
of operating the hannel of Figure 1? The answer lies in the biology. Obviously, energy is
onsumed every time one of the hannel's neurons generates an a tion potential, or spike.
It is also true that when a spike arrives at a synapse lo ated on a dendrite of one of the
hannel's neurons, energy usually is expended in order to onvert it into a ontribution to
the post-synapti potential. This is ee ted via a sequen e of ele tro hemi al pro esses
the end result of whi h is that vesi les ontaining neurotransmitter hemi als empty them
for transportation a ross the synapti left. This, in turn, either in reases or de reases the
post-synapti potential (equivalently, the post-synapti urrent), respe tively as the synapse
is an ex itatory or an inhibitory one. The expe ted energy dissipated in the synapses in
Figure 1 at time k therefore depends on X (k) and Y (k 1), while that dissipated in the
7
urrently under development for wireless ommuni ations. Think of it this way. In order to evolve more
apable sensory systems, organisms needed to expand the temporal and/or the spatial dimensionality of
their pro essing. Time expansion was not viable be ause the need to respond to ertain stimuli in only a
few tens of millise onds pre luded employing temporal ECC te hniques of any power be ause these require
long blo k lengths or long onvolutional onstraint lengths whi h impose una eptably long laten y. Pulse
widths on eivably ould have been narrowed (i.e., bandwidths in reased), but the width of a neural pulse
appears to have held steady at ir a 2 ms over all spe ies over hundreds of millions of years, no doubt for a
variety of ompelling reasons. (Certain owls' auditory systems have spikes only about 1 ms wide, but we are
not looking for fa tor of 2 explanations here.) The obvious solution was to expand the spatial dimension.
Organisms have done pre isely that, relying on parallel pro essing by more and more neurons in order to
progress up the phylogeni tree. If there is any ECC oding done by neurons, it likely is done spatially
over the huge numbers of neurons involved. Indeed, strong orrelations have been observed in the spiking
behaviors of neighboring neurons, but these may simply be onsequen es of the need to obtain high resolution
of ertain environmental stimuli that are themselves inherently orrelated and/or the need to dire t ertain
spike trains to more lo ations, or more widely dispersed lo ations, than it is pra ti al for a single neuron to
visit. There is not yet any solid
eviden e that neurons implement ECC. Similarly, although outside the domain of this paper, we remark
that there is not yet any on rete eviden e that redundan ies in the geneti ode play an ECC role; if it
turns out they do, said ECC apability learly also will be predominately spatial as opposed to temporal in
nature.
7
A tually, the energy gets onsumed prin ipally during the pro ess of re-setting hemi al on entrations
in and around the neuron after ea h time it spikes so as to prepare it to re again should su ient ex itation
arrive. The a tual transmission of a spike is more a matter of energy onversion than of energy dissipation.
8
Spikes arriving at synapses often are ignored, a phenomenon known as quantal synapti failure (QSF). Its
name notwithstanding, QSF a tually is one of natural sele tion's ner triumphs, enhan ing the performan e
of neural oalitions in several ingenious respe ts the details of whi h an be found in the works of Levy and
Baxter[13 [14. Let Sim (k) be a binary random variable that equals 1 if QSF does not o ur at synapse
(i; m) at time k and equals 0 if it does; that is, Sim = 1 denotes a quantal synapti su ess at synapse (i; m)
at time k. Often, the Sim (k)'s an be well modeled as Bernoulli-s random variables, i.e., as being i.i.d.
over i, m and k with ommon distribution P (S = 1) = 1 P (S = 0) = s; in pra ti e, s 2 [0:25; 0:9. The
phenomenon of QSF then may be modeled by multiplying the spike, if any, aerent to synapse (i; m) at time
k by Sim (k). This is seen to be equivalent to installing what information theorists all a Z- hannel [12 at
every synapse. Were it not for QSF, the oalition hannel would be ere tively deterministi when viewed as
an operator that transforms fX g into fY g, sin e the signal-to-noise ratio on neural hannels usually is quite
strong. However, if the hannel is viewed as an operator only from fX BU g to fY g, with fxTD g onsidered
to be random side information, QSF may no longer be its dominant sour e of randomness.
11
axons depends on Y (k). The average energy dissipated in the oalition at time k therefore is
the expe ted value of one fun tion of (X (k); Y (k 1)) and another fun tion of Y (k), with the
ee t of quantal synapti failures (see footnote on erning QSF) usually well approximated
by multiplying the rst of these two fun tions by s. For purposes of the theorem we are
about to present, it su es to make a less restri tive assumption that the average resour es
expended at time k are the expe ted value of some fun tion solely of (X (k); Y (k 1); Y (k)).
We may impose either a s hedule of expe ted resour e depletion onstraints as a fun tion of
k or simply onstrain the sum of the k expe ted resour e depletion over some appropriate
range of the dis rete time index k. Average energy expenditure, whi h we believe to be
the dominant operative onstraint in pra ti e, is an important spe ial ase of this general
family of resour e onstraint fun tions.
Let P SP (k) denote the post-synapti potential of the m neuron in the oalition a time
k. Then the output Y (k) of this neuron, onsidered to be 1 if there is a spike at time k
and 0 if there isn't, is given by
Y (k) = U (P SP (k) T );
where U () is the unit step fun tion. The above dis ussion of Figure 1 lets us write
X
X
P SP (k) = X (k)W Q (k)S + Y (k 1)W Q S (k);
th
th
lm
lm
lm
im
im
im
where W is the signed weight of synapse (i; m), Q (k) is the random size of the quantity
of neurotransmitter that will traverse the synapti left in response to an aerent spike at
synapse (i; m) at time k if synapti quantal failure does not o ur there then, and S (k)
equals 0 or 1, respe tively, in a ordan e with whether said quantal synapti failure does
or does not o ur. The spiking threshold T = T (k) of neuron m at time k varies with m
and k, though usually only mildly.
Note that this hannel model is su h that su essive hannel output ve tors Y (k) are
generated independently, onditional on their orresponding input ve tors X (k) and lo al
feedba k ve tors Y (k 1); that is,
im
im
im
p(y1 jx1 ; y 0 ) =
n
Y p(y jx ; y ):
(1)
k =1
As information theorists, one of our in linations would be to investigate onditions su ient

to ensure that su h a nite-state hannel model with feedba k has a Shannon apa ity. That
is, we might seek onditions under whi h the maximum over hannel input pro esses fX (k)g
of the mutual information rate between said input and the output pro ess fY (k)g it generates, subje t to whatever onstraints are imposed on the input and/or the output, equals
the maximum number of bits per hannel use at whi h information a tually an be sent
reliably over the onstrained hannel. Instead, we shall fo us on the mutual-informationrate-maximizing onstrained input pro ess and the stru ture of the joint (input,output)
sto hasti pro ess it produ es.
Temporarily assume that there is a genuine en oder at the hannel input whi h, for
purposes of generating input x at time k remembers all the past inputs x and all the
past lo al feedba k (i.e., past output) values y ; here, y represents the \initial" state.
Obviously, the maximum mutual information rate a hievable under these ir umstan es is
an upper bound to that whi h ould be a hieved when only y is available at the hannel
k
12
input at time k, with x and y by then no longer being available there. We will show
that this upper bound an be met even when a ess is denied to said past input and lo al
feedba k ve tors. This, in turn, helps demystify how a real neural network stru tured as
in Figure 1 an be information-theoreti ally optimum despite its not possessing a lassi al
en oder at the network's input.
The postulated full-memory en oder an generate any input pro ess whose probabilisti
stru ture is des ribed by a member of the set P (X ) of probabilisti distributions of the
form
Y p(x jx ; y ):
(2)
k
k =1
Now dene another set of input distributions on X , P (X ), having all probability mass

fun tions of the form
Y p(x jy ):
(3)
n
k =1
Compared with (2), this set ontains only those input distributions for whi h, given Y ,
X be omes onditionally independent of all the previous inputs X and all the previous
outputs Y .
1
10 Statement and Proof of Main Theorem

Our main result is stated as the following theorem.
The maximum mutual information rate between the hannel's input and output
pro esses is attained inside P (X1 ), uniformly in the initial onditions Y 0 . Moreover, if
we restri t the distribution of the inputs X1 to P (X1 ), let Y1 denote the orresponding
output, and let Y 0 denote the initial hannel state, then we have
Theorem 1
1. fY ; k = 0; 1; : : : ; ng is a rst-order Markov hain,

k
f(X ; Y ); k = 1; 2; : : : ; ng also is a rst-order Markov hain.

Remarks: (i) f(X )g is not ne essarily a rst-order Markov hain, though we have not yet
endeavored to onstru t an example in whi h it fails to be. Sin e fX g depends, in part,
2.
on bottom-up information derived from the environment, it is unrealisti to expe t it to

exhibit Markovianness, espe ially at pre isely the time step duration of the model. (ii) The
theorem's Markovian results help explain how many neural regions an be hierar hi ally
sta ked, as is the ase in the human visual system, without ne essarily engendering una eptably large response times. (iii) The theorem reinfor es a view of sensory brain fun tion
as a pro edure for re ursively estimating quantities of interest in a manner that be omes
in reasingly informed and a urate.
k
Proof of Theorem 1
We suppress underlining of ve tors and abuse notation by writing X 2 P (X ) to indi ate

that X is distributed a ording to some distribution in P (X ). Furthermore, the expression (X ; Y ) 2 P (X ; Y ) means that X 2 P (X ) and Y is the output that orresponds
to input X and hannel initial state Y . First we establish the Markovianess of the output
pro ess.
n
This proof is joint work with Yuzheng Ying [7.
13
) 2 P (X ; Y ), then Y is a rst-order Markov hain.

Proof: For all k we have
X
p(y jy ) = p(y jx ; y )p(x jy ):
Lemma 1
If (X1 ; Y1
Sin e X1 2 P (X1 ),
n
p(x jy0
) = p(x jy ):
Thus, with referen e to the onditional memoryless of the hannel ( f. equation (1), we
have
X
p(y jy ) = p(y jx ; y )p(x jy ) = p(y jy ):
(4)
k
Remark: Depending on the input pmf's p(x jy ); k = 1; 2; : : :, Y an be either a homogeneous or a nonhomogeneous Markov hain. If pmf p(x jy ) does not vary with k, then
Y is homogeneous; otherwise, it's nonhomogeneous.
We next derive an upper bound on the mutual information between any (X ; Y ) 2
P (X ; Y ), whi h is needed for the proof.
I (X ; Y jY ) = H (Y jY ) H (Y jX ; Y )
X H (Y jY ) X H (Y jX ; Y )
=
1
(a)
=

X I (X ; Y jY );
=
k =1
k =1
n
(b)
k =1
n
( )
k =1
n
k =1
k =1
(5)
k =1
where (a) is the hain rule; (b) follows from the hannel property p(y jx ; y ) =
p(y jx ; y ) for all k and all n > k; and ( ) follows from the fa t that in reasing onditioning an only redu e entropy. Noti e that the inequality in ( ) be omes equality when Y
is a Markov hain. Therefore, it follows from Lemma 1 that, for any (X ; Y ) 2 P (X ; Y ),
k
I (X1 ; Y1 jY0 ) =
n
X I (X ; Y jY ):
(6)
k =1
We now show that, for any (X ; Y ) 2 P (X ; Y ), there's a (X^ ; Y^ ) 2 P (X ; Y )

su h that
I (X^ ; Y^ jY ) I (X ; Y jY ):
(7)
This assertion says that I (Y ) is attained inside P (X ). Sin e (X ; Y ) is a pair of
hannel (input,ouput) sequen es under the initial state Y ,
p n n j (x ; y jy )
Y p k k (x jx ; y )p
)
=
k (y jx ; y
kj k
kj
n
1 ;Y1
Y0
Yp
k =1
1 ;Y
0
k 1 k 1
k jX1 ;Y0
k =1
14
(x jx
; y0
)p k j
Y
1 ;Y0
k k 1
X ;Y
(y jx ; y ); (8)
k
where the last equality follows from (1). We now onstru t a (X^ ; Y^ ) 2 P (X
distributed a ording to the pmf
n
p ^ 1n
X
where we set P
^n Y
^0
;Y
(x
Yp
n
k jY^k 1
^
X
k =1
(x jy )p k j
k
^
Y
(y jx ; y );
^ ;Y
^
X
k k 1
k k 1
^
X
^
Y
X ;Y
(10)
; Y0 )
(9)
(j) equal to P k k (j)) so that

p k j k (x jy ) = p k j k (x jy ); 8k 1
^ ;Y
^
X
and
; y1 ) =
(y jx ; y ) = p kj k k (y jx ; y ); 8k 1:
(11)
Y^ therein is just an alias for the random variable Y . Unlike X in (8), X^ is restri ted to
be onditionally independent of (X^ ; Y^ ), given Y^ . Thus, X^ 2 P (X ). Equation
(11), together with Y^ = Y , assures us that Y^ is indeed the output from our hannel in
response to input X^ and initial state Y ; i.e., (X^ ; Y^ ) 2 P (X ; Y ). It is obvious that
the joint pmf of (X^ ; Y^ ) is dierent from that of (X ; Y ). However, we have the following
lemma.
Assume (X ; Y ) 2 P (X ; Y ). Let (X^ ; Y^ ) be dened as in (9) and let Y^ =
Y . Then
(12)
p k k k (y ; x ; y ) = p k k k (y ; x ; y ); 8k 1:
Proof: It follows from (10) and (11) that
(13)
p k k j k (y ; x jy ) = p k k j k (y ; x jy ); 8k 1:
Sin e Y^ = Y , we may write
p
(y ; x ; y ) = p j (y ; x jy )p (y )
= p j (y ; x jy )p (y )
= p
(y ; x ; y ):
(14)
That is, the lemma statement (12) holds for k = 1, from whi h it follows by marginalization
that
p (y ) = p (y ):
The same arguments as in (14) now an be used to verify that (12) holds for k = 2.
Repeating this argument for k = 3, and so on, establishes the desired result for all k 1.
Sin e (X^ ; Y^ ) 2 P (X ; Y ), we know from (6) that
p ^k j ^ k
Y
^
X ;Y
k 1
Lemma 2
X ;Y
^ ;Y
^
^ ;X
Y
^ ;X
^
Y
^
Y
Y ;X ;Y
Y ;X
^1 ;X
^ 1 ;Y
^0
Y
^1
Y
I (X^1 ; Y^1 jY^0 ) =

n
^0
Y
Y1 ;X1 Y0
Y0
Y1 ;X1 ;Y0
^1 ;X
^1 Y
^0
Y
Y1
X I (X^ ; Y^ jY^ ):
n
Next, re all from (5 that I (X ; Y jY ) P I (X ; Y jY ), and observe from Lemma 2

that
I (X^ ; Y^ jY^ ) = I (X ; Y jY ); 8k 1;
(15)
n
k =1
k =1
15
so
X I (X^ ; Y^ jY^ ) = X I (X ; Y jY ):
n
(16)
k =1
k =1
Therefore, I (X^ ; Y^ jY^ ) I (X ; Y jY ); whi h is (7).

To show that the joint (input,output) pro ess is Markovian when X 2 P (X ), we
write
Pr(X ; Y jX ; Y )
= Pr(Y jX ; X ; Y )Pr(X jX ; Y )
= Pr(Y jX ; Y )Pr(X jY )
= Pr(Y jX ; X ; Y )Pr(X jX ; Y )
= Pr((X ; Y jX ; Y );
where (a) follows from (1) and the ondition X 2 P (X ). Theorem 1 is proved.
For a broad lass of onstraints on the hannel input and/or output, I (Y ) still is
attained inside P (X ). Spe i ally, for all onstraints on expe ted values of fun tions of
triples of the form (Y ; X ; Y ), imposed either as a s hedule of su h onstraints versus
k or as sums or arithmeti avergages over k of fun tions of said triples, the onstrained
value of I (Y ) is attained by an input pro ess whose distribution onforms to (3). To
see this, for any (X ; Y ) 2 P (X ; Y ) satisfying one or more onstraints of this type,
we onstru t (X^ ; Y^ ) as in (9). By Lemma 2, (X ; Y ) and (X^ ; Y^ ) are su h that
for ea h xed k, (X ; Y ; Y k 1) and (X^ ; Y^ ; Y^ ) are identi ally distributed. This
assures us that the expe ted value of any fun tion of (X^ ; Y^ ; Y^ ) is the same as that
for (X ; Y ; Y ), so (X^ ; Y^ ) also is admissible for the same values of the onstraints.
Energy onstraints on the inputs and outputs are a spe ial ase. Thus, pro esses whi h
ommuni ate information among the brain's neurons in an energy-e ient manner will
exhibit the Markovian properties ited in Theorem 1, at least to the degree of a ura y to
whi h the model of Figure 1 re e ts reality.
n
(a)
11 Review of Cost-Capa ity and Rate-Distortion
In preparation for addressing the topi of de oding, or the la k thereof, let us re all Shannon's formulas hara terizing the probability distributions that solve the variational problems for al ulating apa ity- ost fun tions of hannels and rate-distortion fun tions of
sour es.
The ost- apa ity variational problem is dened as follows. We are given the transition
probabilities fp(yjx); (x; y) 2 X Yg of a dis rete memoryless hannel (dm ) and a set
of nonnegative numbers f (x) 0; x 2 Xg, where (x) is the ost in urred ea h time the
symbol x is inserted into the hannel. We seek the probability distribution fp(x); x 2 Xg
that maximizes the mutual information subje t to the onstraint that the average input
ost does not ex eed S . We denote this maximum by
X X p(x)p(yjx)log(p(yjx)=p~);
(17)
C (S ) := max
f g2S
p(x)
where S = ffp(x)g : P p(x) (x) S g and

X
p~(y) = p(x)p(yjx):
x
16
(18)
C (S )
is a on ave fun tion that usually satises C (0) = 0 and lim !1 C (S ) = C . The
onstant C , alled either the un onstrained apa ity or simply the apa ity of the hannel,
is nite if X and/or Y have nite ardinality but may be innite otherwise.
The rate-distortion variational problem is dened as follows. We are given the letter
probabilities fp(u); u 2 Ug of a dis rete memoryless sour e (dms) and a set of nonnegativge
numbers fd(u; v) 0; (u; v) 2 U Vg. Here, d(u; v) measures the distortion that o urs
whenever the dms produ es the letter u 2 U and the ommuni ation system delivers to a user
lo ated at its output the letter v 2 V as its approximation of said u. The alphabets U and
V may or may not be identi al. In fa t, the appropriate V and distortion measure vary from
user to user. Alternatively, and more apropos of appli ation to a living organism, they vary
over the dierent uses a single organism has for the information. In what follows we therefore
speak of (sour e,use)-pairs instead of the usual terminology of (sour e,user)-pairs. In ratedistortion theory we seek the transition probability assignment fq(vju); (u; v) 2 U Vg
that minimizes the average mutual information subje t to the onstraint that the average
distortion does not ex eed D. We denote this minimum by
X X p(u)q(vju)log(q(vju)=q(v));
(19)
R(D) := min
f j 2D
S
q (v u)
where D = ffq(vju) : P P p(u)q(vju)d(u; v) Dg and

X
q(v) = p(u)q(vju):
u
(20)
Viewed as a fun tion of D, R(D) is alled

the rate-distortion fun tion. ItP is onvex on
P
the range [D ; D , where D = min d(u; v) and D = min p(u)d(u; v).
R(D) =P0 for D D and is undened for D < D . R(D ) equals the sour e entropy
H=
p(u)log p(u) if for ea h u 2 U there is a unique v 2 V , all it v(u), that minimizes
d(u; v) and v(u) 6= v(u0 if u 6= u0 ; otherwise, R(D ) < H .
In ea h of these variational problems, Lagrange optimization yields a ne essary ondition
that relates the extremizing distribution to the onstraint fun tion. For the ost- apa ity
problem this ondition, displayed as an expression for (x) in terms of the given hannel
transition probabilities and the information-maximizing fp(x)g, reads
X p(yjx)log(p(yjx)=p~(y)) + ;
(21)
(x) =
min
max
min
max
max
min
min
min
where p~(y) is given linearly in terms of said fp(x)g by (18). The onstant represents
a xed ost per hannel use that simply translates the C (S ) urve verti ally, so no loss
in generality results from setting = 0. The onstant is inter hangeable with the
hoi e of the logarithm base, so we may set = 1, again without any essential loss of
generality. It follows that in order for optimality to prevail the ost fun tion must equal
the Kullba k-Leibler distan e between the onditional distribution fp(yjx); y 2 Yg of
the hannels output when the hannel input equals letter x and the un onditional output
distribution fp~(y); y 2 Y obtained by averaging as in (18) over the information-maximizing
fp(x); x 2 Xg. Sin e the expe tation over X of the K-L distan e between fp(yjX ) and
fp~(y)g is the mutual information between the hannels input and output, we see that
2
10
10
The K-L distan e, or relative entropy, of two probability distribution fp(w); w 2 Wg and fq(w); w 2 Wg
is given by D(pjjq) := w p(w) log(p(w)=q(w)). It is nonnegative and equals 0 if and only if q(w) = p(w)
for all w 2 W .
17
applying the onstraint P p(x) (x) S is equivalent, when the C (S )-a hieving input
distribution is in for e, to maximizing the average mutual information subje t to the average
mutual information not ex eeding a spe ied amount, all it I . Obviously, this results in
C (I ) = I , a apa ity- ost fun tion that is simply a straight line at 45 .
This perhaps onfusing state of aairs requires further explanation, sin e information
theorists are justiably not a ustomed to the apa ity- ost urve being a straight line.
Sin e studying a well-known example often sheds light on the general ase, let us onsider
again Shannons famous formula C (S ) = (1=2)log(1+ S=N ) for the apa ity- ost fun tion of
a time-dis rete, average-power-limited memoryless AWGN hannel; learly, it is a stri tly
on ave fun tion of S . Of ourse, in this formula S is a onstraint on the average power expended to transmit information a ross the hannel, not on the average mutual information
between the hannels input and the output. Next, re all that for this AWGN, whose transition probabilities are given by p(yjx) = exp(y x) =2N=sqrt2N , the optimum power onstrained input distribution is known to be Gausswian, namely p(x) = exp x =2S=p2S .
When D(p(yjx)jjp~(y) is evaluated in the ase of this optimum input, it indeed turns out to
indeed be proportional to x plus a onstant. Hen e, there is no dieren e in this example
between onsidering the onstraint to be imposed on the expe ted value of X or onsidering it to be imposed on the expe ted value of D(p(yjX )jjp(y)). The physi al signi an e
of an average power onstraint is evident, but what, if anything, is the physi al meaning
of an average K-L distan e onstraint? First observe that, if for some x it were to be the
ase that fp(yjx); y 2 Y is the same as p(y); y 2 Y , then one would be utterly unable to
distinguish on the basis of the hannel output between transmission and non-transmission
of the symbol x. Little if any resour es would need to be expended to build and operate
a hannel in whi h fp(yjx); y 2 Yg does not hange with x, sin e the output of su h a
hannel is independent of its input. In order to assure that the onditional distributions on
the hannel output spa e given various input values are well separated from one another,
resour es must be expended. We have seen that in the ase of an AWGN the ability to
perform su h dis riminations is onstrained by the average transmission power available;
in a non-Gaussian world, the physi ally operative quantity to onstrain likely would be
something other than power. In this light I believe (21) is telling us, among others things,
that if one is not sure a priori to what use(s) the information onveyed through the hannel
output will be put, one should adopt the viewpoint that the task is to keep the various
inputs as easy to dis riminate from one another as possible subje t to whatever physi al
onstraint(s) are in for e. We adopt this viewpoint below in our treatment of the de oding
problem.
For the rate-distortion problem Shannon [4 observed that Lagrange minimization over
fq(vju)g leads to the ne essary ondition
q (vju) = (u)q (v)exp(sd(u; v));
(22)
where s 2 [ 1; 0 is a parameter that equals the slope R0(D ) of the rate-distortion fun tion
at the point (D ; R(D )) that it generates, and fq (v); v 2 Vg is an appropriately sele ted
probability distribution over the spa e V of sour e approximations. Sin e q(vju) must sum
to 1 over v for ea h xed u 2 U , we have
X
(23)
(x) = [ q (v)exp(sd(u; v)) :
x
11
11
This does happen for an AWGN hannel for all S << N0 W , and hen e for all pra ti al values of S in
the ase of a time- ontinuous AWGN of extremely broad bandwidth.
18
Ex ept in some spe ial but important examples, given a parameter value s it is di ult to
nd the optimum fq (v)g, and hen e the optimum fq (vju) from (22). We an re ast (22)
as an expression for d(u; v) in terms of q (vju), namely
d(u; v) = ( 1=jsj)log(q (vju)=q (v)) + (1=jsj)log (u):
(24)
Sin e v does not appear in the se ond term on the right-hand side of (24), that term re e ts only indire tly on the way in whi h a system that is end-to-end optimum in the sense
of a hieving the point (D ; R(D )) on the rate-distortion fun tion probabilisti ally re onstru ts the sour e letter u as the various letters v 2 V . Re alling that log(q (vju)=q (v)) is
the mutual information i (u; v) between symbols u and v for the end-to-end (i.e., sour e-touse) optimum system, we see that the right way to build said system is to make the mutual
information between the letters of pairs (u; v) de rease as the distortion between them in reases. Averaging over the joint distribution p(u)q (vju) that is optimum for parameter
value s arms the inverse relation between average distortion and average mutual information that pertains in rate-distortion. This relationship is analogous to the dire tly-varying
relation between average ost and average mutual information in the hannel variational
problem.
s
12
13
12 Low-Laten y Multiuse De oding
For purposes of the present exposition, a key onsequen e of the pre eding paragraph is that
it helps explain how a hannel p(yjx) an be mat hed to many (sour e,use)-pairs at on e.
Spe i ally, under our denition hannel p(yjx) is mat hed to sour e p(u) and distortion
measure d(u; v) at slope s on their rate-distortion fun tion if, and only if, there exists a
pair of onditional probability distributions fr(xju)g and fw(vjy)g su h that the optiumum
end-to-end system transition probabilities fq (vju)g in the rate-distortion problem an be
written in the form
X X r (xju)p(yjx)w (vjy):
(25)
q (vju) =
s
It should be lear that (25) often an hold for many (sour e,use) pairs that are of interest to
an organism. In su h instan es it will be signi antly more e ient omputationally for the
organism to share the p(yjx) part of the onstru tion of the desired transition probability
assignments for these (sour e,use) pairs rather than to have to in ee t build and then
operate in parallel a separate version of it for ea h of said appli ations. This will be all the
more so the ase if it is not known until after p(yjx) has been exer ised just whi h potential
uses appear to be intriguing enough to bother omputing their w(vjy)-parts and whi h do
not.
I onje ture that the previous paragraph has mu h to say about why neural oalitions
are onstru ted and inter onne ted the way they are. Namely, the oalitionwise transition
probabilities ee ted are ommon to numerous potential appli ations only some of whi h
12
Kuhn-Tu ker theory tells us that the ne essary and su ient ondition for fqs (v)g to generate, via
(22), a point on the rate-distortion urve at whi h the slope is s is s (v) := u s (u)p(u) exp sd(u; v)
1 for all v, where s (u) is given by equation (23) and equality prevails for every v for whi h qs (v) >
0. Re ursive algorithms developed by Blahut [19 and by Rose [20 allow rate-distortion fun tions to be
al ulated numeri ally with great a ura y at moderate omputational intensity.
13
Equations (21) and (24) perhaps rst appeared together in the paper by Gastpar et al. [18. Motivated
by exposure to my Examples 1 and 2, they derived onditions for double mat hing of more general sour es
and hannels, onning attention to the spe ial ase of deterministi p(xju) and p(v jy).
19
a tually get explored. The situation is sket hed s hemati ally in Figure 2, from whi h the
reader an see how a given neural oalition, viewed as a hannel, might both be mat hed
by the sour e that drives it and at the same time ould help mat h that sour e to many
potential uses via inter onne tion with other oalitions and sub oalitions.
Whether or not use i, asso iated with some i h neural sub oalition des ribed by transition probabilities fw (v jy; v 2 Vi g, gets a tively explored at a given instant depends on
what side information in addition to part of fY g gets presented to it. That side information
depends, in turn, on Bayesian-style prior probabilities that are ontinually being re ursively
updated as the bottom-up and top-down pro essing of data from stimuli pro eeds. When
said side information is relatively inhibitory rather than ex itatory, the subregion does not
\ramp up. Then energy is saved but of ourse less information is onveyed.
t
s;i
14
15
ACKNOWLEDGEMENT
The author is indebted to Professors William B. "Chip" Levy and James W. Mandell of
the University of Virginia Medi al S hool's neurosurgery and neuropathology departments,
respe tively, for frequent, lengthy, and far-ranging dis ussions of neuros ien e over the past
seven years. In his role as my neuros ien e mentor, Chip has gra iously enlightended me
regarding his own and others' theoreti al approa hes to omputational neuros ien e. This
presentation is in substantial measure a re e tion of Chip Levy's unrelenting pursuit of
neuros ienti theory rooted in biologi al reality. This work was supported in part by NIH
Grant RR15205 for whi h Professor Levy is the Prin ipal Investigator.
Referen es
[1 C. E. Shannon, A Mathemati al Theory of Communi ation. Bell Syst. Te h. J., vol. 27,
379-423, 623-656, July and O tober, 1948. (Also in Claude Elwood Shannon: Colle ted
Papers, N. J. A. Sloane and A. D. Wyner, eds., IEEE Press, Pis ataway, NJ, 1993,
5-83.)
[2 C. E. Shannon, Channels with Side Information at the Transmitter. IBM J. Resear h
and Development, , 289-293, 1958.
2
14
Re ursive estimation is the name of the game in sensory signal pro essing by neurons. A Kalman
formalism [21 is appropriate for this, subje t to ertain provisos. One is that it seems best to adopt the
onditional error entropy minimization riterion [22 [23 [24 as opposed to, say, a minimum MSE riterion;
this is in keeping with our view that an information riterion for as long as possible before spe ializing to a
more physi al riterion asso iated with a parti ular use. Another is that the full Kalman solution requires
inverting matri es of the form I + MM T in order to update the onditional ovarian e matrix. Matrix
inversion is not believed to be in the repertoire of mathemati al operations readily amenable to realization
via neurons unless the ee tive rank of the matrix M is quite low.
15
We remark that neurons are remarkably sensitive in this respe t. They idle at a mean PSP level that
is one or two standard deviations below their spiking threshold. In this mode they spike only o asionally
when the random u tuations in their ee tively Poisson synapti bombardment happen to bun h together
in time to build the PSP up above threshold. However, a small per entage hange in the bombardment
(e.g., a slightly in reased overall intensity of bombardment and/or in a shift toward a higher ex itatory-toinhibitory ratio of the synapses being bombarded) an signi antly in rease the spiking frequen y. 16 Given
the predominantly positive feedba k among the members of a oalition, many of its members an be made
to ramp up their spiking intensities nearly simultaneously. This helps explain why oalitions of neural ortex
exhibit dramati variations on a time s ale of several tens of millise onds in their rates of spiking and hen e
in their rates of information transmission and of energy depletion.
20
[3 C. E. Shannon, An Algebra for Theoreti al Geneti s, Ph.D. Dissertation, Department

of Mathemati s, MIT, Cambridge, Massa husetts, April 15, 1940.
[4 C. E. Shannon, Coding Theorems for a Dis rete Sour e with a Fidelity Criterion. IRE
Convention Re ord, Vol. 7. 142-163, 1959. (Also in Information and De ision Pro esses.
R. E. Ma hol, ed. M Graw-Hill, In . New York, 1960, 93-126, and in Claude Elwood
Shannon: Colle ted Papers, N. J. A. Sloane and A. D. Wyner, eds., IEEE Press,
Pis ataway, NJ, 1993, 325-350.)
[5 H. B. Barlow, Sensory me hanisms, the redu tion of redundan y, and intelligen e. In:
The Me hanization of Thought Pro esses. National Physi al Laboratory Symposium
No. 10, Her Majesty's Stationery O e, London, pp. 537-559, 1958.
[6 D.H. Hubel and T.N. Wiesel. Re eptive Fields, Bino ular Intera tion and Fun tional
Ar hite ture in the Cat's Visual Cortex, J. Physiol., 154, 1962.
[7 T. Berger and Y. Ying, Chara terizing Optimum (Input,Output) Pro esses for FiniteState Channels with Feedba k, submitted to ISIT 2003, Yokohama, Japan, June-July,
2003.
[8 C. B. Woods and J. H. Krantz, Le ture notes at Lemoyne University based on Chapter
9 of Human Sensory Per eption: A Window into the Brain, 2001.
[9 T. Berger, Rate Distortion Theory: A Mathemati al Basis for Data Compression,
Prenti e-Hall, Englewood Clis, NJ, 1971.
[10 F. Jelinek, Probabilisti Information Theory, M Graw-Hill, New York, 1968.
[11 R. G. Gallager, Information Theory and Reliable Communi ation, Wiley, New York,
1968.
[12 T. M. Cover and J. A. Thomas, Elements of Information Theory, Wiley, New York,
1991.
[13 W. B Levy and R. A Baxter, Energy E ient Neural Codes. Neural Comp., , 531543. (Also in: Neural Codes and Distributed Representations, L. Abbott and T. J.
Sejnowski, Eds., MIT Press, Cambridge, MA, 105-117, 1999.
[14 W. B. Levy and R. A. Baxter, Energy-E ient Neuronal Computation Via Quantal
Synapti Failures. J. Neuros ien e, , 4746-4755, 2002.
[15 H. K. Hartline, H. G. Wagner, and F. Ratli, Inhibition in the eyes of Limulus. J. Gen.
Physiol., , 651-673.
[16 A. H. Burkhalter, Corti al Cir uits for Bottom-Up and Top-Down Pro essing. Presented in Symposium 3, Corti al Feedba k in the Visual System, Paper 3.2, Neuros ien e 2002 - So iety for Neuros ien e 32nd Annual Meeting, Orlando, FL, November
2-7, 2002.
[17 J. Bullier, The Role of Feedba k Corti al Conne tions: Spatial and Temporal Aspe ts.
Presented in Symposium 3, Corti al Feedba k in the Visual System, Paper 3.3, Neuros ien e 2002 - So iety for Neuros ien e 32nd Annual Meeting, Orlando, FL, November
2-7, 2002.
160
22
39
21
[18 M. Gastpar, B. Rimoldi and M. Vetterli, To Code or Not to Code. Preprint, EPFL,
Lausanne, Switzerland, 2001.
[19 R. E. Blahut, Computation of Channel Capa ity and Rate -Distortion Fun tions. IEEE
Trans. Inform. Theory, , 460-473, 1972.
[20 K. Rose, A Mapping Approa h to Rate-Distortion Computation and Analysis. IEEE
Trans. Inform. Theory, bf 42, 1939-1952, 1996.
[21 T. Kailath, Le tures on Wiener and Kalman Filtering, CISM Courses and Le tures
No. 140, Springer-Verlag, Vienna and New York, 1981.
[22 H. L. Weidemann and E. B. Stear, Entropy analysis of parameter estimation. Info.
And Control, , 493-506, 1969.
[23 N. Minamide and P. N. Nikiforuk, Conditional entropy theorem for re ursive parameter
estimation and its appli ation to state estimation problems. Int. J. Systems S i., ,
53-63, 1993.
[24 M. Janzura and T. Koski, Minimum entropy of error prin iple in estimation. Info.
S ien es, , 123-144, 1994.
[25 A. Buon ore, V. Giorno, A G. Nobile and L. M. Ri iardi, A Neural Modeling Paradigm
in the Presen e of Refra toriness. BioSystems, , 35-43, 2002.
[26 T. Berger, Interspike Interval Analysis via a PDE. Preprint, Ele tri al and Computer
Engineering Department, Cornell University, Itha a, New York, 2002.
18
14
24
79
67
22

(Berger, T) Living Information Theory

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

(Berger, T) Living Information Theory

Загружено:

Авторское право:

Доступные форматы

LIVING INFORMATION THEORY

The 2002 Shannon Le ture

1 Meanings of the Title

2 Early History of Information Theory in Biology

3 Information Within Organisms

4 Double Mat hing of Sour es and Channels

5 Two Key Examples of Double Mat hing

Shannon's well-known formula for this hannel's apa ity is

6 Bit Rate and Thermodynami E ien y

7 Feedforward and Feedba k: Bottom-Up and Top-Down

8 Neurons and Coalitions

9 Mathemati al Model of a Neural Coalition

As information theorists, one of our in linations would be to investigate onditions su ient

Now de ne another set of input distributions on X , P (X ), having all probability mass

10 Statement and Proof of Main Theorem

1. fY ; k = 0; 1; : : : ; ng is a rst-order Markov hain,

f(X ; Y ); k = 1; 2; : : : ; ng also is a rst-order Markov hain.

on bottom-up information derived from the environment, it is unrealisti to expe t it to

We suppress underlining of ve tors and abuse notation by writing X 2 P (X ) to indi ate

This proof is joint work with Yuzheng Ying [7.

) 2 P (X ; Y ), then Y is a rst-order Markov hain.

We now show that, for any (X ; Y ) 2 P (X ; Y ), there's a (X^ ; Y^ ) 2 P (X ; Y )

(j) equal to P k k (j)) so that

I (X^1 ; Y^1 jY^0 ) =

Next, re all from (5 that I (X ; Y jY )  P I (X ; Y jY ), and observe from Lemma 2

Therefore, I (X^ ; Y^ jY^ )  I (X ; Y jY ); whi h is (7).

11 Review of Cost-Capa ity and Rate-Distortion

where S = ffp(x)g : P p(x) (x)  S g and

where D = ffq(vju) : P P p(u)q(vju)d(u; v)  Dg and

Viewed as a fun tion of D, R(D) is alled

12 Low-Laten y Multiuse De oding

[3 C. E. Shannon, An Algebra for Theoreti al Geneti s, Ph.D. Dissertation, Department

Вам также может понравиться

Now dene another set of input distributions on X , P (X ), having all probability mass

Next, re all from (5 that I (X ; Y jY ) P I (X ; Y jY ), and observe from Lemma 2

Therefore, I (X^ ; Y^ jY^ ) I (X ; Y jY ); whi h is (7).

where S = ffp(x)g : P p(x) (x) S g and

where D = ffq(vju) : P P p(u)q(vju)d(u; v) Dg and