Академический Документы
Профессиональный Документы
Культура Документы
The title, "Living Information Theory," is a triple entendre. First and foremost, it pertains
to the information theory of living systems. Se
ond, it symbolizes the fa
t that our resear
h
ommunity has been living information theory for more than ve de
ades, enthralled with
the beauty of the subje
t and intrigued by its many areas of appli
ation and potential
appli
ation. Lastly, it is intended to
onnote that information theory is de
idedly alive,
despite sporadi
protestations to the
ontrary. Moreover, there is a thread that ties together
all three of these meanings for me. That thread is my strong belief that one way in whi
h
information theorists, both new and seasoned,
an assure that their subje
t will remain
vitally alive deep into the future is to embra
e enthusiasti
ally its appli
ations to the life
s
ien
es.
In the 1950's and early 1960's a
adre of s
ientists and engineers were adherents of the
premise that information theory
ould serve as a
al
ulus for living systems. That is, they
believed information theory
ould be used to build a solid mathemati
al foundation for
biology whi
h always had o
upied a pe
uliar middle ground between the hard and the soft
s
ien
es. International meetings were organized by Colin Cherry and others to explore this
frontier, but by the mid-1960's the eort had dissipated. This may have been due in part
to none other than Claude Shannon himself, who in his guest editorial, The Bandwagon, in
the Mar
h 1956 issue of the IRE Transa
tions on Information Theory stated:
Information theory has ... perhaps ballooned to an importan
e beyond its a
tual
a
omplishments. Our fellow s
eintists in many dierent elds, attra
ted by the
fanfare and by the new avenues opened to s
ienti
analysis, are using these
ideas in ... biology, psy
hology, linguisti
s, fundamental physi
s, e
onomi
s, the
theory of the organization, ... Although this wave of popularity is
ertainly
pleasant and ex
iting for those of us working in the eld, it
arries at the same
time an element of danger. While we feel that information theory is indeed a
valuable tool in providing fundamental insights into the nature of
ommuni
ation
problems and will
ontinue to grow in importan
e, it is
ertainly no pana
ea for
the
ommuni
ation engineer or, a fortiori, for anyone else. Seldom do more than
a few of nature's se
rets give way at one time.
More devastating was Peter Elias's s
athing 1958 editorial in the same journal, Two Famous
Papers whi
h in part read:
The rst paper has the generi
title Information Theory, Photosynthesis and
Religion... written by an engineer or physi
ist ... I suggest we stop writing [it,
and release a supply of man power to work on ... important problems whi
h
need investigation.
The demise of the nas
ent
ommunity that was endeavoring to inje
t information theory
into mainstream biology probably was o
asioned less by these \purist" information theory
editorials than by the relatively primitive state of quantitative biology at the time. Note in
this regard that:
1. The stru
ture of DNA was not determined by Cri
k and Watson until ve years after
Shannon published A Mathemti
al Theory of Communi
ation.
2. It was not possible to measure even a single neural pulse train with millise
ond a
ura
y;
ontrastingly, today it is possible simultaneously to re
ord a
urately in vivo
the pulse trains of many neighboring neurons as an aid to developing an information
theory of real neural nets.
3. It was not possible to measure time variations in the
on
entrations of
hemi
als
at sub-millise
ond speeds in volumes of submi
ron dimensions su
h as those whi
h
onstitute ion
hannels in neurons. This remains a stumbling blo
k, but measurement
te
hniques
apitalizing on
uores
en
e and other phenomena are steadily progressing
toward this goal.
We oer arguments below to support the premise that matters have progressed to a stage
at whi
h biology is positioned to prot meaningfully from an invasion by information theorists. Indeed, during the past de
ade some biologists have equipped themselves with more
than a surfa
e knowledge of information theory and are applying it
orre
tly and fruitfully
to sele
ted biologi
al subdis
iplines, notable among whi
h are genomi
s and neuros
ien
e.
Sin
e our interest here is in the information theory of sensory per
eption, we will dis
uss
neuros
ien
e and es
hew genomi
s.
At a fundamental level information in a living organism is instantiated in the time variations of the
on
entrations of
hemi
al and ele
tro
hemi
al spe
ies (ions, mole
ules and
ompounds) in the
ompartments that
omprise the organism. Chemi
al thermodynami
s and statisti
al me
hani
s tell us that these
on
entrations are always tending toward a
multiphase equilibrium
hara
terized by minimization of the Helmholtz free energy fun
tional. On the other hand,
omplete equilibrium with the environment never is attained
both be
ause the environment
onstantly
hanges and be
ause the organism must exhibit
homeostasis in order to remain \alive". A fas
inating dynami
prevails in whi
h the organism sa
ri
es internal energy in order to redu
e its un
ertainty about the environment,
whi
h in turn permits it to lo
ate new sour
es of energy and nd mates with whom to
perpetuate the spe
ies. This is one of several
onsiderations that strongly militate in favor
of looking at an information gain by a living system never in absolute terms but rather
always relative to the energy expended to a
hieve it.
There is, in addition, an intriguing mathemati
al analogy between the equations that govern multiphase equilibrium in
hemi
al thermodynami
s and those whi
h spe
ify points on
2
Shannon's rate-distortion fun
tion of an information sour
e with respe
t to a delity
riterion [9. This analogy is not in this writer's opinion just a mathemati
al
uriosity but
rather is
entral to fruitfully "bringing information theory to life." We shall not be exploring
this analogy further here, however. This is be
ause, although it provides an overar
hing
theoreti
al framework, it operates on a level whi
h does not readily lead to
on
rete results
apropos our goal of developing an information-theoreti
ally based formulation of sensory
per
eption.
An information theorist venturing into new territory must treat that territory with respe
t.
In parti
ular, one should not assume that, just be
ause the basi
on
epts and methods developed by Shannon and his dis
iples have proved so ee
tive in des
ribing the key features
of man-made
ommuni
ation systems, they
an be applied en masse to render expli
able
the long-standing mysteries of another dis
ipline. Rather, one must think
riti
ally about
information-theoreti
on
epts and methods and then apply only those that genuinely transfer to the new territory. My endeavors in this
onne
tion to date have led me to the following
two beliefs:
Judi
ious appli
ation of Shannon's fundamental
on
epts of entropy, mutual information,
hannel
apa
ity and rate-distortion is
ru
ial to gaining an elevated understanding of how living systems handle sensory information.
Living systems have little if any need for the elegant blo
k and
onvoulution
oding
theorems and te
hniques of information theory be
ause, as will be explained below,
organisms have found ways to perform their information handling tasks in an ee
tively Shannon-otpimum manner without having to employ
oding in the informationtheoreti
sense of the term.
Is it ne
essary to learn
hemistry, bio
hemistry, biophysi
s, neuros
ien
e, and su
h before
one
an make any useful
ontributions? The answer, I feel, is \Yes, but not deeply." The
obje
t is not to get to the point where you
an think like a biologist. The obje
t is to get to
the point where you
an think like the biology. The biology has had hundreds of millions of
years to evolve via natural sele
tion to a point at whi
h mu
h of that whi
h it does is done
in a nearly optimum fashion. Hen
e, thinking about how the biology
do things is
often ee
tively identi
al to thinking about how the biology
do things and is perhaps
even a more fruitful endeavor.
Information theorists are fond of guring out how best to transmit information over a
\given"
hannel. When trespassing on biologi
al turf, however, an information theorist
must abandon the tenet that the
hannel is given. Quite to the
ontrary, nature has evolved
the
hannels that fun
tion within organisms in response to needs for spe
i
information
residing either in the environment or in the organism itself -
hannels for sight,
hannels for
sound, for olfa
tion, for tou
h, for taste, for blood al
ohol and osmolality regulation, and so
on. Common sense strongly suggests that biologi
al stru
tures built to sense and transfer
information from
ertain sites lo
ated either outside or inside the organism to other su
h
sites will be e
iently \mat
hed" to the data sour
es they servi
e. Indeed, it would be
should
does
Elwyn Berlekamp related at IEEE ISIT 2001 in Washington, DC, a
onversation he had with Claude
Shannon in an MIT hallway in the 1960's the gist of whi
h was:
CES: Where are you going, Elwyn?
EB: To the library to study arti
les, in
luding some of yours.
CES: Oh, don't do that. You'd be better o to just gure it out for yourself.
ill-advised to expe
t otherwise, sin
e natural sele
tion rarely
hooses foolishly, espe
ially in
the long run. The
ompelling hypothesis, at least from my perspe
tive, is that all biologi
al
hannels are well mat
hed to the information sour
es that feed them.
Mat
hing a
hannel to a sour
e has a pre
ise mathemati
al meaning in information theory.
Let us
onsider the simplest
ase of a dis
rete memoryless sour
e (dms) with instantaneous
letter probabilities fp(u); u 2 Ug and a dis
rete memoryless
hannel (dm
) with instantaneous transition probabilities fp(yjx); x 2 X ; y 2 Yg. Furthermore, let us suppose that the
hannel's purpose is to deliver a signal fY g to its output terminal on the basis of whi
h
one
ould
onstru
t an approximation fV g to the sour
e data fU g that is a
urate enough
for satisfa
tory performan
e in some appli
ation of interest. Following Shannon, we shall
measure said a
ura
y by means of a distortion measure d : U V ! [0; 1. fV g will be
onsidered to be a su
iently a
urate approximation of fU g if and only if the average distortion does not ex
eed a level deemed to be tolerable whi
h we shall denote by D. Stated
mathemati
ally, our requirement for an approximation to be su
iently a
urate is
k
lim
!1 E n
X d(U ; V ) D:
n
k =1
In order for the dm
fp(yjx)g to be instantaneously mat
hed to the
ombination of the dms
fp(u)g and the distortion measure fd(u; v)g at delity D, the following requirements must
be satised:
1. The number of sour
e letters produ
ed per se
ond must equal the number of times
per se
ond that the
hannel is available for use.
2. There must exist two transition probability matri
es fr(xju); u 2 U ; x 2 Xg and
fw(vjy); y 2 Y ; v 2 Vg, su
h that the end-to-end transition probabilities
X X r(xju)p(yjx)w(vjy); (u; v) 2 U V
q(vju) :=
2X 2Y
solve the variational problem that denes the point (D; R(D)) on Shannon's ratedistortion fun
tion of the dms fp(u)g with respe
t to the distortion measure fd(u; v)g.
Readers not
onversant with rate-distortion theory should refer to Se
tion 11 below. If
that does not su
e, they should
ommune at their leisure with Shannon [4, Jelinek [10,
Gallager [11 or Berger [9. However, the two key examples that follow should be largely
a
essible to persons unfamiliar with the
ontent of any of these referen
es. Ea
h example
is
onstru
ted on a foundation
omprised of two of Shannon's famous formulas.
This example uses: (1) the formula for the
apa
ity of a binary symmetri
hannel (BSC) with
rossover probability , namely
C = 1 h() = 1 + log + (1 )log (1 ) bits=
hannel use;
Example 1.
where we assume without loss of essential generality that 1=2, and (2) the formula for
the rate-distortion fun
tion of a Bernoulli-1/2 sour
e with respe
t to the error frequen
y
distortion measure d(x; y) = 1 (x; y), namely
R(D) = 1 h(D) = 1 + D log D + (1 D)log (1 D) bits=sour
e letter; 0 D 1=2:
Shannon's
onverse
hannel
oding theorem [1 establishes that it is not possible to send
more than nC bits of information a
ross the
hannel via n
hannel uses. Similarly, his
onverse sour
e
oding theorem [4 establishes that it is not possible to generate an approximation V ; : : : ; V to sour
e letters U ; : : : ; U that has an average distortion E n P
d(U ; V ) of D or less unless that representation is based on nR(D) or more bits of information about these sour
e letters. A
ordingly, assuming the sour
e resides at the
hannel
input, it is impossible to generate an approximation to it at the
hannel output that has
an average distortion any smaller than the value of D for whi
h R(D) = C , even if the
number n of sour
e letters and
hannel uses is allowed to be
ome large. Comparing the
above formulas for C and R(D), we see that no value of average distortion less than
an
be a
hieved. This is true regardless of how
ompli
ated an en
oder we pla
e between the
sour
e and the
hannel, how
ompli
ated a de
oder we pla
e between the
hannel and the
re
ipient of the sour
e approximation, and how large a nite delay we allow the system to
employ.
It is easy to see, however, that D =
an be a
hieved simply by
onne
ting the sour
e
dire
tly to the
hannel input and using the
hannel output as the approximate re
onstru
tion of the sour
e output. Hen
e, this trivial
ommuni
ation system, whi
h is devoid of any
sour
e or
hannel
oding and operates with zero delay, is optimum in this example. There
are two reasons for this:
Reason One: The
hannel is instantaneously mat
hed to the sour
e as dened above with
the parti
ularly simple stru
ture that X = U , V = Y , r(xju) = (u; x) and w(vjy) = (y; v).
That is, the sour
e is instantaneously and deterministi
ally fed into the
hannel, and the
hannel output dire
tly serves as the approximation to the sour
e.
Reason Two: The sour
e also is mat
hed to the
hannel in the sense that the distribution
of ea
h U , and hen
e of ea
h X , is p(0) = p(1) = 1=2, whi
h distribution maximizes the
mutual information between a
hannel input and the
orresponding output. That is, the
hannel input letters are i.i.d. with their
ommon distrubution being the one that solves
the variational problem that denes the
hannel's
apa
ity.
The
hannel in this example is a time-dis
rete, average-power-
onstrained
additive white Gaussian noise (AWGN). Spe
i
ally, its k output Y equals X + N ,
where X is the k input and the additive noises N are i.i.d. N (0; N ) for k = 1; 2; : : :.
Also, the average signaling power
annot ex
eed S , whi
h we express mathemati
ally by
the requirement
X X S:
lim
E
n
!1
2
k =1
Example 2.
th
th
k =1
mean-squared-error,
MSE = lim
!1 E n
X(V
):
2
Shannon's
elebrated formula for the rate-distortion fun
tion of this sour
e and distortion
measure
ombination is
R(D) = (1=2)log ( =D); 0 D :
The minimum a
hievable value of the MSE is, as usual, the value of D that satises R(D) =
C , whi
h in this example is
S
D = =(1 + ):
N
As in Example 1, we nd that this minimum value of D is trivially attainable without
any sour
e or
hannel
oding and
p with zero delay. However, in this instan
e the sour
e
symbols must be s
aled by := S= before being put into the
hannel in order to ensure
omplian
e with
p the power
onstraint. Similarly, V is produ
ed by multiplying Y by the
onstant := S=(S + N ), sin
e this produ
es the minimum MSE estimate of U based
on the
hannel output. Hen
e, the
hannel is instantaneously mat
hed to the sour
e via
the deterministi
transformations r(xju) = (x u) and w(vjy) = (v y). Moreover,
the sour
e is mat
hed to the
hannel in that, on
e s
aled by , it be
omes the
hannel
input whi
h, among all those that
omply with the power
onstraint, maximizes mutual
information between itself and the
hannel output that it eli
its. Thus, the s
aled sour
e
drives the
onstrained
hannel at its
apa
ity.
It
an be argued validly that, notwithstanding the fa
t that Examples 1 and 2 deal with
sour
e models,
hannel models, and distortion measures all dear to information theorists,
these examples are ex
eptional
ases. Indeed, if one were to modify fp(u)g or fp(yjx)g
or fd(u; v)g even slightly, there no longer would be an absolutely optimum system that is
both
oding-free and delay-free. A
hieving optimal performan
e would then require the
use of
oding s
hemes whose
omplexity and delay diverge as their end-to-end performan
e
approa
hes the minimum possible average distortion attainable between the sour
e and
an approximation of it based on information delivered via the
hannel. However, if the
perturbations to the sour
e, to the
hannel and/or to the distortion measure were minor,
then an instantaneous system would exist that is only mildly suboptimum. Be
ause of its
simpli
ity and relatively low operating
osts, this mildly suboptimum s
heme likely would
be deemed preferable in pra
ti
e to a highly
ompli
ated system that is truly optimum in
the pure information theory sense.
I have argued above for why it is reasonable to expe
t biologi
al
hannels to have evolved so
as to be mat
hed to the sour
es they monitor. I further believe that, as in Examples 1 and
2, the data sele
ted from a biologi
al sour
e to be
onveyed through a biologi
al
hannel
will drive that
hannel at a rate ee
tively equal to the its resour
e-
onstrained
apa
ity.
That is, I postulate that double mat
hing of
hannel to sour
e and of sour
e to
hannel
in a manner analogous to that of Examples 1 and 2 is the rule rather than the ex
eption
in the information theory of living systems. Indeed, suppose sele
ted stimuli were to be
onditioned for transmission a
ross one of an organism's internal
hannels in su
h a way that
information failed to be
onveyed at a rate nearly equal to the
hannel's
apa
ity
al
ulated
for the level of resour
es being expended. This would make it possible to sele
t additional
data and then properly
ondition and transmit both it and the original data through the
2
hannel in a manner that does not in
rease the resour
es
onsumed. To fail to use su
h an
alternative input would be wasteful either of information or of energy, sin
e energy usually is
the
onstrained resour
e in question. As explained previously, a fundamental
hara
teristi
of an e
ient organism is that it always should be optimally trading information for energy,
or vi
e versa, as
ir
umstan
es di
tate. The only way to assure that pertinent information
will be garnered at low laten
y at the maximum rate per unit of power expended is not
only to mat
h the
hannel to the sour
e but also to mat
h the sour
e to the
hannel.
We shall now dis
uss how in
reasing the number of bits handled per se
ond unavoidably
in
reases the number of joules expended per bit (i.e., de
reases thermodynami
e
ien
y).
To establish this in full generality requires penetrating deeply into thermodynami
s and statisti
al me
hani
s. We shall instead
ontent ourselves with studying the energy-information
tradeo impli
it in Shannon's
elebrated formula for the
apa
ity of an average-power
onstrained bandlimited AWGN
hannel, namely
C (S ) = W log(1 +
S
);
N0 W
where S is the
onstrained signaling power, W is the bandwidth in positive frequen
ies,
and N is the one-sided power spe
tral density of the additive white Gaussian noise. Like
all
apa
ity-
ost fun
tions, C (S ) is
on
ave in S . Hen
e, its slope de
reases as S in
reases;
spe
i
ally, C 0(S ) = W=(S + N W ). The slope of C (S ) has the dimensions of
apa
ity
per unit of power, whi
h is to say (bits/se
ond)/(joules/se
ond) = bits/joule. Sin
e the
thermodynami
e
ien
y of the information-energy tradeo is measured in bits/joule, it
de
reases steadily as the power level S and the bit rate C (S ) in
rease. This militates in
favor of gathering information slowly in any appli
ation not
hara
terized by a stringent
laten
y demand. To be sure, there are
ir
umstan
es in whi
h an organism needs to gather
and pro
ess information rapidly and therefore does so. However, energy
onservation di
tates that information handling always should be
ondu
ted at as leisurely a pa
e as the
appli
ation will tolerate. For example, re
ent experiments have shown that within the
neo
ortex a neural region sometimes transfers information at a high rate and a
ordingly
expends energy liberally, while at other times it
onveys information at a relatively low
rate and thereby expends less than proportionately mu
h energy. In both of these modes,
and others in between, our hypothesis is that these
oalitions of neurons operate in an
information-theoreti
ally optimum manner. We shall attempt to des
ribe below how this is
a
omplished.
0
Before turning in earnest to information handling by neural regions, we rst need to generalize and further expli
ate the phenomenon of double mat
hing of sour
es and
hannels.
So far, we have dis
ussed this only in the
ontext of sour
es and
hannels that are memoryless. We
ould extend to sour
es and/or
hannels with memory via the usual pro
edure
of blo
king su
essive symbols into a \supersympbol" and treating long supersymbols as
nearly i.i.d., but this would in
rease the laten
y by a fa
tor equal to the number of symbols
7
per supersymbol, thereby defeating one of the prin
ipal advantages of double mat
hing. We
suggest an alternative approa
h below whi
h leads to limiting the memory of many
ru
ial
pro
esses to at most rst-order Markovianness.
It has long been appre
iated that neuromus
ular systems and metaboli
regulatory me
hanisms exhibit masterful use of feedba
k. Physiologi
al measurements of the past fteen or
so years have in
ontrovertibly established that the same is true of neurosensory systems.
Des
ribing signaling paths in the primate visual
ortex, for example, Woods and Krantz [8
tell us that "In addition to all the
onne
tions from V1 and V2 to V3, V4 and V5, ea
h of
these regions
onne
ts ba
k to V1 and V2. These seemingly ba
kward or reentrant
onne
tions are not well understood. Information, instead of
owing in one dire
tion, in both
dire
tions. Thus, later levels do not simply re
eive information and send it forward, but are
in an intimate two-way
ommuni
ation with other modules." Of
ourse, it is not that information
owed unidire
tionally in the visual system until some time in the 1980's and then
began to
ow bidire
tionally. Rather, as is so often the
ase in s
ien
e, measurements made
possible by new instrumentation and methodologies have demanded that
ertain
herished
paradigms be seriously revised. In this
ase, those mistaken paradigms espoused so-
alled
\bottom-up" unidire
tionality of signaling pathways in the human visual system (HVS) [6
[5.
Instead of speaking about feedforward and feedba
k signaling, neuros
ientists refer to
bottom-up and top-down signaling, respe
tively. Roughly speaking, neurons whose axons
arry signals prin
ipally in a dire
tion that moves from the sensory organs toward the \top
brain" are
alled bottom-up neurons, while those whose axons propagate signals from the
top brain ba
k toward the sensory organs are
alled top-down neurons. Re
ent measurements have revealed that there are roughly as many top-down neurons in the HVS as there
are bottom-up neurons. Indeed, nested feedba
k loops operate at the lo
al, regional and
global levels. We shall see that a theory of sensory per
eption whi
h embra
es rather
than es
hews feedba
k reaps rewards in the form of analyti
al results that are both simpler
to obtain and more powerful in s
ope.
2
The roughly 10 neurons in the human visual system (HVS)
onstitute
ir
a one-tenth of
all the neurons in the brain. HVS neurons are dire
tly inter
onne
ted with one another via
an average of 10 synapses per neuron. That is, a typi
al HVS neuron has on its dendriti
tree about 10 synapses at ea
h of whi
h it taps o the spike signal propagating along the
axon of one of the roughly 10 other HVS neurons that are aerent (i.e., in
oming) to it. Via
pro
essing of this multidimensional input in a manner to be dis
ussed below, it generates
an eerent (i.e., outgoing) spike train on its own axon whi
h propagates to the
ir
a 10
other HVS neurons with whi
h it is in dire
t
onne
tion. The 10 10 matrix whose
(i; j ) entry is 1 if neuron i is aerent to neuron j and 0 otherwise thus has a 1's density
of 10 . However, there exist spe
ial subsets of the HVS neurons that are
onne
ted mu
h
more densely than this. These spe
ial subsets, among whi
h are the ones referred to as V1,
V2 : : : V5 in the above quote,
onsist of a few million to as many as a few tens of millions
of neurons and have
onne
tivity submatri
es whose 1's densities range from 0.1 to as mu
h
10
10
10
Shannon's topi
for the inaugural Shannon Le
ture in June 1973 was Feedba
k. Biologi
al
onsiderations,
one of his many interests [3, may have in part motivated this
hoi
e.
as 0.5. Common sense suggests and experiments have veried that the neurons
omprising
su
h a subset work together to ee
t one or more
ru
ial fun
tions in the pro
essing of visual
signals. We shall hen
eforth refer to su
h subsets of neurons as \
oalitions". Alternative
names are for them in
lude neural \regions", \groupings", \
ontingents", and \assemblies".
Figure 1 shows a s
hemati
representation of a neural
oalition. Whereas real neural
spike trains o
ur in
ontinuous time and are asyn
hronous, Figure 1 is a time-dis
rete
model. Its time step is
ir
a 2.5 ms, whi
h
orresponds to the minimal duration between
su
essive instants at whi
h a neuron
an generate a spike; spikes are also known as an
a
tion potentials. The time-dis
rete model usually
aptures the essen
e of the
oalition's
operation as regards information transfer. Any spike traveling along an axon aerent to a
oalition of neurons in the visual
ortex will rea
h all the members of that
oalition within
the same time step. That is, although the leading edge of the spike arrives serially at the
synapses to whi
h it is aerent, the result is as it were multi
asted to them all during a
single time slot of the time-dis
rete model.
The input to the
oalition in Figure 1 at time k is a random binary ve
tor X (k) whi
h
possesses millions of
omponents. Its i
omponent, X (k) is 1 if a spike arrives on the i
axon aerent to the
oalition during the k time step and is 0 otherwise. The aerent
neurons in Figure 1 have been divided into two groups indexed by BU and TD, standing
respe
tively for bottom-up and top-down. The verti
al lines represent the neurons of the
oalition. The presen
e (absen
e) of dark dot where the i horizontal line and m verti
al
line
ross indi
ates that the i aerent axon forms (does not form) a synapse with the
m neuron of the
oalition. The strength, or weight, of synapse (i; m) will be denoted by
W ; if aerent axon i does not form a synapse with
oalition neuron m, then W = 0.
If W > 0, the
onne
tion (i; m) is said to be ex
itatory; if W < 0, the
onne
tion is
iinhibitory. In primate visual
ortex about ve-sixths of the
onne
tions are ex
itatory.
The so-
alled post-synapti
potential (PSP) of neuron m is built up during time step
k as a weighted linear
ombination of all the signals that arrive at its synapses during this
time step. If this sum ex
eeds the threshold T (k) of neuron m at time k, then neuron
m produ
es a spike at time k and we write Y (k) = 1; if not, then Y (k) = 0. The
thresholds do not vary mu
h with m and k with one ex
eption. If Y (k) = 1, a refra
tory
period of duration about equal to a typi
al spike width follows during whi
h the threshold is
extremely high, making it virtually impossible for the neuron to spike. In real neurons, the
PSP is reset to its rest voltage after a neuron spikes. One short
oming of our time-dis
rete
model is that it assumes that a neuron's PSP is reset between the end of one time step
and the beginning of the next even if the neuron did not re a spike. In reality, if the peak
PSP during the previous time step did not ex
eed threshold and hen
e no a
tion potential
was produ
ed,
ontributions to this PSP will partially
arry over into the next time step.
Be
ause of
apa
itative leakage they generally will have de
ayed to one-third or less of their
peak value a time step ago, but they will not have vanished entirely.
3
th
th
th
th
th
th
th
im
im
im
im
The time-dis
rete model mirrors reality with insu
ient a
ura
y in
ertain saturated or near-saturated
onditions
hara
terized by many of the neurons in a
oalition spiking almost as fast as they
an. Su
h
instan
es, whi
h o
ur rarely in the visual system but relatively frequently in the auditory system, are
hara
terized by su
essive spikes on an axon being separated by short, nearly uniform durations whose
sample standard deviation is less than 1 millise
ond. Mathemati
al methods based on Poisson limit theorems
(a.k.a. mean eld approximations) and PDE's
an be used to distinguish and quantify these ex
eptional
onditions [25 [26.
4
Most neuros
ientists today agree that the detailed shape of the a
tion potential spike as a fun
tion of
time is in
onsequential for information transmission purposes, all that matters being whether there is or is
not a spike in the time slot in question.
Every time-dis
rete system diagram must in
lude a unit delay element in order to allow
time to advan
e. In Figure 1 unit delays o
ur in the boxes marked . Note, therefore,
that the random binary ve
tor Y (k 1) of spikes and non-spikes produ
ed by the
oalition's
neurons during time step k 1 gets fed ba
k to the
oalition's input during time step k. This
re
e
ts the high inter
onne
tion density of the neurons in question that is responsible for
why they
onstitute a
oalition. Also note that, after the spike trains on the axons of ea
h
neuron in the
oalition are delivered to synapses on the dendrites of sele
ted members of
the
oalition itself, they then pro
eed in either a top-down or a bottom-up dire
tion toward
other HVS
oalitions. In the
ase of a densely
onne
ted
oalition, about half of its eerent
axons'
onne
tions are lo
al ones with other neurons in the
oalition. On average, about
one quarter of its
onne
tions provide feedba
k, mostly to the
oalition immediately below
in the HVS hierar
hy although some are dire
ted farther down. The remaining quarter
are fed forward to
oalitions higher in the hierar
hy, again mainly to the
oalition dire
tly
above. This elu
idates why we distinguished the external input X to the
oalition as being
omprised of both a bottom-up (BU) and a top-down (TD) subset; these subsets
ome,
respe
tively, mainly from the
oalitions dire
tly below and dire
tly above.
Neuros
ientists refer not only to bottom-up and top-down
onne
tions but also to horizontal
onne
tions [15, [16, [17. Translated into feedforward-feedba
k terminology, horizontal
onne
tions are lo
al feedba
k su
h as Y (k 1) in Figure 1, top-down
onne
tions
are regional feedba
k, and bottom-up
onne
tions are regional feedforward. Bottom-up
and top-down signals also
an be
onsidered to
onstitute instan
es of what information
theorists
all side information.
In information theory parlan
e, the neural
oalition of Figure 1 is a time-dis
rete, nitestate
hannel whose state is the previous
hannel output ve
tor. At the
hannel input
appear both the regional feedforward signal fX (k)g and the regional feedba
k signal
fX (k)g. However, there is no
hannel en
oder in the information-theoreti
sense of that
term that is able to operate on these two signals in whatever manner suits its fan
y in order
to generate the
hannel input. Rather, the
omposite binary ve
tor pro
ess fX (k)g :=
f(X (k); X (k))g simply enters the
hannel by virtue of the axons
arrying it being
aerent to the synapses of the
hannel's neurons. We shall subsequently see that there is
no de
oder in the information-theoreti
sense either; as foretold, the system is
oding-free.
My use of the adje
tive \
oding-free" is likely to rile both information theorists and
neuros
ientists - information theorists be
ause they are deeply enamored of
oding and
neuros
ientists be
ause they are a
ustomed to thinking about how an organism's neural
spike trains serve as
oded representations of aspe
ts of its environment. In hopes of not
losing both halves of my audien
e at on
e, allow me to elaborate. Certainly, sensory neurons'
spike trains
onstitute an en
oding of environmental data sour
es. However, unless they
expli
itly say otherwise, information theorists referring to
oding usually mean
hannel
oding rather than sour
e
oding. Channel
oding
onsists of the intentional insertion of
leverly sele
ted redundant parity
he
k symbols into a data stream in order to provide
error dete
tion and error
orre
tion
apabilities for appli
ations involving noisy storage or
transmission of data. I do not believe that the brain employs error
ontrol
oding (ECC).
5
BU
TD
BU
TD
5
In [2 Shannon wrote, "Channels with feedba
k from the re
eiving to the transmitting point are a spe
ial
ase of a situation in whi
h there is additional information available at the transmitter whi
h may be used
as an aid in the forward transmission system."
6
There is a possibility that the brain employs a form of spa
e-time
oding, with the emphasis heavily
on spa
e as opposed to time. Here, the spa
e dimension means the neurons themselves, the
ardinality of
whi
h dwarfs that of the pau
ity of antennas that
omprise the spa
e dimension of the spa
e-time
odes
10
It's time for some mathemati
al information theory. Let's see what has to happen in order
for an organism to make optimum use of the
hannel in Figure 1, i.e., to transmit information
through it at a rate equal to its
apa
ity. (Of
ourse, we are not interested just in sending any
old information through the
hannel - we want to send the \right" information through the
hannel, but we shall temporarily ignore this requirement.) A
hannel's
apa
ity depends
on the level of resour
es expended. What resour
es, if any, are being depleted in the
ourse
of operating the
hannel of Figure 1? The answer lies in the biology. Obviously, energy is
onsumed every time one of the
hannel's neurons generates an a
tion potential, or spike.
It is also true that when a spike arrives at a synapse lo
ated on a dendrite of one of the
hannel's neurons, energy usually is expended in order to
onvert it into a
ontribution to
the post-synapti
potential. This is ee
ted via a sequen
e of ele
tro
hemi
al pro
esses
the end result of whi
h is that vesi
les
ontaining neurotransmitter
hemi
als empty them
for transportation a
ross the synapti
left. This, in turn, either in
reases or de
reases the
post-synapti
potential (equivalently, the post-synapti
urrent), respe
tively as the synapse
is an ex
itatory or an inhibitory one. The expe
ted energy dissipated in the synapses in
Figure 1 at time k therefore depends on X (k) and Y (k 1), while that dissipated in the
7
urrently under development for wireless
ommuni
ations. Think of it this way. In order to evolve more
apable sensory systems, organisms needed to expand the temporal and/or the spatial dimensionality of
their pro
essing. Time expansion was not viable be
ause the need to respond to
ertain stimuli in only a
few tens of millise
onds pre
luded employing temporal ECC te
hniques of any power be
ause these require
long blo
k lengths or long
onvolutional
onstraint lengths whi
h impose una
eptably long laten
y. Pulse
widths
on
eivably
ould have been narrowed (i.e., bandwidths in
reased), but the width of a neural pulse
appears to have held steady at
ir
a 2 ms over all spe
ies over hundreds of millions of years, no doubt for a
variety of
ompelling reasons. (Certain owls' auditory systems have spikes only about 1 ms wide, but we are
not looking for fa
tor of 2 explanations here.) The obvious solution was to expand the spatial dimension.
Organisms have done pre
isely that, relying on parallel pro
essing by more and more neurons in order to
progress up the phylogeni
tree. If there is any ECC
oding done by neurons, it likely is done spatially
over the huge numbers of neurons involved. Indeed, strong
orrelations have been observed in the spiking
behaviors of neighboring neurons, but these may simply be
onsequen
es of the need to obtain high resolution
of
ertain environmental stimuli that are themselves inherently
orrelated and/or the need to dire
t
ertain
spike trains to more lo
ations, or more widely dispersed lo
ations, than it is pra
ti
al for a single neuron to
visit. There is not yet any solid
eviden
e that neurons implement ECC. Similarly, although outside the domain of this paper, we remark
that there is not yet any
on
rete eviden
e that redundan
ies in the geneti
ode play an ECC role; if it
turns out they do, said ECC
apability
learly also will be predominately spatial as opposed to temporal in
nature.
7
A
tually, the energy gets
onsumed prin
ipally during the pro
ess of re-setting
hemi
al
on
entrations
in and around the neuron after ea
h time it spikes so as to prepare it to re again should su
ient ex
itation
arrive. The a
tual transmission of a spike is more a matter of energy
onversion than of energy dissipation.
8
Spikes arriving at synapses often are ignored, a phenomenon known as quantal synapti
failure (QSF). Its
name notwithstanding, QSF a
tually is one of natural sele
tion's ner triumphs, enhan
ing the performan
e
of neural
oalitions in several ingenious respe
ts the details of whi
h
an be found in the works of Levy and
Baxter[13 [14. Let Sim (k) be a binary random variable that equals 1 if QSF does not o
ur at synapse
(i; m) at time k and equals 0 if it does; that is, Sim = 1 denotes a quantal synapti
su
ess at synapse (i; m)
at time k. Often, the Sim (k)'s
an be well modeled as Bernoulli-s random variables, i.e., as being i.i.d.
over i, m and k with
ommon distribution P (S = 1) = 1 P (S = 0) = s; in pra
ti
e, s 2 [0:25; 0:9. The
phenomenon of QSF then may be modeled by multiplying the spike, if any, aerent to synapse (i; m) at time
k by Sim (k). This is seen to be equivalent to installing what information theorists
all a Z-
hannel [12 at
every synapse. Were it not for QSF, the
oalition
hannel would be ere
tively deterministi
when viewed as
an operator that transforms fX g into fY g, sin
e the signal-to-noise ratio on neural
hannels usually is quite
strong. However, if the
hannel is viewed as an operator only from fX BU g to fY g, with fxTD g
onsidered
to be random side information, QSF may no longer be its dominant sour
e of randomness.
11
axons depends on Y (k). The average energy dissipated in the
oalition at time k therefore is
the expe
ted value of one fun
tion of (X (k); Y (k 1)) and another fun
tion of Y (k), with the
ee
t of quantal synapti
failures (see footnote
on
erning QSF) usually well approximated
by multiplying the rst of these two fun
tions by s. For purposes of the theorem we are
about to present, it su
es to make a less restri
tive assumption that the average resour
es
expended at time k are the expe
ted value of some fun
tion solely of (X (k); Y (k 1); Y (k)).
We may impose either a s
hedule of expe
ted resour
e depletion
onstraints as a fun
tion of
k or simply
onstrain the sum of the k expe
ted resour
e depletion over some appropriate
range of the dis
rete time index k. Average energy expenditure, whi
h we believe to be
the dominant operative
onstraint in pra
ti
e, is an important spe
ial
ase of this general
family of resour
e
onstraint fun
tions.
Let P SP (k) denote the post-synapti
potential of the m neuron in the
oalition a time
k. Then the output Y (k) of this neuron,
onsidered to be 1 if there is a spike at time k
and 0 if there isn't, is given by
Y (k) = U (P SP (k) T );
where U () is the unit step fun
tion. The above dis
ussion of Figure 1 lets us write
X
X
P SP (k) = X (k)W Q (k)S + Y (k 1)W Q S (k);
th
th
lm
lm
lm
im
im
im
where W is the signed weight of synapse (i; m), Q (k) is the random size of the quantity
of neurotransmitter that will traverse the synapti
left in response to an aerent spike at
synapse (i; m) at time k if synapti
quantal failure does not o
ur there then, and S (k)
equals 0 or 1, respe
tively, in a
ordan
e with whether said quantal synapti
failure does
or does not o
ur. The spiking threshold T = T (k) of neuron m at time k varies with m
and k, though usually only mildly.
Note that this
hannel model is su
h that su
essive
hannel output ve
tors Y (k) are
generated independently,
onditional on their
orresponding input ve
tors X (k) and lo
al
feedba
k ve
tors Y (k 1); that is,
im
im
im
p(y1 jx1 ; y 0 ) =
n
Y p(y jx ; y ):
(1)
k =1
12
input at time k, with x and y by then no longer being available there. We will show
that this upper bound
an be met even when a
ess is denied to said past input and lo
al
feedba
k ve
tors. This, in turn, helps demystify how a real neural network stru
tured as
in Figure 1
an be information-theoreti
ally optimum despite its not possessing a
lassi
al
en
oder at the network's input.
The postulated full-memory en
oder
an generate any input pro
ess whose probabilisti
stru
ture is des
ribed by a member of the set P (X ) of probabilisti
distributions of the
form
Y p(x jx ; y ):
(2)
k
k =1
k =1
Compared with (2), this set
ontains only those input distributions for whi
h, given Y ,
X be
omes
onditionally independent of all the previous inputs X and all the previous
outputs Y .
1
The maximum mutual information rate between the
hannel's input and output
pro
esses is attained inside P (X1 ), uniformly in the initial
onditions Y 0 . Moreover, if
we restri
t the distribution of the inputs X1 to P (X1 ), let Y1 denote the
orresponding
output, and let Y 0 denote the initial
hannel state, then we have
Theorem 1
Proof of Theorem 1
13
If (X1 ; Y1
Sin
e X1 2 P (X1 ),
n
p(x jy0
) = p(x jy ):
Thus, with referen
e to the
onditional memoryless of the
hannel (
f. equation (1), we
have
X
p(y jy ) = p(y jx ; y )p(x jy ) = p(y jy ):
(4)
k
Remark: Depending on the input pmf's p(x jy ); k = 1; 2; : : :, Y
an be either a homogeneous or a nonhomogeneous Markov
hain. If pmf p(x jy ) does not vary with k, then
Y is homogeneous; otherwise, it's nonhomogeneous.
We next derive an upper bound on the mutual information between any (X ; Y ) 2
P (X ; Y ), whi
h is needed for the proof.
I (X ; Y jY ) = H (Y jY ) H (Y jX ; Y )
X H (Y jY ) X H (Y jX ; Y )
=
1
(a)
X H (Y jY ) X H (Y jX ; Y )
=
X H (Y jY ) X H (Y jX ; Y )
X I (X ; Y jY );
=
k =1
k =1
n
(b)
k =1
n
( )
k =1
n
k =1
k =1
(5)
k =1
where (a) is the
hain rule; (b) follows from the
hannel property p(y jx ; y ) =
p(y jx ; y ) for all k and all n > k; and (
) follows from the fa
t that in
reasing
onditioning
an only redu
e entropy. Noti
e that the inequality in (
) be
omes equality when Y
is a Markov
hain. Therefore, it follows from Lemma 1 that, for any (X ; Y ) 2 P (X ; Y ),
k
I (X1 ; Y1 jY0 ) =
n
X I (X ; Y jY ):
(6)
k =1
1 ;Y1
Y0
Yp
k =1
1 ;Y
0
k 1 k 1
k jX1 ;Y0
k =1
14
(x jx
; y0
)p k j
Y
1 ;Y0
k k 1
X ;Y
(y jx ; y ); (8)
k
where the last equality follows from (1). We now
onstru
t a (X^ ; Y^ ) 2 P (X
distributed a
ording to the pmf
n
p ^ 1n
X
where we set P
^n Y
^0
;Y
(x
Yp
n
k jY^k 1
^
X
k =1
(x jy )p k j
k
^
Y
(y jx ; y );
^ ;Y
^
X
k k 1
k k 1
^
X
^
Y
X ;Y
(10)
; Y0 )
(9)
^ ;Y
^
X
and
; y1 ) =
(y jx ; y ) = p kj k k (y jx ; y ); 8k 1:
(11)
Y^ therein is just an alias for the random variable Y . Unlike X in (8), X^ is restri
ted to
be
onditionally independent of (X^ ; Y^ ), given Y^ . Thus, X^ 2 P (X ). Equation
(11), together with Y^ = Y , assures us that Y^ is indeed the output from our
hannel in
response to input X^ and initial state Y ; i.e., (X^ ; Y^ ) 2 P (X ; Y ). It is obvious that
the joint pmf of (X^ ; Y^ ) is dierent from that of (X ; Y ). However, we have the following
lemma.
Assume (X ; Y ) 2 P (X ; Y ). Let (X^ ; Y^ ) be dened as in (9) and let Y^ =
Y . Then
(12)
p k k k (y ; x ; y ) = p k k k (y ; x ; y ); 8k 1:
Proof: It follows from (10) and (11) that
(13)
p k k j k (y ; x jy ) = p k k j k (y ; x jy ); 8k 1:
Sin
e Y^ = Y , we may write
p
(y ; x ; y ) = p j (y ; x jy )p (y )
= p j (y ; x jy )p (y )
= p
(y ; x ; y ):
(14)
That is, the lemma statement (12) holds for k = 1, from whi
h it follows by marginalization
that
p (y ) = p (y ):
The same arguments as in (14) now
an be used to verify that (12) holds for k = 2.
Repeating this argument for k = 3, and so on, establishes the desired result for all k 1.
Sin
e (X^ ; Y^ ) 2 P (X ; Y ), we know from (6) that
p ^k j ^ k
Y
^
X ;Y
k 1
Lemma 2
X ;Y
^ ;Y
^
^ ;X
Y
^ ;X
^
Y
^
Y
Y ;X ;Y
Y ;X
^1 ;X
^ 1 ;Y
^0
Y
^1
Y
^0
Y
Y1 ;X1 Y0
Y0
Y1 ;X1 ;Y0
^1 ;X
^1 Y
^0
Y
Y1
X I (X^ ; Y^ jY^ ):
n
k =1
k =1
15
so
X I (X^ ; Y^ jY^ ) = X I (X ; Y jY ):
n
(16)
k =1
k =1
(a)
In preparation for addressing the topi
of de
oding, or the la
k thereof, let us re
all Shannon's formulas
hara
terizing the probability distributions that solve the variational problems for
al
ulating
apa
ity-
ost fun
tions of
hannels and rate-distortion fun
tions of
sour
es.
The
ost-
apa
ity variational problem is dened as follows. We are given the transition
probabilities fp(yjx); (x; y) 2 X Yg of a dis
rete memoryless
hannel (dm
) and a set
of nonnegative numbers f
(x) 0; x 2 Xg, where
(x) is the
ost in
urred ea
h time the
symbol x is inserted into the
hannel. We seek the probability distribution fp(x); x 2 Xg
that maximizes the mutual information subje
t to the
onstraint that the average input
ost does not ex
eed S . We denote this maximum by
X X p(x)p(yjx)log(p(yjx)=p~);
(17)
C (S ) := max
f g2S
p(x)
16
(18)
C (S )
is a
on
ave fun
tion that usually satises C (0) = 0 and lim !1 C (S ) = C . The
onstant C ,
alled either the un
onstrained
apa
ity or simply the
apa
ity of the
hannel,
is nite if X and/or Y have nite
ardinality but may be innite otherwise.
The rate-distortion variational problem is dened as follows. We are given the letter
probabilities fp(u); u 2 Ug of a dis
rete memoryless sour
e (dms) and a set of nonnegativge
numbers fd(u; v) 0; (u; v) 2 U Vg. Here, d(u; v) measures the distortion that o
urs
whenever the dms produ
es the letter u 2 U and the
ommuni
ation system delivers to a user
lo
ated at its output the letter v 2 V as its approximation of said u. The alphabets U and
V may or may not be identi
al. In fa
t, the appropriate V and distortion measure vary from
user to user. Alternatively, and more apropos of appli
ation to a living organism, they vary
over the dierent uses a single organism has for the information. In what follows we therefore
speak of (sour
e,use)-pairs instead of the usual terminology of (sour
e,user)-pairs. In ratedistortion theory we seek the transition probability assignment fq(vju); (u; v) 2 U Vg
that minimizes the average mutual information subje
t to the
onstraint that the average
distortion does not ex
eed D. We denote this minimum by
X X p(u)q(vju)log(q(vju)=q(v));
(19)
R(D) := min
f j 2D
S
q (v u)
(20)
max
min
max
max
min
min
min
where p~(y) is given linearly in terms of said fp(x)g by (18). The
onstant
represents
a xed
ost per
hannel use that simply translates the C (S )
urve verti
ally, so no loss
in generality results from setting
= 0. The
onstant
is inter
hangeable with the
hoi
e of the logarithm base, so we may set
= 1, again without any essential loss of
generality. It follows that in order for optimality to prevail the
ost fun
tion must equal
the Kullba
k-Leibler distan
e between the
onditional distribution fp(yjx); y 2 Yg of
the
hannels output when the
hannel input equals letter x and the un
onditional output
distribution fp~(y); y 2 Y obtained by averaging as in (18) over the information-maximizing
fp(x); x 2 Xg. Sin
e the expe
tation over X of the K-L distan
e between fp(yjX ) and
fp~(y)g is the mutual information between the
hannels input and output, we see that
2
10
10
The K-L distan
e, or relative entropy, of two probability distribution fp(w); w 2 Wg and fq(w); w 2 Wg
is given by D(pjjq) := w p(w) log(p(w)=q(w)). It is nonnegative and equals 0 if and only if q(w) = p(w)
for all w 2 W .
17
applying the
onstraint P p(x)
(x) S is equivalent, when the C (S )-a
hieving input
distribution is in for
e, to maximizing the average mutual information subje
t to the average
mutual information not ex
eeding a spe
ied amount,
all it I . Obviously, this results in
C (I ) = I , a
apa
ity-
ost fun
tion that is simply a straight line at 45 .
This perhaps
onfusing state of aairs requires further explanation, sin
e information
theorists are justiably not a
ustomed to the
apa
ity-
ost
urve being a straight line.
Sin
e studying a well-known example often sheds light on the general
ase, let us
onsider
again Shannons famous formula C (S ) = (1=2)log(1+ S=N ) for the
apa
ity-
ost fun
tion of
a time-dis
rete, average-power-limited memoryless AWGN
hannel;
learly, it is a stri
tly
on
ave fun
tion of S . Of
ourse, in this formula S is a
onstraint on the average power expended to transmit information a
ross the
hannel, not on the average mutual information
between the
hannels input and the output. Next, re
all that for this AWGN, whose transition probabilities are given by p(yjx) = exp(y x) =2N=sqrt2N , the optimum power
onstrained input distribution is known to be Gausswian, namely p(x) = exp x =2S=p2S .
When D(p(yjx)jjp~(y) is evaluated in the
ase of this optimum input, it indeed turns out to
indeed be proportional to x plus a
onstant. Hen
e, there is no dieren
e in this example
between
onsidering the
onstraint to be imposed on the expe
ted value of X or
onsidering it to be imposed on the expe
ted value of D(p(yjX )jjp(y)). The physi
al signi
an
e
of an average power
onstraint is evident, but what, if anything, is the physi
al meaning
of an average K-L distan
e
onstraint? First observe that, if for some x it were to be the
ase that fp(yjx); y 2 Y is the same as p(y); y 2 Y , then one would be utterly unable to
distinguish on the basis of the
hannel output between transmission and non-transmission
of the symbol x. Little if any resour
es would need to be expended to build and operate
a
hannel in whi
h fp(yjx); y 2 Yg does not
hange with x, sin
e the output of su
h a
hannel is independent of its input. In order to assure that the
onditional distributions on
the
hannel output spa
e given various input values are well separated from one another,
resour
es must be expended. We have seen that in the
ase of an AWGN the ability to
perform su
h dis
riminations is
onstrained by the average transmission power available;
in a non-Gaussian world, the physi
ally operative quantity to
onstrain likely would be
something other than power. In this light I believe (21) is telling us, among others things,
that if one is not sure a priori to what use(s) the information
onveyed through the
hannel
output will be put, one should adopt the viewpoint that the task is to keep the various
inputs as easy to dis
riminate from one another as possible subje
t to whatever physi
al
onstraint(s) are in for
e. We adopt this viewpoint below in our treatment of the de
oding
problem.
For the rate-distortion problem Shannon [4 observed that Lagrange minimization over
fq(vju)g leads to the ne
essary
ondition
q (vju) = (u)q (v)exp(sd(u; v));
(22)
where s 2 [ 1; 0 is a parameter that equals the slope R0(D ) of the rate-distortion fun
tion
at the point (D ; R(D )) that it generates, and fq (v); v 2 Vg is an appropriately sele
ted
probability distribution over the spa
e V of sour
e approximations. Sin
e q(vju) must sum
to 1 over v for ea
h xed u 2 U , we have
X
(23)
(x) = [ q (v)exp(sd(u; v)) :
x
11
11
This does happen for an AWGN
hannel for all S << N0 W , and hen
e for all pra
ti
al values of S in
the
ase of a time-
ontinuous AWGN of extremely broad bandwidth.
18
Ex
ept in some spe
ial but important examples, given a parameter value s it is di
ult to
nd the optimum fq (v)g, and hen
e the optimum fq (vju) from (22). We
an re
ast (22)
as an expression for d(u; v) in terms of q (vju), namely
d(u; v) = ( 1=jsj)log(q (vju)=q (v)) + (1=jsj)log (u):
(24)
Sin
e v does not appear in the se
ond term on the right-hand side of (24), that term re
e
ts only indire
tly on the way in whi
h a system that is end-to-end optimum in the sense
of a
hieving the point (D ; R(D )) on the rate-distortion fun
tion probabilisti
ally re
onstru
ts the sour
e letter u as the various letters v 2 V . Re
alling that log(q (vju)=q (v)) is
the mutual information i (u; v) between symbols u and v for the end-to-end (i.e., sour
e-touse) optimum system, we see that the right way to build said system is to make the mutual
information between the letters of pairs (u; v) de
rease as the distortion between them in
reases. Averaging over the joint distribution p(u)q (vju) that is optimum for parameter
value s arms the inverse relation between average distortion and average mutual information that pertains in rate-distortion. This relationship is analogous to the dire
tly-varying
relation between average
ost and average mutual information in the
hannel variational
problem.
s
12
13
For purposes of the present exposition, a key
onsequen
e of the pre
eding paragraph is that
it helps explain how a
hannel p(yjx)
an be mat
hed to many (sour
e,use)-pairs at on
e.
Spe
i
ally, under our denition
hannel p(yjx) is mat
hed to sour
e p(u) and distortion
measure d(u; v) at slope s on their rate-distortion fun
tion if, and only if, there exists a
pair of
onditional probability distributions fr(xju)g and fw(vjy)g su
h that the optiumum
end-to-end system transition probabilities fq (vju)g in the rate-distortion problem
an be
written in the form
X X r (xju)p(yjx)w (vjy):
(25)
q (vju) =
s
It should be
lear that (25) often
an hold for many (sour
e,use) pairs that are of interest to
an organism. In su
h instan
es it will be signi
antly more e
ient
omputationally for the
organism to share the p(yjx) part of the
onstru
tion of the desired transition probability
assignments for these (sour
e,use) pairs rather than to have to in ee
t build and then
operate in parallel a separate version of it for ea
h of said appli
ations. This will be all the
more so the
ase if it is not known until after p(yjx) has been exer
ised just whi
h potential
uses appear to be intriguing enough to bother
omputing their w(vjy)-parts and whi
h do
not.
I
onje
ture that the previous paragraph has mu
h to say about why neural
oalitions
are
onstru
ted and inter
onne
ted the way they are. Namely, the
oalitionwise transition
probabilities ee
ted are
ommon to numerous potential appli
ations only some of whi
h
12
Kuhn-Tu
ker theory tells us that the ne
essary and su
ient
ondition for fqs (v)g to generate, via
(22), a point on the rate-distortion
urve at whi
h the slope is s is
s (v) := u s (u)p(u) exp sd(u; v)
1 for all v, where s (u) is given by equation (23) and equality prevails for every v for whi
h qs (v) >
0. Re
ursive algorithms developed by Blahut [19 and by Rose [20 allow rate-distortion fun
tions to be
al
ulated numeri
ally with great a
ura
y at moderate
omputational intensity.
13
Equations (21) and (24) perhaps rst appeared together in the paper by Gastpar et al. [18. Motivated
by exposure to my Examples 1 and 2, they derived
onditions for double mat
hing of more general sour
es
and
hannels,
onning attention to the spe
ial
ase of deterministi
p(xju) and p(v jy).
19
a
tually get explored. The situation is sket
hed s
hemati
ally in Figure 2, from whi
h the
reader
an see how a given neural
oalition, viewed as a
hannel, might both be mat
hed
by the sour
e that drives it and at the same time
ould help mat
h that sour
e to many
potential uses via inter
onne
tion with other
oalitions and sub
oalitions.
Whether or not use i, asso
iated with some i h neural sub
oalition des
ribed by transition probabilities fw (v jy; v 2 Vi g, gets a
tively explored at a given instant depends on
what side information in addition to part of fY g gets presented to it. That side information
depends, in turn, on Bayesian-style prior probabilities that are
ontinually being re
ursively
updated as the bottom-up and top-down pro
essing of data from stimuli pro
eeds. When
said side information is relatively inhibitory rather than ex
itatory, the subregion does not
\ramp up. Then energy is saved but of
ourse less information is
onveyed.
t
s;i
14
15
ACKNOWLEDGEMENT
The author is indebted to Professors William B. "Chip" Levy and James W. Mandell of
the University of Virginia Medi
al S
hool's neurosurgery and neuropathology departments,
respe
tively, for frequent, lengthy, and far-ranging dis
ussions of neuros
ien
e over the past
seven years. In his role as my neuros
ien
e mentor, Chip has gra
iously enlightended me
regarding his own and others' theoreti
al approa
hes to
omputational neuros
ien
e. This
presentation is in substantial measure a re
e
tion of Chip Levy's unrelenting pursuit of
neuros
ienti
theory rooted in biologi
al reality. This work was supported in part by NIH
Grant RR15205 for whi
h Professor Levy is the Prin
ipal Investigator.
Referen es
[1 C. E. Shannon, A Mathemati
al Theory of Communi
ation. Bell Syst. Te
h. J., vol. 27,
379-423, 623-656, July and O
tober, 1948. (Also in Claude Elwood Shannon: Colle
ted
Papers, N. J. A. Sloane and A. D. Wyner, eds., IEEE Press, Pis
ataway, NJ, 1993,
5-83.)
[2 C. E. Shannon, Channels with Side Information at the Transmitter. IBM J. Resear
h
and Development, , 289-293, 1958.
2
14
Re
ursive estimation is the name of the game in sensory signal pro
essing by neurons. A Kalman
formalism [21 is appropriate for this, subje
t to
ertain provisos. One is that it seems best to adopt the
onditional error entropy minimization
riterion [22 [23 [24 as opposed to, say, a minimum MSE
riterion;
this is in keeping with our view that an information
riterion for as long as possible before spe
ializing to a
more physi
al
riterion asso
iated with a parti
ular use. Another is that the full Kalman solution requires
inverting matri
es of the form I + MM T in order to update the
onditional
ovarian
e matrix. Matrix
inversion is not believed to be in the repertoire of mathemati
al operations readily amenable to realization
via neurons unless the ee
tive rank of the matrix M is quite low.
15
We remark that neurons are remarkably sensitive in this respe
t. They idle at a mean PSP level that
is one or two standard deviations below their spiking threshold. In this mode they spike only o
asionally
when the random
u
tuations in their ee
tively Poisson synapti
bombardment happen to bun
h together
in time to build the PSP up above threshold. However, a small per
entage
hange in the bombardment
(e.g., a slightly in
reased overall intensity of bombardment and/or in a shift toward a higher ex
itatory-toinhibitory ratio of the synapses being bombarded)
an signi
antly in
rease the spiking frequen
y. 16 Given
the predominantly positive feedba
k among the members of a
oalition, many of its members
an be made
to ramp up their spiking intensities nearly simultaneously. This helps explain why
oalitions of neural
ortex
exhibit dramati
variations on a time s
ale of several tens of millise
onds in their rates of spiking and hen
e
in their rates of information transmission and of energy depletion.
20
22
39
21
[18 M. Gastpar, B. Rimoldi and M. Vetterli, To Code or Not to Code. Preprint, EPFL,
Lausanne, Switzerland, 2001.
[19 R. E. Blahut, Computation of Channel Capa
ity and Rate -Distortion Fun
tions. IEEE
Trans. Inform. Theory, , 460-473, 1972.
[20 K. Rose, A Mapping Approa
h to Rate-Distortion Computation and Analysis. IEEE
Trans. Inform. Theory, bf 42, 1939-1952, 1996.
[21 T. Kailath, Le
tures on Wiener and Kalman Filtering, CISM Courses and Le
tures
No. 140, Springer-Verlag, Vienna and New York, 1981.
[22 H. L. Weidemann and E. B. Stear, Entropy analysis of parameter estimation. Info.
And Control, , 493-506, 1969.
[23 N. Minamide and P. N. Nikiforuk, Conditional entropy theorem for re
ursive parameter
estimation and its appli
ation to state estimation problems. Int. J. Systems S
i., ,
53-63, 1993.
[24 M. Janzura and T. Koski, Minimum entropy of error prin
iple in estimation. Info.
S
ien
es, , 123-144, 1994.
[25 A. Buon
ore, V. Giorno, A G. Nobile and L. M. Ri
iardi, A Neural Modeling Paradigm
in the Presen
e of Refra
toriness. BioSystems, , 35-43, 2002.
[26 T. Berger, Interspike Interval Analysis via a PDE. Preprint, Ele
tri
al and Computer
Engineering Department, Cornell University, Itha
a, New York, 2002.
18
14
24
79
67
22