Академический Документы
Профессиональный Документы
Культура Документы
A CERTAIN UNCERTAINTY:
NATURES RANDOM WAYS
MARK P. SILVERMAN
Trinity College, Connecticut
And Yet It Moves: Strange Systems and Subtle Questions in Physics (Cambridge
University Press, 1993)
More Than One Mystery: Explorations in Quantum Interference (Springer, New
York, 1995)
Waves and Grains: Reflections on Light and Learning (Princeton University
Press, 1998)
Probing the Atom: Interactions of Coupled States, Fast Beams, and Loose Electrons
(Princeton University Press, 2000)
A Universe of Atoms, an Atom in the Universe (Springer, New York, 2002)
Quantum Superposition: Counterintuitive Consequences of Coherence, Entanglement
and Interference (Springer, Heidelberg, 2008)
Contents
Preface
Acknowledgments
1
Tools
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
1.13
1.14
1.15
1.16
1.17
1.18
1.19
1.20
1.21
1.22
1.23
1.24
1.25
1.26
page xiii
xvii
of the trade
Probability: The calculus of uncertainty
Rules of engagement
Probability density function and moments
The binomial distribution: bits [Bin(1, p)] and pieces [Bin(n, p)]
The Poisson distribution: counting the improbable
The multinomial distribution: histograms
The Gaussian distribution: measure of normality
The exponential distribution: Waiting for Godot
Moment-generating function
Moment-generating function of a linear combination of variates
Binomial moment-generating function
Poisson moment-generating function
Multinomial moment-generating function
Gaussian moment-generating function
Central Limit Theorem: why things seem mostly normal
Characteristic function
The uniform distribution
The chi-square (2) distribution
Students t distribution
Inference and estimation
The principle of maximum entropy
Shannon entropy function
Entropy and prior information
Method of maximum likelihood
Goodness of fit: maximum likelihood, chi-square, and P-values
Order and extremes
1
1
3
5
7
9
10
12
14
16
17
20
22
24
26
28
32
34
38
41
45
46
49
49
54
61
72
vii
viii
Contents
74
84
84
85
86
87
89
91
91
96
98
100
104
112
112
112
117
121
122
128
129
133
138
146
152
154
160
163
175
177
181
188
188
189
194
194
194
199
206
212
Contents
4.5
The split-beam experiment: photon correlations
4.6
Bits, secrecy, and photons
4.7
Correlation experiment with down-converted photons
4.8
Theory of recurrent runs
4.9
Runs and the single photon: lessons and implications
Appendices
4.10 Chemical potential of massless particles
4.11 Evaluation of BoseEinstein and FermiDirac integrals
4.12 Variation in thermal photon energy with photon number
(E/N)jT,V
4.13 Combinatorial derivation of the BoseEinstein probability
4.14 Generating function for probability [Pr(Nn k)] of k successes
in n trials
5
ix
226
236
240
246
254
260
260
267
268
269
270
A certain uncertainty
5.1
Beyond the beginning of knowledge
5.2
Simple rules: error propagation theory
5.3
Distributions of products and quotients
5.4
The uniform distribution: products and ratios
5.5
The normal distribution: products and ratios
5.6
Generation of negative moments
5.7
Gaussian negative moments
5.8
Quantum test of composite measurement theory
5.9
Cautionary remarks
5.10 Diagnostic medical indices: what do they signify?
5.11 Secular equilibrium
5.12 Half-life determination by statistical sampling: a mysterious
Cauchy distribution
Appendix
5.13 The distribution of W XY/Z
272
272
274
277
281
287
296
299
304
310
313
315
328
328
332
340
347
350
353
364
372
372
378
384
318
325
325
Contents
6.11
6.12
6.13
6.14
384
385
385
387
390
390
392
401
404
409
419
425
432
441
453
453
455
457
457
463
465
470
471
476
480
483
488
495
506
509
509
510
510
511
515
515
515
516
523
Contents
10
xi
9.4
Seeking a solution: the construction of models
9.5
Autoregressive (AR) time series
9.6
Moving average (MA) time series
9.7
Combinations: autoregressive moving average time series
9.8
Phase one: exploration of autoregressive solutions
9.9
Phase two: adaptive and deterministic oscillations
9.10 Phase three: exploration of moving average solutions
9.11 Phase four: judgment which model is best?
9.12 Electric shock!
9.13 Two scenarios: coincidence or conspiracy?
Appendices
9.14 Solution of the AR(12)1,12 master equation
9.15 Maximum likelihood estimate of AR(n) parameters
9.16 Akaike information criterion and log-likelihood
9.17 Line of regression to 12-month moving average
526
527
530
533
534
543
547
554
561
565
568
568
569
570
570
573
573
573
577
580
Bibliography
Index
582
589
594
597
602
609
609
609
611
613
How is it possible that mathematics, which is indeed a product of human thought independent
of all experience, accommodates so well the objects of reality?
Here, in my view, is a short answer: In so far as mathematical statements concern reality, they
are not certain, and in so far as they are certain, they do not refer to reality.
Albert Einstein1
Albert Einstein, from the lecture Geometrie und Erfahrung [Geometry and Experience] given in Berlin on 27 January
1921. (Translation from German by M. P. Silverman.)
Preface
An overview start here
I have heard it said that a preface is the part of a book that is written last, placed first,
and never read. Still, I will take my chances; this is, after all, a book about probability
and uncertainty. The purpose of this preface is to explain what kind of book this is,
why I wrote it, for whom I wrote it, and what I hope the reader will gain by it.
This book is a technical narrative. It is not a textbook (although you can certainly
use it that way); there are no end-of-chapter questions or tests, and the level of
material does not presuppose the reader to have reached some envisioned state of
preparedness. It is not a monograph; it does not survey an entire field of intellectual
activity, and there is no list of references apart from a few key sources that aided me
in my own work. It is not a popularization; the writing does not sensationalize its
subject matter, and explanations may in part be heuristic or analytical, but (I hope)
never shallow and hand-waving.
A narrative is a story albeit in this book one that is meant to instruct as well as
amuse. Each chapter, apart from some background material in the beginning, is an
account of a scientific investigation I have undertaken sometimes because the
questions at issue are of utmost scientific importance; other times on a whim out of
pure curiosity. The various narratives are different, but through each runs a common
thread of probability, uncertainty, randomness, and, often enough, serendipity.
Why, you may be thinking, should my scientific investigations interest you? To this
thought, I can give two answers: one brief, the other longer.
The short answer is that I have written six previous books of the same format
(narrative descriptions of my researches), which have sold well. Many people who
bought (and presumably read) the books found the diversity of subject matter
interesting and the expositions clear and informative, to judge from their unsolicited
correspondence. It seems reasonable to me, therefore, that a Bayesian forecast of a
readers response to this book would employ a favorably biased prior.
The longer answer concerns how people learn things. The principal objective of
this book, after all, is to share with anyone who reads it part of what I have
learned in some 50 years (and still counting) as an experimental and theoretical
physicist.
xiii
xiv
Preface
In the course of a long and somewhat unusual scientific career, my researches have
taken me into nearly every field of physics. In broad outline, I study the structure of
matter, the behavior of light, and the dynamics of stars and galaxies. My investigations of quantum phenomena have employed electron interferometry, radiofrequency and microwave spectroscopy, laser spectroscopy, magnetic resonance, atomic
beams, and nuclear spectroscopy. I have examined the reflection, refraction, diffraction, polarization and scattering of light as a classical wave, and the absorption,
emission, and correlation of light as a quantum particle (photon). I have reported on
the quantum statistics of neutron fluids and BoseEinstein condensates in exploded,
collapsed stars, and the classical statistics of fragments of exploded glass in my
laboratory. I have studied the interactions (electromagnetic, nuclear, and gravitational) of real matter on Earth and of dark matter in the cosmos. My interests
embraced projects of high scientific significance (such as tests of quantum electrodynamics, of the theory of nuclear decay, of Newtonian gravity and of general
relativity) and projects to understand the workings of physically simple, yet surprisingly complicated, physics toys (such as a motor comprising only a AA battery, small
cylindrical magnet, and a paper clip; or a passive hollow tube that is fed room
temperature air at the center and emits hot air from one end and cold air from
the other).
The point of the preceding partial enumeration of research interests is simply this:
I was not trained to do all the above and more; I had to teach myself and the
motivation for learning what I needed to know in each instance derived from the
desire to solve a particular problem that interested me. I did not undertake my
physics self-instruction out of a desire to absorb abstract principles!
A narrative a story humanizes the starkness of physical principles and abstraction of mathematical expressions, and thereby helps provide motivation to learn
both. While the personal situations that prompted me to undertake the studies
narrated here are unlikely to pertain to you, the reader, I cannot help but believe
that the issues involved are as relevant to you as they were to me.
Do you travel and fly in an airplane? Then you may want to read my analysis of
the survival of a pilot who fell five miles without a parachute and how, from that,
I developed a protocol for bringing down safely a jumbo jet whose engines all fail.
Do you invest in the stock market to save for retirement? Then you may want to
read my statistical analysis of how common stocks behave and what you can expect
the market to do for you.
Do you take medications of some kind or have an annual physical exam with a
blood test? Then you will be interested in what my statistical analysis reveals about
the reliability of the clinical laboratory reports.
Have you ever served on a jury or a committee or some group required to reach a
collective judgment? Then you will surely be interested in my theoretical analysis
and experimental tests (aided by collaboration with a BBC television show) of the
so-called wisdom-of-crowds phenomenon.
Preface
xv
Do you pay a power company each month for use of electric energy? Are
you confident that the meter readings are accurate and that you are being
charged correctly? Before answering the second question, perhaps you should
read the chapter detailing the statistical analysis of my own electric energy
consumption.
Do you enjoy sports, in particular ball games of one kind or another? Then you
may be intrigued by my analysis of the ways in which a baseball can move if struck
appropriately or, perhaps of more practical consequence, how I inferred that a
certain prominent US ballplayer was probably enhancing his performance with drugs
long before the media became aware of it.
Are you concerned about global climate change? Then my statistical study of the
climate under ground will give you a perspective on what is likely to be the most
serious consequence to occur soonest a consequence that has rarely been given
public exposure.
And if you are a scientist yourself especially a physicist then you may be utterly
astounded, as I was initially, to learn of persistent claims in the peer-reviewed physics
literature of processes that, had they actually occurred, would turn nuclear physics (if
not, in fact, all laws of physics) upside down. You should therefore find particularly
interesting the chapter that describes my experiments and analyses that lay these
extraordinary claims to rest.
The foregoing abbreviated descriptions should not disguise the fact that as
mentioned at the outset this book is a technical narrative. The book can be read,
I suppose, simply for the stories, skipping over the lines of mathematics. However, if
your goal is to develop some proficiency in the use of probability and statistical
reasoning, then you will want to follow the analyses carefully. I start the book with
basic principles of probability and show every step to the conclusions reached in the
detailed explanations of the empirical studies. (Some of the detailed calculations are
deferred to appendices.)
A textbook, in which material is laid out in a linear progression of topics, may
teach statistics more efficiently but this book teaches the application of statistical
reasoning in context i.e. the use of principles as they are needed to solve specific
problems. This means there will be a certain redundancy but that is a good thing. In
many years as a teacher, I have found that an important part of retention and
mastery is to encounter the same ideas more than once but in different applications
and at increasing levels of sophistication.
Virtually every standard topic of statistical analysis is encountered in this book, as
well as a number of topics you are unlikely to find in any textbook. Furthermore, the
book is written from the perspective of a practical physicist, not a mathematician
or statistician and, where useful, my viewpoint is offered, schooled by some five
decades of experimentation and analysis, concerning issues over which confusion or
controversy have arisen in the past: for example, issues relating to sample size and
uncertainty, use and significance of chi-square tests and P-values, the class
xvi
Preface
Acknowledgments
I would like to thank my son Chris for his invaluable help in formatting the text of
many of the figures in the book, for designing the beautiful cover of the book, and for
his advice on the numerous occasions when my computers or software suddenly
refused to co-operate. It is also a pleasure to acknowledge my long-time colleague,
Wayne Strange, whose participation in our collaborative efforts to explore the
behavior of radioactive nuclei was essential to the successful outcome of that work.
I very much appreciate the efforts of Dr. Simon Capelin, Elizabeth Horne,
Samantha Richter, and Elizabeth Davey of Cambridge University Press to find
practical solutions to a number of seemingly insurmountable problems in bringing
this project to fruition. And I am especially grateful to my copy-editor, Beverley
Lawrence, for her thorough reading and perceptive comments and advice.
1
Tools of the trade
Quoted by Mark Kac, Probability in The Mathematical Sciences (MIT Press, Cambridge, 1969) 239.
Let it suffice, therefore, to say that, if you are reading this book, you are already
familiar with the basic idea of probability in at least two contexts.
(a) The first is as the relative frequency of occurrence of an event. Suppose the
sample space i.e. list of all possible outcomes of some process comprises
events A, B, C whose frequencies of occurrence in N 100 observations are
respectively NA 20, NB 50, and NC 30. (The total number must sum to N.)
Then, assuming a random process generated these events, one can estimate the
probability of event A by the ratio P(A) NA/N 1/5, with corresponding
expressions for the other events. We read this as one chance in five or a probability of 20%.
(b) The second is as a statement of the plausibility of occurrence of an event. Thus,
given meteorological data such as the current temperature, humidity, cloud
cover, wind speed and direction, etc., a meteorologist might pronounce a 40%
chance of rain for tomorrow. Tomorrows weather occurs but once; one cannot
replay it one hundred times and construct a table of outcomes and frequencies.
The probability estimate relies in part on prior knowledge of the occurrences of
similar past weather patterns.
The two senses of probability reflect the two schools of thought, referred to
usually as frequentist and Bayesian. There are subtle issues connected with
both understandings of probability. In the frequentist case (a), for example, a
more complete and accurate definition of probability would have N approach
infinity, which is no problem for a mathematician, but would pose a crushing
burden on an experimental physicist. The Bayesian case (b) avoids resorting to
multiple hypothetical replications of an experiment in order to deduce the
desired probabilities for a particular experiment, but the method seems to entail
a hunch or guess dependent on the analysts prior knowledge. Since different
analysts may have different states of knowledge, the subjectivity of a Bayesianderived estimate of probability appears to clash with a general expectation
that probability should be a well-defined mathematical quantity. (One would
hesitate to use calculus if he thought the value of an integral depended on who
calculated it.)
At this point I will simply state that both approaches to the calculation of
probability are employed in the sciences (and elsewhere); both are mathematically justifiable; both often lead to the same or comparable results in straightforward cases. For all the philosophical differences between the two
approaches, it may be argued that the frequentist deduction of probability is
actually a special case of the Bayesian method. Thus, when the two methods
lead to significantly divergent outcomes, the underlying cause (if all calculations
were executed correctly) arises from different underlying assumptions regarding
the process or system under scrutiny. With that conclusion for the moment, let
us move on.
1:2:1
From a frequentist point of view, the foregoing expression may be interpreted as the
ratio (theoretically, in the limit of an infinitely large number of trials; practically,
for a reasonably large number of trials) of the number of events in which A and
B occur together to the number of events in which B occurred irrespective of the
occurrence of A.
It is common symbolism to represent the non-occurrence of an event by an overbar; thus A represents all outcomes that do not include event A. From the foregoing
considerations, therefore, we can succinctly express two fundamental rules of conditional probability:
inclusivity
PAjB P AjB 1,
1:2:2
Bayes theorem
PBjA
PAjB PB
:
PA
1:2:3
The first rule (1.2.2) signifies that, after B occurs, A either occurs or it does not; those
are the two mutually exclusive outcomes
that
exhaust all possibilities. Note that it
is not generally true that PAjB P AjB 1. Rather, given P(AjB) and Bayes
theorem, it is demonstrable that
PA PAjB 2PAB
PAjB P AjB
,
1 PB
as shown in an appendix.
1:2:4
The second rule (1.2.3), although called Bayes theorem, is a logical consequence
of the laws of probability accepted by frequentists and Bayesians alike. It is regularly
used in the sciences to relate P(HjD), the probability of a particular hypothesis or
model, given known data, to P(DjH), the more readily calculable probability that a
process of interest produces the known data, given the adoption of a particular
hypothesis. In this way, Bayes theorem is the basis for scientific inference, used to
test or compare different explanations of some phenomenon.
The parts of Eq. (1.2.3), relabeled as
PHjD
PDjHPH
,
PD
1:2:5
are traditionally identified as follows. P(H) is the prior probability; it is what one
believes about hypothesis H before doing an experiment or making observations to
acquire more information. P(DjH) is the likelihood function of the hypothesis H.
P(HjD) is the posterior probability. The flow of terms from right to left is a
mathematical representation of how science progresses. Thus, by doing another
experiment to acquire more data let us refer to the outcomes of the two experiments
as D1 and D2 one obtains the chain of inferences
PHjD2 D1
1:2:6
with the new posterior on the left and the sequential acquisition of information
shown on the right.
As an example, consider the problem of inferring whether a coin is two-headed
(i.e. biased) or fair without being able to examine it i.e. to decide only by means of
the outcomes of tosses. Before any experiment is done, it is reasonable to assign
a probability of to both hypotheses: (a) H0, the coin is fair; (b) H1, the coin is
biased. Thus
ratio of priors:
PH 0
1:
PH1
Suppose the outcome of the first toss is a head h. Then the posterior relative
probability becomes
first toss :
:
PH 1 jh PhjH 1 PH1 1 12 2
Let the outcome of the second toss also be h. Assuming the tosses to be independent
of one another, we then have
second toss :
2 2 21 :
PH 1 jh2 , h1 Ph2 jh1 , H1 Ph1 jH 1 PH 1 11 2 4
It is evident, then, that the ratio of posteriors following n consecutive tosses resulting
in h would be
nth toss:
PH0 jhn . . . h1
1
:
PH1 jhn . . . h1 2n
Thus, although without direct examination one could not say with 100% certainty
that the coin was biased, it would be a good bet (odds of H0 over H1: 1:4096) if
12 tosses led to straight heads.
It is important to note, however, that unlikely events can and do occur. No law of
physics prevents a random process from leading to 12 straight heads. Indeed, the
larger the number of trials, the more probable it will be that a succession of heads of
any specified length will eventually turn up. In the nuclear decay experiments we
consider later in the book, the equivalent of 20 h in a row occurred.
The probability of an outcome can be highly counter-intuitive if thought about in the
wrong way. Consider a different application of Bayes theorem. Suppose the probability
of being infected with a particular disease is 5 in 1000 and your diagnostic test comes back
positive. This test is not 100% reliable, however, but let us say that it registers accurately
in 95% of the trials. By that I mean that it registers positive () if a person is sick (s) and
negative () if a person is not sick s. What is the probability that you are sick?
From the given information and the rules of probability, we have the following
numerical assignments.
Probability of infection P(s) 0.005
Probability of no infection Ps 0:995
Probability of correct positive: P(js) 0.95
Probability of false negative P(js) 1 P(js) 0.05
Probability of correct negative Pjs 0:95
Probability of false positive Pjs 1 Pjs 0:05:
Then from Bayes theorem it follows that the probability of being sick, given a
positive test, is
Psj
PjsPs
0:950:005
0:087
PjsPs PjsPs 0:950:005 0:950:995
or 8.7%, which is considerably less worrisome than one might have anticipated on
the basis of the high reliability of the test. Bayes theorem, however, takes account as
well of the low incidence of infection.
1.3 Probability density function and moments
In the investigation of stochastic2 (i.e. random) processes, the physical quantity being
measured or counted is often represented mathematically by a random variable.
2
The world stochastic derives from a Greek root for to aim at, referring to a guess or conjecture.
The average i.e. mean value of some function of the outcomes, f(X), is expressed
symbolically by angular brackets
h f X i
N
X
f xi pi :
1:3:1
i1
N
X
xni pi :
1:3:2
i1
X 1 hX i
N
X
x i pi ,
1:3:3
i1
E
D
variance: var X 2X X X 2 2 21 ,
from which the standard deviation X is calculated. We also have
*
+
X X 3
32 1 231
3
skewness: SkX
,
X
3X
1:3:4
1:3:5
1.4 The binomial distribution: bits [Bin(1, p)] and pieces [Bin(n, p)]
pxdx 1
mn
xn pxdx:
1:3:7
The range of integration can always be taken to span the full real axis by requiring,
if necessary, the pdf to vanish for specific segments. Thus, if X is a non-negativevalued random variable, then one defines p(x) 0 for x < 0.
The cumulative distribution function (cdf ) F(x) sometimes referred to simply as
the distribution is the probability Pr(X x), which, geometrically, is the area under
the plot of the pdf up to the point x:
x
Pr X x Fx
px0 dx0:
1:3:8
b
x
a x
db
da
Fx, ydy Fx, b Fx, a
dx
dx
b
x
a x
Fx, y
dy
x
1:3:9
that differentiation of the cdf yields the pdf: px dF=dx. This is a practical way to
obtain the pdf, as we shall see later, under circumstances where it is easier to
determine the cdf directly.
1.4 The binomial distribution: bits [Bin(1, p)] and pieces [Bin(n, p)]
The binomial distribution, designated Bin(n, p), is perhaps the most widely encountered discrete distribution in physics, and it plays an important role in the research
described in this book. Consider a binomial random variable X with two outcomes
per trial:
n
success 1 with probability p
1:4:1
X
failure 0 with probability q 1 p:
The number of distinct ways of getting k successes in n independent trials, which is
represented by the random variable Y X1 X2 Xn , where each subscript
Probability
0.2
n = 60
p = 0.1
0.15
0.1
0.05
10
12
14
16
Number of Successes
Fig. 1.1 Probability of x successes out of n trials for binomial distribution (solid) Bin(n, p)
Bin(60, 0.1) and corresponding approximate normal distribution (dotted) N(, 2) N(6,5.4).
var npq
q p
Sk p
npq
3 n 2 pq 1
npq
1:4:3
and others as needed. If the probability of obtaining either outcome is the same
p q 12, the distribution is symmetric and the skewness vanishes. For p < q the
skewness is positive, which means the distribution skews to the right as shown in
Figure 1.1. In the limit of infinitely large n, the kurtosis approaches 3, which is the value
for the standard normal distribution (to be considered shortly). A distribution with high
kurtosis is more sharply peaked than one with low kurtosis; the tails are fatter (in
statistical parlance), signifying a higher probability of occurrence of outlying events.
In calculating statistical moments with the binomial probability function, the trick
to performing the ensuing summations is to transform them into operations on the
binomial expression ( p q)n whose numerical value is 1. For illustration, consider
the steps in calculation of the mean
n
n
X
d X
d
q1p
n
n
x nx
xp q
px qnx p p qn npp qn1 ! np
p
hX i
x
x
dp x0
dp
x0
where only in the final step does one actually substitute the value of the sum: p q 1.
d
For higher moments, one applies p dp
the requisite number of times. There is a
more convenient way to achieve the same goal (with additional advantages as well) by
means of a generating function, which will be introduced shortly.
x
x!
x 0, 1, 2 . . .
1:5:1
directly from P(xjn, p) by appropriately taking limits p ! 0 and n ! such that the
mean np remains constant. This is a tedious calculation, and a more efficient way
is again afforded by use of a generating function.
The moments of the Poisson distribution are calculable from relation (1.3.2) with
substitution of probability function (1.5.1). The sums are completed by the same
d
device employed in the previous section, except that now one operates with d
on the
x
X
expression
e . For example, consider the first and second moments
x!
x0
x
X
x
d X
e e
x e
x!
x!
d
x0
x0
x
x
X
2
d
d X
d 2
2
e
e
x
e 2
X e
x!
x!
d
d
d
x0
x0
hXi e
1:5:2
which is a characteristic feature of the Poisson distribution. By analogous manipulations one obtains the skewness and kurtosis
Sk 1=2
1
K 3 :
1:5:3
10
distribution is more sharply peaked and has fatter tails than a standard normal
distribution. The above two expressions suggest, however, that as the mean gets
larger, the Poisson distribution approaches the shape of the normal distribution.
That this is indeed the case will be shown more rigorously by means of generating
functions.
n
n1 . . . nr
n!
r
Y
ni !
with
r
X
ni n
1:6:2
i1
i1
11
Table 1.1
yi
(x1, x2)
2
3
4
5
6
7
8
9
10
11
12
(1,1)
(1,2), (2,1)
(1,3), (3,1),
(1,4), (4,1),
(1,5), (5,1),
(1,6), (6,1),
(2,6), (6,2),
(3,6), (6,3),
(4,6), (6,4),
(5,6), (6,5)
(6,6)
(2,2)
(3,2),
(2,4),
(2,5),
(3,5),
(4,5),
(5,5)
1
2
3
4
5
6
5
4
3
2
1
(2,3)
(4,2), (3,3)
(5,2), (3,4), (4,3),
(5,3), (4,4)
(5,4)
Total
36
P(yi) (yi)/
1/36
2/36 1/18
3/36 1/12
4/36 1/9
5/36
6/36 1/6
5/36
4/36 1/9
3/36 1/12
2/36 1/18
1/36
X
Pyi 1
i1
n1 , n2 , . . . nr jn
n
n1
n n1
n2
n n1 n2
n3
n n1 n2 nr1
:
nr
1:6:3
: 1:6:4
n1
n2
n1 ! n n1 ! k2 ! n n1 n2 ! n1 !n2 ! n n1 n2 !
This pattern carries through for all subsequent factors, and by induction one obtains
n!
n
n1 , n2 , . . . nr jn
:
1:6:5
n1 . . . nr
n1 !n2 ! nr !
As an illustration useful to the discussion of histograms later, consider a game in which
two dice are tossed simultaneously. Each die has six faces with outcomes xi i (i
1,2,. . .6). The outcomes of two dice are then yi i (i 2,3,. . .12). What is the probability
of each outcome yi, assuming the dice to be unbiased? Since there are 6 6
36 possible outcomes, the probability that a toss of two dice yields a particular value
of y is the ratio of the number of ways to achieve y i.e. the multiplicity (y) to the
overall multiplicity : P(yi) (yi)/. By direct counting, we obtain Table 1.1.
If we were to cast the two dice 100 times, what would be the expected outcome in
each category defined by the value yi, and what fluctuations about the expected
values would be considered reasonable? We would therefore want to know the
theoretical means and variances in order to ascertain whether the dice were in fact
unbiased. To determine means, variances and other statistics directly from a
12
ni
2
3
4
5
6
7
8
9
10
11
12
2.78
5.56
8.33
11.11
13.89
16.67
13.89
11.11
8.33
5.56
2.78
Total
ni
1.64
2.29
2.76
4.14
3.46
3.73
3.46
3.14
2.76
2.29
1.64
100.00
1:6:6
13
engineering, economics, and any other field of study where random phenomena are
involved. The principal underlying reason for this not always justified in the
application is the mathematical proposition known as the Central Limit Theorem
(CLT), which shows the normal distribution to be the limiting form of numerous
other probability distributions used to model the behavior of random phenomena.
In particular, the normal distribution is most often employed as the law of
errors i.e. the distribution of fluctuations in some measured quantity about its
mean. It has been written in jest (perhaps) that physicists believe in the law of
errors because they think mathematicians have proved it, and that mathematicians
believe in the law of errors because they think physicists have established it experimentally. There is some truth to the first assertion in that the Gaussian distribution
emerges from a general principle of reasoning (referred to as the principle of
maximum entropy) which addresses the question: Given certain information about
a random process, what probability distribution describes the process in the most
unbiased (i.e. least speculative) way? We will examine this question later. Suffice it
to say at this point that the normal distribution does indeed apply widely, but,
when it does not, one can be led astray with disastrous consequences by drawing
conclusions from it.
The Gaussian distribution of a continuous random variable X whose values span
the real axis takes the form
2
1
2
Pxj, p ex = x :
2
1:7:1
By evaluating the moments of X one can show after a not insignificant amount of
labor that the parameters and 2 are respectively the mean and variance. From the
symmetry of P(xj, ) about the mean, it follows that the skewness is identically zero.
Evaluation of the fourth moment leads to a kurtosis of 3.
One can transform any Gaussian distribution to standard normal form N(0, 1) by
defining the new dimensionless random variable Z (X )/. The cumulative
distribution function (often represented by ) then takes the form
1
z p
2
eu
=2
du,
1:7:2
1:7:3
z
z z erf p :
2
1:7:4
14
x
1
2
1 Prz 1 p eu =2 du 0:159:
Pr
2
1
Thus, if test scores were normally distributed, I would expect about 15% of the class
to receive a grade of A. Such an assumption might hold for a class of large enrollment
(perhaps 50 or more), but not for small-enrollment classes. If I graded on a curve in
an advanced physics class of six bright students, there would be one A, two Bs, two
Cs, one D and a great deal of dissatisfaction.
2 1=2
Sk 2=3
K 9=4 :
1:8:2
The significance of the parameter is seen to be the inverse of the mean waiting time,
which is equivalent to a frequency or rate. Though continuous, the exponential
distribution has a direct connection to the discrete Poisson distribution in which
the same parameter represents the intrinsic decay rate of a system. For example, if
the number of occurrences of some phenomenon in a fixed window of observation
time t is described by a Poisson distribution with parameter t, then the
probability that 0 events will be observed in that time interval is PPoi(0jt) et,
and therefore the probability that at least 1 event will be observed in the time interval
is the cumulative probability FPoi(t) Pr(X t) 1 et. The derivative of FPoi(t)
with respect to time
dFPoi t
Pexp tj et
dx
15
1:8:3
1:8:4
Now let us suppose that T units of time have passed, and we seek the conditional
probability that there is no decay before time t T given that there was no decay
before time T
PrX > t TjX > T
etT
et :
eT
1:8:5
The probability is the same independent of the passage of time following creation of
the particles. Note, in obtaining the preceding result we used the definition of
conditional probability: P(AjB) P(AB)/P(B). As applied to the case of waiting
times, the numerator P(AB) is the probability that the waiting time is longer than
both t T and T. But clearly if the first condition is satisfied, then the second must
also be, and so in this case P(AB) P(A).
The lack of memory displayed by the exponential distribution has a discrete
counterpart in the geometric distribution Pgeo(kjp) pqk1 in which an event occurs
precisely at the kth trial (with probability p) after having failed to occur k 1 times
(with probability q 1 p). The probability of an eventual occurrence is 100%
PrX 1
qk1 p p
k1
X
k0
qk
p
p
1,
1q p
1:8:6
16
hk i
X
k1
kqk1 p p
d X
d
p
1
qk p 1 q1
,
2
dq k0
dq
p
1 q
1:8:7
where use was made in both calculations of the Taylor series expansion
X
1
xk :
1 x k0
1:8:8
There are other continuous distributions that play important roles in the physics
discussed in this book, but we will discuss them as they arise. Let us turn next to the
important topic of generating functions.
8
b
X
>
>
>
ex t px
>
>
<
xa
gX t eXt
>
>
>
> ex t px dx
>
:
17
discrete X
1:9:1
continuous X:
X
X
Xt
x t n
tn X
n tn
:
1:9:2
e
hxn i
n!
n! n0 n!
n0
n0
The nth moment is then obtained by taking the nth derivative of gX (t) with respect
to t and setting t 0
dn gX t
1:9:3
n
:
dtn t0
Note that the zeroth moment is just the completeness relation: gX (0) 1.
In statistical analysis, it is often the case that the moments about the mean are the
quantities of interest. Moreover, in my experience, one rarely needs to go beyond the
third or fourth moment. In such circumstances the natural log of the generating
function is useful to work with because it follows from sequential differentiation that
dln gX t
hXi X
dt
t0 D
E
d 2 ln gX t
2
X
2X
1:9:4
X
dt2
t0
E
D
d 3 ln gX t
3
3X SkX :
X
X
dt3
t0
Regrettably, the progression does not extend to the fourth moment or beyond.
Nevertheless, the expansion of ln g(t) yields useful quantities referred to as the
cumulants of a distribution. We shall not need them in this book.
18
n
X
i1
gSn t eSn t
n
* P
i1
ai X i
n
Y
gXi ai t ! gX atn ,
eai tXi
i1
iid
i1
1:10:1
where the third equality is permitted because the random variables are independent.
Recall: If A and B are independent, then hABi hAihBi. The arrow above shows the
reduction of gSn t in the case of independent identically distributed (iid) random
variables all combined with the same coefficient a.
Two widely occurring special cases are those involving the sum (a1 a2 1) or
difference (a1 a2 1) of two iid random variables for which (1.10.1) yields
gX1 X2 t gX t2
1:10:2
Another useful set of relations comes from evaluating the variance of the general
n
X
ln gXi ai t
linear superposition Sn by differentiating ln gSn t
i1
n
X
d 2 ln gSn t
a2i
dt2
t0
i1
n
n
X
X
ai g0Xi ai t
) Sn
ai i
gXi ai t t0
t0
i1
i1
!
n
X
gXi ai tg00Xi ai t g0Xi ai t2
2
)
a2i 2Xi :
Sn
gX ai t 2
i1
dln gSn t
dt
1:10:3
t0
Another special case of particular utility is the equivalence relation for a normal
variate X
1:10:4
N , 2 N 0, 1,
which will be demonstrated later in the chapter.
A situation may arise I have encountered it often in which the mgf of some
random variable X is a fairly complicated function of its argument and therefore does
not correspond to any of the tabulated forms of known distributions. A useful
procedure in that case may be to expand the mgf in a Taylor series to obtain an
expression of the form
gt
a n tn
en0
1:10:5
,
*
n
X
i1
+
ai Xi
19
a1
2a2 a1
6a3 6a1 a2 a31
24a4 24a1 a3 12a22 12a2 a21 a41
120a5 120a1 a4 120a2 a3 60a1 a22 20a2 a31 a51
1:10:6
reveals a pattern that suggests a systematic way of calculating the moments of the
distribution (and subsequently an approximation to the pdf if so desired). The form
of the nth derivative is n! times the sum over all partitions of the integer n weighted by
a divisor k! for each term in the partition that occurs k times. A partition of a positive
integer n is a set of positive integers that sum to n. We can represent a particular
n
X
jj by the notation f11 22 33 . . . nn g.
partition n
j1
Consider, for example, n 3. There are three ways to satisfy the integer relation
k 2l 3m 3, namely
3 3 0 0 2 1 0 1 1 1 ) f3g, f2, 1g, f13 g,
a3
which leads to the weighted sum 3! a3 a2 a1 3!1 for the entry g(3)j0 in (1.10.6).
There is a graphical technique to construct the partitions of an integer relatively
quickly by means of diagrams known as Youngs tableaux. Each term in a partition
is represented by a horizontal row of square boxes of length equal to the term; the
boxes are stacked vertically, starting with the longest row. Thus, considering again
the three partitions of n 3, we have the three diagrams
(3)
(2,1)
(13)
The preceding ideas were drawn from the theory of symmetric groups,3 which tells
us that the total number r(n) of partitions of an integer n is the coefficient of xn in the
power series expansion of Eulers generating function
Y
1
1 xj
1 x 2x2 3x3 5x4 7x5 11x6 :
1:10:7
E x
j1
Examination of the first few terms verifies what could be easily determined by
drawing the Youngs tableaux. Should one need to know r(n) for large n, there is
3
J. S. Lomont, Applications of Finite Groups (Academic Press, New York, 1959) 258261.
20
1:11:1
gX t eXt pet qe0 pet q:
If the coin is tossed n times or n coins are tossed independently and simultaneously
n
X
once the outcome is describable by a random variable Y
Xi whose mgf follows
i1
gY t pet q :
1:11:2
npe
pe
np
t0
dt t0
1:11:3
h
i
d 2 ln gX
1
2
t
t
t 2
t
npe
pe
n
pe
pe
npq:
t0
dt2 t0
After the third or fourth derivative, the procedure becomes tedious to do by hand,
but symbolic mathematical software (like Maple or Mathematica) can generate higher
moments nearly instantly.
Although we arrived at the binomial mgf by starting with probabilities p and q of
the Bernoulli random variable X and then calculating the generating function for the
composite random variable Y, we could equally well have begun with the binomial
probability function (1.4.2) and calculate the expectation value directly:
y ny X
y
n
n
X
n pq
n pet qny
n
gY t eYt
pet q :
eyt
1:11:4
y
y
n!
n!
y0
y0
If, however, we already have the mgf from the procedure leading to (1.11.2), but do
not know the binomial probability function, we can derive it from the mgf by a
method to be demonstrated shortly.
21
A point worth noting about the procedure leading to Eq. (1.11.2) is that the sum
of the elemental Bernoulli random variables (the Xs) produces a random variable
Y which is also governed by a binomial distribution or symbolically:
Bin1, p Bin1, p Binn, p. From the mathematical form of the binomial
n terms
mgf, one can see generally that the addition of independent random variables of type
Bin(n, p) and Bin(m, p) generates a random variable of type Bin(n m, p). There are
relatively few distributions that have the property that a sum of two random
variables of a particular kind produces a random variable of the same kind. Moreover, as is easily demonstrated, this property does not hold for the difference of two
binomial random variables. If Y X1 X2, where the two variates are independent
and of type Bin(n, p), then
gY t pet q pet q 1 2pq cosh t 1 n
n
1:11:5
in which the second equality was obtained after some algebraic manipulation
employing the identity p q 1. The resulting mgf differs from that of a binomial
random variable and, in fact, does not correspond to any of the standard types
ordinarily tabulated in statistics books. Nevertheless, knowing the mgf, one can
calculate from it all the moments of the difference of two independent binomial
random variables of like kind. Although knowledge of the mgf affords a means to
determine the probability function and we shall examine shortly how to do this in
the present case it is better to proceed differently. We seek the probability Pr(X1
X2 z) that the difference is equal to some fixed value n z n. This can be
expressed by the suite of probability statements
PrX1 X2 z
n
X
x2 0
n
X
x2 0
q
:
p
y
q y0 y z
Note that the upper limit to the sum over the dummy index y must be n z since the
first coefficient vanishes when its lower index exceeds the upper index. The expression
in (1.11.7) can be reduced to closed form in terms of a hypergeometric function 2 F1
22
2 !
p
n z
2nz
PrX1 X2 z
p 1 p
2 F1 n, z n, z 1,
z
1p
1:11:8
x X
et x
t
ee 1 ,
gX t eXt e
ext
x!
x!
x0
y0
and leads to
1:12:1
dln gX t
d 2 ln gX t
,
dx t0
dx2
t0
which confirms the equality of hXi and var(X). Moreover, if X1 and X2 are independent Poisson random variables of respective means 1 and 2, then the mgf of their
sum Y X1 X2
gY t gX1 tgX2 t e1 2 e 1
t
y
X
x1 0
y
X
x1 0
y
X
1
x1
yx
2
1 1
1:12:2
e
e
y x1 !
x1 !
x1 0
y
y
e1 2 X
y!
e1 2 X y x yx
yx
x1 2
x 1 2
y!
y!
x!y x!
x0
x0
2
e1 2
1 2 y :
y!
Hypergeometric functions occur in the solution of second-order differential equations that describe a variety of physical
system. One of the most important examples is the radial part of the wave function of the electron in a hydrogen atom
(i.e. the Coulomb problem).
23
The first step is in effect a statement of the sought-for probability by means of Bayes
theorem. The transition from the first to the second is permitted because the Poisson
variates X1 and X2 are assumed independent. In the third step the explicit form of the
Poisson probability function is employed. In the fourth step the expression is
rearranged so as to take the form of a binomial expansion, which, when summed,
yields in the fifth step the Poisson probability function with parameter Y 1 2.
The difference of two independent Poisson random variables, however, is not
governed by a Poisson distribution. One could have foreseen this without performing
any calculation because a Poisson variate must be non-negative, yet the difference of
two such variates can be negative. (An identical argument applies to the difference of
two binomial random variables.) Such a difference is encountered fairly often in
experimental atomic, nuclear, and particle physics, as well as in other disciplines,
whenever it is necessary to subtract a random background noise from a signal of
interest. The mgf of the difference Y X1 X2 takes the form
t
gY t e1 2 1 e 2 e
t
1:12:3
which identifies a Skellam distribution,5 whose name and probability function are
not widely known in physics. Nevertheless, from the mgf one can quickly obtain the
mean, variance, or any other desired statistic by differentiation:
hY i hX1 X2 i 1 2
1:12:4
t
=2
In ztn
1:12:6
n
of modified Bessel functions In(z) in Jn (iz), in which Jn(z) is the more familiar
Bessel function of the first kind. The probability
5
J. G. Skellam, The frequency distribution of the difference between two Poisson variates belonging to different
populations., Journal of the Royal Statistical Society: Series A, 109 (1946) 296.
24
PrX1 X2 y e1 2
y=2
p
1
I y 2 1 2
2
1:12:7
x 2mn
X
1
I n x
:
m!m n! 2
m0
J n x
1:12:9
1:12:10
The factorial value shown by the arrow in the case of an integer argument derives
from the property of the gamma function
x 1 xx,
1:12:11
1:13:1
25
The moment generating function, in which t now stands for the set of r dummy
variables (t1 . . . tr), is the expectation
subject to
r
X
pn1 . . . pnr r
en1 t1 . . . enr tr 1
gt eNt n!
n1 ! nr !
fn g
1:13:2
gt p1 et1 . . . pr etr :
The set of probabilities fpig are not all independent because pr 1
1:13:3
r1
X
pi . The factor
i1
etr , the equivalent of which is absent in the generating function of a binomial distribution, was included for symmetry to permit all classes to be handled equivalently.
In most instances it is considerably simpler to work with the generating function
than to carry out complex summations with the multinomial probability function.
For example, by differentiating Eq. (1.13.3) we immediately obtain the means,
variances, and covariances of the random variables fNig representing the frequencies
of each class:
9
g
>
>
>
hN i i npi
>
>
ti t0
>
>
>
> (
=
2 2 g
varN i 2i N 2i hN i i2 npi 1 pi
2
N i 2 npi nn 1pi )
>
ti t0
cov N i , N j N i N j hN i i N j npi pj :
>
>
>
>
>
>
2 g
>
>
Ni Nj
n
1
p
p
i j ;
t t
i
j t0
1:13:4
A dimensionless measurement of the degree of correlation between outcomes in two
classes is provided by the correlation coefficient
s
p i pj
cov N i , N j
:
ij
1:13:5
ij
1 pi 1 pj
As noted before, the negative sign in the covariance or correlation coefficient signifies
that on average the change in one frequency results in an opposite change in another
frequency because of the constraint on the sum of all frequencies. The binomial
distribution, where p2 1 p1, provides an illustrative special case; Eq. (1.13.5) leads
to 12 1, i.e. 100% anti-correlation, as would be expected.
A multinomial distribution can arise sometimes in unexpected ways. Consider the
following situation, which will be of interest to us later when we examine means of
judging the credibility of models (also referred to as hypothesis testing) with particular focus on examining the properties of radioactive decay. Suppose a random
26
process has generated K independent Poisson variates fNk Poi (k) k 1 . . . Kg. The
probability of getting the sequence of outcomes fn1, n2, . . . nKg is then
K
K
Y
Y
nk
nk k
Pr fnk gjfk g
ek k e
,
nk !
n!
k1
k1 k
K
X
1:13:6
k1
k1
K nk
Y
nk !
k1
n
e
n!
nk
K
Y
k =
n!
,
nk !
k1
k1
1:13:7
which is seen to be a multinomial probability function with parameters pk k / . The
!
K
X
nk n is justified
substitution of the Poisson probability function for Pr
k1
because the sum of K independent Poisson variates is itself a Poisson random variable.
1.14 Gaussian moment-generating function
The moment generating function of the normal or Gaussian distribution is of
particular significance in the statistical analysis of physical processes. Besides generating the moments of the distribution, it provides a reliable means of ascertaining
how well an unknown probability distribution may be approximated by a normal
one. Designate, as before, X to be a Gaussian random variable with mean and
variance 2. Calculation of the mgf then leads to the integral
1
gt he i p
2
ext ex
xt
=2 2
dx,
1:14:1
ezt
=2
dz 1
1:14:2
27
gt et
1
2
2 t2
1:14:3
1:14:4
In going from the first equality to the second the mgf of a constant is simply
1:14:5
ga t eat eat ,
and the mgf of a constant times a random variable Y takes the form
E
D
gbY t ebY t eY bt gY bt:
1:14:6
However, for Y N(0,1), the mgf (1.14.3) applied to relation (1.14.6) yields
2 2
gY bt e b t . Thus, the product of the factors in (1.14.4) leads to
1
2
gX t eat e
1
2
b2 t2
eat
1
2
b2 t 2
1:14:7
1:14:8
28
taking care to include all contributions of the same order in t. For vanishing p, but
np
1, we truncate the expansion after the quadratic term to obtain the limiting form
gBin t ! enpt
1
2
npqt2
gGaus t,
1:14:9
uted measurements fXi i 1 . . . ng each with mgf gX (t). From Eq.(1.10.1), the mgf of
n
X takes the form gX t gX nt , the natural log of which can be expressed in terms
of the moments of X by expanding gX (t) in a Taylor series about t 0
ln gX t
t
nln gX
n
t=nk
n ln 1
k
k!
k1
29
!
n ln1 t:
1:15:1
k
gX t
Here k d dt
is the kth moment of X and the term (t) is to be regarded as a
k
t0
small quantity since t will eventually be set to 0. A Taylor series expansion of the
logarithm
h
i
t
1
1
1:15:2
n t t2 t3 ,
ln gX
2
t
n
t2
t3
1 t 2 21
3 32 1 231
6n2
D 2n
E
X 1 3
2X 2
1 t t
t3
6n2
2n
1:15:3
1:15:4
If the condition that the variables fXig be identically distributed is relaxed, then the
foregoing analysis carries through in the same
way,
albeit with some extra summa2
tions, leading to a Gaussian distribution N X , X with parameters
X
n
1X
n i1 Xi
2X
n
1X
2 :
2
n i1 Xi
1:15:5
It is worth noting explicitly that the only requirement on the distributions of the
original variables fXig is the existence of first and second moments. This modest
requirement is usually met by the distributions one is likely to encounter in physics
although the Cauchy distribution, which appears in spectroscopy as the Lorentzian
lineshape, is an important exception. A Cauchy distribution has a median, but the
mean, variance, and higher moments do not exist.
A significant outcome of the foregoing calculation is that the standard deviation of
the mean of n observations is smaller than the standard deviation of a single
p
observation by the factor n. This statistical prediction is the justification for
repetition and combination of measurements in experimental work. Perhaps it is
intuitively obvious to the reader that the greater the number of measurements taken,
the greater would be the precision of the result, but historically this was not at all
30
n
1X
xi
n i1
s2X
n
1X
xi x2
n i1
1:15:6
and report our measurement as x sX . The empirical value s2X should correspond
approximately to the value x if the underlying distribution is truly Poissonian. Note
that this is still the estimate of the variance of a single trial the count in one bin
only now we have the variable counts in n bins to verify its value directly.
7
S. M. Stigler, The History of Statistics: The Measurement of Uncertainty before 1900 (Harvard University Press,
Cambridge MA, 1986) 28.
31
But the n-bin experiment gives us more. Equation (1.15.4) tells us that the variance
of the mean of the n measurements is s2X s2X =n, which for large n is a much smaller
variation. The quantity sX is referred to in statistics as the standard error, where
error connotes uncertainty, not mistake. However, as before, we cannot check this
prediction on the basis of a single set of n measurements which, for need of a word
that is short and alliteratively parallels bin, I shall refer to as one bag.
Suppose we collect b bags of data, each bag containing n bins where the content of
one bin is the count in one second. Treating the means of each bag fxj j 1 . . . bg in
the way that we formerly treated the bins fxi i 1 . . . ng in a bag, we calculate the
mean of the bag means and the variance of the mean of the bag means
mX
b
1X
xj
b j1
s2X
b
2
1X
xj mX
b j1
1:15:7
and report the experimental result in the form mX sX . The numerical value of sX
p
should satisfy (approximately) the relation sX sX = n. But this experiment also
s2
s2
gives more: the estimated variance of the mean of b bags of data is s2mX bX nbX .
Of course, we cannot actually verify this without performing another set of experiments, each one comprising b bags of data. And so it goes.
However, we could equally well have interpreted the set of b bags of data as a
single large bag of nb realizations fyk k 1 . . . nbg of a random variable Y. Estimates
of the mean y, variance sY, and variance of the mean sY are then given by the
expressions
y
nb
1 X
y
nb k1 k
s2Y
nb
1 X
y y2
nb k1 k
s2Y
s2y
:
nb
1:15:8
sY sX
sY smX :
1:15:9
The greater number of bins per bag does not reduce the variance of the count in a
single bin, but yields a mean whose variance is as small as previously found for the
variance of the means of the b bags of data. The two ways of handling the data give
equivalent overall estimates for the mean and variance of the stochastic process
generating the data. There are advantages, however, to partitioning the data into
bags if the objective, for example, is to test whether the distribution of counts is
actually Poissonian, or to examine whether the mean or variance of the source of
data may be varying in time.
Table 1.3 shows the results of 25 600 outcomes, ordered sequentially into 16 bags
of 1600 bins each, of a Poisson random number
generator
(RNG) set for
100.
The table reports the mean of each bag xj j 1 . . . 16 and variance s2X calculated from relations (1.15.6). From the table one calculates
directly the mean of
all bags (mX) and the variance of bag means s2X . Comparing theoretical
32
Table 1.3
Bag No.
i
Mean
x
Std Dev.
sx
Bag No.
i
Mean
x
Std Dev.
sx
1
2
3
4
5
6
7
8
99.6
99.9
100.1
100.4
100.1
100.0
100.4
100.1
99.0
100.7
100.7
101.5
100.8
104.1
101.0
98.5
9
10
11
12
13
14
15
16
99.6
100.1
99.9
99.9
99.9
100.1
100.3
100.0
97.5
100.0
99.7
96.7
105.1
105.6
101.5
100.6
expectations and empirical outcomes, we find excellent agreement with the principles outlined above.
THEORY
p
X 100 10
X
10
0:250
X p
n 40
X
10
Y p
0:0625
nb 160
EMPIRICAL
sX 10:040
sX 0:251
sX
sY p 0:0628
nb
A final point (for the moment) in regard to Eq. (1.15.4) or Eq. (1.15.5) is that the
expression for variance of the mean is a general property of variances irrespective of
the Central Limit Theorem. Without the CLT, however, we would not necessarily
know what to do with this information. The theorem tells us, for example, that, if the
process generating the particle counts can be approximated by a Gaussian distribution, then we should expect about 68.3% of the bins to contain counts that fall within
a range sX about the observed mean x:
1.16 Characteristic function
The characteristic function (cf ) of a statistical distribution is closely related to the
moment generating function (mgf ) when the latter exists and can be used in its place
when the mgf does not exist. It is a complex-valued function defined by
hX t eiXt gX it,
1:16:1
p
where i 1 is the unit imaginary number. For a random variable X characterized
by a pdf pX(x), the characteristic function takes the form
hX t
eiXt pX xdx
1:16:2
33
which is recognizable as the Fourier transform of the pdf. In this capacity lies its
primary utility, for it permits one to calculate the probability density (or probability
function) by an inverse transform
1
eiXt hX tdt,
1:16:3
pX x
2
which cannot always be done so straightforwardly by means of the mgf itself. One can,
of course, also calculate moments of a distribution by expansion of hX (t) in a Taylor
series about t 0 to obtain an alternating progression of real and imaginary valued
quantities, but I have found little advantage to using it this way when gX(t) is available.
As an illustration of the inverse problem of determining the pdf from the cf, consider
2
the standard normal distribution for which the generating function is gX t et =2 and
2
therefore hX t et =2 . The probability density then follows from the integral
1
pX x
2
e
ixt t2 =2
2
2
2
ex =2
dt
e t 2ixtx dt
2
3
1
2
2
ex =2 6
6 1
p 6p e
2 6
4 2
1
2
tix2
7
7 ex2 =2
7
dt 7 p
7
2
5
1:16:4
n
X
it
n
k
1
1
n
ixt
ixt
e
pe q dt
e
peit qnk dt
p X x
k
2
2
k0
n
X
n k nk 1
pq
eikxt dt
k
2
k0
n x nx
pq :
x
kx
1:16:5
34
The last step bears some comment. A Dirac delta function (x) is technically not a
function, but a mathematical structure with numerous representations whose value is
zero everywhere except where its argument is zero, at which point its value is infinite;
yet the area under the delta function (that is, the integral of the delta function over
the real axis) is 1. The object was introduced into physics by P. A. M. Dirac to the
horror of mathematicians (or so I have read) but eventually was legitimized by
Laurent Schwarz in a theory of generalized functions (referred to as distribution
theory although the concept of distribution is unrelated to that in statististics).
Ordinarily, the delta function has meaning only in an integral where it serves to
sift out selected values of the argument of the integrand for example:
f xx adx f a. One gets a sense of how this occurs from the integral
representation
0
1
x Lim @
K! 2
1
1
e dtA
2
ixt
K
eixt dt
1:16:6
identified in (1.16.5) by the horizontal bracket. The second equality expresses the
familiar form one usually sees for the representation of the delta function. If
the argument is not zero, then the integrand oscillates wildly with average value
of 0. The proof that the foregoing representation satisfies the property of unit area is
best accomplished by means of contour integration in the complex plane and will not
be given here. To perform that integral rigorously, however, one must employ the
correct representation of (x) as a limiting process expressed in the first equality.
In the calculation (1.16.5) of the binomial probability function, the Dirac delta
function causes the right side of the equation to vanish for all values of the discrete
summation index k except for k x. It is therefore assuming the role of the discrete
Kronecker delta kx, which by definition equals 1 if k x and zero otherwise. There is
no inconsistency here, however, because the inverse transform of the characteristic
function is a probability density, and the Dirac delta function, which in general is a
dimensioned quantity (with dimension equal to the reciprocal dimension of the
integration variable) is required for the left-hand side of (1.16.5) to be a density,
even though it is defined only for discrete values of x. In short, the method works, and
we shall not worry about mathematical refinements to make the analysis more
elegant, only to end up with the same result.
pX xjb, a
8
<
1
ba
:
0
b x a
35
1:17:1
otherwise
is constant over the entire interval within which the variable can fall. The value of the
constant is the reciprocal of the interval, as determined by the completeness relation.
Use of pdf (1.17.1) leads to the moment-generating function
1
gX t he i b a
xt
ext dx
a
ebt eat
:
b at
1:17:2
The uniform distribution is perhaps one of very few distributions where it is considerably easier to determine statistical moments directly by integrating the pdf than by
differentiating the mgf. Performing the integrations, we obtain
D
E
1
1
X hXi b a 2X X X 2 b a2
2
D
E. 12
2 1 2
1:17:3
X b ab a2 Sk X X 3 3X 0
3
D
E.
9
K X X 4 4X 1:8:
5
Since the distribution is symmetric (being constant over the entire interval), the
skewness is expected to vanish. The kurtosis turns out to be a number independent
of the interval boundaries and much smaller than 3 (the value for a normal distribution) signifying a comparatively broader peak about the center, which is one way of
looking at a completely flat distribution.
The difficulty with using the mgf for a uniform variate is that substitution of t 0
into gX (t) and its derivatives leads to an indeterminate expression 0/0. In such cases,
we must apply LHopitals rule from elementary calculus to differentiate separately
the numerator and denominator (more than once, if necessary) before taking the
limit. Consider, for example, calculation of the mean
bt
dg t
be aeat ebt eat
X X
dt t0
b at
b at2 t0
b2 a2
bebt aeat
b2 a 2
b2 a 2
1:17:4
b a
2b at t0 b a 2b a
ba
:
2
To avoid indeterminacy, the numerator and denominator of the second term in the second
line had to be differentiated twice. Clearly, use of the mgf to determine moments of the
uniform distribution is a tedious procedure to be avoided if possible. However, there
are other uses, more pertinent to our present focus, in which the mgf is indispensable.
Suppose we want to determine the statistical properties of a random variable
n
X
Y
Xi , which is a sum of n independent random variables each distributed
i1
36
uniformly over the unit interval, i.e. Xi U(0,1). Y, therefore, spans the range
(n Y 0). The mgf of Y and correspondingly the characteristic function
hY (t) gY (it) are immediately deducible from (1.10.1)
t
it
n
e 1 n
e 1
gY t
)
h Y t
:
1:17:5
t
it
Although at this point we do not have the pdf of Y, we can determine the moments
from the derivatives of gY (t)
hY i Y
9
>
>
>
>
>
>
>
>
>
>
>
=
n
2
2 n2
n
2Y
Y
12
4 12
) Sk 0
3
2
3 n
>
n
>
>
Y
6
>
>
8
8
>
K 3 :
>
>
5n
>
4 n4 n3 n2
n >
>
;
Y
16 8 48 120
1:17:6
As expected, the skewness vanishes and the kurtosis approaches 3 in the limit of
infinite n. Moreover, expansion of ln gY (t) to order t3 leads to an approximate mgf
of Gaussian form
gY t e2 t212t eY t2 Y t
n
1 n
1 2 2
1:17:7
hY te
iyt
y
X
1
k n
dt
1
y kn1 :
k
n 1! 0
1:17:8
I have used the symbol [y] in the upper limit of the sum above to represent the
greatest integer less than or equal to y. Recall that Y is a continuous random variable
over the interval 0 to n, but the numbers in the binomial coefficient must be integers.
The calculation leading from the first equality to the second in (1.17.8) is most
easily performed by contour integration in the complex plane and will be left to an
appendix. To verify that pY (y) satisfies the completeness relation, we calculate the
cumulative distribution function
y
1X
k n
FY y pY y dy
y k n :
1
k
n! k0
y
1:17:9
37
Probability Density
n=2
n=3
0.8
n=4
n=5
0.6
0.4
0.2
0
0.5
1.5
2.5
Outcome
3.5
4.5
Fig. 1.2 Probability density of the sum of n uniform variates (solid) with superposed
Gaussian densities (dashed) of corresponding mean n/2 and variance n/12.
n
n
1X
1X
k n
n
nk n
FY n
1
1
n k
kn 1:
k
k
n! k0
n! k0
1:17:10
The preceding identity is by no means obvious, but it can be proven fairly simply by
comparing the nth derivative of the function (et 1)n and its binomial expansion,
both evaluated at t 0.8
Figure 1.2 shows a sequence of plots of pY (y) (solid trace), calculated from
Eq. (1.17.8), for sums of two to five uniformly distributed variates over the unit
interval. Superposed over each plot is a plot (dashed trace) of the corresponding
Gaussian pdf N n2 , 12n . It is remarkable that the addition of as few as three uniform
random variables already generates a probability distribution reasonably well
approximated by a normal distribution. In fact, the sum of just two uniform variates,
which produces a triangular distribution, is matched very closely by the corresponding Gaussian curve in width, height, and inflection points. Of course, convergence
can be much slower for other probability distributions and some may never
approach Gaussian form at all because their first and second moments are undefined.
Besides serving as an interesting illustration of the Central Limit Theorem, the
uniform distribution is of particular interest in is own right because it is the distribution of cumulative distribution functions (cdf ). To see this, suppose pX (x) to be an
arbitrary (but well-behaved) pdf, with cdf F(x)
x
Fx PrX x
pX x0 dx0 :
1:17:11
W. Feller, An Introduction to Probability Theory and its Applications, Vol. 1 (Wiley, New York, 1950) 63. The identity
arises in the classical occupancy problem (i.e. MaxwellBoltzmann statistics) of r balls distributed among n cells such
that none of the cells is empty.
38
How, then, is the random variable Y, defined by Y F(X), distributed? From the
following sequence of relations
PrY y PrFX y Pr X F1 y F F1 y y,
1:17:12
it follows that Y must be a uniform random variable over the interval 0 to 1, i.e. Y
U(0,1). The fact that a cdf is governed by a distribution U(0,1) plays an important role
in statistical tests of significance, such as goodness of fit tests to be discussed shortly.
One can also use a uniform distribution to generate random numbers distributed
in an arbitrarily desired way. Start by generating n realizations fyi i 1 . . . ng of Y
U(0,1). If we suppose that Y is the cdf of a random variable X
x
y Fx f x0 dx0
1:17:13
1:17:14
constitute n realizations of a random variable with pdf f (x). In general, the inversion
will have to be done numerically.
Consider, as an example, the exponential distribution for which an analytical
solution is easily obtained
xi
yi ex dx 1 exi
1:17:15
1
xi ln1 yi
1 yi 0:
The upper panel of Figure 1.3 shows a histogram of 10 000 numbers fyig generated
by a U (0,1) random number generator, and the lower panel shows the corresponding
values fxig obtained from (1.17.15) for an exponential distribution with parameter
3. The dashed curve superposed on each histogram is the theoretical pdf.
39
540
U(0,1)
Frequency
520
500
480
460
440
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.6
0.7
0.8
0.9
Outcome
1500
Frequency
E(3)
1000
500
0
0
0.1
0.2
0.3
0.4
0.5
0.8
0.9
Outcome
Fig. 1.3 Top panel: histogram of 10 000 samples from a U (0, 1) random number generator.
Lower panel: histogram of exponential variates E () generated by transformation (1.17.15)
with parameter 3. Dashed curves are theoretical densities.
To start with, consider a standard normal random variable Z N(0,1), for which
2
the probability density is pZ z 2 1=2 ez =2 . Under a transformation W Z2, the
new pdf can be deduced by the following chain of steps
pW w dw
dz
pZ z dz 2 pZ z dz 2 pZ zw dw,
dw
1:18:1
leading to
pW w
e
,
2 2
j dw
21=2 w1=2
dz j
1:18:2
40
in the expectation heWti into the form of the gamma function (1) 1. (See
Eqs. (1.12.10) and (1.12.11).)
Given the mgf for a single variate Z2, it follows immediately that the superposition
k
X
Z 2i , each the square of a standard
of k independent random variables, W
i1
1:18:3
var 2k
3 k3 6k2 8k
4 k4 12k3 44k2 48k
r
8
Sk
k
K 3
1:18:4
12
:
k
1:18:5
With increasing k, the skewness of the distribution function approaches 0 and the
kurtosis approaches that of a standard normal variate.
The inverse Fourier transform of the characteristic function hW(t) gW(it) yields
the pdf
1 w2k1 w=2
e
,
1:18:6
pW wjk
22k 2
but this calculation, like that of the integral encountered in the previous section, also
entails contour integration in the complex plane, and the demonstration will be left to
an appendix. Figure 1.4 shows the variation in 2k density function (1.18.6) for a set of
low degrees of freedom (k 15) (upper panel) and a set of relatively high degrees of
freedom (k 5866 in intervals of 4) (lower panel). For k 1, the pdf is infinite at the
origin although the area under the curve is of course finite. For k 2, the curve is a
pure exponential, as can be seen from the expression in (1.18.6). As k increases
beyond 2, the plot approaches (although with slow convergence) the shape of a
Gaussian pdf with mean k and variance 2k.
Although ubiquitously used in its own right to test how well a set of data is
accounted for by a theoretical expression, the chi-square pdf can also be considered a
special case of a more general class of gamma distribution Gam(, k) with defining
probability density
pX xj,
x1 ex
, > 0 :
1:18:7
41
k=1
Probability Density
0.4
k=2
0.3
k=3
0.2
k=4
k=5
0.1
0
0
Outcome
10
Probability Density
0.05
k = 50
0.04
k = 54
k = 58
k = 62
k = 66
0.03
Gaussian
0.02
0.01
0
20
30
40
50
60
70
80
90
100
Outcome
Fig. 1.4 Probability density of 2k (solid) for low k (top panel) and high k (bottom panel). The
dashed plot is the density of a normal variate N (k, 2k) for k 66.
t < :
1:18:8
42
x x
p ) U N 0, 1
X
= n
1:19:2
p
p
p
u n 1 x n 1
x n1
t
!
0
v
s
s
43
1:19:4
does not contain the unknown population variance . . . or the population mean, as
well, if the parent population is hypothesized to have a mean of 0, a situation
characterizing a null test (e.g. a test that some process has produced no effect
distinguishable from pure chance).
The derivation of the pdf pT (t) from the component pdfs
1
2
pU u p eu =2
2
d2
2 2 v2 =2
2
v
e
pV 2 v d=2
2 d=2
1:19:5
proceeds easily if one ignores the constant factors i.e. just designates all constant
factors by a single symbol c and focuses attention only on the variables. In a
subsequent chapter I discuss the distribution of products and quotients of random
variables more generally, but for the present the solution can be worked out by a
straightforward transformation of variables. The idea is to
2
(a) start with the joint probability distribution fUV 2 u, v2 pU upVp
2 v
,
(b) transform to a new probability distribution fTV(t, v) where t u d=v,
(c) integrate over v to obtain the marginal distribution pT(t) of t alone, and
(d) determine the normalization constant c from the completeness relation
f T tdt 1.
Execution of steps (a) and (b) by means of the transformation
u, v fUV u, v vfUV u, v
p
fTV t, v fUV u, v
t
t, v
d
u
1:19:6
leads to
v2
f t, v cvd e 2
1td
1:19:7
1:19:8
The integral in step (d) is not elementary, but can be worked out by means of contour
integration in the complex plane with use of the residue theorem. This calculation,
deferred to an appendix, leads to the density
44
d1
2
2d1 d 1=2 2
t2
p
pT t
1
d
d d
t :
1:19:9
To my initial surprise when I first obtained it, relation (1.19.9) is not the Student t pdf
d1
2
d 1=2
t2
pT t p
1
d
d d=2
t
1:19:10
one gets by keeping track of all the constants in steps (a)(d) above, and which is the
form usually found in statistics textbooks. The two expressions (1.19.9) and
(1.19.10) are entirely equivalent, however, although they do not look it. The
demonstration of their equivalence, which requires showing that
h
i
1
1
1
d 1 d 21d d
1:19:11
2
p
(note: 12 ), immerses one in the fascinating, if not bewildering, relationships
of the beta function B(x, y), which we define and use in the next chapter in consideration of Bayes problem (i.e. the problem that divided probabalists into two warring
camps). The expression (1.19.11) is one form of an identity referred to as the
Legendre duplication formula,10 often seen in the cryptic form
1
1:19:12
x! x ! 22x1 1=2 2x 1!,
2
where fractional factorials are defined by means of the gamma function x! (x 1),
or the alternative form
1
x ! 2x1 1=2 2x 1!!,
1:19:13
2
2n 1!
:
2n n!
1:19:14
1:19:15
10
G. B. Arfken and H. J. Weber, Mathematical Methods for Physicists (Elsevier, Amsterdam, 2005) 522523.
45
d
2T T 2
d
4
2
T
d2
:
KT 4 3
d4
T
1:19:16
In the limit d ! , which in practical terms means d ~ 10, the moments approach
those of a standard normal distribution.
An alternative way to compute the moments of the t distribution is to exploit the
fact that the numerator and denominator of the ratio (1.19.1) are independent
variates and therefore the expectation of the quotient is expressible as the product
of two expectations
*
+
d n=2 Un
n
1:19:17
dn=2 hUn i hV n i:
hT i
Vn
Note that (1.19.17) is not equal to dn/2 hUni/hVni. Rather, one must calculate the
negative moments of V, which can be done by integration of the pdf or, as I discuss in
Chapter 4, by integration of the moment generating function. Not every distribution
has finite negative moments. The chi-square distribution is one that does.
In the upper panel of Figure 1.5 Student t distributions with d 3 and 10 degrees
of freedom are compared to normal distributions of the same mean (0) and variances
(3, 5/4). Over the range d 2, the appearance of the t distribution does not
change greatly. At the scale of the figure, the Gaussian distribution of same variance
looks wider, but the appearance is deceiving. In the lower panel, which shows the
tails of the two distributions for d 3, the tail of the t distribution for coordinates
jtj > 5 is fatter, i.e. decreases more slowly and predicts a higher probability than a
normal distribution for the same t values.
46
Probability
0.3
0.2
b
0.1
Outcome
0.02
Probability
0.015
Gaussian
0.01
0.005
Student t
4
4.5
5.5
6.5
7.5
Outcome
Fig 1.5 Top panel: Student t (solid) and Gaussian (dashed) densities for degrees of freedom
d 10 (plots (a), (b)) and d 3 (plots (c), (d)). Bottom panel: tails of the Student t (solid) and
Gaussian (dashed) densities for d 3.
47
11
13
12
H. B. Callen, Thermodynamics (Wiley, New York, 1960) 24.
Callen, op. cit. p. 7.
K. G. Denbigh, Note on Entropy, Disorder, and Disorganization, The British Journal for the Philosophy of Science
40 (1989) 323332.
48
where the sum is over all states of the system. Apart from a universal constant
(Boltzmanns constant kB) chosen so that corresponding statistically and thermodynamically derived quantities agree, S depends explicitly only on the occupation
probabilities. Implicitly, S is also a function of measurable physical properties of the
system because the equilibrium probabilities themselves depend in general on the
energy eigenvalues, the equilibrium temperature, and the chemical potential (which
itself may be a function of temperature, volume, and number of particles in the
system). Nevertheless, the connection between entropy and probability is striking.
One can in fact interpret the expression for S as proportional to the expectation value
of the logarithm of the occupation probability.
The identical expression, made dimensionless and stripped of all ties to heat, work,
and energy, was proposed by Claude Shannon in 1948 as a measure of the uncertainty in information transmitted by a communications channel.14 This was the key
advance that, nearly ten years later, permitted Ed Jaynes, in one of the most fruitful
and far-reaching reversals of reasoning I have seen, to develop an alternative way15 of
understanding and deriving all of equilibrium statistical mechanics from the concept
of entropy as expressed by Shannons information function
X
pi ln pi :
1:21:2
H
i
The significance of Jaynes perspective was the realization that the structure of
statistical mechanics did not in any way depend on the details of the physics it
described. Rather, it was a consequence of a general form of pure mathematical
reasoning that could be employed on countless problems totally unrelated to thermodynamics. In particular, this mode of reasoning subsequently termed the principle
of maximum entropy (PME) can be used to answer Question I: What is the most
unbiased probability distribution that takes account of known information but
makes no further speculations or hypotheses? We have seen how the Central Limit
Theorem explains the apparently ubiquitous occurrence of the normal distribution.
The PME, as will be demonstrated, provides another reason.
14
15
C. E. Shannon, A Mathematical Theory of Communication, Bell System Technical Journal 27 (1948) 379423,
623656.
E. T. Jaynes, Information Theory and Statistical Mechanics, Physical Review 106 (1957) 620630; Information
Theory and Statistical Mechanics II, Physical Review 108 (1957) 171190.
49
i
1
HA HB ,
j
1
1:22:1
where the completeness relation was used to reduce the sums above the horizontal
brackets to unity. No other functional form has this property.
50
H
n
X
pi ln pi
1
i1
n
X
!
pi
1:23:1
i1
n
X
pi f xi
i1
n
X
pi f i :
1:23:2
i1
i1
1:23:3
i1
which now contains two Lagrange multipliers, one for each constraint, leads to an
exponential distribution
pj e10 e1
fj
e1 f j
,
Z 1
1:23:4
where the second equality, obtained by substitution of the first expression into the
completeness relation, displays the so-called partition function
Z 1
n
X
e1 f i :
1:23:5
i1
The value of the Lagrange multiplier 1 is determined (implicitly) from the second
constraint
51
f i e1 fi
i1
F h fi X
n
e1 f i
ln Z 1
:
1
1:23:6
i1
n
P
f ki e
ln Z 1 . . . m
i1
F k h f k x i
m
P
k
n j f ji
P
e j1
1:23:7
i1
n
X
e1
1:23:8
i1
The term partition function, which a reader versed in physics will instantly recognize, is not misused here. It is, in fact, the partition function encountered in statistical
mechanics the symbol Z standing for the German expression Zustandsumme (sum
over states). In statistical mechanics the Lagrange multipliers have physical significance, being related to the temperature of the system (if the mean energy is part of the
prior information), the chemical potential of the system (if the mean number of
particles is part of the prior information), and other physical quantities depending on
the nature of the system and the assumed prior information. The partition function
contains all the statistical information one can know about a system in equilibrium.
For example, the second moments, cross-correlation, and covariance in a system for
which the mean values of two functions ff1(x), f2(x)g are known and Z Z(1, 2) take
the forms
n
X
fk x2
f ki 2 e1
f 1i 2 f 2i
i1
n
X
i1
1 f 1i 2 f 2i
2 ln Z
2k
k 1, 2
1:23:9
52
f 1 x f 2 x
f 1i f 2i e1
f 1i 2 f 2i
1 2 Z
Z 1 2
1:23:10
cov f 1 x f 2 x f 1 x f 2 x f 1 x f 2 x
1 2 Z
1 Z
1 Z
Z 1 2
Z 1
Z 2
2 ln Z
:
1 2
1:23:11
i1
n
X
1 f 1i 2 f 2i
i1
F h fi
1:23:12
1:23:13
F p1 f 1 p2 f 2
1:23:14
(b) mean
ef 1
ef 1 ef 2
p2
ef 2
ef 1 ef 2
1:23:15
f 1 ef 1 f 2 ef 2
:
ef 1 ef 2
1:23:16
The relations above permit one to solve for ef 2 f 1 and hence obtain
f 2 F
ln Ff
1
,
f2 f1
1:23:17
which is positive for f2 f1 > 0 and negative for the reverse. Elimination of then
leads to a partition function expressed directly in terms of F
Z F e
f 1
f 2
f2 F
F f1
f
f1
2 f 1
f F
2
F f1
f
f2
2 f 1
1:23:18
53
and to probabilities
p1
f2 F
f2 f1
p2
F f1
:
f2 f1
1:23:19
Note that once the partition function is expressed in terms of the mean values of
observables, then one cannot calculate moments, as in Eq. (1.23.7), simply by taking
derivatives of Z with respect to the Lagrange multipliers. In that case, the straightforward thing to do is construct the moment-generating function, which in the
present case becomes
1:23:20
gt e f t p1 e f 1 t p2 e f 2 t
and readily generates the moments
h f i
F
2 f 2 h f i2 f 2 FF f 1 :
1:23:21
i1
i1
i1
1:23:22
However, this is not the most convenient form in which to find the extremum. Often
(perhaps even most often) the analysts interest is in moments about the mean. There
is no loss of generality, then, in defining the Lagrange multipliers differently in order
to rewrite the entropy functional in a way that reflects that interest
!
!
!
n
n
n
n
X
X
X
X
1 0
2
0
2
H
pi ln pi 0 1
pi 1 0
pi xi 2
pi xi :
2
i1
i1
i1
i1
1:23:23
For notational simplicity I dropped the subscript 1 from the label of the first moment
and combined the prior information to form a variance 2 2 21 . Since the sum in
the second bracket vanishes identically (by virtue of the expression in the first
bracket) irrespective of the probability distribution, it provides no new information
and therefore one loses nothing in simply setting 01 to zero. The procedure to
maximize the reduced entropy functional
54
H
n
X
i1
pi ln pi 0 1
n
X
i1
!
pi
n
X
1
0 2 2
p i x i 2
2
i1
!
1:23:24
1:23:25
55
view based on Bayes theorem, which dispenses with the philosophical encumbrance
of ensembles and focuses exclusively on the data to hand, not those that did not
materialize. This divergence of thought constitutes one of the battlefronts in the
probability wars alluded to at the beginning of the chapter. Estimates based on the
two approaches do not always turn out to be the same. (Indeed, estimates made by
different orthodox procedures, do not necessarily turn out to be the same either.)
Philosophy aside, the differences between orthodox and Bayesian estimates derive
principally from what one does with the likelihood function. I will come back to this
point later in the chapter.
From the orthodox perspective, the likelihood function of n independent random
variables is defined as their joint probability density. Thus, if fxi i 1. . .ng is a
realization of the set of random variables introduced above, the corresponding
likelihood function would be
Ljfxi g f x1 j f x2 j . . . f xn j
n
Y
f xi j,
1:24:1
i1
where, in the general case, may stand for a set of parameters. The method of
maximum likelihood (ML), due primarily to geneticist and statistician R. A. Fisher16,
may be expressed somewhat casually as follows: The best estimate (usually) of the
parameter is the value ^ that maximizes the likelihood L(jfxig). This immediately
raises the question of what is meant by best.
It is said that a spoken language has many words of varying nuances for something of particular importance in the culture of the people who speak the language. If
that is true, then the concept of estimate is to a statistician what the perception of
snow is to an Eskimo (. . . or perhaps to a meteorologist). To start with, the
statistician distinguishes between an estimator , which is a random variable used
to estimate some quantity, and the estimate , which is a value that the estimator
can take. The orthodox statistician considers the quantity to be estimated to have a
fixed, but unknown, value, whereas the estimates of the estimator are governed by
some probability density function of supposedly finite mean and variance. The goal
of estimation is therefore to find an estimator whose expectation value yields the
sought-for parameter with the least uncertainty possible. With those points in mind:
An estimator is unbiased if its expectation value hi equals the estimated
parameter .
An estimator is close if its distribution is concentrated about the true value of the
parameter with small variance.
An estimator is consistent if the value of the estimation gets progressively closer
to the estimated parameter as the sample size increases.
16
R. A. Fisher, Theory of Statistical Estimation, Proceedings of the Cambridge Philosophical Society 22 (1925) 700725.
56
n
X
ln f xi j,
1:24:2
i1
a quantity that some statisticians have termed the support function, but which
I will refer to simply as the log-likelihood. In the general case of m parameters f1 . . .
mg one must then solve the set of equations
n
f xi j=j
L X
0
j
f xi j
i1
j 1 . . . m:
1:24:3
The variance of each ML estimate and covariance of pairs of estimates are given by
the elements of a covariance matrix C H1, where Cjj 2j , Cjk cov (j, k) are
derived from the second derivatives of the log-likelihood
Hjk Hjk
2 L
j k
57
^
j, k 1 . . . m:
1:24:4
The symbol ^ appended to the bracket signifies that the second derivatives are to be
evaluated by substitution of the ML values of the parameters f^j g:
The preceding method for estimating uncertainty of the parameters follows
straightforwardly from the structure and interpretation of the log-likelihood function
expanded in a Taylor series about the ML values of its argument. For simplicity,
consider the example of two parameters:
2
2 2
X
1X
L
L
^
^
^
i i
i ^i j ^j
LlnL1 ,2 L 1 , 2
i ^
2 i, j1 i j ^
i1
2
1X
L ^1 , ^2
H ij i ^i j ^j
2 i, j1
1
1
L ^1 , ^2 UT HU L ^1 , ^2 UT C1 U:
2
2
1:24:5
In the first line of the expansion, the term involving a sum over first derivatives of
L vanishes by virtue of the ML maximization procedure. The second line shows
the reduced expression with matrix elements of H substituted for the second
derivatives of L. The third line shows the equivalent expression in terms of the
parameter vector
1 ^1
U
1:24:6
2 ^2
(and its transpose UT) and the inverse of the covariance matrix C
2
1
1 2
,
1:24:7
C
1 2
22
cov^1 , ^2
where the correlation coefficient is defined by 12 1 2 . The matrices H and
C are related as follows
1
H11 H 12
1= 21 = 1 2
1
C
:
1:24:8
H
H 21 H 22
1 2 = 1 2 1= 22
Upon neglect of derivatives higher than second, the likelihood function then becomes
proportional to the negative exponential of a quadratic form
1 T 1
1:24:9
L ^1 , ^2 jD / e2U C U ,
which is recognized
as a multivariable Gaussian function of the ML para
meters ^1 , ^2 and data D. For a single variable, the exponential (1.24.9)
58
^
reduces to the familiar form pjD
/ e
hood becomes
1
L ^1 , ^2 jD / e 2
2
2
H11 1 ^1 H 22 2 ^2 2H12 1 ^1 2 ^2
8
2
2
9
>
=
< 1 ^1
2 ^2
1 ^1 2 ^2 >
1
>
2 2 2
1 2
2 21
;
:
e 1 2 >
1:24:10
and shows explicitly the connection between second derivatives of the likelihood
function and the uncertainties in parameter estimates. The preceding formalism is
readily generalizable to any number of parameters.
For illustration and later use consider a set of data fxi i 1. . .ng presumed to be a
sample from a Gaussian distribution of unknown mean 1 and variance 2 2.
The likelihood and its log take the forms
L
n
Y
2 1=2 xi 2 =2 2
e
2 n=2
n
X
xi 2 =2 2
i1
1:24:11
i1
n
n
1 X
xi 2 ,
L log L log 2 2 2
2
2 i1
1:24:12
xi 2 0
2
2 2 2 4 i1
^ 2
n
1X
xi
n i1
n
1X
xi ^ 2 :
n i1
1:24:13
1:24:14
The ML estimator for is therefore the sample mean, a random variable defined by
the expression17
X
n
1X
Xi :
n i1
1:24:15
The expectation of X gives an unbiased estimate of the population mean (i.e. of the
location parameter in the theoretical probability density):
*
+
n
n
1X
1X
1
1:24:16
X
Xi
hXi i n :
n i1
n i1
n
17
The overbar is used to represent both a sample average (random variable) and negation (hypothesis); the two uses are
very different and should cause no confusion.
59
This is not the case, however, for the ML estimator of population variance defined by
S0
2
n1
n
n
2
1X
Xi X
n i1
1:24:17
S2
n
2
1 X
Xi X
n 1 i1
1:24:18
whose expectation gives the unbiased value 2. The bias in S0 2 arises from the
presence in the sum of squares of the statistic X, which is a random variable, in
contrast to the population mean , which is an unknown, but fixed, parameter. To see
this, note that the partition of the sum18
n
X
Xi 2
i1
n
X
n
2 X
2
2
Xi X X
Xi X n X
i1
Xi X
X i n X
i1
1:24:19
i1
1:24:20
i1
that reduces to
2
n 1 2
n 1 S2 n 2 n
n
2
S 2:
1:24:21
2 L
n
1 X
n >
>
2
>
4 6
xi ^
4 ;
H22
2
^ i1
2^
2 ^ , ^ 2 2^
1:24:22
18
The cross term in the binomial expansion of the expression in square brackets vanishes identically as a consequence of
the definition of relation (1.24.16).
60
the negative inverse of which yields the covariance matrix whose elements constitute
the variances of the ML parameters
var^
^ 2
n
1:24:23
2^
4
var ^ 2
n
1:24:24
with zero covariance. This means that the ML estimators derived above are independent, asymptotically normal random variables of the forms
^ 2
4
2 2^
,
1 N ^
2 N ^ ,
:
1:24:25
n
n
Note, as pointed out previously, that the variance of the mean is smaller than the
variance of a single observation by the factor n [a relation also contributing to
Eq. (1.24.21)].
The property of normality and the variance (1.24.23) of the ML estimator X are
actually valid statements irrespective of the size n of the sample. However, the exact
variance of the ML estimator S0 2 can be shown to be 24(n 1)/n2, which asymptotically reduces to the expression in (1.24.24). The explanation for this is that the exact
n
X
2
Xi X , which is propordistribution of the variance of the sample mean, 1n
i1
2
n
X
Xi X
p
constructed to be the sum of the squares of n standard
tional to a form
= n
i1
normal random variables, is not Gaussian, but a chi-square distribution 2n1 . There
are n 1, rather than n, degrees of freedom because the sample mean X is itself
calculated from the data and, once known, signifies that only n 1 of the set of
variates fXig are independent.
One last point of interest in regard to the variances of the ML estimates for
and 2 is to see how they compare with the lower bound of the CramerRao
theorem, which can take either of the two forms below for an estimate of a
function ().
d=d2
d=d2
E
2
:
var CR D
n log f Xj=2
n log f Xj=2
1:24:26
1:24:27
61
n
n X
4
1:24:28
1
4 4 =n
2 4
D
E
,
var 2
2
4
2
n
2
1 2 X
X
n 21 2 X
4
2
1:24:29
where use was made of the expectations hZ2i 1 and hZ4i 3 of the standard
normal variable Z (X )/. Comparison with (1.24.23) shows that the ML
variances of the Gaussian parameters are as small as theoretically possible. The
same minimum variances would have been obtained had we used the second
equality in (1.24.26).
Prfnk gjfpk g n!
K
Y
pn
k
k1
nk !
for the totality of n trials. In general, apart from the completeness relation
1:25:1
K
X
pk 1,
k1
we might not know the probability pk for an outcome to take the value Ak, but we can
do two things: (a) estimate the maximum likelihood (ML) probabilities from the
frequency data, and (b) make a theoretical model of the random process that has
generated the data. Consider first the ML estimate.
In the case of a large sample size n, the log-likelihood function of the multinomial
expression (1.25.1) can be written and simplified as shown below
!
K
K
K
X
X
Y
pnk k
nk ln pk
ln nk ! ln n!
L ln L ln n!
n!
k1 k
k1
k1
1:25:2
K
K
X
X
nk ln pk
nk ln nk n ln n,
k1
k1
where we have approximated the natural log of a factorial n! by the two largest terms
[ln n! ~ n ln n n] in Stirlings approximation
62
pnn
1
1
n! e 2n
1
:
e
12n 288n2
1:25:3
To maximize L with respect to the set of parameters fpkg given only the completeness
relation, we introduce a single Lagrange multiplier to form the functional
L0
K
X
nk ln pk 1
k1
K
X
!
pk
1:25:4
k1
with omission of all terms not containing the parameters since they would vanish
anyway from the ML equations
L0 nk
0
pk pk
k 1. . . K :
1:25:5
nk
:
n
1:25:6
n
Lmax Lfnk gjf^
pk g k1 nk =n
n
k1 k
1:25:7
63
because the products of factorials in numerator and denominator cancel. The log of
the ratio then yields a relation
!
"
!#
K
K
X
X
f^k
f^k
L0
nk ln
nk ln n ln
ln
n ln n
Lmax
nk
nk
k1
k1
!
K
X
nk
nk ln
nf^k
k1
1:25:8
from which one can calculate how likely the null hypothesis is in comparison to the
maximum likelihood.
An advantage to the use of the likelihood ratio for comparison of two
hypotheses or models is that it is invariant under a transformation of parameters. For example, if you wanted to test whether the parameter 1 or 2
characterized a set of data believed to be drawn from a distribution with pdf
2
/ ex= , the likelihood ratio would be the same if, instead, you transformed the
distribution by 2 and then tested for parameters 1 and 2. The example is a
trivial one, but the conclusion still holds in the general case of more complicated
transformations of a multi-component parameter vector. The reason for the
invariance is that the likelihood ratio is a value at a point, rather than an
integral over a range.
That same asset can become a disadvantage, however, to using Eq. (1.25.8) for
inference because the distribution function associated with the likelihood ratio in
specific cases may be difficult or impossible to determine and so to say that one
model is 50% as likely as another does not tell us how probable either is. The power
of a statistical test of inference is defined to be the probability of rejecting a
hypothesized model when it is correct i.e. when the parameters of the model are
the true but unknown parameters of the distribution from which the data were
obtained. A test is the more powerful if it can reject the null hypothesis with a lower
probability of making a false judgment. In a significance test of a model, an ideal
power function would be 0 if the parameters of the model corresponded to the true
parameters, and 1 otherwise. In general, the likelihood function is not a probability
but a conditional probability density, a fact that is a virtue to some and a liability
to others.
With the adoption of a few approximations and some algebraic rearrangements,
the final expression in (1.25.8) can be worked into a form with a known distribution
irrespective of the null hypothesis. To see this, start by
(a) adding and subtracting 1 in the argument of the logarithm,
(b) adding and subtracting n^f k in the pre-factor, and
(c) dividing and multiplying the entire summand by n^f k
64
L0
ln
Lmax
K
X
!
!
nk nf^k nf^k
nk
ln 1
1
nf^k
nf^k
n^f k
k1
K
1:25:9
n^f k 1 k ln1 k
k1
^
fk
so as to express the log-likelihood ratio in terms of a quantity k nknn
, expected to
f^
k
L0
ln
Lmax
K
X
k1
1 2 1 3
^
nf k k k k ,
2
6
1:25:10
nf^k k
K
X
K
X
f^k n n 0,
nk nf^k n n
k1
1:25:11
k1
1:25:12
K
X
n
k1
n^f k 2
) 2dK1m
n^f k
1:25:13
65
d
p
1,
which
would indicate a highly improbable
50% would then be z j1:386dj
2
2d
result for the hypothetical model. Clearly, the same value of likelihood can lead to
radically different values of probability and therefore to different inferences
regarding statistical significance.
From the foregoing discussion and in particular Eq. (1.25.12) it is seen that the
chi-square test of a theoretical model is not independent of the maximum likelihood
method, but follows from it as an approximation. Maximization of the log-likelihood
ratio corresponds to minimization of the resulting chi-square statistic. If, as an
additional approximation, there was justification in assuming that the variance was
the same for each class, then the denominator could be taken outside the sum in
relation (1.25.13). Moreover, if the parameters had not been estimated by
ML
the
method at the outset, then the function fk() f(Akj) would replace ^f k f Ak j^ in
the sum. Under these conditions, maximizing the log-likelihood ratio corresponds to
K
X
minimizing the statistic Q0K
nk nf k 2 with respect to the unknown
k1
19
Statisticians may cringe, but, as a physicist, I am using the same symbol 2d for the random variable, its observed value
in a given situation, and the associated distribution. Hopefully this economy of notation will add to the clarity of the
text rather than to confusion.
66
P Pr >
2
2obs
2obs
xd=2 ex=2
2d=2 d=2
dx
1:25:14
that subsequent sets of observations would yield chi-square values at least as large as
the observed value 2obs for the given number d of degrees of freedom. A small value
of P, corresponding to 2obs
d, is ordinarily interpreted as signifying that the
discrepancy between theory and observation is not likely to have arisen by chance
alone and therefore the proposed model may not be a good one. Fisher had initially
adopted and statisticians have subsequently largely followed a standard of 5%
(i.e. P < 0:05) as a threshold for rejecting a particular model. So entrenched is the use
of P-values in the scientific literature that any manuscript containing a statistical
analysis of experimental results is likely to be rejected by a reviewer or editor if
P-values are not part of the tests of significance.
There are a number of issues, however, surrounding the concept of P-values and the
use of chi-square tests in general, that have generated over the years a vast volume of
commentary by statisticians. I will summarize briefly what to me are the most pertinent
and provocative criticisms, which indeed need to be borne in mind if errors of analysis
and interpretation are to be avoided, and I will append my own commentary at the end.
At root, the issue raised by Jeffreys is this: In judging the credibility of a model and
its alternative(s) is it more appropriate to take (a) the area under the tail of the
20
H. Jeffreys, Theory of Probability 3rd Edition (Oxford University Press, London, 1961) (1st Edition 1939).
67
chi-square distribution (i.e. the P-value) or (b) a ratio of the ordinates (i.e. pointvalue) of the probability density of the statistic?
W. G. Cochran, The 2 Test of Goodness of Fit, The Annals of Mathematical Statistics 23 (1952) 337.
G. U. Yule and M. G. Kendall, An Introduction to the Theory of Statistics (Griffin, London, 1940) 423.
A. W. F. Edwards, Likelihood (Johns Hopkins, Baltimore, 1992) 188. Original Cambridge edition 1972.
H. B. Mann and A. Wald, On the choice of the number of class intervals in the application of the chi square test,
Annals of Mathematical Statistics 13 (1942) 306317.
68
1.25.4 Why bother with 2 anyway since all models would fail
if the sample is large enough?
The claim has been made that, in testing a null hypothesis which is not expected to be
exactly true, but credible to a good approximation, the hypothesis will always fail a
chi-square test applied to a sufficiently large sample of experimental data. Phrased
provocatively, one statistician wrote25
I make the following dogmatic statement, referring for illustration to the normal curve: If the
normal curve is fitted to a body of data representing any real observations whatever of
quantities in the physical world, then if the number of observations is extremely largefor
instance, on the order of 200,000the chi-square P will be small beyond any usual limit of
significance.
Before adding my own two cents, first an admission: I have selectively quoted
comments from statisticians so as to frame their remarks in the most confrontational
way to highlight issues that I believe really are important and deserve careful
attention. No statistician, however at least none whose papers I have read
actually recommended discarding the chi-square test. No experimental physicist
would in any event do that because the test is far too useful and easily implemented
(. . . and required for publication).
Much of the confusion that may accompany use of a chi-square test can be
avoided by keeping in mind that the original test statistic followed a multinomial
distribution (1.25.1) from which the chi-square statistic arose in consequence of three
approximations: (1) Stirlings approximation of factorials; (2) Taylor expansion of a
natural logarithm; and (3) substitution of a continuous integral for a discrete summation. So long as each expectation nfk of the tested model f (xj) is reasonably large,
the reduction is reasonably valid, and the chi-square statistic (1.25.13) is distributed
as 2d to good approximation. If necessary, one may combine classes to achieve a
suitable expectation, which for satisfactory testing should be no fewer than about
510 as a rule of thumb. There was nothing in the derivation, as far as I can see, that
subsequently restricted the chi-square test of significance to the variance of a model
to the exclusion of all other attributes.
25
26
J. Berkson, Some difficulties of interpretation encounered in the application of the chi-square test, Journal of the
American Statistical Association 33 (1938) 526536.
W. G. Cochran, op cit. p. 336.
69
The arbitrariness of classes and boundaries arises only in testing the significance of
a continuous distribution, for in the case of a discrete distribution where specific
objects are counted (e.g. photons, electrons, phone calls . . . whatever), there is a
natural, irreducible assignment of classes whereby each class differs in integer value
from the one that comes before or after by one unit. This may not be the most
practical choice for every test, since it may require a very large sample size, but
conceptually, at least, it establishes a non-arbitrary standard.
In the case where data arising from a discrete distribution have been approximated
by (or transformed into) continuous random variables, there is a simple procedure
for avoiding a ridiculously large and statistically unwarranted chi-square. Statisticians have pointed this out long ago,27 but, unaware of their papers, I discovered it
for myself in testing a distribution of counts from a radioactive source. The experience makes for a lesson worth relating. The counts, which were all integers believed
on theoretical grounds to be Poisson variates, decreased (on average) in time as the
experiment progressed because of the diminishing sample of nuclei. In the next
chapter I will discuss in detail the statistics of nuclear decay. For now, however,
suffice it to say that a standard procedure in the analysis of nuclear data is to remove
the negative trend line in order to examine the variation in counts as if the population
of radioactive nuclei were infinite. In de-trending the data, however, the transformed
numbers were no longer integers. Sorted into 90 classes, the data were tested for
goodness of fit by a Poisson distribution of known mean, leading to an astounding
result of 289 > 1600, where a number around 90 was expected. A previous test on
the original (not de-trended) data had given highly satisfactory results. What
went wrong?
The 90 classes fAk k 1. . .90g were labeled by the number of counts obtained in a
specified window of time (one bin of data); thus A1 150, A2 151, A3 152, etc. In the
test on the de-trended data, the frequency of outcomes x for k 1 > x k was compared
with the Poisson probability for Ak and this gave a very high chi-square, suggesting
that the null hypothesis (namely, the data were Poisson variates) was untenable.
However, if the class values were shifted by 0.5, so that the central value of each class
was an integer i.e. k 12 > x k 12, the chi-square of the de-trended data became
85.14 for 90 classes, corresponding to P 0.596, which was entirely reasonable.
One must likewise be aware of the circumstances under which a discrete distribution is approximated by a continuous one. Return to the previous example where
data originated as integer counts of particles from a sample of radioactive nuclei. The
mean number of counts x per bin being much larger than 1, the hypothesized Poisson
distribution Poi (), with population mean estimated by the sample mean x, should
have been well approximated by a normal distribution N x, x. However, a chi-square
test of the goodness of fit of N(0, 1) to the data in standard normal form
27
M. G. Kendall and A. Stuart, The Advanced Theory of Statistics Vol. 2: Inference and Relationship (Griffin, London,
1961) 508509.
70
p
z x x= x led to so high a value of 2 that the presumed model would have been
unambiguously rejected. Again, what went wrong?
The problem in this instance lay not with locations of class boundaries, but with
the widths of class intervals. The transformed data z are not integers, but neither are
they continuously distributed. Since the values of the counts x are always integer, the
values of z can have a minimum separation of x1=2 . Thus, if one makes the bin width
smaller than that minimum, there can result numerous bins of 0 count, which causes
failure of the chi-square test. With adequately sized bin widths, a value of chi-square
and associated P-value were obtained that did not justify rejection of the null
hypothesis. Note that there was nothing intrinsically wrong with applying the test
to a continuous distribution so long as one took steps to insure that the data being
tested actually were continuously distributed. Nor does the fact that I could get either
a high P or low P by changing the size of the bins imply that the test outcomes were
arbitrary and therefore meaningless. On the contrary, the low P-value resulted
from executing the test under conditions that were inappropriate in two related ways:
(a) testing goodness of fit of a continuous distribution to quasi-discrete data which
resulted in (b) violation of an approximation leading to the chi-square statistic (i.e.
no empty bins).
The same suite of investigations convinced me that the assertion that any model
fitted to a body of data representing . . . quantities in the physical world would fail a
chi-square test, given a sufficiently large (e.g. > 200 000) number of observations was
entirely without foundation. If the model is a true representation of the body of
data i.e. the model captures the essential features of the stochastic process that
generates the data then a chi-square test can yield a respectable P-value for any
sample size. In testing, for example, 1 000 000 standard normal variates, sorted
into 400 classes, for goodness of fit to N(0, 1), I have obtained 2399 419, giving
P 0.236.
However and here is a point of critical importance that all too often seems
to have been overlooked in the confused wrangle over the meaning or worth of
P-values the quantity P is itself a random variable. As a cumulative probability [see
Eq. (1.25.14)] P is governed by a uniform distribution [see (1.17.12)] with mean 12 and
variance 121 . Therefore, obvious though it may be to state this, one should not expect
too much from a single P, any more than is to be expected from a single nuclear count
or the reply of a single respondent to a poll. That does not mean that either P or 2 is
not useful. Rather, if an inference to be made is important, then it is incumbent upon
the investigator to collect sufficient data even if that means more time-consuming
experiments and fewer publications to determine how the P or 2 is distributed. If
discrepancies between the hypothetical model (null hypothesis) and the data are due
to pure chance, then, although a range of P values from low to high will be obtained
from numerous experimental repetitions, they should nevertheless follow a uniform
distribution. By contrast, if a proposed model is a poor one, the P-values should
nearly all be low.
71
Table 1.4
Statistics
289
Mean
Standard Error
Median
Standard Deviation
Skewness
Kurtosis
Minimum
Maximum
Count
87.03
2.22
84.83
15.69
0.172
0.203
53.82
125.87
50
P
0.541
0.046
0.605
0.322
0.219
1.376
0.0062
0.999
50
support the null hypothesis that distribution of chi-square values arose through pure
chance. Had I performed only a single test (rather than 50) of the Poisson variates
and obtained a particularly low or high P-value, statisticians (e.g. those writing
cautionary philosophical commentaries) would have had grave doubts about the
randomness of the nuclear decays. And yet, because P is distributed uniformly (see
lower panel of Figure 1.6), a P-value is just as likely to fall between 0.0 and 0.1 as
between 0.4 and 0.5.
The lesson in all this if there is one is that ambiguous or troubling outcomes to
chi-square tests often stem from insufficient data, a problem that can be solved by
experiment, not philosophy.
72
Frequency
15
10
0
50
60
70
80
90
100
110
120
130
140
0.9
150
Chi-Square
Frequency
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1.1
P Values
289
73
1:26:1
In words, the preceding expression conveys the idea that the total probability for the ith
order statistic to be less than or equal to y is a sum over the probabilities of mutually
exclusive events whereby the condition Yi y is met at most by Yi or by Yi1 as well, or by
Yi2 as well, . . . or by Yn as well. The occurrence of any one of these events for example,
the event that Yj y, but Yj1 > y signifies that j out of the n random variablesfX
ig
n
satisfy this inequality, and nj do not, a condition that could have occurred in
j
different ways. Since the probability that a specific selection of j variates is less than or
equal to y and the remaining nj of them are greater than y is F (y)j [1F(y)]nj, the
probability for all such selections is given by the binomial expression
n
Fy j 1 Fy nj ,
1:26:2
Pr Y j yjY j1 > y
j
whereupon the total probability in (1.26.1) is given by the sum (starting with index i)
n
X
n
FY i y
1:26:3
Fy j 1 Fy nj :
j
ji
Of particular interest are the extreme order statistics Y1 and Yn, which are deducible
immediately from (1.26.3)
n
X
n
Fy j 1 Fy nj
FY 1 y
j
j1
n
X
1:26:4
n
n
j
nj
Fy 1 Fy
Fy 0 1 Fy n
j
0
j0
1 1 F y n
n
FY n y
Fy n 1 Fy 0 Fyn :
n
1:26:5
The cdfs for the lowest and highest order statistics, in fact, could have been deduced
immediately. Consider Y1. The probability that a variable Y (one of the Xs) is greater
than or equal to y is 1 F (y), and so the probability that all n variables are greater
than y is (1 F(y))n. Thus, the probability FY1 y that at least one of the variates is
less than or equal to y is 1 (1 F(y))n. We will see this kind of reasoning again when
we examine the elementary theory of nuclear decay. As for Yn, since the probability
that one of the variables is less than or equal to y is F(y), it should be fairly evident
that the probability that all n variables are less than order equal to y is (F(y))n.
74
Table 1.5
Statistic
Density
Expectations
Y1
fY 1 y n 1 yn1
1
hY 1 i n1
2
Y 1 n12n2
q
Y 1 n1n2 n2
Theory
(n 50)
0.0196
Observed
(n 50)
0.006 17
0.000 75
0.0192
fY n y nyn1
n
hY n i n1
2
n
Y n n2
q
Y n n12n n2
0.9804
0.999
0.9615
0.0192
n!
Fyi1 1 Fyni f y:
i 1!n i!
1:26:6
The details are left to an appendix. The pdfs of the extreme order statistics, however,
can be calculated directly and easily from (1.26.4) and (1.26.5).
Consider the circumstance, pertinent to tests of significance, where variates fXig are
distributed uniformly as U(0, 1), in which case the cdf is simply F(x) x. The pdf and
first two moments of the lowest and highest order statistics may then be summarized
in Table 1.5 above. Returning to the example in the previous section of the 50 chisquare variates and corresponding P-values, one sees from Table 1.5 that the observed
lowest and highest Ps fall within one standard deviation of the predicted expectations.
Statistical principles, more than intuition and hunches, provide a better guide for
judging whether extreme events are too extreme to have occurred by chance.
75
of the maximum likelihood method to special cases, I prefer to think of the maximum
likelihood method itself as a particular application of Bayes theorem. For one thing,
this is a friendlier perspective in discussing the matter with other colleagues, since
the use of Bayes theorem has been the source of much contention in the theory of
statistical inference. But more importantly, it is basically accurate to do so since
Bayes theorem, without the accumulated emotional overburden, is an uncontested
fundamental principle in probability theory and therefore a starting point for nearly
all methods of statistical estimation and inference.
Recall the structure of Bayes theorem, Eq. (1.2.5). Given a set of experimental
data D and various models (hypotheses) Hi proposed to account for the data, then
PHi jD
PDjH i PHi
PDjH i PH i
:
X
PD
PDjH PH
i
1:27:1
i1
As discussed earlier,
(1) P(Hi) is the prior probability of a model based on whatever initial information
may be pertinent;
(2) P(DjHi) is the likelihood, i.e. the conditional probability of obtaining the experimental results given a particular model; and
(3) P(HijD) is the posterior probability of a particular model after the results of the
experiment have been taken into account.
In comparing two models H1, H2, one way to use Bayes theorem would be to
evaluate the ratio
PH1 jD PDjH 1 PH 1
PH 2 jD PDjH2 PH 2
1:27:2
76
or
(ii) the mean value hi
PDjpd
hi
1:27:4
PDjpd
or
(iii) the root-mean-square (rms) value of rms
r
D
E
rms
hi2 ,
1:27:5
or
(iv) the value ~ that minimizes the squared error
D
2 E
d ~
0,
d~
the solution of which works out to be hi
2 E
d D
~
2hi 2~ 0 ) ~ hi :
d ~
1:27:6
1:27:7
The impediment to using these expressions, however, and the flashpoint for much of
the contention over Bayesian methods of inference, is the prior probability p(). In
particular, what functional form does p() take to represent the condition of no prior
information about i.e. the state of ignorance.28 It is to be stressed and this is
another critical point whose misunderstanding has been the source of much contentious discussion in the past that the prior does not assign probability to the value of
the unknown parameter, which is not a random variable, but to our prior knowledge of
that parameter. There have been other potentially divisive issues as well, such as
repudiation by some statisticians of the very idea that the probability of a hypothesis
makes any sense, but I will dispense with all that here. From my own perspective as a
practical physicist, any set of non-negative numbers summing to unity and conforming
to the rules of probability theory can be considered legitimate probabilities, whether
they arose from frequencies or not. The essential is that the set of numbers be testable,
reproducible (statistically), and help elucidate the problem being investigated.
28
Ignorance derives from a root word meeting not to know and, as used in statistics, does not carry the vernacular
connotations of stupidity or incompetence.
77
It would seem, at first, that the logical course of action would be to assume a
uniform distribution for unknown parameters in those instances where one has no
prior information about them. There are difficulties with this course, however.
The most serious is that the estimate then depends on an arbitrary choice of how the
model is parameterized. For example, if the random variables of a model are believed
2
2
to be generated by a pdf of the form pxj / ex = , and one assumes a uniform
distribution p() constant for the prior, then one cannot assume the transformed
2
parameter 2 in the pdf pxj / ex = to be uniformly distributed as well because
p
p constant
/ 1=2 :
jd=dj
2
1:27:8
And yet an analyst, having no more prior information about than about , could
have begun the analysis by assuming to be uniformly distributed. Clearly, then,
there is a logical inconsistency here somewhere, since the same state of prior knowledge should lead to the same posterior estimate no matter how one chooses to label
the parameters of a model.
The maximum likelihood (ML) method provides a way around the problem of
priors by disregarding them and basing the estimate on the mode of the likelihood,
i.e. the maximum of the conditional probability P(Dj). The method is invariant to a
transformation of parameters since, by the chain rule of calculus,
d
d
d
0 PDj
PDj ,
1:27:9
d
d
d
d
d
PDj 0 if d
PDj 0, which leads to the same point estimate
and therefore d
whether the model is formulated in terms of or .
A secondary difficulty with assuming that a parameter about which no prior
information is known is distributed uniformly is that Bayes theorem then leads to
some odd results in comparison with corresponding ML estimates. For example,
consider the set of observations fxi i 1. . .ng believed to have arisen from a Poisson
process with unknown parameter . As worked out previously, the parameter
dependence of the likelihood function is en nx , maximization of which gives the
n
X
ML estimate ^ x N1
xi , the mean value of the observations, a reasonable result.
i1
Contrast this with the Bayes estimate obtained by calculating the expectation hi
under assumption of a uniform prior p() constant:
n nx
n nx1
e d
e
d
1 nx 2
1
0
B hi 0
x :
nx
n
en nx d
en nx d
0
Uniform Prior
Uniform Prior
1:27:10
78
Although the asymptotic value of B and ^ are the same, the preceding result is
puzzling and not good for a small sample.
The problem of how in general to determine what prior distribution best represents ignorance is a very important one, as it is crucial to having a completely
consistent and reliable theory of scientific inference. Various ad hoc solutions have
been proposed in the past such as Jeffreys prior, which is to take p(d) / d for a
parameter that ranges infinitely in both directions ( ) and p(d) / d/ for
a parameter that ranges infinitely in one direction ( 0). However, there was no
fundamental reason to adopt such rules of thumb or suggest how to deduce the most
suitable prior in other circumstances where neither of the preceding two choices may
be adequate.
As of this writing, there may yet be no universally agreed solution to the problem
of priors for all circumstances, but it seems reasonable to me to expect that the state
of ignorance be defined by a principle of invariance i.e. a group theoretical concept
such as recognized and illustrated by Ed Jaynes29 over four decades ago. Jaynes
starting point, to counter the objection that use of Bayes theorem for inference led to
subjective results dependent on personal judgments of the analysts, was to insist that,
given the same initial information (or lack thereof ), all analysts should be able to
arrive at the same prior probability function for the parameters being estimated. To
achieve a unique prior (and therefore a unique posterior), it was essential for a
problem to be worded precisely so as to make clear what attributes of the parameters
of the underlying model were not specified. It was this lack of clarity that led to some
well-known paradoxes, such as Bertrands paradox,30 in the history of probability
theory.
Bertrands paradox, proposed in 1889, provides a good illustration of how
an ambiguity in the statement of a problem can lead to multiple solutions.
What is the probability P that the length of a chord, selected randomly, will be
greater than the side of the equilateral triangle inscribed in that circle? The
solution depends on what is meant by choosing the chord randomly. Here are
three such ways.
(1) Random length: The linear distance between the midpoint of the chord and the
center of the circle is random. (The solution is P 12.)
(2) Random arc: The arc length between two endpoints of a chord chosen randomly
on the perimeter of the circle is random. (The solution is P 13.)
(3) Random point: The location of the midpoint of the chord anywhere within the
area of the circle is random. (The solution is P 14.)
29
30
E. T. Jaynes, Prior Probabilities, IEEE Transactions On Systems and Cybernetics, Vol. 1 Section 4 (1968) 227241.
E. Parzen, Modern Probability Theory and Its Applications (Wiley, New York, 1960) 302304. At least six solutions were
proposed by a mathematician Czuber in 1908; cited in Parzen, p. 303.
79
tx
:
x!
1:27:11
The unit of time, however, is arbitrary, and adopting another unit so that the count
interval becomes t0 and the decay parameter becomes 0 would not change in any way
our prior knowledge about the unknown decay rate. It would then follow that under
a transformation (, t) ! (0 , t0 ) subject to t 0 t0 , that is,
31
32
80
! 0 q
,
t ! t0 q1 t
1:27:12
the functional form of the density call it f () that represents what we know about
is precisely the same as the density that represents what we know about 0 . Hence
f d f 0 d0 qf qd
or
f qf q:
1:27:13
df
constant
f 0 ) d lnf 0 ) f
:
d
1:27:14
Equation (1.27.14), which corresponds to the Jeffreys prior but now with a
theoretical justification based on symmetry independent of anyones personal
opinion is the only expression that objectively represents a state of prior
knowledge compatible with the mathematical invariance that ignorance of implies.
Used in Eq. (1.27.4) to estimate the value of the unknown parameter , one now
obtains
n nx
e d
0
n nx1
e
d
Uniform Prior
d
n nx
e
0
B h i
n nx d
e
1 nx 1
x ^
n nx
Uniform Prior
1:27:15
the same result as the estimate by maximum likelihood.
The same kind of reasoning can be applied to distributions with more than one
parameter, such as the important case of the normal distribution where one may be
ignorant of the location parameter and scale parameter . If one has no prior
information about these parameters, then the posterior probability, corresponding to
a set of observations fxi i 1. . .ng
Pd, djD /
n h
Y
i1
2 1=2 exi
=2 2
f , dd,
1:27:16
81
that the unknown parameters fall within the ranges (, d), (, d) must
be invariant under simultaneous transformation of location and scale (x, , ) !
(x0 , 0 , 0 )
0 b
0 a
1:27:17
x0 0 ax :
If we truly have no prior information about their values, then merely relocating the
mean and changing the variance cannot provide new information, and thus the prior
density function f() must have the same dependence on its argument after the
transformation as before, from which it follows that
f , dd f 0 , 0 d0 d 0 :
1:27:18
Substituting expressions (1.27.17) for the transformed parameters into the argument
of the right side and evaluating the Jacobian of the transformation leads to the
functional equation
0 , 0
1:27:19
f , f b, a
af b, a :
,
To solve Eq. (1.27.19) take derivatives of both sides sequentially with respect to each
of the two arbitrary transformation parameters, and then set the parameters to their
values for the identity transformation (a 1, b 0). Start with b:
f ,
f b, a
0a
:
b
b
1:27:20
The vanishing of the right side of (1.27.20) tells us that f cannot be a function of b,
whereupon we can write f (, ) af (, a). But this is the same functional relation
that we encountered before in (1.27.13) with solution (1.27.14). The posterior probability (1.27.16) therefore takes the form
Pd, djD /
n h
Y
2 2
1=2
exi
i1
/ 2 n=2 n1 e
=2 2
i d d
n
1 X
xi 2
22 i 1
1:27:21
dd
82
nx
2 2
nS0
e 22 dd
1:27:22
in terms of sufficient statistics: the sample mean x and (biased) sample variance S0 2.
From Eq. (1.27.22) we can determine the marginal probabilities of each parameter
i.e. the density of one irrespective of the value of the other by integrating over the
undesired parameter
PdjD d p, jD
0
PdjD d
i
d h
2 n=2
/ x 2 S0
d
02
p, jD d / n enS
=2 2
d:
1:27:23
1:27:24
The proportionality constants in the foregoing three equations can be worked out
exactly if needed, but, depending on how the equations are used, they may simply
drop out of the calculation. For example, to estimate the parameter from the data
by using (1.27.22) to calculate the expectation hi one has
e
hi
2
n x
2 2
e
2
n x
2 2
d
2
x p y ey =2 dy
n
x
ey
=2
1:27:25
dy
x
p and cancellation of factors
upon change of variable to standard form y =
n
common to numerator and denominator. The integral corresponding to the first
moment of y vanishes identically because of symmetry, whereupon the quotient
reduces immediately to x, the same value as the ML estimate.
Similarly, one can use Eq. (1.27.24) to estimate the parameter from the expectation h2i
2 n nS0 =2 2
2
2
n enS
02
=2 2
n1 enS
02
n1 enS
02
=2 2
=2 2
n
2
y2 ey dy
83
1:27:26
n
n
1
X
nS 0
nS
2 n S0 2 1
x i x 2 :
n
2 n 1
2
n2
n 2 i1
2
y2 ey dy
02
02
In the first line above the Jeffreys prior is applied. In the second line substitution of
y nS0 2/22 transforms the integrals into gamma functions. The estimate differs from
the corresponding ML estimate (S 0 2) although the results are asymptotically
equivalent.
Appendices
PAjB P AjB
P B
P B
1:28:2
P B
1 P B
PA PAjB 2PAB
1 PB
in which the second line follows from completeness relations
PB P B 1
PAB P AB PA
1:28:4
and the third line from combining the two terms and recognizing the expression for
conditional probability P (A|B).
84
85
ixt
hte
1
dt
2
n
eit 1
it
eixt dt
1:29:1
expand the binomial expression and interchange the order of integration and summation to obtain
px
n
X
ikxt
e
n 1
i
dt:
k 2
tn
1:29:2
1 eiaz
dz,
2 zn
1:29:3
nk n
1
k0
where a > 0 is a constant and the contour C, to be traversed in the positive (i.e.
counterclockwise) sense, is a semicircle of radius R in the upper-half complex plane
with diagonal along the real axis. On the semicircular portion of the contour the
integration variable takes the form z Rei R cos iR sin, and therefore the
magnitude of the integrand vanishes exponentially as eRsin in the limit that R ! .
The integrals in (1.29.2) and (1.29.3) are then related by
1
2
eiat
1 eiaz
1 eiu
n1
dt Lim
dz a Lim
du,
R! 2
R! 2 un
tn
zn
C
1:29:4
where the second equality results from a change of integration variable u az.
The contour integral can be evaluated immediately by means of the residue
theorem of complex analysis
X
f udu 2i
Res f ; ui
1:29:5
C
in which the sum is over the poles of the function in the integrand. The integral in
(1.29.4) has a single pole of order n at u 0. Recall that the residue of a function f(u)
expanded in a Laurent series is the coefficient of the term u1. Thus, expansion of
eiu/un in powers of u generates an infinite sum of terms
2
3
j
n1 n
X
an1 eiu
i
1
n1
jn
4
5 a i
1:29:6
du
a
u
du
2 un
j! 2
n 1!
j0
C
ij, n1
86
of which the only nonvanishing term is the one for which j n 1, giving the result
shown above. Substitution of (1.29.6) into (1.29.2) with identification of a k x
leads to the final expression in (1.17.8).
Note that, if a < 0, then the path of integration from to would have to be
closed by a semicircle in the lower half complex plane in order for the contribution to
the contour integral along this portion to fall off exponentially as R ! . The
integral must then be multiplied by 1 since the traversal of the contour would then
be in a negative (i.e. clockwise) sense.
ixt
hte
1
dt
2
eixt 1 2itk=2 dt
1:30:1
expand the integrand in a negative binomial series and interchange summation and
integration to obtain
2
3
X
k
1
k
k=2 4
2i2j
t2j eixt dt5:
1:30:2
p x
j
2
j0
The integral above has the same general form as the one worked out in Section 1.29
and therefore can likewise be evaluated as a contour integral but along a semicircular contour of radius R in the lower half complex plane because of the negative sign
in the exponent. It then follows from the residue theorem that
2
3
n n1
n1 iu
1
x
e
5 i x :
1:30:3
tn eixt dt Lim 4
du
R!
2
un
n 1!
2
Substitution of the result (1.30.3) for n j into (1.30.2) yields the sum of terms
X
x=2 j
k k
k=2
k
px 22 x21
1:30:4
j
2j1 !
j0
k
2
1:30:5
x21 X
x=2 j x21 ex=2
k
k
j!
22 2k 1 ! j0
22 2k
k
p x
87
1:30:6
where the factorial in the denominator was replaced by its equivalent representation
as a gamma function.
The identity in (1.30.5) is easily demonstrable when one recalls that a combinatorial
coefficient can be written as a quotient of two product sequences of equal length.
7
765
For example:
123. Applied to a negative binomial coefficient, one has
3
n
j
j terms
nn 1n 2 n j 1
1 2 3 j
1 j
nn 1n 2 n j 1
1 2 3 j
1 j
1:30:7
n j 1!
j!n 1!
where the final expression, equivalent to the right side of (1.30.5), is obtained by
multiplying both numerator and denominator by (n 1)!.
1.31 Probability density of the order statistic Y(i)
(A) The straightforward plodding method requires taking and simplifying the derivative of the cumulative distribution function
n
X
n
Fy j 1 Fy nj
FY i y
j
ji
n
i1
X
X
n
n
nj
j
F 1 F
Fj 1 Fnj
j
j
j0
j0
1:31:1
i1
X
n
nj
j
1
F 1 F
j
j0
i1
X
n
1 1 Fn
j
j
j0
in which F(y) is the cdf of the unordered random variable Y and (y) F(y)/(1 F(y)).
(To keep the notation as simple as possible, the functional dependence on y will be
omitted whenever it is not needed for clarity.) The derivative of the defined quantity is
"
#
d
1
F
f
f y
1:31:2
dy
1 F 1 F2
1 F 2
where f(y) dF(y)/dy is the pdf of the unordered random variable Y.
88
Taking the derivative of (1.31.1) with insertion of (1.31.2) and a little rearranging
produces
2 0 1
3
0 1
n
n
i1
i1
X
X
1
n1 4
j
j1
@ A
f Y i y nf 1 F
j @ A 5
n1 F j0
j0
j
j
1:31:3
0
1
2 0 1
3
n
n
1
i1
i2
X
1 X
@
A j1 5:
nf 1 Fn1 4 @ A j
1
F
j0
j0
j
j
In the transition from the first line to the second, note that
j n
n1
and
(a)
j1
n j
i1
X
n1
(b) the first nonvanishing term in the sum
must start with j 1.
j1
j0
One can therefore change the summation index so that the first nonvanishing term
actually begins with j 0 by also changing the upper limit of the sum to i 2, leading
to the second term of the second line in (1.31.3).
Substitution into (1.31.3) of the combinatorial identity
n1
n1
n
1:31:4
j1
j
j
allows one, after some more algebraic manipulation, to subtract the second sum from
the first, yielding the result
n1 n 1
f Y i y nf 1 F
i1 ,
1:31:5
i1
which reduces to the pdf given in (1.26.6).
The combinatorial identity in (1.31.4), known as Pascals triangle, can be demonstrated algebraically by manipulating the sum of factorial expressions represented by
the right side and showing the equivalence to the factorial expression represented by
the left side. However, a simple combinatorial argument avoids such tedious calculation. Suppose we have n distinguishable objects, and we focus attention on one of
them. We then consider the number of ways to select j objects, which of course is
n1
n
ways to choose j objects that include the
. However, there are
j1
j
n1
ways to choose j objects that do not include
originally designated one and
j
the designated one. Since the two groups are mutually exclusive and exhaustive, the
identity (1.31.4) follows.
89
(B) The insightful method33 makes clever use of the multinomial distribution.
Start with the defining relation
f Y i y
dFY i y
FY y y FY i y
Pry y Y i > y
Lim i
Lim
:
y!0
y!0
dy
y
y
1:31:6
Then recognize that the probability in the numerator is the product of the probability
of three mutually exclusive and exhaustive events:
(a) (i 1) of the set of originally unordered variates are <y;
(b) one of variates falls in the range (y, y y);
(c) (n i) of the set are y y.
The number of ways to achieve this threefold grouping is given by the multinomial
n!
coefficient i1!1!
ni!. It then follows that the derivative in (1.31.6) can be
expressed as
dFY i y
n!
Fy y Fy
1 Fni
Fi1 Lim
y!0
dy
i 1!n i!
y
1:31:7
f y
n1
n
f yFi1 1 Fni,
i1
which is precisely the result (1.26.6) previously obtained with considerable effort.
t2
1
d
d1
2
dt d
1=2
1 x2
d1
2
dx
1:32:1
2 a
1z
dz
C
dz
z ia z ia
1:32:2
A. M. Mood, F. A. Graybill, and D. C. Boes, Introduction to the Theory of Statistics 3rd Edition (McGraw-Hill, New
York, 1974) 253.
90
Res
;
z
z
,
1:32:3
0
z z0 n
n dzn1
n
zz0
which can be readily verified by expanding f(z) in a Taylor or Laurent series about z0
and identifying the coefficient of the term (z z0)1. Application of (1.32.3) to the
function g(z) (z i)a in (1.32.2) to obtain the residue, followed by use of the
residue theorem (1.29.5), yields the result
I0
1
t
d
d1
2
1
dt 2
t
d
d1
2
dt
d 1=2 d
2 d 1=22
d
1:32:4
2
The fundamental problem of a practical physicist
The truth is, the science of Nature has been already too long made
only a work of the brain and the fancy: It is now high time that it
should return to the plainness and soundness of observations on
material and obvious things.
Robert Hooke1
Robert Hooke, Micrographia (first published by the Royal Society 1665) unnumbered page from The Preface.
K. Pearson, The Fundamental Problem of Practical Statistics, Biometrika XIII (1920) 116.
T. Bayes, An Essay towards solving a problem in the doctrine of chances, Philosophical Transactions of the Royal
Society of London 53 (1764) 370418.
T. M. Porter, Karl Pearson: The Scientific Life in a Statistical Age (Princeton University Press, Princeton NJ, 2004) 42,
181183.
91
92
limited interest, Bayes problem served as a surrogate for one of the most important
tasks of statistics and physics: inference or the prediction of future outcomes based
on past observation. In Pearsons words:
None of the early writers on this topicall approaching the subject from the mathematical
theory of games of chanceseem to have had the least inkling of the enormous extension of
their ideas, which would result in recent times from the application of the theory of random
sampling to every phase of our knowledge and experienceeconomic, social, medical, and
anthropologicaland to all branches of observation, whether astronomical, physical or
psychical.
In the history of probability, Bayes method came to be known as inverse probability since, in contrast to the forward direction of calculating probabilities of
outcomes from an assumed model, Bayes theorem could be used to predict the
probability of a model (or hypothesis) from observed outcomes. We have already
examined aspects of this issue in the previous chapter in regard to estimating the
parameters of models. Inference, however, goes beyond estimation for it concerns
not only how to extract values of parameters from data, but what to do with them
afterward. Inference is as much an art as science because there is generally no unique
right answer; different methods can differ in their predictive utility. But all methods
have to deal with the ancient and seemingly intractable problem of ignorance: how
is an unknown probability to be represented mathematically?
Bayes clever solution to that conundrum, reminiscent of physicists practice
(particularly in the nineteenth century) of making mechanical models to help derive
or illustrate mathematical laws, was to devise a gedanken-experiment (thoughtexperiment) involving the motion of a ball on a billiards table. The ball was
presumed to stop with equal probability at any location within the width of the
table, and each instance of rolling the ball was an event independent of preceding
ones. In this way, Bayes arrived at assigning a uniform distribution to a parameter
whose value determined the probability of success for example, a head in the
coin-toss problem defined at the beginning of the section. What Pearson came to
understand, however, was that Bayes solution to Bayes problem led to the same
result whether one assumed a uniform distribution on this parameter or not. This is
an interesting and significant observation, worth looking at in detail.
Define a binary random variable X, where a success (S) corresponds to X 1 and a
failure (F) to X 0, in the following way:
1
for < 0
X
2:1:1
0
for 0
where 0 is the decision variable and is the deciding variable. In other words,
choose a value of randomly from a distribution with continuous pdf f(). If < 0,
the outcome of the coin toss will be S; if 0, the outcome will be F. The
experiment is repeated n times and the result is a successes [aS] and (n a) failures
93
[(n a)F]. The decision variable 0 is chosen at the start from the same pdf f() and
remains fixed for the first n trials as well as the m trials to follow for which the
possible number of successes is in the range (m b 0).
From Eq. (2.1.1) the probability of a success or failure is given by
0
PrX 1j0 P
f d F0
2:1:2
PrX 0j0 f d 1 P 1 F0
0
where F() not to be confused with the symbol F for Failure is the cumulative
distribution function (cdf ) defined above by the integral of an arbitrary probability
density constrained only by the requirements that it be non-negative and normalizable,
Pra; njI
Prajn, f d
1
n
n
a
na
f d
Pa 1 Pna dP
P 1 P
a
a
2:1:3
0
n a 1n a 1
n
Ba 1, n a 1
a
a
n 2
n!
a!n a!
1
a!n a! n 1!
n1
where the beta function B(x, y) appearing in the third line of (2.1.3) is defined in terms
of gamma functions or an equivalent integral5
1
ab
Ba, b
xa11 xb1 dx:
a b
2:1:4
G. Arfken and H. Weber, Mathematical Methods for Physicists (Elsevier, New York, 2005) 520526.
94
1
(b) the final result in line 4 of (2.1.3) is a constant n1
independent of the number of
successes a, which is precisely what one might have expected given the range of
outcomes (n a 0) comprising n 1 possibilities and the absence of any
prior information on .
Now for the inference part, i.e. to calculate the conditional probability, which I shall
symbolize by Pr((b; m) j (a; n)), of getting b successes out of a further m trials, given
the previous result of a successes out of n trials. The new prior density function is
n
Pa 1 Pna dP,
Pr dPja; n
2:1:5
a
and the likelihood is
Pr Pjb; m
m
Pb 1 Pmb :
b
2:1:6
1
1
n a
na
Pa 1Pna dP
P 1P
dP
a
0
0
m
Bba1,mnba1
b
Ba1,na1
m!
ab!nmab! n1!
b!mb!
nm1!
a!na!
ab
nmab
a
na
:
nm1
n1
2:1:7
The first line in (2.1.7) expresses the definition of conditional probability under the
two conditions that
(a) the two sets of coin tosses actual and unrealized are independent (hence the
product of the two binomial expressions in square brackets) and
6
S. M. Stigler, The History of Statistics: The Measurement of Uncertainty before 1900 (Harvard University Press,
Cambridge MA, 1986) 103105.
95
(b) all values of the probability of success P can occur with equal weights (hence the
integral over P).
The resulting integrals are recognized as beta functions in line 2 and expanded into
the defining factorial expressions in line 3. With a little rearrangement, the factorials
can be grouped to form the product of combinatorial coefficients in the final form
in line 4.
If the number of trials (n, m) and successes (a, b) are sufficiently larger than 1 that
the 1s can be omitted in (2.1.7), then the factorials can be rearranged to the approximate expression
m
m
n
Nm
b
b
a
Kb
Prb; mja; n
,
2:1:8
nm
N
ab
K
which is the form of a hypergeometric distribution with total number of trials
N n m and total number of successes K a b. This is the distribution that
results from random sampling (e.g. of balls in an urn) without replacement. The
mean and variance of the variate b in distribution in (2.1.8) are respectively
m
m
m N K
2:1:9
varb K
1
b K
N
N
N N1
and reduce to the corresponding approximate expressions
b Kp
e
varb e Kp1 p
2:1:10
nn mm a bab n m a bnmab
aa bb n ana m bmb n mnm
2:1:11
A. W. F. Edwards, Likelihood (Johns Hopkins University Press, Baltimore, 1992) 216217 gives a justification of
Fishers reasoning.
96
2:2:1
and then use that estimate call it ~ to predict the probability of a future outcome
(b; m) as follows
~
Prb; mja;
n
m ~b
~ mb :
1
b
2:2:2
1
n a
1 na f d
a
2:2:3
if there were justification for a particular pdf f(). Pearson circumvented the issue by
using Bayes theorem in such a way that the distribution f() did not appear. Jaynes
97
theory of invariance, however, yields a specific functional form for f() through the
following argument.8
Suppose we attempted to estimate the initial probability of success p(S) about
which nothing was known beforehand by performing an experiment that yielded
data D. Then we could relate the posterior probability p(Sj D) to the prior probability
p(S) by Bayes theorem
pSjD
pDjSpS
p S
,
c1 1
2:2:4
where 0 and are respectively the posterior and prior probabilities of success and
c p(Dj S)/p(Dj F) is a ratio of likelihoods. If, however, the experiment was ill-chosen
and did not teach us anything new, then our state of knowledge afterward would be
the same as before, in which case the distributions of 0 and must be the same,
f 0 d 0 f d:
2:2:5
Combined with Eq. (2.2.4), the preceding equation leads to a functional relation
c
2:2:6
1 c 12 f cf
1 c 1
that can be solved in the manner employed in the previous chapter. Take derivatives of
both sides with respect to the parameter (c) and then set the parameter equal to its identity
element (c 1). The calculation is straightforward and leads to the distribution
f d
d
:
1
2:2:7
Ba 1, n a a 1n a
Ba, n a
n 1a n
2:2:8
a1 1 na1 d
0
gives a Bayesian estimate ~ an, the same as the ML estimate. Thus the solution,
Eq. (2.2.2), to Bayes problem with the Jaynes prior reduces to
8
E. T. Jaynes, Prior Probabilities, IEEE Transactions on Systems Science and Cybernetics 4 (1968) 227241.
98
~
Prb; mja;
n
m
b
a b
n
1
amb
:
n
2:2:9
There are some oddities to the use of (2.2.7) that are worth noting. First, the function
f() becomes singular at the points 0, 1. In other words, the prior in (2.2.7)
weights the endpoints more heavily than the midsection. Jeffreys, whose approach to
probability theory served as inspiration to Jaynes, did not himself find the suggestion
of (2.2.7) (which he attributed to the biologist Haldane) appealing, although his own
logarithmic prior for a scale parameter is also singular at the point 0. Having
recorded his skepticism of the BayesLaplace uniform prior, which might appeal to
a meteorologist. . .but hardly to a Mendelian, he then wrote9
Certainly if we take the BayesLaplace rule right up to the extremes we are led to results that
do not correspond to anybodys way of thinking. The rule dx/x(1 x) goes too far the other
way. It would lead to the conclusion that if a sample is of one type with respect to some
property there is probability 1 that the whole population is of that type.
It would seem therefore, that use of (2.2.7) must entail either exclusion of endpoints
0, 1 with failure of the prior to be complete, or inclusion of the endpoints with
failure of the prior to be normalizable. It can be argued, however, that the density
function of a parameter about which nothing is presumed known beforehand ought,
in fact, to exclude from its range points indicative of certainty.
A second, related peculiarity is that, if (2.2.7) is substituted into Eq. (2.2.1), the
conditional probability
Prdja; n
a1 1 na1
n 1!
d
a1 1 na1 d
Ba, n a
a 1!n a 1!
2:2:10
99
0.1
Probability
0.08
c
0.06
0.04
0.02
0
0
10
15
20
25
30
35
40
45
50
Outcome
Fig. 2.1 Bayes solution for the probability PB(b) of b successes in m 50 trials given a prior
successes in n trials at constant ratio (a/n) equal to (a) 1/2, (b) 5/10, (c) 25/50, (d) 50/100. Solid
curve (e) shows binomial probability distribution Bin50, 12.
uniform prior I shall call it PB(b) are not particularly transparent. Knowing that
the asymptotic form approaches that of a hypergeometric distribution is not all that
helpful since this is a complicated function. Visualization, however, is helpful and
Figure 2.1 shows the variation in form of PB(b) as a function of b for m 50 and
different values of a and n such that the ratio a/n is a fixed quantity (which for no
particular reason except aesthetics) I chose to be 1/2.
The distribution is discrete, so the dashed lines tracing the different curves are
there only to guide the eye. In order not to encumber the figure, I have omitted
symbols showing the discrete point values, except in the tallest curve depicted with a
solid line. This is the binomial distribution corresponding to PJ(b). The behavior of
PB(b) now becomes evident and in a certain sense very reasonable. When the prior
information is deduced from a small sample, for example (a;n) (1;2), the prediction
of future outcomes (b;m) is highly uncertain, and the plot PB(b) spreads widely and
with low amplitude over the range (m b 0). As the sample from which the prior
estimate of the probability of a success is inferred increases, the predictions become
sharper, and the plot PB(b) approaches asymptotically in form the binomial distribution PJ(b). Although readily apparent, I note explicitly that the function PJ(b)
depends only on the ratio of a to n and not on the two values separately.
To demonstrate analytically what the figure suggests graphically, it is better to
work directly with the probability function (2.1.7) than with a moment generating
function or characteristic function, neither of which is easily calculable or useful in
^
the case of a hypergeometric distribution. Since a and n are related by a n,
substitute for a in PB(b) to obtain the form
100
y m b
y
PB b
2:3:1
zm
z
:
x
x>>b
b!
x!b!
b!
xb
x
Approximating each of the three combinatorial coefficients in (2.3.1) with the corresponding form given by (2.3.2) yields the result
xb ymb
b! m b!
Lim PB b
large n
zm =m!
mb
^
b
n n 1 ^
m
b
n 1m
amb
m a b
1
PJ b
b
n
n
2:3:3
101
2:4:2
102
0.2
0.15
0.1
b
0.05
0
0
10
12
14
16
18
20
Number of Successes b
Fig. 2.2 SilvermanBayes experiment with randomly chosen decision parameter 0 0.179.
The histogram shows frequency of successes per set of 20 trials in 5000 sets of trials
implemented by a N(0,1) RNG. (a) Binomial distribution PJ Bin 20, 12 (Eq. (2.2.8)) based
on Jaynes solution with parameter (a/n) (2/4). (b) Bayes distribution PB (Eq (2.1.7))
based on prior F(0) 0.571. (c) Binomial distribution Bin(20,0.570) with observed
mean probability of success p 0:570. Dashed lines connecting discrete points serve only as
visual guides.
m
X
Xi
!2 +
m
X
Xi
103
+2
2
B hBi2
B
i1
i1
p var
m2
m2
m
h
i
mhX2 i mm 1hXi2 m2 hXi2 hX2 i hXi2
m2
m
2
2
1 F 0 1 F 1F 01 F2 F F2
m
m
2
2:4:3
Since the prior information was obtained from a small sample (n 4), it was to be
expected (as explained in the previous section) that the BayesLaplace predictions
would be dispersed much more widely about p0 than the JaynesHaldane predictions.
Figure 2.2 confirms this.
If the prior estimate p0 were identical to F(0), the curve of PJ(b) would form a
tight envelope about the histogram (as illustrated by the light dashed line), and the
error would be close to zero. Looking at the figure, which summarizes an outcome
with prior estimate p0 0.50 (about which the theoretical curves are centered) not
too far from p 0:57 (about which the histogram is centered), one would likely
conclude that PJ(b) has made overall better predictions than PB(b). The measures of
J
B
error (we will use the absolute error (2.4.1)) confirm this: abs 0:153, abs 0:202.
Since the function PJ(b) takes the form of the empirical distribution irrespective of
prior sample size n, whereas the function PB(b) takes the form of the empirical
distribution only for very large values of n, one might be tempted to conclude that
the better bet would be on the JaynesHaldane predictions. This, however, would be
a mistake and possibly a costly one if a wager were actually involved. I have run the
experiment many thousands of times it took but a few seconds with a computer
and in most instances by far the BayesLaplace predictions led to smaller overall
errors than the JaynesHaldane predictions. This turnabout occurs because the prior
estimate p0 is more likely not to be sufficiently close to the true probability of
success, F(0), in which case the more diffuse PB(b) distribution overlaps the empirical distribution to a greater extent than does the PJ(b) distribution.
Figure 2.3 illustrates this point quantitatively in the case of a prior sample of size
n 20 and posterior test of size m 50 in which the outcomes to be inferred span the
range b 0,1,. . .50. For the example shown, the true probability of success was
taken to be F(0) 0.5. The top two plots with dashed lines show the absolute errors
(i..e. summed over all outcomes b) incurred by predictions based on PJ and PB as a
function of the prior number of successes a or, equivalently, of the prior estimate
of the probability of success p0 a / n. Except for just three values of a (9, 10, 11) out
of the entire range a 0,1. . .20, the BayesLaplace error lies below the Jaynes
Haldane error. The difference of the two errors Jabs Babs as a function of a is shown
by solid line. This conclusion holds for both small and large prior and posterior
sample sizes.
104
2
a
Prediction Error
1.5
1
0.5
0
0.5
1
0
10
12
14
16
18
20
b1
gY t pet q 1 pet 1 ,
2:5:1
which in the present case can be written to show explicitly the dependence on n and 0
n
gY tjn, 0 1 F0 et 1 :
2:5:2
If the number of trials n is sufficiently large that the values selected randomly for 0
induce P F(0) to cover its range (0,1) evenly, then the population of successes
would be characterized by the P-averaged mgf
105
Table 2.1
Distribution
U[0, n]
U(0, n)
et 1n
tn
n
2
n2
3
n3
4
n4
5
n2
12
n1t
e
1
n1et 1
n
2
n2
n
3 6
3
n
n2
4 4
n4
3n3
n2
5 10 30
nn2
12
mgf
hYi
hY2i
hY3i
hY4i
2Y
0
h
SkY
n
30
2
9 n 2n4=3
nn2
5
1
n
gY tjn 1 Pet 1 dP
0
9
5
en1t 1
:
n 1et 1
2:5:3
The right side of Eq. (2.5.3) actually takes the form of one of the familiar discrete
distributions, but, if we were not aware of this, we could obtain the probability
function fpj j 0,1,2. . .g for j successes by replacing et by s and examining the
probability generating function (pgf ) f(s)
f Y sjn
X
j0
pj s j
1 sn1
:
n 11 s
2:5:4
n
X
n1
s j 1s1s , where-
j0
2:5:5
Thus, expression (2.5.3) is the mgf of a discrete uniform distribution over the range
(0, n), as shown explicitly by the probability function in (2.5.5). I will distinguish the
discrete uniform distribution symbolically from a continuous uniform distribution
over the same range by writing U[0, n] in contrast to U(0, n) (see Table 2.1). The first
few moments of both distributions can be calculated readily either by summing/
integrating over the probability function or differentiating the moment-generating
function. (As discussed in the previous chapter, the latter operation requires taking
the limit t ! 0 by means of LHopitals rule.)
Figure 2.4 shows a stochastic confirmation of the predicted probability function
(2.5.5). The histogram records the distribution of number of successes per set of
106
220
Frequency
210
200
190
180
170
160
0
10
12
14
16
18
20
Number of Successes
Fig. 2.4 SilvermanBayes experiment with variable decision parameter 0. Histogram of the
number of successes per set of 20 trials obtained from 4000 sets of trials in which 0 is a U(0, 1)
random variable realized before each set. The dashed line at 190.48 marks the expected mean
number of successes for the predicted posterior distribution U[0, 20].
20 trials obtained from 4000 sets of trials in which the decision variable 0 was
selected randomly before each set of trials from a U(0, 1) random number generator,
followed by the selection of the deciding variables from the same RNG. The dashed
line marks the theoretically expected mean number of events (4000/21 190.48) in
each outcome category. A chi-square test of the goodness-of-fit of the distribution
U[0, n] yielded 20.7 for d 21 degrees of freedom, corresponding to the P-value 0.48
(not to be confused with the symbol P for probability of success).
Suppose next, as a further variation of the Bayes problem, that the number n of
trials per set was also to be chosen randomly. In other words, before tossing the coin,
the number of trials n is selected randomly by a RNG and then the decision variable
0 for that set is selected randomly by the same RNG. How, under these conditions,
would the number of successes be distributed? In the course of repeating the experiment numerous times, how many heads on the average would you expect to obtain
per set of trials? From a Bayesian perspective, the only prior information that is now
available is the type of random number generator and its associated parameter(s).
To be specific, let us adopt first a Poisson distribution with parameter for the
probability of the number of trials n
e n!
1 e
n
PrN nj
n 1, 2 . . .:
2:5:6
Note that n 0 is not part of the range; each set is required to comprise at least 1
trial. The factor (1 e)1 is therefore required in order for the completeness
relation to be satisfied. Having already averaged the mgf (2.5.2) over P, we must
107
now average the result (2.5.3) over n, where n 1,2. . .. The averaging is not difficult
to do, although it is a little tedious. It would be easier, however, to perform the
average over n first and then over 0. The final result, as one can demonstrate, does
not depend on the order in which the averages are taken.
Starting again, therefore, with the mgf (2.5.2) and performing the average
n
1 X
ne
gY tj0 1 e
1 F0 et 1
n!
n1
"
#
1 F0 et 1n
1
e
1e
e
n!
n0
1 e
1 e
t
2:5:7
we see from the form of the resulting mgf that the random variable describing the
distribution of successes would itself be a Poisson variate with mean parameter
F(0) except for the probability of getting zero successes. This is an interesting result
in itself, representative of a class of compound distributions that can occur in physics
(for example, in testing photon emissions for randomness, to be discussed in a later
chapter) and in commercial risk assessment (for example, the probability of damage
to structures by lightning strikes where the number of hits is a Poisson variate to be
folded into the probability of damage per hit).10
The average of (2.5.7) over 0 or P produces the mgf
21
3
1
t
1
gY t gY tj0 f 0 d0 1 e 4 ePe 1 dP e 5
2:5:8
0
0
t
ee 1 1
e
,
et 11 e 1 e
which is not a particularly familiar generating function. Nevertheless, one can
determine the first few moments from (2.5.8), which turn out to be
hYi
varY
1 2
1
2
3 2
hY
i
1 e
1 e
1
4 1
12 6 e
1
2
1 e
2:5:9
:
10
es1 1
e
,
s 11 e 1 e
2:5:10
W. Feller, An Introduction to Probability Theory and its Applications Vol. 1, 2nd Edition (Wiley, New York, 1950)
270271.
108
2:5:12
m 1, x tm et dt:
2:5:13
N
X
1
en1t 1
,
N et 1 n1 n 1
2:5:15
N1
4
hY 2 i
4N 5N 1
36
varY
N 2 18N 11
:
144
2:5:16
11
N
N
n
N
X
X
1
1 sn1 1 X
1 X
sk
pj s j
N 1 s n1 n 1
N n1 n 1 k0
j0
N
X
x1
x NN1
and
2
N
X
x1
x2 NN162N1.
2:5:17
109
in which it is to be noted that the series in s is finite, the highest-order term being sN.
The pattern of the set fpjg generated by (2.5.17) is an interesting one, as seen by
writing it explicitly for N 3:
1
1 1 1
1
3
2 3 4
1
1 1 1
3
2 3 4
1
1 1 1
:
3
2 3 4
1
1 1
3
3 4
1
1
3
4
p0
p0
p1
p2
p3
2:5:18
The first p0 (in the inner box) is the probability of no successes under the condition
that n 0 is included in the number of trials. The other probabilities (in the outer
box) are pertinent to the problem we are examining and are seen to conform to the
relation
N1
1 X
1
N kj1 k
p0 p1 :
pj
j 1
2:5:19
There is, however, a closed-form expression for the sum in (2.5.19) deriving from the
identity
N
X
1
k1
N 1 ,
2:5:20
d
lnx
dx
N!
12
N
X
1
n1
2:5:21
!
ln N
2:5:22
The Euler or EulerMascheroni constant is an unending decimal number which (to my knowledge) has not been
proven to be algebraic or transcendental (i.e. not a solution of an algebraic equation with rational coefficients). e and
are examples of transcendental numbers.
110
0.14
a
0.12
0.1
c
Probability
0.08
0.06
0.04
0.02
0
0.02
10
15
20
25
30
35
40
Number of Successes
Fig. 2.5 SilvermanBayes experiment with variable parameters 0 (decision) and n (trial size).
Distribution of the number of successes in which n is a random variable (a) U[1,20], (b) U[1,50],
(c) Poi(10), (d) Poi(25); the posterior probability is averaged over allowable values of both n
and 0. The mean number of successes is (a) 5.25, (b) 12.75, (c) 5.00, (d) 12.50.
defined as the limiting difference between the harmonic series and the natural
logarithm. We may therefore express the probability function (2.5.19) in the form
1
pj N 2 j 1 j 1
N
p0 p1 :
2:5:23
A comparison of the outcomes of using either a Poisson or uniform RNG for the
random selection of n is illustrated in Figure 2.5 for two values of the Poisson mean
( 10, 25) and two values of the uniform upper limit (N 20, 50), which lead,
correspondingly, to approximately the same mean number of successes (respectively
5 and 12.5). In the first case (Poisson RNG), the probability pj of obtaining j
successes in a set of trials is maximum for j 0, drops to approximately half
maximum near j , and is asymptotically approaching 0 by j 2. The larger the
value of , the longer the function remains close to its maximum value before
descending rapidly somewhat like a FermiDirac occupation probability curve.
In the second case (uniform RNG), pj is maximum at j 0 and drops monotonically
to 0 at j N. The two distributions illustrated bear no similarity whatever to a
normal distribution, and therefore it should be no surprise that the significance of the
standard deviation as a measure of uncertainty is very different.
111
m b
n a1
mb
na1
1
d
1
b
a
1
n a1
na1
1
d
a
0
m
Ba b, n m a b
b
Ba, n a
ab1
nmab1
a1
na1
nm1
n1
2:5:24
when plotted as a function of b for fixed a, n, m is not significantly different from the
Bayes solution (2.1.7) already discussed, except in the case of n 1 trial, which
produces a completely flat distribution rather than the broad rounded curve of
Figure 2.1a. We may interpret this, as before, by arguing that, until one has executed
a minimum of two trials and obtained one success and one failure, we cannot even
presume that outcomes follow a binomial distribution. With increasing n, the posterior probability of success (2.5.24) approaches the same hypergeometric distribution
(2.1.8) as Bayes solution.
3
Mother of all randomness
Part I
The random disintegration of matter
2
3
Paul Kammerer, Das Gesetz der Serie [The Law of Seriality] (Deutsche Verlags-Anstalt, 1919) 456. (Translation from
German by M. P. Silverman).
Arthur Koestler, The Roots of Coincidence An Excursion Into Parapsychology (Vintage, New York, 1972) 8586.
M. P. Silverman and W. Strange, Search for correlated fluctuations in the decay of Na-22, Europhysics Letters 87
(2009) 32001 p1p5.
112
113
Before commenting on the proffered evidence, let me make the foregoing paragraph
explicitly clear in the context of a field in which I have some expertise (nuclear
physics). Suppose I set up a nuclear counting apparatus to count what in effect
amounts to the number of radioactive nuclei of type X disintegrating in some
specified window of time (let us say one second) and a colleague also sets up
elsewhere (for example in some other part of the building, city, state, country. . .wherever) an apparatus to count decaying radioactive nuclei of type Y which need not be
the same species (or nuclide in the terminology of physics) that I count. The two
sets of apparatus begin their tasks and record chronologically in a long time series
many one-second intervals (bins) of counts. From bin to bin the number of counts
will vary, some bins showing counts greater than the mean, others less, since the
transmutation of nuclei is an archetypical quantum process (a transition between
different quantum states) whose individual occurrences are believed to be random
and unpredictable.
If my colleagues apparatus were to record consistently a greater-than-average
number of decays of nuclide Y whenever my apparatus recorded a greater-thanaverage number of decays of nuclide X, then statistically we would say that the
fluctuations (variations about the average) of the two stochastic processes were
positively correlated. If his counts were consistently lower than average, whenever
mine were greater than average, then the fluctuations would be negatively correlated.
In any event, the two independent decay processes would be correlated, and such
correlations would be contrary to quantum theory as it is currently understood. It
has long been known that spontaneous nuclear decay is virtually insensitive to the
gravitational, chemical, or electromagnetic environment in which the unstable
nucleus finds itself. This broad remark calls for some qualifications, which I will
address later.
A reproducible observation, therefore, of correlated fluctuations in the time series
of identical disintegrating nuclei would call for a most unusual explanation, quite
possibly beyond currently known physical principles. Moreover, for time series of
S. E. Shnoll et al., Realization of discrete states during fluctuations in macroscopic processes, Physics. Uspekhi. 41 (10)
(1998) 10251035. [Uspekhi Fizicheskikh Nauk, Russian Academy of Sciences]
114
5
6
See, for example: M. P. Silverman, W. Strange, C. R. Silverman, and T. C. Lipscombe, Tests for Randomness of
Spontaneous Quantum Decay, Physical Review A 61 (2000) 042106 (110).
For historical reasons the manifolds (shells) of atomic electrons are named (from innermost outward) K, L, M, etc.
115
the weak nuclear interaction. The researches were undertaken to test one of the most
basic features of quantum physics, namely, that transmutations of unstable nuclei
occur randomly and without memory. A finding that this is not so would have a
twofold significance, at least. First, it would represent a striking violation of quantum
mechanics, the theory that accounts most comprehensively for the structure of matter
and interaction of matter and energy. And second, there would be repercussions for
practical use of nuclear decay as a means of generating true random numbers, in
contrast to pseudo random numbers created by mathematical algorithms run on
computers, for the wide range of applications that require them, such as cryptography, statistical modeling (in science, medicine, economics, etc.), Monte Carlo
methods of simulation, computer gaming, and others.
The series of investigations into the randomness of nuclear decay (which I have
described in my earlier book A Universe of Atoms, An Atom in the Universe)7 was,
I believe, the most comprehensive study of its kind undertaken to that time. The
outcome was to conclude that the data (temporal sequences of nuclear decays) were
thoroughly compatible with what was to be expected on the basis of pure chance
or, in statistical parlance, to say that the results were under statistical control. It is
worth emphasizing at this point that one can never prove that some process is
random, for no matter how many statistical tests the data generated by the process
may satisfy, there is always a possibility of producing yet another test that the data
(or a larger sample of data) may fail. What is ultimately demonstrable is that a
stochastic process may be non-random. A non-random process can furnish information by which future outcomes are predictable to an extent greater than that due
to pure chance alone.
Let us now examine what evidence justified the claims of correlated fluctuations
between stochastic processes and extrapolation to the existence of a new cosmic
force. The observations were of two principal kinds, both based on visual inspection
of the shapes of histograms.
The first observation purportedly manifested what the quoted authors believed to
be discrete states during macroscopic fluctuations. The histograms (exhibited in the
cited article for a variety of nuclear processes such as the alpha decay of plutonium239 (239Pu) and K-capture in iron-55 (55Fe) were constructed of layers in which the
first layer recorded frequencies of events i 1. . .I, the second recorded frequencies of
events i 1. . .2I, the third recorded frequencies of events i 1. . .3I, and so on, the jth
layer recording frequencies of events i 1. . .jI for some integer I. A striking pattern
of well-defined articulations in the layers resulted, like those shown in the upper
panel of Figure 3.1. I will discuss shortly the significance of this finding, which I have
easily duplicated, but let us first move on to the second piece of evidence.
The second kind of reported observation was a perceived recurrence in time of
histograms of similar shapes. Presented in the article were two composite figures (not
7
M. P. Silverman, A Universe of Atoms, An Atom in the Universe (Springer, New York, 2002).
116
600
(a)
Frequency
500
400
300
200
100
0
160
180
200
220
Class Value
240
600
(b)
Frequency
500
400
300
200
100
0
160
180
200
220
Class Value
240
Fig. 3.1 Twenty-layered histograms (unit intervals between classes) with overlapping (top
panel) and non-overlapping (bottom panel) event histories. Elemental histograms Ha (a 1,
2. . .210) were constructed from 1000 random numbers produced by a Poi(200) RNG.
i
X
Ha .
Correlated-layer histograms Li (i 1. . .20) were formed by superpositions Li
a1
reproduced here) comprising 12 or 18 panels of histograms, each panel a superposition of two histograms separated in time. The histograms, which had been
smoothed to facilitate visual comparison showed recurrent coarse features such
as broad single peaks, rabbit ears, rolling ridges, and other geometrical structures.
Before examining the matter of correlated nuclear fluctuations rigorously and
comprehensively, I want to stress that the so-called shape of a histogram is an illdefined geometric feature and not an invariant characteristic of a multinomial
distribution. It can take widely differing forms for a given set of events depending
on the number and widths of the arbitrary classes into which events are assigned.
Irrespective of the validity of the claims made in the article, the visual observation of
patterns is too fraught with human bias to be accepted as evidence of a scientific
phenomenon especially one at such variance to prevailing physical theory. Indeed,
117
Na ! 22 Ne e e
3:2:1
p ! n e e :
3:2:2
R. Graham, B. Rothschild, and J. H. Spencer, Ramsay Theory (John Wiley and Sons, New York, 1990).
118
takes place within the nucleus when the binding energy (i.e. the energy released when
individual nucleons protons and neutrons combine to form a nucleus) of the
mother nucleus (22Na) is less than that of the daughter nucleus (22Ne). The energy
difference is then partitioned among the mass and kinetic energies of the product
particles.
There were several reasons for the choice of sodium decay. First, the process
should be governed by Poisson statistics, a hypothesis that I will discuss in more
detail in due course; thus the parent probability function was known and all other
pertinent statistical quantities could be determined analytically. Second, this transmutation is, as mentioned, an example of a weak nuclear interaction with long halflife; thus the time series of decays over the period of our experiment was very nearly
stationary. In other words, the number of radioactive nuclei in the sample was
sufficiently large throughout the duration of the experiment that the mean number
of decays per counting interval (bin) remained nearly constant. In reality, the mean
count decreased slightly in time, but we could detect this and correct for it. Third, the
decay summarized by (3.2.1) yielded a stable nuclide of neon and a single outgoing
positron, which immediately interacted with an ambient electron leading to electron
positron annihilation
e e !
3:2:3
to produce two counter-propagating 511 keV gamma photons. The simplicity of the
final state together with spatial correlation and narrow energy uncertainty of the s
permitted us to make gamma photon coincidence measurements with very low
background and high signal-to-noise ratio.
From a philosophical perspective, it would not be an exaggeration to note that the
process (3.2.3) exemplifies the two conceptual pillars upon which rests the entire
aedifice of physics. First is the complete annihilation of matter to pure energy, a
process impossible to imagine before Einsteins theory of special relativity. Each
photon carries the energy equivalent of the mass of one electron, which amounts to
511 keV.9 Second is the entanglement of the two counter-propagating gamma
photons, a quantum mechanical two-particle state that does not factor into products
of single-particle states no matter how great the separation between the particles, and
which subsequently can manifest correlations inexplicable on the basis of classical
physics. Erwin Schrodinger, who developed the form of quantum mechanics initially
known as wave mechanics, coined the term entanglement [Verschrankung], referring to this feature as the characteristic trait of quantum mechanics, the one that
enforces its entire departure from classical lines of thought.10 Interesting as the
9
10
The mass m of an electron is approximately 9.109 1031 kg. The energy equivalent is mc2, where c is the speed of light
2.998 108 m/s. This leads to 8.187 1014 J or ~ 511 103 eV.
E. Schrodinger, Discussion of Probability Relations Between Separated Systems, Proceedings of the Cambridge
Philosophical Society 31 (1935) 555563.
119
3:2:4
120
Count
2100
2000
1900
1800
0
100
200
300
400
500
600
700
800
900
1000
Bin
Fig. 3.2 Scatter plot of gamma coincidence counts at a function of time (i.e. bin number),
where each bin represents 10 bins, or a time window of 4.39 s.
a plot such as this one, however, is the seemingly contradictory nature of its two
outstanding features. On the one hand, there is the visual suggestion of pure randomness, the points dotting the plane of the graph like gray snowballs thrown
against a wall. On the other hand, random though the snowball impacts may be,
they are governed by a statistical law, as represented graphically by the three
histograms in Figure 3.3.
The histograms, which respectively summarize the frequencies of occurrence of
gamma coincidences at the very start of the experiment (Bag 1), the middle of the
experiment (Bag 83) and at the end of the experiment (Bag 167) were each subjected to
a chi-square test for goodness of fit to a Poisson distribution of corresponding mean
parameter {a a 1. . .167} obtained for each bag by a maximum likelihood line of
regression fit to the scatter plot. The histograms comprised K 91 classes of unit
width, each class identified by an integer number of gamma coincidences, spanning a
range centered on the integer closest to the mean. The figure reveals to the eye
graphically how well the histograms conform to a Poisson distribution. However,
the aggregate of chi-square tests, which matches the distribution of resulting values
f 2a a 1 . . . 167g against the theoretical density 2d89 , shows the minds eye analytically whether the data support the null hypothesis. A chi-square test of this fit with
14 classes yielded P 0.26 for 2d13 15:76, which is unambiguously acceptable.
Thus, although the number of coincidences per bin is randomly scattered about
the mean in Figure 3.2, that scatter falls within fairly well-defined limits set by the
single parameter (mean variance) of the Poisson probability law. The noted
nuclear physicist turned quantum philosopher, J. A. Wheeler, who (so it seemed to
me) liked to speak and write in riddles apparently a consequence of his early
exposure to Niels Bohr conjured up the phrase law without law12 to describe
12
J. A. Wheeler and W. H. Zurek, Eds. Quantum Theory of Measurement (Princeton University Press, Princeton NJ,
1983), Chapter I.13 Law Without Law, 182213.
121
Relative Frequency
0.025
0.025
0.020
0.020
0.015
0.015
0.010
0.010
0.005
0.005
0.000
140
160
180
200
220
0.000
140
240
160
180
200
220
240
160
180
200
220
240
Count Class
Fig. 3.3 Histogram of counts for bags of data collected at the beginning (Bag 1), middle (Bag
83), and end (Bag 167) of the experiment. Each bag represents about one hour of data
accumulation. Superposed on each histogram is a Poisson probability function of
corresponding mean parameter . The dashed curve is the Gaussian density N(, ) ~ Poi(),
valid for 1.
quantum interference phenomena. I cannot say that I ever really fathomed what he
meant by it or by his other cryptic expressions such as magic without magic,
higgledy-piggledy universe, great smokey dragon, and more, but perhaps an
interpretation that gives some sense to the phrase is the highly structured nature of
randomness. The idea of randomness to someone who has not thought about it too
deeply is the occurrence of events willy-nilly, without plan or choice, without organization or direction, without pattern or connection, without identifiable cause, without predictability to put together some of the common associations I have
encountered. On the contrary, as we proceed with this investigation, it will become
clearer how tightly the patterns of randomness are constrained.
122
composite histogram in the figure comprises 20 layers, each of which, in the notation
I
X
H a where I 1. . .20. The
just developed, can be represented symbolically by LI
a1
M. P. Silverman, Probing The Atom: Interactions of Coupled States, Fast Beams, and Loose Electrons (Princeton
University Press, Princeton NJ, 2000).
123
The physical content of the null hypothesis is that the probability that a nucleus
decays within a short time interval t is proportional to t
Pr1j;t p t,
3:4:1
where the intrinsic decay rate is the constant of proportionality. It then follows that
the probability that this nucleus does not decay within the composite time interval t
mt is
t m
Pr0j;t 1 tm 1
! et
3:4:2
m m!
which, in the limit of an infinitely large number of infinitesimally short time intervals
takes the form of an exponential as shown in (3.4.2). Therefore, the probability that
this nucleus does decay within the time interval t is 1 et. Under the assumption
that all such decays are mutually independent events, the probability that n out of
N nuclei decay within time t is equivalent to asking for the probability of n successes
out of N independent trials, which immediately calls to mind a binomial distribution
n
Nn
N
1 et et
PrnjN, ; t
:
3:4:3
n
Equation (3.4.3) gives the exact probability distribution for the disintegration of
radioactive nuclei or, indeed, for the irreversible transition of any quantum system
out of its initial state, subject to the null hypothesis. Although exact, the binomial
expression is not practically useful because the number of nuclei in a macroscopic
sample is astronomically large, even if macroscopic may actually entail a very small
mass. For example, an approximate 0.08 Ci sample of 22Na has a mass close to
13 g and contains about 1017 sodium atoms. I will justify this assertion shortly.
Without further simplification
or approximation, no computer can evaluate
N
combinatorial coefficients
with N ~ 1017. However, the intrinsic decay rate
n
for a weak interaction process is very low. For a process with half-life of ~2.6 years,
the corresponding decay rate is ~ 8.4 109 transitions per second. This relation
will also be justified in a moment. We have seen, therefore, that when the number of
trials is very large and the probability of success very small, a binomial distribution
Bin(N, p) reduces to a Poisson distribution Poi() of mean parameter Np.
For the case of sodium decay, the binomial-to-Poisson reduction leads to a
mean number of decays at time t given by
t N 0 1 et ,
3:4:4
where the subscript 0 on N signifies the initial number of radioactive nuclei in the
sample. The mean number of counts Nc that one detects (assuming 100% efficiency
of detectors) in a bin of temporal width t, which is equal to the number N of
radioactive nuclei lost from the sample (hence the minus sign) during this interval, is
124
dt
t N 0 et t
dt
N 0 t N 0 p,
N c N
3:4:5
where the approximation in the second line holds for t << 1. Treating the discretely
small quantities in (3.4.5) as differentially small leads to the rate equation for loss of
radioactive nuclei
dN t
N 0 et ,
dt
3:4:6
whose solution for N(t) with initial condition N0 gives the familiar result
N t N 0 et :
3:4:7
The preceding relation makes apparent why is termed the decay constant.
The
relation of to the half-life is derivable from (3.4.7) by setting N N 0 =2 to
obtain
1
2
1
2
ln 2
:
1
2
3:4:8
We are now in a position to tie some loose ends together. Adopting the half-life
2.603 years (7.28 107 s) for 22Na yields ~ 8.44 109 s1. The number of
radioactive nuclei in a sample of activity
1
2
3:4:9
is then
dN=dt 2:96 109 nuclei s1
N 0
3:5 1017 nuclei:
8:44 109 s1
3:4:10
From knowledge of Avogradros number, NAv ~ 6.02 1023 atoms per molar mass of
substance, one readily calculates the mass m of N0 atoms of atomic mass 22 g
m 22
gram N 0
g
3:5 1017
mol 106
106 e 12:8 g:
22
mol N Av
gram
6:02 1023
3:4:11
125
3:4:12
Rn1 N 0 n 1 Rn :
3:4:13
14
3:4:14
W. Feller, An Introduction to Probability Theory and its Applications Vol. 1, (John Wiley & Sons, New York, 1950)
402404.
126
The interval t is sufficiently narrow that we can ignore the possibility of more than
one decay within it. Rearranging terms in Eq. (3.4.14) and taking the limit t ! 0 leads
to the differential equation
Pn t t Pn t
dPn t
Lim
3:4:15
3:4:16
Upon substitution of expressions (3.4.12) and (3.4.13) for the rates, there results a set
of master equations
dPn t
N nPn t N n 1Pn1 t
dt
dP0 t
NP0 t
dt
initial condition: Pn 0 n0
n > 0
3:4:17
3:4:18
and then substitute this result into the differential equation for P1(t). Having found
P1(t), substitute it into the differential equation for P2(t), and so on up the line until
one recognizes a general pattern and can obtain the general solution by induction.
This is a somewhat tedious way to proceed, and I will use a different approach
employing a generating function.
Let us define a probability generating function
Gz, t
N
X
zNn Pn t,
3:4:19
n0
3:4:21
since Pn(0) n0 i.e. at the initial instant of time no nucleus has yet decayed (n 0).
127
We will determine the functional form of G(z,t) by finding the differential equation
that governs it. To this end note the following temporal and spatial derivatives
(actually, z is just an expansion variable, not connected in any way to the spatial
degrees of freedom of the nuclei)
N
N
X
Gz, t X
dPn
zNn N nPn N n 1Pn1
zNn
dt
t
n0
n0
z 1
N 1
X
zNn1 N nPn
3:4:22
n0
N
N 1
X
Gz, t X
N nzNn1 Pn
N nzNn1 Pn :
z
n0
n0
3:4:23
In the first line in (3.4.22) the time derivative of Pn(t) was replaced by its equivalent
from the master equation (3.4.17). Note that the two terms in the square brackets
have opposite signs; after some algebraic rearrangement and relabeling of indices the
term in zN drops out and the expression in the second line results.
Examination of the final expressions in (3.4.22) and (3.4.23) reveals the following
equality
G
G
z 1
:
t
z
3:4:24
The simplest approach to solving the differential equation above might be to try separation of variables, i.e. to express the generating function in the form G(z, t) Z(z)T(t).
Although this ansatz15 does not work (the initial condition (3.4.21) cannot be satisfied),
the outcome suggests that G(z, t) might be a function of ln(z 1) t, so let us write
Gz, t Glnz 1 t:
3:4:25
3:4:26
3:4:27
which now gives us the precise functional form to use for G(z, t) i.e. with timedependence included:
h
iN h
i N
:
3:4:28
Gz, t elnz1t 1 zet 1 et
15
An ansatz, a word borrowed from German, is a mathematical expression assumed to apply in some situation, but
without a rigorous justification for its use at the outset.
128
One can readily establish that (3.4.28) satisfies Eq. (3.4.24). Moreover, by expanding
the binomial expression in the second equality, it follows immediately that
N
Gz, t zet 1 et
X
N
N
X
N
Nn
t n t Nn
1e
z
e
z Nn Pn t
3:4:29
n
n0
n0
thereby leading to the same binomial probability distribution Pn(t) as obtained
previously in (3.4.3). In hindsight, replacement of z Nn by sn in the generating
function in (3.4.28) yields precisely what we would have obtained from the binomial
mgf derived in Chapter 1.
Although we already know the conditions under which a binomial distribution
reduces to a Poisson distribution, we can rediscover this connection in a different way
by examining the master equation (3.4.17). If the number N of nuclei in the sample is
enormously greater than the number n decaying within any time interval throughout
the experiment which is assuredly the case in the experiment I am discussing then
one can ignore the dependence of the decay rate Rn on n to obtain a constant decay
rate R N. The master equation (3.4.17) for n 1 then simplifies to
dPn t
N Pn t Pn1 t RPn t Pn1 t,
dt
3:4:30
N
X
zn Pn t
3:4:31
n0
where, for all practical purposes, the upper limit of the sum is effectively infinite. It is
then not difficult to establish that G(z, t) satisfies the differential equation
G
Rz 1G,
t
3:4:32
which follows upon neglect (in combining two sums) of a vanishingly small term
/ PN in the limit N ! . The solution to Eq. (3.4.32), subject to initial condition
G(z, 0) 1 is the exponential
Gz, t eRz1t ,
3:4:33
129
Fig. 3.4 Scatter plot of the natural log of mean counts per bag (1 bag 8192 bins) as a
function of time. Solid line is the maximum likelihood line of regression for the entire set of
data with slope X 193.8
0.16.
130
RY t, t hY tY t i Lim
3:6:1
where inclusion of the time variable t in the argument of RY signifies that the outcome
may depend on when, within the time history, the ensemble average is taken. The
autocorrelation function describes quantitatively how closely the values of the data
at one time t depend on the values at another time t ; the increment is referred to
as the delay time or lag. In the case of two different random signals X(t) and Y(t), one
can define by an analogous ensemble average the cross-correlation function
N
1X
xk tyk t ,
N! N
k1
3:6:2
where, in general, order matters and RXY (t, t ) 6 RYX (t, t ). We shall not be
cross-correlating different functions in this chapter, so as a matter of nomenclature
I will refer simply to the correlation function, which is to be understood to mean
autocorrelation.
If the ensemble mean Y (t) hY(t)i, defined by the same kind of limiting process shown
in (3.6.1), is not zero, then one is usually interested in the covariance function defined by
CY t, t hY t Y tY t Y t i:
3:6:3
For zero delay, Eq. (3.6.3) defines the variance of the stochastic process.
In the special case that the mean and correlation are independent of time translations, the process is said to be weakly stationary, and RY () and CY () depend only on
the delay. If all probability distributions relating to the random process Y(t) are timeindependent, the process is termed strongly stationary. For a Gaussian random
process the two concepts coincide because the mean and covariance determine all
other probability distributions. We shall see in due course that the stochastic model
best fitting nuclear decay is an example of a strongly stationary process. The
autocorrelation function of a stationary process is characterized by the symmetry
RY RY :
3:6:4
1 T
Y lim
ytdt
3:6:5
T! T 0
131
where the y (t) in the integrand could be any one of the sample functions of the set
{yk (t)}. In general, time and ensemble averages are not equivalent. If, however, a
random process is stationary and the time average does not depend on the specific
sample function yk(t) i.e. it is independent of k then the process is said to be
ergodic from Greek roots signifying path of work or action. For a stationary
ergodic process Y(t), time and ensemble averages are equivalent, and we can express
the correlation function (for Y 0) as
T
1
ytyt dt:
RY hY tY t i Lim
T! T
3:6:6
RY hY tY t i
RY 0
hY t2 i
3:6:7
that falls within the range 1 () 1. Nomenclature is not consistent, and one will
find the normalized expression (3.6.7) referred to as the autocorrelation function, and
the non-normalized expression (3.6.1) as the autocovariance function.
When a random signal being sampled is not continuous, but is discrete as in the
counting of quantum particles, the serial correlation function of lag k defined by16
!
!
Nk
Nk
Nk
1 X
1 X
1 X
Rk
yt
y0
y0
ytk
3:6:8
N k t1
N k t0 1 t
N k t0 1 t k
and normalized as follows
rk 2
Nk
Nk
1 X
1 X
yt
y0
N k t1
N k t0 1 t
Nk
Nk
X
1 X
4 1
yt
y0
N k t1
N k t0 1 t
!2
Nk
1 X
y0
ytk
N k t0 1 t k
Nk
Nk
1 X
1 X
ytk
y0
N k t1
N k t0 1 t k
!2 312
5
3:6:9
16
M. Kendall, A. Stuart, and J. K. Ord, The Advanced Theory of Statistics Vol. 3, (Macmillan, New York, 1983) 443445.
132
Nk
1 X
yt y ytk y
rk
N k t1
s0 Y
s0 Y
3:6:10
N
X
yt
1
s02
Y N
t1
N
X
yt y2
3:6:11
t1
for the entire series. A disadvantage to the form of rk , however, is that it can
take values greater than 1, in contrast to the behavior of a true correlation
coefficient. Thus, an alternative form often employed for the correlation
function and correlation coefficient is obtained by approximating (N k) 1
by N1
R0k
Nk
1X
y yytk y
N t1 t
Nk
X
yt yytk y
r0k
t1
N
X
3:6:12
yt y
t1
where r 0k now lies strictly within the range (1, 1). The serial correlation function
and coefficient of a stationary random process of zero mean are then
Nk
X
R0k
N k
1X
y y
N t1 t tk
r0k
yt ytk
t1
N
X
yt 2
t1
3:6:13
The adjusted time series obtained in the 22Na gamma-coincidence experiment represents a process of this kind.
For a long time series of data, the calculation of the correlation function
directly from relation (3.6.13) is extraordinarily time consuming, even when a
fast desktop computer is employed. There is, however, a more efficient way to
perform the calculation, based on a relation, known as the WienerKhinchin
theorem, between the correlation function and the power spectrum of the time
series. Beyond facilitating calculation, the power spectrum plays an important
role in this research because it, together with other functions to be described,
reveals what hidden periodicities may lurk within the time series of decaying
nuclei.
133
X
X
2nt
2nt
2nt
f t a0
an cos
bn sin
cn ei T
3:7:1
T
T
n
n1
n1
where, by means of Eulers relation
e
i cos
i sin ,
3:7:2
(
,
3:7:3
To determine the coefficients from the original function f (t), it is generally easier to
use the complex form of the series because the basis functions satisfy a very simple
ortho-normalization relation
2
eimm d 2mm0 :
3:7:4
Thus, by multiplying both sides of (3.7.1) by a basis function ei(2kt/T) for integer k
and integrating t over the range (0, T), one obtains
2
T
1
1
c0
f tdt
f d
T
2
0
ck60
3:7:5
1
2kt
1
f tei T dt
f eik d
T
2
0
where the second equality follows from the change of variable 2t/T.
The time series of nuclear disintegrations, which was analyzed for hidden periodic
structure, is not continuous, but a discrete series comprising samples taken every t
seconds (the width of one bin) for a total duration T Nt. To represent a discrete
time sequence in a Fourier series, the expression in (3.7.1) must be modified as
follows
f nt f n a0
N=2
X
j1
aj cos
X
N=2
N=2
X
2nj
2nj
2nj
bj sin
cj ei N 3:7:6
N
N
j1
jN=2
134
3:7:7
In the discrete series (3.7.6) the ratio t/T became a ratio of integers n/N, the unit t
having canceled from numerator and denominator, and the integrals over t in (3.7.5)
were replaced by sums in (3.7.7). The calculation leading to the Fourier coefficients in
(3.7.7) makes use of a discrete ortho-normalization relation (the sum of a complexvalued geometric series)
N
X
n1
ei
2mn
N
N1m
N
ei
sin m
Nm0
sin m=N
integer jmj N
3:7:8
C. E. Shannon, Communication in the presence of noise, Proceedings of the Institute of Radio Engineers 37 (1949)
1021. Reprinted in the Proceedings of the IEEE 86 (February 1998) 447457.
135
Now the more technical explanation. Consider a time-varying signal x(t) that is
sampled (measured) periodically every t seconds for a sampling time t. If there
are no gaps in the sampling process, then t t; if, however, t t, then the
signal is being sampled for only a relatively small fraction of the time. In any
event, since the sampling is periodic, we can represent the sampling function by a
Fourier series
S t
ck ei2kt=t
k
ck eiks t
3:7:9
k
with fundamental period t or sampling angular frequency s 2/t. The functional form of the sampled signal is then xs(t) x(t)S(t), whose spectral content is
given by its Fourier transform
Xs
xs teit dt
X
k
ck
xteiks t dt
3:7:10
ck Xks :
k
The physical significance of the final expression in (3.7.10) is that the sampling
process has generated an infinite number of replicas of the original signal in
frequency space, the replicated spectral lineshapes being spaced at intervals of
s 2/t or s s/2 1/t.
Let us now suppose that the original signal is band-limited, which means that its
frequency content is confined to a frequency interval 2B about the central frequency.
If the replicas are not to overlap, then the highest frequency of one replica (e.g. B for
the spectrum k 0 centered at the origin) must be less than the lowest frequency of
the succeeding replica (e.g. s B for k 1), which places a lower limit on the
sampling frequency
s > 2B )
1
1
> 2B ) c
> B:
t
2t
3:7:11
In other words, as long as the sampling frequency is greater than the bandwidth or,
equivalently, the cut-off frequency exceeds the highest frequency contained in the
frequency spectrum of the signal one can reproduce exactly the original timevarying signal from a single replica of the Fourier transform of the sampled signal,
even if the sampling time t is much shorter than the dead time between samples. This
is actually a remarkable theorem when one thinks about it.
All physical signals are band-limited because the signal must have finite starting
and ending times, but the highest frequencies in the spectrum may exceed a practical
sampling frequency. In that case the replicated spectral lineshapes overlap and an
effect known as aliasing occurs. Signals at frequencies greater than c contribute to
136
Amplitude
1
0
10
20
30
40
50
60
70
80
90
100
Time
Fig. 3.5 Square pulse of unit amplitude (small circles) of period 100 (arbitrary unit). Fourier
reconstructions (solid) comprise frequencies up to maximum harmonic number n of (a) 1, (b) 5,
(c) 99.
the sampled signal at frequencies below c. Specifically, for any frequency in the
range (c 0), the frequencies (2c
), (4c
),. . .(2nc
) are aliased with ,
as is readily demonstrable from the set of relations below:
cos2 2nc
t
cos2t
4nc t integer n
1
integer t 2 integer
4c t 4
2t
cos2 2nc
t cos 2t :
3:7:12
An example that ties these various ideas together is the representation in a Fourier
series of a square pulse of period Tp 2
8
> t > 0
<1
3:7:13
f t 0
t 0, , 2
:
1 2 > t >
sampled discretely in bins of unit width (t 1) over a total time Nt with
2
N t
100. The function is odd over the period, in which case only the coefficients
of the sine series are nonvanishing
an 0
bn>0
21 1n
n
n 0, 1 . . .
3:7:14
as readily determined from (3.7.7). Thus the square pulse can be reconstructed from a
series of the form
t 1
4
3t
1
5t
sin
:
3:7:15
sin
sin
f t
The plots in Figure 3.5 show the original square wave and Fourier reconstructions
with maximum frequencies n n/Tp marked by harmonic indices n 1, 5, and 99.
137
a
Amplitude
2
0
10
20
30
40
50
60
70
80
90
100
Time
Fig. 3.6 Unit square pulse (gray circles) and Fourier amplitudes bn (t) for n equal to (a) 1,
(b) 49, (c) 99. For sampling time t 1, n 49 corresponds to the highest discernible
frequency, whereas n 99 is aliased with the fundamental n 1.
As expected, the greater the number of harmonics included in the series for f (t), the
closer the reconstruction resembles the square wave. The figure also illustrates a
point made in the heuristic explanation of the sampling theorem. If the original
function were sampled once every 50 units of time (half the period), then the cutoff frequency would be the reciprocal of the period c 1/2 1/100, and aliasing
would occur for harmonics with frequencies
n
n
1
> c
) n > 1:
n
T p 2
2
In other words, all frequency components beyond the fundamental would be
aliased. This result accords physically with the observation that if we had only
two sample points one in each bin of width 50t we could not tell to which
harmonic a point belonged. However, if the original function were sampled once
every t 1 unit of time, as was actually the case in construction of the figure, then
the cut-off frequency would be c 1/2t 1/2, and aliasing would occur for
harmonics with frequencies
n
n
1
n
> c
) n > 50:
T p 100t
2t
In other words, only those terms with harmonic numbers n > 50 would be aliased.
Figure 3.6 illustrates the aliasing phenomenon explicitly. Because the Fourier
spectrum of the square wave pulse (3.7.13) contains only odd-integer sine waves as
represented by (3.7.14), the cut-off frequency c 1/2 actually corresponds to
harmonic n 49 (since there is no contribution from n 50). Superposed over the
square pulse (gray circles), are the sine waves bn(t) corresponding to harmonic
numbers n 1, 49, 99. For a sampling interval t 1 and pulse period 100t, the
138
49
highest discernible frequency corresponds to harmonic n 49: 49 T49p 100
< 12
99
(oscillatory black). The harmonic n 99 at frequency 99 T99p 100
(dashed black)
1
appears to have the same frequency as the fundamental (solid black), 1 T1p 100
,
with phase shift of 180 or radians. This agrees with relation (3.7.12) that frequency
1
99
1
2c 1 100
100
is aliased with 100
. To distinguish the aliased harmonic
xte2it dt
3:8:1
3:8:2
and its Fourier transform, the power spectral density SX() at frequency
S X
RX e2i d:
3:8:3
The function SX() is a measure of the energy content or, more accurately, the
rate of energy transfer or power in the frequency range (, d). The term
power calls to mind an electromagnetic wave (think Poynting vector), but the
terminology is applied as well to any stochastic time record such as the record of
gamma coincidence counts obtained from the decay of 22Na. Although defined by
(3.8.2), the autocorrelation function is also deducible from the inverse Fourier
transform of (3.8.3)
RX
SX e2i d:
3:8:4
18
It is common notation in physics to represent a time series by a lower-case letter and its Fourier transform by the
corresponding upper-case letter. This contrasts with our previous notation, also in common usage, of representing a
random variable by an upper-case letter and its realization in a sample by the corresponding lower-case letter. It is
impossible to remain entirely consistent in all matters of notation, as one would soon exhaust the supply of familiar
symbols.
139
From relations (3.8.2) and (3.8.4) it follows that at zero delay the autocorrelation
RX 0 hxt i
SX d
3:8:5
gives the mean square value of a time series, which is equivalent to the integrated
power spectrum. If the mean hx(t)i 0, the integrated power spectrum equals the
variance 2X .
The pair of relations (3.8.3) and (3.8.4) are known as the WienerKhinchin (WK)
theorem. Together, they provide an indispensable set of tools for investigating
correlations and periodicities that may be hidden in a noisy signal. What makes the
WK theorem a theorem and not merely a trivial Fourier transform pair is that it
remains valid even in the case of a non-square-integrable signal x(t) for which the
Fourier transform X() does not exist. We shall be working with signals, however,
that do have a Fourier transform.
An alternative way of arriving at SX() is to substitute the expressions
xt
Xe2it d
*
3:8:6
x t x t
* 2it
X e
RX
d
SX 0
SX e2i d:
3:8:7
3:8:8
140
The functions RX() and SX() are even functions of their arguments
RX RX
SX SX :
3:8:9
The second symmetry is a consequence of the first, and, as noted in (3.6.4), the first
follows from the hypothesis of a real-valued stationary random process.
Putting the pieces together, one can analyze a stochastic time record in either of
two ways as symbolized by the chain of steps:
A xt ! X ! SX ! RX
B xt ! RX ! SX :
It is instructive to apply the WK relations to two examples (a) white noise and (b) a
purely harmonic process since these examples arise in our search for hidden
correlations and periodicities in nuclear decay (and other spontaneous quantum
processes). The first arises because, if our null hypothesis is true, then the disintegration of nuclei is a white-noise process. And if the null hypothesis is not true, then
the second process may possibly lie hidden in the time record of decays.
White noise refers to a stochastic process with power uniformly distributed over the
entire frequency spectrum. Alternatively, it may be regarded as the ultimate expression
of randomness whereby no two distinct points of a time-varying function are correlated, no matter how close in time they occur. The consistency of these two viewpoints
follows from the WK theorem. Consider the first (constant spectral density)
SWN
) RWN
e2i d 2 ,
3:8:10
RWN ) SWN
2
e2i d 2:
3:8:11
The assumption in (3.8.11) that the mean value of the noise is zero identifies the
constant in (3.8.10) as the variance of the noise. We have also made use of the
familiar representation of the delta function
1
2
e
e2i d
3:8:12
in which the absence or presence of the factor 1/2 depends on whether integration is
over frequency or angular frequency.
The opposite of a completely random process is a perfectly deterministic one.
Consider the harmonic function x(t) A cos(20t) of constant amplitude A and
frequency 0. Invoking the ergodic theorem, which equates ensemble and time
141
A2 cos 20 t cos 20 t dt
A2
cos 20 :
2
3:8:13
e
2i
A2
cos 20 d
4
e2i0 e2i0 d
A2
0 0
2
3:8:14
m
X
Rk kt kt
3:8:15
k1
where the delta functions restrict to integer multiples {k 0,1,. . .m} of the sampling
time t. Substitution of (3.8.15) into expression (3.8.3) for SX() and integration over
leads to the relation
m
m
X
X
k
SX / R0 2 Rk cos 2kt R0 2 Rk cos
c
k1
k1
)
(
m
X
k
/ R0 1 2 rk cos
3:8:16
c
k1
in which the definition of the cut-off frequency c was used in the second expression,
and the definition of the discrete correlation coefficient rk Rk/R0 was used in the
third. Because a delta function of time in (3.8.15) is a density, it has units of inverse
time. The proportionality constant in (3.8.16) must then be proportional to t, the
only temporal parameter available. If we restrict the frequency to physically
142
meaningful positive values only, then the proportionality constant linking the physically realizable power spectral density GX() to the elements of the autocorrelation
function is 2t. Since most applications are concerned only with the content of the
spectrum and relative strengths of the spectral amplitudes, the value of the proportionality constant is usually of no consequence, and, unless otherwise indicated, we
will regard (3.8.16) as an equality.
An alternative way19 to introduce the power spectrum that will prove useful later
is to construct from a time series {yt t 1. . .N} of zero mean the two functions
N
1 X
yt cos t
A p
N t1
N
1 X
B p
yt sin t
N t1
3:8:17
with 2t, which resemble the coefficients of a Fourier series, and define the
power spectrum by
S A2 B2
(
)
N
N X
N
X
1 X
2
0
y 2
yt yt0 cos t t
N t1 t
t0 1 t1
1
t0 6t
N
X
y2t
t1
N 1 X
Nk
X
yt ytk cos k
k1 t1
R0
N1
X
1 2 r k cos k :
N
N 1
N k
X
1X
1X
y2t 2
yy
N t1
N t1 t tk
k1
R0
!
cos k
Rk
3:8:18
k1
The transition from the second line to the third is made by a change in summation
index t0 t k, where k is the lag (of which the unit t has been absorbed in the
definition of the angle ). The final expression in (3.8.18) is identical in form to
(3.8.16) except that the sum includes all possible correlation coefficients and not
just those up to an arbitrarily set maximum lag m. The question of what value
should be taken for the maximum lag in some particular application will be
discussed later.
p
I call attention to the fact that the normalization constant in (3.8.17) is 1= N and
not 1/N as is usually the case in summing over elements of a statistical set (e.g. in
forming a sample mean or variance). The virtue of this choice, which is a convention
adopted in certain algorithms for rapid computation of Fourier series (to be elaborated later), is that it leads to the following large-sample (N 1) limits (for 6 0, )
19
M. Kendall, A. Stuart, and J. K. Ord, The Advanced Theory of Statistics Vol. 3 (Macmillan, New York, 1983) 510511.
N
1X
1
sin 2 k !
N k1
2
143
N
N
X
1X
cos k
sin k 0 ! 0 3:8:19
N k1
0
k 1
that will facilitate determining how various functions of Fourier amplitudes are
distributed.
Although the discrete power spectrum (3.8.16) can be evaluated for any frequency
up to the cut-off frequency c, the set of discrete frequencies {j jc/m} for j 0,
1,. . .m is particularly convenient20 as it leads to m/2 independent spectral estimates.
This follows from recognizing that points in the discrete time series Rk separated by
intervals less than mt m/2c can be correlated. In the frequency domain, therefore,
points separated by less than the reciprocal interval 2c/m can be correlated. Evaluated at this special set of discrete frequencies, the spectral density becomes
m
m
X
X
jk
jk
Rk cos
r k cos
/12
3:8:20
Sj / R0 2
m
m
k1
k1
and can be shown to satisfy a kind of completeness relation
!
m1
X
1 1
1
S0
Sj S m R 0 :
m 2
2
j1
3:8:21
The demonstration of (3.8.21), which uses complex exponentials to achieve a remarkable reduction in what appears at first sight to be a complex function (in both senses
of the word) will be left to an appendix.
To appreciate the utility of autocorrelation and power spectral analysis for
recovering information from a noisy signal, consider a sampled signal of the form
2t
2t
xt A cos
B cos
t
3:8:22
Ta
Tb
where (t) is a random variable of type U(0, 1). In other words, at each instant t at
which the signal of interest is sampled, the measurement includes a randomly
fluctuating component of magnitude between 0 and 1. The objective of the measurement is to confirm the existence of any periodic terms and to estimate their periods.
This would be impossible to do by looking at the observed time series in the top panel
of Figure 3.7, which shows (in gray) the signal sampled at unit intervals t 1 for a
total recording time of N 512 intervals. The parameters of the non-random part of
the theoretical waveform (3.8.22), shown in black and displaced downward by one
unit for visibility, are Ta 10, Tb 20, A 0.20, B 0.15.
The middle panel of Figure 3.7 shows the discrete autocorrelation function (black
points) of the observed signal (3.8.22) calculated directly from the defining relation
20
J. S. Bendat and A. G. Piersol, Measurement and Analysis of Random Data (Wiley, New York, 1966) 292.
144
1
Signal x(t)
0.5
0
0.5
1
1.5
0
50
100
150
200
250
Time
Correlation r(k)
1
0.5
0
0.5
1
0
10
20
30
40
50
60
70
80
90
100
30
35
40
45
50
Lag
Power G(j)
20
10
0
10
0
10
15
20
25
Harmonic
Fig. 3.7 Upper panel: periodic signal (black) x(t) 0.20 cos (2t/10) 0.15 cos (2t/20)
displaced downward by one unit for clarity; empirical signal (gray) made noisy by
superposition of U(0, 1) noise sampled at intervals of one time unit for a total of 512 time
units. Middle panel: autocorrelation of the periodic signal (gray) and empirical signal (black
points). Lower panel: power spectrum of empirical signal showing harmonics at j 10,
20 corresponding to periods of 20 and 10 time units derived from the autocorrelation
function of maximum lag 100.
145
(3.6.12) up to lag m 100, after first having been transformed to the corresponding
record y(t) of zero mean. Also shown (gray curve) is the theoretical autocorrelation of
the noiseless signal
R
A2
2
B2
2
,
3:8:23
r
cos
cos
2
R0 A B2
Ta
Tb
A2 B2
where
0
1
T
1
R Lim @ xtxt dtA:
T! T
3:8:24
(Note that the time average of the noiseless signal is zero.) Although it is now clear
from the plot that the noisy signal contains periodic terms, their periods and amplitudes are not evident.
This information is provided by the power spectrum, shown in the bottom panel
of Figure 3.7, which was calculated by (3.8.20) at the special set of discrete frequencies {j jc/m}. The abscissa, labeled by harmonic index j, unambiguously shows
harmonics at j 10 and 20. The period corresponding to a particular harmonic is
obtained from the reciprocal relation
1
jc
j
j
m
Tj
2mt
T j 2m
:
t
j
3:8:25
Thus, for maximum lag m 100, the power spectrum correctly reveals periods of
T10 20 and T20 10 time units with a ratio of power spectral amplitudes S20/S10
2.3 close to the theoretically exact value A2/B2 (2.0/1.5)2 ~ 1.8.
The necessity or at least advantage of working with a time series y(t) of zero
mean may be seen by considering the relation between the autocorrelation function
Y
X
Rk and the corresponding function Rk of the stationary random process of sample
mean x 6 0
Y
Rk
Nk
Nk
Nk
1X
1X
1X
X
yt ytk
xt xxtk x
xt xtk x2 Rk x2 :
N t1
N t1
N t1
3:8:26
Y
146
performed the analysis instead with a series of significant non-zero mean, the autocorrelation coefficient as a function of delay would have been a downward sloping line
weakly modulated by oscillations of low contrast. Correspondingly, the power spectrum
of this autocorrelation would have been dominated by a strong peak at index j 0,
which could have distorted the power distribution at other frequencies.
The necessity of working with a time series of zero mean applies as well to an
alternative approach to recovering information from a noisy signal. We could have
proceeded by calculating first the Fourier spectral amplitudes fan , bn n 0 . . . 12Ng of
the (zero mean) time record y(t) by use of a fast Fourier transform (FFT) algorithm,
and then obtained the discrete autocorrelation function Rk from the inverse FFT of
the power spectrum Sn a2n b2n . Although stating this procedure in words may
make it sound complicated and time-consuming, in practice as applied to long time
series (e.g. of nuclear decay data) it has led to results in seconds that otherwise would
have required hours to compute.
Perhaps the most familiar FFT algorithm is the one developed by Cooley and
Tukey21 in 1965. The description of this and other algorithms goes beyond the
intended scope of this chapter, but several points are worth noting. The Cooley
Tukey method calculates a discrete Fourier transform (DFT) of length N by performing a number of operations of order N log N, rather than the much larger N2
which typifies the calculation of Fourier amplitudes directly from the defining
integrals. Also, the CooleyTukey FFT algorithm relies on a factorization technique
that requires N to be a power of 2 hence the choice N 512 29 in the preceding
illustration. For a time record of 512 points, it matters little in terms of efficiency
which method one employs. However, for a record containing more than 1 million
bins of nuclear decay data, the relative efficiency of using the CooleyTukey FFT
algorithm instead of direct evaluation of the defining sums or integrals goes as
106
1:7 105 . Thus, a one-second calculation by FFT could take more than
log106
40 hours by the direct computation.
3.9 Spectral resolution and uncertainty
There is a connection between the duration of a time series and its spectral bandwidth
analogous to the quantum mechanical uncertainty principle governing the measurement of location and momentum of a particle. Indeed, because the mathematics of
waves describes the statistical behavior of quantum particles, the latter uncertainty
principle may in some ways be regarded as arising from the former.22 This constraint
plays a role in the sampling theorem previously described and has consequences for
calculation of autocorrelation functions and power spectra.
21
22
J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of
Computation 19 (1965) 297301.
M. P. Silverman, Quantum Superposition: Counterintuitive Consequences of Coherence, Entanglement, and Interference
(Springer, Heidelberg, 2008).
147
Consider a time series of zero mean y(t) that is nonvanishing only within the
interval (0, T). If this record were repeated multiple times, it would constitute a
periodic function of period T and therefore of fundamental frequency 0 1/T.
A Fourier representation of this series would then take the form
y t
cn e2int=T
n X
n
an cos 2 t
bn sin 2 t
T
T
n0
n1
n
3:9:1
c0 0:
3:9:2
Note the form of (3.9.2); this coefficient is identical to T1 Y n=T , where Y() is the
Fourier transform of y(t). Thus the specific set of samples {Y(n/T)} determines the set
of coefficients {cn} which determines the function y(t) which determines the transform
Y() for all frequencies . Symbolically:
fYn=Tg ) fcn g ) yt ) Y
for all c :
3:9:3
If Y() were confined to the frequency band (B, B), then the minimal number of
discrete samples of Y() needed to describe y(t) a quantity referred to as the number
of degrees of freedom would be
2B
2B
2BT:
0
1=T
3:9:4
1 X
Cn e2inf =2B
2B n
3:9:5
with coefficient
1
Cn
2B
B
B
Y e2i2B d
n
C0 0
3:9:6
148
1
that is seen to be identical to 2B
yn=2B. By the same reasoning as before, therefore, it
follows that the specific set of samples {y(n/2B)} determines the set of coefficients
{Cn} which determines the function Y() which determines the transform y(t) for all t.
Again, symbolically
3:9:7
If y(t) were confined to the range (0, T), then the minimal number of discrete
samples of y(t) needed to describe Y() leads to the same number of degrees of
freedom
T
2BT:
1=2B
3:9:8
In short, the sampling theorem links the duration of a signal, the highest frequency
in its spectrum, and the number of samples required to characterize the signal
completely. However, it is, in fact, not possible for a function of finite duration to
have a finite bandwidth. For example, a pure sine wave extends infinitely in time.
Correspondingly, a delta-function pulse, which vanishes at all times except for a
single instant, has an infinite spectral content.
In general, it can be shown by means of Parsevals theorem23
jY j2 d
yt dt
3:9:9
(which gives equivalent expressions for total power in terms of integration over time
or over frequency) and the Schwartz inequality
f t dt gt dt
2
f tgtdt
3:9:10
23
24
1
,
4
3:9:11
Although Parsevals theorem can be interpreted in terms of an integrated power, the functions in the integrand are not
random variables and no ensemble or time averages are involved.
A corresponding relation in quantum mechanics is ET 12
h in which E 2
h is the uncertainty of the energy of
a quantum system whose duration is uncertain by T. The universal constant
h (pronounced h-bar) is Plancks
constant divided by 2.
149
where
2 Y 2 d
t yt dt
2
T 2
2
yt2 dt
3:9:12
Y 2 d
sin 2B
,
2B
3:9:14
whose form is the so-called sinc function [sin(x)/x] commonly encountered in the
analysis of optical diffraction phenomena. The function is maximum RWN(0) 1 at
0, drops to zero at
1/2B, and undulates thereafter with small decreasing
amplitude. Points in the time record separated by a delay equal to 1/2B (and,
practically speaking, for all delays greater than 1/2B) are uncorrelated. Thus, the
number of degrees of freedom expressed in (3.9.8) is the number of statistically
uncorrelated samples within the recorded time T of the signal. These samples are
statistically independent in the special case of band-limited Gaussian white noise.
The matter of correlation and independence is important and somewhat subtle. If
two random variables are independent by which is meant that their joint probability density factors p(x, y) p(x)p(y) then they are uncorrelated according to the
general defining relation (the Pearson correlation coefficient)
X,Y
hX X Y Y i
hX X i hY Y i
!
0:
X ,Y independent
X Y
X Y
The converse is not always true: if X and Y are uncorrelated by which is meant
that X,Y 0 they are not necessarily independent. Consider the two functions
X(t) sin t and Y(t) sin2(t). The latter is completely dependent on the former, but
150
0.6
0.2
0.2
0
10
12
14
16
18
20
22
24
26
28
30
10
11
12
13
14
15
0.8
Max Lag = 50
0.6
0.4
0.2
0
0.2
0
Harmonic Number
Fig. 3.8 Power spectral density SX() (Eq. (3.9.15)) of the periodic signal in Figure 3.7 as a
function of with step size 0.1 for maximum lag m 100 (upper panel) and m 50 (lower
panel). Peaks occur at harmonics i 2m/Ti (i 1, 2) where T1 10 and T2 20 time units.
Peak width, measured between zero crossings flanking the central maximum, is precisely
2.
X,Y 0 because of the respective odd and even symmetries of the two functions. Zero
correlation implies independence, however, in the case of two jointly normal
variables.
The number of degrees of freedom has implications for the maximum lag m at
which to evaluate the autocorrelation of a time series. Recall that m is also the
number of correlation functions Rk or coefficients rk contributing to the power
spectrum (3.8.20). Figure 3.8 shows the normalized power spectral density
(
)
m
X
1
k
r k cos
SX
12
3:9:15
m
m
k1
151
of the noiseless part of x(t) in the previous example (3.8.22) calculated for maximum
lags of m 50 and 100 time units and signal duration of N 512 time units.
To enhance visibility, the computation was performed at more points than just
the special set of discrete harmonics j previously introduced. The larger the value
of m, the greater is the separation of the peaks. Thus, higher m affords higher
resolution.
As a heuristic explanation of this feature, consider that the autocorrelation
function {Rk k 0, 1. . .m} constitutes a time record of length mt and therefore an
effective bandwidth Be 1/mt. The higher m is, the narrower is Be.
There is a downside, however, to making the maximum lag too large. For a
bandwidth B e 1/mt and record length T Nt, the number of degrees of
freedom (3.9.8) becomes 2N/m. The expression (3.6.13) for the sample
autocorrelation was derived under the condition N m. As m increases for
fixed N, decreases, and the sample estimate of the correlation function as a
whole may be poor (although estimates of individual points may
remain good).
One also runs up against the uncertainty principle (3.9.11). The variance in
the measurement of S() within any narrow range about some specific frequency
0 is inversely proportional to 2N/m. I will discuss shortly with greater rigor
the statistical distribution of the power spectrum and other random variables arising
from Fourier analysis of the time series of nuclear decays. For now, however, the
uncertainty principle affords a complementary way to understand why fluctuations
in the measurement of S() become greater with increasing m. The maximum time lag
mt corresponds to an effective frequency bandwidth Be (mt)1. As m is
increased for fixed total record length Nt, and decreases, one eventually violates
the uncertainty principle
1
N
1
:
T Nt
mt
m 4
3:9:16
152
100
100
20
20
Trials
G10
G20
G2
G4
1
2
3
4
5
1.04
0.62
2.00
0.25
1.39
0.87
1.61
1.80
0.40
0.88
1.37
0.86
0.66
0.55
1.05
1.06
1.20
0.56
1.04
1.03
Calculation of the mean and standard error (SE) of the five measurements for each
100
choice of j and m confirm empirically upon comparing the ratio SE/Mean of G10 with
20
100
20
G2 and of G20 with G4 the greater degree of uncertainty in the power spectral
amplitudes obtained from autocorrelation functions of higher maximum lag times.
100
100
20
20
G10
G20
G2
G4
Mean
Standard error
Ratio (SE/mean)
1.060
0.304
28.6%
1.112
0.259
23.3%
0.898
0.146
16.2%
0.978
0.109
11.1%
Thus the choice of maximum lag requires a compromise to achieve both good
spectral resolution and statistical reliability.
If, however, harmonics of amplitudes significantly above the noise level are actually present in the time series, the choice of maximum lag is less influential.
A repetition of the preceding experiment when the original amplitudes A 0.20,
B 0.15 were retained, led to the following outcome.
100
100
20
20
G10
G20
G2
G4
Mean
Standard error
Ratio (SE/mean)
10.28
0.49
4.8%
16.84
1.03
6.1%
3.36
0.12
3.7%
3.81
0.30
6.8%
This shows relatively little difference in the ratio of standard error to the mean for
corresponding spectral peaks.
In the search for periodicities in a time series of nuclear decays, no such harmonics
are expected to be present.
153
autocorrelation and power spectral analysis to search for hidden correlations and
periodicities introduces other random variables and their associated statistical distributions. These non-elementary statistics arise in asking how the Fourier amplitudes
(real part, imaginary part, modulus, phase) of the time series and the elements of the
correlation function, correlation coefficient, and power spectrum are distributed.
Since these are random variables, different ensembles (the bags of data) will almost
certainly produce different amplitudes for the same harmonics and different correlation coefficients for the same lag values.
The fact that a Poisson distribution may provide a good description of a time record
of nuclear disintegrations which one expects to be the case in the absence of external
forces is no guarantee that the data will make a good fit to other statistical distributions predictable on the basis of the null hypothesis. Recall that the null hypothesis is
that the probability of a single nuclear decay in a short sampling interval is proportional to that interval and independent of outcomes in previous or subsequent time
intervals. Therefore, a test of these distributions constituted the next step in looking for
evidence of non-random behavior that violated physicals laws.
The time record {xt t 1. . .N} of disintegrations of radioactive 22Na, for which the
stationary mean is X N 0^t, was transformed to a record {yt t 1. . .N} of zero
mean and zero trend in preparation for calculation of the autocorrelation function
and power spectrum. Under the condition X 1, which pertained in these experiments, the Poisson distribution characterizing the time record {yt} is very closely
approximated by the corresponding Gaussian distribution N(0, X). From the null
hypothesis and Gaussian approximation there then follow all the statistical distributions summarized in Table 3.1.
The technical details of the derivations of these distributions, which make use of
generating functions as developed in Chapter 1 as well as relations concerning
products and quotients of random variables to be discussed in subsequent chapters,
will be left to an appendix. Of significance now is the fact that each of the distributions, which tests different facets of the time record and Fourier amplitudes of
the decaying nuclei, is determined exclusively by a single empirical parameter
the mean count per bin X of the original time record fixed at the outset of the
experiment. The means and variances of these distributions, which give perspective to
measurements that will be discussed shortly, are summarized in Table 3.2.
Figure 3.9 shows histograms of four of the statistical quantities in Table 3.1 [real
part of amplitude, spectral power, modulus, and phase (defined by the ratio of
imaginary to real parts of the amplitude)] with corresponding theoretical densities
superposed. It is to be emphasized that the histograms of the figure, which display
virtually no discernible deviations from theory at the scale of viewing, are not
computer simulations, but the actual experimentally derived frequencies. Bear in
mind in examining the figure that, apart from , there are no adjustable parameters.
The excellent agreement sensed by the eye is substantiated by analysis, as shown by
the results of chi-square tests summarized in Table 3.3.
154
Table 3.1
Statistic
Distribution
Symbol
Probability density
Counts
Poisson ~
Normal
Normal
X Poi() ~ N(, )
f P x; e x!
fa, bg N 0, 12
2
2
1
f N x; , 2 p
ex =2
2
Gamma
fjaj2 , jbj2 g Gam 12, 1
f G x; r, s sr xr1 esx
Exponential
Rayleigh
Cauchy
f E x; 1 ex=
2
f R x; 2 xex =
f C x; r, s 1xr 2
s 1 s
Normal
Amplitude (real
or imaginary)
Squared
amplitude
Power
Modulus
Amplitude ratio
Autocorrelation
function and
coefficient
Power via WK
theorem
Exponential
2
2
1
ex =2
f N x; , 2 p
2 2
f E x; 1 ex=
* The order of parameters in the density functions is the same as in the symbols identifying the
types of random variables.
Table 3.2
Distribution
Parameters
Mean
Variance
MGF or CF
Poi()
N(, 2)
Gam(r, s)
E()
Ray()
Cau(r, s)
X 0
2
r 12; s 1
r 0; s 1
ee 1
2 2
et t
r
1 st
(1 t)1
2
e t
eirt sjtj
r
s
1
2
1
2
r
2
s2
2
1 14
Does not exist
1
2
To this point, therefore, there is nothing in the statistics of the decay of 22Na that
would suggest a deviation from the prevailing theory (the null hypothesis). However,
it is possible that a periodic component of weak amplitude could remain undetected
within the histograms of Figure 3.9. Let us examine more closely, therefore, the
matter of recurrence, autocorrelation and periodicity.
3.11 Recurrence, autocorrelation, and periodicity
Recall that a histogram is a graphical representation of a multinomial distribution
M({nk}, {pk}) M(n, p) of outcomes k 1. . .K with frequencies {nk} and probabilities
{pk} governed by the (discrete) probability function
155
Table 3.3
Distribution
2obs
Real part
Imaginary part
Square of real part
Square of imaginary part
Power
Modulus
Ratio: imaginary/real
45.6
50.7
13.3
28.3
38.8
44.7
46.9
45
40
13
26
40
40
40
0.45
0.12
0.43
0.34
0.48
0.28
0.21
Amplitude (Real)
Power Spectrum
0.005
Relative Frequency
Relative Frequency
0.04
0.004
0.03
0.003
0.02
0.002
0.01
0.001
0.00
-40
-20
20
0.000
40
200
Modulus
800
1000
0.30
0.05
Relative Frequency
Relative Frequency
600
0.06
0.04
0.03
0.02
0.01
0.00
0
400
0.25
0.20
0.15
0.10
0.05
0.00
10
20
30
40
50
-10
-5
10
Fig. 3.9 Empirical (bars) and theoretically predicted (solid) distributions of Fourier
amplitudes {j j ij} of the 22Na decay time series: (a) Gaussian distribution of real
part {j}; (b) exponential distribution of power spectral density f2j 2j g; (c) Rayleigh
1=2
distribution of modulus 2j 2j
; (d) Cauchy distribution of amplitude ratio {j/j}.
f M n; p n!
subject to the constraint
K
X
k1
n
Y
p nk
k
k1
nk !
3:11:1
156
pk f P xk ; e
xk
exk =2
! p
xk ! >>1
2
3:11:2
for the decay of xk nuclei. The mean frequency and variance of the kth class are
respectively
nk npk
3:11:3
varnk npk 1 pk ,
3:11:4
3:11:5
157
r/r
-2
0
200
400
600
Lag
Frequency
200
100
0
-0.005
0.005
Autocorrelation
Fig. 3.10 Top panel: autocorrelation rk/r as a function of lag k (671 k 0) of the 22Na decay
time series comprising N 218 bins with lag interval
k 512
bins ~ 224.77 s. Lower panel:
distribution of rk fit by a Gaussian density (solid) N 0, 2X =N .
triangular wedge of points whose upper surface is more or less flat with zero
slope. I will explain the significance of this plot momentarily. The bottom panel of
Figure 3.11 plots the imaginary part of the Fourier amplitude against the real part.
The most striking feature of this plot is the isotropic distribution of points with
nearly uniform density except for the foamy periphery again indicative of significant fluctuations. The three plots were constructed from the FFT amplitudes of a
trend-adjusted time series of N 105 bins of gamma coincidence counts from
decaying 22Na nuclei, with bin interval of 4.39 s. In keeping with the Shannon
sampling theorem, the FFT amplitudes and derived power spectrum comprise
N/2 5 104 harmonics.
Look carefully at the upper panel of Figure 3.11, in particular at the flecks of
foam, which represent numerous statistical outliers beyond the mean. How is one to
know whether any of these points actually represents a periodic component to the
nuclear decay or whether all are just noise? If the null hypothesis is valid, the
ordinates {Sj j 1. . .N/2} of the power spectrum of {yt} should be distributed
exponentially (see Table 3.1) with a standard deviation equal to the mean:
S S 2X X . In other words, as pointed out earlier, the fluctuations are of
comparable size to the signal.
158
Power (104)
1.5
1
0.5
0
0
50
100
150
200
250
300
Harmonic ( 100)
Log Power
6
4
2
0
2
4
0
0.5
1.5
2.5
3.5
4.5
Log Harmonic
Im(Amplitude)
100
50
50
100
100
50
50
100
Re(Amplitude)
Fig. 3.11 Three perspectives in the display of the discrete Fourier transform (FFT) of the 22Na
decay time series. Top panel: power spectral density against harmonic number; middle panel:
double-log plot of power against harmonic number; bottom panel: imaginary part against real
part of the complex Fourier amplitude. Plots comprise J 215 harmonics obtained from the
first 216 bins of a time series of length 105 bins with bin interval t 4.39 s. The frequency
corresponding to harmonic j is j j/(Jt).
159
25
26
M. G. Kendall, A. Stuart, and J. K. Ord, The Advanced Theory of Statistics, Design and Analysis and Time-Series, Vol. 3
(Macmillan, New York, 1983) 589590.
G. T. Walker, Correlation in seasonal variation in weather. III: On the criterion for the reality of relationships or
periodicities, Indian Meteorological Department (Simla) Memoirs. 21 (1914), 22.
160
Now consider again the pattern shown in the middle panel of Figure 3.11. The
rationale for constructing the plots in the top and bottom panels is probably clear,
but the reason for the double-log plot may perhaps be less evident. This kind of plot,
however, reveals very useful information about the underlying stochastic process. It
is often the case that the power spectral density of a random process can be
represented by a power law,
2 jj 1 ,
S / jj
3:11:7
0
otherwise
within some limited range of frequencies. It then follows that the slope of the double
log plot is a constant: dlnS
dln . The exponent defines the type of stochastic
process and provides a quantitative measure by which to gauge the degree of
predictability of future outcomes. It may seem at first like an oxymoron that the
outcomes of a random process are predictable, but there are, in fact, different degrees
of randomness. We will look into this question later when we consider randomness in
the stock market. For the present, suffice it to say that white noise, defined by 0
is the most random process in the universe. It contains no information at all useful
for prediction.
In summary, neither the power spectrum nor the autocorrelation spectrum gave
evidence of a statistically significant component of period T 83.5 hours in the time
series of coincident counts arising from the decay of 22Na. For all practical purposes,
the analyses so far have shown the nuclear decay of radioactive sodium to be
equivalent to white noise.
27
R. A. Fisher, Tests of significance in harmonic analysis, Proceedings of the Royal Society of London A 125 (1929)
5459.
161
radioactive nuclide by simulating the decay time series with a Poisson RNG of timevarying mean
X t X0 1 cos 2t=T 0
3:12:1
and decreasing the amplitude of the harmonic until the presence of the harmonic is
no longer discernible in either the autocorrelation or power spectrum. The information obtainable from such a simulation depends on whether T0 is less than or
greater than the duration of the time series T. Let us examine each in turn.
Figure 3.12 shows the progressive change in the power spectrum (right panels)
and autocorrelation (left panels) as the harmonic amplitude takes on the
8
(a)
3
2
(b)
1
4
-1
-2
-3
0
0
200
400
0.0
Power
0.6
0.8
1.0
(d)
4
2
-1
-2
0
200
400
600
0.0
0.2
0.4
0.6
(e)
3
2
0.8
1.0
0.8
1.0
(f )
6
Power
1
0
4
2
-1
-2
-3
0
0.4
-3
0
0.2
(c)
Autocorrelation
600
0
200
400
Lag
600
0.0
0.2
0.4
0.6
Frequency
Fig. 3.12 Autocorrelation rk/r and power spectrum Sj for Poisson RNG simulated time series with
periodic mean of harmonic amplitude 0.0% (panels a, b); 0.3% (panels c, d); 0.5% (panels e, f ).
162
1
(b)
(c)
Autocorrelation
0.5
(d)
0.5
(a)
1
0
50
100
150
200
250
300
350
400
450
500
Lag
Fig. 3.13 Autocorrelation function of (a) xt cos(2t/(N/10)), (b) xt cos(2t/25N), (c) xt t
(exact calculation of rk), (d) xt t (linear approximation to rk) for N 29 and t 0. . .N 1.
sequential values 0 (Figures 3.12a,b), 0.003 (Figures 3.12c,d), and 0.005 (Figures 3.12e,f)
for a period T0 less than the duration T of the time series. The top two panels are
indicative of white noise. In the bottom two panels, the periodic waveform in the
autocorrelation and the delta function-like spike in the power spectrum are so
strong as to be practically blinding, even for so weak a relative amplitude of
0.5%. The middle two panels display results at an approximate threshold value
0.3%, at which the power ordinate Smax just passes the WF test, signifying no
departure from statistical control, and the harmonic variation in rk merges with the
noise. Thus, if a harmonic component with amplitude > 0.3% were present in the
time series {yt}, it would have been revealed by statistical analysis even though
visual inspection of the sequence of 167 histograms would show no statistically
significant recurrences.
A time series of duration T does not permit one to measure a period T0 > T.
Nevertheless, the data may contain sufficient information to reveal the possible
presence of a harmonic component even if its period could not be measured. One
can see why immediately from the example illustrated in Figure 3.13. Plot 3.13a
shows the autocorrelation of the periodic time series xt cos(2t/T1) (t 0. . .N 1)
with short period T1 N/10 for N 29 512 time units. Contrast this plot with that
of plot 3.13b, which shows the autocorrelation of the same function with long period
T2 25N. The former oscillates with decreasing amplitude many times over the range
of lag values, whereas the latter diminishes virtually linearly over the same range.
Now consider the discrete linear time series xt at b over the same range, where a
and b are constants. It is not difficult (although a little tedious) to show that the
163
There is actually a name apophenia given to this state of mind. See Apophenia, http://en.wikipedia.org/wiki/
Apophenia.
M. P. Silverman, A Universe of Atoms, An Atom in the Universe (Springer, New York, 2002) 285294.
164
20
Original xt
15
10
5
0
-5
-10
0.000
0.005
0.010
0.015
0.020
20
Power
15
Transformed yt
10
5
0
-5
-10
0.000
0.005
0.010
0.015
0.020
20
15
Simulation
10
5
0
-5
-10
0.000
0.005
0.010
0.015
0.020
Relative Frequency
Fig. 3.14 Power spectrum of autocorrelation of experimental gamma coincidence time series
{xt} unadjusted for negative trend due to natural lifetime (top panel); experimental series {yt}
adjusted for zero-trend (middle panel); time series simulated by a Poisson RNG for 1/4-cycle
variation in mean with harmonic amplitude 2.0% (bottom panel). The lag interval k
83 bins ~ 36.44 s. A relative frequency 1.0 corresponds to (k)1.
nuclear decay processes, such as the alpha decay of 214Po, beta decay of 137Cs, and
electron-capture decay of 54Mn, for evidence of non-random behavior.30,31 No
statistically significant evidence was found.
A run is an unbroken sequence of outcomes of the same kind usually of binary
alternatives like (1, 0), or (head H, tail T) whose length is the number of elements
defining the run. There are different ways to count runs, depending on what one
30
31
M. P. Silverman, W. Strange, C. R. Silverman, and T. C. Lipscombe, Tests of alpha-, beta-, and electron capture
decays for randomness, Physics Letters A 262 (1999), 265273.
M. P. Silverman, W. Strange, C. R. Silverman, and T. C. Lipscombe, Tests of spontaneous quantum decay, Physical
Review A 61 (2000) 042106 110.
165
regards as the unbroken sequence. For example, one might count the run of 1s in the
sequence 01110 in one or more of the following ways:
(a)
(b)
(c)
(d)
1 run of length 3,
3 runs of length 1,
2 runs of length 2 (where the middle 1 contributes to both runs), or
1 run of length 1 and 1 run of length 2.
Success [1]
Failure [0]
1
2
3
4
2
0
2
0
2
1
0
1
A stochastic process ideally suited for runs analysis is a coin toss, which, if performed
with an unbiased coin, is a realization of a Bernoulli process. Each trial is independent of the others, and the probability p of a success say head H remains constant
for all trials. Most people think they know what a random sequence of coin tosses
should look like, but they are probably mistaken. As an exercise32 to see whether this
32
M. P. Silverman, W. Strange, C. R. Silverman, and T. C. Lipscombe, On the Run: Unexpected Outcomes of Random
Events, The Physics Teacher 37 (1999) 218225.
166
was the case with college students, I divided the students in various physics classes of
mine over the years into two groups and assigned one group the task of tossing a coin
256 times and writing down in sequence the outcome of each toss; the other group
was told to write down what they imagined a typical sequence of 256 tosses to be, but
not actually to do the tossing. The students would then turn in their papers without
indicating to which group they belonged and I told them that I could predict with a
success of about 90% or higher which sets of data were obtained experimentally and
which imaginatively.
The key to this apparent feat of clairvoyance is the inherent disbelief of those
unschooled in the properties of randomness that long runs of heads or tails can occur
in a truly random sequence of coin tosses. This disbelief underlies many a false
gambling strategy whereby the gambler, having lost n times in a row in some game
equivalent to a Bernoulli process feels certain that his luck will turn since the event of
yet another loss must be highly improbable. But that is not so if the events are
independent. The probability of a loss on the (n 1)th trial is the same as it was
on each of the previous n trials. Analogous reasoning applies to runs in a Bernoulli
sequence. The probability of a sequence of nH is the same as the probability of any
specified ordering of n binary alternatives that is, (1/2)n provided the coin is
unbiased ( p 1/2). Thus, the trick to distinguishing experiment from imagination is to look for long runs.
A person unaware that a random sequence of coin tosses must include long runs
will almost invariably write a sequence of 256 outcomes like the following
HTHHTTHHHTTHTHH, etc. with a lot of reversals and therefore too many
short runs. A simple binomial argument, however, leads in the case of an unbiased
coin to the approximate expectation
rkH rkT
n
2k2
3:13:1
for the mean number of runs of heads or of tails of exactly length k out of n trials, and
RkH RkT
n
k1
3:13:2
for the corresponding mean number of runs of length k or longer out of n trials, in the
limit of large n k. To arrive at (3.13.1), one multiplies the number of trials n by the
probability 12k2 of getting a T to start the run, k Hs in a row, and a final T to end
the run. The formula (3.13.2) then follows by summing rkH over all values of run
length k
n
n
X
X
1
n
1
n
rjH n
k1 1 nk1 ! k1 :
3:13:3
j
nk
2
2
2
jk
jk 2
Thus, I would expect to find at least two runs of 6H or four runs of 5H in a sequence
of 256 coin tosses, with the same statistics for tails. Equations (3.13.1) and (3.13.2)
167
are approximate in that runs occurring at the start or closure of a sequence have been
ignored. For a long sequence, the contribution of these end runs becomes
negligible.
The exact theory of Bernoulli runs is a more complicated exercise in combinatorial
reasoning than can be taken up here, and so derivations of the theory outlined below
will be left to the literature.33,34,35,36 Given a random sequence of na events of type a
[successes] and nb events of type b [failures], with a total sample size n na nb, the
mean number of success runs of length precisely k (where k 1) is
rak
na !nb nb 1n k 1!
;
na k!n!
3:13:4
na !nb 1n k!
;
na k!n!
3:13:5
n 2na nb
:
n
3:13:6
The theory of runs usually proceeds by determining (3.13.5) first and then (3.13.4)
through the relation
r ck Rck Rck1
c a, b:
3:13:7
Using the exact relation (3.13.5), I would expect a mean of 1.90 success runs of length
6 or longer and 3.9 success runs of length 5 or longer in a sequence of 256 trials
which is close to the previous expectations obtained from the approximate relations
(3.13.2).
For long sequences n k and approximate equality na nb, the variances of the
preceding expectation values are closely approximated by the following expressions
2 r ak 3r ak
2 Rak Rak
2 Ra
n1
:
4
3:13:8
Exact expressions for the variances are quite complicated and will not be needed nor
given here.
The exact probability Pn, k Pr Rak 1jn of occurrence of at least one success
run of length k or longer in a Bernoulli sequence of length n, where p is the
33
34
35
36
A. Wald and J. Wolfowitz, On a test whether two samples are from the same population, The Annals of Mathematical
Statistics 11 (1940) 147162.
A. M. Mood, The distribution theory of runs, The Annals of Mathematical Statistics 11 (1940) 367392.
H. Levene and J. Wolfowitz, Covariance matrix of runs up and down, The Annals of Mathematical Statistics 15 (1944)
5869.
A. Hald, Statistical Theory with Engineering Applications (Wiley, New York, 1959) 338373.
168
n, k
n
k1
X
1j
j0
n jk
1 pj pjk
j
3:13:9
h i
n
where the upper limit k1
in the sum signifies the greatest integer less than or equal
to n/(k 1). This formula becomes impractically difficult for large n, however,
because of the factorial products that make up the binomial coefficients. An alternative approach is to employ directly the generating function from which (3.13.9) was
derived
Gk s
X
1 pk sk
1 Pn, k sn
1 s 1 ppk sk1 n0
3:13:10
3:13:11
where V(s) is an mth order polynomial such that the numerator and denominator do
not have a common root. Recall that a partial fraction expansion converts a product
of factors into a sum of terms
W s
an sn an1 sn1 a0
w1
w2
wm
, 3:13:12
s s1 s s2 s sm s s1 s s2
s sm
X
X
U sj =V 0 sj
Gs
n sn ,
3:13:13
s
s
j
j1
n1
37
169
where V 0 (s)dV/ds and {sj j 1. . .m} are the roots of V(s). A series expansion in s of
(3.13.13), such as illustrated in (3.13.10) for a particular generator, is then readily
made, provided one has been able to find the solutions sj of the equation V(s)0.
Actually, for large expansion order n i.e. under the very circumstances for which
exact methods may become impractical it is not necessary to find all the roots, but
only the least positive one sL, in which case the sum over j in (3.13.13) can be
approximated by a single term
Gs
U sL =V 0 sL
s sL
3:13:14
UsL =V 0 sL
:
sn1
L
3:13:15
Consider again the coin-toss task I posed my students. For p 1/2 and run length
k 6 the generator (3.13.10) reduces to
Gs
1 6
1 64
s
U s
,
1 7
V s 1 s 128
s
3:13:16
and one finds (by using the RootOf function of Maple) that
sL 1:008 276 516 723 31
1 7
s 0. The expansion coefficient
is the least positive root to the equation 1 s 128
(3.13.15) then takes the form
n 1 Pn, 6
1 6
1 64
sL
,
7 6
1 128
sL
sn1
L
3:13:17
which leads to probability P256,6 87.45% as before. Equation (3.13.17) allows one
to determine nearly effortlessly the variation in Pn,6 with increasing number of
trials n. Thus, for n 1024 tosses, the probability of finding at least one run
of successes of length 6 or longer is P1024,6 99.98%.
For very large n and run lengths k ~ 8, the distribution of Rak can be approximated by a Poisson distribution, since, by (3.13.8), the variance of Rak is approximately equal to the mean. Thus, if
p n, k m
em mm
m!
3:13:18
3:13:19
170
38
2n 1
3
3:13:22
H. Levene and J. Wolfowitz, The Covariance Matrix of Runs Up and Down, The Annals of Mathematical Statistics 15
(1944) 5869.
171
with variances
"
r n, k 2n
2
2 2k5 15k4 41k3 55k2 48k 26
k 3! 2
22k2 9k 12
2k 32k 5k 3!k 1!
#
44k3 18k2 23k 7
2
k2 3k 1
3:13:23
2k 5!
k 3!
2k 1k!2
"
2k 12k2 4k 1
2
2
2
Rn, k 2n
k 2! 2
2k 1k!2 2k 3k 2!k!
#
4k 1
k1
3:13:24
2k 3! k 2!
2 R
16n 29
:
90
3:13:25
Equation (3.13.25) is exact, whereas Eqs. (3.13.23) and (3.13.24) retain only terms
proportional to n in what otherwise are very long expressions. Under the condition
n k pertinent to the experiments discussed in this chapter, the omitted terms are
insignificantly small. The exact distribution of the cumulative up/down runs is not
easily derived or expressed. To good approximation, however, the total number of
runs R is normally distributed for a sample length n > 20.39
Although the derivations of the preceding relations must be left to the references,
it is again possible, as in the case of partition runs (e.g. runs with respect to the
median) to deduce the n-dependent terms in (3.13.20) and (3.13.21) which are by far
the major contributions by a simple combinatorial argument. Under the same
condition as before (n k), the mean number of difference runs (both up and down)
of length k or greater in a sequence of n random numbers takes the form
Rn, k 2nPrD kjn, in which Pr(D kjn) is the probability of k positive differences D in a sequence of k 1 ascending numbers in a set of k 2 numbers. The set
cannot begin with the lowest number since there would then be k 2 ascending
numbers and k 1 positive differences. Thus, of the (k 2)! ways to order k 2
numbers, only k 1 of the orderings lead to runs up of length k. Hence, Pr(D kjn)
(k 1)/(k 2)!, from which follows Rn, k 2nk 1=k 2!, which approximates
(3.13.21). Application
of the identity (3.13.7) then yields straightforwardly
r n, k 2n k2 3k 1 =k 3!, which approximates (3.13.20).
39
A. Hald, Statistical Theory with Engineering Applications (Wiley, New York, 1952) 354.
172
Table 3.4
Run length k
Exact var(Rn,k)
2
3
4
5
6
7
8
9
10
2047.6
546.0
113.7
19.5
2.8
0.36
0.041
0.0041
0.000 38
160.9
271.9
83.9
16.2
2.5
0.32
0.036
0.0037
0.000 34
Although perhaps not immediately apparent, the variance (3.13.24) can be shown
to approach in value the cumulative mean (3.13.21) for long runs (k 5) of long
sequences (n 1). To see this, factor (k 1)/(k 2)! from the terms in the square
bracket; only the last term, equal to 1, survives in the limit of large k. Thus, we can
again estimate the probability for cumulative up and down runs by a Poisson
approximation
PrRn, k 1 1 eRn, k
k e5
3:13:26
as was done previously [(3.13.19)] for target runs. An illustration of this equivalence
is shown in Table 3.4 for a sequence of length n 8192, corresponding to the number
of bins in 1 bag of data. The unshaded portions of the table include values of run
length for which the Poisson approximation is good.
Statistical analyses based on runs were intended originally for examining the
quality control of processes for example, the manufacture of mechanical parts
leading to outcomes describable by continuous, rather than discrete, random
variables. If the observed fluctuations of a specified measurement were predictable by probability theory (i.e. without an assignable cause), then the process
was said to be under statistical control. Mathematical treatments of difference
runs were generally based on the assumption that occurrence of identical adjacent elements in a time series was sufficiently improbable to be disregarded, but
this is not necessarily the case for a series of integers. The mean number of
occurrences of adjacent identical integers 0,1,2. . .n 1 uniformly distributed
over a range n in a sample of length N is N/n, a result that may well be
statistically significant.
In the case of a Poisson process of mean , the probability Pr(kkj) of two identical
adjacent outcomes is
173
Table 3.5
Length k
Rn, k
1
2
3
4
5
6
7
8
917 129
343 923
91 713
19 107
3275
478
60.7
6.82
22
Na coincidence counts
Exp
theory
Rn, k
experiment
916 214
344 158
92 060
19 263
3383
526
81
8
Prkkj
pkj2
k0
2
I 0 2,
X
e
k0
k
k!
Rn, k
R n, k R n, k
495
330
255
132
56.7
21.8
7.78
2.61
1.85
0.712
1.36
1.18
1.90
2.21
2.61
0.45
Exp
Thy
2
3:13:27
where I0(x) is a modified Bessel function of the first kind. (We encountered this
function in the discussion of the Skellam distribution in Chapter 1.) Although the
sum extends over all non-negative integers, counts tend to cluster about the mean ,
p
thereby creating an effective range k 4 4 . Consider 100 for example,
for which the effective range would comprise ~ 40 integers between 80 and 120:
exact:
X
k0
120
X
k80
In a bag of 8192 bins, one would expect approximately 8192 0.028 229 occurrences of adjacent bins with identical counts.
To handle the case of identical adjacent outcomes, one can consider all runs
distributions that result from replacing each null (0) with each binary value (,).
If all of these distributions are incompatible with statistical control, the hypothesis of
randomness is to be rejected. Alternatively, one can assign to each null a randomly
chosen binary value for example, by using an alternative random process or reliable
pseudo-RNG and analyze the runs of the resulting sequence.
The time series of gamma coincidence counts arising from the decay of 22Na was
analyzed for runs up and down, with results tabulated in Table 3.5 and shown
graphically in Figure 3.15 for a sequence of length n 1.376 106 with mean count
per bin X ~ 194.
For a time record of so many bins, up/down runs of length up to k 8 were
observed. The square plotting symbols in the top panel of the figure mark the (log of
the) observed number of runs of each length. The solid line (to help guide the eye)
is the theoretical prediction of runs for a random distribution of integers (null
174
6
5
4
3
2
1
Run Length
1000
R(Exp) - R(Thy)
500
500
1000
1
Run Length
Fig. 3.15 Upper panel: observed values (squares) and theoretical values (solid) of the log of
cumulative up/down runs Rn,k of length k and record length n 1 375 694 bins. Error bars mark
2 standard deviations. Lower panel: difference (circles) of observed and predicted runs
Rn, kExp Rn, kThy . Dashed lines mark intervals of
2 standard deviations at each run length.
175
x
x!
3:14:1
is the probability of x decays occurring within a bin irrespective of where in the time
record the bin is located, provided the process is stationary (i.e. is constant). We will
assume the record is stationary or has been so adjusted, as described previously.
To facilitate examination, let us denote by EX the event defined by the occurrence of
x decays. If we designate by 0 the bin in which EX last occurred, then the waiting time in
units of t is the number (call it n) of the bin in which EX next occurs. This bin would
then be designated 0, and the process repeated to determine the subsequent waiting
time. In this manner by working ones way through a long chronological record of
nuclear decays and tallying the times of first occurrence of event EX one can obtain a
sample estimate of the mean waiting time hTXi and associated variance 2T X .
The probability that the waiting time TX n follows a geometric distribution
PrEX PrT X n qn1
X pX
qX 1 pX ,
3:14:2
where EX has failed to occur with probability qX in the first n 1 bins and then occurs
with probability pX in the nth bin. It is then easy to calculate from (3.14.2) the
moment-generating function (mgf )
gT X t heT X t i
nt
pX qn1
X e
n1
pX X
pX e t
n
qX et
,
qX n1
1 q X et
3:14:3
which provides an expedient means of deducing statistical moments (by differentiation with respect to the expansion variable t)
2T X
1
pX
hT 2X i g00T X 0
00 q
hT 2X i hT X i2 lngT X 0 X2 :
pX
hT X i g0TX 0
1 qX
p2X
3:14:4
The concept of waiting time can be generalized so as to apply to the rth ocurrence
r
(rather than the first occurrence) of a count x in the nth bin. Designate this event EX .
176
r
Then EX is achieved when a success (a count x) occurs in bin n and a failure (count
other than x) occurs in (n 1) (r 1) n r of the remaining n 1 bins. The
r
number of ways in which EX can take place is then given by the binomial coefficient
n1
r
, whereupon it follows that the probability Pr EX in words, the probnr
r
ability that the waiting time T X between rth occurrences of success takes the value
nt is
n 1 r nr
r
r
pX q X :
3:14:5
Pr EX Pr T X n
nr
It is more convenient to work with the formula in (3.14.5) if the bin variable n is
replaced by the number of failures k n r (where k 0,1,2. . .). Then (3.14.5) takes
the form
rk1 r k
r r
r
Pr T X r k
3:14:6
pX qX
pX qX k ,
k
k
where the second equality expresses what is called a negative binomial distribution.40
The equivalence between the two binomial expressions in (3.14.6) is established by
r
according to the rule defining a binomial
explicitly writing the factors in
k
coefficient
!
r
r r 1 r k 1
r r 1 r k 1
1k
1
2
k
k!
k
r 1!
r r 1 r k 1
r k 1!
1k
1k
r 1!
k!
k!r 1!
!
rk1
1k
k
3:14:7
and then multiplying numerator and denominator by (r 1)! (as shown in the second
rk1
.
line) in order to create the factorials that define
k
Given the negative binomial form of the probability function in (3.14.6), it is
straightforward to demonstrate the completeness relation
X
X
pr
r r
r
k
r
qX k prX 1 qX r Xr 1
3:14:8
pX qX pX
k
k
pX
k0
k0
40
W. Feller, An Introduction to Probability Theory and its Applications Vol. 1 (Wiley, New York, 1957) 155156.
177
gT r t e
n
r
k
X
nr
k0
pet
r
X
pet
r
k
qX et
k
1 qX et
3:14:9
k0
hT X i
pX
p2X
2T r
X
rqX
p2X
3:14:10
are obtained.
r
There is a simple, instructive explanation for the form of the mgf (3.14.9) of T X ,
r
which is seen to be the rth power of the mgf (3.14.3) of TX. The random variable T X is
interpretable as the sum
r
TX TX TX TX
3:14:11
r terms
of r independent random variables, each representing the waiting time for the first
occurrence of a bin with count x. Since the mgf of a sum of independent
random
r
variables is the product of the component mgfs, it follows that gT r t gT X t .
X
The entire time record of gamma coincidence counts arising from decay of 22Na
was analyzed for the intervals of count values X 190 through X 198 where the
mean count per bag was approximately 194. Additionally, in order to examine
whether the statistics of the record may have changed throughout the duration of
the experiment, the intervals were examined as well for the first 10 hours of counting,
for a middle period of hours 50 through 60, and for the hours 150 through 160
towards the end of the experiment. In calculating the theoretical mean waiting time
and variance from (3.14.4), it was necessary to take account of the variation in due
to natural lifetime, since this parameter determines the probability pX in (3.14.1).
A histogram of the intervals of recurrences of X 194 counts per bin is shown in
Figure 3.16 with a plot of the theoretical geometric probability function superposed.
This histogram is typical of the results obtained for the other count values as well. So
close is the agreement of experiment with theory (i.e. the null hypothesis) that I again
remind readers they are looking at real data and not a computer simulation. Table 3.6
summarizes the goodness of fit for a range of class values.
178
Table 3.6
T Thy
X
T Exp
X
Class
X
dof
d
P
value
hT Thy
X i
(Thy)
(Exp)
(Thy)
(Exp)
190
191
192
193
194
195
196
197
198
153
140
153
148
159
154
176
181
189
168
170
172
171
166
161
178
167
171
0.790
0.955
0.848
0.898
0.638
0.640
0.528
0.216
0.164
36.2
35.6
35.2
35.0
35.0
35.0
35.4
35.9
36.6
36.2
35.6
35.2
34.7
34.9
35.2
35.4
35.9
36.7
35.7
35.1
34.7
34.5
34.4
34.6
34.9
35.4
36.1
35.7
35.0
34.9
34.3
34.4
34.6
34.9
35.5
35.9
Relative Frequency
0.025
0.020
0.015
0.010
0.005
0.000
0
50
100
150
200
179
Suppose, for example, that at the end of an initial period of counting nuclear
decays the classes corresponding to count values X . . .80, 90, 100, 110, 120. . . in the
resulting histogram showed a set of frequencies . . .130, 628, 1000, 587, 140. . ., where
the highest frequency (1000) corresponded to the central class, i.e. the center of a
more or less Gaussian-looking shape. If a histogram of similar shape were to occur
again, we would expect the same classes to exhibit frequencies very close to the
preceding set. This idea is illustrated below in tabular form with classes of unit width
and count frequencies arrayed chronologically in bags, where the bag number serves
as a measure of increasing time.
Bags !
Classes
Value
1
2
3
.
.
.
.
.
.
.
K
X1
X2
X3
.
193
194
195
.
.
.
XK
n11
n21
n31
.
85
100
94
.
.
.
nK1
n12
nK2
...
...
...
n1k
85
100
94
nKk
...
n1l
85
100
94
nKl
M
n1M
nKM
The central class defined by X 194 corresponds closest to the mean number of
counts per bin in the time series.
The frequencies nkm (k 1. . .K classes, m 1. . .M bags) entered in the table are
hypothetical and meant only to show the kind of pattern that would occur if the
histograms generated by the bags of frequencies exhibited a recurrent shape with
perfect regularity. Since nuclear decay is a stochastic process whereby the number of
decays fluctuates randomly from bin to bin, one would not expect the sequence of
histograms each histogram corresponding to one bag to manifest a pattern as
striking as the one shown above. The question, therefore, is how to detect amidst
statistical noise an underlying pattern of recurring shapes. . .if such a pattern were
actually present.
It has already been demonstrated in the previous two sections that the observed
runs up and down and the observed waiting times of different count values in the time
series of coincident gamma counts were in complete accord with theory (the null
hypothesis). It is difficult to imagine how the numbers of a time series can pass such
tests of randomness and yet occur with frequencies that display the temporal
180
Table 3.7
Count
class Ck
Sequence
length n
RnObs (observed)
RnThy (theory)
Normalized
residual zR
Probability
Pr(z jzRj)
190
191
192
193
195
196
197
198
199
37 977
38 606
39 050
39 464
39 029
38 839
38 280
37 430
36 728
25 223
25 800
26 031
26 366
26 099
25 928
25 466
25 015
24 467
25 318
25 737
26 033
26 309
26 019
25 892
25 520
24 953
24 485
1.15
0.76
0.02
0.68
0.96
0.43
0.65
0.76
0.22
0.25
0.45
0.98
0.50
0.34
0.67
0.52
0.45
0.82
regularity illustrated above. Nevertheless, nature has led to surprises before especially in matters involving quantum mechanics with outcomes that seemed counterintuitive,41 if not unreasonable. The test I devised to examine this possibility
employed again but simultaneously the concepts of waiting times (intervals)
and runs up and down.
The conceptual basis of the test was the following. If there is a causal periodicity to
the recurrence of histograms {Ha a 1. . .167}, then not only must the frequency of
occurrence of a particular class value (e.g. the count X 194) recur with some
regularity i.e. manifest intervals whose frequency of repetition is unaccountable
on the basis of pure chance but the intervals for different classes of count values
must be correlated or, again, there would be no meaning to the idea of equivalent
histogram shapes. The test was implemented, therefore, in two stages.
In the first stage, tests of up/down runs were made on the intervals in the
frequencies of a range of classes Ck 194 k (5 k 4) about the central class
C0 194 to establish that the results were all as expected on the basis of pure chance
i.e. under statistical control. This was indeed established.
In the second stage, the intervals of C0 were then arranged in descending order,
and the intervals of the other classes were sorted in the corresponding order. Runs
up/down tests were again performed on the intervals of Ck (k 6 0) to test whether the
sequences of intervals were still under statistical control or whether they were
correlated with the now highly improbable rank ordering of the intervals of C0.
The results, summarized in Table 3.7 for the total number of runs R Rn,1, confirmed
that the re-ordered intervals of Ck60 still conformed completely to what one would
expect on the basis of pure chance, signifying no correlation with the intervals of C0
or with each one another.
41
181
Recall that for a sample size n > 20, R is normally distributed to a good approximation with mean and variance given by (3.13.22) and (3.13.25). The fifth column
expresses the observed number of runs in standard normal form as the residual zR
(R(obs) R(thy))/R, with the corresponding P-value listed in the sixth column.
As is evident by the close agreement of observation with prediction, the runs-ofintervals test showed no evidence whatever that the histograms of a long time series
of 22Na decays gave rise to recurrent shapes. In all likelihood, any such appearance to
the contrary particularly with histograms massaged to generate smooth peaks,
valleys, and rabbit ears only reflects the intrinsic capacity of the human mind
(apophenia) to seek coherent patterns out of random noise.
The research described in this chapter proceeded over a period of more than five
years. For one thing, carrying out a sustained program of research within an
undergraduate institution devoted primarily to teaching and therefore without
the support of graduate students or postdoctoral assistants greatly restricted the
times when work on the project could be done. This is simply a statement of fact, not
a complaint, as there are compensating features to being at a liberal arts college, and
I am there by choice. For another, the project evolved in complexity, as I began to
understand better the experimental and analytical dimensions of what needed to be
done, and learning can be a relatively slow process. Fortunately if one chooses to
think of it this way I had little competition since belief in the randomness of nuclear
decay is so ingrained in the psyche of most physicists that few if any other labs (none
to my knowledge at the outset) probably thought this experimental fact was sufficiently in doubt to be worth checking.
Actually, compared to the plethora of experimental studies of quantum phenomena relating to interference and entanglement, I was aware of only relatively few tests
to examine specifically whether quantum transitions occurred nonrandomly. It was
this surprising paucity that prompted me to investigate nuclear decay in the first place
in the 1990s. Physicists who believed that the statistics of radioactive nuclei had long
ago been established beyond doubt probably conflated such a demonstration with
exponential decay. In fact, quantum theory predicts a non-exponential decay of
quasi-stationary states for times short compared with the coherence time of the
42
A. S. Eddington, The Nature of the Physical World (University of Michigan Press, 1963) 229. [Originally the
1927 Gifford Lectures published by Cambrdige University Press.]
182
system or long compared with the mean lifetime.43 In the case of 22Na, the former
time domain would be roughly 1018 seconds (time required for a nucleon to cross
the diameter of the nucleus), and the latter about 2.5 years. The time scale of the data
collection was about 167 hours, well outside both time domains. The experimental
conditions therefore were consistent with the null hypothesis, which does lead to
exponential decay as explained earlier in this chapter.
My motivation for examining one particular nuclide (22Na) in such detail and in
the specific ways described in this chapter was, as I have written in the introduction,
primarily due to the repeated extraordinary claims published by certain groups of
researchers. Since much of the daily work in science, as in other professions generally,
does not rise above the mundane, the challenge posed by these published assertions,
although apparently dismissed out of hand by many in the physics community or at
least by those who wrote me to debunk them made my own scientific life more
exciting. In fact, despite harboring the biases that most physicists do in favor of a
prevailing theory that had remained inviolable since its inception in the 1920s,
I secretly wished that the claims were valid, that nuclear decay somehow managed
to disguise a deterministic component beneath an outer appearance of randomness
or more likely (although still highly unlikely) that some external influence of
unknown origin, a cosmogenic force, existed that exerted a subtle control over
what otherwise would be independent, random processes.
Alas or perhaps fortunately the experiments and analyses described here found
no such thing. Rather, all the tests pointed unwaveringly to the following
conclusions.44
The discrete states in histograms of nuclear decay reflected only correlations
introduced artificially by construction.
Visual inspection of shapes of histograms provided no reliable test of correlations
in the underlying stochastic processes.
Nuclear decay (at least that of radioactive sodium)
was completely consistent with white noise;
showed no correlations in fluctuations of counts in the time series;
showed no correlations in fluctuations of frequencies in the histograms;
showed no periodicity in either nuclear counts or count frequencies for time
intervals ~167 hours;
showed no unexplained trends over a period ~35 days.
There are, of course, countless other nuclear decays to examine, as well as random
fluctuations arising from electromagnetic waves and chemical reactions. But, if the
point was to search for a cosmic influence of universal reach, then presumably such
43
44
M. P. Silverman, Probing the Atom: Interactions of Coupled States, Fast Beams, and Loose Electrons (Princeton
University Press, Princeton NJ, 2000).
M. P. Silverman and W. Strange, Search for Correlated Fluctuations in the Decay of Na-22, Europhysics Letters
87 (2009) 32001 p1p6.
183
an influence did not exist if it had no effect on the decay of 22Na (which is a weak
nuclear interaction), the subsequent ee annihilation into gamma rays (which is a
relativistic quantum electrodynamic interaction), and all the electronic activities that
went on within our detection and data processing instrumentation (which constitute
non-relativistic classical electromagnetic interactions).
Nevertheless, the question of whether it is conceivable within the framework of the
known laws of physics for ostensibly independent nuclear processes to be correlated
by some kind of universal interaction is an interesting one. The Standard Model of
particles and forces predicts an ever-present background field (Higgs field) pervading
all space, which determines particle masses. Similarly, the Standard Cosmological
Model (big bang inflation) requires an all-pervasive field (dark matter) to account
for the cosmic distribution of mass and a second such field (dark energy) to account
for a perceived increase in the expansion rate of the universe. Whether such fields
could lead to correlated fluctuations in nuclear decay is highly dubious. Indeed, the
very existence of such fields in cosmology has come into question, and, despite my
having also contributed to this genre of speculation,45,46,47 I am progressively evolving to the view that these mysterious entities will eventually go the way of the
nineteenth-century aether once the nature of gravity is better understood.
This statistical narrative of what at least to me was a fascinating undertaking
could well have ended with the preceding sentence were it not for a surprising
revelation at least as extraordinary and possibly more believable than the claims of
discrete structures and recurrent histograms. As the project recounted in this chapter
neared conclusion, reports came to my attention of data showing variable nuclear
decay rates correlated with the Earths orbital position about the Sun48 and influenced by variable solar activity such as solar flares.49 The data from which such
drastic conclusions were drawn were not recent, but comprised measurements of the
half-life of radioactive silicon (32Si) made at the Brookhaven National Laboratory
(BNL) over the period 19821986 and of radioactive europium (152Eu, 154Eu) made
at the Physikalisch-Technische Bundesanstalt (PTB) in Germany over a period of
approximately 15 years from 19841999, and a measurement of the decay of radioactive manganese (54Mn) made in 2006. These measurements, I understood, were
undertaken for purposes of metrology and calibration and not with an eye to testing
fundamental principles of nuclear physics. Indeed, I was later to learn that when the
long-duration measurements manifested variable nuclear decay rates, the
45
46
47
48
49
M. P. Silverman and R. L. Mallett, Coherent degenerate dark matter: a galactic superfluid?, Classical and Quantum
Gravity 18 (2001) L103L108.
M. P. Silverman and R. L. Mallett, Dark matter as a cosmic BoseEinstein condensate and possible superfluid,
General Relativity & Gravitation 34 (2002) 633649; Erratum 35 (2002) 335.
M. P. Silverman, A Universe of Atoms, An Atom in the Universe (Springer, New York, 2002) 325385.
J. H. Jenkins et al., Evidence of correlations between nuclear decay rates and Earth-Sun distance, Astroparticle
Physics 32 (2009) 4246.
J. H. Jenkins and E. Fischbach, Perturbation of nuclear decay rates during the solar flare of 2006 December 13,
Astroparticle Physics 31 (2009) 407411.
184
experimenters largely retired the data out of concern that the experiments had
somehow gone awry.
Once resurrected, however, the claims of variable nuclear decay rates did not go
unchallenged, and evidence against small periodic annual variations modulating the
exponential decay curve was obtained from re-examination of a variety of nuclear
decays obtained both terrestrially50 and from radioisotope thermoelectric generators
aboard the Cassini spacecraft.51
Around the time I learned of this interesting controversy, I was co-organizing a
workshop on the fundamental physics of charged-particle (e.g. electron) and heavyparticle (e.g. neutron) interferometry to be held at the HarvardSmithsonian Center
for Astrophysics in Cambridge, Massachusetts in April 2010. The conference, it
seemed to me, could provide an excellent opportunity to understand better the
conflicting conclusions regarding decay of radioactive nuclei, which, while not
exactly interferometry, nevertheless involved some very heavy quantum particles.
And so I invited two speakers on opposite sides of the issue. Only one accepted the
invitation, and therefore only one side was presented that in support of variable
nuclear decay. I found the presentation thought-provoking, but not convincing.
If a variation in nuclear decay rate were indisputably shown to be correlated with
either the position of the Earth about the Sun or short-term violent activity at the
Suns surface, what possibly could be the cause? The processes that go on within a
nucleus have long been thought to be impervious to non-nuclear activities outside the
nucleus that is, to all local environmental variables like temperature, humidity,
molecular bonding, chemical reactions, ambient light, and even fairly strong laser
fields directed at the nucleus. The only exception I know of are processes that involve
the density of exterior bound electrons at the nucleus.
The most familiar case, previously mentioned, is that of electron-capture decay
whereby a neutron-deficient nucleus can convert a proton to a neutron by capturing
an orbital electron, usually from the innermost (K) shell. This can occur if the
daughter product has a mass greater than the threshold mass for positron emission.
Ordinarily, this effect is very weak. An isotope of beryllium (7Be), for example,
decays with a half-life of about 54 days by capturing a K-shell electron to form an
isotope of lithium (7Li), a process that plays a role in dating geologic samples.
Depending on the molecular bonding by which 7Be is bound in a molecule (hydrated
ion, hydroxide, or oxide), the half-life of 7Be can vary by about 1%.52
A less common case, however, known as bound-state beta decay, can lead to
spectacular modifications. This is a weak-interaction process in which the emitted
electron (beta particle) remains in a bound atomic state rather than being emitted
50
51
52
E. B. Normal et al, Evidence against correlations between nuclear decay rates and EarthSun distance, Astroparticle
Physics 31 (2009) 135137.
P. S. Cooper, Searching for modifications to the exponential radioactive decay law with the Cassini spacecraft,
Astroparticle Physics 31 (2009) 267269.
R. A. Kerr, Tweaking the clock of radioactive decay, Science 286 No. 5441 (29 October 1999) 882883.
185
into a continuum of unbound states. For a neutral atom, only weakly unbound states
with low density at the nucleus are available, and the process is insignificant compared to other decay pathways. However, for the fully ionized atom such as can
occur in a stellar interior deeply bound states close to the nucleus are available. One
example, produced terrestrially in a heavy-ion synchrotron, is the bound-state beta
decay of an isotope of rhenium (187Re) for which the half-life of the neutral atom is
42 billion years and that of the fully ionized atom is only 14 years!53
Of various mechanisms that have been proposed for the alleged correlation of
nuclear decay and solar activity, some e.g. novel force fields emanating from the
Sun that cause changes in the magnitudes of fundamental parameters such as the
fine-structure constant involved entirely new physics, whereas others e.g. variations in the flux of neutrinos emitted by the Sun involved known particles, but
interacting with matter with cross sections far greater than those predicted by the
Standard Model. Nevertheless, the neutrino mechanism is of particular interest
because it can be tested more readily than others that require new force fields. If
the neutrino mechanism is applicable, one would not expect to find an annual
variation in the rate of decay of nuclei that disintegrate by emission of alpha
particles, which is an electromagnetic rather than weak nuclear process. As of this
writing, the issue is still ambiguous.
How utterly remarkable it would be if neutrinos from the Sun actually affected the
decay of radioactive elements on Earth, may be glimpsed from the following backof-the-envelope (or front-of-the-computer-screen) estimate of (a) the rate of
absorption of neutrinos by radioactive manganese (54Mn) compared with (b) the
natural decay rate due to electron capture. Manganese-54, whose decay I investigated
years earlier with my colleague Wayne Strange, is one of the nuclides claimed to be
affected by solar flares. The rate of process (a) is given by the expression
neutrino flux
neutrino
number of
number of neutrinos
a
at the Earth
cross section
absorbed per second
nucleons
N n :
3:16:1
The term flux, derived from the Latin root for flow, means number of particles
passing through a unit area in a unit of time. The neutrino flux at the Earth has been
measured to be ~ 7 1010 m2 s1. The cross section is a measure (in terms of area)
of the probability that an interaction occurs. The neutrino cross section increases
with energy, but at the low energies of beta decay (MeV compared with GeV),54 a
representative value of the neutrino cross section is ~ 1047 m2. It is this extremely
small number that underlies the frequently cited remark that a neutrino can pass
53
54
F. Bosch et al., Observation of bound-state decay of fully ionized 187Re: 187Re187Os cosmochronometry, Physical
Review Letters 77 (1996) 51905193.
The electron volt (eV) is a common measure of particle energy. The MeV, (mega or million eV) is characteristic of lowenergy nuclear phenomena like beta decay, whereas GeV (giga or billion eV) is characteristic of high-energy elementary
particle phenomena.
186
undeflected through a light-year of lead. The rate of process (b), which was derived
earlier in the chapter, takes the form
intrinsic
number of
number of 54 Mn nuclei
b
e 1:510 ,
ln 2
electron-capture decay
0:693
3:16:3
1
2
1
2
where it is to be noted that the number of nucleons drops out. Unless there is
something about neutrinos that physicists really do not understand, it is difficult to
reconcile with current theory how solar neutrinos could affect the rate of weak
nuclear decay processes on Earth.
Because the modulation of nuclear decay by solar activity would have momentous
theoretical and practical consequences (e.g. for nuclear-based geological dating), it
goes without saying (but I will say it anyway) that such claims are unlikely to be
accepted until careful studies are done to understand just how the Sun affects the
apparatus employed in experiments for detecting particles and processing electronic
signals. It has long been known that the Sun emits a veritable wind of charged
particles and a broad spectrum of electromagnetic radiation. Sporadic violent solar
activity has damaged orbiting satellites and affected communications on Earth. What
is perhaps not widely appreciated and only relatively recently investigated in depth
is that the Sun has more than ten million normal modes of oscillation that can affect
virtually every electrical device imaginable. As one recent report on the Suns
ubiquitous influence concluded:55
. . .we have shown a series of examples where data encountered in the engineering environment
is not of the form that most texts prepare engineering students to expect. The majority of this
data is nonstationary on a variety of scales, definitely not white, and contains many discrete
modes. . . .these modes begin as normal modes of the sun, are often further split by Earths
rotation and possibly other causes. These modes are ubiquitous in space physics data, in the
magnetosphere and ionosphere, in barometric pressure data, in induced voltages on ocean
cables, and even in the solid Earth. Although the specific physical coupling mechanism is not
understood, the solar modes appear to be a major driver of dropped calls in cellular phone
systems. We currently do not know how many different kinds of systems are directly or
indirectly affected by phenomena arising from these modes. . .
55
D. J. Thomson, et al., Solar modal structure of the engineering environment, Proceedings of the IEEE 95 (May 2007)
10851132.
187
Appendices
1
2
S0 12 R0
Sm R0
1
2
m
X
Rk
k1
m
X
3:17:2
k
1 Rk
k1
and
m1
m
m1
X
X
X
jk
Sj m 1R0 2 Rk
cos
m
j1
k1
j1
m 1R0
m h
X
C1211k
i
1 1k Rk :
3:17:3
k1
188
3:17:4
189
rj
m1
X
rj 1
j0
r rm eik=m cos k
:
1r
1 eik=m
3:17:5
Taking the real part of (3.17.5) leads to the expression beneath the bracket in (3.17.3).
i1
3:18:2
190
2A
N
2 X
1
cos 2 ti ! 2
2
N i1
N!
t!0
2B
N 1
2 X
1
sin 2 ti ! 2
2
N i0
N!
3:18:3
t!0
that reduce in the limits shown to one-half the variance of the original time series.
Thus, the real and imaginary amplitudes of the DFT of the time series are normally
distributed
A
1
1
3:18:4
N 0, 2 N 0, :
2
2
B
3.18.3 Squared amplitudes A2(), B2()
The probability density function (pdf ) of z y2 is related to the pdf of y by the chain
of steps
1
dy
pY ydy pY y pY y dy pY y pY y dz
pZ zdz
dz
3:18:5
from which follows the transformation
pZ z
p
p
pY y pY y pY z pY z
p
:
jdz=dyj
2 z
3:18:6
Application of (3.18.6) to the pdf pY (y) of a normal variate N 0, 12 leads to the pdf
pZ(z) of a gamma-distributed random variable Gamr, Gam12 , 1. Thus, the
square moduli of the DFT are gamma distributed
1 1
A2
3:18:7
Gam , :
2
B2
3.18.4 Power spectrum S()jF()j2 A()2 + B()2
The mgf of a gamma variate Gam(r, ) takes the form g(u) (1 1u)r. It then
follows that the mgf of the sum of two iid gamma variates Gam 12 , 1 is the square
g(u)2 (1 u)1, which is identical to the mgf of an exponential distribution with
parameter . Thus, the ordinates of the power spectrum are distributed exponentially
S A2 B2 E:
3:18:8
The pdf of w
191
q
A2 B2
p
z is related to the pdf of z by a chain of steps
dw
1 pW wdw pW w
dz pZ zdz
dz
3:18:9
pZ z
2z pZ z 2wpZ w2 :
jdw=dzj
1
2
3:18:10
Rayleigh pdf
3:18:12
pZ z pX yzpY yjyjdy:
3:18:14
Equation (3.18.14) reduces to the pdf of a Cauchy Cau(0, 1) distribution centered on
z 0 with unit width parameter when the pdfs of x and y are Gaussian functions of
zero mean and unit variance.
3.18.7 Autocovariance Rk
We employ the discrete estimate
Rk
n
1X
yy
n t1 t tk
k > 0
3:18:15
of the autocovariance of the adjusted time series {yi}, each element of which is
representable by an independent normal random variable Y N(0, 2) N(0, ).
192
3:18:16
2
1
2 z
1
pZ z 2 e22 x x2 x1 dx
3:18:17
when the pdfs of x and y are densities for N(0, 2). Rather than evaluate the integral (3.18.17)
directly, it is actually more convenient to use it to calculate the corresponding mgf
gZ u
ezu pZ zdz
1
2
2 2
x2 =2 2 1
e
x
dx
ezu ez =2x dz
2
0
p
2
2 xexu =2
r
2 1 12x2 12 2 u2
e
dx 1 4 u2 ! 1 2 u2
2
1
2
3:18:18
1
2
which reduces to a deceptively simple form that is not one of the familiar distributions
to be found in books. At first glance it may resemble the mgf of a gamma distribution
(with r 12), but the latter is a function of u, not of u2.
The pdf resulting from (3.18.17) is in fact a Bessel function a modified Bessel
function of the second kind, (2)1 K0(z/2), to be exact but we do not need to
know or use this information. Everything of relevance is contained in the mgf
(3.18.18)
which it follows that the mgf of a sum of n k iid products ytytk is
from
, and therefore the mgf of the autocovariance Rk>0 is
1 2 u 2
nk
2
nk
h uink
2 2
2u
1 2
:
gRk u gZ
n
n
3:18:19
Expansion of the natural logarithm of (3.18.19) with truncation after the first term
(of order u2) which is entirely adequate for large sample size n 1 leads to the
2
2
mgf gRk u e e =nu of Gaussian form with variance 2Rk 2 =n. Thus, the
autocovariance function of lag k > 0 is distributed as
Rk>0 N 0, 2 =n :
3:18:20
1
2
un=2
g2n u 1 2
:
n
193
3:18:21
By the same, somewhat tedious, analytical procedure one now arrives at the mgf
un=2
:
3:18:22
gR0 u 1 2 2
n
There is, however, an easier way to arrive at (3.18.22), which shows the connection
n
X
2
N i 0, 2
and the chi-square variate
between the random variable Z 1n
n
X
i1
2
2
1
n n
N i 0, 1 . By exploiting the equivalence N(0, 2) N(0,1), we can evalui1
2
2
n=2
2 2
2 2 u
u
u
e n n u e n n
,
12
g2n
n
n
3:18:23
3:18:26
4
Mother of all randomness
Part II
The random creation of light
Sir Isaac Newton, Opticks (first printed by The Royal Society, 1704) 1.
See M. P. Silverman and W. Strange, The Newton two-knife experiment: intricacies of wedge diffraction, American
Journal of Physics 64 (1996) 773787. I have reproduced Newtons experiment in my lab using lasers (rather than
sunlight) and a CCD (charged coupled device) camera (rather than the human eye) and analyzed the patterns in detail
using the scalar theory of Fresnel diffraction.
194
195
4:1:1
196
the amount of energy and linear momentum (4.1.1) and undergoes a transition
to a higher-energy quantum state. Excited matter can undergo a transition to
a lower-energy state by emitting a photon that carries away the amount of
energy and momentum (4.1.1). Quantum mechanically, this is how light is
produced, in contrast to the classical mechanism whereby accelerating charged
particles emit electromagnetic radiation. Because photons are massless, their
range is infinite. That is why we can receive electromagnetic signals from
galaxies millions of light years distant.
Combining relations (4.1.1) and (4.1.2) leads to familiar relations linking
frequency, wavelength, and speed
c :
p
k
4:1:3
One might wonder why, if light either moves through vacuum at speed c or
else vanishes, does the speed of light through matter differ from c. The
answer is statistics. From a quantum perspective, the movement of light
through matter is in some ways analogous to the flight of a drunk through a
forest. By continually colliding with trees, falling down, getting up, and
recommencing running, the drunks mean speed is less than his instantaneous speed. Likewise, photons moving through matter are virtually
absorbed and re-emitted in collisions with atoms and molecules but otherwise move at speed c through the interstices of the material. Macroscopically, the effect of a material on the propagation of light is represented by the
index of refraction, a wavelength-dependent function that could also depend
on the polarization and direction of propagation of the light.
(c) The photon has no electric charge. Thus, although the photon is a carrier of the
electromagnetic interaction, it does not interact directly with electric or magnetic
fields. A consequence of this property for classical electromagnetism is the
linearity of Maxwells equations and the fact, therefore, that any linear superposition of solutions to these equations is also a solution. In QED, however, ultrastrong electric or magnetic fields can destabilize the vacuum, leading to virtual
(i.e. transient) electronpositron pairs that scatter photons. From a macroscopic
perspective, the vacuum has acquired a field-dependent refractive index. One
such exotic process, for example, is the magnetic birefringence of the vacuum,
which requires an ultra-strong magnetic field of such strength B that the work
expended in displacing an electron by its Compton wavelength (e h/mec) is at
least equal to the electron rest-mass energy
ecB
relativistic magnetic
force
electron Compton
wavelength
me c2
or
B
me c2
109 Tesla,
eh
4:1:4
197
The calculation in (4.1.4) is of a heuristic nature only, intended to arrive at a dimensionally correct, numerically valid
estimate of the critical field strength. Technically, a magnetic field does no work because it acts perpendicular to
displacement. According to special relativity, however, the magnetic field in the moving frame of a particle results in an
electric field in the instantaneous rest frame of the particle. Electric fields can do work on a particle.
198
would detect LCP waves if the electric vector of the approaching wave was rotating
counter-clockwise i.e. from the 12:00 position toward the observers left shoulder. Appropriate linear superpositions of LCP and RCP waves produce vertical
and horizontal linearly polarized (LP) light or, more generally, elliptically polarized light in which the tip of the electric vector traces out an ellipse in time. The
polarization of electromagnetic waves is utilized in numerous ways for example,
by optical rotatory dispersion or optical circular dichroism or polarization-based
imaging to study the structure, composition, and concentration of materials.
Every characteristic of lightwave vector, frequency or wavelength, and polarizationcan be exploited in practical ways to provide information about physical
systems.4
Without any desire to get caught up in a controversy of semantics, I will nevertheless express an opinion, based on numerous experimental and theoretical investigations of quantum systems5 over many years, that notwithstanding the oft-repeated
expression waveparticle duality, the building blocks of nature are particles, not
waves. An individual entity with invariant mass, charge, and spin (or helicity) is a
particle. A photon is a quantum particle.
Quantum mechanics is an irreducibly statistical theory. In quantum mechanics the
aspect of waves enters only statistically either in the aggregate behavior of similarly
prepared particles or in the repetitive observations of a single particle. The quantum
wave function is not the physical entity itself, but only a mathematical tool for
calculating probabilities and expectation values. It is a mathematical expression of
the information available about a quantum system.
Statistically, there is a relation between the uncertainty in location and uncertainty
in momentum of a quantum particle. A perfectly monochromatic photon (which, in
reality, does not exist) would be completely delocalized although its linear momentum would be measurable with no uncertainty. Photons produced by real sources are
describable mathematically by a wave packet, i.e. a linear superposition of monochromatic components; they have a finite spectral width and a calculable spatial
uncertainty as characterized by the uncertainty relation derived in Chapter 3
[Eqs. (3.9.11) and (3.9.12)] for a general function and Fourier transform pair.
Although quantum theory has been subjected to many tests since its inception at
the turn of the twentieth century, relatively few have been expressly designed to probe
the underlying randomness of nature. The photon is an ideal particle with which to
conduct such tests. The previous chapter was devoted to the randomness of nuclear
disintegration and a comprehensive set of tests on radioactive sodium in response to
a longstanding nuclear controversy. In this chapter I focus on the randomness of
4
5
I discuss my experiments covering all facets of physical optics in M. P. Silverman, Waves and Grains: Reflections on Light
and Learning (Princeton University Press, New York, 1998).
I discuss my experimental and theoretical investigations of quantum systems in M. P. Silverman, Quantum
Superposition: Counterintuitive Consequences of Coherence, Entanglement, and Interference (Springer, Heidelberg, 2008).
199
photon creation and interaction at a beam splitter. The discussion will bring to light
(pun intended) important elements of statistical physics such as the concept of a
compound distribution, the relation between Poisson statistics (a hallmark of classical physics) and the distributions of quantum particles, the correlations intrinsic
to the statistics of bosons and fermions, and the statistics of recurrent events, which
play an important part in an experimental test of photon statistics to be described in
due course.
eE
Z T, V, N
1=kB T
4:2:1
eET , V , N
4:2:2
200
nk N
4:2:3
nk k E
4:2:4
k0, 1, 2...
X
k0, 1, 2...
pfnk g
gfnk ge
nk k
k0, 1, 2...
4:2:5
where gfnkg is the degeneracy or statistical weight of the set of occupation numbers.
The numerical values that an occupation number nk can take and the mathematical
form of the degeneracy factor gfnkg depend on the assigned statistics. Here the term
statistics as used by a physicist is different from that of a statistician.
In the lexicon of a statistician, a statistic is an observable random variable
(or function of such variables) that does not contain any unknown parameters;
as a discipline, statistics, according to the American Statistical Association,
is the science of collection, analysis, and presentation of data. To a physicist,
however, the term statistics may also refer to the solution of a particular
kind of occupancy problem: in how many ways can one distribute indistinguishable balls over distinguishable cells? The answer depends on the imposed
constraints.
If the distribution is inclusive, i.e. no constraints are imposed apart from (4.2.3)
and (4.2.4), then the solution leads to BoseEinstein statistics.
If the distribution is exclusive, i.e. so constrained that no more than one ball can be
placed in any given cell, then the solution leads to FermiDirac statistics.
These two solutions which, so far as experiment has revealed and relativistic
quantum theory has deduced, are the only two solutions that nature permits6
may be summarized by the relations in Table 4.1. Disregard for the moment the
third column.
Proposals have been made from time to time of the existence of entities that follow other kinds of statistics. I have
myself investigated hypothetical quantum systems comprising particles in bound association with magnetic flux tubes.
These composite quasi-particles manifest statistical behavior that interpolates between that of fermions and bosons,
depending on the value of the magnetic flux. I know of no fundamental particles, however, that behave this way.
201
Table 4.1
BoseEinstein statistics
FermiDirac statistics
MaxwellBoltzmann statistics
nk 0, 1, 2 . . . N
nk 0, 1
nk 0, 1
gBE fnkg 1
gFD fnk g
1
0
nk 0 or 1
otherwise
gMB fnk g Y1
nk !
k
The fundamental lesson of the first and second columns of Table 4.1 is that the
statistical weight of each allowed set of occupancy numbers is 1; if the particles
are identical, then rearranging them among the same quantum states does not
provide any new information since one cannot tell which particle is in which state
by any distinguishing feature.
The partition function in (4.2.5)
X
nk k
X
Z
gfnk ge k0, 1, 2...
4:2:6
fnk g
is the sum over all allowed partitions of N subject to the constraints (4.2.3) and
(4.2.4). For purposes of illustration, consider a system of three spinless particles to be
distributed over three non-degenerate states. The resulting configurations, can be
represented by a triad of numbers (n1 n2 n3) giving the occupancy of states of energy
(1 2 3). Consider each of the three types of occupancy shown in Table 4.1.
4.2.1 BoseEinstein (BE) statistics
The case of three bosons is tallied as shown. Configurations 13 include a 3-particle
state. Configurations 49 include a 2-particle state. Configuration 10 includes only
BoseEinstein
1
2
3
4
5
6
7
8
9
10
3
0
0
0
0
1
2
1
2
1
0
3
0
1
2
0
0
2
1
1
0
0
3
2
1
2
1
0
0
1
202
1-particle states. The resulting partition function (with temporarily set equal to
1 for simplicity) takes the form
ZBE e31 e32 e33
e1 22 e1 23 e2 21 e2 23 e3 21 e3 22
e1 2 3 :
4:2:7
Each term in the sum (4.2.7) has an exponent equal to the total system energy for the
associated particle configuration. Note, however, that the partition function can also
be factored into the product of three sums in which each sum is over the occupancies
of a single energy state
ZBE 1e1 e21 e31 1e2 e22 e32 1e3 e23 e33 N 3:
4:2:8
The delta-function in (4.2.8) is there to remind us to exclude from the expanded
product any terms with total number of particles N 6 3. Thus, the term
1 e01 e02 e03, which represents no particles in any of the three states would be
excluded; likewise, we would exclude the term e21 e22 e23 , which represents a
6-particle configuration in a system defined to have only three particles.
Generalizing to arbitrary numbers of particles and states, one can write the partition
function (4.2.6) for identical bosons as a product of single-mode partition functions
!
!
Y
X
X
nk k
ZBE
e
nk
N
4:2:9
k1,2... nk 0,1,2...
k
where each single-particle energy state defines a mode. The product index k enumerates modes; the sum index nk specifies the number of particles in mode k.
The total number of ways of distributing N particles over r states with no restriction on number of particles per state is given by the multiplicity
Nr1
:
4:2:10
BEN, r
N
To see this, draw r 1 vertical lines to represent the boundaries of r cells and put in
N dots to represent the N particles. Keeping the first and last lines fixed, one obtains
all possible distributions of dots in cells by making all possible rearrangements of the
N dots and r 1 interior lines. The number of ways to partition N r 1 items into
two groups of size N and r 1 is given by the combinatorial expression (4.2.10). In
the given example
of three bosons distributed over three states, one finds
5
BE3, 3
10 as expected.
3
4.2.2 FermiDirac (FD) Statistics
Consider next the distribution of N identical particles over r states with the restriction
that no more than one particle can occupy a state. For the illustrative case of three
203
particles in three states, there is only one way to distribute the particles.
The corresponding partition function (again with temporarily set to 1) has only
one term
FermiDirac
ZFD e1 2 3 :
4:2:11
In general, the total number of ways of distributing N particles over r states with
exclusion is given by
r
,
4:2:12
FDN, r
N
which obviously requires r N. Expression (4.2.12) is almost self-evident. There are
r ways to choose a state for the first particle, r 1 choices for the second particle,
and on down the line until there remain r N 1 choices for the Nth particle.
However, since the particles are identical, the N! ways of assigning the N particles to
a given set of states provide no new information. Thus the total number of
arrangements is
rr 1r 2 r N 1
r!
N!
N!r N !
FDN, r
4:2:13
1
1 e
P
2
1 e
3
...
1 e
k0, 1, 2...
k
N
!
nk :
4:2:14
nk N
The constraint on counting (in both the BE and FD cases) posed by a fixed
number of particles is lifted if one works with the grand canonical, rather than
canonical, ensemble although at the expense of introducing an additional system
parameter, the chemical potential, which depends in general on extensive and
intensive variables of the system and cannot be expressed in a closed form for
numerical evaluation.
204
0
4:2:16
e k0, 1, 2...
e k
N,
Z MB
!n
!
.
.
.
n
k1
fn g 1 2
3!
1!1!1!
where
X
k1
e k
4:2:17
is the single-particle partition function. The symbol Z 0MB for the partition function is
adorned with a prime because, although derived rigorously, the result (4.2.16) turns
205
out to be physically inadmissible, a fact well known before the advent of quantum
mechanics. The problem, referred to historically as Gibbs paradox, is that application of Z 0MB to classical statistical thermodynamics led to an entropy function that
did not scale properly with size.
Volume, particle number, entropy, and all the various thermodynamic energies
(internal energy U, Helmholtz free energy F, etc.) are extensive quantities, i.e.
additive over non-interacting subsystems. Mathematically, an extensive function,
e.g. entropy, must satisfy a relation of the form
SnU, nV, nN nSU, V, N
4:2:18
when the energy U, volume V, and particle number N are scaled by a factor n.
A function that obeys relation (4.2.18) is said to be a homogeneous function of first
degree. If the scale factor on the right side was nd, then the function would be
homogeneous of degree d. Thus, an intensive function, one like temperature T or
pressure P, that does not scale with size,
PnU, nV, nN n0 PU, V, N PU, V, N
4:2:19
X
1
4 V
3
3
!
d xd p 3
p2 dp
4:2:20
h3
h
k
the single-particle partition function becomes proportional to volume V. Thermodynamic functions of state are calculated from the logarithm (and its derivatives) of
the partition function. However, the relation
ln Z0MB N ln Nln V constants
is immediately seen to fail the test of (4.2.18). For example, if the volume V is
doubled, the left side, to which the Helmholtz free energy is proportional, is not
doubled.
The Gibbs correction rectifying this problem was to insert by hand a factor 1/N!
into the degeneracy function (4.2.15) to give the empirical (rather than theoretically
deducible) degeneracy function shown in Table 4.1. Applied to (4.2.6), the Gibbs
correction leads to the MB partition function
"
#
!
n
n
X
X
e 1 1 e 2 2
N
N
,
4:2:21
nk
Z MB
n1 !
n2 !
N!
n1 , n2 ...
k
206
where now the transformation from a sum over discrete states to an integral over
phase space leads to
V
ln Z MB N ln ln N! N ln
1 N ln
constants
4:2:22
N
N
upon substitution of Stirlings approximation for N!. If the extensive variables (V, N)
of the system are doubled, then ln ZMB is also doubled, as well as all the thermodynamic functions that derive from ln ZMB.
The purpose of the Gibbs correction is to account for the indistinguishability of
the particles by dividing out the number of permutations of particle distribution over
states that do not lead to new information. As we have just seen, however, this
correction works only for configurations with one particle per state [such as (1, 1, 1)],
but would incorrectly adjust for configurations with more than one particle per state
[such as (3, 0, 0)]. Thus, MB statistics is an approximation to exact quantum statistics
valid for conditions leading to a low mean state occupancy.
It is worth noting at this point that the transformation (4.2.20) defines the statistical density of states ( p)
pdp
4gV 2
p dp,
h3
4:2:23
which is valid if the volume V is sufficiently large that particle energy levels are spaced
closely enough to be treated as a continuous variable. The relation (4.2.23) includes
a degeneracy factor g to account for additional degrees of freedom due to spin (g
2s 1) or helicity (g 2). In the case of ultra-relativistic particles such as the photon,
substitution of pc from (4.1.2) yields density as a function of energy
d
4gV
hc3
2 d,
4:2:24
207
zero chemical potential; matter absorbs and emits it in arbitrary numbers of quanta
with a mean state occupancy determined only by the equilibrium temperature. The
questions of (a) whether the chemical potential of photons is always zero, and
(b) whether the chemical potential of massless fermions would likewise be zero, are
intricate ones, which I consider further in an appendix.
Evaluating the BE and MB partition functions [(4.2.9) and (4.2.21)] for photons
leads to the following expressions
!
Y
Y
X
1
k nk
e
ZBE
4:3:1
1 e k
k1, 2... nk 0, 1, 2...
k
nk
Y
YX
e k
Z MB
exp e k
4:3:2
n
!
k
k n 0
k
k
in which the first entails summing a geometric series and the second an exponential
series. Although the chemical potential of fermions in thermal equilibrium is ordinarily not zero, there are circumstances, discussed in the appendix, where it does
vanish. For the sake of comparison, therefore, we evaluate (4.2.14) in the same way
to arrive at the expression
Y
1 e k :
4:3:3
Z FD
k
To facilitate discussion of split-beam experiments in the next section, focus attention on a single mode of energy and occupation number n.7 The three partition
functions and corresponding probability functions then reduce to the following
expressions
1
pn en 1 e
n 0, 1, 2 . . .
4:3:4
zBE 1 e
zMB exp e
zFD 1 e
en
exp e
n 0, 1, 2 . . .
n!
8
1
>
>
n0
<
n
e
1 e
pn
1 e >
1
>
:
n1
e 1
pn
4:3:5
4:3:6
where the large number of other modes may be considered part of the environment
with which the mode of interest is in equilibrium.
The mean occupation number n of the mode, as well as fluctuations about the
mean, can be calculated in at least three different ways.
To keep notation simple and consistent with standard usage in physics, I will not make a distinction, as in previous
chapters, between a random variable for occupation number and its realizations n.
208
Differentiation of the partition function. From (4.2.6) one obtains the general
relations
ln z
hni kB T
4:3:7
T , V
2
2 ln z
2
2
n hn i hni kB T
:
4:3:8
2
T , V
Direct evaluation of expectation values from the probability functions
X
gnn en
X
npn n
hni
z
n
X
gnn2 en
X
n
:
n2 pn
hn2 i
z
n
4:3:9
dGt
d ln Gt
4:3:10
hni
dt
t0
dt
t0
d 2 ln Gt
2
n
4:3:11
,
dt2
t0
where
X
Gt hen t i
pn ent
h
in
gn et
4:3:12
4:3:13
Applying any of the three methods to the three types of physical statistics leads to the
expressions in Table 4.2 for the single-mode mean occupation numbers, variances,
and mgfs.
Figure 4.1 shows plots of n as a function of mode energy for the three physical
statistics. The distinction between classical and quantum phenomena lies in the size
of the ratio /kBT. For thermal energy low in comparison to the quantum
of energy,
1, the means and variances of BE and FD statistics approach
the MB values
209
Table 4.2
Stats
Partition function
z(, )
Moment-generating
function G(t)
Mean
n
Variance
2n
BE
(1 e)1
e 1
e et
1
e 1
e
MB
exp(e)
e
FD
(1 e)
e et
e 1
1
e 1
e
2
1
e
e 1
BE
MB
FD
0.5
1.5
Energy/kT
Fig. 4.1 Mean occupation number n as a function of energy for a canonical ensemble of
particles obeying BoseEinstein (BE), MaxwellBoltzmann (MB), and FermiDirac (FD)
statistics. At high compared with thermal energy kBT, nBE and nFD approach nMB
characteristic of classical particles. At zero , nBE becomes singular, indicative of quantum
condensation and nFD approaches 1/2.
hni
2n
! e
>>1
4:3:14
corresponding to the classical limit. The asymptotic equality of mean and variance in
(4.3.14) suggests a connection between the statistics of classical particles and the
Poisson distribution, a point that will emerge directly in the next step of the discussion. The form of the MB mgf in Table 4.2 also reveals this connection; recall that the
210
mgf of a Poisson variate X is gX (t) exp((et 1)) in which is the mean X (not to
be confused with chemical potential).
From the relations of Table 4.2, we can express the Boltzmann factor e, and
therefore the three probability functions (4.3.4)(4.3.6), in terms of the mean occupation number n, which is useful because the latter is a measurable experimental
quantity:
BE
e
hni
hni 1
MB
e hni
FD
e
hni
1 hni
pBEn
hnin
4:3:15
hni 1n1
hnin ehni
hni!
hni
pFDn
1 hni
4:3:16
pMBn
)
n1
n 0:
4:3:17
Pn1 e
hni 1
hni 1
which is precisely the BE probability function (4.3.15) for emission of n photons. The
preceding derivation is merely an interpretation of the form of the BE distribution,
not an explanation of its origin, which at the most fundamental level derives from the
connection between spin and statistics.
Figure 4.2 shows how the MB and BE distribution functions of a monomode light
source with fixed mean n vary as a function of the actual number of emitted
photons n. From an experimental standpoint, n corresponds to the mean number
of particles received in a counting interval (bin), previously represented by the
symbol when there was no confusion with chemical potential. For a weak (i.e.
classical) light source with n < 1, the MB distribution looks very much like the BE
211
Maxwell-Boltzmann
(Poisson)
0.8
Bose-Einstein
Occupation Probability
Occupation Probability
(a)
0.6
(b)
0.4
(c)
0.2
(d)
0.8
(a)
0.6
0.4
(b)
(c)
0.2
(d)
0
10
10
Photon Number n
Fig. 4.2 Emission probability as a function of number n of emitted photons for MB and BE
light sources with mean count n equal to (a) 0.1, (b) 1, (c) 2, (d) 4.
Emission Probability
0.5
<n> = 0.5
0.4
0.3
MB
0.2
<n> = 4.0
BE
0.1
10
Photon Number n
Fig. 4.3 Comparison of BE (gray) and MB (black) emission probability as a function of
photon number n for a weak source (solid) of mean count 0.5 and a strong source (dashed)
of mean count 4.0.
distribution, a comparison that shows up better in Figure 4.3 where the two distributions are presented together for n 0.5 and 4.0. For mean counts n > 1, the
BE distribution decreases monotonically with n in a fat tail asymptotically
approaching n1, whereas the MB distribution takes a bell-shaped form centered
at n n. One sees, then, that in the classical domain of low n, it is the
monotonically decreasing portion of the MB probability curve that correctly
approximates natures BE distribution.
212
0.25
Maxwell-Boltzmann
(a) (Poisson)
0.3
(b)
(c)
0.2
(d)
(e)
0.1
0
0
10
Bose-Einstein
(a)
Emission Probability
Emission Probability
0.4
0.20
(c)
0.10
(d)
(e)
0.05
0
15
(b)
0.15
10
20
30
40
Another perspective is given in Figure 4.4, which shows plots of the MB and BE
distributions as a function of n for fixed numbers n of emitted photons. In an
experiment, one is often interested in certain kinds of events e.g. 1-photon or
2-photon emissions and needs to know how to adjust the source intensity or some
other experimental parameter to minimize contamination by unwanted events. Here
we see from the approximate matching of the pre-peak portion of plots (a) that
the MB approximation to the exact BE distribution is really valid only for singlephoton emission from a source of low n.
:
t
hnie
hni 1
1 hniet 1
1
hni 1
GBEt hent i
X
hniet n
4:4:1
Successive differentiation of G(BE)(t) in the usual way leads to the moments of the
occupation number, in particular the first through third from which the variance
and skewness follow
hn i 2hni hni
2
9
>
=
213
2N hni1 hni
)
:
2hni 1
Sk p
hni1 hni
4:4:2
The limiting cases of the variance are especially interesting, as they reveal in stark
contrast the wave and particle properties of light from a statistical standpoint.
For a classical light source with n << 1, the variance in occupation number is
proportional to n, and thus
n
1
p
hni
hni
hni << 1:
4:4:3
This is the kind of fluctuation in a light signal that results in shot noise in a
photodetector. Increasing the number of photons (or strength of the optical field)
will smooth out these statistical fluctuations. An example would be the emission of
red light of wavelength 633nm and frequency c/ 4.74 1014 Hz from an
incandescent filament at temperature T 1000K. The ratio of a quantum of light
energy to the mean thermal energy per mode is8
h
1:96 eV
22:77,
kB T 0:086 eV
An electron volt (eV) is the energy acquired by an electron (or positron) moving through a potential difference of 1 volt.
The conversion to MKS units is 1 eV 1.6 1019 joules.
214
n
1 hni >> 1:
hni
4:4:4
f d:
4:4:6
Since the energy or intensity of the optical field is proportional to the square of the
amplitude in the classical picture and to the number of photons in the quantum
picture, we can write
D E
n / jEj2 :
4:4:7
In the explanation that follows, what matters most is the independence of the
phases of the wavelets and so for simplicity I will have all sources emit wavelets of
unit amplitude. If j is the phase of the jth wavelet, then the net complex amplitude
E of the optical field is given by
E
N
X
4:4:8
eij ,
j1
*
+
2
N
N
N
D E
X
X
X
2
ik
cos
I jEj
1
2
N:
j
k
j1
j
j>k
N
4:4:9
We encountered this characteristic previously in the investigation of nuclear decay with variates (like the power spectral
amplitude) that follow an exponential distribution.
215
N
N
N
X
X
X
2
14
cos j k 4
cos j k cos l m
j1
N
j>k
j>k>l>m
1
N N 1
2
N 2 N 2 N 2N 2 N ! 2N 2 ,
N>>1
4:4:10
which takes the final value shown above in the limit of a large number of sources.
From relation (4.4.9) and (4.4.10) the fluctuations in intensity of a monochromatic
classical light wave can be expressed as
D
E D E2
2
jE2 j2 jEj2
I2 I
2N 2 N 2
D E
1
4:4:11
I
N2
jEj2
which, in contrast to the result for thermal light, shows that the fluctuations are not
smoothed out as the number of emitters, and therefore the intensity of the light wave,
increases.
The preceding heuristic interpretation of (4.4.4) lies at the heart of a classical
explanation of an experimental procedure devised initially to measure stellar diameters and termed intensity interferometry by its developers, R. Hanbury Brown and
R.Q. Twiss.10 The procedure elicited considerable controversy at its inception
because many physicists could not accept that a nonvanishing time-averaged interference could occur between the intensities of independent light sources. Indeed, the
name of the measurement technique was an unfortunate choice because that was not,
in fact, what was being observed. Rather, the nonvanishing signal was related to the
second term of the middle line of relation (4.4.10), in which the product of components with identical phase differences were averaged. I have subsequently introduced
variations of this procedure into quantum physics to study the nature of entangled
quantum states and the statistics of fermions and bosons in novel ways quite distinct
from those of traditional matter-wave interferometry.11
To conclude this section, it is instructive to examine several aspects of the fluctuations of the full multi-mode field of a thermal light source, which raise some subtle
10
11
R. Hanbury Brown and R. Q. Twiss, A new type of interferometer for use in radio-astronomy, Philosophical Magazine
45 (1954) 663.
(a) M. P. Silverman, More Than One Mystery: Explorations in Quantum Interference (Springer, New York, 1995), and (b)
M. P. Silverman, Quantum Superposition: Counterintuitive Consequences of Coherence, Entanglement, and Interference
(Springer, Heidelberg, 2008).
216
issues, apparent paradoxes, and pitfalls to avoid. Although the various aspects are
related, for clarity of emphasis I will take them up as separate issues.
ISSUE 1
From relation (4.3.1), which expresses the partition function ZBE of a system of ultrarelativistic BE particles (chemical potential 0) as a product of the partition
functions fzkg of the individual modes, we can write
X
X
ln 1 e k
ln zk :
4:4:12
ln ZBE
k
Applied to the canonical ensemble of photons as a whole, the reasoning by which the
variance (4.3.8) in photon number of a single mode was derived leads to the general
expression for variance in total internal energy
2 ln Z
hEi
2E hE2 i hEi2
4:4:13
,
2
V
a relation obtained previously in Chapter 1 (Eq. (1.23.10)) in the broader context of
the principle of maximum entropy. Substitution of (4.4.12) into (4.4.13) with replacement of the sum over states by integration over the density of states (4.2.24) (and
change of integration variable x ) yields the expression
2E
d 8VkB T 5 x4 ex dx
32 5 VkB T 5
:
e 1
h c3
ex 12
15hc3
0
4:4:14
4 4 =15
Looking at the result (4.4.14) and comparing it with (4.4.2) or the comparable
quantity
2n nn 1
e
e 1
4:4:15
from Table 4.2, one might be inclined to ask: Where in (4.4.14) does one now find
the distinction between shot noise and wave noise?. The answer is nowhere.
The factor (e 1)1 that appears in the first integral on the right side of (4.4.14)
is the mean occupation number n of the mode . Integration over all mode
energies effectively combines the fluctuations from states of low n and states of
high n to produce a net variance in internal energy proportional to (kBT)5.
Nevertheless, comparing the size of the fluctuations (4.4.14) to the mean internal
energy E
ln Z
d 8Vk B T 4 x3 dx
8 5 VkB T 4
hEi
V
e 1
ex 1
h c 3
15h c3
0
4 =15
4:4:16
217
hEi
15
hc
,
4
2 V k B T
1
2
3
2
3
2
1
2
4:4:17
which does show that the fluctuations in energy diminish as the geometric size or the
temperature of the system increases.
There is, however, another way to interpret the fluctuations (4.4.14) in internal
energy. Although the occupation numbers of each mode of the field of thermal
radiation is unconstrained, the system nonetheless has a mean number of photons
obtained by summing (i.e. integrating) over all occupation numbers
d
kB T 3 x2 dx
kB T 3
8V
hNi
16 3 V
,
e 1
hc
ex 1
hc
4:4:18
23
X
1
n:
k
k1
4:4:19
From relations (4.4.14), (4.4.16), and (4.4.18) we can then make the following
associations
hNi / kB T 3 V
hEi / kB T 4 V
4:4:20
h 2E i / kB T V
and therefore
h E i
1
/ p
hEi
hNi
4:4:21
The Riemann zeta function (over the complex field) has long been an object of fascination to mathematicians and
intimately connected with one of the most fundamental unsolved problems of mathematics. See J. Derbyshire, Prime
Obsession (Penguin, New York, 2003), and the review M. P. Silverman, American Journal of Physics 73 (2005) 287288.
218
2E
kB T
k B T 2 CV
T
V
2 hEi
4:4:22
4:4:23
Thus, one should get the same result as (4.4.14) by calculating separately the terms
E2, E2 and taking the difference, rather than evaluating directly the second
derivative of the partition function. We have already calculated the mean energy
E, so there remains the task of calculating E2. There is, however, a pitfall to
avoid. If by analogy to
hEi
0
d
e 1
2 d
NO!
hE i
e 1
2
or
hE i
2
2
d NO!
e 1
the result will turn out to be incorrect. Indeed, a cursory examination of the foregoing expressions for E2 and E2 shows immediately that the combination (4.4.23)
would lead to two terms with different powers of V and T. In no way could they be
combined to yield (4.4.14).
Return to the basic definition (4.2.4) of the internal energy E as a sum over discrete
modes, and consider the simplest case of just two modes
219
hE2 i hn1 1 n2 2 2 i
1 X
n1 1 n2 2 2 en1 1 en2 2
z1 z2 n1 , n2
!
!
1X
1X
1X
1X
2 n1 1
2 n2 2
n1 1
n2 2
n1 1 e
n2 2 e
2
n1 1 e
n2 2 e
z1 n1
z2 n2
z1 n1
z2 n1
4:4:24
4:4:25
4:4:26
k>l
DX
E2 X
X
hEk i
hEk i2 2 hEk i hEl i
hEi2
k
2E
X
2Ek
4:4:27
k>l
where it is seen that cross terms in the mean of the square and the square of the
mean drop out, leading to a total variance that is the sum of all modal variances as
would be expected for a system of independent modes and unconstrained occupation
numbers.
From (4.4.15) for the variance in photon number of a single mode, it follows that
2 2 n1 n
2 e
e 1
4:4:28
is the variance in energy of that mode, and therefore the total variance in energy is
"
#
2
e d
4gVkB T 5
x4 ex dx
2
2
E d
2
hc3
ex 12
e 1
0
0
0
"
#
4:4:29
4gVkB T 5
x3 dx
16 5 gVkB T 5
4
ex 1
hc3
15hc3
4k B ThEi
in agreement with the result (4.4.14) obtained previously (with g 2). The transition
from the first to the second line above is made by integration by parts. Comparing the
220
form of the variance in energy as expressed in the third line with (4.4.22) identifies the
heat capacity of thermal radiation as
CV
4hEi
,
T
4:4:30
hEi 2 2
2
2
,
4:4:31
E
k B T CV
V,
hNi T , V N
2
which is a sum of two terms: (a) the variance calculated for a canonical ensemble of
fixed number of particles and (b) the variance due to fluctuation in number of
particles. If the fluctuation in particle number is very large, then by (4.4.31) it appears
that the fluctuation in system energy could be greatly enhanced. We will see shortly
whether or not this is the case for thermal photons.
First, quantifying the suggestions in the first paragraph shows that we really are
in a quandary. The chemical potential of the thermal photon gas is 0. The
fluctuation in particle number, determined directly from the partition function, takes
the form
2 ln Z
1 hNi
4:4:32
2N 2 2
,
T , V T , V
which evaluates immediately to 0 because N, given by (4.4.18), is a function only of
T and V and not . However, by use of various thermodynamic relations, one can also
express 2N as
2N kT B hNiT
in terms of the experimentally measurable isothermal compressibility
1 V
T
:
V P
T , N
221
4:4:33
4:4:34
1 hE i 4 5 gkB T4
:
3 V
45hc3
4:4:35
Since P is not a function of V, the derivative (P/V )jT,N vanishes; the compressibility
and therefore 2N are now infinite. A physical quantity cannot be both 0 and .
A third approach to calculating the variance in photon number, is to calculate all
the pertinent thermodynamic quantities for a grand canonical ensemble of non-zero
chemical potential and then take the limit 0. The starting point is the grand
canonical partition function for a Bose gas
PV
X
ln ZBE ln 1 e k
kB T
k
4gVkB T 4
3hc3
x3
dx
e ex 1
4:4:36
kB T 3
16 V
Li4 e ,
hc
where
Lisz
n
X
z
n1
ns
z2 z3
2s 3s
4:4:37
defines the polylogarithm, which reduces to elementary functions only for certain
values of the order s. Before proceeding further, it is useful to understand the origin
of the steps leading to the final expression in (4.4.36).
Each term ln 1 e k in the sum in the first line resulted from the sum over
occupancy numbers of a particular mode k as in (4.3.1), only now the single-mode
Boltzmann term e k includes the chemical potential, which permitted one to
sum all modes independently even if the total number of particles is conserved,
because the chemical potential enforces a constraint on the mean number of
particles in the system. The first line is an exact relation for all BE particles.
PV/kBT, is established by
The second equality in the first line, relating ln ZBE to
P
where
comparing
the
statistical
entropy
S kB E, N pE, N ln pE, N ,
222
pE, N e EN =Z, with the thermodynamic entropy expressed in the First Law
U TS PV N. The thermodynamic extensive variables are equated to
expectation values of the corresponding statistical variates, as e.g. internal energy
U E and particle number N N.
The expression in the second line was obtained by replacing the sum over states by
an integral over density of states for ultra-relativistic bosons (kBT >> ) and
performing an integration by parts. The integrand was made dimensionless by
the change of variables x .
The evaluation of the integral (for degeneracy factor g 2) as an infinite sum in the
third line was performed by a method explained in an appendix, which leads to the
general form
X
xk
an
dx
k!
k 1Lik1a,
4:4:38
a1 ex 1
nk1
n1
0
where the gamma function (k 1) k! for integer k. For a 1, the first few integrals
pertinent to this discussion become
x dx
2 =6 e 1:6449
ex 1
x2 dx
2 3 e 2:4041
ex 1
4:4:39
x3 dx
4 =15 e 6:4939
ex 1
x4 dx
24 5 e 24:8863:
ex 1
All the physical quantities needed are derivable from ln ZBE . Thus, before and after
setting g 2 and taking the limit 0, we have the following relations.
Pressure
P
kB T
8 g kB T 4
Li4 e
ln ZBE
V
hc3
!
0
g2
lnZBE
kB T 3
hNi kB T
8 gV
Li3 e
T , V
hc
8 5 kB T 4
45hc
1 hEi
:
3 V
4:4:40
kB T 3
:
! 16 3V
0
hc
g2
4:4:41
223
2
kB T 3
8 3 V kB T 3
2 ln ZBE
2
N k B T
8 gV
Li2 e !
: 4:4:42
0
hc
3
hc
2
T , V
g2
It then follows from (4.4.41) and (4.4.42) that the relative fluctuation in particle
number for a system of relativistic bosons in the limit of zero chemical potential
1 Li2 e
2 =6 3 1:3684
,
!
0
hNi e hNi
hNi2 hNi Li3e
2N
4:4:43
dT
dV
dN SdT PdV dN
4:4:44
dF
T
V , N
V
T , N
N
T , V
1
2
we can write the Maxwell relation (which signifies that second derivatives can be
taken in either order)
:
4:4:45
V
T , N
N
T , V
We will use this relation shortly.
(b) Now consider the defining property of an intensive function i.e. a homogeneous function of degree 0 expressed in (4.2.19). Suppose f (x, y) is an intensive
function of extensive variables x and y. Then
df x, y
df nx, ny
f dnx
f dny
0
dn
n1
dn
nx dn
ny dn n1
n1
4:4:46
f
f
x y :
x
y
With f equated first to the chemical potential and then to the pressure P, and the
replacements x V, y N, the relation (4.4.46) leads to the following two equalities
224
V
T , N
V N
T , V
V P
N
T , V
N V
T , N
4:4:47
which, when substituted into (4.4.45), yield an alternative expression for the variation
in particle number with chemical potential
N=V
:
4:4:48
T , V P=N jT , V
Replacing, in accordance with standard procedure, the thermodynamic N by the
statistical N and employing (4.4.48) in (4.4.32), we obtain an expression for
the variance in particle number
2N
hNi=V kB T
P=hN i
T , V
hNi
PV
hNi kB T
"
#1
ln Z
lnhNi
T , V
4:4:49
T, V
exclusively in terms of the partition function and mean particle number. The transition from the first equality to the second, where V and T are included in the function
to be differentiated, is legitimate because the partial derivative must hold temperature and volume constant. The third equality made use of relation (4.4.36) expressing
the grand canonical partition function in terms of system variables. As a quick check
of (4.4.49), we can apply the first relation to the equation of state of an ideal gas
9
PV NkB T >
=
hNik B T
2
4:4:50
P
kB T ) N kB T hNi:
>
V V
N T , V
V
The outcome is as expected for a classical system of grainy constituents.
Now consider the implication of the last equality in (4.4.49). The variance in
particle number is the reciprocal of the slope of a plot of ln Z against lnN. If
circumstances are such that all points in the plot were obtained for the same
temperature and volume i.e. there is at least one additional extensive or intensive
variable X that distinguishes one point from another then the slope exists, as does
the variance in particle number. However, if no additional variable influences the
system e.g. the chemical potential is a constant irrespective of whether that constant
is 0 or not then each point in the plot must represent a different temperature and/or
volume. Since no slope can be associated with a single point, the function 2N is then
neither zero nor infinite nor anything; it is not defined.
If the mass of a photon were nonvanishing, however small, the chemical potential
would not necessarily be zero, and the limiting process represented by (4.4.42) could
225
be implemented to yield, as deduced above, a variance in particle number proportional to the number of particles, just like an ideal gas. However, if the chemical
potential of the photon gas were identically zero, and the partition function depended
on no other parameters than temperature and volume, then the photon number
distribution would have a first moment, but not a second moment. (We have
encountered distributions before, like the Cauchy distribution, which has no first
or second moment.) This somewhat unusual circumstance perhaps makes one
wonder how, from a physical (rather than purely mathematical) standpoint, there
can be a difference in the properties of a physical system depending on whether a
measurable quantity like the chemical potential is arbitrarily small as opposed to
being identically zero.
The answer is that the transition from arbitrarily small to identically zero is
not always smooth. There is an abrupt difference, for example, in the allowed
polarizations of zero-mass and small-mass particles. A particle with nonvanishing
mass, however small, will have 2s 1 spin substates i.e. directions of polarization,
whereas a particle with zero mass can have no more than two independent states of
polarization (helicity components) irrespective of the spin s. This difference is a
fundamental outcome of any theory of particles invariant under a proper Lorentz
transformation and mirror-reflection.13 Stated differently but equivalently, for a
particle with mass, however small, the polarization depends on the reference frame
of the observer; for a massless particle, the spin direction is a relativistic invariant,
always either parallel or anti-parallel to the velocity.
As a final thought on this matter, I return to the question of whether the
magnitude of photon number fluctuations in a thermal photon gas has an observable
consequence on the energy fluctuations, as given by (4.4.31). The answer depends on
the coefficient
!
hEi
1
P
hEi PV TV
4:4:51
hNi
T , V hNi
T V , N
the derivation of which I leave to an appendix. Upon substitution of the previously
obtained relations [(4.4.40), (4.4.30)]
1 hEi
P
3 V
1 hEi
3V T
V, N
V, N
1
1 4hEi
CV
,
3V
3V T
4:4:52
one finds that (4.4.51) is identically zero. For a thermal photon gas, therefore, the
fluctuation in energy is given by
13
E. P. Wigner, Relativistic invariance and quantum phenomena in Symmetries and Reflections: Scientific Essays
(Indiana University Press, 1967) 5181.
226
2E
hEi
kB T
kB T 2 CV
T
V
4:4:53
Correlator
Detector A
Splitter
Source
Emission PF
(BE, MB, FD)
NAB
Counter
NB
Detector B
Decision PF
(Bin)
Fig. 4.5 Schematic diagram of split-beam counting and correlation experiments. The source
supplies particles randomly according to a specified (BE, MB, or FD) emission probability
function (PF). The particles are randomly directed to counter A or counter B by the splitter
according to a binomial PF. The numbers of particles (NA, NB) received by detectors
A and B are counted and correlated to obtain the correlation function C(NA, NB) (NANB
NANB).
227
4:5:1
specified counting interval (bin) given that the source emitted n photons, is calculable
from the binomial probability function
n k nk
PrSn kjN n
pq ,
4:5:2
k
where the random variable N represents the number of particles emitted from the
source. Since N is determined independently by the Poisson probability function of
the source (where we again represent the mean particle count by since there will be
no confusion with chemical potential in this section),
PrN n e
n
,
n!
4:5:3
4:5:4
Table 4.3 records explicit values of the probability (4.5.4) for the emission of 0, 1, or
2 photons by the source. Note that different photon emissions can all lead to the
same number of arrivals at detector A. For example, a photon can arrive at
A because 1 or 2 or 3 or more photons were emitted from the source. Unless the
experimenter has taken specific measures to create a photon source that emits a predetermined number of photons,14 the photon number N is not an experimentally
controllable parameter. One needs to know, therefore, the marginal probability of
detecting a photon at A irrespective of the number of emitted photons
14
This, in fact, can be done. A single-atom fluorescence source emits one photon at a time. An atomic hydrogen source
radiating from the 2S1/2 metastable state emits two photons.
228
Number detected at A
NA
Probability
Pr(NA)
e
0
1
e( q)
e( p)
0
1
2
*, k
e q2
e (2pq)
2
1
2 e p
1
2
p n, k :
4:5:5
n0
*, k
X
n0
p n, k
e p k X
qn
pk X
qn
e
k! q n0 n k!
k! nk n!
4:5:6
pk
pk
e eq ep
k!
k!
shows that this marginal probability is governed by a Poisson distribution with mean p.
An alternative approach to ascertaining the probability function is to determine
the moment generating function (mgf ) of the compound distribution (4.5.2)
gNAt heNA t i
Pr N n gSnt,
4:5:7
n0
where
n
gSnt pet q
4:5:8
is the mgf of the binomial distribution Bin(n, p). Substitution of (4.5.8) into (4.5.7)
X
X
n
pet q n
n
e
pet q e
gNAt
n!
n!
4:5:9
n0
n0
pet q
p et 1
e
e e
generates the mgf of a Poisson distribution with mean p, in agreement with the
probability function (4.5.6). From the mgf we can readily determine the mean, mean
square, and variance although we already know what they are in this case and from
the symmetry of probability function (4.5.4) can also immediately determine the
corresponding quantities for detector B:
229
hN A i 2A p
hN B i 2B q:
4:5:10
cov N A , N B
h N A pN B qi hN A N B i hN A ihN B i
2
2
A B
2A 2B
2A 2B
4:5:11
where
hN A N B i
X
n
X
pn, k k n k:
4:5:12
n0 k0
n
n d d
X
X
n n
n X
k nk
hN A N B i e
k n kp q e
p
q
pk qnk
n!
n!
dp
dq
k
n0 k0
n0
k0
k
n
d
d X
p qn
d
d pq
e p
e p
q
q
e
n!
dp
dq n0
dp
dq
pq1
2 pq,
4:5:13
which is equal to the product NANB. (Note that one must not impose the
constraint p q 1 in (4.5.13) until after the derivatives with respect to p and q
have been taken.) Thus, C(NA, NB) 0 for particles subject to MB statistics.
This result is not surprising; one would not expect Poisson-distributed particles
arriving randomly at one location to be synchronized in any way with the random
arrival of such particles at a different location. Were that actually to occur, the fall of
raindrops (whose distribution is claimed to be Poissonian) at two locations under the
same cloud would be correlated.
4.5.2 BoseEinstein (BE) particles (photons)
We proceed as in the previous section except that now the probability for emission
of n particles within some specified time interval is given by the BE function
Pr N n
n
1n1
4:5:14
4:5:15
230
It follows then that the mgf for the distribution of particles at detector A is
X
n
1 X
pet q n
1
pet q 1
n
t
1
pe
g N A t
n1
1 n0
1
1
1
n0 1
1
,
1 p et 1
4:5:16
*, k
X
n0
p n, k
pn
p 1n1
4:5:17
and the means and variances for particle arrivals at each detector
hN A i p
hN B i q
2A p 1 p 2B q 1 q:
4:5:18
As in the previous section, we could also derive the probability p*,k directly by
summing (4.5.15) over the index n. The algebraic steps are a little more involved
than in the previous case (where the sum led to an exponential function) but are
worth examining because the calculation reveals further connections between the BE
distribution and the distribution of waiting times. In implementing the sum (4.5.15)
over n, we soon arrive at the form
n
p=qk X
q
n!
p ,k
,
4:5:19
*
1k! nk 1 n k!
where the index begins at n k if the factorial in the denominator is to make any
sense. Reset the index to start at n 0 again by defining r n k to obtain
r
X
pk
q
k r !
p ,k
:
4:5:20
k1
*
r!
1 k! r0 1
A sum of the form (4.5.20) can be closed as follows
X
X
k r !
k!
kr
ar
ar
k!
r
r!
1 ak1
r0
r0
4:5:21
to yield a final form for p*,k identical to (4.5.17) upon substitution a q/( 1).
The closure in (4.5.21) is associated with the negative binomial distribution
defined by
f kjr, p
r r
rk1 r k
p qk
pq
k
k
k 0, 1, 2 . . .
4:5:22
231
r
k
1
rk1
k
4:5:23
X
1
r
r 1 q
1 q
k0
r
k
X
rk1 k
q
q:
k
k
4:5:24
k0
That the distribution defined by (4.5.22) satisfies the completeness relation is seen
immediately by multiplying both sides of (4.5.24) by pr (1 q)r.
The negative binomial distribution (4.5.22) gives the probability for the rth
occurrence of an event (or success) at the (r k)th trial, where k can be 0, 1, 2,
etc. Thus, it is interpretable as the probability for the waiting time to the rth
success. We have seen that the BE single-mode occupation probability pBE of
(4.3.15) was interpretable as a probability for the first occurrence of success at
the kth trial. The negative binomial distribution solves the more general
problem.
rk1
will also be recognized from (4.2.10) as the BE
The coefficient
k
multiplicity factor BE, i.e. the number of ways that k indistinguishable objects can
be sorted into r cells. The two relations pBE and BE come together in addressing
the combinatorial problem: What is the probability that a set of r out of s specified cells
[e.g. quantum states or photon modes] are filled with exactly k out of m indistinguishable balls [e.g. photons]? The solution is given by
Number of ways to distribute k
indistinguishable balls over r cells
P k, rjm, s
rk1
r1
ms1
s1
m k s r 1
s r 1
4:5:25
with appended explanation for each factor. In the limit that the number s of cells and
number m of balls become infinitely large while the ratio m/s becomes the mean
occupation number n of a cell, the right side of (4.5.25) reduces to
232
P k, rjm, s !
m !
s !
hnik
hnik
rk1
rk1
k
r 1 1 hnikr
1 hnikr
4:5:26
m
s ! hni
which is the appropriate BE probability for this case. Demonstration of this reduction is given in an appendix. In the special case corresponding to photon emission
into a single mode (r 1), relation (4.5.26) yields
hnik
hnik
k
,
4:5:27
P k, 1jm, s !
k1
0 1 hni
1 hnik1
which reproduces the probability (4.3.15) obtained previously by means of the
canonical partition function.
There are at least two ways to calculate the correlation of BE particles in a splitbeam experiment. The most direct way is to evaluate the expectation (4.5.12) by
substituting the BE probability function, closing the sum and taking derivatives with
respect to p and q in the manner employed before [(4.5.13)], and then applying the
constraint p q 1.
!
X
n
X
n
n
k n kpk qnk
hN A N B i
n1
k
n0 k0 1
"
! #
n n
n
d d
1 X
q X
pk
p q
4:5:28
dp dq 1 n0 1 k0 k
q
d d
1
p q
dp dq 1 p q 1 pq1
22 pq
The resulting correlation in (4.5.28) is twice that for MB particles, leading to a
positive correlation coefficient
r
hN A N B i hN A ihN B i
pq
C N A , N B
4:5:29
A B
p 1q 1
indicative of a photon arrival pattern referred to as photon bunching. Figure 4.6
shows the variation in C(NA, NB) as a function of mean count for different values of
the decision probability p.
A less direct but more general and powerful procedure that provides complete
statistical information about individual and correlated responses of the two detectors
is obtainable from the two-variate moment generating function
X
X
n1 n2
n1 n 2
N A t1 N B t2
n1 t 1 n2 t 2
e
i
e e
pn1 qn2 ,
gA, B t1 , t2 he
n1 n2 1
n
2
n1 0 n2 0
4:5:30
233
(a)
Correlation Coecient
0.8
(b)
(c)
0.6
(d)
0.4
0.2
BE Emission PF
Bin Decision PF
0
10
20
30
40
Mean Count
Fig. 4.6 Correlation function in split-beam counting experiment with BE emission probability
function (PF) and binomial decision PF with probability p (a) 1/2, (b) 1/10, (c) 1/20, (d) 1/50
of transmission to detector A.
where each sum now spans the range (0, ) of the number of photons that can arrive
at each detector. By separating the factors according to their summation index
X
1
pet1 n1 X
qet2 n2 n1 n2
gA, B t1 , t2
n2
1 n 0 1 n 0 1
1
4:5:31
and completing the sums sequentially (which again requires use of the negative
binomial summation identity (4.5.24)), one obtains after a little work the simple
expression
gA, B t1 , t2
1
:
1 1 pet1 qet2
4:5:32
We encountered a structure like that defined in (4.5.30) previously in the discussion of the multinomial distribution (relations (1.13.2 and 1.13.3)). To recapitulate, all desired moments, variances, and cross-correlations can be determined
from partial derivatives of gA,B(t1, t2) of appropriate order with respect to the
two arguments:
gA, B t1 , t2
hN A i
p
4:5:33
t1 0
t1
t2 0
234
gA, B t1 , t2
hN B i
q
t1 0
t2
4:5:34
t2 0
N A
2
2 gA, B t1 , t2
p 1 2p
t1 0
t21
4:5:35
2 gA, B t1 , t2
q 1 2q
t1 0
t22
4:5:36
t2 0
2
N B
t2 0
2 gA, B t1 , t2
22 pq
hN A N B i
t1 0
t1 t2
4:5:37
t2 0
2A
2 ln gA, B t1 , t2
p 1 p
t1 0
t21
4:5:38
2 ln gA, B t1 , t2
q 1 q:
t1 0
t22
4:5:39
t2 0
2B
t2 0
I discuss photon bunching and fermion anti-bunching in quantitative detail in M. P. Silverman, Quantum Superposition:
Counterintuitive Consequences of Coherence, Entanglement, and Interference (Springer, 2008).
235
particles where each detection event is independent of any other. It is this third
phenomenon that evokes the image of bunching, since a time record of detection
events would show random regions of enhanced density. Most striking is the fact that
the theoretical conditional probability of a second event with zero time delay after the
first event is twice that for a coincidental second arrival predicted by Poisson statistics. That, in fact, is the implication of relation (4.5.28) compared with (4.5.13).
From a quantum perspective, the bunching of thermal photons, as manifested by a
positive correlation coefficient, is a consequence of BoseEinstein statistics. From a
classical perspective, it is attributable to the wave noise arising from the random
fluctuations in net amplitude of the classical wave comprising independently emitted
wavelets of random phase from numerous atomic or molecular sources. The shot
noise that is always present as a consequence of the grainy nature of photons is
averaged out by the correlator.
There are, however, non-classical states of the optical field i.e. states of light not
described by solutions to Maxwells electromagnetic equations that display a
different type of statistical behavior referred to as anti-bunching. These states play
an important part in the motivation and execution of the experiments to be described
shortly that test sequences of photon measurements for non-randomness. As the
name suggests, the conditional probability of a second detection event, given that a
first has occurred, has lower probability than that predicted by a Poisson distribution. This is the kind of statistical behavior expected for fermions as a consequence of
the Pauli exclusion principle.
4:5:41
236
hN A i p
2
h N A i p
hN B i q
2
h N B i q
hN A N B i 0
2A p 1 p
2B q 1 q
r
hN A N B i hN A ihN B i
pq
:
C N A , N B
A B
1 p1 q
4:5:42
4:5:43
4:5:44
4:5:45
The negative sign of the correlation coefficient C(NA, NB) is indicative of antibunching.
The origin of the negative correlation is perhaps obvious, but nevertheless worth
commenting. Since no more than one particle can be in the emitted mode within any
counting interval, the receipt of a particle at one detector means that the other
detector cannot have received a particle hence the signals at the two detectors are
negatively correlated, as shown explicitly in (4.5.45) because the instantaneous
product NANB (and therefore the mean NANB) is 0.
As mentioned briefly, there are non-classical photon states that manifest antibunching even though the particles are intrinsically governed by BE statistics. One
example is a single-photon particle-number or Fock state, represented in Dirac
notation by j, k, where is the energy, k is a unit vector in the direction of
propagation, and is a two-valued label of the state of light polarization which could
take such forms as (V, H) for vertical and horizontal planes of polarization, (D, D)
for planes of diagonal polarization at angles 45 or 45 to the vertical, (R, L) for
right and left circular polarizations, or simply (1, 2) for two unspecified polarization
states. Single-photon states give rise to anti-bunching in a split-beam experiment for
the same reason that FD particles do namely, because there is at most only one
particle in the emitted mode, the two detectors cannot each receive a particle within
the same counting interval and therefore the product NANB is zero.
However, just as certain kinds of photon states can exhibit statistical behavior
ordinarily attributable to fermions, there are fermionic states that are predicted to
exhibit statistical behavior usually attributable to bosons a finding quite surprising
when first reported.16
M. P. Silverman, Fermion ensembles that show statistical bunching, Physics Letters A 124 (1987) 2731. Further
information is given in M. P. Silverman, Quantum Superposition, op. cit.
237
emission and scattering of light, occur non-randomly are more than just tests of
quantum mechanics. Like other physical theories before it, quantum mechanics may
someday be revised or replaced but it is very unlikely, in my opinion, that any such
modification will reflect a discovery that nature is, after all, less random than we
currently believe. Nevertheless, only experiment can reveal whether or not this is so.
There is another and more practical objective of increasing importance in an era of
increasing digitalization. Although individuals might strive for order and predictability in their personal lives, governments, businesses, and organizations seek randomness in a mathematical manner of speaking. In particular, they need a steady flow of
random numbers to ensure the security of communications and transactions.
Cryptography is the science of rendering a message unintelligible except to an
authorized recipient. To do this, the message of interest is encrypted by combination
with a random message, the key, to form a cryptogram. If the key is truly random, of
length (in bits) no less than that of the message to be protected, and used only once,
then it can be demonstrated mathematically as was done by Claude Shannon in
1949 that the cryptogram is impossible to decipher.17 As an example, consider the
transmission of the secret message MEET TONIGHT.
A
1
B
2
M
E
E
T
T
O
N
I
G
H
T
C
3
13
5
5
20
20
15
14
9
7
8
20
D
4
E
5
F
6
G
7
H
8
I
9
BINARY CIPHER
16 8
4
2
1
0
1
1
0
1
0
0
1
0
1
0
0
1
0
1
1
0
1
0
0
1
0
1
0
0
0
1
1
1
1
0
1
1
1
0
0
1
0
0
1
0
0
1
1
1
0
1
0
0
0
1
0
1
0
0
J
10
K
11
L
12
M
13
N
14
O
15
RANDOM BITS
16 8
4
2
1
1
1
0
0
1
0
1
0
1
0
1
1
1
1
0
0
0
1
0
1
0
1
1
0
0
1
1
1
0
0
0
1
0
1
0
1
0
1
0
0
1
0
0
0
0
0
1
0
0
1
1
0
0
0
1
P
16
Q
17
16
1
0
1
1
1
1
0
1
1
0
0
R
18
S
19
T
U V W X Y Z
20 21 22 23 24 25 26
ENCRYPTION
8
4
2
0
1
0
1
1
1
1
0
1
0
0
0
1
0
0
0
0
1
0
1
0
1
1
0
0
1
1
0
0
0
0
1
0
1
0
1
1
1
0
1
0
1
1
1
1
20 T
15 O
1 A
17 Q
24 X
19 S
4 D
3 C
23 W
1 A
5 E
Start by assigning to each letter with no space between the two words since the
message as so written is quite clear the decimal number of its placement in the
Roman alphabet. Thus, we assign 13 to M, 5 to E, and so on. Our cipher or
algorithm is then to write the decimal number in binary. Recall that if a number is
written in base b, then the first column, starting at the right, represents b0, the second
column b1, the third column b2, etc. Thus the decimal symbol for the letter M is coded
as the binary number 01101. The initial 0 is included so that each letter of a message
will comprise 5 bits (binary digits) since no fewer than 5 bits can express all
26 letters of the Roman alphabet. (Z, for example, is 26 16 8 2 11010.)
Moreover, since 5 bits is sufficient to code 25 32 letters, we will interpret the code as
17
C. Shannon, Communication theory of secrecy systems, Bell Telephone Technical Journal 28 (1949) 656715.
238
239
Among the newest applications of quantum physics are those obtained by putting
the adjective quantum in front of almost any noun relating to computers and
communication: quantum computing, quantum information, quantum cryptography, quantum key distribution, and the like. These are topics that lie outside the
intended scope of this book except for the following few pertinent remarks. No one,
to my knowledge, has yet made a quantum computer in the sense that physicists
ordinarily understand the term: a device that computes by means of a large number
of entangled quantum states (qubits) to perform multiple calculations in parallel.18
Such a computer, if it existed, could conceivably factor large numbers (and perform
other mathematical operations) fast enough to render present encryption methods
obsolete. The simplest counter-measures would be to adopt longer encryption keys or
a protocol like the one for perfect secrecy.
There would still remain the problem of key distribution. It is in this area that
quantum mechanics is providing practical solutions already available for exploitation. Quantum cryptography is less a matter of encryption than of supplying a
means, based on quantum principles, for securely transmitting a key over a public
channel. Although in theory any elementary particle may be used, in practice the
most suitable means of transmission are single-photon states. Photons are readily
created, have an infinite range, and can be relatively easily transformed into
desired states of polarization. The classical and quantum interpretations of what
occurs when polarized light encounters a polarizing device are profoundly
different.
Consider, for example, a monochromatic classical light wave, which in quantum
terms means an astronomically large number of photons in a narrow range of energymomentum states. If a D-polarized light wave is incident on a linear polarizer that
passes V-polarized light, the relative intensity of the transmitted light is given by the
square of the scalar product of the unit electric vectors of the incident and transmitted waves:
1
I transmitted
4:6:1
,
j^e D ^e V j2 cos 2
Iincident
4
2
a relation historically known as Malus law.19 In other words, half the light energy,
regarded as a continuous quantity, is transmitted and half is not. The half that is not
may be absorbed or reflected, depending on the type of polarizer.
From a quantum perspective, a single photon can be described as a linear superposition of quantum states in a (V, H) basis or a (D, D) basis. The two sets of bases
are related by
18
19
The term quantum computing is to be distinguished from biomolecular computing in which massively parallel
computation is provided by individual molecules like DNA to solve problems in graph theory such as the travelingsalesman problem.
A comprehensive discussion of light polarization and the interaction of polarized light with different optical
components is given in M. P. Silverman, Waves and Grains: Reflections on Light and Learning (Princeton, 1998).
240
9
1
>
jD i p jVi jHi >
>
=
2
>
1
>
jD i p jVi jHi >
;
2
or the inverse
9
1
>
jVi p jD i jD i >
>
=
2
>
1
>
jHi p jD i jD i >
;
2
4:6:2
where only the polarization labels are shown since the other characteristics (energy,
momentum) are assumed to be the same for all states. Neither basis is more fundamental than the other. In the symbolism of quantum mechanics, a photon in the state
D arriving at a V or H polarizer has a probability
1
2
1
P HjD jhHjD ij2
2
P VjD jhVjD ij2
4:6:3
of passing or not. For photons, the quantum scalar product expressed in Dirac notation
as AjB is equivalent to the scalar product (4.6.1) of unit electric-field vectors. The
seminal point is that with a single incident photon, a polarizer does not pass a fraction
of a photon; the outcomes are discrete, mutually exclusive, and unpredictable.
The security of quantum key distribution relies on (a) the unpredictability of
outcomes of measurements made on single-photon polarization states like (4.6.2)
and (b) the fact that attempts by an eavesdropper to intercept and copy transmitted
single-photon states would alter, in the aggregate, the correlation between states sent
and received. Thus, when sender and receiver compare a sample string of bits, they
would discover that an intrusion had occurred and could take measures to protect
further compromise of information. The reliance on single-photon states for unconditional security is crucial because these states can thwart a particularly effective
method of eavesdropping referred to as a photon-number splitting attack whereby
the attacker intercepts a pulse, splits off a photon, and makes measurements useful to
deciphering the key. There are no surplus photons in a single-photon Fock state.
To test whether measurements on single-photon states lead to outcomes consistent
with what one expects for a random process, one needs to generate a long sequence of
measurement outcomes and test them for revealing patterns. If the patterns are not
there, then the sequences are not random. One highly effective way to do this is by a
process originally known as parametric fluorescence, but referred to now by the
longer, cryptic name of spontaneous parametric down-conversion.
4.7 Correlation experiment with down-converted photons
In contrast to thermal light, the creation of single-photon states is not a simple matter
of switching on an incandescent light bulb. In the words of one review20
20
S. Scheel, Single-photon sources an introduction, Journal of Modern Optics 56 (2009) 141160 (quotation from
p. 141).
241
Generating one and only one photon at a well-defined instance . . . proves to be a formidable task. It
amounts to producing a highly nonclassical state of light with strongly nonclassical properties.
Single photons on demand must therefore originate from a source that operates deep in the quantum
regime and that is capable of exerting a high degree of quantum control to achieve sufficient purity
and quantum efficiency of photon production.
D. C. Burnham and D. L. Weinberg, Observation of simultaneity in parametric production of optical photon pairs,
Physical Review Letters 25 (1970) 8487.
242
polarizations and emerge along the surface of two different cones (one for the
ordinary or o-ray; the other for the extraordinary or e-ray) whose intersection
is centered on the pump beam.
Mathematically, the paired (or conjugate) single-photon states of the first type can
be represented by direct products of single-photon states
jViS jViI for pump jHiP
I 1, 2
4:7:1
jHiS jHiI for pump jViP
whereas the conjugate single-photon states of the second type take the form of a
superposition of such products, for example
1
II 1, 2 p jViI jHiS jHiI jViS :
2
4:7:2
Although the energy (frequency) and linear momentum (wave vector) need not be the
same for the S and I photons, only the polarizations, which are the informationcarrying degrees of freedom of interest here, are displayed in the states of (4.7.1) and
(4.7.2). Nevertheless, the constraints posed by energy and momentum conservation
lead to correlations in the frequency and propagation direction of the S and
I photons. In the special case of Type II PDC in which the S and I photons have
the same frequency and emerge along the two directions corresponding to the
intersection of the two cones, it is not possible to tell which photon is the signal
and which is the idler.
A quantum state of the form (4.7.2) one of four so-called Bell states22 named for
the theorist J. S. Bell is said to show quantum entanglement. The significance of an
entangled state is that it preserves quantum correlations among component states
even when the particles described by those states are far apart. Thus, if the signal
polarization of a PDC state represented by (4.7.2) was measured to be V, then one
would know with 100% certainty (barring non-ideal conditions like imperfect
detector efficiency, detector dark current, and other intrusions of the real world)
that the idler polarization was H, irrespective of the separation between the detectors.
Likewise, if the signal polarization were measured to be H, then without doubt the
idler polarization would have to be V. Since the correlations intrinsic to entangled
states cannot be reproduced by physical models based on classical theories of local
hidden variables (a point proved by Bell), entangled states of various kinds have
been created experimentally for purposes of testing predictions of quantum theory.
To date, I know of no replicable test in which the correlations predicted by quantum
theory were not confirmed.
It is worth mentioning, because the subject is fundamental and still a source of
contention among some philosophically minded physicists, that the properties of
22
The four Bell states are: j i p12 j0i1 j0i2 j1i1 j1i2 , j i p12 j0i1 j1i2 j1i1 j0i2 .
243
Lens
PDC
Signal
LASER
Idler
/2
Detectors/
Counters
LF A
0
PBS H
LF
LF
Fig. 4.7 Arrangement for measuring single-photon polarizations. A cw laser pump photon is
converted to simultaneous signal and idler photons within a BBO parametric down-conversion
crystal (PDC). The idler photon acquires diagonal polarization (D) after passing through a
half-wave plate (/2), and is either transmitted with horizontal polarization (H) or reflected
with vertical polarization (V) by the polarization beam splitter (PBS). Optical fibers, lenses and
filters (LF) transmit the signal and idler photons to detectors AB or AC, depending on the
outcome of the PBS measurement. The photon pairs are counted in coincidence and each such
event assigned the appropriate binary label.
23
24
For elaboration of this point, see M. P. Silverman, Quantum Superposition: Counterintuitive Consequences of Coherence,
Entanglement, and Interference (Springer, 2008).
D. Branning, A. Katcher, W. Strange, and M. P. Silverman, Search for patterns in sequences of single-photon
polarization measurements, Journal of the Optical Society of America B 28 (2011) 14231430.
244
1
jni jni ,
n! S I
4:7:4
n
1n1
4:7:5
in which the mean number of photons per bin and the effective photon temperature
T of the mode are related to the parameter of the pump by
sinh2 jj
:
kB T
2 ln cothjj
4:7:6
4:7:7
25
26
B. Yurke and M. Potasek, Obtainment of thermal noise from a pure quantum state, Physical Review A 36 (1987) 3464
3466.
B. Blauensteiner et al., Photon bunching in parametric down-conversion with continuous-wave excitation, Physical
Review A 79 (2009) 063846 (16).
245
Table 4.4
Hi-
1 0.0111
0.0002
1
2 0.364
0.002
0.1
0.98920 (Obs)
0.98896 (Thy)
0.694 (Obs)
0.695 (Thy)
0.01074 (Obs)
0.01098 (Thy)
0.253 (Obs)
0.253 (Thy)
0.00006 (Obs)
0.00006 (Thy)
0.053 (Obs)
0.052 (Thy)
0.0056 (Obs)
0.0056 (Thy)
0.209 (Obs)
0.206 (Thy)
Sequence length n
Pr (1) p
Number of bags M
(1 bag 8192 bins)
Sequence length n
Pr (1) p
Number of bags M
(1 bag 8192 bins)
Single-photon events
8 919 341
0.478 50
16 797 012
0.500 37
1088
2050
20 258 816
0.414 87
1094
2473
a thermal source. The thermal noise arises from amplification of the vacuum fluctuations that lead to the spontaneous emission of signal and idler photons.
Table 4.4 summarizes the characteristics of the sequences of signalidler events
experimentally obtained for two values 1 0.0111 and 2 0.364 of the mean
count per bin which will be referred to as the Lo- and Hi- sequences.
For beams of mean occupation number of < 1 counts per bin, where the bin
width (counting interval) is long compared to the coherence time of the source, the
emission statistics should be well approximated by a Poisson distribution. The theoretical probabilities in Table 4.4 were calculated with Poisson statistics. By far, the
largest number of outcomes is 0 events per bin. Since there are no coincidence counts
to be assigned a 0 or 1, these events were removed from the sequences to be tested.
The singly occupied bins are the ones that form the random number sequences of
interest in polarization-based random number generators. Multiple-photon events
are usually excluded from a random bit sequence because they also cannot be
assigned a binary label in addition to the fact that such events can compromise
the security of the key. As seen in Table 4.4, the fraction of events with two or more
246
photons within the same data collection time is very low if is sufficiently low, but
can still be less than 1 and lead to a significant presence of multiple-photon events.
To test whether the observed sequences of photon measurements with or without
inclusion of multiple-photon events gave evidence of non-randomness, a statistical
method was needed that could be applied to sequences of data comprising events of
more than two kinds. The theory of recurrent runs provides a mathematically
interesting and statistically effective solution to this need.
4.8 Theory of recurrent runs
A stochastic process generates random outcomes in time or space. Despite their random
occurrence indeed, precisely because of it, as discussed in the previous chapter the
outcomes of a stochastic process will display ordered patterns, which a statistically
na ve observer may mistakenly interpret as predictively useful information. Although it
is not possible to prove with certainty that a particular process is random, various
statistical tests can demonstrate within specified confidence limits that it is not random.
Among these, runs tests are especially useful because they are easily implemented, do
not depend on the form or parameters of the distribution of the sampled population,
and are sensitive to deviations from the statistics expected for a random sample.
Recall that a run was defined to be an unbroken sequence of similar events of a binary
nature, as, for example, a sequence of 0s and 1s. A runs test, then, is a test of randomness
in permutational ordering along a single dimension, either spatial or temporal. The
applicability of runs tests is more general that might be inferred at first glance because the
original data, which can be any discrete or continuous series of real numbers, can be
mapped to a set of binary elements in various ways. The different mappings generally
produce different sets of frequencies of runs of specified length, thereby independently
mining the information inherent in the data. Runs tests are distribution free because they
rely on ordinal or categorical relationships between the elements of the sequence to be
tested, rather than on the exact magnitudes of the elements themselves.
To apply a runs test one must know, or at least be able to approximate closely, the
distribution of the chosen statistic. The statistics of interest have traditionally been
the total number of runs (of both types of symbols) and the frequency of the shortest
and longest runs. However, the data are much more effectively utilized by determining for each run length t the probability pn,k,t for occurrence of k runs in n trials.
Although I discussed certain kinds of runs (exclusive runs) in the previous
chapter, it will be useful to consider the subject again from a broader perspective.
Generally speaking, runs tests are of three types. The first is based on categorical
relationships, by which is meant that a variate is assigned a symbol such as a or b
depending on whether it was greater or lesser than a specified threshold, e.g. the
median. The null hypothesis, against which the resulting series containing na elements
na nb
of one kind and nb elements of the other is compared, is that each of the
na
247
W. Feller, An Introduction to Probability Theory and its Applications (Wiley, 1950) 299300.
248
r
X
4:8:1
4:8:2
k1
4:8:4
where fn is the probability that E occurs for the first time at the nth trial. The
generating function of the probabilities of first occurrence is expressed by the series
expansion
F s
4:8:5
f n sn
n0
from which it follows that the generating function of the rth occurrence of E is
!r
X
X
r
r
r n
n
F s
f n s F s
f ns
,
4:8:6
n0
n0
where
Pr Sr n f nr
4:8:7
is the probability that the rth occurrence of E first takes place at the nth trial. [See
Eq. (3.14.11) and the discussion preceding it.]
Although the derivation of the generating function (4.8.5) is not difficult, it is
somewhat lengthy, and I will simply give the result28
F s
28
pt st 1 ps
1 s qpt st1
4:8:8
W. Feller, Fluctuation theory of recurrent events, Transactions of the American Mathematical Society 67 (1949) 98119.
249
from which the mean and variance of the recurrence times of runs of length t follow
by differentiation
dF s
1 pt
T
4:8:9
qpt
ds s1
(
)
d 2 F s
dF s 2 dF s
1
2t 1 p
2
2:
4:8:10
T
2
2
ds
ds
ds
qpt
q
qpt
s1
4:8:11
In words: if the total waiting time to the kth success is less than n, then the number of
successes in time n must be at least k. The probability pn,k that exactly k events
E occur in n trials is then expressible as
pn, k Pr N n k Pr Sk n Pr Sk1 n
4:8:12
p n, k z k
4:8:13
k0
Fk s
X
n1
p n, k s n
Fk s1 F s
:
1s
4:8:14
Note that the summation in (4.8.13) is over the number of occurrences k, whereas the
summation in (4.8.14) is over the number of trials n. The second equality in (4.8.14)
follows directly from Eq.(4.8.11), and is derived in an appendix. Multiplying both
sides of (4.8.13) by sn and summing over n leads to the bivariate generating function
"
#
X
X
1 F s
H s, z
pn, k zk sn
4:8:15
1
s 1 zF s
n1
k0
from which the probabilities pn,k are obtained by series expansion of both sides of the
equality.
A sense of the structure of the formalism can be obtained by considering the case
of recurrent runs of length t 3 for a stochastic process with p 12. Substitution of
250
these conditions into Eq. (4.8.8) for F(s) yields the following expression for the right
side of Eq. (4.8.15) and its corresponding Taylor-series expansion to order s6
2 s2 2s 4
H s, z 3
s zs3 2s2 4s 8
1
7
3
13 4
1
3
s
z s3
z
z s5 O s6 :
1 s s2
8
8
16
16
4
4
4:8:16
Recall that the powers of s designate the number of trials, and the powers of z
designate the number of occurrences of runs of length 3. For a fixed power of s,
the sum of the coefficients of the powers of z within each bracketed expression sum to
unity, as they must by the completeness relation for the probability of mutually
exclusive outcomes. Note that the first three terms (s0 s1 s2) are independent of
z i.e. contain only powers z0 since there cannot be runs of length 3 in a sequence of
no more than two trials. For three trials, the probability of zero runs of length 3 is 7/8
and the probability of one run of length 3 is 1/8. For five trials, however, the
probability of zero runs is 3/4 and the probability of one run is 1/4. This pattern
persists: (a) to obtain a run of length t, the sequence of trials must be of length n t,
and (b) the greater the number of trials, the higher is the probability of obtaining
longer runs.
It is not necessary to know the individual pn,k to determine the mean number of
recurrent runs
X
hN n i
kpn, k :
4:8:17
k0
Multiplying both sides of Eq. (4.8.17) by s and summing n over the range (1, ) leads
to the generating function for the distribution of Nn
n
M1s
X
hN s isn
n1
F s
:
1 s 1 F s
4:8:18
4:8:19
k2 pn, k
k0
and following the same procedure that led to (4.8.18) yields the generating function
for the distribution of hN 2n i
M2s
X
hN 2s isn
n1
F s F2 s
1 s 1 F s2
4:8:20
From the expansion coefficients of M1(s) and M2(s) to order sn one obtains the
variance
251
var N n hN 2n i hN n i2 e
n 2T
,
3T
4:8:21
where the approximate equality holds in the asymptotic limit of large n. The generators for higher-order moments of Nn can be derived in the same manner, but are not
needed in this chapter.
The statistics (probabilities and expectation values) for any physically meaningful
choice of probability of success p, run length t, and number of trials n are deducible
exactly from the generator (4.8.15) and derived generators such as (4.8.18) and
(4.8.20). For many applications, however, particularly where it is possible to accumulate long sequences of data as is often the case in atomic, nuclear and elementary
particle physics experiments or investigations of stock market time series, the tests for
evidence of non-random behavior are best made by examining long runs. Suppose,
for example, one wanted the probability of obtaining the number of occurrences of
runs of length 50 in a sequence of 100 trials. This would require extracting the
hundredth term
p100, 50
1
562 949 953 421 312
1 125 899 906 842 623
1
z z 1
2 251 799 813 685 248 2 251 799 813 685 248
i
1 dn h
1
k
1 s 1 F sF s
4:8:22
pn, k
n! dsn
s0
1
hN n i
n!
dn h
1
1
1 s 1 F s F s
dsn
4:8:23
s0
can be derived from the associated generators, but direct execution of these expressions by differentiation is not in general computationally economic. The computer, in
fact, performs the series expansion of the generators H(s,z) and M1(s) more rapidly
than it performs symbolic differentiation.
As an example that employed Maple and a Mac laptop computer, I calculated the
exact mean number of runs of length t 4 in a sequence of one million trials with
probability of success p 0.5 by the following steps.
252
1
s4
:
4
3
2
2 s s 2s 4s 8s 1
X
hN j isj , but do not
Convert the rational function numerically to a power series
j0
display the result since, after all, there are 1000001 terms.
Instead, extract the desired term by summing the series from term n to term n; there
is only one term in the sum. Extraction of this element for the specified conditions
led to N1000000 33333.258 in a fraction of a second.
One can show by application of the Central Limit Theorem to relation (4.8.11) that
for large n, the number Nn of runs of length t produced in n trials is approximately
normally distributed with mean
n
T
4:8:24
n 2T
3T
4:8:25
N
and variance
2N
253
Table 4.5
Run
length t
Mean number of
runs (exact)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
40
60
50
16.556
7.041
3.258
1.562
7.611 (1)
3.738 (1)
1.842 (1)
9.098 (2)
4.496 (2)
2.222 (2)
1.099 (2)
5.433 (3)
2.686 (3)
1.328 (3)
6.561 (4)
3.243 (4)
1.602 (4)
7.916 (5)
3.910 (5)
2.819 (11)
1.820 (17)
50
16.667
7.143
3.333
1.613
7.937 (1)
3.937 (1)
1.961 (1)
9.785 (2)
4.888 (2)
2.443 (2)
1.221 (2)
6.104 (3)
3.052 (3)
1.526 (3)
7.630 (4)
3.815 (4)
1.907 (4)
9.537 (5)
4.768 (5)
4.547 (11)
4.337 (17)
h
i
P n1 P n ! pn, 0 pn, 1 z pn, 2 z2 pn, z sn ,
n
t
n
t
where nt is the largest integer k such that kt n, and the coefficients pn,k are given as
exact fractions.
Evaluate the set fpn,kg as floating-point numbers, if desired.
As an example, the procedure led in under ten seconds to the full set p1000,k
fk 0. . .200g for the probability of k occurrences of runs of length 4 in a sequence of
1000 trials. The calculations were again performed with a Mac laptop running Maple.
Using the above methods to obtain exact numerical probabilities becomes impractical for very long sequences and long run lengths since the evaluation time grows
nonlinearly with n. For example, calculation of the distribution for n8192, t6
required more than 550 hours of computation. However, it was realized (by an
undergraduate working on the project) that after Gn(z) is calculated for small n, it
can be used to approximate Gn(z) for larger n by treating the longer sequence as a
concatenation of smaller ones, and applying a correction for loss of runs at the
254
boundaries. For the distributions arising from the PDC single-photon experiment,
which involved sequences of length 8192 trials, this method yielded values for pn,k
that differed from the exact probabilities by a theoretical bound no greater than 106.
In cases where direct comparison was possible, this discrepancy never exceeded 108.
Run
Length t
N obs
Single photon
events
Nn
Single photon
events
Nobs
All non-null
events
Nn
All non-null
events
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1381157
572187
257254
119840
56511
26863
12878
6125
2966
1452
730
365
157
69
32
16
9
7
4
2
2
1
1
1
0
1381300
900
572300
700
257300
500
119700
300
56500
200
26870
160
12820
110
6120
80
2930
50
1400
30
670
30
321
18
153
12
73
9
35
6
17
4
8
2
3
2
1.8
1.4
0.9
0.9
0.4
0.6
0.2
0.4
0.1
0.3
0.05
0.2
0.02
0.15
1375944
567578
253947
117665
55200
26044
12443
5887
2850
1384
699
343
142
63
27
13
7
4
2
1
1
0
1376000
900
567600
700
257400
500
117500
300
55200
200
26110
160
12390
110
5890
80
2800
50
1330
40
630
30
301
17
143
12
68
8
32
6
15
4
7
2
3.5
1.9
1.7 1.3
0.8
0.9
0.4
0.6
0.2
0.4
255
Table 4.7
Run length
t
N obs
Single photon
events
Nn
Single photon
events
Nobs
All non-null
events
Nn
All non-null
events
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2802820
1202758
561638
271694
133684
66255
33002
16568
8219
4047
1987
979
493
237
119
63
34
19
9
3
1
1
1
1
1
0
2803000
1300
1202000
900
561300
700
271800
500
133900
400
66400
300
33110
180
16530
130
8270
90
4130
60
2070
50
1030
30
520
20
259
16
130
11
65
8
32
6
16
4
8
3
4
2
2
1,4
1
1
0.5
0.7
0.3
0.5
0.1
0.4
0.06
0.25
2464125
911663
361732
147779
60917
25164
10473
4337
1775
725
273
99
38
13
8
2
1
0
2464000
1300
911500
90
361900
600
147500
400
60800
200
25130
160
10400
100
4320
70
1790
40
740
30
308
18
127
11
53
8
22
5
9
3
3.8
1.9
1.6
1.3
0.7
0.8
Two sets of analyses were performed. In the first, only single-photon events were
retained with the binary classification depicted in Figure 4.7. In the second, all nonnull events were retained and assigned to the binary categories
1 single-photon coincidence AC
8
< 0 single-photon coincidence AB
1
:
:
X multiple-photon coincidence event
Examination of Tables 4.6 and 4.7 show that, as expected, the presence of multiple
photon events led to fewer runs of 1s at each length because a run of 1s could be
terminated by occurrence of either a 0 or X. The empirical values of the probability
256
140
Frequency
120
Single Photon
Events
= 0.364
Run Length 6
100
80
60
40
20
0
40
50
Frequency
70
80
90
Number of Occurrences
250
200
60
All Non-Null
Events
= 0.364
Run Length 6
150
100
50
0
10
15
20
25
30
35
40
Number of Occurrences
Fig. 4.8 Observed frequencies of runs of length t 6 for 8192-bit bags of the Hi- singlephoton events only (upper panel) and with all non-null events (lower panel). Theoretical
distributions (solid), obtained by means of the concatenation method, are superposed.
Multiple-photon events, which comprise about 17% of the non-null events, shift the
distribution by decreasing the frequency of runs.
of success p obtained for the two sets of analyses for Hi- and Lo- experiments are
given in Table 4.4. Comparison of observed and theoretically expected numbers of
runs for both sets of analyses are in close agreement.
For a thorough comparison of observation with theory, the two data sequences
(Lo- and Hi-) were partitioned into M subsequences (bags) of length n 8192 bits,
as was done with the sequences of nuclear decay counts in the previous chapter. As
summarized in Table 4.4, the number of bags was slightly above 1000 for the Lo-
experiment and above 2000 for the Hi- experiment. Histograms were made of the
number of occurrences of each run length from 2 to 13, an example of which is shown
in Figure 4.8 for runs of length t 6 in the Hi- sequence.
Each of the 12 histograms of run frequencies obtained from the sequences for
each of the two experiments was then tested against the theoretical distributions
Nn,k,t M pn,k,t with a 2 analysis. Recall that the outcome of a 2 analysis is the
cumulative probability, or P-value, of obtaining a value of the tested variate
greater or equal to the observed value. If the null hypothesis (that the tested
257
P-Value
0.75
0.50
0.25
0
2
10
12
Run Length
Fig. 4.9 Distribution of P-values as a function of run length for runs of 1s in the Hi- sequence
of bits.
258
29
30
31
A. Rukhin et al., A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic
Applications, National Institute of Standards and Technology, Special Publication 80022 (Revised 2010), http://
csrc.nist.gov/publications/nistpubs/800-22-rev1a/SP800-22rev1a.pdf.
J. Soto, Statistical testing of random number generators, Proceedings of the 22nd National Information Systems
Security Conference (National Institute of Standards and Technology, 1999), csrc.nist.gov/grops/ST/toolkit/rng/
documents/nissc-paper.pdf
S. Chatterjee, M. R. Yilmaz, M. Habibullah, and M. Laudato. An approximate entropy test for randomness,
Communications in Statistics Theory and Methods 29 (2000) 655675.
259
be preferred over the entropy test. Moreover, the entropy test yields a single statistic,
whereas runs tests yield a statistic for each run length.
Finally, in tests I have carried out on the randomness of first differences of closing
stock prices of a score or more of listed companies of the New York Stock Exchange
an investigation that I will discuss in Chapter 6 a number of the resulting series
passed tests of randomness based on autocorrelation, periodicity (by means of power
spectra), and entropy, but failed runs tests for nearly all values of run length.
Appendices
where the extensive variables of the system are internal energy U, entropy S, volume
V, and number of particles Ni of constituent i, and the intensive variables are absolute
temperature T, pressure P, and chemical potential i of constituent i. In thermal
equilibrium (dS 0) in a system of constant volume (dV 0) and constant energy
(dU0), the First Law reduces to
X
i dN i 0:
4:10:2
i
If the variations dNi are arbitrary, then it must follow that i0.
Suppose, however, that the variation in particle number is not arbitrary, but
governed by a reversible chemical reaction of the form
a1 X1 a2 X2 a3 X3 a4 X4 :
4:10:3
260
a1 d )
X
a2 d
)
c i i 0
a3 d
i
a4 d
Reactant : ci ai
,
Product :
ci ai
4:10:4
261
which, upon substitution into (4.10.2), leads to the constraint (4.10.4) on chemical
potentials. This constraint does not pertain only to chemical reactions.
Consider, for example, the creation and annihilation of electronpositron pairs
e e 2;
4:10:5
4:10:6
Is the chemical potential of the gamma photons zero even though now there
appears to be a specific number of photons produced in each reaction? The answer
is yes. In this case the reason given is that the photons escape the system and
therefore do not contribute to the thermodynamic equilibrium of the remaining
particles. (For the same reason, the chemical potential of neutrinos, once thought
to be massless, is usually taken to be zero in weak nuclear processes like the beta
decay of the neutron.) This was the case with the experiment I described in the
previous chapter in which detection of each pair of back-to-back gamma rays from
reaction (4.10.5) provided a signal for the decay of one radioactive sodium nucleus.
The reaction proceeded exclusively in the forward direction because the photon
density within the volume that defined the system was too insignificant to sustain
the reverse reaction.
Suppose, therefore, that we have a system of interacting electrons, positrons,
and gamma rays confined to a fixed volume with 100% reflective boundaries
so the photons cannot be absorbed by the walls. Since photons do not interact
with one another (leaving aside nonlinear QED processes) and do not exchange
energy with the walls of the container, they come into equilibrium with the
electrons and positrons only. The system is a charged or neutral plasma,
depending on whether or not there is an initial imbalance of charged particles.
The reaction (4.10.5) proceeds in both directions, and the three components of the
plasma come to the same equilibrium temperature. Is the number of photons still
uncertain? Is the chemical potential still zero? The answer to both questions is
again yes.
The number of photons is uncertain because the electronpositron annihilation
reaction can actually generate any number of photons consistent with the conservation of total energy, linear momentum, and angular momentum (spin). Thus,
a ee pair in a singlet state32 (state with anti-parallel spins) at rest can decay to
photon pairs that also have total linear momentum 0 and net spin 0. This would
include any integer number (n 1) of pairs of back-to-back photons of opposite
helicities (circular polarizations)
32
The multiplicity of states of a particle of spin quantum number s is 2s 1. Thus s 0 for e and e with anti-parallel
spins, and s 1 for parallel spins.
262
e e S0 2; 4; 6 . . . 2n n 1;2; . . .
4:10:7
4:10:8
that preserve energy and have zero net linear momentum. The pair cannot decay to
just one photon because that photon could not have zero linear momentum. The
various decay modes have different probability; the greater the number of photons
produced, the lower is the probability. Thus the half-life of singlet positronium the
bound state of one electron and one positron is much shorter (1.24 1010 s) than
the half-life of triplet positronium (1.39 107 s).
In any event, since the photon number for process (4.10.7) and (4.10.8) is indefinite, the relation (4.10.4) for chemical potentials can be satisfied only if 0, which
also implies that
p e :
4:10:9
Suppose, however, that the confined plasma were actually generated by a process of
the form (4.10.5) in which each annihilation rigorously produced only two gammas.
Would one still have 0 if photons were produced in definite numbers? Again:
yes because at the fundamental level of quantum field theory a particle and its antiparticle have chemical potentials of equal magnitude and opposite sign. Thus relation
(4.10.9) holds generally (where, by convention, the electron is usually taken to be the
particle and the positron the anti-particle). By this argument, the chemical potential
of the photon would have to be zero in any reaction of the form A B
because the photon is its own anti-particle ; thus 2 0.
There have been claims in the research literature33 of non-zero photon chemical
potentials in complex systems such as semiconductors, in which the particle is an
electron in a conduction band, and the anti-particle is a hole in the valance band.
Among the various interactions that can occur is a process like that of (4.10.5) in
which an electron and hole combine to produce radiation. One can then write an
equation like that of (4.10.6), where the claim is made that the chemical potentials
for the electron and hole do not sum to zero, and therefore 6 0. A detailed
analysis is beyond the intended scope of this chapter. Let it suffice to say that,
although the claim may not be incorrect, it may also be more a matter of semantics
than physics. The definition and application of thermo-statistical quantities such as
the chemical potential apply rigorously only to systems in equilibrium that means
33
P. Wurfel, The chemical potential of radiation, Journal of Physics C: Solid State Physics 15 (1982) 39673985.
263
thermal, chemical, and hydrostatic equilibrium and it is not apparent that electrons, holes, and radiation in a semiconductor satisfy those conditions. There is no
confinement or photon reservoir to supply photons to sustain the reverse reaction
necessary for equilibrium; photons immediately leave the system as in the
case of electronpositron annihilations of the previous chapter. Moreover,
electrons and holes can have different temperatures. Finally, the chemical
potentials of the electrons and holes were equated with corresponding Fermi energies. The two concepts, however, are not synonymous. The chemical potential,
defined by the First Law (4.10.1) and its equivalent representation in terms of other
thermodynamic potentials H (enthalpy), F (Helmholtz free energy), and G (Gibbs
free energy),
4:10:10
i
N i
S;V N i
S;P N i
T;V N i
T;P
is the amount of energy required to add one particle of type i to a system already
containing Ni particles of that kind. In contrast, the Fermi energy F is the highest
occupied energy level in a system of fermions. The chemical potential and Fermi
energy are equivalent only at T 0.
At the most basic level, distinct from all the physical arguments, there is a purely
mathematical setting that defines the nature of the chemical potential. From the
general perspective of the opening chapter of this book i.e. the derivation of
equilibrium statistical physics from the principle of maximum entropy the chemical
potential is at root a Lagrange multiplier for a particular constraint, which in physics
is usually associated with a conservation principle. Where there is no conserved
quantity, there is no meaningful, independent chemical potential. Consider again
electronpositron pair production and annihilation (4.10.5), by which is really meant
all processes (4.10.7) and (4.10.8). The electron, positron, and photon number
densities ne, np, n are not conserved quantities; assigning a chemical potential to
each of these quantities individually is not physically meaningful. However, if a
certain number of electrons and positrons is initially introduced into a fixed volume
with reflective walls (or into a system interacting with a photon reservoir), then the
net electrical charge is a conserved quantity, and a chemical potential e p
(leading to 0) can be associated with a constraint on the difference in fermion
densities nd ne np.
It is interesting physically and instructive mathematically to work out the statistical physics of this system (a charged or neutral plasma) a little more deeply in the
case of ultra-relativistic electrons and positrons; the photons, of course, are intrinsically ultra-relativistic. By ultra-relativistic is meant that the total energy of a particle
is sufficiently higher than its rest-mass energy (mec2) that it can be approximated by
the asymptotic form of (4.1.2). With this approximation, the integral for the energy
of electrons or positrons is the same as for photons, apart from the term
1 in the
denominator, which distinguishes fermions from bosons. For specified initial
264
conditions of net electric charge and mean energy, the equilibrium state subsequently
reached by the plasma is described by equations that can be cast in the form
1
1
nd
hc 3
charge
x
dx
x2 1 x
4:10:11
4g kB T
z e 1 ze 1
0
1
1
x3
u0
hc 3
x
dx
dx
x3 1 x
; 4:10:12
energy
x
4g kB T
z e 1 ze 1
0
0 e 1
4 =15
where u0 is the initial mean energy density, z exp(/kBT) is termed the fugacity, and
g 2 is the degeneracy factor, the same for both the fermions (spin states
1/2) and
photons (helicity states
1). Solution of coupled equations (4.10.11) and (4.10.12)
then leads to values for the chemical potential and temperature T. The second
integral in (4.10.12), which evaluates exactly to 4/15, derives from the photon energy
density; there is no fugacity factor because the chemical potential of the gamma
photon is 0. Although individual expressions for fermion occupation numbers and
energies can be reduced no further than to infinite series, the combined integrals
above deriving from the difference of fermion occupation numbers and the sum of
fermion energies can be evaluated exactly in closed form to yield the coupled
equations
"
#
h
i
3
3 nd hc3
2
3
2
2
kB T kB T
4g
kB T
kB T
"
k B T
11 30
7 7 2
kB T
2
15
4
7
kB T
4 #
15 u0 hc3
:
14 5
4:10:13
4:10:14
The term 11/7 in (4.10.14) includes a contribution of 1 from the fermion energy
density and contribution 4/7 from the photon energy density.
As a practical point to keep in mind, the coupled equations are nonlinear and the
sought-for parameters and T have vastly different magnitudes: the temperature of
the plasma may be at billions of Kelvin, whereas the chemical potential (expressed in
standard MKS units) could be trillionths of a Joule. Thus, solving directly for and
T may not work unless one starts with initial estimates very close to the correct
solution. A workable strategy in that case is to solve the original set of equations
(4.10.11) and (4.10.12) for the temperature T and the fugacity z, whose value is close
to 1, and then determine the chemical potential from kBT ln z. I have tried both
methods and found that the computer (using Maple) implemented both methods
quickly and arrived at the same solutions, although solving the integral equations
was far less sensitive to initial estimates.
Examples of the results are plotted in Figure 4.10 for both charged and neutral
relativistic plasmas. The upper panel for a neutral plasma ( 0) shows the variation
265
Temperature (109 K)
Neutral Plasma ( = 0)
30
20
10
0
0
100
200
300
400
300
400
500
Charged Plasma
(ne - np = 100 n0)
400
300
200
100
0
100
200
266
lim
me c2 =kB T !0
,
2
n
x
x2
4
dx
dx :
x
x
ne
e 1
e 1
3
0
4:10:15
In contrast to the photon which is a massless boson, the neutrino, of which three
kinds or flavors (electron, muon, and tau) are currently known, is a spin-1/2
fermion initially believed to be massless. Observations of neutrino oscillations, i.e.
periodic transitions between neutrino flavor, strongly indicate that neutrinos have
mass, although exact values are not known.34 As inferred from other experiments
(e.g. maximum electron energy in beta decay), the electron-neutrino mass, if nonzero, must be very low, about 1eV/c2 or less. The mass of the hitherto lowest mass
particle known, the electron, is 511keV/c2.
If we assume for the sake of discussion that neutrinos are massless, then the
question posed at the outset for photons can also be asked of neutrinos: is the
neutrino chemical potential zero? Because the neutrino is a fermion, the answer turns
out to be more intricate and more interesting than the case of a massless boson. First,
as weakly interacting particles whose mean free path through lead is about 1 lightyear
( 1016 m), neutrinos ordinarily escape from reactions in which they are produced
terrestrially and therefore have insignificant influence on the thermodynamic equilibrium of laboratory experiments. In such cases, one can take the chemical potential
to be 0. There are exotic conditions, however, such as in the early stages of the
universe or in supernova explosions or within the interior of a neutron star, where
neutrinos become trapped by dense matter. Does 0 then? Note that neutrino
production and absorption, contrary to analogous processes for photons, are constrained by a conservation principle: conservation of lepton number (in the nuclear
weak interactions). The reaction describing neutron beta decay
n ! p e
4:10:16
4:10:17
(where the p now stands for proton, not positron). In general, one would expect ,
and therefore , to be non-zero.
34
Measurement of the neutrino oscillation parameters permit inference of the difference of the squares of neutrino
masses.
267
The answer, however, depends on what kind of neutrino a real neutrino is. Theory
admits two possibilities: a Dirac neutrino D 6 D , which is distinct from its antiparticle, or a Majorana neutrino M M , which, like the photon, is identical to its
anti-particle. The chemical potential of a Dirac neutrino is not necessarily
zero, but the chemical potential of a Majorana neutrino must necessarily be zero for
a fundamental reason (the Pauli exclusion principle) different from that which
applies to photons. In contrast to photons, an arbitrary number of which can occupy
a state, neutrinos, like electrons, fill quantum states pairwise with opposite spins. If
the neutrino is its own anti-particle, then there is a non-zero probability that two
neutrinos in a given quantum state can annihilate one another. The number of
Majorana neutrinos, therefore, cannot be a conserved quantity, unlike the number
of Dirac neutrinos, and consequently there can be no chemical potential to impose
this constraint.
k x
xk dx
x e dx
a
1
x
a e 1
1 aex
0
X
X
k x
x n
n1 k x n1
a x e
ae dx
a
xe
dx
0
n0
n0
X
X
an1
an1
yk ey dy k 1
k1
k1
n0 n 1
n0 n 1
4:11:1
X
an
k 1
nk1
n1
in which the second form is obtained from the first by multiplying numerator and
denominator of the integrand by aex. The integral is then worked as follows.
X
x 1
aex n .
Replace (1 ae ) by the equivalent infinite series
n0
Change the integration variable x to y x(n 1) to bring the integral into the form
of a gamma function.
Relabel the dummy index so the sum begins with n 1 and takes the form of a
polylogarithm.
In the special case a 1 (corresponding to chemical potential 0), the integral is
expressible as a zeta function
268
X
xk dx
1
k 1 k 1:
x
k1
e 1
n
n1
4:11:2
The FermiDirac integral takes a general form differing from the BoseEinstein
integral only by the 1 (rather than 1) in the denominator
k x
xk dx
x e dx
a
a1 ex 1
1 aex
0
a xk ex
0
x n
1 ae dx k 1
n0
X
an
n1
4:11:3
n 1k
and results, by the same procedure, in a sum with terms of alternating signs. The
special case a 1 leads to the relation
X
xk dx
1n1
k 1 k;
nk
ex 1
n1
4:11:4
where the Dirichlet eta function (k) is defined by the sum with alternating signs.
4.12 Variation in thermal photon energy with photon number (E/N)jT.V
The internal energy U of a system in thermodynamic equilibrium is ordinarily
considered a function of entropy, volume, and particle number, from which the First
Law takes the standard form
dU TdS PdV dN:
4:12:1
The variation in energy with volume at fixed temperature and particle number is then
given by
T
P;
4:12:2
V
T; N
V
T; N
where use of the Maxwell relation
V
T; N T
V; N
T
P:
V
T; N
T
V; N
4:12:3
4:12:4
Now consider the internal energy as a function of temperature, volume, and particle
number
U U T;V;N
V
N;
V
T; N
N
T; V
269
4:12:5
which is expressible as shown on the right side because U, V, N are extensive variables
(homogeneous of degree one) and T is an intensive variable (homogeneous of degree
zero). This is an example of Eulers theorem for homogeneous functions. Rearranging the terms in (4.12.5) to isolate (U/V)jT.N and then substituting into (4.12.4)
leads to the expression
!
U
1
P
;
4:12:6
U PV TV
N
T; V N
T V; N
which is equivalent to (4.4.51) upon replacement of thermodynamic variables by
equivalent statistical expectations.
Re-ordering the factorials in the factor labeled by the symbol X leads to the equation
X
s!
m! m s k r !
sr mk
;
s r ! m k!
m s!
s mrk
4:13:2
where the approximate final expression was deduced in the following way. Consider
just the first quotient which, by definition of the factorial operation, becomes
s!
s r 1s r 2 s sr :
s r !
4:13:3
r factors
For s very much larger than r, we can treat each of the r factors as approximately
equal to s, leading to the final result in (4.13.3). Applying the same reasoning to all
factorial ratios in X yields the final expression in (4.13.2), which can be factored as
follows
270
sr m k
s mrk
s
sm
r
m
sm
k
1
r
1 ms
m
s
k
1 ms
hvik
1 hvikr
4:13:4
4:14:1
where it is again understood that a success is the occurrence of a run (let us say
of 1s) of length t. Then from relation (4.8.12), it follows that
n
X
4:14:2
f k f k1
pn;k
1
pn;k sn
X
n
X
n1 1
n1
f k f k1 sn :
n
X
X
h sn ;
Q s
n1
4:14:3
4:14:4
which, when expanded, leads to the following expressions for each value of the index n
n1
h1 s1 s2 s3 s4
n2
h2
s 2 s3 s4
s3 s 4
n3
h3
4:14:5
n4
h4
s4
..
.
Each of the infinite sums in (4.14.5) is easily closed to yield the following pattern
sn
1
s
1
1s
1s
sn
s
s2
s
1s
1s
sn
s2
s3
s2
1s
1s
n1
X
n2
X
n3
..
.
X
sr1
sr
sr1
sn
1s
1s
nr
4:14:6
271
1 X
hn s n :
1 s n1
4:14:7
1 X
F k s F k1 s
f nk f nk1 sn
1 s n1
1s
F k s1 F s
;
1s
the expression given in (4.8.14).
4:14:8
5
A certain uncertainty
I often say that when you can measure what you are speaking about
and express it in numbers you know something about it; but when
you cannot measure it, when you cannot express it in numbers, your
knowledge is of a meager and unsatisfactory kind: it may be the
beginning of knowledge, but you have scarcely, in your thoughts,
advanced to the stage of science, whatever the matter may be.
Lord Kelvin (William Thomson)1
Lord Kelvin (William Thomson), from the lecture Electrical Units of Measurement given to the Institution of Civil
Engineers, 3 May 1883; quoted in S. P. Thomson, The Life of William Thomson Vol. 2 (Macmillan, 1919) 792.
272
273
2
3
M. P. Silverman, W. Strange, and T. C. Lipscombe, Quantum test of the distribution of composite physical
measurements, Europhysics Letters 57 (2004) 572578.
M. P. Silverman, W. Strange, and T. C. Lipscombe, The distribution of composite measurements: How to be certain of
the uncertainties in what we measure, American Journal of Physics 72 (2004) 10681081.
274
A certain uncertainty
5:2:1
and the succeeding analysis is made much more tractable. Therefore, expanding
f (X, Y) in a series about (X, Y) and truncating at the second order leads to the relation
fX, Y f jx , Y f x jx , YX x f y jx , Y Y Y
5:2:2
1
1
f xx jx , Y X x 2 f yy jx , Y Y Y 2 f xy jx , Y X x Y Y ,
2
2
f
2 f
2 f
where the function f and its partial derivatives f x x
, f xx x
are
2 , f xy xy , etc:
all evaluated at the mean values X x, Y Y. The expectation of f (X,Y ), approximated by (5.2.2), immediately yields
1
1
hZi f jx , Y f xx jx , Y 2X f yy jx , Y 2Y :
2
2
5:2:3
x , Y
x , Y
2X 2Y :
5:2:5
The two most common applications of the preceding theory, particularly relations
(5.2.3) and (5.2.5), are to products Z XY and quotients Z X/Y. For the former, in
fact, exact expressions are derivable irrespective of the distributions of X and Y. We
start with the identity
Z XY X Y X X Y Y Y X X X Y Y
5:2:6
5:2:7
where the covariance vanishes if, as assumed, the variates are independent. Upon
squaring (5.2.6) and taking the expectation, all cross terms linear in X x or Y Y
vanish, and the expression reduces to
D
E
XY2 x Y 2 2x 2Y 2Y 2X 2X 2Y :
The exact variance of the product of independent variates is therefore
E
D
varXY XY2 hXYi2 2x 2Y 2Y 2X 2X 2Y ,
275
5:2:8
5:2:9
which is precisely the result given by the approximation (5.2.5). The two calculations
agree exactly because partial derivatives of order higher than two of the product XY
vanish. The mean and variance of Z XY can be cast in the form
2Z 2X 2Y 2X 2Y 2X 2Y
,
5:2:10
2Z 2X 2Y 2X 2Y 2X 2Y
which simplifies, as shown, for sharp distributions with X , Y << 1.
X
Y
Exact relations for the mean and variance of the quotient Z X/Y of two
independent random variables do not exist in general. From the approximate expressions derived from the series expansion, it is straightforward to show that
2
Z hX=Yi X 1 2Y
5:2:11
Y
Y
1
2
1
5:2:12
2Z varX=Y 2 2X 2x 2Y 2 2X 2Y ,
Y
Y
Y
which can also be cast into a compact form
2X 2Y 2X 2Y
2Z 2X 2Y 2X 2Y 2X 2Y
2 2 2
2Z
2Y
X
Y
1 2
Y
5:2:13
that reduces for sharply distributed variates to the same relation (5.2.10) as for the
product.
Generalization of the series expansion (5.2.2) to a function Z f (X1, X2, . . . XN) of
more than two independent variates, each with well-defined mean i and variance 2i ,
leads to a simple estimate for variance
2
N
X
Z
2i
5:2:14
2Z
X
i
i1
fXi i g
(under the condition i << i for each Xi) that one frequently finds in elementary
books of data analysis. If the composite measurement Z is a product of powers of
independent elementary variates
n
Y
Z X11 . . . Xnn
5:2:15
Xi i,
i1
276
A certain uncertainty
then
hZi Z
n
Y
i i ,
5:2:16
i1
and
ln Z
i
Xi fi g i
from which follows the relation
N
2Z X
2
2i 2i :
2
Z
i
i1
5:2:17
For a composite expression comprising only products and quotients of the elementary variates, all the powers i are either 1 or 1, whereupon (5.2.17) reproduces
the familiar reduced expression of (5.2.10) or (5.2.13).
The rules of standard error propagation theory, such as (5.2.14), (5.2.16), and
(5.2.17), are perhaps so familiar that one rarely expects to see users justify them, but
in fact they may not be adequate for at least three reasons. First, they are approximations derived from a Taylor-series expansion to first order in the variances that
can fail entirely for certain parent distributions that do not have first and second
moments. Second, without knowledge of the actual distribution of a composite
measurement Z, one cannot rigorously associate some degree of confidence or
probability with the uncertainties derived from these rules. To assume, as is often
done, that the standard deviation Z of a composite measurement for example, Z
XY or Z X/Y represents a two-sided confidence interval of 68% is not a priori valid
unless Z is distributed normally, and this is not generally the case even when X and
Y are normally distributed. And third, the approximations obtained from a truncated
series expansion may fail to provide accurate statistical moments of a composite
measurement.
The case of a linear superposition of independent random variables ordinarily
poses no difficulty, since it is subject to the Central Limit Theorem (CLT) provided
the mean and variance of the distributions of the superposed variates exist and
estimates of uncertainty can be made with a normal distribution irrespective of how
the component variates are distributed. Such simplicity does not automatically
extend to products and quotients of random variables.
The field of nuclear physics provides useful systems by which to examine this
important, yet not widely appreciated aspect of statistical analysis. Radioactive
nuclei ordinarily decay by more than one pathway, and the rate of decay by a
particular mode to the total decay rate is termed the branching ratio of that mode.
An example of this phenomenon is furnished by radioactive bismuth (212
83Bi), which
can undergo transmutation by emission of either an alpha particle to become
277
212
thallium (208
81Tl) or a beta particle to become polonium ( 84Po). The branching of the
alpha and beta decays can be used to create parent populations of random variables
from which experimentally derived product and quotient distributions are generated
and compared with theoretical predictions. Later in the chapter I will discuss this
quantum test of the distribution of composite physical measurements.
The study of the statistics of composite measurements is not just an undertaking
of academic interest. At the end of the chapter I will return to the introductory
question regarding the measurement of diagnostic medical indices to voice a concern that I have with the way such tests have been (and are being) conducted and
reported.
6
X
n2
integer 12n
12
PX
PYn,
n
5:3:1
where the condition below the summation sign restricts the summation index to integer
values for which 12 is factorable. If we assume that the numerical outcome of each die
has the same probability 1/6, then Eq. (5.3.1) leads to PZ (12) 4(1/6)2 1/9.
Equation (5.3.1) can be rewritten to express more generally the probability of any
outcome of the product or ratio of two discrete random variables
X
z
PZXYz
PX x y PYy
5:3:2
y
PZX=Yz
5:3:3
where the sum over y may require some restrictive condition, depending on the sets of
numbers (real, rational, integer . . .) to which X and Y belong and their specified ranges.
278
A certain uncertainty
k
,
k!
5:3:4
which gives the probability of k counts in a specified time interval (bin) from a sample
for which the mean count per bin is . A random variable governed by the probability
law (5.3.4) will be designated by Poi (). From Eqs. (5.3.2) and (5.3.3) the distributions of Poi (1) Poi (2) and Poi (1)/Poi (2), products and ratios generated from
two independent decay modes with mean parameters 1 and 2, are
PZXYzj1 , 2 e1 2
y1
integer z=y
PZX=Yzj1 , 2 e1 2
y1
integer zy
z=y
1 y2
y!z=y!
5:3:5
y
zy
1 2
:
y!zy!
5:3:6
Note that the elements z in the set of products are positive integers and the elements
in the set of quotients are positive rational numbers (ratios of integers). We will see
the consequence of these conditions in due course.
The reasoning leading to Eqs. (5.3.2) and (5.3.3) applies as well to distributions of
continuous variables although the formalism is a little different for we must work
with probability densities. Because the probability that a continuous random variable
takes on precisely a given value is zero, we consider instead the cpf expressed by an
integral over the probability density pZ (z)
z
FZz PrZ z
pZz0 dz0:
5:3:7
Thus, for the product Z XY and quotient Z X/Y of independent variates, the
cumulative probability functions equivalent to (5.3.2) and (5.3.3) are
FZXYz FXz=ypYydy
pYy dy
z=y
pXx dx pYy dy
FZX=Yz FXzypYydy
zy
pYy dy
pXx dx 5:3:8
z=y
pXx dx:
5:3:9
The partitioning of the integral in (5.3.8) comes about because the hyperbola xy z
has two segments, one lying in the first (or NE) quadrant and the other in the third
279
(or SW) quadrant for z > 0. Therefore the condition xy < z is satisfied for all points
below the NE segment and all points above the SW segment. Analogous reasoning
with corresponding change of quadrants can be applied to the case for z < 0.
From the definition of the cpf, it follows that the probability that Z falls within the
differential range (z, z dz) is given by
FZz dz FZz pZz dz
pZz
dFZz
:
dz
5:3:10
pZXYz
pXz=ypYyjyj1 dy
5:3:11
pZX=Yz
pXzypYyjyjdy:
5:3:12
The absolute magnitude in the integrand reflects the condition that the probability
density must always be non-negative.
There is an alternative way to determine product and quotient pdfs that is more
expedient to employ if one is familiar with the properties of the Dirac delta function,
in particular the identity
ax
1
x,
jaj
5:3:13
pZXYz
pXYx,ydxdy
xyz
pXxdx
h
z i
1
z
pYy x y
dy pXx dx pYy y dy 5:3:14
x
jxj
x
z
pXxpY
pXxpYyxy zdxdy
jxj1 dx:
In the first line, use of the delta function restricts the region of integration to only
those points (x, y) satisfying the condition xy z and thereby permits unrestricted
280
A certain uncertainty
upper and lower bounds on the integral over the joint probability density pXY (x, y)
pX (x) pY (y), which factors because X and Y are assumed independent. In the second
line, the argument of the delta function is factored into a factor that depends on
the integration variable y and a factor x which is constant within the integral over y.
The identity (5.3.13) is then invoked, which allows the integral over y to be performed
trivially, leading to the final expression in (5.3.14) that is completely equivalent to
(5.3.11) because of the symmetry between x and y in the relation z xy. Indeed, the
delta function in (5.3.14) could have been factored so as to yield precisely (5.3.11).
Because x and y do not occur symmetrically in the quotient x/y, evaluation of the
integral for the pdf
pZX=Yz
pXYx, ydxdy
pXxpYyx=y zdxdy
5:3:15
x=yz
by factoring the delta function to express its argument in terms of either x or y does
not lead to expressions of the same form as did the pdf of the product. Rather, the
factorization
x
x yz
z
jyjx yz
5:3:16
y
y
leads to Eq. (5.3.12), whereas the factorization
x
1 z
1
1 z
z x
y
y x
jxj y x
5:3:17
leads (after change of integration variable u y1) to the different, but equivalent,
form
1
pZX=Yz 2
z
pXxpYx=zjxjdx:
5:3:18
(Keep in mind that the magnitude of the Jacobian must be used in transforming
integration variables because a pdf must be non-negative.) Equation (5.3.18) is also
derivable by means of the previous geometry-based method which starts with the
cumulative probability function.
Once the pdf for the product or ratio of two independently measured quantities is
known, the pdf for a composite measurement of any combination of products and
quotients of independent direct measurements can be constructed by iteration.
Consider, for example, the random variable W XY/Z, which might represent
measurement of the gas constant R PV/T from determinations of pressure, molar
volume, and absolute temperature. Iterative use of (5.3.11) and (5.3.12) then lead to
the pdf for W
281
pWw
wz
jyj1 dy
pZz pXYwzjzj dz pZz jzj dz pY y pX
5:3:19
pXxdx 1.
which satisfies the normalization requirement
There is a simple way to express a relation of the form (5.4.1) by means of the
interval function I(a, b)(x),
1
bxa
,
5:4:2
Ia, bx
0
otherwise
which facilitates the evaluation of integrals over restricted regions. Consider, for
dx
z
example, the integral
Ia, bxIa, b
where b > a > 0, which occurs in the
x jxj
derivation of the pdf of a product of uniform variates. The first interval function
b
dx
z
reduces the integral to Ia, b
. The remaining interval function restricts the
x x
a
range to b xz a , which is equivalent to 1b xz 1a and therefore to
integration
z
z
a x b . Comparing these limits with the limits imposed by the first interval
function generates the four conditions below.
Lower limit
z
z
a ) xmin
b
b
z
a ) xmin a
b
Upper limit
z
b ) xmax b
az
z
b ) xmax
a
a
Condition on z
z ab
z ab
282
A certain uncertainty
Once the limits are determined, the integral can be easily evaluated
8
z=a
>
>
>
>
>
> dln x ln z 2 ln a z ab
<
dx >
z
a
Ia, bxIa, b
:
b
x jxj
>
>
>
>
>
dln x 2 ln b ln z z ab
>
>
:
5:4:3
z=b
pYy I0, 1y
into (5.3.8), (5.3.9), (5.3.11) and (5.3.12) with application of the preceding reasoning
for evaluating the integrals leads to the following cpfs and pdfs of the area Z XY
FZXYz z1 ln z
and aspect ratio Z X/Y
8
z
>
>
< 2
FZX=Yz
1
>
>
:1
2z
pZXYz ln z
z1
z1
0 z 1
8
1
>
>
<
2
pZX=Yz
>
>
: 1
2z2
z1
5:4:4
5:4:5
z1
The area probability relations (5.4.4) are not partitioned over the interval (0,1)
because the partition point ab 0 lies at a boundary. In the more general case
X U1(a, b), Y U2(a, b), the resulting expressions
8
zln z 2 ln a 1 a2
>
>
>
<
z ab
b a2
FZXYz
2
>
z2 ln b ln z 1 a 2ab
>
z ab
>
:
b a2
8
ln z 2 ln a
>
>
<
b a2
pZXYz
2 ln b ln z
>
>
:
b a2
z ab
z ab
283
Theory
X 0.500, Y 0.501
1
0:500
2
1
p 0:289
12
cov(X, Y) 0
X 0.289, Y 0.288
cov(X, Y) 2.94 104
Z = U(0,1) x U(0,1)
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
10
Product
0.6
Z = U(0,1) / U(0,1)
0.5
0.4
0.3
0.2
0.1
0
0
Quotient
Fig. 5.1 Upper panel: histogram of U1(0,1) U2(0,1) comprising 105 samples partitioned
among 500 bins over the interval (0,1); bin width 0.002. Lower panel: histogram of
U1(0,1)/U2(0,1) comprising 105 samples partitioned among 104 bins over the interval
(0,1000); bin width 0.1. Dashed traces are theoretical densities.
284
A certain uncertainty
The figure shows excellent agreement between the theoretical probability densities
and the computer-simulated distributions of product and quotient.
It is important to note that, although a square is a degenerate rectangle, the
corresponding area distribution cannot be calculated from the foregoing equations
because the two sides of a square are correlated 100% i.e. selection of the length
completely determines the width. The case can be treated, however, in a manner
previously demonstrated by starting with the cumulative probability Pr (X2 z), or
p
p
p
p
5:4:6
FZX2z PrX z x z FX z FX z,
and taking the derivative
pZX2z
p
p
dFZz
1
p pX z pX z:
dz
2 z
5:4:7
p
z
1
pZX2z p :
2 z
5:4:8
From (5.4.8) and (5.4.4) one can demonstrate that 50% of the squares but nearly
60% of the rectangles whose side lengths fall randomly within the range (0,1) have
areas less than 0.25. For squares, this is obviously consistent with the fact that 50%
of the side lengths are shorter than 0.5 (which generates the area 0.25). The different
fraction of rectangles is due to the fact that there are infinitely many combinations of
lengths and widths that lead to a given area but only one length for a square. It may
seem reasonable therefore that this fraction should be larger for rectangles, but
this is not the case for very small areas with z < 0.081. The reason for this behavior
follows from the geometric circumstance that when x is close to 0, say in the region
0.1 x 0, then z x2 falls in the compressed range 0.01 z 0, whereas when x
is close to 1, say in the region 1 x 0.9, then z x2 falls in the stretched range
1 z 0.81.
In developing the statistics of composite measurements, the general procedure is to
begin with the cpf F (z), calculate the pdf pZ (z) dF (z)/dz, and from the latter derive
the moments
mn hZ n i
zn pZz dz
5:4:9
of which the most significant are ordinarily the first few as used in the combinations
of mean, variance, skewness, and kurtosis.
Mean Z m1 :
Variance
varZ 2Z hZ m1 2 i m2 m21 :
5:4:10
5:4:11
285
Skewness
SkZ
Kurtosis
KZ
hZ m1 3 i m3 3m2 m1 2m31
:
3Z
3Z
5:4:12
:
4Z
4Z
5:4:13
X ,
X p ,
12
1
5
SkX 0,
KX ,
5:4:14
5:4:15
n 12
1
4
0:250
p
7
12
0:220
SkZ
p
18 7
49
0:972:
5:4:16
The estimates by error propagation theory (EPT), Eqs. (5.2.7) (with zero covariance) and (5.2.9) (neglect product of variances), yield the same mean as does (5.4.15)
p
EPT
126 0:204, which is suitably close. However, what
and a standard deviation Z
exactly would it mean to report the outcome of measurements of the rectangular area
as Zexp Z Z 0.25 0.22? To answer this question, one must resort to the exact
cpf to calculate the probability F(Z Z) F(Z Z) 69.2% that a subsequent
measurement would fall within the range Z about Z. To assume if one had not
determined the distribution of the composite measurement beforehand that it was
the same as the parent distribution or that it was a normal distribution, could lead
to a very different and incorrect estimate of measurement uncertainty. Now it
so happens that for a normal distribution, the probability Pr (jZ Zj Z) is
68.3%, which is quite close to the value just obtained for a standard uniform
286
A certain uncertainty
pproduct of
EPT
EPT
variances) are EPT
4=3,
2=3
,
and
therefore
6=4 0:61.
Z
Z
Z
Z
Here is a case where EPT fails entirely because the exactly determined moments
hZni hXnihYni diverge because hYni diverges. One might think that eliminating
0 from the range of Y will improve matters, but this is not so.
Consider the more general parent distribution U (a, b) (b > a > 0) for X and Y in
which case the ratio now falls within the range ba Z ab. The cpf and pdf deduced
from (5.3.9) and (5.3.12) are
FZX=Yz
8
>
>
>
>
<
bz a2
2
2b a z
>
az b2
>
>
>
1
:
2b a2 z
a
1z
b
pZX=Yz
b
z1
a
8 2
b a=z2
>
>
>
>
< 2b a2
1z
>
>
b=z2 a2
>
>
:
2b a2
b
z1
a
a
b
5:4:17
Calculation of the exact moments of the distribution from (5.4.9) leads to a mean
1 1
ln
5:4:18
Z
2 1
and variance
2Z
1 2
2
1 1
ln 2 ,
4 1
5:4:19
1 1 2
3 1
2 EPT 2 1 2
Z
3 1
5:4:20
5:4:21
are functionally quite different from Eqs. (5.4.18) and (5.4.19) although they
approach the corresponding exact expressions in the limit ! 1.
287
5:5:1
with mean and variance 2. Many of the composite measurements one is likely to
make in science entails multiplying or dividing elementary measurements that are
distributed at least approximately normally. For example, the nuclear decay experiments I discussed in a Chapter 3 involved Poisson-distributed random variables, but
the Poisson distribution is well approximated by a normal distribution for sufficiently
large mean.
Figure 5.2 gives a graphical overview of what typical distributions of the
product and ratio of normal variates might look like. The second and third histograms were constructed respectively from 50 000 independent pairs of N(4,1) and
N(8,1) variates drawn from a Gaussian RNG. The ratios of the pairs make up the
high, narrow first histogram, and the products of the pairs comprise the low, broad
fourth histogram. Solid lines enveloping the histograms were calculated from theoretically exact expressions that will be discussed shortly. Visually, the histogram of
0.8
N(8,1)/N(4,1)
Probability Density
0.7
0.6
0.5
N(4,1)
0.4
N(8,1)
0.3
0.2
N(8,1)xN(4,1)
0.1
10
15
20
25
30
35
40
45
50
55
Outcome
Fig. 5.2 Panoramic display of histograms of X/Y, X Y, and parent distributions X N(8,1),
Y N(4,1) with superposition (solid) of respective theoretical densities. Parent histograms
comprise 50 000 samples from a Gaussian RNG with observed covariance cov(X, Y)
1.173 103. Outcomes are distributed in 1000 bins over the interval (0, 100).
288
A certain uncertainty
quotients looks sharper and more asymmetric, with marked skewness to the right,
than the parent distributions. The product histogram is much broader than the
parent distributions, but, at the scale of the figure, still resembles a normal distribution with barely any skewness. The quotient and product histograms look centered
more or less at the respective numerical quotient and product of the means of the
parent distributions, namely 8/4 2 and 4 8 32. All four histograms have unit
area in accord with the completeness relation for probability.
We will examine first the product Z XY. Substituting the pdf (5.5.1) for X and
Y into (5.3.11) yields the product pdf
z
2 .
2 22
2
1
2
2
ex1 =21 e x
jxj1 dx
pZXYz
2 1 2
2 8 (
2 )9 3
>
z 1 2 2 2
1
>
>
>
w
w w2 >
>7
>
6 >
=
<
7
6
1
2
2
1
1
7 dw ,
6exp
2
7
6
>
2 1 2 4 >
1
>
>
5 1 w
>
>
>
>
w
;
:
1
1
5:5:2
where the expression in the second line results from the transformation w (x 1)/
1. Equation (5.5.2) is exact and cannot, to my knowledge, be reduced further to
some recognized special function for general values of the parameters. However,
under the condition that the parent distributions are sharp
(i/i) >> 1 (i 1, 2), one
can neglect the integration variable w in each factor 11 w because the entire
integrand, which decreases exponentially with w, will have become negligibly small
when w is of comparable size to 1/1 or 2/2. [Note that the distribution of Z XY is
actually symmetric in the statistical parameters of each factor and one could have
begun the calculation by integrating over y, rather than x in the first line of (5.5.2).]
This approximation allows for completing the square in the numerator of the
exponential to obtain, after some algebraic rearrangement and integration of a
Gaussian density, a product pdf of normal form
2
1
i
2
hZi 1 2
>>
1
5:5:3
pZXYz p ezZ =2 Z
2Z 21 22 22 21
i
2 Z
with mean and variance the same as that of error propagation theory.
In cases where the approximation leading to (5.5.3) does not hold, the pdf of a
product of normal variates can differ markedly from a Gaussian. Consider,
for
2
example,
the
special
case
of
normal
variates
of
zero
mean,
X
N
1 0, 1 ,
2
Y N 2 0, 2 , where the variables span the entire real axis. The first expression in
Eq. (5.5.2) yields the density
1
z cosh u
e 1 2
du
5:5:4
pZXYz
1 2
0
289
N(0,1) x N(0,1)
0.8
0.6
0.4
0.2
0
4
Product
0.3
N(0,1) / N(0,1)
0.2
0.1
0
6
Quotient
Fig. 5.3 Histogram of N1(0,1) N2(0,1) (upper panel) and N1(0,1)/N2(0,1) (lower panel)
comprising 50 000 pairs of samples from a Gaussian RNG partitioned among 2000 bins over
the range (20, 20); bin width 0.02. Solid traces are theoretical densities.
M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions (Dover, New York, 1972) 376.
290
A certain uncertainty
12 z
K z
ez coshu sinh2udu:
12
1
2
5:5:5
5:5:7
mZn zn pZXYzdz xn pXxdx yn pYydy mXn mYn ,
1
2
n
i
mnXi ni
5:5:8
0
odd k :
k
:
i
k0
k-1!!
even k 2
The first three moments
m1 1 2
m2 21 21 22 22
m3 1 2 21 3 21 22 3 22
lead to the variance
V 21 22 22 21 21 22 ,
5:5:9
Sk
61 2 21 22
21 22 22 21 21 22
3=2 :
5:5:10
291
ezy1
=2 21 y2 2 =2 22
jyjdy,
5:5:11
exp 2 2
22 z 21
1 2
5:5:12
that looks rather intimidating. The error function erf(x) a transformation of the
Gaussian cumulative probability function is defined by the expression
x
2
2
erfx p eu du
5:5:13
adbc
a2 c2
2 1 2
2
ab cd
a2 c2 y 2
a c2 jyjdy,
e
which exposes clearly the important combinations of parameters. Next, account for
the absolute-value restriction by re-writing the integral as two integrals both with
range of integration (0, ). And last, transform the integration variable to
q
ab cd
w a2 c2 y 2
,
a c2
292
A certain uncertainty
which again doubles the number of integrals: one involves an exact differential and is
immediately integrable in closed form; the other takes the form of an error function.
The four integrals resulting from (5.5.11) can be combined to yield (5.5.12).
If X and Y are sharply distributed with i /i >> 1 (i 1, 2), the error functions in
(5.5.12) (second line) reduce to 1 (1) 2, the exponential function (third line)
becomes negligible, and Eq. (5.5.12) can be approximated by the non-Gaussian
expression
(
!)
!
1
1 22 z 2 21
1 2 z2
pZX=Yz p
,
5:5:14
exp 2
2 2 z 21
2 2 z 2 3=2
2
of variable z
1
2
pZX=Yzj0,
1
1 z=2
1 = 2
5:5:15
centered on z 0 with width parameter 1/2. The lower panel of Figure 5.3 tests
the theoretical density (5.5.15) for 1 against a histogram generated from 50 000
pairs of samples from a Gaussian RNG.
The Cauchy distribution has been mentioned several times so far under various
circumstances, and it is pertinent at this point to examine its properties more
thoroughly. Of especial interest is the fact that it is a distribution to which the
Central Limit Theorem (CLT) does not apply. Although centered on 0 with a welldefined width parameter , the mean, variance, and moment generating function
(mgf ) resulting from (5.5.15) do not exist. The characteristic function (cf ), which
does exist,
293
Probability Density
N(8,1) / N(4,1)
0.6
0.4
0.2
0.5
1.5
2.5
3.5
4.5
Quotient
0.06
Probability Density
0.05
N(8,1) x N(4,1)
0.04
0.03
0.02
0.01
10
20
30
40
50
60
Product
Fig. 5.4 Histograms of the ratio (upper panel) and product (lower panel) of parent
distributions X N(8,1), Y N(4,1) with respective theoretical densities (solid). Two
theoretical curves, one from the defining integral (5.5.11) and the other from the closed-form
expression (5.5.12), overlap to produce the solid trace in the upper panel. Dashed traces show
Gaussian densities of corresponding mean and variance.
5:5:16
helps explain why. Recall that the moments are given by the coefficients in the Taylor
expansion of the cf
t2
t3 4 t 4
hZt 1 ihZit Z 2
i Z3
Z
...,
2!
3!
4!
5:5:17
294
A certain uncertainty
where the nth moment is calculated by taking the nth derivative of hZ (t) and then
setting the expansion variable t 0. However, at t 0 the function (5.5.16) presents a
cusp and the first derivative does not exist (since it is at 0 and at 0).
If the mean does not exist, then neither does the variance. However, the cf has a
unique second derivative, but this leads to hZ2i 2, which is not physically
meaningful because the mean square of a real-valued variate must be positive. In
fact, the calculation of hZ2i directly by means of the pdf
3
2
6
2
2
2 6
2
u
2
1
6
2
2
z pZzdz
du
du
hZ i
6 du
1 u2
6
1 u2
40
0
0
7
7
7
7
7
5
5:5:18
leads to an infinite result. Thus, although the cf exists, it does not lead to acceptable
first and second moments.
There are physical as well as mathematical consequences to the fact that there is
no mean or variance to a Cauchy distribution. In contrast to distributions subject to
the CLT in which the variance of the mean of N measurements is smaller by a factor
N1 than the variance of one measurement, one obtains no greater precision by
making N measurements than in making one measurement of a Cauchy-distributed
quantity. I have found that many scientists, including physicists, are surprised to
learn this. A heuristic explanation is that the more samples one draws from a Cauchy
distribution, the more often there will occur values from the fat tails, and the
fluctuations in the mean of the measurements will lead to a resultant of no greater
predictive value than that of a single sample.
Nevertheless, a Cauchy distribution does have a well-defined median, the moments
of which are deducible from the pdf of the corresponding order statistic. Suppose, for
example, that N independent measurements fxi i 1. . .Ng were made of a Cauchydistributed variate, where N is an odd integer. The median is then the middle order
statistic Ym with m (N 1)/2 for which the pdf, as determined in Chapter 1, is
pY my
N!
N1
2
!
fFy1 Fyg
N1
2
py:
5:5:19
(If N were even, then the median would be the average of the two middle order statistics.)
Substituting into (5.5.19) the pdf of a Cauchy random variable Y Cau (,) (see
Table 3.1) of location parameter and scale parameter
1
pCauyj,
5:5:20
1 y =2
and its associated cpf
y
FCauy
pCauy0 j, dy0
1 1
y
tan1
2
5:5:21
295
(d)
Probability Density
0.4
(c)
0.3
0.2
(b)
0.1
(a)
0
10
10
Median
Fig. 5.5 Probability density of the median of a Cauchy variate Cau(0,a) (black) and
corresponding Gaussian density N(0, (a/2)2 N) (gray) as a function of a for sample size N :
(a) 1, (b) 3, (c) 5, (d) 11.
N!
2
!
N1
2
1 1
4 2
#N1
2
y 2
tan
:
2
y
1
1
5:5:22
Complicated though the preceding expression might appear, one sees at once from
plotting it for (odd) sample size N > 1 that (5.5.22) looks very much like a Gaussian
distribution in the limit of large N. See Figure 5.5. Indeed, expansion of ln pY mu ,
where u (y )/, with truncation at order u2 leads to the simple Gaussian
expression
2
2
1
eu =2 YM
pY mu p
2 Y M
Y M p
2 N
5:5:23
296
A certain uncertainty
n
n
5:6:1
mn z pZX=Yzdz x pXxdx yn pYydy mXn mYn
evaluated with the parent pdfs. The latter case, however, requires calculation of the
negative-power moments of a random variable.
The question of negative-power moments of random variables is one that I have
rarely found mentioned in statistical references. And yet, there are instances in which
the need for such moments arise naturally. For example, in the problem of waiting
times (which constituted one of the tests of the randomness of nuclear decay discussed previously) one may be interested in the statistics of the number N of trials
required to achieve a specified number r of successes for constant probability of
success p at each trial. If not known beforehand, the probability p can be estimated
by its maximum likelihood value ^
p given by the expectation ^p hr=Ni rhN 1 i,
which requires the first negative moment of N. For another example (introduced in
Chapter 1), the variance of a Student t variate, T d1/2U/V, where U N(0,1) and
V 2 2d are independent random variables, is given by 2T dhU 2 ihV 2 i, which
requires the second negative moment of V actually to be calculated as the first
negative moment of V2.
A general procedure for calculating the negative moments of a random variable
Y again makes use of the moment generating function gY (t). Recall that the positive
n
nth moment is obtained by calculating the nth derivative gY 0. This suggests that the
negative nth moment may be obtained by the inverse procedure i.e. an n-fold integral
and, indeed, this can be shown to be the case.5 The exact expression takes the form
n
t1
hY i
dt1
dt2 . . .
tn1
1
gYtn dtn
tn1 gYtdt,
n
5:6:2
where the gamma function (n) (n 1)! for integer argument. Underlying the first
equality in (5.6.2) is an assumption that one can interchange the order of integration
over any t-variable and the variable y occurring in the definition of the mgf
gYt he
Yt
pYyeyt dy:
5:6:3
N. Cressie, et al., The moment-generating function and negative integer moments, The American Statistician 35 (1981)
148150. I learned of this reference in 2010, long after I had worked out the method for myself. Prior to that, I had the
vanity to think I may have been the first to discover it. Such is life . . . aptly expressed in the Ecclesiastic maxim Sub sole
nihil novi est.
297
The basis for the transition from the multiple integral in the first equality to the
single integral in the second becomes more transparent if one examines a diagram of
the integration region for the simple cases of n 1, 2 and judiciously transforms the
integration variables and limits. It would be seen then how each integration over a
t-variable contributes one factor of t to the integrand of the second integral. The
occurrence of t, rather than t, in the argument of the mgf may be understood by
following the steps in the derivation of the first negative moment
2
3
0
0
1
yt
hY 1 i y pYydy 4 e dt5pYydy dt eyt pYydy
0
0
gYt
5:6:4
0
1
p
1 m1 m
q :
q mr mn m r
n
5:6:6
X
p qm p
q2 q3 q4 . . .
q
hN i
2
3
4
q mr m q
p
p
ln p,
ln1 q
q
1p
1
5:6:7
which is easily summed by recognizing it as the Taylor series expansion of the natural
logarithm ln(1 q).
298
A certain uncertainty
Generally speaking, one must be careful to avoid the error of confounding the
mean hYni with the reciprocal hYni1. Nevertheless, curious to compare the estimate
^p obtained from solving (5.6.7) with the alternative estimate p 1=hNi, I generated
with an RNG 50 000 Poisson variates of specified mean and determined the
intervals between occurrences of a previously designated target value X. For example,
for X 100, the theoretical Poisson probability is pX 0.039 86. In one
experiment the mean interval between occurrences of X was hNXi 24.9406, from
which followed pX 1=hN X i 0:040 10. The corresponding mean reciprocal interval
was hN 1
X i 0:1377, which when substituted into (5.6.7) yielded the solution
^p X 0:041 47. Interestingly, in trying numerous values of and X, the values of pX
always came out a little closer than ^
p X to the theoretical Poisson value pX irrespective
of whether the specified mean or actual sample mean was used to calculate pX.
Consider next the negative moments of a chi-square variate V 2 2d of d degrees of
freedom for which the moment generating function is
gV 2t 1 2td=2 :
5:6:8
1
2n d=2
d=2
n1
t 1 2t
dt
u
u 1n1 du
n
n
0
1 d n
2n 2 1
2 d
5:6:9
d > 2n
which reduces to
D
E
V 2 1
1
d2
hV 2 2 i
1
d 2d 4
5:6:10
E
D
2
2
1
2m
2m x2 =2
m12
p x e
U
dx 2
m
2
5:6:11
(since the odd moments vanish) with the negative moments of a chi-square variate,
1=2
we obtain the nonvanishing (even) moments of the Student t variate T d V U
d m 2m m 12 2m 12 d m
p
12 d
1
1
m 22 d m 12 d 12
1
dm
12 d 12
2 d12
m 1, 2, 3, . . .
Bm 12 , 12 d m
dm
,
1
1
B2 d, 2
d > 2m
hT 2m i
5:6:12
299
p
where the relation 12 was employed in the first line of (5.6.12). Using (5.6.10)
and (5.6.11) we obtain the variance of the t distribution
D ED
E
D E
1
d
2
2
2
2 1
V
d1
T T d U
5:6:13
d2
d2
by a different method than that employed in Chapter 1.
2 2
1
1 X
=2k n2k1 z
n1 t 2 t2
hY i
t e
dt n
z
e dz
n
n k0 2k k!
n
1
2
n2k
1X
n 2k 1!
:
n
2k=2k
k0
k!n 1!
5:7:1
Although (5.7.1) still diverges for an upper limit of kmax , it can provide useful and
practically stable estimates for a suitably chosen finite kmax. A practical convergence
criterion, that the (k 1)th term be smaller than the kth term, yields an inequality
=2 <
2k 1
n 2kn 2k 1
5:7:2
300
A certain uncertainty
kmax 10
kmax 20
kmax 30
kmax 40
hY~ni
/ 0.1 n 1
/ 0.1 n 2
/ 0.1 n 3
/ 0.2 n 1
/ 0.2 n 2
/ 0.2 n 3
0.101
0.0103
0.001 065
0.105
0.0116
0.001 369
0.101
0.0103
0.001 065
0.105
0.0116
0.001 45
0.101
0.0103
0.001 065
0.111
0.0471
0.109
0.101
0.0103
0.001 065
142
1136
4605
0.101
0.0103
0.001 065
0.105
0.0116
0.001 369
We can also attack the problem of negative moments by means of the characteristic function (cf ) which, in contrast to the mgf, leads to useful expressions in terms of
the real part of a convergent integral
9
8
=
<
1 2 2
1
n
n
n1 2 z
~ i
Re
i
hY
,
5:7:3
cos
z
i
sin
zz
e
dz
;
:
n n
0
where the tilde above the symbol Y distinguishes the moments from those calculated
from (5.7.1). Eq. (5.7.3) leads to the following explicit expressions for the first four
negative moments
1 2 2
1
z
1
2
~
hY i
sin z e dz
1 2 2
z
~ 2 i 1 z cos z e 2
hY
2
dz
1
~ 3 i 1 z2 sin z e 2
hY
23
2
z2
dz
0
1 2 2
1
z
4
3
2
~
hY i 4 z cos z e dz
6
5:7:4
the numerical values of which were included in Table 5.1 as a standard of comparison
with the values obtained by series truncation.
The relation between the moments hYni and hY~ni for the relatively large ratio
/ 0.25, is illustrated in Figure 5.6 as a function of cut-off kmax for n 1, 2, 3. The
two modes of estimating the negative moment of a normal variate yield the same
numerical values (to three decimal places) over a wide range of cut-off limits that
satisfy the convergence criterion.
301
Moment m-1
0.16
n=1
0.14
0.12
0.1
0.08
0
10
15
20
15
20
15
20
Moment m-2
0.2
n=2
0.1
0
0.1
0
10
Moment m-3
0.4
n=3
0.2
0
0
10
Cut-O kmax
Fig. 5.6 Plot of the variation in negative moment mn hYni of normal variate Y N(10, (2,5)2)
for n 1 (top panel), 2 (middle panel), and 3 (bottom panel) as a function of cut-off kmax in
Eq. (5.7.1) (solid) and corresponding moment hY~ni calculated from the characteristic function,
Eq. (5.7.3) (dashed).
When I first published these results, a few readers could not get past the point that
a normal random variable theoretically has no finite negative moments, despite this
point having been made explicitly in the published papers. It may therefore be
worthwhile to stress again here, lest a mathematical purist begin to reach for his or
her keyboard, that although mathematical expressions may manifest singularities,
finite real physical systems generally do not. The task of a practically minded
physicist, in contrast to a mathematicians, is to devise ways of drawing information
from the real world of things, not the hypothetical world of numbers. In such cases,
302
A certain uncertainty
expediency may take precedent over rigor. Lord Rayleigh (John William Strutt) said
this aptly in his preface to The Theory of Sound6
In the mathematical investigations I have usually employed such methods as present themselves naturally to a physicist. The pure mathematician will complain . . . of deficient rigour.
But to this question there are two sides. For, however, important it may be to maintain a
uniformly high standard in pure mathematics, the physicist may occasionally do well to rest
content with arguments which are fairly satisfactory and conclusive from his point of view. To
his mind, exercised in a different order of ideas, the more severe procedure of the pure
mathematician may appear not more but less demonstrative.
Substitution of the series (5.7.1), truncated at (/)8, into (5.6.1) yields the
following
for moments of the ratio of two normal variates
explicit
estimations
N 1 , 21 and N 2 , 22 for condition (2/2) < 1:
"
2
4
6
8 #
1
2
2
2
2
1
m1
3
15
105
5:7:5
2
2
2
2
2
2
"
2
4
6
8 #
1 21
2
2
2
2
13
15
105
945
5:7:6
m2
2
2
2
2
22
"
2
2
4
6
8 #
1 1 3 21
2
2
2
2
45
420
4725
5:7:7
m3
16
2
2
2
2
32
4
"
2
4
6
8 #
1 621 21 3 41
2
2
2
2
m4
1 10
105
1260
17325
:
2
2
2
2
42
5:7:8
We will see shortly the utility of these estimated negative moments and the values
obtained from the integral expressions (5.7.4).
Table 5.2 compares the values of the first four moments, standard deviation,
skewness, and kurtosis of the quotient distribution N(10,1)/N(5,1) obtained from
50 000 samples of a Gaussian RNG with corresponding statistics calculated by means
of the Gaussian mgf (for positive moments) and cf (for negative moments), which,
according to the convergence criterion, gives results equivalent to the series expansion in (2/2) to eighth order at least. The match between theory and computer
experiment is impressive.
Nevertheless, some cautionary remarks are in order. In the real world of physical
things, the samples one draws are unlikely to be distributed exactly like a Gaussian
random variable, and therefore negative moments or, more generally, the moments
of products and quotients of normal variates that represent composite measurements will ordinarily be finite. To the extent, however, that one is actually dealing
6
J. W. S. Rayleigh, The Theory of Sound Vol. 1 (Dover, New York, 1945) xxxv. [First Edition 1877.]
303
m1
m2
m3
m4
Variance
Standard deviation
Skewness
Kurtosis
2.091
4.662
11.237
29.992
0.289
0.538
1.790
11.365
2.092
4.669
11.278
30.101
0.291
0.539
1.856
10.146
m1
m2
m3
m4
Theory
50 000
100 000
150 000
200 000
500 000
1 000 000
2 000 000
2.092
4.669
11.278
30.101
2.092
4.662
11.226
29.962
2.093
4.681
10.920
41.690
2.094
4.675
11.284
30.229
2.091
4.657
11.216
30.041
2.092
4.678
11.883
69.686
2.092
4.667
11.32
32.458
2.092
4.666
11.343
33.698
with normally distributed variates drawn, for example, from an acceptable pseudorandom number generator the evidence of divergence of the moments will eventually show up in samples of sufficient size. This occurs because the larger the number
of trials, the more likely there will occur outlying values in the tail of the distribution.
These have low probability, but in the aggregate are responsible for the divergence of
the moments hYni. Moreover, the higher the order n, the more prone is the moment
hYni to diverging.
Table 5.3 summarizes the moments of the quotient N(10,1)/N(5,1) obtained by
drawing two sets of Gaussian variates from samples of increasing size. The ratio
2/2 0.2 of the denominator distribution is small enough that the sample
moments m1, m2, m3 of the quotient distribution remained finite, stable, and close
to the theoretical estimate for sample sizes in the millions. The moment m4, however,
although in agreement with theory for most samples, was too high for sample
sizes of 100 000 and 500 000. There was nothing special about these sample sizes.
The moments of the sample are themselves random numbers formed by quotients of
random numbers. Had samples been drawn again (which, in fact, was done), the
fourth moments might well have disagreed with theory for some other choices of
sample size.
304
A certain uncertainty
1000
5000
10 000
2.693
8.047
27.383
108.751
2.701
2.684
8.532
8.442
46.156
8.383
868
1319
25 000
50 000
2.701
2.712
8.272
8.763
32.653
82.132
236
6265.000
100 000
150 000
2.515
3,440
6.3 107
1.2 1012
2.702
13.686
2,193
2.4 106
M. P. Silverman, W. Strange, C. R. Silverman, and T. C. Lipscombe, Tests of alpha-, beta-, and electron capture
decays for randomness, Physics Letters A 262 (1999) 265; M. P. Silverman and W. Strange, Experimental tests for
randomness of quantum decay examined as a Markov process, Physics Letters A 272 (2000) 1.
305
able to ascertain, appears to occur entirely randomly and without regard to past
influences. Until shown otherwise, the process of nuclear decay is natures most
perfect random number generator. I recognize, of course, that nuclear processes
clearly have physical causes tied to the weak, electromagnetic, and strong interactions. By non-deterministic I mean that, as a consequence of physical laws and
not merely technologically remediable lack of information, it is not possible to
predict which nuclei of a sample will decay or when.
It is useful, therefore, to turn to nuclear physics to test the statistical distributions
of composite measurements developed in the preceding sections. Consider an experiment8 to obtain the distributions of the product and ratio of radioactive decays
occurring through the two branching decay modes of 212Bi, each of which leads
(directly or secondarily) to an alpha particle:
(a)
(b)
212
Bi ! 212Po ! 208Pb
( branch ratio 64.06%)
Bi ! 208Tl ( branch ratio 35.94%).
212
M. P. Silverman, W. Strange, and T. C. Lipscombe, Quantum test of the distribution of composite physical
measurements, Europhysics Letters 57 (2004) 572578.
306
A certain uncertainty
M. P. Silverman, W. Strange, C. Silverman, and T. C. Lipscombe, Tests for randomness of spontaneous quantum
decay, Physical Review 61 (2000) 042106 (110).
The phrase dwell time originally signified the time cargo remains in a terminals in-transit storage area while awaiting
clearance for shipment.
307
Bi
212
Po
N(9.8,9.8)
7
N(14.7,14.7)
0
0
12
18
24
30
308
A certain uncertainty
Sample
m1
m2
m3
Standard deviation
Skewness
1.683
3.660
12.744
0.909
5.069
1.638
3.520
12.349
0.914
5.033
212
Po / 212Bi
Gaussian
N(( , )
1 = 14.74
2 = 9.86
RNG Simulation
0
0
2.5
Quotient
1.0
Relative Frequency x 10
212
Po / 212Bi
1 = 58.93
2 = 39.45
0.5
1.75
3.5
Quotient
Fig. 5.8 Distribution of 212Po/212Bi decays. Upper panel: 4096 samples of 2-bin data with
theoretical density (solid) for NPo(14.7,14.7)/NBi(9.8,9.8). Lower panel: 1023 samples of 8-bin
data with superposition of theoretical density (solid) for NPo(58.9,58.9)/NBi(39.4,39.4).
309
Sample
m1
m2
m3
Standard deviation
Skewness
145.322
2.475 104
4.820 106
60.264
0.769
145.307
2.483 104
3.608 106
60.986
0.559
1.0
Relative Frequency x 10
212
Po x 212Bi
1 = 14.74
2 = 9.86
0.5
140
280
Product
Fig. 5.9 Distribution of 212Po 212Bi decays: 4096 samples of 2-bin data with theoretical
density for NPo(14.7,14.7) NBi(9.8,9.8) (solid).
increasing sample size. This is illustrated in Figure 5.10, which compares the 2-bin
quotient data and the corresponding exact theoretical distribution calculated from
Eq. (5.3.6). To preserve the striking visual identity of the two histograms, they are
plotted in separate panels rather than as superposed figures in one panel.
The appearance of pseudo fluctuations is even more conspicuous in a distribution
of Poisson products in which the samples are all integer although the discrete parent
distributions are enveloped closely by smooth Gaussian functions. Nevertheless, the
spikey pattern in the product distribution is reproduced precisely by the exact
distribution law, Eq. (5.3.5), as illustrated in Figure 5.11 for the case Poi(10) Poi(5)
simulated by a Poisson RNG. The influence of such pseudo fluctuations on the
statistics of a composite measurement (product or ratio) diminishes, however, with
310
Probability (Experimental)
A certain uncertainty
.12
Experiment
212
Po / 212Bi
.08
1 = 14.74
2 = 9.86
.04
Probability (Poisson)
.12
Theory
Poi(14.7)/Poi(9.8)
.08
.04
0
0
Quotient
Fig. 5.10 Histogram of 4096 samples of experimental 212Po/212Bi 2-bin data (upper panel)
compared with the exact theoretical probability function (lower panel) for Poi1(14.7)/Poi2(9.8).
Apparent fluctuations are not random, but are stable reproducible features resulting from
mathematical constraints on the sum in Eq. (5.3.6).
increasing mean values of the parent distributions, which, as expected, is the condition
under which a Poisson distribution tends toward a normal distribution.
311
Probability x 10-2
Simulation
Poi(10) x Poi(5)
4
3
2
1
0
Probability x 10-2
Theory
Poi(10) x Poi(5)
4
3
2
1
0
40
80
120
Product
Fig. 5.11 Histogram of 50 000 samples (bin width 0.125) of Poi1(14.7) Poi2(9.8) simulated by
a Poisson RNG (upper panel) compared with the exact theoretical distribution (lower panel).
The apparent fluctuations are again stable reproducible features deriving from mathematical
constraints on the sum in Eq. (5.3.5).
312
A certain uncertainty
Consider, for example, variables X and Y governed by pdfs pX (x) a2xeax and
pY (y) b2yeby, respectively. The exact distribution of the ratio Z X/Y
pZX=Yz 6a2 b2 z=az b4 ,
5:9:1
given by Eq. (5.3.12), does not tend toward a normal distribution for any choice of
parameters a and b. The reason for this may be understood as follows. The parent
distributions are actually a special case of the gamma distribution Gam(, m)
m m1 u
u e
m
5:9:2
am bm
zm1
,
Bm, m az b2m
5:9:3
pujm,
p
for which the mean and standard deviation are respectively m/ and m=,
p
leading to a ratio = m that is independent of the dimensioned
parameter . In
p
the example given leading to (5.9.1), / will always be 2 irrespective of a and b.
The ratio / increases with the index m, however, and the pdf of the general quotient
Z Gam(m, a)/Gam(m, b),
pZz
5:9:4
Such a limit, however, might be irrelevant to the study of a specific physical phenomenon whose law entails a particular value of m. For example, Plancks radiation law
in the high-frequency domain takes the form of a gamma distribution (5.9.2) of
radiation frequency with fixed index m 3.
Another such circumstance (for the applicability of error propagation theory) is
the Central Limit Theorem (CLT). It is frequently the case that those who make
measurements do not need to deal with the distribution functions of the quantities
they measure, but only with the distribution functions of the averages of those
quantities over a large number of measurements, N. The basic message of the CLT
is that under certain specified conditions the mean of an infinite number of measurements approaches a Gaussian distribution with a standard error that decreases as
N1/2. Since the ratio of mean to standard deviation of the mean becomes large, we
are back again to the first-mentioned circumstance where the conditions for validity
of customary EPT apply.
The specified conditions for validity of the CLT are almost always taken to
mean the existence of the first and second moments, a condition commonly met in
practice except for a small class of functions like the Cauchy distribution. Often
313
11
12
M. P. Pignone et al., Screening and treating adults for lipid disorders, American Journal of Preventative Medicine
(April 2001) 3S 5369.
Cholestech Corporation Technical Brief: Clinical Performance of the BioScanner 2000TM and the Cholestech LDX
System Compared to a Clinical Diagnostic Laboratory Reference Method for the Determination of Lipid Profiles
(Cholestech Corporation, Haywood, CA, 2001).
314
A certain uncertainty
Men
Women
<3.4
4.0
5.0
9.5
>23
<3.3
3.8
4.5
7.0
>11
distribution where R N1
N
X
k1
14
National Institutes of Health, High blood cholesterol: what you need to know, http://www.nhlbi.nih.gov/health/
public/heart/chol/wyntk.htm
Exercise Prescription on the Net, Blood cholesterol http://www.exrx.net/Testing/LDL%26HDL.html
315
From the computed cpf we find that Pr(1.05R R 0.95R) 20.2%, a value that
does not inspire much confidence.
How many measurements, then, would have to be made on a particular blood
sample for the physician to be 90% confident that R is within 10% of the observed
mean value R? To answer this question, we can make use of the transformation
2 R1
1=2 following Eq. (5.5.14) which yielded a standard normal variate
22 R2 21
N0, N 1 . The values of corresponding to the limits (0.95R, 1.05R) are
(min 0.264, max 0.247). The sought-for number of trials, obtained by equating
the integrated probability function of a standard normal variable to 90%
N
2
1=2 max
eN
=2
d 0:90,
min
is N 41. In general, however, only one or two tests are performed per patient
per year.
In short, given the virtually explosive growth in prescriptions and sales of statin
drugs, conceivably on the basis of single annual determinations of a ratio whose
uncertainty is not ordinarily known or understood by physicians, it would seem that
the measurement, reporting, and diagnostic interpretation of lipid panel analytes are
matters for serious reevaluation.
And it is not just lipid panel tests that involve composite diagnostic indices whose
distribution and uncertainty are unknown or incorrectly determined or omitted in the
summary report of results. Other examples may include the ratio of blood urea
nitrogen (BUN) to creatinine, which is used to ascertain the likelihood of prerenal
injury, or the ratio of albumin to globulin, which is an indicator of a potential kidney
or liver disorder. Although test reports often contain what a particular laboratory
considers a range of normals, they do not usually provide any interpretation of what
normal means. Moreover, if by normal is meant that the composite index follows a
normal distribution, that assumption is almost certainly incorrect. And the CLT is of
no help in cases like these because the number of measurements is too few.
To be sure, a competent physician is unlikely to prescribe a life-long medication
on the basis of a single test. Nor is it my intention to sow seeds of distrust of the
blood tests that are performed. Rather, as a physicist I know that no measurement is
significant or interpretable without an understanding of the uncertainty with which
it was obtained. This knowledge is no less applicable to diagnostic medicine as it is
to physics.
5.11 Secular equilibrium
Within the Earths interior, the radioactive nuclei comprising each of three distinct
decay chains beginning with uranium-238 (238U), thorium-232 (232Th), or uranium235 (235U) are in a state of secular equilibrium. This means that the activities of all the
316
A certain uncertainty
radioactive species within a series are nearly equal. The activity of a radionuclide
refers to the product of its decay constant and quantity (or concentration); the decay
constant is equal to ln2 divided by the half-life . Secular equilibrium can occur
under the conditions that
(a) the half-life of the parent nucleus is much longer than the half-life of any of the
daughter products in the series, and
(b) a sufficiently long time has elapsed to allow for the daughter products to develop.
The radium (224Ra) used as a source in the nuclear test of composite measurements
discussed in Section 5.8 was produced over a period of approximately 40 years
through a chain of transmutations starting from an oxide of 232Th and eventually
ending with a stable isotope of lead (208Pb) as shown in part below up to the shortlived isotope of radon (220Rn):
232 Th
14
Gy
!
X0 5:010
11
228 Ra
5:8 y
!
0:12
X1
228 Ac
6:1 h
!
995:4
X2
228 Th
1:9 y
!
0:37
X3
224 Ra
3:7 d
!
68:4
X4
220 Rn
!
X5
3:9105
55:6 s
5:11:1
The numbers above the arrows give the half-life i of each transition in a convenient
time unit (s second, h hour, d day, y year, Gy 1 billion years). The
numbers below the arrows give the decay constants
i ln 2= i
5:11:2
dXi
i Xi i1 Xi1
dt
5:11:3
that account, like a financial balance sheet, for the creation and destruction of the
nuclide on the left side. When secular equilibrium occurs, there is no net production
or loss of nuclei i.e. dXi/dt 0 and it then follows from (5.11.3) that
0 X0 i Xi
i 1 . . . n
5:11:4
5:11:5
dY i
ei i1 t Y i1 ,
dt
5:11:6
317
Activity x 1011
4
3
2
1
232
Th
0
0
10
228
228
Ra
20
228
Ac
30
Th
224
40
Ra
50
Time (y)
Fig. 5.12 Approach to secular equilibrium of the first four daughter products in the 232Th
decay series. Initial 232Th activity (dashed) has not perceptibly changed in 50 y. The nuclides
228
Ra, 228Ac (solid gray) and 228Th, 224Ra (solid black) follow two different decay curves,
which merge about 40 years after preparation of the parent sample of 232Th.
5:11:7
starting with the solution Y0 constant (here taken to be 1) and boundary conditions
Yi(0) 0. The method of analysis leading to (5.11.7) is analogous to one I have
devised in quantum mechanics to solve harmonically driven transitions among the
states of a multi-state atom.15 It is interesting how widely disparate physical systems
can be studied by a few fundamental mathematical methods.
A plot (Figure 5.12) of the relative concentrations Xi(t)/X0(t) in (5.11.5) for the
given decay rates shows that the concentration curve of 224Ra (i 4 in the sequence)
flattens at around 30 years and achieves close to 99% of its secular equilibrium value
at 40 years. The thorium-232 half-life is so long that the concentration X0(t) is
effectively constant for the time period under consideration. Note that there are
actually four plots in Figure 5.12, but only two distinct curves are seen because two of
the plots overlap two others. This curious feature is a a fortuitous outcome of the
numerical values of the various decay constants, which reduce the exact solutions for
the activities to the two nearly exact approximate expressions
15
M. P. Silverman, Probing The Atom: Interactions of Coupled States, Fast Beams, and Loose Electrons (Princeton
University Press, Princeton NJ, 2000), Chapter 3.
318
A certain uncertainty
8
<
Ait
: 0 1
0 1 e1 t
3
1
e1 t
e3 t
3 1
3 1
i 1, 2
i 3, 4
5:11:8
16
17
The statistical appellation connotes a process whereby a state En can change only to the state En1. See W. Feller, An
Introduction to Probability Theory and its Applications Vol. 1 (Wiley, NY, 1957) 402403.
S. Pomme, Problems with the uncertainty budget of half-life measurements, In: T. M. Semkow et al. (Eds.) Applied
Modeling and Computations in Nuclear Science, ACS Symposium Series 945 (American Chemical Society,
Washington, DC, 2007) 282292.
ij
tij ln 2
ln Zij
319
5:12:1
n1 X
n
X
i1
1
1 nn 1:
2
ji1
5:12:2
Note: the requirement that tj occur after ti means that Aj is theoretically smaller than Ai, in
which case Zij > 1 and ln Zij is a positive number. However, because the activities are random
variables some observed values of Zij can in fact turn out to be less than 1, whereupon
Eq. (5.12.1) would yield a physically unacceptable negative value for the corresponding halflife. If such cases occur in the data, do not include them in the analysis.
320
A certain uncertainty
Z ij
Ni , i
N ij , 2ij
Nj , j
j > i,
5:12:3
where
^
ij
0 e^tj
^tj ti
1e
i
^
etj ti etj ti ln 2=^
j
1 i
j
tij ln 2
^
0 e
5:12:4
tj ln 2
^
1
tij ln 2
e ^
:
5:12:5
The true intrinsic decay rate ^ and half life ^ are constant parameters not to be
confused with the estimates (5.12.1) calculated from pairs of activities. The explicit
time dependences in (5.12.4) and (5.12.5) follow from the Poissonian character of
nuclear decay.
To analyze the histogram of two-point half-life estimates, we need the pdf of the
random variable , whose functional dependence on Z is given by (5.12.1). The
procedure for transforming pdfs should now be familiar. Given the Gaussian pdf
pZ (z), the pdf pT () is calculable from
.dz
pT pZ z
5:12:6
d
where the transformation function (or Jacobian) for a particular variate Zij is
dZ i j ti j ln 2 ti j ln 2
:
5:12:7
d 2 e
The composite pdf representative of the entire sample of n independent measurements is the normalized sum of the pdfs of the individual variates. Note that the sum
of the pdfs is not the pdf of the sum of variates, which would represent an entirely
different quantity namely, a measurement comprising the sum of all two-point halflife estimates.
Putting the pieces of the preceding analysis together leads to the exact expression
8
2 9
ti j ln 2
>
>
ti j ti j ln 2
i
>
>
>
>
e
=
<
n1 X
n
exp
X
2
j
1
ln 2
p
r
, 5:12:8
pT
exp
i
i
>
>
nn 1=2 2 i1 ji1
i
i
>
>
2
1
2
>
>
1
j
j
;
:
2j
j
which looks (and is) quite complicated. The basic structure, however, can be interpreted as follows. The first factor is the constant normalizing the pdf to unit area
321
when integrated over . The second factor contains (in the numerator) the constant
relating decay rate and half-life and (in the denominator) the normalization constant
from a Gaussian distribution. The sums are over all observations such that tj > ti > 0.
The next factor (within the sums) includes factors from the Jacobian (5.12.7) and the
standard deviation from the denominator of the Gaussian distribution. The final
factor is the exponential function of the Gaussian distribution. The exponential
exp (tij ln 2/) appearing within the argument of the Gaussian exponential and as a
prefactor is the functional relation Z ().
Equation (5.12.8) bears no resemblance to a Cauchy distribution. To see how this
extraordinary evolution comes about, I will strip away all inessential factors from
(5.12.8) and express time in units ti i t with t 1. Then, after substitution of the
explicit time-dependent expressions (5.12.4) for i and ij, the function in (5.12.8)
takes the skeletal form
f
n1 X
n
X
ji
i1 ji1
ji
2
ji
ji
e
,
exp 0 e
e
5:12:9
where 0 is the initial mean number of counts per bin and ^ is the sought-for true
value of the half-life. The following conditions are then imposed.
(1) and ^ are long compared to the intervals (j i).
(2) The source is strong: 0
1.
(3) Numerous measurements are made: n and N are
1.
Under these conditions a plot of (5.12.9) generates a curve that is well fit by a Cauchy
probability density.
Condition (1) is the critical step in the deconstruction for it allows us to
approximate
ji
ji
e e ^ j i 1 ^ 1
5:12:10
^ ^
2 :
^
^
5:12:11
One will also find that the form of the lineshape is not changed significantly if the
is replaced by the constant ^ in the denominators of the prefactors
variable
ji
ji
exp
2
. At this point, we have transformed the exact function (5.12.9) into a
sum of Gaussians
n1 X
n
X
j i ji^
0
2
2
e
:
5:12:12
exp
f
j
i
^
^ 4
^ 2
i1 ji1
322
A certain uncertainty
The exponential function falls off rapidly outside a narrow interval around ^ and has
an argument smaller than 1 close to ^ . Thus, one can further approximate (5.12.12)
by a Taylor series expansion
)
(
0
1
1
2
2
n
o
exp 4 j i ^
,
0
2
2
0
^
1 4 j i2 ^ 2
exp 4 j i ^
^
5:12:13
which, apart from a normalization factor, leads directly to the form of a Cauchy
function
f C
1
2 :
1 ^
5:12:14
Now to this point we have transformed (5.12.9) into a sum of Cauchy functions of
different widths
ji
ji ^
n1 X
n
e
X
^ 2
f
:
5:12:15
0
2
2
i1 ji1 1 ^ 4 j i ^
By ignoring the time-dependent, but non-resonant, exponential in the numerator,
which computer analysis confirms to have little consequence, we can, in fact, go one
step further and judiciously approximate the variable quantities (j i), (j i)2 (e.g.
by their means), and thereby collapse the double sum in (5.12.15) to a single Cauchy
function.
To return to the problem with which we began, upon restoration of the physical
constants, the exact expression for the pdf of two-point half-life measurements
(5.12.8) can be accurately represented by a Cauchy density (5.12.14) centered on
the true half-life ^ with approximate width parameter
p
6 ^ 2
5:12:16
p :
n ln 2 0
The greater the number of measurements n, the narrower is the lineshape, and the
better the empirical Cauchy pdf matches the theoretically exact pdf.
Figure 5.13 compares the theoretically exact and empirical Cauchy densities for
different numbers of two-point activity measurements of a hypothetical radioactive
nucleus with half-life of 1000 time units. For a single pair of activity measurements
n 2, the exact pdf (5.12.8) skews markedly to the right and looks nothing like either
a Cauchy or Gaussian function. For a set of 11 activity measurements (N 55 pairs),
the exact pdf begins to resemble a Cauchy function displaced a little to the left.
However, for a set of only 26 activity measurements (N 325 pairs), the exact pdf
and Cauchy densities are indistinguishable over the range of half-life values
323
(c)
Probability Density
0.005
0.004
0.003
(b)
0.002
0.001
400
(a)
600
800
1000
1200
1400
1600
Half-Life
Fig. 5.13 Plot of exact probability density (solid gray) of the half-life distribution compared
with a single Cauchy density (dashed black) of width given by Eq. (5.12.16) for sample size n:
(a) 2, (b) 10, (c) 25. Parameters of the calculation are: true half-life 0 1000 t, initial mean
count rate 0 107/t, counting interval t 1. The time unit is arbitrary, but 1 day has been
used in application to long-lived radionuclides.
displayed. The Cauchy lineshapes in Figure 5.13 are actually centered on 996.5,
rather than on 1000. The small displacement, however, vanishes in the limit of
increasing N.
In short, location of the center of the histogram of two-point half-life estimates
leads directly to the true value of the half-life without the need for curve fitting. We
can estimate the uncertainty in the value of the half-life by compounded use of the
approximation (5.2.5) for variance of a function of a random variable, starting with
the relation between half-life and activity
!2
t ln 2
t ln 2 varZ
) var
,
5:12:17
Z
Z2
ln Z2
where t is the interval between measurements of the two activities comprising the
ratio Z Ai/Aj. Next, one applies (5.2.5) again to obtain var(Z) in terms of the
variances of the two activities (or counts)
Z
Ai
Aj
varZ
i 2i
,
2j 3j
5:12:18
324
A certain uncertainty
which, by Poisson statistics, are equal to the respective means. Combining (5.12.17)
and (5.12.18) leads to the variance of one two-point estimate
1
,
5:12:19
var i j t2ijln 22 1
i j
whereupon the mean variance for entire set of N samples is obtained by summing
over all pairs
var
n1 X
n
ln 22 X
1
t2i j 1
:
i
j
N i1 ji1
5:12:20
5:12:21
2 ln 22t2
0
3
5:12:22
var
which takes a particularly simple form
var
j i2 ,
t2i j 1
i
j
0
5:12:23
j i2
1 2 2
n n 1:
12
5:12:24
Appendix
xy
pWXYZw
pUupZz
w dz du
5:13:1
z
leads to
pWXYZw pUzwpZzjzjdz:
5:13:2
u
pXYu pXxpY
jxj1 dx,
x
which, when substituted into (5.13.2) gives the final result
zw
dx:
pWXYZw jzjpZzdz jxj1 pXxpY
x
5:13:3
5:13:4
326
A certain uncertainty
8
1 1
>
<
ln w
pWw
zdz y1 dy pZzpYypXwz=y 2 2
>
1
:
4w2
0<w1
w1
5:13:5
X=N(4,1)
Probability Density
0.4
Y=N(10,1)
0.3
Z=N(8,2)
0.2
0.1
10
12
14
Outcome W=XY/Z
Fig. 5.14 Probability density functions of Gaussian variates X N(4,1), Y N(10,1), Z
N(8,2) (dashed) and the composite variate W XY/Z (solid) obtained from Eq. (5.13.8) by
numerical integration.
327
Symbol
Value
EPT
Mean of X : hXi
Standard deviation of X
Mean of Y: hYi
Standard deviation of Y
Mean of Z: hZi
Standard deviation of Z
Expectation hWi
Expectation hW2i
Expectation hW3i
Expectation hW4i
Standard deviation of W
Skewness
Kurtosis
1
1
2
2
3
3
m1
m2
m3
m4
W
SkW
KW
10
1
4
1
8
2
5.37
33.99
256.11
2321.23
2.27
696.28
7.76
5.31
1.84
Substitution of the corresponding pdfs into (5.13.4) leads to a relation (the integrations can be performed in either order)
pWw
1
23=2 1 2 3
1
23=2 1 2 3
jzje
z3 2 =2 23
dz
jxj1 ex1
1
jx je
x1 =2 21
dx
jzjez3
zw
2
2 22
=2 21 x 2
e
dx
zw
2
2 22
=2 23 x 2
e
dz
5:13:8
which would result in a very unwieldy expression if evaluated further analytically.
Figure 5.14, obtained by performing the integration in (5.13.8) numerically, shows
the density pW(w) for W N(4,1) N(10,1)/N(8,2). The moments of W, also obtained
by numerical integration using (5.13.8), are summarized in Table 5.8 and compared
with the mean and variance predicted by the approximate relations (5.2.3) and (5.2.5)
of error propagation theory. As seen in the figure, pW(w) is skewed markedly forward,
signifying that the probability of drawing outlying events in samples of large size will
be significantly greater than in the case of a normal distribution. The large value of
kurtosis further quantifies the fat tail and narrow peak of the distribution.
6
Doing the numbers nuclear physics
and the stock market
Louis Bachelier, from Theorie de la Speculation [Theory of Speculation], a thesis presented to the Faculty of Sciences
of the Academy of Paris on 29 March 1900, unnumbered page of the Introduction. (Translation from French by
M. P. Silverman.)
328
329
330
non-random behavior. The results did not overthrow quantum mechanics, but they
were published because in science it is always important to test ones beliefs carefully
and thoroughly.
The daily closing prices of stocks in a stock market provide another time series of
numbers. For a while the series may rise; shareholders are happy and economists will
tell you why the market is doing well. Then things change; the series may fall;
shareholders are unhappy and economists will now tell you why the market is doing
poorly. Economists will always have reasons for why the market is not doing well and
can always propose solutions to fix the problem. Unlike physicists who (for the most
part) are in accord over the fundamental principles of their discipline and can agree
when a problem is solved correctly, no two economists will likely ever agree on a
solution to a problem but, if the stock market is not doing well, they are all sure
there is a problem.
But maybe there isnt a problem.
I have examined the change in daily closing prices of numerous stocks and funds with
the same tests I used to search for non-random behavior in the disintegration of
radioactive nuclei and in the emission of photons from excited atoms. I looked at
these records for periods before the latest financial meltdown as well as afterward.
And here are the results: in no instance did I find convincing evidence of non-random
behavior. Correlations, periodicities, numerical patterns, and other statistical indices
showed that for all practical purposes of prediction,2 each company or fund could
well have been some kind of radioactive nucleus. Physicists call this kind of randomness white noise, a name that refers to a broad, flat spectrum of frequencies, rather
than to the preponderance of Caucasian traders in the New York Stock Exchange.
Moreover, the white-noise character of these track records seemed to be largely
independent of economic, political, or social perturbations. The implications of these
results, if they accurately characterize stock price fluctuations, are consequential.
First, you have undoubtedly read or been told whenever you invest that Past
performance is no guarantee of future results. Believe it! Nevertheless, if your experience is like mine, you can sense the prospectus winking at you as it offers this warning
(usually in very small font size) because the company executives or brokerage firm really
dont want you to believe it. In one flyer I received from one of the largest financial
services company in the USA, the warning was followed by another sentence (in larger
font size) that claimed to offer me a track record of competitive investment performance. Competitive in regard to what other random processes? If the prospectus was
completely truthful, the warning would read more transparently: Our track record is no
more correlated with future results than is the record of decay of radioactive nuclei. Of
course that might not mean much unless the reader was a physicist.
I refer to prediction by an ordinary investor, not by an ultra-fast computer designed to make trades in fractions of a
second. We will come to this point in due course.
331
4
5
M. P. Silverman, Computers, coins, and quanta: unexpected outcomes of random events, A Universe of Atoms, An
Atom in the Universe (Springer, 2002) 279324.
A. J. Ziobrowski, P. Cheng, J. W. Boyd, and B. J. Ziobrowski, Abnormal returns from the common stock investments
of the U.S. senate, Journal of Financial and Quantitative Analysis 39 (2004) 661676.
W. Buffett, Buy American. I am, http://www.nytimes.com/2008/10/17/opinion/17buffett.html
332
Buffett neglected to mention, however, is that his wealth enabled him to undertake
risks that would be unwise for the average investor, so that irrespective of the
outcome, he would still end up wealthy whereas those of modest means who followed
his advice might lose most of their savings.
In short, the result of doing the numbers with nuclear physics is to realize with
near mathematical certainty that investing in the stock market is no different from
gambling in a casino but for one important distinction. The latter is done by choice
for amusement with money people can afford to lose (if they gamble responsibly).
However, for the increasing number of workers who must secure their retirement
income from some kind of defined contribution plan, the gambling is done out of
necessity with money they will need to live on.
To the question What can you expect to gain in the long term from investing in
the stock market? the mathematical answer is this: nothing.
Before proceeding further I will make the same disclaimer here that I make at my
lectures: I am not a financial advisor; I do not give (and am not now giving) financial
advice. I am only relating what I learned from a limited study of certain statistical
features of the stock market. Any decision readers may make or forego on the basis
of something they read in this book is their own responsibility.
Now as the Marketplace man says well have the details.
6.2 The details CREF, AAPL, and GRNG
The time record of a stock or stock fund (I will refer to either simply as a stock) can
take an infinite variety of appearances of which three such records are shown in the
upper panel of Figures 6.1, 6.2, and 6.3 for the Stock mutual fund of the College
Retirement Equities Fund (CREF), the Apple Computer Company (AAPL), and the
Grange Information Services Corporation (GRNG). I have examined the records of
many stocks, but have chosen these three for illustration for both instructional and
personal reasons.
Most teachers and researchers in the USA who read this book have probably
invested in CREF Stock and therefore have a personal interest in the statistics of its
track record. However, I chose it also because it is an example of an actively managed
stock fund. In other words, there is a department of financial experts whose fulltime job presumably is to determine which companies and how many shares of each
to include in the fund. One might expect, therefore, that if active management should
lead to results superior to dart throwing or nuclear decay we should assuredly
see this in the performance of CREF Stock.
I chose AAPL because I like Apple computers. All of my books (except the first
when I didnt have an Apple computer) and nearly all of my scientific publications
were written on one kind of Apple computer or another. However, apart from
familiarity bred of long association, I chose AAPL because nearly everyone (I would
think) has heard of the company, which at least up to the death of Steve Jobs had
333
400
200
0
200
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
CREF STOCK - Log Power
10
5
0
5
0
0.5
1.5
2.5
Log Harmonic
CREF STOCK - Autocorrelation
0
1
0
200
400
600
Lag (d)
Fig. 6.1 Statistics of CREF STOCK over a 4096-day period from 27 December 1994 to
4 October 2010. Top panel: time series of closing prices: raw data (upper solid); detrended
data (lower solid); lines of regression (dashed). Middle panel: log of the power spectrum of
the detrended series (solid) and line of regression (dashed) with slope characteristic of
Brownian noise. Bottom panel: autocorrelation of the original (solid) and detrended
(dashed) time series.
a global reputation for producing innovative products that captured a large loyal
customer base. By such measure it is a successful company. One might expect,
therefore, that if technological innovation accompanied by artistic flair and a sharp
instinct for what appeals to a device-buying public should lead to results superior to
dart throwing or nuclear decay we should find this in the performance of
AAPL stock.
I choose GRNG for the opposite reasons. Virtually no one but a relatively small
number (compared to the general population) of technical specialists have heard of it,
and it is not actively managed.
334
400
300
200
100
0
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
AAPL - Log Power
10
5
0
5
0
0.5
1.5
2.5
Log Harmonic
AAPL - Autocorrelation
1
0
200
400
600
800
1000
1200 1400
1600
1800
2000
Lag (d)
Fig. 6.2 Statistics of AAPL stock over a 4096-day period from 21 July 1994 to 22 October
2010. Top panel: time series of closing prices: raw data (upper solid); detrended data (lower
solid); lines of regression (dashed). Middle panel: log of the power spectrum of the detrended
series (solid) and line of regression (dashed) with slope characteristic of Brownian noise.
Bottom panel: autocorrelation of the original (solid) and detrended (dashed) time series.
Each time record in the top panel of Figures 6.16.3 consists of the closing prices
on 212 4096 consecutive days, a period a little longer than 11 years. The reason for
selecting a period as a power of 2 is that it permitted calculation of the discrete
Fourier transform (DFT) by means of a fast algorithm as explained previously.
I have taken the time unit to be t 1 day because records of daily opening and
closing stock prices are readily available at no charge to the ordinary investor from
335
200
100
0
100
200
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
GRNG - Log Power
10
5
0
5
0
0.5
1.5
2.5
Log Harmonic
GRNG - Autocorrelation
1
1
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Lag (d)
Fig. 6.3 Statistics of GRNG stock over a 4096-day period. Top panel: time series of closing
prices: raw data (upper solid); detrended data (lower solid); lines of regression (dashed).
Middle panel: log of the power spectrum of the detrended series (solid) and line of regression
(dashed) with slope characteristic of Brownian noise. Bottom panel: autocorrelation of the
original (solid) and detrended (dashed) time series.
the internet and an ordinary investor (. . . works all day, comes home at night. . .)
would not likely have time or inclination to monitor a stock portfolio throughout the
day anyway.
The CREF time record in Figure 6.1 shows two plots: (a) the upper trace shows
the original time record with calculated trend line; (b) the lower trace shows the
336
detrended time record i.e. after the mean and trend have been removed. The
original record shows a net increase in share value over the observed period with a
stationary component that appears to exhibit wavelike returns to zero.
The single AAPL time series in Figure 6.2 shows that the AAPL share price
continued largely unchanged for more than 2000 days before beginning (around
mid 2004) a leap upward in value, despite several sharp reversals. Perhaps the surge
in share price followed the release of some spectacular new Apple device.
The two time series of GRNG share prices in Figure 6.3 again show original and
detrended records. For some 700 days following the point of origin, the share price
trended downward until something in the economy (perhaps the release of favorable
statistics by some federal agency) triggered a steady climb upward that lasted more
than 2500 days (~ 7.1 years) or greater than 61% (2500/4096) of the total displayed
record. Alas, the economic climate again triggered a reversal (. . .Another government
report? Refusal of the Chinese government to export rare-earth elements needed for
manufacture of computer chips? Attempted takeover by Google? Indictment of company
executives for securities fraud?. . .) and the share price began another long trek
downward.
Three different stocks three very different time records. Nevertheless, there is a
common dynamic associated with all three, as suggested by a plot of the log of the
power spectrum (of the detrended time series) against the log of the frequency (i.e.
harmonic number) shown in the middle panel of Figures 6.1, 6.2, and 6.3. All three
plots are virtually identical. The maximum likelihood lines of regression to the three
traces have respective slopes of 1.791 (CREF), 1.798 (AAPL), 1.887 (GRNG).
The value of this slope, 1.8, reveals important information concerning how well
one can forecast future price changes based on the past history. I will take up this
matter in due course.
The similarity of the dynamics underlying the three time records also shows up in
the autocorrelation function, as it must since the autocorrelation and power spectrum are related by the WienerKhinchin theorem. The dual traces in the lower panel
of Figures 6.1 6.3 show the sample serial correlations calculated for both the original
and detrended time records. Note that irrespective of whether the original record was
rising, falling, or flat during various parts of its history, the serial correlations are all
very similar in exhibiting a long-range correlation decreasing slowly with delay time.
The serial correlations were calculated, as described in Chapter 3, from the DFT of
the power spectrum. The maximum delay is k N/2 for a time record of length
N, beyond which the serial correlation rk begins to repeat. Leaving aside technical
details such as the accuracy or reliability of values of rk for large lag times k e.g.
values of k for which rk becomes negative and even overlooking the differences in rk
between the original and detrended time series, it is apparent from these figures that
positive correlations persisted for several hundred days.
It is not the original or raw time record of share prices, however, that matters to
the typical investor for retirement, but the record of price changes. After all, who
337
would invest in the stock market if he did not expect the value of the purchased
shares to climb higher? Figures 6.4, 6.5, and 6.6 respectively show the time record
(upper panel), power spectrum (middle panel), and correlation function (lower panel)
of the first difference i.e. the day-to-day price change corresponding to the price
histories in Figures 6.16.3. In other words, if fxi i 1. . .Ng is the time series of daily
closing prices, then fwj xj xj1 j 2. . .Ng is the time series of first differences.
The pair of dashed lines in the upper and lower panels mark 2 standard deviations
20
20
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
CREF STOCK - First Dierence Log Power
0
5
0
0.5
1.5
2.5
Log Harmonic
CREF STOCK - First Dierence Autocorrelation
0.1
0.1
0
50
100
150
200
250
300
350
400
Lag (d)
Fig. 6.4 Daily price change of the CREF time series in Figure 6.1. Top panel: time series
(solid) of first-difference closing prices and 2 limits (dashed). Middle panel: log of the firstdifference power spectrum (black points) and line of regression (dashed) characteristic of white
noise. Bottom panel: autocorrelation function of the first-difference time series (solid) and 2
limits (dashed).
338
20
20
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
AAPL - First Dierence Log Power
5
0
0.5
1.5
2.5
Log Harmonic
AAPL - First Dierence Autocorrelation
0.1
0.1
0
50
100
150
200
250
300
350
400
Lag (d)
Fig. 6.5 Daily price change of the AAPL time series in Figure 6.2. Top panel: time series
(solid) of first-difference closing prices and 2 limits (dashed). Middle panel: log of the firstdifference power spectrum (black points) and line of regression (dashed) characteristic of white
noise. Bottom panel: autocorrelation function of the first-difference time series (solid) and 2
limits (dashed).
about the mean (0) as obtained either empirically from the sample or theoretically
from a model distribution function to be discussed soon; the two calculations lead to
the same standard deviation. The dashed line in the middle panel shows the maximum likelihood line of regression whose slope in all three cases (CREF 0.018,
AAPL 0.355, GRNG 0.056) is close to 0. As I pointed out before, the slope of
the double-log plot of the power spectrum tells us much about the nature of a time
series. Apart from an apparent excess of fluctuations or volatility in the parlance
339
10
10
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
GRNG - First Dierence Log Power
0
5
0
0.5
1.5
2.5
Log Harmonic
GRNG - First Dierence Autocorrelation
0.05
0.05
0
50
100
150
200
250
300
350
400
Lag (d)
Fig. 6.6 Daily price change of the GRNG time series in Figure 6.3. Top panel: time series
(solid) of first-difference closing prices and 2 limits (dashed). Middle panel: log of the firstdifference power spectrum (black points) and line of regression (dashed) characteristic of white
noise. Bottom panel: autocorrelation function of the first-difference time series (solid) and 2
limits (dashed).
of economists in the time records of CREF and AAPL the statistical behavior in the
three displays of first differences is very close to what one expects for white noise.
I have a confession to make. There is no company whose stock symbol is GRNG
at least none when I checked the list of market symbols at the time of writing this
chapter. GRNG is my designation of Gaussian Random Number Generator. The time
record in Figure 6.3 and associated statistical panels were obtained by a stochastic
algorithm that simulates nuclear decay. Any mathematical algorithm to generate
340
with constraint
m
X
i1
N!
N!
Y
m
n1 ! . . . nm !
ni !
6:3:1
i1
ln x! x ln x x ln 2x,
6:3:2
one can show that the set of probabilities fpig that maximizes MN or, equivalently,
log MN (to any base) is the distribution that maximizes the statistical entropy
6
C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal 27 (1948) 389423, 623656.
H
m
X
1
pi log2 pi ,
log2 MN
N
i1
341
6:3:3
designated H by Shannon, who used base 2. [Note: To go from (6.3.1) to (6.3.3), one
can ignore the third term of the approximation (6.3.2) since it is much smaller than
the first two terms.] The entropy unit is then in bits: 1 bit the information acquired
in a single binary decision. Were H in (6.3.3) to be expressed in terms of natural
logarithms, as is usually the case in statistical mechanics, the unit of information
would be the nat.
If transmission of a particular symbol say Ak were certain ( pi ik i 1. . .m),
the information or uncertainty H in (6.3.3) would vanish. (Recall that ik is the
Kronecker delta symbol.) On the other hand, if the transmission of all symbols was
equally likely ( pi 1/m), then H would assume its maximum value Hmax
log2 m bits.
In the general case between certainty and total uncertainty, the expression (6.3.3)
for H represents the expectation hlog Pi in which P is a probability function with
realizations comprising the set fpig.
An alternative, axiomatic approach7 to defining information employed by
Shannon was to require a function H( p1. . .pm) with the properties that
H should be a continuous function of the ps,
H should be a monotonic increasing function of the number m of symbols,
H should satisfy a certain linearity criterion, best explained by means of an
example.
Suppose a decision is to be made with three possible outcomes with associated
probabilities fp1, p2, p3} that sum to unity. The uncertainty or information inherent
in this triad of choices is H( pi, p2, p3). Now suppose instead that the same decision is
to be made in two stages, where the first entails two choices with probabilities fp1, p4}
and the second, which occurs a fraction p4 of the time, entails two choices with
probabilities fp5, p6}. Overall, there are still three outcomes with respective probabilities fp1, p4 p5, p4 p6} that sum to unity, as illustrated in the decision tree in Figure 6.7.
Equivalence of the two procedures requires that
(a) p4 p2 p3 (since the new second choice replaces the original second and third
choices),
(b) p4 p5 p2,
(c) p4 p6 p3.
According to Shannon, the information inherent in the second procedure must take
the form H( p1, p4) p4H( p5, p6), where the coefficient p4 is a weighting factor
7
C. E. Shannon and W. Weaver, The Mathematical Theory of Communication (Illini Books, Urbana IL, 1964) 49.
342
Decision Trees
p1
p2
p1
p4
p4 p5
p5
p6
p3
1-Step
Decision
p4 p6
2-Step
Decision
Fig. 6.7 Decision trees for a decision with three possible outcomes of probability pi (i 1, 2, 3)
to be made either in one step (left panel) or in two steps (right panel) in which the outcomes
2 and 3 occur in a fraction p4 of the cases.
introduced because the second choice occurs only that fraction of the time. Shannon
was able to demonstrate that the only function satisfying the equality
Hp1 , p2 , p3 Hp1 , p4 p4 Hp5 , p6
6:3:4
6:3:5
i1
up to an arbitrary scale factor which determines the units of information. If the units
are in bits, then the base of the logarithm, which was also arbitrary, is chosen to be 2.
I have on occasion been asked by students and colleagues why the letter H (and
not I, for example) was chosen to symbolize information. It is not uncommon in
statistical physics to find quantities symbolized by the first letter of the corresponding
word in German. Thus, for example, one will often see W for probability
(Wahrscheinlichkeit) and almost always Z for partition function (Zustandssumme
sum over states). So what about H? I cannot say with certainty this is, after all, a
book about chance and uncertainty but I would speculate8 that Shannon chose the
letter with Boltzmanns H-theorem in mind. Boltzmanns quantity H was an early
attempt (~1870s) at describing entropy. Why not E, then, since entropy in German
is Entropie? From what I have learned second-hand, Boltzmann did use a script
8
I have also speculated (correctly, I believe) on the origin of James Clerk Maxwells strange choice of electromagnetic
field symbols in my earlier book, Waves and Grains: Reflections on Light and Learning (Princeton University Press,
Princeton NJ, 1998).
343
m
X
6:3:6
i1
to indicate explicitly that it is the uncertainty of the events symbolized by the set fAig,
can be generalized to address the matter of conditional information (or conditional
entropy). Circumstances often arise as will be the case in discussing the stock
market in which we may want to know how much information is provided
by events fAig, given that events of another kind represented by the set of symbols
fBj j 1. . .m0 } are known to have occurred. The entropy of the second set takes the
same form as (6.3.6)
HB
m0
X
6:3:7
j1
pAi
i1
m0
X
pBj 1,
6:3:8
j1
pAi Bj
pBj
6:3:9
where p(AiBj) is the probability of joint occurrence of the two events. The marginal
probabilities p(Ai), p(Bj) of individual events are derived from the joint probabilities
by summation over the irrelevant elements
pAi
m0
X
j1
pAi Bj
pBj
m
X
pAi Bj :
6:3:10
i1
D. Lindley, Boltzmanns Atom: The Great Debate That Launched a Revolution in Physics (The Free Press, New York,
2001) 75.
344
6:3:11
in which the quantity in square brackets summed over index i is the entropy of the set
A conditioned on element Bj. When multiplied by p(Bj) and summed over index j, the
resulting expression is the entropy of A conditioned on the full set of symbols
B. Substitution of (6.3.9) leads to the second relation in (6.3.11) in terms of joint
and marginal probabilities.
Entropy (and therefore information), like energy in physics, is additive over
subsystems; that is, the entropy H(A B) of the combined system of m m0 events
fAi, Bj i 1. . .m, j 1. . .m0 } is just the sum of the entropies, H(A) H(B), of the
separate parts. This total entropy should not be confused, however, with the entropy
H(AB) of the mm0 joint events fAiBjg, which is defined in the first relation below
X
HAB
pAi Bj log pAi Bj
i, j
HA HBjA HB HAjB
6:3:12
and expressible, after substitution of (6.3.9), in either of two equivalent ways. Note
that H(AB) is generally not an additive function since it is the sum of an entropy and a
conditional entropy. As a consequence, H(AB) is always less than or equal to the sum
of the component entropies
HAB HA HB
6:3:13
6:3:14
because
That is, the uncertainty H(AjB) (or H(BjA)) conditioned on the acquisition of new
information is always less than or equal to the unconditional uncertainty H(A) (or
H(B)). The equality in (6.3.13) or (6.3.14) holds only in the case of statistical independence of sets A and B whereby the joint probability factors: p(AiBj) p(Ai)p(Bj).
A formal demonstration of relation (6.3.14) is easily made but will be left to an
appendix.
The difference between the right and left sides of the inequalities in (6.3.13) and
(6.3.14)
HA, B HA HB HAB HA HAjB HB HBjA
X
pAi Bj
6:3:15
pAi Bj log
pAi pBj
i, j
345
pAi Bj
pAi Bj log
pAi pBj
HA, B
HAjB
i, j
X
,
1
HA
HA
pAi logAi
X
6:3:16
6:3:17
114
0:31
365
pA 1 pA 0:69:
6:3:18
346
Based on the alleged NWS success rate, the conditional probabilities are taken to be
pAjB p AjB 0:84
pAjB pAjB 1 pAjB 0:16:
6:3:19
The unconditional prior uncertainty about precipitation is therefore10
HA pAlog2 pA pAlog2 pA 0:893 bits:
6:3:20
This is the information that the NWS forecast would provide if it were 100%
accurate.
To find the actual unconditional information provided by the NWS, we must first
calculate p(B) from the completeness relation
pA pAjBpB pAjBpB,
6:3:21
pA pAjB
0:31 0:16
0:221
6:3:22
and
pB 1 pB 0:779:
6:3:23
6:3:24
6:3:25
Using the preceding figures, we obtain from (6.3.11) the uncertainty in the weather
(i.e. precipitation) given the NWS forecast
pAB
pAB
pAB
pA B
pABlog
HAjB pABlog
pABlog
pA Blog
pB
pB
pB
pB
0:634 bits:
6:3:26
6:3:27
The exact unit is bits per symbol. The additional two words may seem redundant, but this is not the case when a
message consists of a series of symbols.
347
HA HAjB 0:259
29:0%:
HA
0:893
6:3:28
A 29% decrease in uncertainty seems like a respectable number for an 84% success
rate in prediction. In any event, every bit helps.
6:4:1
348
I started with the full 4096-day time records, which were used to calculate the
autocorrelation and power spectrum statistics of CREF, AARP, and GRNG, and
examined how the information content equivalently, the reduction in uncertainty
varied as I took shorter intervals (32, 16, 8, 4 days) closer to the present day, i.e. the
day on which a hypothetical investor intended to take some action. The longer the
time interval, the more closely the probabilities estimated from frequencies reflected
the true probabilities of the system, but, of course, the more remote was the
preponderance of price variations from the present.
An example of the conditional probability matrix p(AijBj) for CREF (N 4096) is
shown below. To simplify notation, only the price-change symbol (i , , 0) is
shown, it being understood that A occupies the first slot and B the second:
0
pj
B
pAi jBj @ pj
p0j
pj
pj
p0j
pj0
0:539
C B
pj0 A @ 0:446
0:015
p0j0
0:506
0:475
0:019
0:500
C
0:444 A:
0:056
6:4:2
As required by the completeness relation, each column sums to unity. The unconditional price-change probabilities for the same time period were
pB 0:523
pB 0:459
pB0 0:018:
6:4:3
The numbers show that the probability that the price will remain unchanged from
one day to the next is low. The anticipated unconditional price-change probabilities
p(Ai) are obtained by a generalization of (6.3.21)
pAi
3
X
6:4:4
j1
and turn out to be identical (to three decimal places) to the set p(Bj). Such agreement
is expected; in a sufficiently long time series of events such that the probabilities of
different states are estimated from frequencies of occurrence, it would be odd indeed
if a different set of probabilities were obtained merely by counting the same numbers
partitioned into categories (rise, fall, same). The agreement, however, deteriorates as
the time period over which the statistics are obtained shortens. This is an indication
that the data are too few to provide adequate estimates of probability and
therefore estimates of entropy derived from these frequencies are not statistically
meaningful.
The joint probabilities p(AiBj) follow from (6.4.2) and (6.4.3) by the same relations
employed in (6.3.25)
pAi Bj pAi jBj pBj
6:4:5
and lead to the joint probability matrix (with same symbol convention used
previously)
349
Table 6.1
Stock
H(AjB)
(bits)
H(A)
(bits)
H(A) H(AjB)
(bits)
HA HAjB
HA
(%)
P(B)
P(B)
P(B0)
CREF
AAPL
GRNG
1.105 45
1.140 96
0.999 63
1.107 03
1.141 60
0.999 84
1.581(3)
6.366(4)
2.180(4)
0.143
0.056
0.022
0.523
0.502
0.507
0.459
0.473
0.493
0.018
0.025
0
B
pAi Bj @ p
p0
p
p0
p
C B
p0 A @
p0
p00
0:282
0:232
0:233
0:218
8:7913
C
7:8143 A:
Knowledge of p(AiBj), p(Ai) and p(Bj) permits one to calculate, by means of the
relations given in the previous section, the initial uncertainty of price variation
and the extent to which this uncertainty is diminished by knowing how the price
had varied in the past (with a delay of one day). The results for CREF, AAPL, and
GRNG are summarized in Table 6.1. The fractional acquisition of information (or,
equivalently, decrease in uncertainty) is minute and of no practical statistical use to
an investor.
Although the CREF and AAPL time series appear to provide a miniscule amount
of information more than GRNG, the initial and conditional entropies in the table
are nearly identical to what one would expect statistically for a coin toss with slightly
biased coin. If we ignore the small probability of a share price remaining unchanged
over a one-day interval, then there are just two states (price rise, price fall) of
approximately equal probability. Each symbol (, ) in the time series of first
differences then contributes an uncertainty of 1 bit (log2 2 1), which is precisely
the entropy obtained for GRNG and very close to the entropies of CREF and
AAPL. (Recall that entropy is the uncertainty per symbol in a long sequence of
symbols.) For time series shorter than 4096 days, but long enough to generate valid
estimates of probability (e.g. 100 days), the event 0 did not occur and the initial
and conditional entropies of CREF and AAPL turned out to be 1 bit within a few
parts in 104.
Indeed, the NWS meteorologists forecast of my local weather provided about
200 times more information to reduce my initial uncertainty of precipitation
than did the CREF time series of price changes. If I (still) had a financial
advisor who based his advice on examination of stock time charts, I think
I would do better to consult instead my local weatherman. In any event,
I couldnt do worse.
350
12
L. Bachelier, Theorie de la Speculation, Annales Scientifiques de lEcole Normale Superieure 3 (1900) 2186.
Translated into English by P. H. Cootner, The Random Character of Stock Market Prices (MIT Press, Cambridge
MA, 1964) 1778.
J.-M. Courtault et al., Louis Bachelier On The Centenary of Theorie De La Speculation, Mathematical Finance 10
(2000) 341353.
351
px, t1 t2
px0 , t1 px x0 , t2 dx0 :
6:5:1
6:5:2
352
6:5:3
can be approached in the same way I illustrated previously in regard to the functional
equations that arose in determination of Bayesian priors satisfying certain invariance
relations. Differentiate (once) both sides of (6.5.3), first with respect to u and next
with respect to v, to obtain the two first-order differential equations below expressed
as a single statement
0
f u vf 0 u v f 1 uf 1 u f 2 vf 2 v constant,
6:5:4
where the prime signifies differentiation with respect to the indicated argument.
Equation (6.5.4) is readily integrated to yield f(t)2 2Dt, where D is a constant to
be interpreted shortly (and encountered again in Chapter 10). Applied to the variance
in pdf (6.5.2), one obtains the complete expression (in my notation) found by Bachelier
ex =4Dt
px, t p
4Dt
2
6:5:5
for the spatial and temporal dependence of the probability law governing fluctuations in stock market prices. Keep in mind, however, that the coordinate x refers to
price displacement from the mean or true current market price, not physical length.
It is of significance to remark although Bachelier did not take notice of it that
the form of the right side of Eq. (6.5.1) is a convolution integral
or
6:5:7
where u is just an expansion variable. The variates are governed by the same
probability law hence the same function gX or hX but with a time-dependent
parameter. Normalized solutions to the functional equations in (6.5.7) are the exponential functions
gX u e t u
1
2
2 2
or
hX u e t u ,
1
2
2 2
353
6:5:8
which the reader will recognize immediately as the mgf and cf of a Gaussian
distribution. Bachelier made no use of generating functions in his thesis.
The probability density (6.5.5) does not represent a stationary stochastic process
since the variance, 2t 2Dt
increases
with time. The root-mean-square spread in
p
6:5:9
6:6:1
where tt0 is the Kronecker delta function symbolizing no correlation between perturbations at different times.
If the perturbation is a Gaussian random variable t N 0, 2 , we can obtain
results equivalent to Bachelier, but it is not necessary to make such a choice at this
13
A. Einstein, (a) On the movement of small particles suspended in stationary liquids required by the molecular-kinetic
theory of heat, Annalen der Physik 17 (1905) 549560; (b) On the theory of Brownian movement, Annalen der Physik
19 (1906) 371381. Einsteins papers on Brownian motion are collected in the book Albert Einstein Investigations on the
Theory of the Brownian Movement, Eds. R. Furth and A. D. Cowper (Dover, New York, 1956).
354
point. Indeed, one of the objectives of the exercise I undertook was to determine what
kind of random shock adequately describes the random walk of stock prices. My
model, of a type labeled AR(1) for autoregressive process of order 1, was defined by
the master equation
xt xt1 t
6:6:2
where the parameter gauges the influence of the past lag 1 day on the present.
For the model to be useful when theory is matched to data, one must find that jj 1,
otherwise Eq. (6.6.2) results in run-away solutions that diverge exponentially with
time. Once Eq. (6.6.2) is solved, can be estimated from a time series in various ways,
and the value is highly informative in regard to the nature of the stochastic process.
I will describe shortly what resulted for the rather different looking CREF, AAPL,
and GRNG time series.
Equation (6.6.2) is an example of a Markov process, i.e. a stochastic process in
which the future depends only on the present, a characteristic experimentally tested
in the decay of radioactive nuclei.14 A more formal way of expressing this point is to
state that the conditional probability of obtaining the present state xt given all the
past values of the random variable X
PrXt xt jfxt0 g for all t0 < t PrXt xt jxt1
6:6:3
equals the probability of the present variate conditioned on only the most recent past
variate i.e. with time lag 1. In a more general autoregressive model AR(n)
xt 1 xt1 2 xt2 3 xt3 . . . n xtn t ,
6:6:4
the influence on the present reaches further into the past and there are more parameters to be determined from the available time series data.
Time series analysis by means of equations of the autoregressive type, as well
as other defined types that go by names (ARMA, ARIMA, ARCH, GARCH, etc.)
that sound either like a government agency or a person choking have been widely
investigated15 for their utility as forecasting tools. From the preceding section, one
would not expect stock price forecasts to provide useful information. Actually, the
model (6.6.2) does tell us something immediately: the mean price expected for
tomorrow is times the price today. We shall see that 1 to within statistical
uncertainty.
Equation (6.6.2) can be solved formally by writing a column of time-lagged
versions of the equation, each row multiplied by ,
14
15
M. P. Silverman and W. Strange, Experimental tests for randomness of quantum decay examined as a Markov
process, Physics Letters A 272 (2000) 19.
G. E. P. Box, G. M. Jenkins, and G. C. Reinsel, Time Series Analysis: Forecasting and Control (Prentice-Hall, New
York, 1994).
xt
xt1
xt1 2 xt2
2 xt2 3 xt3
3 xt3 4 xt4
..
.
t
t1
2 t2
3 t3
355
6:6:5
t
X
tk t
k1
t1
X
k tk :
6:6:6
k0
6:6:7
t1
X
Bk t
k0
1 Bt
t Bt
1 B
6:6:8
of the operator B acting on the present perturbation. The generating function (B)
will prove very useful shortly.
To model the time series of stocks, such as those shown in Figures 6.16.3, I chose
the perturbations to be independent, identically distributed (iid) normal variates of
mean 0 and variance 2 . Recall that one of the properties of the normal distribution is
its strict stability a term signifying that a linear combination of iid normal variates
also results in a normal variate according to the relations
n
X
n
X
ai N i i , 2i
N i ai i , a2i 2i N, 2
i1
n
X
i1
i1
ai i
n
X
6:6:9
a2i 2i :
i1
Thus, the linear combination in (6.6.6) can be collapsed to a single normal variate
1 2t
2t
2
xt N 0, 2t
6:6:10
1 2
of mean 0 and time-dependent variance. Note that the form of the variance in
(6.6.10), which follows rigorously from (6.6.9), is also obtained simply from the
operator expression (B) in (6.6.8) by replacing the backshift operator B with .
356
6:6:12
where the second equality follows from time-translation invariance and the third
equality merely states that the expectation value does not depend on the order of the
variates in the brackets. Multiplying the two sides of the master equation (6.6.2) by
xtk and then taking the expectation leads to the relation
hxt xtk i hxt1 xtk i ht xtk i
16
k k1 ht xtk i:
6:6:13
A stable distribution is one characterized by the linear relation a1X1 a2X2 a3X a4 where the as are constant
coefficients and the Xs are variates of the same kind (e.g. Gaussian). For a strictly stable distribution a4 0.
357
6:6:14
For lag k
1, the expectation htxtki vanishes because the random perturbation t
occurs later than the variate xtk and therefore cannot influence it. For lag k 0, one
makes use of the master equation (6.6.2) and perturbation covariance (6.6.1) to find
that the expectation
ht xt i ht xt1 t i h2t i 2 ,
6:6:15
6:6:16
j jjkj
..
k60
>
>
.
j1
>
>
;
n 1 jn1j 2 jn2j . . . n 0
6:6:17
known as the YuleWalker equations, affords one way of estimating the model
parameters from the sample autocorrelation coefficients of an observed time series.
(The coefficient 0, although equal to 1 by construction, is shown explicitly in (6.6.17)
to maintain the pattern of indices.)
From the closed-form solution of the AR(1) process in (6.6.6) the power spectrum
S() at frequency c
0, where c 1/2t is the cut-off frequency, can be deduced
in a simple way. As a matter of notation, however, note that the dimensionless
product t falls in the range 12
t
0. Because t 1, I will omit writing it in
the ensuing mathematical expressions, whereupon a dimensionless frequency (actually a phase) within the above range will be designated by . The simple procedure for
obtaining the power spectrum then consists of replacing the backward shift operator
B in (B) by the phase factor e2i, whereupon one finds
S 2 2 je2 i j2 :
6:6:18
358
e2 i 2t
2
!
2
i
t!
1 e
!1
1
2 2
2
,
1 cos 2
6:6:19
which is stationary in the asymptotic limit, since it is assumed that < 1, even if only
by an infinitesimal amount. For frequencies well below cut-off, a Taylor series
expansion of the denominator results in the relation
S /
1
,
2
6:6:20
6:6:21
6:6:22
the log-likelihood function of the sampled time series fxt t 1. . .Ng takes the form
L ln L
N
N
1 X
ln 2 2
xt xt1 2 constant,
2
2 t2
6:6:23
L= 2 0
6:6:24
xt xt1
t1
N
X
^ 2
x2t1
N 1
1X
^ t1 2 :
xt x
N t1
6:6:25
t1
The corresponding covariance matrix derived from the mixed second derivatives of
the log-likelihood is diagonal
359
2 L
2
B
B
C B
B 2 L
@
2
11
2 L
C
2 C
C
2 L C
A
2
2
var
0
!
0
,
var 2
6:6:26
2
N 2t
2 4
var 2
N
6:6:27
6:6:28
360
Table 6.2
AAPL
Lag parameter
Shock parameter
Correlation
Original series
Detrended series
1.001 3.5(4)
1.986 0.087
2.9(3)
1.001 6.7(4)
1.988 0.087
6.7(3)
1.003 8.2(4)
3.640 0.829
9.4(3)
0.966 0.012
3.630 0.825
0.011
1.000 1.6(4)
2.064 0.094
8.1(3)
0.998 9.5(4)
2.064 0.094
5.0(3)
1.002 7.7(4)
3.339 0.697
0.011
0.974 0.010
3.343 0.699
0.012
1.000 3.3(4)
1.981 0.087
8.1(3)
1.001 5.6(4)
1.982 0.087
7.8(3)
1.002 1.3(3)
1.923 0.23
0.020
0.999 5.5(3)
1.919 0.23
0.017
4096
512
Lag parameter
Shock parameter
Correlation
CREF
Lag parameter
Shock parameter
Correlation
4096
512
Lag parameter
Shock parameter
Correlation
GRNG
Lag parameter
Shock parameter
Correlation
4096
512
Lag parameter
Shock parameter
Correlation
t
t
1X
1X
xn xn 2 with xt
xn
t n1
t n1
6:6:29
of the time series to increase approximately linearly with t according to (6.6.11), and
the sample variance
s2Wt
t
t
1X
1X
wn wn 2 where wt
wn
t n2
t n2
6:6:30
361
1500
AAPL Variance
(a)
1000
500
x 100
0
500
500
1000
1500
2000
2500
(b)
3000
3500
4000
Time (d)
Fig. 6.8 Time variation of the variance of the AAPL (a) detrended time series in Figure 6.1
(solid) and corresponding line of regression (dashed line); (b) first-difference of the detrended
time series multiplied by 100 for visibility (solid). The variance of pure Brownian noise
increases linearly in time, whereas that of white noise is constant.
In short, taking account of all the comparisons made so far, the CREF and AAPL
time series both of which typify other stock time records I have examined appear
to derive from stochastic processes largely characterizable as a Gaussian random
walk.
If you are wondering whether practically useful departures from a random-walk
process might have been discerned by analyzing the stock time series with a AR(n)
model with n > 1 or some other more complicated model that permitted a deeper slice
of the past to influence the present, the answer is almost assuredly no. The basic
strategy of deciding which, if any, of many possible linear models to apply to a nonstationary time series consists of transforming the series and examining the resulting
difference series until one attains a difference series whose autocorrelation function
resembles that of white noise. One then reverses direction putting all the components
together (integrating instead of differencing) to arrive at the identity of the
process characterizing the original time series. It is by such means that the strange
names of the time series were derived (e.g. ARIMA Autoregressive Integrated
Moving Average process).
The salient feature to all the stock market time series I have examined is the
immediate arrival at white noise with the first difference. There would be little point,
therefore, to employ a model more complicated than AR(1) if one were simply a
typical investor saving for retirement rather than a speculator or hedge-fund manager faced with the statistical uncertainties of some futures contract or other
362
Cau(,)
0.4
0.3
0.2
N(,2)
0.1
0
6
Class
Fig. 6.9 Histogram of first differences of the CREF STOCK time series of Fig 6.1 (black bars)
superposed by a Cauchy distribution (solid) Cau(,) Cau(0.2,0.76) with visually fit
location and scale parameters, and a normal distribution (dashed) Nw, var w
N5:19 103 , 4:26 determined by the sample mean and variance. Excluded from the w 0
bin of the histogram are contributions in which the price change was null because of market
closure.
derivative product. Over-modeling a time series does not provide new or more precise
information.
Nevertheless, there is one striking difference between actual stock time series and
Gaussian-simulated time series when one looks more closely at the distribution of
first differences of the raw or detrended time series as illustrated by the CREF data in
Figure 6.9. Although the GRNG first-difference series is well characterized by a
Gaussian distribution by virtue of its construction, histograms of CREF and AAPL
first differences have much narrower peaks and fatter tails that more closely resemble
a Cauchy distribution (although not to an extent that passes a chi-square test). The
statistical implication of fat tails, in comparison to the exponentially decreasing tails
of a Gaussian distribution, is greater volatility i.e. higher probability of the
occurrence of outlying or extreme events.
This greater volatility is apparent in the top panel of Figures 6.4 and 6.5 in the
rare, but not negligible, occurrence of first-difference excursions extending beyond
the region bounded by 5 standard deviations. For a random variable strictly
governed by a normal distribution, the probability of attaining by pure chance a
value of at least five standard deviations beyond the mean is about 6 107. Thus the
mean number of occurrences of such events in a time period of 4096 days (with one
363
Displacement
N(0,22)
0
100
200
0
500
1000
1500
2000
2500
3000
3500
4000
3000
3500
4000
Steps
Cauchy Random Walk
Displacement
5000
Cau(0,2)
0
5000
0
500
1000
1500
2000
2500
Steps
Fig. 6.10 Comparison of one-dimensional Gaussian (upper panel) and Cauchy (lower panel)
random walks (RW) with location parameter 0 and width parameter 2. The N(0,22) path is
characterized by numerous small fluctuations; the Cau(0,2) path shows a relatively smooth
evolution punctuated intermittently by large fluctuations.
trial per day) would be 0.0023. Nevertheless, the autocorrelation of the first differences, shown in the bottom panel of Figures 6.4 and 6.5 is well characterized by white
noise and both original time series look very much more like a Gaussian random
walk than a Cauchy random walk, as illustrated in Figure 6.10.
The upper panel in Figure 6.10 shows the cumulative displacements in 4096 steps
of a normal variate N(0,4) i.e. with width parameter 2 simulated by a
Gaussian RNG. The lower panel shows the corresponding displacements of a Cauchy variate Cau(0,2) with width parameter 2 simulated by a Cauchy RNG. In
contrast to the Gaussian random walk, which appears to fluctuate rapidly on a fine
scale but proceed more or less smoothly on a coarse scale, the Cauchy random walk
appears to proceed with little change on a fine scale for long intervals and then
undergo large changes in scale suddenly. Stock market time series are approximate
Levy processes that fall somewhere between Gaussian and Cauchy random walks. In
appearance they look much more like the former than the latter, but the higher-thannormal incidence of stock market melt-downs serves as a graphic reminder of the
impact of extreme events residing in those fat tails.
364
We have seen that there is no information (in the Shannon sense) in a time series or
record of past performance of stock prices. Nevertheless, to know that stock price
movements are characterized reasonably well by a Gaussian random walk enables
one to understand how that movement exemplifies one of the most astonishing,
counter-intuitive properties of a random walk a behavior responsible for much
delusion in regard to long-term returns of the stock market.
On a number of occasions I have asked diverse groups of people audiences at my
seminars, students in my classes, associates at work or at other activities what they
thought would be the cumulative gain (positive or negative) of tossing a fair coin a
few hundred times and receiving a dollar for every head and paying a dollar for every
tail. The reply almost invariably was to suggest that the net accumulation would be
close to zero since, after all, if the coin were unbiased, then there was a 5050 chance
of getting either a head or tail. The cumulative gain of a coin toss with a fair coin is
equivalent to the net displacement of a Bernoulli random walk of equal step size (let
us say 1 unit) with probability p to step right equal to the probability q to step left.
Even physicists who recognize this equivalence may not be fully aware of how
awesomely wrong the usual reply is.17
The four panels in Figure 6.11 show four realizations of a one-dimensional 1000step random walk simulated by a N(0, 1) Gaussian RNG. A Gaussian random walk is
different from a Bernoulli random walk in that the step size is continuous over an
infinite range with a probability determined by a Gaussian distribution. Nevertheless,
there are remarkable connections between the two types of random walk. Note first
how much time the random walker spends either above the zero ordinate (breakeven line) or below it. In the top panel, for example, the random walker remained at
or below the breakeven line for 978 trials out of 1000 that is, for 97.8% of the time.
What is the theoretical probability that a Gaussian random walk will remain nonpositive (or, equivalently, non-negative) for at least 97.8% of the time?
To answer that question generally, represent the displacement of the nth step by
Xn Nn(0, 2) each step being independent of preceding steps and the cumulative
n
X
Xj N0, n 2 . Then
displacement at the conclusion of the nth step by Sn
j1
17
See M. P. Silverman, Computers, coins, and quanta: unexpected outcomes of random events in my book A Universe
of Atoms, An Atom in the Universe (Springer, New York, 2002) 279324.
365
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
40
Cumulative Displacement
20
0
20
40
20
0
20
40
20
0
20
40
Number of Steps
Fig. 6.11 Four realizations of the cumulative displacement of a one-dimensional Gaussian
random walk of 1000 steps simulated by a N (0,1) Gaussian RNG. Note the large fraction of
each trajectory spent either above or below the origin.
366
Pr(S1 > 0, S2 > 0. . .Sn > 0) is the probability that all n steps lie above the breakeven
line. For n 1 we obtain the expected result
2 1=2
Pr S1 > 0 Pr X1 > 0 2
x21 =2 2
dx1 2
1=2
1
2
ez1 =2 dz1 ,
2
6:7:1
where the transformation z x/ was made to make the integral dimensionless.
For n 2
Pr S1 > 0, S2 > 0 Pr X1 X2 > 0, X1 > 0 Pr X2 > X1 , X1 > 0
2
2
22=2 ez1 =2 dz1 ez2 =2 dz2 :
6:7:2
z1
Pr S1 > 0, . . . Sn > 0 2
z21 =2
dz1
z22 =2
z1
dz2
e
z1 z2
z23 =2
dz2 . . .
ezn =2 dzn
2
n1
X
zj
j1
2n=2
n
Y
k1
kX
1
e zk =2 dzk ,
2
6:7:3
zj
j1
results in a formidably looking multiple integral of Gaussian functions with constraints on the lower limits.
Although it may seem incredible, the value of the integral in (6.7.3) is given by a
very simple combinatorial formula
1 2n
1
6:7:4
Pr S1 > 0, S2 > 0 . . . Sn > 0 2n
! p n 1, 2 . . .
n
n1
n
2
derived for a one-dimensional Bernoulli random walk of 2n steps of equal size and
probability p q.18 The asymptotic (large n) reduction to the right of the arrow is
obtained by applying Stirlings approximation (6.3.2), this time including the squareroot term which is often omitted in statistical thermodynamics.
The probability of a non-negative cumulative displacement for k steps out of n and
a non-positive cumulative displacement for the remaining n k steps is then the
product of two factors of the form of (6.7.4)
18
W. Feller, An Introduction to Probability Theory and Its Applications Vol. 1 (Wiley, New York, 1950) 7475. Feller
derives the formula for a Bernoulli random walk; he does not discuss at all the Gaussian random walk.
367
0.18
Probability
0.16
0.14
0.12
0.1
0.08
0.06
0
10
1
Pr kjn 2n
2
2k
k
2n 2k
nk
1
! p :
n1 kn k
6:7:5
Apart from the end points (k 0, n) where it becomes singular, the asymptotic
expression is quite good even for relatively low n as shown in Figure 6.12 for n 10
steps. Although powerful mathematical software like Maple or Mathematica permits
one to evaluate (6.7.5) numerically for arbitrarily large n, the asymptotic formula
shows more transparently that a plot of Pr(kjn) as a function of k is concave upward
with a minimum at k n/2. In other words, the probability that the random walker
remains above (or below) the breakeven line for most of the n steps is much higher
than the probability of it spending time nearly equally in both domains. This
outcome may seem obvious when one thinks of the random walker as a molecule
diffusing from a bottle of perfume, but it is much less obvious when the image in
mind is the cumulative gain of tossing a fair coin.
The probability that the random walker remains in the positive domain for at least
k out of n time units is obtained by summing (6.7.5) over the range from k to n
n
1 X
2j
2n 2j
Pr K
kjn 2n
:
6:7:6
j
nj
2 jk
One can approximate (6.7.6) by integrating the asymptotic expression in (6.7.5) to
obtain a form of the so-called arc sine law
368
1
Pr K
kjn 1
k=n
r!
dx
1
2
2k
2
k
p 1 sin 1
,
1
1 sin 1
2
n
x1 x
6:7:7
where the last equality in (6.7.7) follows from the trigonometric identity cos 2
1 2sin2 . The answer to the question of how probable it is for a Gaussian random
walker to remain in the positive domain for 978 steps out of 1000 is then close to
9.64% by (6.7.6) or (approximately) 9.48% by the arc-sine law (6.7.7). In other
words, what may have seemed initially to be an extremely improbable event can
occur by pure chance in approximately 1 out of 10 time records.
The appearance, therefore, of a more or less upward trending movement of a stock
price over a substantial period of time is no reason necessarily to believe that it is the
consequence of anything other than pure chance i.e. the unpredictable outcome of a
myriad of uncontrollable influences. The trend will eventually reverse and a correspondingly persistent downward trending price movement will ensue. Such patterns of
persistence and reversal can lead to waves in a price track record like those
appearing in the raw or detrended CREF time series in Figure 6.1. The undulations
give the illusion of a predictable cyclical movement but that is only an illusion, as
established, for example, by examination of the power spectrum. There is no deterministic periodicity only the random outcomes of a stochastic process.
To justify the preceding assertion, consider the probability that a Gaussian
random walker, which began at the origin (location 0 at time 0) and remained in
the positive domain for n 1 steps then crosses the zero line into the negative domain
for the first time at the nth step. We are seeking the probability of first passage
through the origin
Pr S1
0, . . . Sn1
0, Sn < 0
n=2
z21 =2
dz1
z22 =2
dz2
z1 z2
z1
z23 =2
dz2 . . .
n2
X
j1
z2n =2
e
n1
X
zj
dzn1
ezn =2 dzn
2
zj
j1
n=2
n1
X
zj
n1
Y
k1
j1
z2k =2
k1
X
dzk
ezn =2 dzn
2
6:7:8
zj
j1
which differs critically from the expression in (6.7.3) by the limits on variable zn i.e.
n
X
on the rightmost integral where the condition Sn
Xj < 0 poses the constraint
j1
n1
X
369
j1
n1
X
zj
j1
ezn =2 dzn
2
ezn =2 dzn
2
p
2
n1
X
ezn =2 dzn :
2
6:7:9
zj
j1
Since the first term on the right side is just the Gaussian normalization constant,
substitution of (6.7.9) into (6.7.8) reduces the probability of first passage through the
origin into the difference of two probabilities of non-negative displacement, the first
of n 1 steps and the second of n steps (as one might have anticipated)
Pr S1
0, . . . Sn1
0, Sn < 0 Pr S1
0, . . . Sn1
0
Pr S1
0, . . . Sn1
0, Sn
0:
6:7:10
1
2
1 p :
2
n
6:7:13
As the number of steps becomes infinitely large, the probability approaches 50% that
the random walker will cross from the positive into the negative domain (or from the
negative into the positive domain). Looking at the waves in the detrended CREF
time record, one can estimate a period of about 1100 days between the point where
the record initially crossed the breakeven axis into the positive domain and the first
return crossing into the negative domain. From either (6.7.12) or (6.7.13) the probability of this event is 48.3%. On average, the probability is about 35% that a
370
Gaussian random walker will make a first passage through the origin within 15 time
units i.e. about two weeks in terms of a record of daily stock prices. Note that for a
stationary time series there is nothing special about the point of origin. Any moment
in a long time series can be taken to be the origin of time, and the subsequent stock
prices measured with respect to that starting value.
At first thought, it may seem contradictory that (a) the probability of first passage
through the origin (which in the context of stock prices means loss of any accumulated gain) approaches 50% as the time period gets longer and (b) a Gaussian
random walker is far more likely to spend most of its time either in the positive
domain or negative domain and return only rarely to the breakeven point. However,
there is no inconsistency. The explanation is that the probability of second, third . . .
and all higher passages through the origin decreases rapidly. We can see how this
comes about by examining a relatively simple relation derived for a one-dimensional
Bernoulli random walk for which cumulative displacements from the origin are
integer multiples of a fixed step size, but which nonetheless leads to a general
property also exhibited by a one-dimensional Gaussian random walk. The probability that the random walker returns to the origin k times in a period of 2n time units
(which is expressed as such because it must be an even integer) is19
1
2n k
pk
:
6:7:14
n
n
22nk
The probability function (6.7.14) satisfies the completeness relation when the
variable k is summed over the range (0, n) since a Bernoulli random walker can make
at most n returns to the origin in 2n steps. (A minimum of two steps is required to
leave the origin and return.)
The mean number hRi and mean square number hR2i of returns to the origin can
be calculated exactly from (6.7.14)
r
n
X
p
k
2 n 32
n
2n k
p
hRi
1:128 n
6:7:15
1 ! 2
2nk
n
n1
n 1
k0 2
hR2 i
n
X
k2
6 n 32
2n k
2n 3 p
! 2n
2nk
n
n 1 n1
k0 2
6:7:16
and leads to the asymptotic expressions to the right of the arrows upon substitution
of the asymptotic formula (Stirlings approximation) for the gamma functions. There
are several unusual features to this distribution. Note first that hRi is proportional to
the square root of the time period. Since the return of a Bernoulli random walker to
the origin is tantamount to a tie score in the coin-tossing game, one might have
supposed that the number of such ties would be proportional to the duration of the
19
371
game i.e. if the playing time were twice as long, the number of ties would be
doubled. But that is not the case. The number of ties increases only as the square
root of the playing time. This curious fluctuation behavior virtually ensures that the
random walker will make long excursions in either the positive or negative domain
and return only relatively rarely to the origin. However, the random walker does
return eventually with a 100% probability. Second, the standard deviation
q s
p
2
2
2
R hR i hRi 2 1
n 0:853 n
6:7:17
is also proportional to the square root of the time period, so that the fluctuation in
number of returns to the origin is of the same order as the mean a property we
encountered before (in Chapter 3) with the exponential distribution.
Because the distribution (6.7.14) is not Gaussian, one must be careful to interpret
correctly the statistical implications of the standard deviation (6.7.17). Since the
distribution p
is
one-sided,
the probability of an observation R falling between 0 and
hRi= R 1= =2 1 1:324 is
r hRi=
r 1:324
R
2
2
2
u2 =2
e
du
eu =2 du 0:814
6:7:18
372
2
Npq that increases
D
p
with N . Thus, if the game is fair and you bet $100 at $1 per game, your capital at the
end of 100 games would lie between $90 and $110 with a probability of about 73% or
between $80 and $120 with a probability of about 96%. Details are left to an
appendix.
Many people think that the law of averages guarantees that the more games they
play the closer their gain G will be to their expectation hGi, but that idea is wrong.
Rather, the law states that as the number of games increases the closer will be the
fraction G/N to the fraction hGi/N although the gap as measured by D gets wider.
pIt
is the ratio (D/N) or (D/N) that tends toward 0 with increasing N (as 1= N ),
not D. Moreover, there is nothing in the law of averages to suggest a tendency for
the leads of two players to equalize within any specific game. The law implies only
that if a game is fair, then at the outset each player has equal chance of winning it,
and in the long run should win about half of the games.
In the context of the stock market, the law of averages tells you this: if you want to
be assured of a positive gain over the course of a lifetime of investing for retirement,
you will have to invest in stocks with a positive expectation. No company, of course,
can guarantee that.
373
6:9:1
c1
6:9:2
c2
Since the elements of the series on the right-hand side of (6.9.1) are known, the
predicted element ~x t1 of lead time units can be calculated once the constants have
been determined. In general, we would expect that the prediction is best for 1 and
gets progressively more uncertain the further one tries to forecast the future so we
will consider 1. To be able to predict the performance of stocks one day in
advance would be a marvelous boon to an average investor who knew how to do it.
We shall determine the constants by minimizing the mean square error of the
prediction
2p ~x t1 xt1 2
1xt xt xt1 xt1 xt2 xt1 xt 2 ,
6:9:3
where the expression in the second line, obtained by substitution of (6.9.1), has been
expressly arranged to contain differences of pairs of elements. The expectation of the
square of such a difference, known as a variogram,20
k xt xtk 2 ,
6:9:4
is expressible as either a difference of covariance elements k hxt xtki or an integral
of the power spectrum S over a specified frequency range (1, 2)
k hx2t i hx2tk i 2hxt xtk i 20 k 20 1 k
2
4 S 1 cos 2kd:
6:9:5
20
The variogram is usually represented by the Greek letter gamma. However, to avoid confusion with either the gamma
function (x) or the covariance function k, I represent the variogram by a gamma with an overbar: x.
374
The second line in (6.9.5) is obtained from the defining relation (6.9.4) in two steps by
(a) substitution of the Fourier transforms
xt
e2it d
xtk x*tk
* e2i tk d
6:9:6
and
(b) use of the identify
h *0 i S 0
6:9:7
which defines the power spectral amplitudes and recognizes that Fourier amplitudes
of different frequencies are uncorrelated. In the first line of (6.9.5) the assumption
was made that the time series is stationary so that hx2t i hx2tk i. For reasons
explained previously, one usually detrends i.e. transforms to zero overall mean
and slope a discrete, finite time series.
If the time series is continuous and of infinite extent, the integral in (6.9.5)
theoretically extends over the range (0, ), provided it converges. However, a finite
time series is necessarily band-limited and therefore the actual upper limit is the
cut-off frequency c 1/2t and the lower limit (fundamental) is the reciprocal of the
time period 0 1/Nt. For a long time series (N 1), one can take 0 0, provided
the integral converges. Upon substitution in (6.9.5) of the dimensionless variable
u the theoretical variogram of lag kt 2k/c becomes
4c
k
k
k
Su 1 cos u du:
6:9:8
2k=N
Determined empirically from the power spectrum of a discrete time series, the sample
estimate of the variogram is
N=2
X
2j k
Xk 4
Sj 1 cos
,
N
j0
6:9:9
where the phase of the cosine is the product of harmonic j j/Nt and lag k t.
The subscript on X kdenotes the time series.
Evaluation of (6.9.3) with use of the relation
1
2
k 0 k
6:9:10
6:9:11
375
expressed entirely in terms of the variance 0 hx2t i and variograms. The unknown
constants are determined by solving the set of equations
2p = 2p = 2p = 0:
6:9:12
However, before undertaking this task, we can save much work at the outset by
examining qualitatively the behavior of the variograms in two limiting cases characterizing the kinds of stochastic processes we know have arisen.
In the first limiting case, that of quasi-white noise, applicable to nuclear decay
and first differences of stock prices, the correlation coefficients k are all close to 0 for
k 6 0, and therefore the variograms k 20 . The mean square error then reduces to
2p =0 1 2 2 2 3 3
6:9:13
for which (6.9.12) leads to a set of homogeneous algebraic equations (i.e. with no
constant term) with unique solution 0. The forecast equation (6.9.1) is
then ~x t1 0. The past record provides no useful information; the process is
unpredictable.
In the opposite limiting case, that of quasi-Brownian noise, applicable to molecular diffusion and stock price movement, the correlation coefficients k are all close to
unity, i.e. 0 1, 1 1 , 2 1 2, and so on for some small quantity . Then
the variograms k 2k0 . Neglecting terms linear in yields a mean square error
2p =0 12 ,
6:9:14
which attains its minimum value for 1. If we set 1 in (6.9.11), the first two
terms vanish and the values of and that solve the second and third equations in
(6.9.12) are
1
3 2
1
2
22 212 213
:
241 2
6:9:15
Had curvature been neglected in the forecasting equation (6.9.1) then the slope
parameter minimizing the mean square error would be
0
2
1:
21
6:9:16
The conditions for pre-selecting , which determine the nature of the solution, can
be given a more rigorous, general basis by examining quantitatively the properties of
the integral that defines the variogram in (6.9.8). It is often the case that the power
spectrum of a random process takes the form of a power law S / jj for some
exponent over an applicable frequency range. The exponent defines the character of
the stochastic process and can be estimated empirically by fitting a line of regression
to a plot of log S vs log , the slope of which is independent of the choice of
376
Ik,
1 k
2
1 cos 2u
du
u
1 cos 2u
du
u
!
1 !0
2 !
k1
6:9:17
0 222 1
6:9:18
for Eq. (6.9.1). These parameters are plotted as a function of in Figure 6.13. From
the figure, one discerns a qualitative difference in predicted behavior depending on
whether is larger than 2 or less than 2. If the former pertains, then 0 and 1 are
positive, and a time series that was increasing in the (recent) past is predicted to
increase further in the (near) future a property referred to as persistence. If the
latter pertains, then 0 and 1 are negative, and a time series that was increasing is
now predicted to reverse and decrease a property termed anti-persistence.
21
22
377
Prediction Coecients
0.5
1
0
0
0.5
1
1
1.5
2.5
6:9:19
calculated from (6.9.9) are well reproduced theoretically for 1.8 as summarized in
Table 6.3 together with the estimated forecast parameters.
Prediction parameters 0, 1 and 1 all close to 0 mean that contributions from
trend alone or from trend and curvature vanish to any practically useful degree,
again signifying that the records of price changes were essentially white noise. Thus
378
Table 6.3
Time series
J(2)
J(3)
CREF
AAPL
GRNG
1.933
2.021
2.016
2.786
3.042
3.036
0.034
0.011
7.81(3)
0.0102
0.0103
0.0102
0.0410
3.753(4)
2.378(3)
I(2;1.8)
I(3;1.8)
2.118
3.062
the prediction equation (6.9.1) reduces simply to ~x t1 xt , re-confirming that the best
prediction for the price tomorrow is the price today.
A stochastic process with the property that the conditional expectation of the next
value, given the current and preceding values, is always the current value is known as
a martingale, a term originally referring to a betting strategy whereby the gambler
doubles his bet after every loss in the expectation of regaining the losses plus a profit
equal to the original bet. The betting strategy is unsound; the gambler has no longterm advantage over any other betting strategy, including the placement of bets of
random amounts.
The movement of stock prices in a stock market is a martingale. The expectation
of tomorrows price is todays price. When I wrote at the end of Section 6.1 that an
investor can expect to gain nothing from the stock market in the long run, I was not
making a snide comment, but summarizing accurately what a statistical analysis of
stock time series taught me.
B. G. Malkiel, A Random Walk Down Wall Street (Norton, New York, 2003), First Edition 1973.
379
strategy may seem to be a reasonable one, provided that such meaningful criteria can
actually be established and pertinent data acquired with the result that ensuing
assessments are found to be empirically (i.e. predictively) valid. Whether this is the
case or not is controversial.
Princeton University professor and former member of the Council of Economic
Advisors, Burton Malkiel challenged both schools of analysts by writing (page 24 in
the 1996 edition of his book) a blindfolded monkey throwing darts at a newspapers
financial pages could select a portfolio that would do just as well as one carefully
selected by the experts. The Wall Street Journal (WSJ) took up the challenge in
1988 with a contest to match the performance of four stocks selected by pros
against four stocks chosen by WSJ staff who threw darts at a stock table. Competitions, initiated at one-month intervals, ran initially for one month, later extended to
six months, at the end of which the price appreciation of stock picks in a given
contest were compared. In 1998 the WSJ published the results of the hundredth
dartboard contest.
Of the 100 competitions, according to the WSJ, experts beat dartboarders 61 to 39.
However, according to Malkiel and other academic researchers, the competition was
seriously flawed as a test of the dartboard hypothesis, and the outcomes were
erroneously interpreted. The most serious shortcoming was that the contests structured primarily for entertainment than for research were not double-blind. As the
gold standard of statistical testing of a product (e.g. new drug) or hypothesis, a
double-blind method is one in which neither the administrators nor subjects of a test
know the evolutionary development and results until the test is completed. Otherwise, human nature being what it is, personal biases, conscious or unintended, may
strongly influence the outcome. The WSJ, apparently ignorant of proper protocols,
published the experts stock picks and explanations of these selections at the start of
each contest, thereby biasing the subsequent investments of their many readers and
inflating the returns of the selected stocks. After the contests, the values of the expertselected stocks fell, whereas the dartboard-selected stocks continued to do well. All in
all, detailed analysis of the contests showed that the experts did not outperform either
the dartboarders or the market.24
There is a curious irony to the randomness of stock prices that reflects the
fundamental hypothesis of market behavior: the so-called efficient-market hypothesis. As interpreted by Malkiel
The efficient-market theory does not . . . state that stock prices move aimlessly and erratically
and are insensitive to changes in fundamental information. On the contrary, the reason prices
move in a random walk is just the opposite: The market is so efficientprices move so quickly
when new information does arisethat no one can consistently buy or sell quickly enough to
24
B. Liang, The Dartboard Column: The Pros, the Darts, and the Market, http://papers.ssrn.com/sol3/papers.cfm?
abstract_id=1068.
380
benefit. And real news develops randomly, that is, unpredictably. It cannot be predicted by
studying either past technical or fundamental information.
In a way, this is the same reasoning that underlies the application of statistical
physics to multi-particle systems like a (classical) gas. The molecules in a macroscopic
quantity of gas move in direct response to the forces acting on them. Metaphorically
speaking, these forces, arising from both the environment and intermolecular interactions, are like the new information in the quotation from Malkiel. They change
rapidly (due to temperature fluctuations, variations in intermolecular distances, etc.)
and the molecular responses occur rapidly. No computer, however large its memory,
could keep track of these changes, nor could any human or computer make use of
this astronomically vast amount of information in a timely way even if it were
possible (which it isnt) to catalog it. So the whole system behaves for all practical
purposes randomly, describable by coarse-grained statistical state variables (temperature, pressure, density, etc.) and not by the Newtonian coordinates and momenta
of each molecule.
The financial lesson that Malkiel and others draw from acceptance of the efficientmarket hypothesis is that it is fruitless to time the market i.e. to try to guess an
optimal time to get in or out of a particular stock. The best strategy, we are told, is to
buy an index stock and hold it for the long term. In that way the investor can do as
well as the market does, since there is no viable strategy for beating the market. In
support of this conclusion, Malkiel cites25 H. N. Seybun of the University of
Michigan who
. . .found that 95 percent of the significant market gains over the thirty-year period from the
mid-1960s through the mid-1990s came on 90 of the roughly 7,500 trading days. If you
happened to miss those 90 days, just over 1 percent of the total, the generous long-run stock
market returns of the period would have been wiped out. The point is that market timers risk
missing the infrequent large sprints that are the big contributors to performance.
6:10:1
in which k 1.38 1023 J/K is the universal Boltzmanns constant. Using (6.10.1)
one readily finds that there are approximately n 2.45 1019 molecules/cm3 at 1 atm
25
381
pressure and room temperature (about 300 K). To a good approximation therefore,
the mean distance d between molecules is d n1/3 or about 3.4 107 cm under the
presumed conditions. The molecules move about like a swarm of mad bees with a
root-mean-square (rms) speed derivable from the classical equipartition theorem (i.e.
mean energy 12 kT for each translational degree of freedom)
p
vrms 3kT=m:
6:10:2
Under the stated conditions, this speed is about 515 m/s.
In an equilibrium state the gas molecules are uniformly distributed macroscopically, but because of statistical fluctuations in the exchange of energy and momentum
with the environment there will occur randomly from time to time and place to place
pockets of higher or lower density or pressure than the mean. For a system to be in
stable equilibrium means that fluctuations of this kind subsequently damp out, rather
than intensify. How quickly do they damp out? Fluctuations in the gas dissipate on a
time scale comparable to the time between collisions of gas molecules because it
is these collisions that are responsible in the first place for creating equilibrium
conditions throughout the gas. From the kinetic theory of gases we find that the
mean free path (i.e. the mean distance between collisions) of molecules of diameter
a is given by the expression
1
p
2a2 n
6:10:3
and therefore the mean time between collisions can be estimated from the relation
c
vrms
6:10:4
Evaluating (6.10.4) in the case of nitrogen gas (N2) for which the molecular mass (28
atomic mass units) is m 4.7 1026 kg and molecular size is a d/8.74 3.9
108 cm (inferred from the ratio of densities of nitrogen in the liquid state, where the
molecules are assumed contiguous, and the gaseous state, where they are separated
by d), we find that c 1.2 1010 seconds.
In other words, as long as we observe the sample of nitrogen gas on time scales long
compared to fractions of a nanosecond, departures from equilibrium damp out
sufficiently rapidly that our apparatus is insensitive to their presence. A few tenths
of a nanosecond is a very short time interval; most laboratories do not have the means
to measure or in any way exploit processes on such a short time scale. But some do.
The 1999 Nobel Prize in Chemistry was awarded to a scientist (Ahmed Zewail of
CalTech) for showing that it is possible with rapid laser techniques to see how atoms
in a molecule move during a chemical reaction.26 The technique is known as
26
http://nobelprize.org/nobel_prizes/chemistry/laureates/1999/press.html
382
27
28
29
30
A. L. Cavalieri et al., Attosecond spectroscopy in condensed matter, Nature 449 (2007) 10291032.
TIAA-CREF, http://en.wikipedia.org/wiki/TIAA-CREF.
See T. A Bass, The Predictors (Henry Holt and Co., New York, 1999) for the fascinating story of How a band of
maverick physicists used chaos theory to trade their way to a fortune on Wall Street (blurb on front cover).
S. Kroft, CBS News, How Speed Traders Are Changing Wall Street, http://www.cbsnews.com/stories/2010/10/07/
60minutes/main6936075.shtml.
383
Appendices
m
X
i1
m X
m0
X
6:11:1
i1 j1
and the conditional entropy (6.3.11) for system A given knowledge of events in
system B
HAjB
i, j
pAi Bj log
i, j
pAi Bj
,
pBj
6:11:2
:
pAi Bj log
pAi Bj
i, j
6:11:3
6:11:4
X
X
pAi pBj
1
pAi Bj
pAi
pBj 1 0
pAi Bj
i, j
i
j
6:11:5
385
X
X
j tj
j Bj t Bt
6:12:1
y t xt
j0
j0
with coefficients j determinable from the AR(n) master equation. Use of the backshift operator B, permits one to represent the series succinctly by means of the
generator (B) defined above.
The covariance elements of the series can likewise be written in terms of the
expansion coefficients
k hyt ytk i 2
j jk
6:12:2
j0
k Bk 2
k
X
X
X
X
j
jk Bk 2
j h Bhj
kj
j0
j0
h0
Set hjk
j0
h0
j Bj
h Bh 2 BB1
6:12:3
1
k cos 2k S ,
2
k1
6:12:4
6:13:1
for which the two parameters and 2 are to be estimated by the principal of
maximum likelihood. Thus, the complete likelihood function (i.e. posterior probability in a Bayesian analysis) is
Pr fxt gj, 2 Pr ft t 2 . . . ngj, 2 , x1 Pr x1 j, 2 ,
6:13:2
386
where
1=2 x2 =2 2
Pr x1 j, 2 2 2t
e 1 t
6:13:3
is the conditional probability that the first element of the series is x1 and
Pr ft t 2 . . . ngj, 2 , x1
2 n1=2
n
P
t 2
2t =2 2
6:13:4
is the conditional probability of the n 1 subsequent shocks given x1. Note that the
variance 2 appearing in (6.13.4) is that of the shock t, whereas the variance 2t in
(6.13.3) is that of xt. The two variances were shown to be related to one another and
to the covariance element 0 by
0 hx2t i 2t
2
:
1 2
6:13:5
Upon combining relations (6.13.3) through (6.13.5), the complete likelihood function
(6.13.2) becomes
n
P
2
2 2
1 x1
xt xt1
2 2
2 1=2
2
2 n=2
t2
Pr fxt gj, 2
1 e
6:13:6
from which follows the log likelihood function
n 1
L ln 2 ln 1 2
2
2
1 2 x21
n
X
xt xt1 2
t2
2 2
6:13:7
0
1
..
.
m1
31
1
0
..
.
m2
11
m1
m2 C
C
.. C :
. A
0
6:13:8
387
2
1 ^
n
1
t2
x21
"
#
n
n
n
X
1 X
2X 2
2
x 2^
xt xt1 ^
xt1
n t2 t
t2
t2
6:13:10
in the parameters, which must be solved numerically. These numerical solutions for
the parameters of the CREF, AAPL, and GRNG time series and the values of the
parameters obtained previously from the conditional likelihood function are in
excellent agreement.
The matrix H of second derivatives of L with elements
!
!
2
n
2 L
1 ^
1 X
2
2
xt1 x1
H 11 2
2
^2
^ t2
1
"
!#
n
n
X
2 L
1 X
2
2
^
4
6:13:11
xt xt1
xt1 x1
H 12 H21
2
^ t2
t2
"
#
n
2 L
n
1 X
2
2 2
^
xt xt1 1 x1
H 22 2 4 6
2^
^ t2
2
yields the covariance matrix of the parameters through the relation C H1.
6:13:13
N
X
i1
Xi
6:13:14
388
gG t eGt pet qet N :
6:13:15
The expectation and variance of the gain are readily calculated from the mgf
dg t
hGi G
Np q
6:13:16
dt t0
d 2 ln gGt
var G var G hGi
6:13:17
4 Npq:
dt2
t0
It then follows from Eq. (6.13.17)
the standard deviation D of the difference
pthat
6:13:18
Alternatively, one can calculate the mgf of the random variable D/N
gD=Nt heDt=N i ehGit=N pet=N qet=N N
6:13:19
or
ln gD=Nt
hGit
N lnpet=N qet=N N ln1 pe2t=N 1,
N
6:13:20
which, when expanded in a power series to third order in t, yields a mgf of Gaussian
form leading to the asymptotic identification
D
2pq
:
6:13:21
N 0,
N
N
p
Thus D/N tends toward 0 as 1= N .
The probability of losing k games and winning N k games to achieve a gain of
N 2k is given exactly by the binomial expression
PrkjN, p
N k Nk
pq :
k
6:13:22
Under conditions of a fair game p q 12, the probability that ones gain after
100 games lies in the range (110
G
90) or in the range (120
G
80) is obtained
from the following sums of (6.13.22) over k (performed by Maple)
Pr110
G
90
Pr120
G
80
55
X
2100 k 45
1
60
X
2100 k 40
100
k
100
k
!
0:729
6:13:23
!
0:965:
389
1
1
2
2
ex
=2
dx 0:683
6:13:24
ex
=2
dx 0:954:
7
On target: uncertainties of projectile flight
In his satirical song about WW II German rocket engineer Werner von Braun,
songwriter and erstwhile mathematician Tom Lehrer sings (in a German accent):
Vonce ze rocketts are up, who cares vere zey come down? Physicists, of course, care
very much where they come down. Indeed, the study of projectile motion is ordinarily a fundamental part of any study of classical dynamics, introductory or advanced,
where it serves primarily to illustrate the laws of motion applied to objects in free-fall
in a uniform gravitational field. In this context, as evidenced by numerous textbooks
beginning with Galileos own, first published in 1638 students are taught to solve
problems that fall into certain standard categories such as
(1) ground-to-ground targeting (e.g. a missile is fired with speed v at an angle to the
horizontal at a target a horizontal distance d away),
(2) air-to-ground targeting (e.g. a package is dropped at a height h above the ground
from an airplane traveling horizontally at speed v),
(3) ground-to-air targeting (e.g. a projectile is launched at speed v and angle to the
horizontal at a pie plate simultaneously dropped from a height h and horizontal
distance d),
and possibly others.
In the commonly encountered textbook and classroom examples the specified
dynamical variables are exact, rather than distributed, quantities, and the objective
1
Galileo Galilei, from Discorsi e Dimostrazioni Matematiche Intorno a Due Nuove Scienze [Discourses and Mathematical
Demonstrations relating to Two New Sciences] (Elzevir, 1638), Fourth Day: Theorem I, Proposition I. (Translation
from Italian by M P Silverman)
Tom Lehrer song, Werner von Braun, performed in 1965 in San Francisco CA. A recorded performance is available
at http://www.youtube.com/watch?v=QEJ9HrZq7Ro.
390
391
See also, M.P. Silverman, W. Strange, and T.C. Lipscombe, The distribution of composite measurements: How to be
certain of the uncertainties in what we measure, American Journal of Physics 72 (2004) 10681081.
392
V 2 sin 2
,
g
7:2:1
where the acceleration of gravity is g ~9.8 m/s2 near the Earths surface. If V and
vary randomly from launch to launch as governed by pdfs pV (v) and p(), then the
range will also follow a pdf pR(r) which can be deduced from pV (v) and p() by
methods developed in Chapter 5. Unless otherwise noted, we will again employ
standard notation where an upper-case letter represents a random variable and the
corresponding lower-case letter signifies its realization in a sample.
Once the pdf pX (x) of a random variable X (where b X a) is known, all the
statistical moments mk,
b
mk hx i xk pX x dx,
k
7:2:2
X X
X
3 +
m3 3m2 m1 2m31
,
3X
7:2:4
which is a measure of the asymmetry of the distribution about the mean, and (d) the
kurtosis
*
+
X X 4
KX
,
7:2:5
X
393
which is a measure of the flatness of the distribution near the mean, and the variances
in these (and other) functions of X.
The pdf, which must satisfy the completeness relation
b
pX xdx 1,
7:2:6
dFX x
dx
7:2:7
is the derivative
pX x
FX x PrX x pX x0 dx0 :
7:2:8
In considering the distribution of projectile range, we will examine the case in which
the angle of launch is sufficiently well-defined that for all practical purposes can be
taken to be a constant 0. This not only makes the resulting mathematics more
tractable, but corresponds reasonably well to the conditions of many experiments for
which the launch angle can be set precisely and the uncertainty in the force of discharge
leads to a spread in projectile speed. In that case, the range takes the form R cV2 with
constant c sin 20/g. The more general case will be deferred to an appendix.
The pdf pR(r) is deducible by first determining the distribution FR(r) in terms of the
distribution of V
p
FR r PrR r PrcV 2 r Pr jVj r=c
p
p
Pr V r=c
Pr V r=c ,
7:2:9
V>0
V<0
:
dr
2c r=c
7:2:10
Let us suppose for the present that the speed is distributed as a normal random
variable V NV 0 , 2V with mean V0 and standard deviation V. As a reminder, the
designation NV 0 , 2V signifies a pdf of the form
2
1
2
pX x p exX =2X
2 X
7:2:11
p 2
2
2
1
e r=cV 0 2 V e r=cV 0 2V
p
pR r p
,
7:2:12
2 2 c V
r=c
394
pR r
p
p2
2
r R0
2 p
p
e
p
2 2r pR
R
p2
2
r R0
2 p
7:2:13
where
R0
V 20
sin 20
g
pR
s
sin 2 0
V
g
7:2:14
are conveniently defined quantities and not a mean and standard deviation of the
non-Gaussian distribution (7.2.13). For (V0/V) 1 the second exponential in (7.2.12)
is numerically insignificant although its presence is required theoretically for correct
normalization.4
Since R cV2, the expectation values of functions of R can be evaluated using
(7.2.13) or, equivalently, as expectations of functions of V using (7.2.11). Either way,
one obtains the following useful statistical relations:
mean R hRi c V 2 c V 20 2V
second moment m2 R2 c2 V 4 c2 V 40 6V 20 2V 3 4V
variance varR c2 varV 2 2c2 2V 2V 20 2V
h
i
p
1
standard deviation R varR 2cV 0 V 1 V =V 0 2
2
*
8
+ * 2
3 +
>
R R 3
V V 2
>
>
SkR
>
>
R
V2
>
>
>
>
>
2
<
V 3 V =V 0
h
i3=2
skewness
V0
2
>
1
>
1
=V
>
V
0
2
>
>
>
>
>
>
V 5 V 3 21 V 5 . . .
>
:
3
V0 4 V0
32 V 0
7:2:15
7:2:16
7:2:17
7:2:18
7:2:19
in which the series expansion of the exact expression for skewness is truncated at
O((V/V0)6). We will not need the kurtosis or moments mk for k > 3 in this chapter.
The mode of a distribution is the argument x at which the pdf pX(x) is maximum.
By ignoring the second term in (7.2.13) under the assumption that (V0/V) 1, and
solving the equation d ln pR(r)/dr 0, one obtains the modal range Rm
For zero mean and unit variance relation (7.2.12) reduces to the pdf of a chi-square distribution of one degree of
freedom, as expected for the square of a standard normal variate.
2
3
s
"
2
2 #
1 24
2 V 2 5
V
:
Rm cV 0 1 1
cV 20 1 2
V0
V0
4
395
7:2:20
The approximate second relation results from a Taylor series expansion to first
nonvanishing order of (V/V0)2.
The upper and lower
panels of Figure 7.1 respectively show histograms of the
speed V N V 0 , 2V and range R cV2 for parameters V0 100 m/s, V 15 m/s
(and therefore V/V0 0.15), and 0 50o. The histogram of speeds was compiled
from a sample of 50000 trials with a Gaussian RNG. The superposed pdf pR(r) in
(7.2.13) closely delineates the envelope of the corresponding histogram of ranges,
which yields sample moments in excellent agreement with values theoretically predicted by relations (7.2.15) (7.2.19),
as shown in Table 7.1. For comparison, the
2
Gaussian approximated pdf N R , R is also shown. The broader the distribution of
speed (as gauged by the ratio V/V0), the greater the departure of the distribution of
range from a normal distribution, with concomitantly increasing skewness. Because
of skewness, the mean R and mode Rm depart significantly from the kinematic range
R0 (which is not a stochastic variable). For a distribution of speeds sufficiently
narrow that all powers of V/V0 beyond the first can be neglected, the expressions
for R and R reduce to the expressions of standard error propagation theory (EPT)
R0
V 20 sin 20
g
R0
2V 0 V sin 20
:
g
7:2:21
7:2:22
7:2:23
from (7.2.17).
5
From the identity N (a, b2) a bN (0, 1) proven in Chapter 1, it follows that cN (a, b2) c[a bN(0, 1)] ca cbN (0, 1)
N(ca, c2b2).
396
V = N(V0 ,V2)
0.025
0.02
0.015
0.01
0.005
0
40
60
80
100
120
140
160
Speed (m/s)
0.0014
R = cV2
0.0012
0.001
8 10 4
6 10 4
4 10 4
2 10 4
0
0
500
1000
1500
2000
Range (m)
Fig. 7.1 Top panel: histogram of 50000 samples of speed V drawn from a Gaussian RNG V
N(100,152) and sorted into 50 bins (bin width W 2.4). Bottom panel: corresponding
histogram of range R cV2 (bin width W 48.5) calculated for launch angle 0 50 .
Proportionality constant c sin (20)/g 0.1005; kinematic range R0 cV 20 1005 m;
standard deviation R 303.2 m. Solid curves trace
the exact probability density functions;
the dashed curve is the Gaussian approximation N R0 , 2R .
397
Sample
Theory
Mean (m) R
Mode (m) Rm
Kinematic range (m) R0
Standard deviation (m) R
Skewness SkR
1028
956
(not a statistic)
302.7
0.422
1028
960
1005
303.2
0.446
For the case to be considered here in which V / V0 < 1, we can approximate the
preceding mean and standard deviation by the simpler expressions
Y V 20
Y 2V 0 V :
7:2:24
7:2:25
with c sin 20/g as before were it not for the fact that r must be non-negative.
There arises, therefore, the matter of proper normalization so that
pR rdr 1:
7:2:26
exa =2b
dx
r
a
1 erf p
2
2b
7:2:27
7:2:28
where I(0,)(x) is the interval function introduced in Chapter 5. In the limit (a/b) ! ,
the error function in the denominator approaches 1, and (7.2.28) reduces to the
familiar Gaussian pdf. Applied to (7.2.25), we arrive at the pdf of the range
p
2 2
2 2 2
2=
ercV 0 8c V 0 V
pR r
I0, r
7:2:29
V0
2cV 0 V
1 erf p
2 2 V
398
Sample
Theory
Mean (m) R R0
Standard deviation (m) R
Skewness SkR
1003
301.48
0.013
1005
301.47
0
7:2:30
p v2 V 2 2
8V 2 2
0
0 V
2= ve
:
V0
V 0 V 1 erf p
2 2 V
7:2:31
The upper and lower panels of Figure 7.2 respectively show histograms of the range
q
R N cV 20 , c2 2V 2 and speed V N V 20 , 2V 2 corresponding to the hypothesis that
the square of the projectile launch speed is normally distributed; the parameters of
the distribution are, as before, V0 100 m/s, V 15 m/s, and 0 50o. The
histograms
again compiled from samples of 50000 trials with a Gaussian
2 were
2
RNG N V 0 , V 2 sorted into 50 bins. In both panels the theoretical pdfs are superposed and fit the envelopes of the histograms like a glove. Also shown for comparison in the lower panel is the pdf of a non-truncated Gaussian distribution centered
on V0 with standard deviation V. This Gaussian function, although unskewed,
approximates the center of the histogram reasonably well for the parameter ratio
V / V0 0.15.
Because the distribution of V is skewed, the mean and modal speeds differ.
Calculating the modal speed Vm by the same procedure used to calculate the modal
range in (7.2.20) leads to the expression
399
R = cN(V02 , 4V02V2)
0.001
5 10
0
0
500
1000
1500
2000
Range (m)
0.03
0.025
V = N(V02 , 4V02V2)
0.02
0.015
0.01
0.005
0
40
60
80
100
120
140
160
Speed (m/s)
Fig. 7.2 Top panel: histogram of range R cN(1002, 30002) with theoretical Gaussian density
q
(solid). Bottom panel: histogram of corresponding speed V N 1002 , 30002 with exact
density function (solid) and Gaussian approximation (dashed). The kinematic range and
sample size are the same as for Figure 7.1.
s
"
2
2 #
1 V
V
V0 1
Vm V0 1
:
V0
2 V0
7:2:32
400
Sample
Theory
Mean (m/s) V
Mode (m/s) Vm
Parameters (m/s)
V0 100 V 15
Standard deviation (m) R
Skewness SkR
97.4
101.5
98.7
102.2
15.7
0.543
15.7
0.549
Ro
Fig. 7.3 Schematic diagram of the uncertainty region of size 2d about the range R0 within
which a projectile launched at variable speed V and sharp angle 0 may land.
Let R0 cV 20 R be the location of the target and 2d the distance within which the
projectile must land, as illustrated in Figure 7.3. The probability that the projectile
lands within specified tolerances is calculable by the chain of steps
R R
d 0:95, 7:2:33
Prd R R0 d PrjR R0 j d Pr
R R
in which the quantity ZR (R R)/R is a standard normal variate, i.e. ZR N(0,1).
The requirement in (7.2.33) is then met numerically by d/R 1.96 2, and
substitution of R 2cV 20 V =V 0 2R0 V =V 0 leads to the inequality
V =V 0 d=4R0 :
401
7:2:34
Suppose, for example, the projectile is a small ball shot from a spring-loaded
launcher to hit a 10 cm diameter pie plate lying on the floor a distance 3 m away.
Then d 5 cm, R0 300 cm, and from (7.2.34) one must have V/V0 1/24 0.042.
By contrast, if the projectile were a ground-to-ground missile intended to hit a 5 m
wide target a distance 1 km away, (7.2.34) would require V/V0 1/800 0.00125.
Inequality (7.2.34) helps make clear why a ballistic missile is a guided missile, i.e.
requires a guidance system.
Alternatively, for a given precision V/V0 we can ask the following question.
How many projectiles must be launched in order to be 95% confident that the
target is hit?
To answer this question, we utilize the previously demonstrated result that the mean
ZR of n samples of a normal random variable ZR N(0,1) is distributed normally with
mean 0 and standard deviation n1/2, i.e. Z R N0, n1 . The desired condition
d
p 0:95
7:2:35
Pr jZ R j
R = n
p
then leads to the requirement that nd= R 1:96 2 or, equivalently,
n 16R0 =d2 V =V 0 2 :
7:2:36
For the precision V/V0 0.15 used in the examples illustrated in Figures 7.1 and 7.2,
the minimum number of strikes needed to assure 95% confidence of landing within
5 m of a target 1 km away is n 14,400. Improving the precision by a factor of
10 would reduce the number of trials by a factor of 100.
402
We take as our null hypothesis H0 the statement that the square of the speed and
therefore the range of the projectile is distributed normally. One way to proceed is
to do a chi-square analysis, which, as discussed previously, entails sorting measurements of the range among k 1, 2. . .n classes of specified width and calculating
2obs
n
X
Ok Ek 2
k1
Ek
7:3:1
where Ok is the observed frequency of elements of the kth class and Ek is the
corresponding theoretically expected frequency. The sum in (7.3.1) approaches a 2d
distribution for d n 1 degrees of freedom (dof ) if the number of elements in each
bin is reasonably large, a criterion usually taken to mean greater than 5. From chisquare tables or use of computational software one then ascertains a P-value, i.e.
probability Pr 2d 2obs .
Although widely used, the 2 test may not be incisive enough to distinguish
between a normal distribution and a distribution that closely approximates a normal
distribution over much of the range centered on the mean. In that case, which is the
case we are faced with, an alternative and possibly better approach is to examine a
statistic that is sensitive to some critical symmetry that distinguishes the two distributions. If H0 is true, then the distribution should have a vanishing skewness. The
skewness of the alternative distribution considered here is given by (7.2.19) and is
approximately 3V/V0.
Consider the random variable W Z3 where Z ((X X)/X)N(0,1). By
symmetry (i.e. odd parity), as well as by direct calculation, we know that hWi 0.
To use W in a statistical test, we need also to determine its variance, which is given by
varW hW 2 i hWi2 hZ 6 i hZ 3 i2 hZ 6 i 15
from which follows the standard deviation
p
W 15 3:87:
7:3:2
7:3:3
1
2=3
pW w p w2=3 e w :
3 2
1
2
403
7:3:5
20
3 V =V 0 2
7:3:7
In the cases we examined previously where V/V0 0.15, Eq. (7.3.7) yields n 296.
With a little imaginative thinking, however, we can do better than (7.3.7) by
creating another distribution (a binomial distribution) with a variance smaller than
that of (7.3.4). Consider a sample of n independent variates Zi (Xi X)/X where
(i 1. . .n) and assign Bi(Z) 1 for Zi > 0 and Bi(Z) 1 for Zi < 0. Since Z is a
continuous random variable, the probability is 0 that it assumes precisely the value
0 and indeed this was found to be the case in the empirical study described in the
next section. Bi(Z) follows a binomial distribution with a mean p q and variance
4pq where p is the probability Pr (Zi 0) and q 1 p is the probability Pr (Zi < 0).
n
X
Bi is distributed approximately as a variate
Thus, by the CLT the mean Bn 1n
i1
N( p q, 4pq/n) for sufficiently large n. [Note that, statistically, this test is identical to
the coin-toss game of the preceding chapter (Sections (6.8), (6.14).] If the hypothesis
H0 is true, then p q, hBn i 0, and 2B 1=n.
n
The probability
Pr jBn = Bn j 1:96 0:05
7:3:8
4
V =V 0 2
7:3:9
For V/V0 0.15, (7.3.9) yields n 178, a reduction in sample size of about 40%
compared to (7.3.7).
404
7:3:10
where f(X) is a non-negative function of a random variable X with domain the real
line and k > 0, that permits estimation of sample size irrespective of the exact
distribution of X. The Weak Law of Large Numbers, derivable from (7.3.10), takes
the form6
PrjX X j for n 2X =2 ,
7:3:11
with > 0 and 0 < < 1 any two specified numbers within their respective ranges.
Applying this relation to X Bn with 0.05, 0.15, and 2B 1, leads to the
inequality n 889. Although consistent with (7.3.8) and (7.3.9), the Weak Law can
be highly inefficient.
6
7
nk
1X
Xki
nk i1
k 1, 2, 3
7:4:1
A.M. Mood, F.A. Graybill, and D. C. Boes, Introduction to the Theory of Statistics 3rd Edition (McGraw-Hill, New
York, 1974) 71, 232233.
Data are available at http://www.aw-bc.com/info/triola/tes09_02_eoc.pdf
405
Table 7.4
McGwire (1998)
360
370
380
360
425
370
450
350
510
430
369
460
430
341
370
350
480
450
450
390
385
430
527
390
430
452
510
410
420
380
430
461
420
500
420
340
550
388
430
380
450
380
460
478
423
470
470
470
400
410
420
410
440
398
430
440
440
390
360
400
409
458
377
410
420
410
390
385
380
370
Sosa (1998)
371
350
420
460
350
420
390
400
400
380
388
440
480
480
430
400
410
364
380
414
434
420
430
415
430
400
482
344
430
410
430
450
370
364
410
434
370
380
440
420
370
420
370
370
380
365
360
400
420
410
366
420
368
405
440
380
500
350
430
433
410
340
380
420
433
390
Bonds (2001)
420
417
370
420
415
436
410
450
320
360
410
380
488
361
442
404
440
400
430
320
375
430
394
385
410
360
410
430
370
415
410
390
410
400
380
440
380
411
417
420
390
375
400
375
365
420
391
420
375
405
400
360
410
416
410
347
430
435
440
380
440
420
380
350
420
435
430
410
410
429
396
420
454
S2X
k
nk
X
S2Xk
1
Xk, i Xk 2
nk
nk nk 1 i1
3
Mk
k 1, 2, 3
nk
X
n
Xk, i Xk 3
n 1n 2 i1
7:4:2
7:4:3
406
Table 7.5
Player
Sample SD
sXk (ft)
Sample
skewness
Sample
size nk
McGwire
Sosa
Bonds
418.5 5.4
404.8 4.7
403.7 4.0
45.5
38.3
34.1
0.572
0.293
0.296
70
66
73
have expected and which in fact results from a maximum likelihood estimation.
Rather, these are the values that lead to unbiased estimations
3
MX x X 3 :
7:4:4
hS2X i x X 2
The origin of bias is the appearance in the sums of the sample mean X, which is a
random variable, and not the population mean X, which is a theoretical parameter.
Derivation of the estimator for the third moment is given in an appendix.
If we make the reasonable assumption that the projectile range is normally
distributed, how probable is it that these three sample means could have arisen from
random sampling of the same normal population? Suppose variates X1 and X2 are
distributed normally with means 1 and 2 and with the same population variance 2X .
Random samples of size n1 and n2 are taken from these two populations, and we
denote the sample means and sample variances by X1 , X2 , S21 , S22 . Then the sample
means are also normal variates
2
X 1 N 1 , X
n1
2
X2 N 2 , X
n2
7:4:5
X2 X1 2 1
q
X n11 n12
7:4:7
n1 S21 n2 S22
2 2n1 1 2n2 1 2n1 n2 2
2X
X
7:4:8
407
Table 7.6
Player pair
dof d n1 n2 2
T value
Pr (T tobs)
Z value
Pr (Z zobs)
McGwireSosa
McGwireBonds
SosaBonds
134
141
137
1.890
2.214
0.192
0.030
0.014
0.424
1.914
2.216
0.192
0.028
0.013
0.424
t
v
n1 n 2
n s2 n s2
1 1
7:4:9
2 2
.q
s2X
s2Y
a standard normal variate Z X Y
n1
n2 applicable to large samples (which,
to a good approximation, is what we have). Columns 4 and 6 show that the normal
distribution leads to slightly lower probabilities than the Student t distribution in
confirmation of the statement made in Chapter 1 based on plots of these two pdfs.
Thus, while it is probable that Sosa and Bonds are equivalent hitters, McGwire is
clearly in a different class. If the data of Table 7.4 are representative, then the
hypothesis that McGwires mean home run distance is statistically equivalent to
those of the other two hitters could be rejected with a probability better than 97%.
My comparative analysis of home run distances was originally completed in
2003 and submitted to the American Journal of Physics. Seven years later, in January
2010 news services reported
Mark McGwire admitted Jan. 11 that he used steroids on and off for nearly a decade, including
during the 1998 season when he broke the then single-season home run record.8
8
ESPN.com news services, McGwire apologizes to La Russa, Selig (January 12, 2010, 2:01 PM ET), http://sports.espn.
go.com/mlb/news/story?id=4816607
408
2obs
Pr 2d4 2obs
McGwire
Sosa
Bonds
2.77
3.51
5.52
59.8%
47.7%
23.8%
Well . . . not everybody. Although correlation is not proof of cause and effect,
statistical analysis of sports-related projectile ranges such as distances of home
runs, shot puts, golf drives, long jumps, and the like, may nevertheless help identify
instances where superior athletic performance was achieved through the aid of
performance-enhancing drugs. Of course this will become less helpful if nearly every
participant in some popular projectile sport is on drugs.
There remains a loose end to tie: use of the Student t test presumed that the home
run distances were distributed normally. To test the normality of the random variable
Xk Xk =SXk , where k 1, 2, 3 refers to McGwire, Sosa, and Bonds respectively,
a chi-square analysis was made in which the three samples of size nk were grouped in
N 7 classes of width equal to SXk . The number of degrees of freedom of the statistic
N
X
Oj Ej 2 =Ej , in which Oj is the observed frequency and Ej is the theoret2
j1
n
Ej p
2 SX
exX =2SX dx
2
7:4:11
XjN2 SX
and therefore d 4. The results in Table 7.7 provide no reason to reject the
hypothesis that the range of the home runs was distributed normally.
An examination of the skewness of each hitters home run distribution using the
estimator in (7.4.3) for the third moment about the mean and (7.4.2) for the variance
leads to the results in Table 7.8, which are again consistent with the zero skewness of
a normal distribution.
409
Skobs
Pr(Sk jSkobsj)
McGwire
Sosa
Bonds
0.572
0.293
0.296
10.4%
25.9%
25.7%
v
,
7:5:1
dvx
,
dy
7:5:2
relating the shear stress (i.e. frictional force per unit area) xy at the interface between
contiguous layers of fluid moving horizontally, let us say along the x axis, to the rate
dvx/dy at which the velocity vx decreases in a direction (along the y axis) transverse to
the flow. A fluid that obeys (7.5.2) is a Newtonian fluid. Air and water are good
Newtonian fluids; treacle (molasses) is not. From the definition (7.5.1) and the
dimension of inferred from (7.5.2), it is not difficult to show that Reynolds number
is indeed a dimensionless parameter
h i
h i
1 1
M
M
L
TL
L
TL
ML T
L3
L3
Re
h i. h i.
1
,
F
V
ML
1
ML1 T 1
2
2 2
L
T L
where [M], [L], [T] commonly signify the fundamental dimensions of mass, length, and time.
410
For fluids at very low Reynolds number, the drag force is linearly proportional to
the relative speed, a situation referred to as creeping flow or Stokes flow. Viscosity
dominates inertia in this regime, and the drag force is large. Small particles in
air (such as nascent rain droplets or the tiny oil droplets in Millikans renowned
experiment to determine the charge of an electron) and aquatic micro-organisms
ordinarily experience creeping flow. Baseballs, motor cars, and aircraft experience
Newtonian flow.
For fluids at high Reynolds number but nevertheless at subsonic speeds, the drag
on an object is proportional to the square of the relative speed and usually written in
the form
1
Fd v2 Cd A
2
7:5:3
1 2
v constant,
2
7:5:4
the dynamic pressure is interpretable as the gauge pressure (i.e. pressure above the
ambient static value) exerted by a moving fluid at a point on a stationary surface
where the fluid is brought to rest.
In general, the drag coefficient is not constant, but varies with Reynolds number
in a manner dependent on the shape of the object. For a sphere, the drag coefficient
has a very interesting behavior, remaining approximately constant at Cd ~ 0.47 over
the wide range 105 Re 103 and then precipitously dropping to Cd ~ 0.1 over a
narrow range 23
105.9 The sudden decrease in drag is due to a kind of phase
transition in the flow pattern about the sphere.
In the ideal (and totally unrealistic) case of an inviscid (frictionless) fluid passing in
steady flow around a smooth sphere, the particles of fluid follow precise trajectories
or streamlines in well-defined layers. This steady laminar flow gives rise to zero drag;
a sphere placed at rest in such a flow would remain at rest. Prior to a more complete
understanding of fluid dynamics in the early years of the twentieth century, which
incorporated the effects of viscosity in a thin boundary layer over the surface of an
object, this strikingly false prediction was known as dAlemberts paradox. The
paradox results from the front-to-back symmetry of the streamlines. By Bernoullis
principle (7.5.4), the faster a fluid moves (in constant gravitational potential) over an
9
P. Wegener, What Makes Airplanes Fly? (Springer, New York, 1991) 9094.
411
object, the lower is the pressure it exerts on the object. The flow speed is greatest and
the pressure is least over the circumference of the midsection (defining the boundary
between the front and rear hemispheres) normal to the flow. In ideal steady inviscid
flow, the speed of a fluid particle at some location in front of the sphere is exactly the
same speed at its mirror-reflected location behind the sphere. Collectively, therefore,
the pressure distribution in the hemisphere facing the flow is the same as in the rear
hemisphere, and no net momentum is transferred to the sphere by the fluid. Hence,
the sphere remains at rest.
More realistically, however, at Reynolds numbers below the transition, air flow
around a sphere remains more or less laminar until it reaches the boundary of the
midsection, after which it separates from the sphere and degrades into turbulent eddies
behind the sphere. Now the front-to-back pressure distribution is asymmetric, and the
pressure over the rear hemisphere remains lower than that over the forward hemisphere.
Looked at from a reference frame in which the air is at rest and the sphere is moving,
the sphere experiences a backward push or drag by the air. At Reynolds numbers above
the transition, however, the layer of air closely enveloping the sphere (i.e. the so-called
boundary layer) becomes turbulent and air flow remains attached to the sphere over a
greater portion of the rear surface. The pressure behind the sphere rises to a greater
extent than was the case when flow separated at the midsection, and the pressure
differential between the two hemispheres is consequently reduced, leading to lower drag.
Generalizing (7.5.3) to two-dimensional motion along direction v within a plane
with the x axis horizontal and the y axis vertical as shown in Figure 7.4 we can write
Newtons second law of motion
vy
aL
Altitude
aD
vx
Horizontal Displacement
Fig. 7.4 Two-dimensional projectile motion in the presence of air drag. Drag deceleration aD
opposes the velocity v; lift acceleration aL is normal to the velocity; the acceleration of gravity g
acts vertically downward. Solution of the two-dimensional equations of motion (without lift)
leads to a trajectory that has a shorter ascent than descent time. As a consequence, the
projectile covers a greater horizontal distance in the ascent than in the descent.
412
8
>
Cd A v2x v2y vx
>
>
dv
x
>
>
0
<
dt
2m
>
>
2
2
>
C
A
v
v
vy
>
d
x
y
>
: dvy
g
dt
2m
1
2
dv 1
Cd Av2v mg
dt 2
1
2
7:5:5
for a non-rotating sphere of mass m in freefall through air of density with gravitational acceleration g. Problems in fluid dynamics (or specifically aerodynamics) are
almost always facilitated by transforming dynamical laws into relations of dimensionless quantities. From the material parameters that occur in (7.5.5) the following
characteristic velocity, time, and displacement parameters can be constructed. (Later
it will be useful to construct another characteristic velocity v associated with lift.)
2mg
:
7:5:6
Velocity
vd
Cd A
2m
:
7:5:7
Time
td
gCd A
1
2
1
2
Displacement
d vd t d
2m
:
Cd A
7:5:8
By rescaling velocity and time and, for later use, displacement in the following way
Vx
vx
vd
Vy
vy
vd
t
td
x
d
y
d
7:5:9
1
2
7:5:10
It is worth noting that such rescaling, when it can be done, is of wide utility in
physics, not only because it simplifies the appearance of equations to be solved, but
because it permits experimental results from different physical systems to be fit to a
single mathematical expression. This, in fact, is the basis for simulating forces on
large-scale aerodynamic or hydrodynamic structures by small-scale models testable
in wind tunnels or water basins. One of the earliest examples of such a procedure in
the study of fluids is the Law of Corresponding States (LCS) pertaining to application of van der Waals equation to the thermodynamics of real gases. The LCS, which
is actually more general than van der Waals equation, holds that all gases, when
compared at the same reduced temperature and reduced pressure i.e. temperature
413
and pressure scaled by suitable parameters have the same compressibility and
deviate from ideal gas behavior to about the same extent.
Solution of the two-dimensional (2D) projectile problem in vacuum is quite simple
because the equations of horizontal and vertical motion are uncoupled. The resulting two
one-dimensional (1D) equations of motion are immediately integrable to yield expressions
8
< vx t vx0 constant vy t vy0 gt
7:5:11
no drag
1
: sx t vx0 t
sy t vy0 t gt2
2
for horizontal and vertical velocity and displacement. It is the coupling of horizontal
and vertical components because drag depends on relative speed of the air flow
that makes (7.5.10) difficult to solve analytically. Nevertheless, an analytical solution
to (7.5.10) is achievable by first transforming the set of equations to polar form
9
8
q
< V V 2 V2
V x V cos =
x
y
7:5:12
,
: tan V =V
V y V sin ;
y
and then algebraically manipulating the resulting pair of equations to exploit the
trigonometric identity cos2 sin2 1 and thereby to obtain the transformed pair
dV
V 2 sin
dT
d
V
cos
dT
7:5:13
where is the angle relative to the (horizontal) ground. Next, by using the second
relation in (7.5.13) to express the differential dT in terms of d, one can combine the
two equations into a single differential equation as a function of angle
d
V cos V 3 0:
d
7:5:14
And last, the variable change U 1/(V cos ) permits one to separate variables and
integrate separately over U and . The final result is the somewhat complicated but
useful closed-form expression
1
,
V q
1 sin
C0 ln cos cos 2 sin
where C0 incorporates the initial conditions of speed V0 and angle 0
1 V 20 sin 0
1 sin 0
:
C0
ln
cos 0
V 20 cos 2 0
7:5:15
7:5:16
The corresponding components of velocity along the x and y axes can then be
calculated from (7.5.12).
414
Alternatively, the original Cartesian form of Eqs. (7.5.10) is particularly convenient for an iterative numerical solution by computer and generates solutions directly
as a function of time, rather than angle. I have used a LevenbergMarquardt
algorithm (developed by the Argonne National Laboratory), which is incorporated
in the Mathcad symbolic computational software. Both approaches lead to the same
numerical results, which we shall examine shortly. Before doing so, it is instructive to
consider the stationary or asymptotic behavior of the drag equations in Cartesian
form. Setting the time derivatives in (7.5.10) to zero, one obtains solutions for
velocity
V x T ! 0
V y T ! 1
)
)
vx 0
vy vd
7:5:17
in which the horizontal component vanishes and the magnitude of the vertical
component, referred to as the terminal speed, equals the scale factor vd in (7.5.6).
Under stationary conditions, therefore, the projectile is descending vertically downward at a constant rate vd. For a baseball, however, the ground may intervene well
before it reaches terminal velocity.
Corresponding horizontal and vertical displacements cannot be reduced to closedform expressions but must be obtained numerically from the integrals
T
Sx T V x T dT
0
Sy T V y T 0 dT 0 :
7:5:18
Consequently, the range of the projectile R i.e. the value of Sx for which Sy 0
must be worked out numerically or graphically for each case of interest.
An alternative approximate measure worth examining at this point is to decouple
the equations (7.5.10) by neglecting Vy in the horizontal motion and Vx in the vertical
motion. Then, as in the case of no drag, the resulting equations can be integrated
relatively easily, leading to
8
tan T H T T T H
V x0
>
>
V y T
V x T
>
>
>
V
T
1
tanhT H T T T H
x0
<
( cos T H T
drag 1D
>
ln
T T H
>
>
cos T H
Sy t
Sx t lnV x0 T 1
>
>
:
ln cos T H coshT H T T T H
7:5:19
in which
T H tan 1 V y0 tan 1 V 0 sin 0
7:5:20
is the time for the projectile to reach maximum height H. Prior to TH, the ball is
traveling upward, Vy > 0, and both air drag and gravity act in the same direction
415
7:5:21
in which the difference in sign of Vy in the two time periods must be borne in mind in
setting the appropriate limits of integration. One does not need to take account
explicitly of this transition point in the exact analytical solution to the coupled
equations (7.5.10) because the horizontal component Vx never vanishes.
To estimate what might be close to the ultimately achievable home run distance in
air, let us assume an atmosphere at about room temperature 20 C and 1 atm pressure
for which the air density is approximately air 1.204 kg/m3 and air viscosity is air
1.85
105 kg/ms.10 The fastest pitched baseballs have been clocked at a little over
100 miles per hour (mph), so it is perhaps not unreasonable to assume that such a ball
leaving the bat of a powerful hitter may be launched with an initial speed of about
110 mph or 49.2 m/s. The corresponding Reynolds number (7.5.1) with ball diameter
d 0.075 m as the characteristic length is Re 2.40
105, a value occurring very
close to the transition region for the drag coefficient of a sphere. I will adopt,
therefore, the lower value Cd ~ 0.2 for the region of higher Reynolds numbers. Given
a baseball mass of 0.145 kg, the scale factors in (7.5.6) then become vd 51.71 m/s,
td 5.27 s, and d 272.60 m.
Figure 7.5 shows the variation with (scaled) time T of the (scaled) velocity
components Vx, Vy, and speed V for a ball hit at an initial angle of 45 to the ground,
which in vacuum would yield the greatest range for a given initial speed. The solid
and dashed black lines trace the variation in speed and velocity components derived
from the exact analytical two-dimensional (2D) solution. The traces bear out the
previously deduced asymptotic limits Vx ! 0, Vy ! 1, shown in the figure as
horizontal light dashed lines. Vx and Vy decrease monotonically as expected and, in
practical terms, reach 98% of their asymptotic limits after about 3.0 (for Vy) and 4.0
(for Vx) time units td. The dashed gray lines, lying just above the dashed black lines,
trace the exact 1D solutions to the decoupled 2D equations. It is perhaps surprising
how closely the 1D solutions replicate the physically more realistic coupled 2D
solutions for the specified initial conditions. The largest apparent discrepancy occurs
in the calculation of Vx for which 2D coupling leads to a more rapid decrease in time.
I have simulated projectile motions for a wide range of initial conditions and obtain
10
The MKS unit of viscosity (kg/ms) comes directly from Eq. (7.5.2) and can be written equivalently as Pascal-second
(Pa s). The CGS counterpart is the poise: 1 Ps 0.1 Pa s.
416
Velocity (scaled)
0.5
Vx
0
0.5
1.5
2.5
3.5
0.5
Vy
1
1.5
Time (scaled)
Fig. 7.5 Speed (solid) and velocity components (dashed) of a baseball (mass 0.145 kg, radius
0.0375 m) plotted against time as obtained from solution of coupled 2D equations (dark black)
and uncoupled 1D equations (gray). Light black dashed traces show solutions in absence of air
drag. The drag coefficient is Cd 0.2 (for Reynolds number Re 2.4
105). Variables
are scaled by parameters vd 51.71 m/s, td 5.27 s, d 272.60 m. Initial conditions are
v0 49.17 m/s, 0 45 .
comparable outcomes. The light dashed lines trace the velocity components in the
absence of drag in which Vx is constant and Vy decreases linearly in time.
The trajectory of the ball, i.e. plot of vertical against horizontal displacement, is
shown in the upper trace of Figure 7.6 for the three model solutions: no drag,
uncoupled 1D and coupled 2D equations with drag. In vacuum, the path of the ball
is symmetric about the midpoint (point of greatest altitude), but in the resistive
medium of air the ball covers a greater horizontal distance in its ascent than its
descent. This gives the visual impression that the ball descends more quickly than it
rises, but this is a spurious inference as made apparent in the lower trace of Figure 7.6
in which the vertical displacements are plotted against time. Solid lines depict the
coupled 2D solution and dotted lines represent the uncoupled 1D solutions. Again
the latter shadow closely the former. The rise-time of the 2D solution to reach
maximum altitude H is about Tr 0.55 time units, whereas the fall-time to descend
from H to ground is Tf 1.16 0.55 0.61 units. Thus, the ball takes more time on
the way down than on the way up because the speed of the ball is greatest at launch,
as is likewise the drag which is proportional to the square of speed.
417
Altitude (scaled)
0.2
No
Drag
1D
0.15
0.1
2D
0.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Altitude (scaled)
0.2
0.15
0.1
0.05
0.2
0.4
0.6
0.8
1.2
1.4
Time (scaled)
Fig. 7.6 Upper panel: plot of trajectory: altitude against horizontal displacement. Lower
panel: plot of altitude against time. Parameters are the same as for Figure 7.5. Traces are
derived from: solution of coupled 2D equations with drag (solid), solution of uncoupled 1D
equations with drag (dotted), solution of uncoupled 1D equations without drag (dashed).
418
maximum height, and the distance Sx 0.55 0.30 0.25 units from the point of
maximum height to the point of impact with the ground.
Although the total time of flight T 1.16 units is short compared to the time
required to approach terminal velocity, air drag has nevertheless made a huge impact
on the trajectory of the ball. As shown in the upper panel, drag has reduced the
theoretically achievable range by about 39% from R 0.90 in vacuum to R 0.55 in
air. In MKS units, obtained by multiplying the preceding numbers by the associated
scale factor d 272.6 m, the range is respectively 246.7 m in vacuum and 150.7 m in
air. Drag has also reduced the theoretically achievable height of the trajectory from
H 0.23 (62.7 m) in vacuum to H 0.17 (46.3 m) in air.
The 1D drag solutions provide a way to estimate the range of the ball with
reasonable accuracy. From (7.5.19) one can find the time T0 which satisfies the
relation Sy(T0) 0, i.e. the time at which the projectile has returned to ground
1 sin T H
T 0 T H ln
,
7:5:22
cos T H
whereupon it follows that the range is then
R1D Sx T 0 lnV x0 T 0 1:
7:5:23
For the illustrative initial conditions (v0 110 mph, 0 /4), Eqs. (7.5.22) and
(7.5.23) yield T0 1.223, R1D 0.599 in comparison with results of the 2D calculation T0 1.160, R1D 0.553. The 1D solution provides an upper limit of range, as is
apparent from Figure 7.6.
Besides leading to an asymmetric trajectory and reducing the altitude and range of
the ball, air drag also decreases the launch angle at which the longest range results.
For the specified set of air parameters and launch speed (110 mph), computer
simulation of the trajectories resulting from different launch angles led to a maximum range at 0 ~ 41 , rather than 45 . The results are summarized in Table 7.9,
Table 7.9
0 (deg)
Rvac
Rdrag
15
20
25
30
35
40
41
42
45
50
0.452
0.582
0.693
0.783
0.850
0.891
0.896
0.900
0.904
0.891
0.351
0.425
0.481
0.521
0.545
0.556
0.557
0.556
0.553
0.536
0.664
0.706
0.719
0.709
0.679
0.633
0.622
0.611
0.573
0.500
419
which includes, besides the conditions of vacuum and air drag, a third column of
numbers to be discussed after we take up the case of the flying ball that is a fly
ball that literally flies.11 The longest range under each set of flight conditions (in
vacuum, with air drag, or with air drag and lift) is set in bold font.
For the reader unfamiliar with baseball terminology, a fly ball is a ball hit high into the air.
I. Newton, Principia Vol. 1: The Motion of Bodies (University of California Press, Berkeley, 1966) 334336. [Mottes
translation of 1729 revised by Cajori.]
420
Table 7.10
Jumbo Jet
74.2
6.5
59.6
511
162386
333390
6.9
0.031
252.0
FN v2 b sin2
7:6:1
where the subscript N refers to Newton. Thus, if FN equals W, the board flies,
provided there is a means to keep it moving forward.
To my knowledge, Newton never made such a calculation. However, I have read
or listened to explanations of powered flight in terms of such a model. The basic idea
that a heavier-than-air object can be made to rise if (by Newtons Third Law) there is
a counter-motion of the medium downward is correct, but the details are not. An
airplane does not fly because the lower surface of its wings bats the air downward.
Such a picture leads quantitatively to a lift proportional to sin2, which, for small
angles of inclination required to avert stall, would be so small a quantity as to require
wings of impractically great length for powered flight to be achievable. Or, equivalently, for wings of practical length, the Newtonian model would require an
unacceptably high cruising speed v (in the frame of reference where the undisturbed
air is at rest). Rather, an airplane wing in virtue of its shape and orientation relative
to the air stream serves more like a huge pump than a bat, capturing air flow from
above the upper surface and directing it downward. Worked out correctly, the theory
of air flow over a long airfoil leads to a lift proportional to sin . . . not sin2.
An appreciation of the remarkable capacity of an airfoil to deflect air and generate
lift can come only from consideration of some pertinent numbers. Consider what is
required for steady level flight of a Boeing 747 [B747] jet airliner cruising at v 907 km/h
with an air stream incidence of 5 . The relevant characteristics (in MKS units) of
the aircraft are summarized in Table 7.10 for the earliest model (Boeing 747100).13
If the lift generated by the wings is to sustain the maximum weight upon take-off
W mg 333 400 kg9:81 m=s2 3 270 654 N,
13
7:6:2
421
7:6:3
the rate of mass transport of air vertically downward at speed v sin must be
dm
W
3 270 654 N
149:0 tonnes=s:
dt
v sin 252 m=s sin 5
7:6:4
dm=dtt
149:0
103 kg=s1 s
242:1 m:
air b
1:204 kg=m3 8:6 m59:6 m
7:6:5
To accentuate my point: the long thin wings of the B747 are deflecting air from
within at least one hundred meters above its surface to create the reaction that lifts
the aircraft. The term deflecting actually does not conjure up the appropriate
image. A better term perhaps would be sucking . . . like a pump. It is the angle of
inclination of the wing, more than the wing shape, that is responsible for this sucking
action. Parcels of air passing over the wing would leave behind a vacuum were it not
the case that more air is drawn down and over the wing. This continues so long as
there is relative motion of the wing in air.
If we were to use the Newtonian model (7.6.1) of lift which, as I noted previously, was not used in this way by Newton himself arising from deflection of air
from the underside of an airfoil, the cruising speed vN required to sustain weight
W would be
vN
W
3:3
106 N
836:6 m=s,
air b sin 2
1:2 kg=m3 8:6 m59:6 m sin 2 5
1
2
1
2
which greatly exceeds the speed of sound in air: vs 343 m/s at 20 C. Interestingly,
supersonic aerodynamics leads to a pattern of air flow for air speeds much in excess of the
speed of sound that is similar to what Newton imagined in his study of air resistance.15
14
15
This is an example where the time rate of change of momentum mv arises from the variation in mass (dm/dt) at constant
velocity rather than acceleration (dv/dt) of a constant mass. There are subtleties to the use of such a relation, which
I have overlooked now to avoid distraction.
T. von Karman, Aerodynamics (McGraw-Hill, New York, 1954) 122.
422
Besides the erroneous Newtonian explanation, I have also seen or heard misleading explanations of flight based on Bernoullis principle as applied to a stationary
airfoil in a uniformly moving fluid. Again, the basic idea is correct, but not the
details. The argument goes as follows. Air rushes over the cambered upper surface
faster than over the flatter lower surface of the wing in order that the two flows join at
the trailing edge. By (7.5.4) the faster air stream exerts a lower pressure on the upper
wing surface and therefore the plane is pushed upward by the greater pressure on the
lower surface.
This explanation fails on several accounts. First, there is no aerodynamic principle
requiring parcels of air to time their flow so as to meet at the trailing edge if they
simultaneously arrived at the leading edge. In fact, it is an essential condition of flight
that the part of a divided air stream passing over the top surface arrive at the trailing
edge before the part passing under the lower surface. This creates a nonvanishing
circulation about the wing to be discussed shortly. Second, the primary impulse for
the wing to rise comes from a low gauge pressure topside (near the leading edge) than
from a high gauge pressure under the wing. As I pointed out previously, it is more
accurate to think of the wing as being sucked up than pushed up. Finally, this
Bernoulli explanation has its cause and effect backward. It is the low pressure
created by deflection of the air stream downward that leads to a higher air speed over
the wing than under the wing . . . not the reverse statement.
Although Newtons laws and Bernoullis principle (which is derived from
Newtons Second Law) are essential ingredients to understanding flight, the correct
way in which they come together to give a quantitative account of lift is through two
seemingly remote and abstract concepts: circulation and vorticity. Consider, as
before, a reference frame with the airfoil at rest and the air stream moving to the
right with a uniform upstream (i.e. initially undisturbed) velocity v0. In a nutshell, lift
arises from a bound vortex (i.e. tornado-like whirlwind) of air induced at the initial
moments of relative motion between the wing and the air. During these moments, air
moving over the top surface of the wing does not flow smoothly over the entire
surface, but separates turbulently at some point above, but close to, the trailing edge
due to friction in a thin boundary layer encompassing the wing. Airflow along
the undersurface of the wing passes around the trailing edge and up over the top
surface (upwash) to the separation region. This counter-flow generates a starting
vortex, i.e. an anticlockwise circulation of air for an initial air stream directed to the
right, which the wing sheds and the air stream carries away. As a result of angular
momentum conservation in the fluid, a clockwise-circulating bound vortex is induced
around the wing.
The (clockwise) circulatory motion of the bound vortex, superposed on the
original horizontal air stream moving to the right at velocity v0, generates a faster
air stream over the top surface and a slower air stream over the bottom surface,
which, by Bernoullis principle, produces the differential pressure leading to a lifting
force sustained as long as relative motion of the wing in air continues. Theoretically,
423
7:6:6
where is the density of the medium (air), v0 is the uniform wind speed of the
undisturbed air far in front of the airfoil, and
v ds ndS r
v,
7:6:7
C
termed the circulation, is a line integral of the net wind velocity v over an arbitrarily
shaped planar contour C about the airfoil. For an airfoil of infinite span, the location
of the plane of the contour does not matter. For a finite airfoil, however, circulation
can vary with location and the total lift will require integrating fl over the span,
which we will do shortly when we apply these results to a baseball in the following
section.
The equivalent second expression in (7.6.7), resulting from use of Stokes theorem
of vector calculus, is an integral over an open surface bound by C (with appropriate
orientation of the outward normal to correspond to positive traversal of the contour)
of the vorticity , defined as the curl of the fluid velocity. The expressions in (7.6.7)
are suggestive of Amperes law in electromagnetism relating the current I (analogous
to ), magnetic induction B (analogous to v), and vector potential A (analogous to ).
Indeed, there is a BiotSavart law in aerodynamics by means of which the velocity
field associated with curved vortex lines can be calculated (although we shall not need
to do so in this book).
A general derivation of the KuttaJoukowski theorem can be found in advanced
aerodynamics references, but the basic ingredients can be understood from examining the lift on a long thin board of width (chord) b, such as illustrated in Figure 7.7.
Let boldfaced letters with carets x , y , z signify unit vectors along the corresponding
Cartesian axes. The span of the board is normal to the page (along the z axis).
Flowing to the right over and under the board is a steady horizontal wind of velocity
v0 v0 x , and circulating around the board in a clockwise sense is a bound vortex
moving with velocity ux over the top and velocity ux over the bottom. Because
the board is thin (in principle, infinitesimally thin), we can neglect the velocity of the
vortex at the leading and trailing edges. The total velocity at any point is the vector
sum v v0 u. From Eq. (7.6.7), the circulation is then
16
17
Irrotational flow does not mean that the fluid cannot rotate. Rather, it signifies that an object immersed in the fluid
does not change its orientation relative to fixed axes as it is carried by the fluid. Illustrative of such motion would be the
passenger cars on a Ferris wheel at an amusement park.
L.M. Milne-Thomson, Theoretical Aerodynamics (Macmillan, London, 1958) 9192.
424
v0
-u
v0
Fig. 7.7 Schematic diagram of air flow over a stationary airfoil with bound vortex. The farfield air stream and circulating air are largely parallel above the airfoil and anti-parallel below.
Thus, the net air speed (v0 u) is greater above the airfoil than below (v0 u), giving rise to lift
as described by the KuttaJoukowski theorem.
7:6:8
The vertical component of the force on the board due to air pressure p takes the form
of a closed surface integral
Fl p n dS y plower pupper b
7:6:9
in which p is the pressure on a patch of differential area dS dxdz with outward
normal unit vector n y for the upper and lower surfaces, respectively. (As before,
we ignore the surfaces at the leading and trailing edges, but the force on them does
not contribute anyway because the scalar product of the outward unit normals
n x with y vanishes.) The minus sign in (7.6.9) shows that the direction of the
pressure force on each patch is along the inward normal. By Bernoullis formula
(7.5.4), we can replace pressure in (7.6.9) by
1 2
1
1
p p0 v0 v2 constant v2
2
2
2
upstream static and
dynamic pressure
7:6:10
fl
i
Fl
1 h
plower pupper b b 2upper 2uower
2
i
1 h
b v0 u2 v0 u2
2
2buv0 v0 ,
425
7:6:11
19
G. Magnus, Uber die Abweichung der Geschosse, Abhandlungen der Koninglichen Akademie der Wissenschaften zu
Berlin (1852) 123. [Concerning the deviation of projectiles.]
Letter of Isaac Newton reproduced by I.B. Cohen, Isaac Newtons Papers and Letters on Natural Philosophy and Related
Documents (Harvard University Press, Cambridge MA, 1958). Republished as Isaac Newton, A new theory about
light and colors, American Journal of Physics 61 (1993) 108112.
426
Backspin
Air ow
(slower)
Fig. 7.8 Magnus effect on a baseball backspinning at angular frequency about an axis
perpendicular to its translational velocity v, thereby producing a vertical lift force FL. The
effect is due to the greater airspeed (relative to the surface of the ball) and consequently lower
pressure at points in the upper hemisphere compared to corresponding points in the lower
hemisphere.
readable (if you read German) and comprehensive discussion of the Magnus effect
was published in 1925 by the aerodynamicist Ludwig Prandtl,20 creator of boundarylayer theory (and therefore in effect the father of the science of aerodynamics) to
explain the Windkraftschiff (windpower ship) invented by a German engineer Anton
Flettner, which employed two vertical rotating cylinders in place of sails.
To calculate the circulation rigorously, and therefore the lift, of a sphere spinning
in a viscous fluid, it is necessary to determine the velocity field of the fluid. In general,
this is very difficult to do analytically since it entails solving the nonlinear Navier
Stokes equation. We can avoid the necessity of doing so, however, by making a few
assumptions, adequate for the present purposes, that give insight into the quantities
that matter most.
Figure 7.8 shows a schematic diagram of a sphere of radius a backspinning at
angular frequency i.e. with angular velocity such that the cross product
v,
where v is the velocity of the center of mass through the air (or v is the velocity of
the air stream in the rest frame of the sphere), is in the direction of the aerodynamic
lift FL. A point
on the
sphere at a horizontal distance x from the origin and radial
p
distance r a2 x2 from the rotation axis moves with a linear speed v(r) r. The
no-slip condition requires that air molecules adhere to the surface of the sphere and
move with it at the same angular frequency. If we consider only 2D flow within
20
L. Prandtl, Magnuseffekt und Windkraftschiff, Die Naturwissenschaften 6 (1925) 93104. An English translation is
available online as a NASA Technical Report: NACA Technical Memorandum 367, http://ntrs.nasa.gov/search.jsp
427
planar sections perpendicular to the rotation axis and assume that molecules at the
surface entrain those in successive layers within a thin boundary layer likewise to
follow the motion of the spheres surface (i.e. neglect effects of viscosity in the bulk
fluid outside the boundary layer), then the circulation about a contour of radius r
would be
r 2r 2 :
7:7:1
The circulation (7.7.1) contributes a vertically upward force dFl (x) on a section of
width dx about x of
dFl x 2rx
2 v dx:
7:7:2
The total lift on the sphere, obtained by integrating (7.7.2) over the range a x a, is
then easily shown to be
Fl
8 3
1
a v v2 Cl a2
3
2
Cl
16a
,
3v
7:7:3
where the second expression defines the coefficient of lift Cl by a relation analogous
to the one defining the coefficient of drag Cd. Note that Cl is a dimensionless constant
expressed, as might be expected, as a number of order unity times the ratio of
rotational and translational velocities. Based on dimensional considerations alone,
fluid flow about a smooth rotating sphere in translational motion should be characterized by two parameters: the Reynolds number
Re v2a=
7:7:4
7:7:5
The essential feature of (7.7.3) is a lift proportional to the first power of the
relative speed of the ball and medium. Although early investigators of the transverse
force on a spinning sphere reported a force proportional to the square of the speed,
more recent systematic investigations of golf balls21 and baseballs22 are more or less
consistent with a linear force for high Reynolds number and low roll parameter.
Nevertheless, the problem is complex and experiments are not fully in agreement with
one another or with theory. In arriving at Eq. (7.7.3), I have made assumptions that
are not rigorously self-consistent. The velocity field, obtained by superposing a
uniform free-stream flow and the field of an ideal vortex centered on the sphere,
neglects viscosity even though viscosity is what engendered the fluid circulation.
21
22
428
More realistically, the flow around a spinning sphere at high Reynolds numbers
undoubtedly produces turbulence and eddies behind the sphere, which cause drag
and affect lift.
The same inconsistency can be found in attempts at a tractable theoretical analysis
of a rotating cylinder for which the validity of the result is easier to estimate (or at
least speculate).23 To the extent that inferences drawn from a cylinder may have
relevance to a sphere, one may conclude the following. For sufficiently large values
of J, there is a surface dividing the fluid into an irrotational part that flows past the
cylinder in the main stream and a part trapped near the cylinder that co-rotates with
it. If the boundary layer within which viscosity is important is small compared with
the thickness of this surface of entrapment, then vorticity cannot readily diffuse into
the main stream, whereupon the circulation given by (7.7.1) (with constant r) is
believed accurate to a good approximation. The criterion for validity of (7.7.1) can
be shown to be J >> Re , from which it follows from (7.7.5) that the spin frequency
/2 must satisfy
1
3
Re v
>>
:
2
2a
1
3
7:7:6
Thus, for a baseball traveling through room-temperature air at 110 miles per hour
(49.2 m/s), we have seen that Re ~ 2.4
105, whereupon (7.7.6) yields /2 3.6 Hz,
which is easily achievable for a spinning baseball. In the analysis of cylindrical flow,
however, the condition J 1 was assumed, and this is not the case for a batted
baseball. For example, J 0.072 for a baseball translating at 110 mph and spinning
at 15 Hz. There are, as well, other features of a baseball, such as the seam that
meanders over the surface, that may (or may not) contribute further complexity and
uncertainty to a rigorous analysis of lift. For purposes of illustration, therefore, I will
adopt (7.7.3) as our working relation since it embodies the maximum circulation for a
fixed (i.e. the circulation arising from rigid body rotation) and should presumably
lead to the most optimistic estimates of achievable home run length.
Taking account of the lift and induced drag resulting from the Magnus effect
(7.7.3) on a sphere as depicted geometrically in Figure 7.4, leads to a new characteristic speed vs associated with spin
mg
7:7:7
vs 8 3
3 a
and thus to a modification of the (dimensionless) equations of motion (7.5.10)
dV x
2
V x V 2y V x V y 0
dT
7:7:8
dV y
2
V x V 2y V y V x 1
dT
1
2
1
2
23
T.E. Faber, Fluid Dynamics for Physicists (Cambridge University Press, New York, 1995) 279283.
429
0.3
Drag+Spin
/2 = 15 Hz
Altitude (scaled)
0.25
0.2
No Drag
No Spin
Drag
0.15
0.1
0.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
for the velocity of the ball. The additional dimensionless parameter vd/vs
quantifies the relative influence of lift and drag. For a non-spinning ball, vs
and 0. In contrast to the equations of motion without spin, the set (7.7.8) cannot
be solved analytically, but is readily solvable numerically by the Levenberg
Marquardt algorithm to which I have referred before.
Figure 7.9 shows a comparative illustration of the trajectories (solid lines) of balls
launched at 110 mph at initial angles 0 of 25 (gray lines) and 45 (black lines) and a
moderate backspin of /2 15 Hz. To put this in perspective, dashed lines mark the
flight of the ball in vacuum (no drag, no spin) and dotted lines mark the flight in air
(drag, no spin) with a form drag coefficient again chosen to be Cd ~ 0.2 for Reynolds
numbers beyond the transition region. As we have already seen, without spin the
range in vacuum for fixed initial speed is always greatest for 0 45 , and the range
in air for the given parameters was greatest for 0 ~ 41 . With a backspin of 15 Hz
under the same conditions, the longest home run range was obtained for a much
lower initial angle, 0 ~ 25.3 . Now that spin has been discussed, a re-examination of
the results in Table 7.9 showing the launch angles that lead to maximum ranges for
all three sets of conditions (no drag/no spin; drag/no spin; drag/spin) would be
informative.
Figure 7.10 shows a comparison of trajectories for balls launched at 110 mph at
initial angles 0 of 0 (gray line) and 25 (black lines) and a higher backspin of /2
30 Hz. At 0 25 the trajectory looks like the cone of a volcano, the ascending and
430
Table 7.11
Frequency (Hz) /2
Range (scaled) 0 0
Range (scaled) 0 25
15
20
25
30
35
40
45
0
0.389
0.804
0.930
0.902
0.821
0.741
0.719
0.765
0.755
0.694
0.607
0.518
0.447
0.3
Drag+Spin
Altitude (scaled)
0.25
/2 = 30 Hz
0.2
0.15
Drag+Spin
No Drag
No Spin
0.1
Drag
0.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
431
0.25
Drag+Spin
/2 = 100 Hz
Altitude (scaled)
0.2
0.15
0.1
0.05
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
at 110 mph at 0 (gray plot) or 25 (black plot) and spinning at 100 Hz. At this high
spin rate the ball is undergoing a sustained, oscillatory flight, reminiscent of (and
indeed related to) the behavior referred to as phugoid motion24 by W.F. Lanchester,
one of the first to understand the principles of heavier-than-air flight. The term
phugoid is actually a linguistic blunder. Lanchester sought a Greek word for flight
in the sense of flying, but chose a word that had the sense of fleeing. For perspective,
the figure also shows the vacuum trajectory (dashed black plot) of the ball launched at
25 without drag or spin.
To see clearly the mathematical origin of the looping motion, consider the equations of motion (7.7.8) in the limit of very high spin parameter such that the drag
terms quadratic in velocity are omitted
dV x
V y 0
dT
dV y
V x 1:
dT
7:7:9
The set (7.7.9) can again be solved exactly, leading to components of velocity
V x T V x0 1 cos T V y0 sin T 1
7:7:10
V y T V y0 cos T V x0 1 sin T
and of displacement
24
432
sin T
1
V y0
Sx T V x0
sin T
1
V x0
Sy T V y0
cos T
T sin T
2
cos T
1 cos T
7:7:11
I recall playing with such a toy as a child, although I am not familiar with the name
Rotorang.
Incidentally, the principal feature of interest to the author of the report was not
the phugoid (i.e. looped) motion of the Rotorang, but the fact that at a particular
spin rate its drag dropped so low that the toy hung motionless in the wind for an
unusually long period of time. This, the author claimed, was illustrative of what he
called the Barkley phenomenon (named for a man who pointed the characteristic
out to him during model-basin tests of rotary rudders): the drop in drag on a rotor
just prior to its reaching equality of surface and flow velocities i.e. a roll parameter
J 1. Recall that one of the earliest failures of the theory of ideal fluids was the
dAlembert paradox in effect, the false prediction that an object placed at some
location in a rapidly flowing stream of ideal fluid would just remain there at rest.
A cylinder spinning in air at the appropriate rate actually appears to do that,
although for complex reasons related to the spin and boundary layer within which
the fluid is not ideal, but viscous. Moreover, the end caps on the Rotorang were not
meant to serve principally at wheels, but to block air flow between the external and
internal regions that would reduce the circulation .
7.8 Falling out of the sky is a drag
In January 1945, First Lieutenant Federico Gonzales of the US 8th Air Force was
piloting the lead B-17 Flying Fortress in an air raid over Germany when half of the
25
J. Borg, The Magnus Effect An Overview of its Past and Future Practical Applications Vol. 1, Report AD-A165
902 (Department of the Navy, Washington DC, 1986) 2122.
433
left wing of his aircraft was shot off by ground fire. The plane, spinning rapidly,
split amidships and plunged 27000 feet (8.23 km) to the ground with the unconscious
pilot wedged under the instrument panel. Although severely injured, the pilot
survived, alone of the ten-man crew. So began the narrative of a book26 written
by the authors son that, a few years after its publication, came to my attention
in a serendipitous way: as a book-sale discard that my wife brought home to me.
It was a chance happening that re-directed my research focus for several years
afterward.
The pilot, having received medical attention in captivity, was eventually able,
despite some permanent injury, to resume a normal life after the war and, interestingly enough, became a professor of biophysics. When I learned these spare details of
Gonzaless extraordinary fall and subsequent recovery, two questions immediately
came to mind. First, how was it possible for any human to survive a fall of ~8 km
without a parachute? And second, feeling a certain kinship with the man through our
common pursuit of physics as a profession, I could not help wondering whether he,
himself, ever wondered about his survival, other than to regard it as an exceptionally
lucky outcome, if not a miracle.
The maximum acceleration that a human can endure has long been a subject of
fascination, as well as practical interest, particularly to insurance companies, automotive safety agencies, national space agencies, the military, and the like. Estimates
have ranged from about 10g to a little over 100g, depending on the duration and
orientation of impact, where g is the acceleration of gravity: g 9.81 m/s2. Particularly striking was the case of racing driver David Purley who survived a crash
estimated to have produced 178g as he decelerated from 173 km/h to 0 in a distance
of 66 cm.27 The most comprehensive study I am aware of concerning human impact
tolerance is a 324-page report prepared for the Insurance Institute for Highway
Safety.28 Among the findings of the authors, who investigated vertical falls up to
about 275 feet (. . . the height of the Golden Gate Bridge in San Francisco, California,
from which numerous suicide attempts have been made . . .) as a proxy for horizontal
car crashes, was that 350g for 2.53.0 ms was the approximate survival limit of
children under age 8 subject to head impacts. From such data it seems likely that a
few hundred g over a period of a few seconds would be a liberal upper limit to human
impact tolerance under most circumstances.
Straightforward application of the kinematics of uniform acceleration, such as is
taught in elementary mechanics, would tell us that, starting from rest, an object
(irrespective of its mass) falling a vertical distance h 8.23 km through vacuum (or
26
27
28
L. Gonzales, Deep Survival (W. W. Norton, New York, 2005) 915. Other accounts of the mission in which Lt.
Gonzales plane was shot down were recorded in diaries of various members of the squadron, excerpted online at the
website of the 398th Bomb Group Memorial Association http://www.398th.org/Missions/Dates/1945/January/
MIS_450123.html
David Purley, http://en.wikipedia.org/wiki/David_Purley
R. G. Snyder, D. R. Foust, and B. M. Bowman, Study of Impact Tolerance Through Free-Fall Investigations
(December 1977), Highway Safety Research Institute of the University of Michigan, Ann Arbor, Michigan.
434
7:8:1
7:8:2
and the combined forward thrust Ft Pe/v0 of all the engines of combined power Pe
is balanced by the rearward drag of air resistance
1
Pe
Fd v20 Cd S Ft
v0
2
7:8:3
so that there is no net acceleration. Besides balance of forces, there is the absence of
moments. The upward lift of the port wing (i.e. on the pilots left), which would roll
29
http://en.wikipedia.org/wiki/Boeing_B-17_Flying_Fortress
Table 7.12
435
23
2.4
13.6
32
131.9
3.4
16391
24 495
7.58
128.3
81.4
895
4
the plane clockwise about the long axis of the fuselage is balanced by the upward lift
of the starboard wing (to the pilots right), which would roll the plane anti-clockwise.
The upward lift of both wings, which could pitch the nose of the plane upward in a
rotation about a horizontal axis through the wings, is balanced by the upward lift of
the rear horizontal stabilizers (the winglets in the tail assembly or empennage),
which would pitch the nose downward. And the thrust of the port engines, which
would yaw the plane to starboard about a vertical axis through the crafts center of
gravity, is balanced by the counter-thrust of the starboard engines, which would yaw
the plane to port.
In the expressions (7.8.2) and (7.8.3), the forces of lift and drag are expressed in
standard Newtonian form i.e. proportional to the square of the relative wind
speed in which is again the air density, S is again a reference area defined as the
projected area of the planform which for all practical purposes may be taken to be
the wing area, and the engine power Pe is equal to the product of velocity and thrust.
The lift-to-drag ratio then follows simply as
Fl
Cl mgv0
:
Fd Cd
Pe
7:8:4
The drag on a subsonic airplane arises from various sources. Skin (or friction)
drag results from viscous shearing stresses over the surface. Pressure (or form) drag
results from the integrated effect of static pressure normal to the surface. Together,
skin drag and pressure drag constitute profile drag. Other forms of drag not pertinent
to the present discussion arise from shock waves associated with relative speeds at or
beyond the speed of sound. These can be ignored now. For a sleek aerodynamically
shaped object like an airplane cruising horizontally (and subsonically) through the
air at high Reynolds number, the profile drag is primarily skin drag, and the drag
436
C2l
2 =S
7:8:5
Induced Drag
where the profile drag (first term) depends primarily on shape, and the induced drag
(second term) depends on lift and aspect ratio AR 2/S.
Given the aerodynamic complexities of a real aircraft in flight, the coefficients of
lift and drag are generally not amenable to theoretical prediction, but must be
obtained empirically which we can do from the data in Table 7.12. Thus, solving
for Cl and Cd from (7.8.2) and (7.8.3) and substituting the appropriate quantities
from the table lead to
30
31
32
Cl
2mg
224495 kg9:81 m=s2
0:426
7:8:6
Cd
2Pe
24
895
103 W
0:078
3
v0 S 1:29 kg=m3 81:4 m=s3 131:9 m2
7:8:7
L. K. Loftin, Jr, Quest for Performance: The Evolution of Modern Aircraft, NASA SP-468 (NASA Scientific and
Technical Information Branch, Washington DC, 1985) Appendix A, Table II Characteristics of Illustrative
Aircraft 19181939, http://www.hq.nasa.gov/pao/History/SP-468/app-a.htm
The aspect ratio AR of a rectangular airfoil is the ratio of the wingspan to the chord length b. For wings of variable
width, the aspect ratio is defined by AR 2/S, where S is the wing area.
R. von Mises, Theory of Flight (Dover, New York, 1959) 140142, 165.
437
and therefore to a ratio Cl /Cd 5.5 for a fully loaded B-17 cruising at about 80 m/s
in level flight. NASA reports a maximum ratio (Cl /Cd)max 12.7.
Since the temperature and density of the atmosphere are not homogeneous, the
value of the air density employed above warrants comment. It is the density of dry air
at 0 C and 1 atm pressure, a set of conditions referred to as Standard Temperature
and Pressure (STP). Within the troposphere, i.e. the first ~11 km of the atmosphere
above sea level, the temperature of dry air rising adiabatically (i.e. without heat
exchange) decreases linearly with altitude at approximately 10 C/km. This variation,
which characterizes convective isentropic equilibrium in the atmosphere, is known as
the (dry air) adiabatic lapse rate. As the air temperature changes, so too does the
density and pressure according to the ideal gas law and the adiabatic expansion
equation which relate all three variables. Additionally, even in an isothermal atmosphere, the pressure, and therefore the density, decrease exponentially with altitude in
accordance with the barometric equation. Given that Lt. Gonzaless B-17 descended
over Germany in winter (January), it may be reasonable to assume that ground-level
temperature was about 0 C, and therefore his descent at 27000 feet (8.2 km) began in
an ambient temperature of about 80 oC. The variation in density affects the descent
rate, but I will deal with the complexities of a thermodynamically inhomogeneous
atmosphere in the next section where it is more pertinent to the content. For the
present purpose, however, of accounting for Lt. Gonzaless survival, it is sufficient
simply to adopt the STP value of air density.
With sudden destruction of the port wing on Lt. Gonzaless fateful day, this
perfect balance was instantly shattered. The lift on the starboard wing, now
unopposed by the lift of its counterpart, generated a torque about the long axis,
rolling the starboard wing upward and the remnant of the port wing downward. The
roll, according to the narrative I read, was violent enough to invert the plane. With a
port engine missing, the uncompensated torque of the starboard engines about the
vertical axis through the planes center of gravity yawed the nose of the plane to
the port side, initiating a spin. The lift of the horizontal stabilizers, now exceeding the
lift of the wings, forced the nose of the plane downward. Rolling, yawing, pitching,
the doomed B-17, its weight no longer supported by lift, plunged earthward, quickly
settling into a stable helical spiral described in aeronautical terms as a flat spin.
In a spinning descent, the aerodynamic variables are all out of kilter. With the
nose declined below horizontal and the aircraft falling downward, the relative wind is
primarily vertically upward flowing over the wings at an incidence (angle between air
stream and wing chord) a little below 90 , in fact, for a flat spin far above the
stalling angle. Under such conditions, the net aerodynamic force on the plane is
perpendicular to the wing chord. Drag, which in steady flight is horizontal, opposing
thrust, is now vertically upward, opposing weight. Lift, which in steady flight is
vertically upward, opposing weight, is now horizontal and radially inward, creating
the centripetal acceleration of the spin. Of the two kinds steep spin (the extreme
form of which is a spinning nose dive) and flat spin (the extreme form of which
438
resembles the descent of a Frisbee) the flat spin is the more dangerous because it is
stable. The wings are stalled, the control surfaces, particularly the rudder, cease to
function, and, once transients of the motion decayed away, the flat spin persists at a
steady rate. Nevertheless, hazardous and irrecoverable as it is reputed to be, I believe
that a flat spin saved the life of Lt. Gonzales.
In modeling the violent transition from steady horizontal flight to a helical flat
spin descent, I will consider first an intact B-17 (because its geometry is unambiguous) and assume that
(a) the cruising speed v0 of the aircraft became (at least approximately) the tangential
speed of the planes center of gravity about the axis of the helix, and
(b) the vertical descent started with an initial axial component vz0 0.
It then follows from the equation for lift as the source of centripetal acceleration
mv20
r
7:8:8
Fd mg
7:8:9
Fl v20 acent
Fd gr
g
7:8:10
Fl
and the steady-state equation for drag
gives what in the pilots rest frame would be interpreted as the centrifugal force,
once the radius r of the helical trajectory is known. The radius for extreme flat spin is
ordinarily not larger than one-half of the wingspan,33 which from Eq. (7.8.10) and
Table 7.12 would lead to a centrifugal force of about 43g. This is a little below the
maximum acceleration experienced by a human on a rocket sled. Beyond 50g
sustained spin could lead to death or serious injury.
In the narrative about his fathers descent, the author wrote that the plane was
spinning hard enough to suck your eyeballs out,34 a literary embellishment that
might well apply to a centrifugal force of about 43g. The corresponding spin rate for
a radius 12 15:8 m is 0.82 Hz or very close to 50 revolutions per minute (rpm.) As
an interesting comparison, the Guinness World Record for spinning on ice skates
was (at the time of writing) 308 rpm set by Natalia Kanounnikova of Russia
on 27 March 2006.35 However, the maximum centrifugal force experienced by a
point on her body if she were modeled as a vertical cylinder of radius ~25 cm
would be ~26g.
33
34
35
B. N. Pamadi, Performance, Stability, Dynamics, and Control of Airplanes (AIAA, Reston VA, 1998) 650.
L. Gonzales, op. cit. p. 271.
Guinness World Records, http://community.guinnessworldrecords.com/_GUINNESS-WORLD-RECORDSHOLDING-FIGURE-SKATERS-GO-FOR-THE-GOLD-IN-VANCOUVER/blog/1866731/7691.html
439
The assumption underlying (7.8.9) that the plane fell at a steady terminal speed
will now be justified as we examine the descent. Although the nonlinear Newtonian
drag force couples vertical and horizontal components of velocity, we have seen in
the analysis of a baseball trajectory that treating horizontal and vertical motions
independently led to results in surprisingly good agreement with those obtained by
solution of the exact equations of motion. With adoption of the same approximate
procedure, the equation of motion for freefall from height h in a retarding atmosphere takes the dimensionless form
dV z
V 2z 1
dT
V z vz =vd ; T t=td
7:8:11
where z designates the vertical axis whose origin is at the initial location of the plane.
The (scaled) vertical displacement (Sz sz/sd) is measured from this origin. For a
loaded B-17 (m 24495 kg) falling like a flat plate (Cplate 1.3), the scale factors in
terms of which dynamical variables are expressed take values
1
2
velocity
vd
2mg
Cplate S
time
td
vd
4:75 s
g
7:8:13
displacement
sd vd td 221:48 m:
7:8:14
46:61 m=s
7:8:12
Two significant features distinguish Eq. (7.8.11) from the equation employed
previously for the vertical motion of a baseball. The first is the 1, rather
than 1, on the right-hand side, the sign of which reflects that positive displacement along the vertical axis occurs downward (in the direction of g) rather than
upward. The second is the initial condition vz0 0, instead of vz0 v0 sin 0. With
these two differences taken into account, the equation can be integrated to yield
expressions
V z T
V z Sz
vz t
tanh T V z0
1 V z0 tanh T
vd
vz sz
1 1 V 2z0 e2Sz
vd
1
2
!
tanh T
V z0 0
!
V z0 0
sz t
lncoshT V z0 sinhT
!
Sz T
V z0 0
sd
1 e2Sz
7:8:15
1
2
7:8:16
Sz T lncoshT
p
TSz ln eSz e2Sz 1
7:8:17
440
g
2glc :
55:4 lc 2 m
which, though severe, are within past precedents of survival.
It is to be noted that the element critical to Lt. Gonzaless survival was that
the plane came down flat like a Frisbee and not steep like an arrow. Replacement
of the form drag of a flat plate broadside to the wind with the friction drag of a
B-17 yields a value for ac/g in the thousands for any reasonable impact length,
whereupon the pilots future son would not have been around to write the
narrative.
The question remains as to whether the results of the preceding analysis for an
intact plane are valid given the extensive damage (with loss of structures) to the
aircraft. As seen from (7.8.12), the terminal freefall velocity is reduced by a loss of
mass but is increased by a reduction in wing area. Thus, to estimate reliably the
dynamical effects of damage requires some detailed anatomical information which
may no longer be available. In the narrative, half of one wing was shot away and
shortly afterward the plane broke in two amidships. To my knowledge there is
no photographic record of Lt. Gonzaless downed B-17, but from photographs
I have seen of other damaged B-17 aircraft, I would speculate that the fuselage
fractured just fore of the empennage at about three-quarters the distance from
the nose. Since the bombs on a B-17 were stored in racks in a bomb bay behind
the cockpit, the loss of the rear quarter of the fuselage did not mean loss of the
principal load.
To determine the mass of a B-17 missing one-half a wing and one-quarter the
fuselage requires knowing the masses of the wings and fuselage separately. Since no
technical specifications of the B-17 available to me gave these data, I estimated them
441
7.9 Descent without power: how to rescue a jumbo jet disabled in flight
As a physicist conducting research in laboratories all over the world, I have spent a
lot of time in airplanes some 10 km above the ground. In all my travels, I have yet to
36
37
38
M. D. Ardema, M. C. Chambers, A. P. Patron, A. S. Hahn, H. Miura, and M. D. Moore, Analytical Fuselage and Wing
Weight Estimation of Transport Aircraft, NASA Technical Memorandum 110392 (May 1996), pp. 19, 22. The eight
aircraft whose fuselage and wing weights were given are: B-720, B-727, B-737, B-747, DC-8, MD-11, MD-83, and
L-1011, where B Boeing, DC Douglas, MD McDonell-Douglas, L Lockheed.
F. Gonzales and M. J. Karnovsky, Electron microscopy of osteoclasts in healing fractures of rat bone, Journal of
Biophysical and Biochemical Cytology 9 (1961) 299316.
M.P. Silverman, Quantum Superposition: Counterintuitive Consequences of Coherence, Entanglement, and Interference
(Springer, Heidelberg, 2008).
442
meet a fellow air traveler who has not at least for a moment reflected on the
possibility that the plane may go down. The trend in design of modern commercial
aircraft, driven in part by rising costs of fuel, construction materials, and labor, is to
larger, heavier planes that transport ever greater numbers of passengers. Aerodynamicists now routinely contemplate design models capable of carrying 800 or more
people.39 Although air travel is presently considered very safe, no human-made
machine is 100% reliable, and it is therefore certain that at least one of these
airplanes would eventually fail in service with a huge number of fatalities. It is of
prime interest therefore to investigate how the laws of physics may be used to avert
such a catastrophe.
The fall and survival of Lt. Gonzales prompted me to consider more generally the
controlled descent of fragile loads, a topic of vital concern to space agencies, cargo
transporters, and general aviation. In regard to the latter, in particular, I was able to
demonstrate40 analytically how a large passenger airliner, having suffered total loss of
power, may be brought to ground by means of a sequentially released parachuteassisted descent with impact deceleration below 10g. The idea of protecting an entire
aircraft, rather than individual persons, with a parachute, unusual as it may seem, has
in fact been implemented commercially since about 1980 for small craft with maximum masses in the range of 2701410 kg and deploy speeds of about 6585 m/s.41
For large general aviation aircraft, however, the greater weights, speeds, and altitudes
are believed to make in-air recovery virtually impossible. Nevertheless, I have found
that in-air recovery of large general-aviation aircraft should be aerodynamically
feasible with decelerators of a size that currently exist and without necessarily requiring new materials.
The air resistance (drag force) on an object descending through an atmosphere
depends, as discussed in the previous section, on the air density, square of the relative
air speed, effective area presented to the air stream, and drag coefficient. The air density
in turn is a function of altitude and air temperature. In an atmosphere in isentropic
equilibrium, such as characterizes the Earths troposphere (depth of 816 km
from poles to tropics), the density varies adiabatically with altitude. The drag coefficient is largely independent of size, but depends weakly on Reynolds number (for
high Reynolds numbers) and sensitively on shape and origin (i.e. from pressure or
friction).
Although it is usually an acceptable approximation to regard air as an incompressible fluid for horizontal flight at subsonic speeds, the effect of compressibility on air
resistance will be significant at any speed for a sufficiently large vertical excursion.
39
40
41
A. Bowers (Senior Aerodynamicist for NASA), The Wing is The Thing (TWITT) Meeting, NASA Dryden Flight
Research Center, Edwards AFB, California USA, (16 September 2000). Presentation available at http://www.twitt.
org/BWBBowers.html
M.P. Silverman, Two-dimensional descent through a compressible atmosphere: Sequential deceleration of an
unpowered load, Europhysics Letters 89 (2010) 48002 p1p6.
Ballistic Recovery Systems, http://www.usairborne.com/brs_parachute.htm
443
Air is a poor conductor of heat. To say that air density varies adiabatically means
that over the brief time interval that a parcel of air expands or contracts in an
environment at different temperature, there is no heat flow into it from the immediate
surroundings. Thus, the work done in adiabatic expansion or contraction comes
from the internal energy of the parcel, which subsequently must cool (for expansion)
or become warmer (for compression). Combined application of the equation of state
of an ideal gas of molar mass M
p RT=M,
7:9:1
which relates pressure p, absolute (or Kelvin) temperature T, and density , with the
equation for an adiabatic transformation of an ideal gas
p1 T constant
7:9:2
derived from (7.9.1) with use of the Second Law of Thermodynamics, and the
barometric equation
dp
g
dz
7:9:3
that governs the decrease in pressure with altitude in a uniform gravitational field,
leads to expressions for the adiabatic lapse rate
dT
Mg 1
T0
z
) Tz T 0 1
7:9:4
hatm
dz
R
hatm
and the adiabatic variation of density with altitude
z 0 1
1
1
hatm
7:9:5
in which
hatm
RT 0
1Mair g
7:9:6
444
6.5 C/km, but I will use the theoretical value for maximal influence of altitude on air
temperature and density.
The z axis in Eqs. (7.9.3) and (7.9.4) is oriented vertically upward with the origin at
ground level. However, when (in due course) we consider the descent of an aircraft, it
will also prove useful to employ a vertical axis oriented downward with the origin at
the initial height h of the falling object. Displacements measured downward from the
initial location will then be represented, as before, by sz (or dimensionless scaled
equivalent Sz) and displacements measured upward from the ground will be represented by z (or a scaled equivalent Z). The two sets of vertical coordinates are related
by sz z h or, as scaled variables, Sz Z H. With attention to symbolism, there
should be no confusion.
Consider now the application of Newtons Second Law of Motion with air drag,
Eq. (7.5.5), applied to a structure of total mass m comprising several separate but
attached plates, as in Figure 7.12, each of which contributes drag independently of
S2
S0
x
S1
-V
-Vx
-Vy
Fig. 7.12 Schematic diagram of an airfoil with horizontal and vertical decelerators, modeled as
plates with respective plan areas S0, S1, S2 moving relative to the air stream with velocity v and
incidence (as seen from the rest frame of the airfoil).
445
the others with a surface either perpendicular to the horizontal (x axis) or facing
downward (negative direction along z axis). In the rest frame of the structure the
wind blows with velocity of magnitude v at an incidence to the x axis. Decomposing
the equation into its horizontal and vertical components, one obtains a set of firstorder nonlinear equations
dvx
x v2x z vx vz 0
dt
dvz
z v2z x vx vz g
dt
7:9:7
dx
v cos
dt
vz
dsz
dz
v sin :
dt
dt
7:9:8
7:9:9
are the drag parameters (distinct from drag coefficients which are dimensionless) of
the x- and z-oriented plates.
The structure in Figure 7.12 is a plate model of a falling unpowered aircraft in
other words, just an elaborate projectile comprising only the essential components
of wings (w), a single horizontal (or drogue) parachute (hp), and one or more vertical
parachutes (vp). The components are characterized aerodynamically as plates of
projective area and drag coefficient (S0, Cw), (S1, Chp), (S2, Cvp), respectively. Air
resistance on these decelerators is due primarily to form drag (pressure) rather than
skin drag (friction). For such a configuration, the drag parameters (7.9.9) take the
simplified form
x
Chp S1
2m
Cw S0 np Cvp S2
2m
7:9:10
7:9:11
in which the altitude-dependence is shown explicitly and (x0, z0) are the groundlevel drag parameters defined by (7.9.10) for density 0. Equations (7.9.11) are
446
expressed in terms of a single independent variable sz, the time derivatives in (7.9.7)
having been eliminated by use of the chain rule
d=dt dsz =dtd=dsz vz d=dsz :
7:9:12
The final step in expressing the equations of motion (and their eventual solutions)
is to transform them, as was done previously, into dimensionless form (V v/vd,
Z z/sd, H h/sd)
1
dV x 2 2
V x V x V z 1 H Sz 1 0
Vz
dSz
1
dV z 2
Vz
V z 2 V x V z 1 H Sz 1 1
dSz
whereby velocity, time, and displacement are scaled by factors
2mg
vd
v2
td
sd vd td d
vd
g
g
0 Cw S0 np Cvp S2
1
2
7:9:13
7:9:14
and
x0
Chp S1
z0 Cw S0 np Cvp S2
sd
v2
d
hatm ghatm
7:9:15
are dimensionless parameters. The aero-thermodynamic parameter can be interpreted as the ratio of the distance fallen from rest to about 93% of vertical terminal
velocity in a homogeneous atmosphere to the adiabatic height of the atmosphere.42
In full generality Eqs. (7.9.13) require numerical solution. They can be solved
analytically, however, for several important special cases.
7:9:16
dSz
1 H Sz 1=21 :
dT
7:9:17
V s, x
V s, z
In the approximation of uncoupled 1D motion, the distance Sz sz/sd 1 fallen vertically from rest leads to vertical
velocity V z Sz 1 1 e2 0:93.
1
2
447
which itself varies in time, and one cannot take a limit t ! because of the
restriction Sz H. Rather, we will see from exact numerical solutions of the equations
that the vertical velocity Vz can reach a maximum magnitude greater than 1 before
decreasing toward the limit Vs,z 1.
0 1
7:9:19
expressible directly in terms of altitude Z. Eq. (7.9.19) can be solved exactly by means
of an integrating factor to yield the expression
0
V 2z Z V 2z0 e H
Z 2 2HZ
0 2
2e1= 0 e Z
1
2
2Z
p
0 1
0 H
p
0 Z0 1
eu du
7:9:20
with initial condition Vz(H) Vz0. The relation between velocity and time must be
H
obtained by integration, T V z u1 du.
Z
448
Table 7.13
Component
Scaled variables
Horizontal velocity
V xT
Horizontal acceleration
vxsx
V x0 eSx
vd
2
ax
V x0
AxT
g
V x0 T 1
sx
SxT lnV x0 T 1
sd
vxt
V x0
V x0 T 1
vd
V xSx
Horizontal displacement
V zT
vzt
tanhT V z0
vd
1 V z0 tanhT
V zSz
vzsz
1 1 V 2z0 e2Sz
vd
Vertical acceleration
AzT
az
1 V 2z0
g coshT V z0 sinhT2
Vertical displacement
SzT
sz
lncosh T V z0 sinhT
sd
Vertical velocity
1
2
G-11 Cargo Parachute Assembly Technical Data Sheet, Mills Manufacturing Corporation, http://www.
millsmanufacturing.com/files/G-11%20Tech%20Data%20Sheet.pdf/view
449
Velocity (scaled)
2.5
Horizontal
1.5
Vertical
0.5
0
0
10
Time (scaled)
Fig. 7.13 Time variation of horizontal (black) and vertical (gray) components of velocity for
an unpowered B747100 with air drag provided by: (1) airfoil and drogue parachute calculated
by the exact 2D theory (solid) and uncoupled 1D theory (dashed); (2) airfoil without drogue
calculated by exact 2D theory (dotted). The initial altitude is h 10 km; initial velocity
components (m/s) are vx 250, vy 1. Plan areas (m2) are Sairfoil 511, Sdrogue 182.4.
Drag parameters (s1) are x0 0.0722, y0 0.1125 with aero-thermodynamic parameter
0.0278. The dashed line at the ordinate 1 marks the terminal vertical velocity.
113.4 kg would suffice, with a corresponding parachute of radius R/2 for the drogue.
At the time I first looked into the matter, the manufacturer packaged these parachutes in clusters up to 8. Sequential deployment symmetrically over the fuselage
makes it possible to reduce vertical impact with the ground to a level below that of
individual military parachutists (1015)g.44
The drag coefficient of a parachute Cp depends on shape and venting, and the
spread of values I found in the literature ranged from about 1.3 to 2.4 depending on
the mode of ascent.45 For illustrative purposes, I adopted Cp 1.5, which is a little
larger than the drag coefficient Cplate 1.3 of a plate (the wings) of aspect ratio 7.0 at
high Reynolds number.46 Given the maximum take-off mass in Table 7.10 and STP
44
45
46
J. R. Davis, R. Johnson, and J. Stepanek, Fundamentals of Aerospace Medicine (Lippincott Williams and Wilkins,
Philadelphia, 2008) 675676.
See, for example, (a) P. Wegener, What Makes Airplanes Fly?: History, Science, and Applications of Aerodynamics
(Springer, New York, 1991) 107; (b) Parachute Descent Calculations http://my.execpc.com/~culp/rockets/descent.
html#Velocity
R.W. Fox and A.T. McDonald, Introduction to Fluid Mechanics 4th Edition (Wiley, New York, 1992) 442, 468.
450
3
ground-level values for air density (0 1.294 kg/m
and temperature (273 K), the
p)
parameters in (7.9.13) become x0 0:0722, z0 0:0126 np 0:0208. Since Cd/
Cp ~ 0.021, we can ignore the contribution of the frictional drag on the aircraft when
treating the horizontal deceleration.
Upon solving the drag equations (7.9.13) with use of the foregoing parameters,
one finds that an unpowered B747 in unaided freefall would decelerate horizontally
to 50 m/s in 25.3 s while descending 2.01 km from an initial height of 10 km, and
attain a vertical velocity of 117 m/s. Deployment at that point of 24 G-11 parachutes
would bring the plane to a terminal velocity of 13.7 m/s, thereby subjecting passengers to an initial deceleration a0/g~33.4, which, while not necessarily life-threatening,
is nevertheless beyond the assumed level of tolerance. In a safe recovery, the parachutes must be deployed sequentially and in a manner to keep the wings parallel to
the ground (flat descent) so as to avoid unduly large initial accelerations.
An example of such a protocol, again obtained from numerical solution of Eqs.
(7.9.13), might unfold as follows. A B747, cruising 250 m/s at 10 km, becomes
disabled; all engines fail or are shut off intentionally to effect the recovery. The
drogue is deployed while the plane drops 4 km, which reduces the horizontal velocity
to 12.6 m/s and increases the vertical velocity to 118 m/s in about 40.6 s, at which
time six vertical parachutes are deployed symmetrically in three groups of two along
the fuselage. These decelerate the aircraft vertically to 30.5 m/s and horizontally to
nearly 0 m/s, with peak deceleration amax/g~10g, which decreases rapidly in time; the
5-second time-averaged deceleration is aav(5s)~1.6g. Then 18 more G-11 parachutes
are deployed symmetrically in three groups of six, the total of 24 decelerating the
aircraft (amax~2.7g; aav(5s)~0.3g) to a terminal velocity of 1.7 m/s, at which it falls the
remaining distance to ground. The plane strikes the ground flat, compressing the
cargo hold 2 m to produce an impact deceleration of less than 5g. Table 7.14
summarizes the kinematic details of the vertical descent from an initial altitude of
10 km both with and without use of a drogue. The two cases result in nearly the same
maximum decelerations and a difference in cumulative horizontal displacement of
less than 2 km.
The preceding summary does not take account of the opening time of the parachute canopy, for which the mean delay t of a G-11 is about 5.3 s.47 In numerous
simulations, however, I found that taking account of the delay by including suitable
time-dependent opening functions in Eqs. (7.9.13) did not change perceptively the
numerical results of Table 7.14 since the delay is very much less than the descent time
at each deployment stage.
It is worth noting that calculations were also performed for lower initial altitudes.
At lower altitudes the density of the air and therefore the drag on the parachutes
47
W.R. Lewis, Minimum Airdrop Altitudes for Mass Parachute Delivery of Personnel and Material Using Existing
Standard Parachute Equipment, ADED Report 642 (US Army Natick Laboratories, Natick, Massachusetts, April
1964) 11.
Table 7.14
vinitial
x
(m/s)
vfinal
x
(m/s)
vinitial
y
(m/s)
vfinal
y
(m/s)
Action
np
x (s )
y (s )
yInitial
(km)
Freefall w. drogue
Deploy 6 (2, 2, 2)
Deploy 18 (6,6, 6)
Accumulated
intervals
Freefall w/o drogue
Deploy 6 (2, 2, 2)
Deploy 18 (6, 6, 6)
Accumulated
intervals
Decel. length
Lc 2 m
Lc 3 m
0
6
24
0.072
0.072
0.072
0.112
0.371
0.716
10
6
3
250
12.6
0
12.6
0
0
1.0
118
30.5
0
6
24
0
0
0
0.112
0.371
0.716
10
6
3
250
26.9
0
26.9
0
0
1.0
123
30.5
1
1
T (s)
sx (km)
sy (km)
a0/g
118
30.5
13.7
40.6
88.0
203.0
331.6
3.4
0
0
3.4
4
3
3
10
1.5
10.0
2.7
123
30.6
13.7
38.7
87.8
203
329.5
5.1
0.1
0
5.2
4
3
3
10
1.0
11.1
2.7
ac/g
4.8
3.2
451
452
is greater, but the distance for recovery is of course shorter. Nevertheless, I found
that the recovery protocol is still sufficient. Initiated at an altitude of only 4 km with
deployments at 2.5 km and 1.5 km also led to landings with peak deployment
accelerations and impact decelerations below 10g.
The computer simulation of numerous airplane recoveries bolstered my confidence in the idea that sequential, symmetric deployment of vertical and horizontal
decelerators with drag parameters comparable to those of available parachutes can
bring a large general aviation airplane down safely in flat descent without subjecting
passengers to accelerations exceeding ~10g. Current barriers to such recovery are not
aerodynamic, but at most material. The peak horizontal drag exerted by an air
stream at 10 km altitude with relative velocity of 250 m/s on a 7.62 m radius drogue
is ~3.7 MN, which amounts to a tension of 30.5 kN in each of the 120 suspension
lines of diameter about 3.175 mm (1/8 inch), thereby requiring a tensile strength of
about 3.9 GPa. Although a drogue may in fact be dispensable, peak drag on a G-11
vertical parachute corresponding to a relative vertical wind speed of ~123 m/s is
~3.5 MN, thereby requiring nearly the same tensile strength of 3.7 GPa. The tensile
strength of the currently used Type III nylon cord is about 309 MPa.48 (Pressure
constraints on the canopies are much less severe; peak drag overpressure on the
drogue was ~0.20 atm in the preceding analysis).
There exist other materials, however, whose tensile strength is already within the
range needed and which may serve as precursors to suitable replacements for Nylon,
such as
(a) Vectran (2.93.3 GPa) an aromatic polyester spun from a liquid-crystal
polymer,49
(b) Zylon (5.8 GPA) a thermoset liquid crystalline polybenzoxazole,50 and
(c) fiber glasses such as E-Glass (3.5 GPa) and S-Glass (4.7 GPa).
Potentially new materials of extraordinary tensile strength may eventually be fabricated from allotropes of carbon with cylindrical nanostructure (C-nanotubes) which
have the highest tensile strength of any known material (composites 2.314.2 GPa;
single fibers of 22.2 GPa).51 Successful implementation of the recovery protocols may
also call for distributing the reaction force of the suspension lines over space or time
to avoid structural damage at sites of attachment. This should be achievable by
appropriate design and canopy shapes, controlled timing of canopy opening, and use
of extensible materials.
48
49
50
51
Nylon Cord PIA-C-5040/MilC-5040 Technical Data Sheet, Mills Manufacturing Corporation, http://www.
millsmanufacturing.com/files/Miltex-Tech%20Sheet.pdf/view.
R. B. Fette and M. F. Sovinski, Vectran Fiber Time-Dependent Behavior and Additional Static Loading Properties
(NASA/TM2004-212773) 13.
Tensile strength http://en.wikipedia.org/wiki/Tensile_strength
F. Li, H. M. Cheng, S. Bai, G. Su, and M. S. Dresselhaus, Tensile strength of single-walled carbon nanotubes directly
measured from their macroscopic ropes, Applied Physics Letters 77 (2000) 31613163.
Appendices
7:10:1
7:10:2
To express pR(r) directly in terms of r, set R X Y where X V2/g and Y sin (2)
and apply the rules for transforming pdfs to obtain
r
p 12 sin1 r=x
1
1
q
7:10:3
pX xpY
pX x
dx
pR r
2 dx,
x
0 jxj
0 jxj
2 1 xr
where
sin1 y
pY y p :
2 1 y2
p
1
2
7:10:4
This procedure can be applied again if it is desired to express pX(x) in terms of the
density pV(v). Substitution of specific densities for speed and angle into (7.10.3) and
(7.10.4) will generally lead to complicated mathematical expressions and, depending
on the choice of parameters, may also raise subtle issues regarding the range of the
angle variable.
It is not necessary, however, to use (7.10.3) to calculate the variance of the range
h
i
2R g2 hV 4 ih sin2 2i hV 2 i2 h sin 2i2 :
7:10:5
Let us suppose as an illustration that the speed and angle are distributed normally
according to
pV v N V 0 , 2V
p N 0 , 2
7:10:6
453
454
where V /V0 and /0 are both small enough that we need not be concerned with the
occurrence of nonphysical negative values of the variables. Then, as has been shown
previously, the second and fourth moments of speed lead to the relations
"
2 #
V
2
2
hV i V 0 1
7:10:7
V0
"
2
4 #
V
V
:
7:10:8
3
hV 4 i V 40 1 6
V0
V0
To evaluate the Gaussian integral of a sine or cosine function, make use of the
characteristic function h(t) of the random variable , defined by the expectation
heiti. With a range of integration , the Gaussian integral yields the
closed-form expression
1
he i p
2
it
eit e0
=2 2
d ei0 t
1
2
2 t2
7:10:9
1 in
e ein
2i
1
cos n ein ein
2
7:10:10
1
2
n2 2
sin n0
n2 2
cos n0
hcos ni e
1h
hsin 2 ni 1 e
2
1
2
1
2
n2 2
7:10:11
7:10:12
cos 2n0 :
7:10:13
Equations (7.10.5) and (7.10.11)(7.10.13) lead to the following exact expression for
the variance of the range
2
3
!
2 4 !
2 !2
4
8 2
V
1e
cos
4
2
V
V
0
V
2R 20 4 16
3
e8 sin 2 20 5:
1
g
V0
V0
2
V0
7:10:14
2
and
Upon retaining terms to first order in the variances
reduces to
"
#
4V 20
V 2
2
2
2
2
sin 20 cos 20 ,
R
g
V0
2V ,
Eq. (7.10.14)
7:10:15
which could also have been obtained more simply by taking the differential of
(7.10.1) and then arbitrarily combining the two terms in quadrature, a procedure
455
that is not rigorous, but frequently justified by heuristic arguments. The derivation
given here, however, leads directly to the correct combination of component variances. Note, too, that 2R in the absence of the corresponding distribution function,
gives no information about confidence limits, i.e. the probability that a measurement
of R falls within some specified interval (e.g. R) about the true mean (i.e. population mean). The Central Limit Theorem can be used to estimate the sample mean R of
a large number (in principle, infinite number) of measurements, and one could
employ the Weak Law of Large Numbers, as we have done in Section 7.3, to make
such an estimate, but it will usually lead to a broader inequality than necessary, as
was shown in regard to skewness.
n D
E
X
Z i Z3
7:11:1
Z Z 3 pZ zdz f n
i1
1
n
n
X
7:11:2
Zi ,
i1
hZ i Z j Zk i
i, j , k
n
X
i
hZ i 3 i 3
n
X
i6j
hZi 2 ihZj i
n
X
hZ i ihZj ihZ k i,
7:11:4
i6j6k
which is reducible to
n
X
i, j, k
7:11:5
456
n
DX
7:11:6
E
Z i Z3
i1
8
The guesses of groups
Modern statisticians are familiar with the notion that any finite
body of data contains only a limited amount of information on
any point under examination; that this limit is set by the nature of
the data themselves, and cannot be increased by any amount of
ingenuity expended in their statistical examination: that the statisticians task, in fact, is limited to the extraction of the whole of the
available information on any particular issue.
R. A. Fisher1
1
2
3
457
458
Richard Feynman, as the reader probably knows, was one of the most colorful
American physicists of the twentieth century. Creator of his own version of quantum
mechanics based on path integrals, and seminal contributor to the formulation of
quantum electrodynamics, Feynman was also an entertaining raconteur of his lifes
experiences. In one of his narratives4 describing the tribulations of serving on a
California state commission charged with the selection of high school mathematics
textbooks, he related a brief fable about the length of the Emperor of Chinas nose.
So exalted was the Emperor of China, that no one was permitted to see him, and the
question in peoples minds was: how long is the Emperors nose? To find out,
someone (according to the narrative) asked people all over China what they thought
was the length and then averaged all the results. Evidently, this average was considered to be accurate because the sample was large and representative.
Feynmans message, however, which would seem the embodiment of common
sense, was that averaging a lot of uninformed guesses does not provide reliable
information. Yet, in a nutshell, this was exactly what the book I read appeared to
advocate as the most reliable way to acquire information.
The book, a New York Times Business Bestseller titled The Wisdom of Crowds (to
be abbreviated in this essay as WOC), was not concerned with finding the length of
the Emperors nose. It began instead with an anecdote relating to the weight of a
dressed ox, which the visitors to the annual West of England Fat Stock and Poultry
Exhibition could bet on for a sixpence ticket. The 1906 competition is noteworthy in
that it was attended by the English polymath and statistical innovator, Francis
Galton, well known for his anthropometric studies of human physical and mental
characteristics and their correlation with good breeding. Galtons experiments did
not give him a high opinion of the average person whose stupidity and wrongheadedness . . . [was] . . . so great as to be scarcely credible. Not to miss an opportunity to reconfirm his opinion, Galton borrowed the tickets after the awarding of
prizes and made a simple statistical analysis to determine the shape of the distribution (a bell-shaped curve? . . . we are not told) and the mean value of the participants
guesses. According to WOC, The crowd had guessed that the ox, after it had been
slaughtered and dressed, would weigh 1197 pounds. After it had been slaughtered
and dressed, the ox weighted 1198 pounds. In other words, the crowds judgment was
essentially perfect.
What is one to make of that agreement: that the story was apocryphal, an
exaggeration, a coincidence? Curious about the authenticity of the event,
I researched Galtons published articles and indeed I found that he described his
experiment at the Exhibition in a short paper published in Nature in 1907 under the
title Vox Populi, i.e. the voice of the people.5 Galton began his account with the
words
4
5
R. P. Feynman, Surely Youre Joking Mr. Feynman! (W. W. Norton, New York, 1985) 295296.
F Galton, Vox Populi, Nature 75, No. 1949 (March 7, 1907) 450451.
459
In these democratic days, any investigation into the trustworthiness and peculiarities of
popular judgments is of interest.
In less descriptive and more modern terminology, Galton had tallied the guesses and
found the median, which turned out to be 1207 lbs, a value too high by a mere 0.8%.
Surprised and impressed, he concluded
This result is, I think, more creditable to the trustworthiness of a democratic judgment than
might have been expected.
Having accepted, therefore, the WOC account of Galton at the fair to be a more
or less accurate description of an actual incident, it seemed to me not unreasonable
at first to believe that few people at the Exhibition were likely to have had any
experience in slaughtering and dressing oxen. Many were probably tradesmen or
professionals from town (carpenters, blacksmiths, coopers, lawyers, bankers, physicians, and the like) or maybe vegetable or poultry farmers. Thus, one might have
expected as perhaps Galton did the group average to deviate widely from the
true weight. In the words of WOC, . . . mix a few very smart people with some
mediocre people and a lot of dumb people, and it seems likely youd end up with a
dumb answer.
On further reflection, however, the reasonableness of the assumption vanished,
replaced by the question: why should it be assumed that few of the ticket purchasers
knew anything about the dressed weight of an ox? This was, after all, an exhibition
of fat stock, and it took place annually as presumably did the contest. So maybe a
substantial number of visitors were well-informed about the size, shape, and weight
of oxen. Maybe they frequented regional exhibitions or farms or slaughter houses
or butcher shops. Maybe they participated before in this contest. This was rural
England in 1906, not urban England in the twentyfirst century. People were accustomed either to growing their own food or to purchasing it, one step removed, at
the shops, markets, and farms where the food was produced and processed. I was
not around then, but my great grandmother was, and she did not purchase her food
at a local West of England supermarket where hundreds of small cuts of meat lay
neatly packaged on Styrofoam slabs wrapped with cellophane having traveled by
refrigerated railcars or airplane hundreds or thousands of miles from wherever it
was that the animals were raised. So maybe the 1906 Exhibition crowd was not
dumb.
I returned to researching Galtons papers to see whether he had further thoughts
on the matter. He did but even more interestingly so did one of his contemporaries
460
who wrote his objections to the editor of Nature.6 The perceptive gentleman was a
Mr. F. Perry-Coste of Cornwall, who, nearly one hundred years before I put pen to
paper (or fingers to keyboard) on the subject, had apparently had the identical
thought:
. . . Mr Galton says that the average competitor was probably as well fitted for making a just
estimate of the dressed weight of the ox as an average voter is of judging the merits of most
political issues on which he votes. . . . I do not think that Mr. Galton at all realizes how large a
percentage of the votersthe great majority, I should suspectare butchers, farmers, or men
otherwise occupied with cattle. To these men the ability to estimate the meat-equivalent weight
of a living animal is an essential part of their business . . . Now the point of all this is that, in so
far as this state of things prevails, we have to deal with, not a vox populi, but a vox expertorum.
[The] majority of such competitors know far more of their business, are far better trained, and
are better fitted to form a judgment, than are the majority of voters of any party, and of either
the uneducated or the so-called educated classes. I heartily wish that the case were otherwise.
The Ballot Box, Nature 75, No. 1952 (March 28, 1907) 509. [Letters to the Editor from Galton and others.]
The website has since been changed to http://www.despair.com/
461
462
8
9
C. Hoffman, The mad genius from the bottom of the sea, Wired Magazine, http://www.wired.com/wired/archive/
13.06/craven_pr.html
http://en.wikipedia.org/wiki/John_Pina_Craven
463
W. Allison, A. Kumar, and C. Pittman, Foam insulation has history of damaging shuttle, St. Peterburg Times
(4 February 2003).
From a publishers standpoint, however, a book titled The Information of Groups would probably not sell as well
except perhaps to physicists and mathematicians who mistook the meaning of the title.
464
The WOC explanation is relatively simple. Each persons guess contains information and error. In the average of a large number of diverse, independent estimates
or predictions, the errors effectively cancel and, according to WOC, youre left with
information. The information is useful because we are all products of evolution and
therefore equipped to make sense of the world. In short, WOC tells us that the
answer rests on a mathematical truism, but none is explicitly indicated. Since the
author of WOC was not a mathematician, I could think of only two such truisms
that might have come to his attention and a third one of which he was unlikely to
be aware.
The first is the law of large numbers in which the mean m of a sample of
independent observations from a given population approaches the population mean
as the sample size increases. Consider, for example, an election poll. In a population
of 1 million people, a sample of 10 000 will give a more accurate representation of
opinion than a sample of 100. Indeed, the spread about the mean the standard
deviation of the mean m varies inversely with the square root of the sample size, so
that the result of the larger sample would be more sharply defined (and presumably
more reliable
the samples
pif
465
Marquis de Condorcet, Essai sur lapplication de lanalyse a` la probabilite des decisions a` la pluralite des voix,
(LImprimerie Royale, Paris, 1785), reproduced online by the Bibliothe`que Nationale de France: http://gallica.bnf.fr/
ark:/12148/bpt6k417181
466
xnm 1 1 xnnm dx
p
n1
n
xnm 1 1 xnnm dx:
PrSn nm jn
nm 1
Bnm , n nm 1
0
8:3:4
8:3:5
It has already been established in Chapter 1 that the cumulative distribution function
of the kth order statistic is
n
X
n
FY k y PrY k y
8:3:6
Fyj 1 Fynj ,
j
jk
whereupon the probability that at least k of the set fyi i 1 . . . ng is less than or equal
to some number p, where 1 p 0, is given by
PrY k
n
X
n j
p
p 1 pnj ,
j
jk
8:3:7
which is precisely the sum that appears in (8.3.2) if one sets k = nm.
Second part There is, however, another way to arrive at the probability Pr(Y[k] p)
by clever use of the multinomial distribution as employed in Section 1.31 of
Chapter 1. The probability that the order statistic y[k] lies between x and x + dx,
where x p, is the probability that
(a) k 1 elements of the set fyi i 1 . . . ng fall between 0 and x, and
(b) 1 element falls in the range (x, x + dx), and
(c) n k elements exceed x + dx.
Because all the elements were drawn independently from a uniform distribution, this
probability is proportional to the product xk 1(dx)(1 x dx)n k. In the limit of an
infinitesimal interval dx, the sum (i.e. integral) over all values of x p yields the
probability
p
Pryk p / x
467
p
k1
1 x
nk
8:3:8
The constant of proportionality C can be obtained in either of two ways. The first
way is by a combinatorial argument: the total number of ways to partition n
distinguishable elements into three categories respectively containing k 1, 1, and
n k elements is given by the multinomial coefficient
n!
nn 1!
n
n1
C
n
:
k 1, 1, n k
k1
k 1! 1! n k! k 1! n k!
8:3:9
The second way is to normalize the integral in (8.3.8), which is known as an incomplete beta function. The normalization constant will then be the reciprocal of a beta
function
C
1
1
xk1 1 xnk dx Bk, n k 1
0
kn k 1 k 1!n k!
1
n 1
n!
n1
n
k1
8:3:10
n1
PrSn nm jn n
x1
p
xnm 1 1 xnnm dx
8:3:11
is the probability that a majority decision of the group will be correct if p is the
probability that an individual in the group votes correctly.
For the sake of illustration, consider the case of an odd number of jurors, i.e. n =
2m + 1 where nm = m + 1 is the median. Eq. (8.3.11) then simplifies to
2m
PrSn m 1 2m 1
m
p
x1 xm dx,
8:3:12
where m = (n 1)/2. In the limit of large n (or m), the integrand in (8.3.12) becomes a
p
sharply peaked function of x with a width inversely proportional to n. This, in a
nutshell, is the reason why the probability (8.3.2) depends sensitively on whether
p exceeds 12 or not. To see in detail how this occurs, let us evaluate the integral which
cannot be reduced to an exact closed-form expression by the method of steepest
descent. This entails
468
1 2
1
1
1024
1
8mx12 64
3 x2 64x2 5 x2 ...
4
10
8:3:13
1 2
2 ,
:
12 n 288 n2 51840 n3 2488320 n4
8:3:16
The expression (8.3.16) is surprisingly accurate even for values of m as low as 1.13
It then follows that
h
pi
2m 2m
r
4 m
2m
2m
e
2m 12m!
2m
2m1 m
2
: 8:3:17
2m 1
h
e
pi2
m
m!2
mm em 2 m
Combining (8.3.17) and (8.3.14) leads to a Gaussian cumulative distribution function
p
p
2m
2p1
n
1
1
2
2
ez =2 dz ! p
ez =2 dz:
8:3:18
PrSn m 1 e p
p
2
2 p
p
1
2
2m
2mn
n
The error E(n) = n! fac(n), where fac(n) is the Stirling series to the order shown above, is on the order of 104, 105,
106 respectively for n = 1,2,3. As n increases, the absolute error eventually becomes much larger than 1, but the
relative error RE(n) = [n! fac(n)]/n! decreases rapidly. Thus, for n = 1,10,100, RE(n) = 104, 108, 1012.
>
>
1
>
> p ez2 =2 dz ! 1 > 0
>
>
< 2
1
Lim Pr S > n
n!
>
2
>
1
2
>
>
p
>
ez =2 dz ! 0 < 0
>
: 2
469
8:3:19
n
X
i1
Xi Binn, p ! N np, np1 p
n 1
8:3:20
p
with mean np and standard deviation np1 p. This can be readily proven by
use of the moment generating function, or by invoking the Central Limit
Theorem.
Suppose the group size to be 10 000 and each individual in the group to be only
51% likely to answer a particular question correctly. The mean number of correct
voters is 5100 with a standard deviation 50. Thus, the number of group members
voting correctly will fall within a 2 range 5100 100 = (5000,5200) with a probability of 95%. In other words, the majority decision will be correct about 95% of
the time.
Suppose, however, the group size to be 1 000 000 and each individual, as before,
has a 51% chance of being correct. The mean number of correct voters is then
510 000 with a standard deviation 500. Now the number of group members voting
correctly will fall within a 20 range 510 000 10 000 = (500 000,520 000) with a
probability very close to 100%. (The exact value is 1 5.5
1089.) Thus, the
majority decision is likely to be correct 100% of the time.
In short, the larger the sample size n, the wider is the range (in units of ) of
voters beyond the sample median who give correct answers (for p > 0.5). How
large, in fact, must a group be (considering for illustration an odd number of
members) in order that the majority decision be correct 99% of the time if
individual members have a probability of being correct only 51% of the time?
Comparison of the StirlingGauss approximation (8.3.18) with a more exact
higher-order calculation based on the Stirling series (8.3.16) and expansion
(8.3.13) of the incomplete beta integral to tenth order leads to the results shown
in Table 8.1.
470
Table 8.1
Group size n
High-order calculation
StirlingGauss approximation
13 525
13 527
13 529
13 531
0.989 995 99
0.990 000 57
0.990 005 15
0.990 009 73
0.989 986 85
0.989 991 44
0.989 996 02
0.990 000 06
Under the assumptions of the jury theorem a minimum group size of 13 527
members would be needed to assure a 99% chance of the majority decision being
correct. (The StirlingGauss approximation yielded a group larger by four members.)
Interesting as the jury theorem may be as a mathematical exercise, the key
question remains as to whether this theorem can serve as the truism upon which
to assert the inevitability of the WOC hypothesis. The answer in my opinion is a
negative one. Real-world decision making rarely, if ever, conforms to the conditions of the theorem. Most decisions are not of a binary nature, but require
qualitative judgments to be made from among many choices or numerical estimates that can fall within a wide range of real numbers. Moreover, it is entirely
unrealistic to expect all the members of a group to be equally informed (or
uninformed) so that a single probability p represents their state of knowledge. As
mentioned previously, extensions of Condorcets theorem have been published, but
I have seen none that would serve to justify the broad claims of WOC. Perhaps the
author of WOC had some other mathematical truism in mind, but, if so, I do not
know what it was.
8.4 Epimenides paradox of experts
There is a glaring, if not humorous, logical inconsistency in the indiscriminate
debunking of experts and expertise that one finds in WOC. For example, one reads
that
. . . expertise and accuracy are unrelated.
. . . experts decisions are seriously flawed.
. . . experts judgments [are] neither consistent with the judgments of other experts
in the field nor internally consistent.
. . . experts are surprisingly bad at . . . calibrating their judgments.
. . . experts . . . routinely overestimate the likelihood that theyre right.
From whom, one may ask, has the author acquired these professional insights on
experts?
WOC acknowledges, among others, James Shanteau . . . one of the countrys
leading thinkers on the nature of expertise . . . That would make Shanteau an
471
expert, right? And if he is an expert, then would not the above objections apply to
him too, which would mean that his critique of experts is not reliable, which would
then imply that the opinions of experts can be trusted? This is one of those selfreferential paradoxes like the well-known logical paradox enunciated by Epimenides
the Cretan that All Cretans are liars. If you ask a Cretan whether he is lying and he
replies truthfully in the affirmative, then he was actually not lying, which meant that
he did not reply truthfully. And so on.
In any event, the fact that the authors sources, by virtue of their expert status,
may be questionable does not necessarily mean that they were wrong, as evidenced by
a few past examples of expert opinion:
Who the hell wants to hear actors talkHarry Warner (1927)
I think there is a world market for maybe 5 computersThomas Watson (1943)
Computers in the future may weigh no more than 1.5 tonsPopular
Mechanics (1949)
and one of my favorites:
640 K ought to be enough for anybodyBill Gates (1981)
(I found these in WOC and in lists of quotes on the internet and cannot vouch for
their authenticity but they do sound good).
Personally, my own experiences with experts especially those providing advice
on financial, legal, or educational matters were much in accord with the critical
remarks above. It is, in fact, one of the character traits of good physicists to be
skeptical of authority and to try to find things out for themselves. And so I initiated a
set of experiments to investigate the Guesses Of Groups or GOG.
472
473
Phase 3 involved making predictions. Students were asked first to predict something connected with an activity (taking tests) with which they, as a group, were
thoroughly familiar: the class mean score on a quiz to be taken at the end of the
week (Trial 8). At another time they were asked to predict something in regard
to an activity (the stock market) with which I assumed few, if any, were
familiar: the change in the Dow Jones Industrial Average (DJIA) by the close
of day at the end of the week (Trial 9).
The preceding exercises took place one trial per class meeting, usually during the first
or last ten minutes of the period. Each student received from me a sheet of paper
stating what was to be estimated or predicted and asking for a numerical response as
well as a qualitative estimate of the students confidence in his or her answer (None,
Low, Medium, High). The purpose of the latter was to see whether there was any
correlation between accuracy and confidence. (There wasnt.) Since one of the
conditions alleged to be necessary for the validity of the WOC hypothesis is
the independence of individual guesses, students were instructed not to discuss the
exercise with their neighbors or to glance at their neighbors answers.
To avoid purely random guessing, in which case the situation would degenerate to
the one in Feynmans fable about the length of the Emperor of Chinas nose, it was
necessary that students gain something personally from answering accurately. In this
regard, my policy was to offer whoever came closest to the exact answer some extracredit points toward their cumulative course score. The amount offered was quite
modest, but, if you have ever taught at a college or university, you probably have a
good idea of what students would do for almost any amount of extra credit. Suffice it
to say that my students were satisfied with the offer. It should also be noted that
students were not required to put their names on the response sheets they turned in
(in case some would have felt embarrassed at submitting a wildly incorrect estimate),
but obviously they could earn extra credit only if I knew to whom to award the
points. Every participant revealed his or her identity.
Phase 4, the final phase, entailed an exercise of a kind different from the preceding
in which the participants merely had a few moments to view, hold, or think
about something before writing down their responses. Bearing in mind
Cravens search for the lost Scorpion, I wanted to see for myself whether
laymen working in groups or experts working individually were more
successful at a problem-solving activity. Having defined (privately14) a set of
criteria on the basis of which to identify student experts in the class, I divided
the class into five groups of six students each. In four of the groups comprising the
non-experts, the students within each group were to solve the problem as a team;
14
The criteria were not announced to the class since that could have seriously affected their attitude toward the exercise
and their performance.
474
in the fifth group, however, the six chosen experts were to sit far apart and work
on the problem individually. The five groups were set to work simultaneously in
different classrooms for the same amount of time (15 minutes).
Here is the first problem they were given (Trial 10):
Two former high school friends A and B, who had not seen one another for many years, met by
chance.
A:
B:
A:
B:
A:
B:
I had initially intended for the foregoing problem to be the final trial, but the idea of
actually having the students search for something missing, as Craven did, strongly
appealed to me. Since (to my knowledge) the US military services had not lost
another boat or bomb in the intervening years since Craven was assigned to look
for such things, I settled on a simpler problem of lost treasure (Trial 11). Here it is.
During the early nineteenth century a wealthy professor buried his fortune on the Trinity College
campus, and you have come into possession of the following instructions found in a book once
belonging to his personal library.
1. Count thy steps from the door of the College Alehouse to the door of the Metaphysics Building,
turn left by a right angle, take the same number of steps and place a spike in the ground.
2. Count thy steps from the door of the College Alehouse to the door of the Alchemy Building, turn
right by a right angle, take the same number of steps and place a second spike in the ground.
3. At the point halfway between the two spikes dig for treasure.
Now the old Metaphysics Building still exists (it became McCook and presently houses the Physics,
Philosophy, and Religion departments), and the old Alchemy Building still exists (it became
Clement and houses the Chemistry Department and College Cinema), but the College Alehouse
has long since been demolished and no one alive today remembers where it used to be, although
there is much speculation.
Draw a map showing the Metaphysics and Alchemy Buildings and place a small cross (
) precisely
where on the map you believe the treasure is located. Explain your reasoning on the back of this
page.
475
Table 8.2
Phase
No. Description
Correct
value
Closest
value(s)
Group
mean
Group
median
I. Estimation
(nearby)
229
230
178.1
152.5
229
230
199
152
1016.8 g
1000.0 g
2012.1 g
1818.2 g
4
5
Number of shot in a
jar (viewed only)
Number of shot in a
jar (viewed & held)
Mass of shot in a jar
(viewed & held)
Height of ceiling
Periphery of campus
Number of banks
39
7
8
Number of restaurants
Mean quiz score
311
78.2%
Change in DJIA
26.4 pts
10
11
Age of children
Treasure hunt
2
3
II. Estimation
(distant)
III. Prediction
IV. Deduction
54.9
34
127.3
78.9%
102103
78.3%
14.9 pts
5 pts
I should explain that the two contemporary buildings mentioned above really do exist
and for reasons of political or financial expediency really do house the irrationally
eclectic combination of departments. Thus the term Metaphysics Building by
which I have long referred to the edifice within whose basement my research laboratory is located is an apt appellation. Furthermore, the conditions of this trial differed
from those of the preceding trial in that I did not attempt to identify experts, but
decided to let the students themselves partition their number into groups and individuals. This turned out to be a mistake (perhaps) in that no students organized
themselves into groups, and the mathematically most adept students in the class
submitted their results as individuals. Nevertheless, this outcome was itself
informative.
A summary of the 11 trials and their outcomes is displayed in Table 8.2. I leave the
solutions to the two logic problems to appendices.
To interpret the significance of the outcomes, I would like to reiterate, even at
the risk of redundancy, one of the key components to the WOC hypothesis.
Among the assertions defended in The Wisdom of Crowds was the statement by
economist Kenneth Arrow that average opinions of groups is frequently more
accurate than most individuals in the group. Since all the participating students in
476
477
Table 8.3
Test
Estimate
shot
Estimate
mass
Estimate
height
Estimate
periphery
Estimate
banks
Estimate
restaurants
Predict quiz
average
Predict DIJA
change
Solve logic
problem
Fractional error
of the mean
Fractional error
of the median
Superiority of
group mean?
Superiority of
group median?
22.3%/13.1%
33.6%
No
No
97.9%
78.8%
No
No
9.4%
6.3%
No
No
14.4%
2.8%
No
OK
40.8%
12.8%
No
OK
59.1%
67.0%
No
No
0.90%
0.1%
OK
OK
43.56%
81.1%
No
No
50.00%
n/a
5050
n/a
single statistic like the mean or median, but the actual sample distributions, two of
which are shown as histograms in the upper and lower panels of Figure 8.1. Recall
that a histogram is an approximate graphical representation of the probability
distribution of the parent population from which a sample is taken. One divides
the range of sample outcomes into non-overlapping classes or bins into which the
outcomes are distributed. The histogram is then a plot of the frequency (i.e. number)
of outcomes in each bin as a function of the bin value.
Underlying the WOC hypothesis is an implicit assumption that group responses are
distributed more or less normally, i.e. in a bell-shaped curve with the preponderance of
samples clustered symmetrically about the mean and decreasing in frequency fairly rapidly
in the wings. In keeping with Galtons democratic principle of one vote one value, WOC
identified the collective judgment (wisdom) of a group with the sample mean
x
n
1X
xi ,
n i1
8:6:1
where fxi i = 1. . .ng is the set of individual responses. Although the assumption of
normality may seem reasonable since, after all, the ubiquity of the Gaussian
distribution (as Galton had doggedly revealed) is precisely why one refers to it as
normal the GOG experiments suggested to me that it is a flawed assumption.
478
Frequency
20
15
10
0
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
More
14
12
Frequency
10
8
6
4
2
0
500
1000
1500
2000
2500
3000
3500
4000
4500
More
Category
Fig. 8.1 Top panel: histogram of the estimates of the perimeter of Trinity College by physics
students given the point of departure and connecting streets. Bottom panel: histogram of
responses by the same students asked to estimate the mass or weight of a jar of steel shot. For
all such exercises, respondents could use whatever units they preferred; the estimates were
subsequently converted into standard units.
479
The sample distribution in each of the GOG experiments did not resemble a
normal distribution. True, the blockish quality of the resulting histograms, like the
two shown in Figure 8.1, reflected in part the fact that a sample of about 30 to 60
participating students (depending on the particular trial) did not comprise a large
population, but the sample was nonetheless large enough to draw meaningful conclusions. A much larger sample, it seemed to me, might smooth the envelope of the
histograms, but not necessarily eliminate the marked skewness in some or conspicuously small kurtosis in others.15
Having examined numerous histograms of the GOG trials, I reached a conclusion
very different from what Galton would have believed. The samples were not drawn from
a hypothetical normal population. Rather, in regard to each specific question or task put
to the group, the members fell roughly into a subgroup of those who were more or less
informed (i.e. had at least some idea of what constituted a reasonable response) and a
subgroup of those who were essentially clueless about the matter. Replies from the
informed subgroup would be distributed approximately normally (although not necessarily centered on the correct answer), whereas the widely scattered replies from the
uninformed subgroup would be better modeled by a uniform distribution.
For example, virtually all college and university students have taken numerous
tests in their lives, and so when asked to predict the class mean of a forthcoming quiz,
their replies could be expected to follow more or less a normal distribution centered
on a correct prediction. And this was the case. On the other hand, most students
probably ate at the college dining facility rather than at city restaurants, and therefore when asked to estimate the number of restaurants their replies were all over the
board, so to speak, ranging from about 25 to greater than 250.
The upper and lower panels of Figure 8.1 reflect this categorical division for two
other trials. Few students in my classes, even though they may have taken science
courses before, had background experiences preparing them to estimate the weight of
an object by holding it. There may have been a few students, perhaps, who went
grocery shopping at home or who cooked or baked in their kitchens and lifted so
many ounces of this or a half pound of that. But most had little kinesthetic sense of
weight or mass; the estimates of the mass of a jar of steel shot were all over the chart.
Contrast that with the quasi-Gaussian-looking histogram of estimates of the college
perimeter, attributable, I believe, to the fact that many students jogged that route
regularly or traversed it by car. Thus, the histogram took the shape of a normal
distribution from the informed group skewed to the right by a long tail (outliers of
excessive length) from those who simply guessed erratically. In general, then, the
distribution of group replies should be a mixture in varying proportions of the
informed and uninformed distributions.
15
Recall that kurtosis K (from the Greek root for bulging) is a measure of the fourth moment about the mean. It is a
gauge of the sharpness of the peak and heaviness of the tails. For a normal distribution K = 3. A distribution with
lower kurtosis has a more rounded peak, narrower shoulders, and shorter tails.
480
Before continuing with this thought, however, a brief word of explanation is called
for in view of my previous criticism (in Chapter 3) of inferences about nuclear decay
drawn by certain researchers from the shapes of histograms. What, then, would
justify at this point my own deductions based on histogram shape? First, it is most
certainly the case that a probability density function (pdf ), when plotted against the
variable upon which it depends, has a definite shape. An observer with appropriate
experience would surely recognize the bell shape of a Gaussian distribution, the Eiffel
Tower-like shape of a Cauchy distribution, the skewed ski-slope shape of a Rayleigh
distribution, and other more or less familiarly shaped pdfs. A histogram approximates the shape of a pdf, provided that the bins are not so numerous as to result in a
statistically insignificant number of events in each, nor so few as to reveal no shape at
all. When the resulting form of a histogram is not essentially altered by varying the
number and boundaries of the bins within a statistically permissable range, then it is
meaningful to speak of the histogram shape as an empirical approximation to the
true underlying pdf. What is not meaningful (as I demonstrated in Chapter 3) is to
assign significance to the shape of fluctuations in the numbers of events in these
(arbitrarily designated) bins, since such secondary spatial features (e.g. rabbit ears)
can change radically (. . . they are, after all, fluctuations . . .) with a change in the
number and value of the bins. And now, let us return to my GOG deductions.
The lesson I drew from the GOG experiments with my physics students was that in
seeking to optimize the information one can extract from a group, one should not
weight equally everyones response. Rather, a better strategy would be to give more
weight to the members of the informed group and less to those of the uninformed
group. Yet how could that be done, given that the members of each subgroup are not
individually identifiable? How could one tell whether a nearly correct response
actually came from the completely random guess of someone in the uninformed
group, or that an outlying incorrect response came from careful consideration by
someone in the tail of the informed subgroup? What was needed was a completely
objective statistical model that utilized only the sample of data without making any
attempt to assess the knowledge of individual respondents.
I will present such a model shortly, but first let us examine under what circumstances the sample mean is justified as an expression of the wisdom of the crowd.
Technically, from a Bayesian point of view, there is always prior information, even if it consists of total ignorance.
We considered in Chapter 2 the statistical representation of ignorance.
481
the crowd at the 1906 Fat Stock and Poultry Exhibition with the sample median
most likely because the median is not sensitive to outliers and then later favored the
group mean when it was pointed out (at least in that one instance) that it gave a more
accurate prediction. Although Galton presumably knew nothing of entropy at the
time, he made a statistically reasonable choice. In the absence of prior information
concerning the distribution of guesses, this choice can in fact be justified by the
principle of maximum entropy, introduced in Chapter 1.
Let us designate by fxig i = 1. . .n the set of independent guesses submitted by a
group of n members in response to some query. Group the guesses into K categories
(bins) fXkg with frequencies fnkg k = 1. . .K. It then follows that
K
X
nk n
8:7:1
k1
K
X
nk X k
k1
n
X
8:7:2
xi ,
i1
and the sample mean can be expressed in either of two equivalent ways
x
n
K
1X
1X
xi
nk X k :
n i1
n k1
8:7:3
Suppose pk to be the (unknown) probability that an outcome (guess) falls in the kth
bin. The sorting of n items into K bins constitutes a multinomial distribution for
which the probability P(fnkgjfpkg) of an observed configuration of outcomes fnkg is
given by
Pfnk gjfpk g n!
K
Y
p nk
k1
nk !
8:7:4
8:7:5
pk 1:
8:7:6
k1
As discussed in Chapter 1, the most objective (least biased) assignment of probabilities fpkg, given the set fnkg, is obtained by maximizing the (8.7.5). In other words,
one must solve the equations
H
P
ln P 1 0
pk
pk
k 1 . . . K,
8:7:7
482
8:7:8
The second relation in (8.7.8) follows because a function and its logarithm are
extremized at the same points.
Substitution of (8.7.4) into (8.7.8) leads to a set of equations
K
L
X
nk ln pk 0 j 1 . . . K
pj pj k1
8:7:9
which is the same set of equations that would follow from application of the method
of maximum likelihood (ML). The log-likelihood function L of this system is
L
K
X
8:7:10
nk ln pk ,
k1
but, because of constraints (8.7.1) and (8.7.6), only K1 of the K terms in L are
independent. We can deal with the situation, as we have before, by use of a Lagrange
multiplier or more simply in this instance by rewriting L in terms of independent
quantities only, in the following way
!
!
K 1
K1
K1
K1
X
X
X
X
L
nk ln pk nK ln pK
nk ln pk n
nk ln 1
pk : 8:7:11
k1
k1
k1
k1
nj
pj
n
1
K1
X
k1
K1
X
nk
pk
nK
constant
pK
j 1 . . . K 1:
8:7:12
k1
The constant is determined to be n from the completeness relation (8.7.6), and one
thereby obtains the maximum-entropy (ME) set of probabilities
ME
pk
nk
n
k 1 . . . K
8:7:13
m1
K
X
k1
ME
pk
Xk
K
X
nk
k1
Xk
n
1X
xi
n i1
8:7:14
of the distribution obtained from the principle of maximum entropy for the case
where no prior information (other than completeness) is known about how the
483
guesses of a group are distributed. This is the most unbiased value one can obtain by
querying the group once but there is no predictive value to it. Put the same question
to the same group again as I have done with students and, as individuals change
their estimates or guesses, a different set of frequencies fnkg and therefore probabilities fpkg will likely emerge.
Although the frequencies may fluctuate from trial to trial, examination of the
resulting histograms of responses suggests a discernable pattern or form which, if
real, would constitute information beyond pure ignorance. Knowing that form (i.e.
probability function) would permit one to ascertain more reliably the information
contained in the collective response of a group. It would approximate the collective
judgment of a group of infinite number, or, equivalently, the mean response of a
finite-size group to the same question posed an infinite number of times. This does
not mean, of course, that the calculated wisdom of the crowd would necessarily lie
closer to the true answer to the question put to the crowd but that the value
obtained would be more stable and therefore consistent.
8:8:1
referred to as a mixed distribution. Recall that the interval function I[a,b](x) appearing
in (8.8.1) restricts its argument to the range b x a
484
Fig. 8.2 Bottom panel: scatter plot of N=NG+NU=10 000 samples from a mixed distribution
of Gaussian N(200,502) and Uniform U(40,600) variates. The population comprises
NG=7500 Gaussian (dense points) and NU=2500 uniform (diffuse points) samples. Top
panel: histogram of all samples enveloped by theoretical probability density (black). The
sample mean and standard deviation are respectively x 230:8, sx = 105.1 in close
agreement with the population mean and standard deviation x = 230.0, x = 105.4.
I a, b x
1
0
b x a
x > b, a > x :
8:8:2
Figures 8.2 and 8.3 show examples of mixed distributions (8.8.1) obtained empirically
by drawing a total of 10 000 samples of which a fraction f came from a Gaussian
RNG and 1 f came from a uniform RNG. The fractions f in the two figures are
respectively 0.75 and 0.25. As expected, the shape in Figure 8.2 looks predominantly
Gaussian with a long tail skewed to the right, as in the GOG histogram for college
perimeter, whereas the shape in Figure 8.3 resembles more what the histogram for
mass estimation might have been if the number of samples were closer to 7500, rather
than 60.
I will explain shortly how the unknown parameters (f, , 2) can be determined
from data. For now, assuming they are known, I identify the collective judgment of
485
Fig. 8.3 Bottom panel: scatter plot of N = NG + NU = 10 000 samples from a mixed
distribution of Gaussian N(200,502) and Uniform U(40,600) variates. The population
comprises NG = 2500 Gaussian (dense points) and NU = 7500 uniform (diffuse points)
samples. Top panel: histogram of all samples enveloped by theoretical probability density
(black). The sample mean and standard deviation are respectively x 289:6, sx = 151.2 in close
agreement with the population mean and standard deviation x = 290.0, x = 151.4.
the group with the expectation of X calculated with pdf (8.8.1), which is readily
shown to be
X hXi f 1 f
ab
:
2
8:8:3
p
Correspondingly, the uncertainty in group response is taken to be X varX,
where the variance
varX 2X hX2 i 2X
b2 ab a2
b a 2
2
2
f 1 f
f 1 f
3
2
2
2
b a
1
f 2 1 f
8:8:4
f 1 f b a
12
2
486
is neatly expressible as a weighted sum of the variance of the two components in the
mix and the square of the difference of their means.
The mixed model raises a subtle, but essential, distinction regarding the statistical
description of the group that may have escaped the readers attention and should be
clarified even though it entails a brief digression. Given the hypothesized pdf (8.8.1),
it would be wrong to think that the random variable X itself takes the mixed form
Xmixed f N, 2 1 f Ua, b:
8:8:5
Although the expectation of the variate defined by (8.8.5) yields precisely the same
result (8.8.3), the pdf of Xmixed is not that of a mixed distribution and the theoretical
variance (and higher moments) differ substantially from those calculable from
(8.8.1). A mixed random variable is not the same as a mixed distribution.
To see this, one can employ the methods introduced earlier in the book to show
that the pdf of Xmixed in (8.8.5) takes the form
xa1f
f
1
pXmixed x p
2 1 f b a
xb1f
f
eu =2 du
2
f b1 f x
f a 1 f x
p
p
erf
erf
2
2
,
21 f b a
8:8:6
where
x
2
2
erf x p ez dz
8:8:7
b a2
:
12
8:8:8
(e)
0.007
487
(d)
0.006
0.005
0.004
(c)
0.003
(b)
0.002
(a)
0.001
0
100
200
300
400
500
600
x
Fig. 8.4 Variation in shape of the pdf of a mixed random variable Xmixed = f N(200, 402) +
(1 f )U (40, 600) as a function of mixing coefficient f : (a) 0.05, (b) 0.25, (c) 0.50, (d) 0.75, (e) 0.95.
coefficients fi can be uniquely described by a density matrix [f1V1( (1)); f2V2( (2)); . . .
fnVn( (n))] or simply [f; V1( (1)); V2( (2))] for a binary mixed distribution, where it is
understood that the fraction f refers to the first variate in the bracket and the
complementary coefficient must be 1 f.
There are various ways to estimate the parameters (f, , 2) in the mixed model
[f; N(,2); U(a,b)] of group judgment. I have generally employed the method of
maximum likelihood which follows directly from Bayes theorem (for a uniform
prior) and possesses a number of statistically desirable properties, as discussed in
Chapter 1. Note that the range values (a, b) of the uniform distribution are not
unknown parameters to be solved for, but can be established well enough at the
outset from the sample of responses. The data xi (i = 1 . . . n) must be grouped, i.e. the
n samples sorted into K bins Xk (k = 1 . . . K) with frequency of nk samples in the kth
bin. The log-likelihood function L of n samples drawn from a population governed
by the probability density (8.8.1) then takes the form (discussed in Chapter 1)
L
K
X
k1
nk ln pXk j 1 , 2 , 3
nk ln pXk j f , , 2
8:8:9
k1
j k1 pXk jfg
j 1, 2, 3:
8:8:10
488
Note that the third parameter to be solved for can be either or 2. Because the
resulting equations (8.8.10) are highly nonlinear and require a numerical procedure
that calls for an initial guess, one choice may be less sensitive to the starting values
than the other and thereby lead more readily to convergence. As a general guideline,
choose parameters that are neither too large nor too small. In applications to be
discussed shortly (the BBCSilverman experiments), rapidly convergent solutions
were obtained for either choice when was of order 1; was the preferable parameter, however, when its value was of order 102. Taking 3 = , the ML equations of
the Mixed-NU Model become
"
#
2
2
K
X
L
eXk =2
1
p
nk
0 )
pXk j f , , 2 1 0
2
f
b
a
2
k1
L
0
L
0
K
X
nk Xk eXk
k1
K
X
k1
"
nk
Xk
2
=2 2
pXk j f , , 2 1 0
8:8:11
#
1 e Xk
=2 2
pXk j f , , 2 1 0
with
2
1
1 f
2
Ia, b Xk :
pXk j f , , 2 f p eXk =2
2
ba
2
8:8:12
489
Fig. 8.5 Top panel: fruit cake used by the BBCs The One Show in 2007 to test the ability of a
crowd in Londons Borough Market to guess the weight of a cake. The winner received the
cake as a prize. Bottom panel: scale used at Borough Market to weigh the cake, showing true
mass of 5.315 kg.
value than the sample mean (Galtons model), provided the informed subgroup of
respondents (the Gaussian component) made up a sufficiently large faction of
the group.
The first trial took place in the large outdoor Borough Market, one of Londons
major food markets located in the Borough of Southwark close to the famed London
Bridge. Carrying a large rectangular fruit cake with a bright red question mark in the
center of the icing, as shown in Figure 8.5, BBC reporter Michael Mosley randomly
queried 123 shoppers for their estimates of the weight of the cake. As incentive to
guess accurately, the person coming closest to the true value Mcake = 5.315 kg, as
shown on the scale in Figure 8.5, would receive the cake as a prize. Truth be told,
I would not, myself, want to eat a cake that was carried around Borough Market all
day and held by more than a hundred people . . . but perhaps thats being too
fastidious.
In any event, like Galton at the West of England Exhibition, I received from Ms.
Freeman the complete record of guesses. Also like Galton, I found the crowds
judgment to be surprisingly good. Estimates ranged from 1.000 kg to 14.700 kg with
490
Cake Experiment
Frequency
15
Mixed-NU
10
Gaussian
5
0
0
10
12
14
Bin
Fig. 8.6 BBCSilverman cake experiment in Borough Market, London. Histogram of N =
123 estimates of the mass of a cake sorted into 29 bins of width 0.5 kg over the range 115 kg.
Superposed is the theoretical NU-Mixed model probability density (solid) with maximum
likelihood parameters (f, , 2) given in Table 8.4, and a normal density (dashed) based on
the unweighted group mean and variance. The two outcomes are respectively MMNU = 5.345
0.239 kg and MGaus = 5.416 0.223. The true mass was Mcake = 5.315 kg.
a sample mean Msample = 5.416 kg, a median of 5.100 kg, standard deviation Ssample =
2.471 kg, and standard error of 2.228 kg. In short, the crowd missed the exact mass with
a fractional error (Msample Mcake)/Mcake = 1.90%, or about 1 part in 50.
Figure 8.6 shows a histogram of the mass estimates, overlaid by a Gaussian
distribution NMsample , S2sample , reflecting Galtons (and WOCs) democratic belief
that the collective wisdom of a crowd resides in the sample mean of normally
distributed independent guesses. However, in keeping with the results of my GOG
experiments with physics students, the Borough Market histogram, with its concentrated density in the vicinity of 5 kg and a long flat tail skewed to the right, again
strongly resembled a mixed GaussianUniform distribution. The Mixed-NU distribution (8.8.12) with parameters determined by the method of maximum likelihood
(8.8.11) provided a better match to the data, as also shown in Figure 8.6 and
summarized in Table 8.4.
Having watched a video recording of the experiment in progress that Ms.
Freeman sent me, I thought this outcome made perfect sense. The population
queried by Mr. Mosely included seasoned housewives, who undoubtedly made
and lifted many cakes over the years, as well as some young people in their teens
or twenties, who probably never lifted, let alone made, a fruit cake. Nevertheless,
Table 8.4
491
Population size n
Number of bins
Bin width
Mixed-NU parameters
Uniform range (a, b)
Gaussian fraction f
Gaussian mean
Gaussian standard deviation
Exact value
Crowd (sample) mean value
Silverman Mixed-NU mean
Crowd (sample) percent error
Silverman Mixed-NU percent error
Cake experiment
Coin experiment
123
29
0.5 kg
1706
71
100 coins
(1, 15) kg
0.806
4.705
1.627
Mcake = 5.315 kg
Msample = 5.416
0.223 kg
MMNU = 5.345 0.239
Msample Mcake
1:90%
Mcake
MMNU Mcake
0:56%
Mcake
(0,7000) coins
0.868
736.40
354.62
Ncoin = 1111 coins
Nsample = 982 39 coins
NMNU = 1100 30
N sample N coin
11:60%
N coin
N MNU N coin
0:99%
N coin
since the venue was a major food market, it is not unreasonable to expect the
sampled population to comprise more mature, knowledgeable food preparers than
clueless youths. If so, that would account for why the mean judgment of the crowd
was quite accurate, and why a normal distribution captured the information of the
crowd almost as well (percent error 1.90%) as my Mixed-NU Model (percent error
0.56%). Nevertheless, a normal distribution alone fails to account for the high-end
fat tail.
Of particular interest to me was the second trial, which took place in the BBC
studio and entailed an exercise that was probably not part of the experiences of many
viewers who emailed in their guesses: to estimate the number of 1 coins in a large
open, transparent glass, as shown in Figure 8.7. The true value was Ncoin = 1111. The
1706 guesses received were all over the board, ranging from a low of 42 to a high of
43 200 with a sample mean Nsample = 982, a median of 695, standard deviation Ssample =
1593, and standard error of 39. Although the judgment of the crowd was worse than in
the cake experiment, it was not terribly bad, missing the exact value with a fractional
error of (Nsample Ncoin)/Ncoin = 11.6%, or about 1 part in 9.
Actually, the largest value submitted was 25 million, but was excluded since there
was reason to believe (as Ms. Freeman wrote me) that it was intended to sabotage
the experiment. Indeed, as a matter of common sense, where would The One Show
get 25 000 000 or more than $40 000 000 to put in a jar for the purpose of a brief
492
Fig. 8.7 Glass full of 1 coins used by BBCs The One Show in 2007 to test the ability of
viewers to estimate the number of items in a set.
17
This is about 15% of the entire BBC One network annual budget.
493
Coin Experiment
Frequency
150
Mixed-NU
100
50
Gaussian
0
0
500
1000
1500
2000
2500
3000
Bin
Fig. 8.8 BBCSilverman coin experiment. Histogram of N = 1706 estimates of the number of
1 coins in the glass of Figure 8.7 sorted into 71 bins of width 100 over the range 07000.
Superposed is the NU-Mixed model probability density (solid) with maximum likelihood
parameters given in Table 8.4, and a normal density (dashed) based on the sample mean and
variance. The two outcomes are respectively NMixed = 1065 30 kg and NSample = 982 39.
The true count was Ncoin = 1111.
(the informed subgroup) of 86.8% a little higher, in fact, than the 80.6%
Gaussian component of the crowd of cake-weight estimators. The use of the term
informed does not necessarily imply that most individuals in the group were
especially skilled at estimating numbers only that they had a sense of what might
be a reasonable number in contrast to a preposterous one. Recall the lesson of the
Condorcet jury theorem: a group, if sufficiently large, can produce a correct majority
vote with near certainty, even if individual members were correct only marginally
more than half the time.
It is also possible that the closeness with which the two group estimates, as
extracted by the Mixed-NU model, matched the true values is partly fortuitous. In
this regard, several aspects of the analysis bear brief commentary.
First, the match between frequencies calculated from the model pdf and the data
would fail a chi-square test for goodness of fit i.e. give rise to a P-value smaller
than the conventionally set threshold of 5%. That failure does not in itself constitute a failure of the model because the sole purpose of the model was to provide a
better rationale than pure ignorance for gauging information (as expressed
through the mean and variance) available in the collective judgment of a group.
Unlike physical systems like atoms or stars, which are subject to well-founded
theories grounded in quantum mechanics and yield predictable line shapes of
one kind or another, there may well be no general theory for the response of a
group to some query. In that case an empirical model would be the best one can do.
494
18
495
8:10:1
constructed from a truncated Gaussian defined over the range ( x 0). Correct
normalization (to obtain unit area under the pdf ) is obtained by insertion of the
normalization constant
C,
2
1 erf
p
2
8:10:2
The mean (8.8.3) and variance (8.8.4) are then replaced by the relations
"
#
r
2
2
2
e =2
ab
p
X f 1
8:10:3
1 f
1 erf = 2
2
3
2
r
2 =2 2
2 = 2
2 = e
2
e
b a2
7
6
p
1
f
2X f 2 41
5
2
p
12
1 erf = 2
1 erf = 2
"
#2
r
2
2
2
e =2
1
p b a
f 1 f
1 erf = 2 2
8:10:4
derived from (8.10.1). Applied to the BBCSilverman coin experiment with data
sorted into 51 bins and others parameters as listed in Table 8.4, Equations (8.10.3)
and (8.10.4) lead to 1114 30 coins, yielding a fractional error of 0.31%.
Although the modified Mixed-NU model with correct normalization apparently
extracted an even closer (still fortuitous?) estimate of the true number of coins, there
remain some undesirable features. For one thing, the model pdf does not vanish at
the origin as it must, since no respondent who takes the exercise seriously would look
at a jar full of coins and estimate its number to be zero. Second, and perhaps more
important from an aesthetic perspective, is what may appear to some as an artificial
distinction between informed and uninformed subgroups. In mathematical
terms, it would be more satisfying if one could find a single pure density function
that generated the statistical features of a mixed-distribution model without assuming that respondents actually comprised two discretely different distributions.
This second desideratum raises a general (and, if you think about it, profound)
question of whether or not there may be a universal probability function for the
responses from a large (in principle, infinitely large) group of independent (i.e. noncoordinating) participants, each with a unique set of background experiences and
496
variable (in kind and amount) knowledge. This hypothetical group is, of course, an
idealization, but perhaps one that might be realized in a practical way by means of the
internet. Under the foregoing conditions, it would seem that orthodox statistical procedure already tells us what this universal probability function should be. If each
respondents guess is represented by a random variable of arbitrary kind (provided that
its first and second moments exist), the Central Limit Theorem (CLT), as we have seen in
Chapter 1, asserts that the variate representing the sum or mean of the set be distributed
normally. However, this prediction is not supported by the results of either my GOG or
BBC experiments. The CLT is a rigorous statistical law, but it can lead to less familiar
results under unusual circumstances. I will return to this point at the end of the section.
To find and test a hypothetical universal probability function, it is crucial that the
group to be sampled be large. What one might expect from a truly large and variable
group could perhaps be anticipated from the data set for the coin experiment, which,
for want of a larger data set at the time of writing, will have to serve as a proxy for an
ideal infinite group of respondents.
The top panel of Figure 8.9 plots the histogram frequencies as points (rather than
as bars) as a function of bin value. Having examined numerous shapes assumed by
various skewed distributions for different parametric choices, I found that nearly all
of them failed to depict convincingly either the central concentration of points or the
fat tail or both. One striking exception, however, was the log-normal distribution, the
pdf of which takes the forms
2
1
1
lnx=x0 =2 2
ln x2 =2 2
pX xj, p e
p e
8:10:5
2 x
2 x
and leads (as shown in an appendix) to the following statistical moments and
functions of moments
1 2 2
hXn i en2n
hXi e
n 0, 1, 2, . . .
12 2
x0 e
2 =2
8:10:7
2X hXi e 1:
2
8:10:8
SkX e 2e 1
2
1
2
KX e 4 2e 3 3e 2 3:
2
8:10:6
8:10:9
8:10:10
The solid curve in the top panel of Figure 8.9 is the theoretical pdf (8.10.5) with
parameters ^
, ^ determined from the maximum likelihood (ML) expressions
K
1X
nk lnX k
n k1
K
2
1X
^ 2
nk lnXk ^
n k1
8:10:11
497
Coin Experiment
Frequency
200
150
Log Normal
100
50
0
0
500
1000
1500
2000
2500
3000
Bin
300
Frequency
250
200
150
Gaussian
100
50
0
1.5
1.75
2.25
2.5
2.75
3.25
3.5
3.75
Log Bin
Fig. 8.9 Coin experiment of Figure 8.8. Top panel: histogram of count estimates (gray dots)
fitted by a log-normal probability density (black solid) and plotted against class values. Data
were sorted into 51 bins of equal intervals of 140. Bottom panel: the same histogram and
density plotted against the logarithm (to base 10) of the class values.
If the coin data were a sample drawn from a log-normal population, then a plot of
histogram frequencies as a function of the logarithm (to any base) of the bin values
should transform the histogram into the shape of a Gaussian, as shown by the solid
trace in the lower panel of Figure 8.9 calculated from the pdf (8.10.5) with ML
parameters. The points of the transformed histogram follow the theoretical curve
reasonably well. Figure 8.10 compares the observed cumulative distribution,
Fk
k
1X
nj
n j1
k 1 . . . K,
8:10:12
498
Cumulative Probability
Coin Experiment
0.75
0.5
Log Normal
0.25
2.2
2.4
2.6
2.8
3.2
3.4
3.6
3.8
log Bin
Fig. 8.10 Coin experiment of Figure 8.8. Empirical (gray dots) and log-normal (black solid)
cumulative probabilities plotted against the logarithm (to base 10) of the class values.
which is largely independent of the arbitrary choice of class number and interval,
with the theoretical cumulative probability
1
ln x ^
p
1
8:10:13
FX x erf
2
2 ^
derived by integrating (8.10.5) with substitution of ML parameters. Agreement of
(8.10.12) and (8.10.13) is satisfyingly close.
The group judgment of the number of coins in a jar, given by the mean and
standard error relations (8.10.7) and (8.10.8), is summarized in Table 8.5 for several
choices of the number of classes K. Most striking is that the estimates all in the
vicinity of about 900 coins is considerably poorer, compared to the true value of
1111, than the results of either the Mixed-NU model or the sample mean (982)
representing one voice, one vote. Table 8.5 also shows results of mixing a lognormal and uniform distribution (Mixed-LNU model). With parameters determined
again by the maximum likelihood method, the resulting mixture had a log-normal
component of about 96% i.e. nearly a pure log-normal pdf but nevertheless
extracted a group judgment significantly closer to true value than the sample mean.
Statistically, however, there would be no motivation at this point for such a model,
since the pure log-normal function itself is supposed to provide the long tail that the
uniform distribution was adopted to provide in the original Mixed-NU model. I will
return to this point later.
It is a striking feature of many random processes that they give rise to results
represented by a log-normal distribution. Among the multitudinous phenomena for
Table 8.5
499
36
200
6.608
0.592
883 14
20.5%
51
140
6.609
0.622
900 15
19.0%
71
100
6.591
0.591
893 15
19.6%
6.666
0.542
0.958
1018 20
8.4%
6.619
0.577
0.962
984 20
11.5%
6.578
0.585
0.961
957 20
13.9%
which a log-normal distribution has been claimed are (to cite but a few):19 (1) the
concentration of elements in the Earths crust, (2) the distribution of particles,
chemicals, and organisms in the environment, (3) the time to failure of some maintainable system, (4) the concentration of bubbles and droplets in a fluid, (5) coefficients of friction and wear, (6) the latent period of an infectious disease, (7) the
abundance of biological species, (8) the taxonomy of biological species, (9) number of
letters per word and numbers of words per sentence, and (10) the distribution of sizes
of cities. In my own laboratory, recent investigations of the mystifying and amusing
system of explosive glass droplets known as Ruperts drops found that the size
of dispersed glass fragments followed a distribution similar to the log-normal
distribution.20
The reader will note that the preceding sampling spans fields of physics, chemistry,
engineering, biology, medicine, linguistics, and more. Surely, the same stochastic
mechanism cannot be operating in all these cases. One cannot help asking: why does
a log-normal distribution turn up so often?
A general explanation for the occurrence of skewed distributions was given at least
as far back as the turn of the twentieth century,21 subsequently followed by numerous
elaborations. Briefly (and with more attention to content than rigor), the argument
goes as follows. Consider a process occurring sequentially that produces some initial
element X0 from which successive elements Xj ( j = 1. . .n) arise by a random action
j on the immediately preceding element Xj1, as in the sequence
19
20
21
Lists of reported occurrences of the log normal distribution with corresponding references are given by: (1) E. Limpert,
W. Stahel, and Markus Abbt, Log-normal distributions across the sciences: keys and clues, BioScience 51 (2001) 341
352; (2) Wikipedia, Log-normal distribution, http://en.wikipedia.org/wiki/Log-normal_distribution
M. P. Silverman, W. Strange, J. Bower, and L. Ikejimba, Fragmentation of explosively metastable glass, Physica
Scripta 85 (2012) 065403 (19).
J. C. Kapteyn, Skew Frequency Curves in Biology and Statistics, (P. Noordhoff, Groningen, 1903).
500
8:10:14
The outcome of the iterative process (8.10.14) at the nth step is then a product of
n factors
X n X0
n
Y
1 j
8:10:15
j1
of which the logarithm (the base is not important to this demonstration) takes the
form
ln Xn ln X0
n
X
ln 1 j ln X0
j1
n
X
j higher-order terms in :
j1
8:10:16
Under the assumption that the random action at each step is small, whereupon
neglect of the higher-order terms in the expansion of the logarithm is justifiable,
the sum of stochastic variables in the right side of relation (8.10.16) asymptotically
approaches, by virtue of the CLT, a Gaussian random variable. In other words,
ln Xn follows a normal distribution, and therefore Xn is a log-normal random
variable.
It is difficult to imagine, however, how the foregoing sequential process might
pertain in the mind of a person asked to estimate the number of coins in a jar. Would
the person start with an estimate of the number and then sequentially modify it by
factors proportional to the current estimate until arriving at a satisfactory value?
Most likely not.
I propose, instead, a different mechanism, based on how I, myself, would have
executed the task. In short, I would first estimate the volume of the container and
then multiply that number by my estimate of the number of coins per volume. To see
how this plays out, look again at Figure 8.7, which shows that the glass container
wide at the top and narrow at the base takes the approximate shape of a frustum of
a cone, i.e. the portion of the cone lying between two parallel planes that cut it
perpendicular to the symmetry axis. Now, with r the radius of the (small) circular
base, R the radius of the (wide) circular mouth, and H the vertical distance between
the two planes, it is a straightforward exercise in geometry to show that the volume of
the conical frustum is
Vr, R, H
2
r rR R2 H:
3
8:10:17
Upon letting C stand for the number of coins per volume, I would then calculate the
number of coins in the jar from the formula
Xr, R, H, C r 2 rR R2 HC:
3
8:10:18
501
The crucial point to bear in mind at this stage is that none of the needed numbers
(r, R, H, C) is known; all are random variables whose realizations (i.e. guesses) by
members of a group would be different. After examining Figure 8.7, made from the
video given me by Ms. Freeman who did not specify the dimensions of the glass
container, I assigned the following rough values of lengths and uncertainties (in
centimeter units)
r N3, 0:72
R N5, 1:02
H N20, 2:02
C N1, 0:22
8:10:19
and assumed, as shown explicitly by the expressions in (8.10.19), that they constituted the mean values and standard deviations of normally distributed variates.
Simple (in contrast to compound) physical quantities are often distributed normally,
so the assumption does not seem unreasonable to me.
Figure 8.11 shows plots (in gray) of the frequency (upper panel) and cumulative
probability (lower panel) of the guesses from a group of 1 000 000 respondents, as
simulated by a Gaussian random number generator (RNG) generating one million
values for each variate in (8.10.19) and multiplying the four realizations of each trial
together in accordance with expression (8.10.18) to arrive at an estimate of the
number of coins in the glass. The data were sorted into 400 bins of width ~10 units.
Because the number of bins is large and the bin width is narrow, the plots are shown
as continuous curves, rather than by a sequence of discrete bars or points. Superposed on the plots are the corresponding theoretical curves (in black) obtained from
a log-normal distribution with parameters calculated, as before, by the method of
maximum likelihood. We will come to the gray dashed curves in due course.
The first feature to note is the striking visual accord between the data and the lognormal fit. Although the match of a log-normal distribution to the distribution
arising from computer simulation is not perfect, it is so awesomely close (and has
remained that way for numerous repetitions of the experiment) that one must
discount any attribution to coincidence. The explanation is both simple and subtle.
First the simple part. Consider a random variable X to be the product of a large
number J of arbitrary non-negative random variables X = V1V2. . .VJ with finite
moments. Then log X takes the form of a sum of variates
log X
J
X
j1
ln V j
J
X
CLT
Y j ! N, 2
8:10:20
j1
502
Log Normal
1.2 10 4
Simulation of
Coin Experiment
Frequency
1 10 4
Mixed-NU
8000
6000
4000
2000
500
1000
1500
2000
2500
3000
2000
2500
3000
Bin
Cumulative Probability
0.8
0.6
0.4
0.2
500
1000
1500
Bin
Fig. 8.11 Computer-simulated coin experiment. Top panel: plot of frequency against class
value for (1) 106 guesses (gray) arrived at by the stochastic product (8.10.18) with normal
variates (8.10.19); (2) the theoretical log-normal density (solid black) and (3) the Mixed-NU
density (dashed), all model parameters being determined by maximum likelihood. Bottom
panel: the corresponding cumulative probabilities with the same traces as in the top panel.
encountered such a case in Chapter 1 with a sum of just three uniform variates this
is not the complete explanation. More to the point is that the product of normal
variates, whether independent (as in the product rRHC) or correlated (as in the
product R2HC) yields a log-normal variate either exactly or approximately, rather
than asymptotically. Thus the number of random variables in the product is not
503
1 000 000
400
~10
LN model
Gaussian mean ^
Gaussian standard deviation ^
LN mean and SE
LN percent error
6.890
0.385
10580.6
4.8%
Mixed-NU model
Uniform range (a, b)
Gaussian fraction f
Gaussian mean
Gaussian standard deviation
Mixed-NU mean and SE
Mixed-NU percent error
(42, 4327)
0.986
1038
373.6
1054 0.4
5.1%
1055 0.4
5.1%
really key to the outcome. I demonstrate these properties explicitly in the appendix
on log-normal variates.
Other items to note concern the numerical details of the experiment, summarized
in Table 8.6.
First, as seen in Table 8.6, there is little difference (albeit a statistically significant
one given the recorded standard errors) between the sample mean and the mean
arrived at by a log-normal fit with ML parameters. In other words, a sample size of
one million trials approximates an infinite sample well enough that the normalized
set of empirical frequencies fnk/ng can be taken for all practical purposes as the true
probability function. The log-normal fit reproduces this probability function sufficiently closely to yield nearly the same value for the sample mean, although not
closely enough to pass a chi-square test. Given the objective, however, the latter
circumstance is unimportant. All that matters in the context of this investigation into
the wisdom of crowds is to get the best estimate that a crowd (which, now with
1 million respondents, really is a crowd) can provide. For the assumptions (8.10.19)
that have gone into the simulation, one cannot do better than the sample mean ~1055
in Table 8.6.
A second point concerns the matter of grouped vs ungrouped data. Given the
effectively infinite sample size, it was actually better to calculate the ML parameters
with ungrouped data; i.e. from the relations
504
n
1X
lnXi
n i1
n
2
1X
lnXi ^
^
n i1
8:10:21
where the sum is over elements, instead of from (8.10.11) in which the sum is over
classes. With grouping of data some information is always lost, but grouping of some
kind is necessary in order to visualize and model an empirical distribution. If all one
wants is the ML mean and standard error, grouping is not necessary but I have
found that model predictions based on parameters derived from ungrouped data
gave less satisfactory results than predictions based on parameters derived from
grouped data when sample size was small.
A third point concerns the calculation of standard error i.e. the standard
deviation of the mean a topic discussed at some length in Chapter 1. That discussion, however, pertained to a simpler situation, which does not apply now. The
variance of the mean of the log-normal model is not simply the variance of a single
estimate divided by the size of the sample. This understates the actual uncertainty
because it does not take account of the uncertainty in the ML parameters upon which
the mean depends. More generally, the variance of a function X = f(, ) of random
variables (, ) must be calculated from the conditional expectation and conditional
variance of X given the variables (, )
varX hvarXj , i varhXj , i:
8:10:22
A. M. Mood, F. A. Graybill, and D. C. Boes, Introduction to the Theory of Statistics 3rd Edition (McGraw-Hill,
New York, 1974) 159.
505
^ 2
n
var^
^ 2
2n
cov^
, ^ 0
8:10:24
^ 4
:
e 1 ^
2
^ 2
8:10:25
I conclude this section with two further observations: one regarding my explanation of the generality of the log-normal distribution in estimation experiments, and
the other concerning the relationship and statistical implications of the two models
(Mixed-NU and LN) for extracting the knowledge of a group.
Although I arrived (theoretically and by computer simulation) at a log-normal
distribution by examining step by step how I would myself estimate the number of
coins in a jar as embodied by the stochastic product (8.10.18) I would emphasize
that I did not expect all (or perhaps even most) of the respondents emailing their
guesses to The One Show to have arrived at their estimates in the same way. Most
participants probably had no idea what a frustum was or how to calculate its
volume. This is, however, an unimportant geometric detail. One could model the
container simply as a box in the shape of a rectangular solid; the independent
variations in the product of height, length, and width would again generate a
distribution resembling a log-normal distribution. The seminal point to my explanation is that many respondents probably reasoned in some analogous way i.e. they
estimated the number of coins by multiplying several linear dimensions and a coin
density. If the variation in each stochastic variable resembled a normal distribution,
then a log-normal distribution of guesses was bound to emerge.
Now, as to implications. If a log-normal distribution accurately represents the
diverse conjectures of the members of a group, then it might appear (from Tables 8.4
and 8.5) that The One Show coin group was much less adept at numerical estimation than the results of the Mixed-NU model would indicate. The two models,
however, are not necessarily in conflict; they presume different populations, serve
different functions, and provide different information.
The Mixed-NU model is intended to assess the best collective guess of a particular
group responding to a single, specific query. This might be the kind of information
sought, for example, if one wanted to know right away the opinion of a class of
physics students, or the shoppers in a food market, or the studio audience of a
television game show whom the contestant can solicit collectively one time for advice
in answering a question. A repetition of the experiment with a different group of the
same size would probably lead to different statistics. The LN model presuming that
a log-normal distribution actually occurs ubiquitously assesses the hypothetical
506
collective response of a (practically) infinite sized group, in effect the parent population of all diversely knowledgeable, independently operating respondents who make
an effort to answer the posed question accurately. This might be the kind of information sought, for example, if one wanted to ascertain the opinion of a large group
on some technical issue to be decided in an approaching referendum. As the number
of respondents grew in time and more data were accumulated through periodic
polling, the distribution of responses would asymptotically become log normal with
a mean approaching the true mean of the entire population.
Regarding the italicized words above, when or why would a log-normal distribution not be expected to occur? Note that a log-normal distribution in my simulated
coin experiment arose repeatedly (i.e. with each simulation of one million estimates)
as a consequence of a knowledgeable calculational effort i.e. an estimating and
multiplying together of various uncertain factors and not as a result of unmotivated
random guessing. In other words, the log-normal distribution arose when the
computer-simulated group comprised informed, rather than uninformed,
members to use the words that inspired my simple Mixed-NU model in the first
place. If members of a group are largely uninformed in regard to some query, then
I would think that a mixed log-normal-uniform (Mixed-LNU) model would depict
the outcome better, as recorded in Table 8.5 for the BBCSilverman coin experiment.
For groups of small size, the distributions of responses in my experiments did not
much resemble a log-normal distribution; there were too few samples. For a very
large group of informed respondents, however, one would expect the Mixed-NU
model to classify most of the responses in the informed category i.e. to come up
with a parameter f close to unity and thereby arrive at a mean value close to the
sample mean. Stated somewhat differently, the Mixed-NU, LN, and Mixed-LNU
models should all do about equally well in fact, as well as could theoretically be
expected.
This is precisely what occurred when the Mixed-NU model was fit to the
computer simulated coin experiment, as shown by the dashed traces in Figure 8.11
with the statistical details given in the lower part of Table 8.6. Although the shape
of the resulting probability function does not fit the histogram of guesses as well
as does a log-normal density, in all cases where the ML equations of the model
could be solved numerically the resulting Gaussian fractional component was
within a few points of 100% and the model mean was virtually identical to the
sample mean.
8.11 Conclusions: so how wise are crowds?
In concluding this investigation into the assessment and efficacy of collective
judgment, I will highlight what I believe are useful lessons to be drawn from
the statistical exercises performed with my students and the viewers of BBCs The
One Show.
507
I began the project with an objective to find out whether the guesses of a group in
regard to various quantifiable matters (counts, weights, lengths, problem solving,
etc.) is better than the best guess by individuals within the group. In virtually every
trial with students, the mean response of the group did not surpass the best response
of one or more individuals. Indeed, in most cases, the judgment of the group was
considerably worse. The same was true with the BBC cake and coin trials. Four of the
shoppers queried in Borough Market guessed a cake mass of 5.3 kg, which was just
15 g below the exact mass. The mean of the 123 samples, while close, was still off the
mark by 101 g. Likewise, four of the BBC respondents emailed the exact number
(1111) of coins in the glass, whereas the mean of the 1706 samples (not counting the
saboteur) was off the mark by 129 coins.
From the perspective of potential utility, however, the key question is whether
those individuals giving the most accurate responses would do so again if the experiment in which they excelled was repeated. In other words, were they experts giving
expert opinion, or merely lucky guessers? If, for some cogent reason, you were
charged with the task to estimate the number of coins in the Royal Mint, would
you prefer to form a committee drawn from passersby in a London street, or hire
those four respondents on The One Show? The experiments on The One Show were
not repeated, so we will never know whether those respondents were experts. However, I had repeated some of my GOG experiments and found that individual best
responses were mostly lucky guesses. But there were exceptions. The individual who
estimated most closely the height of the classroom ceiling was on the college basketball team and apparently had a very good idea of his height and reach. He simply
stood up, raised an arm to the ceiling, and judged the distance accurately. If I was
charged by the Dean of Faculty to estimate the height of classroom ceilings, I would
prefer to hire this student rather than form a decanal committee of students and
faculty selected randomly on the college campus.
The next stage of the project had the objective of determining how best to extract
what information a group might provide. In the absence of any prior suppositions
concerning the composition of the group, the best group judgment would simply be
the sample mean (with associated uncertainty). But for small to moderately sized
groups, a single sample mean could be a poor assessment of the best response that
members of the group could give. Making the assumption that some individuals in
the group are more knowledgeable than others, I then developed a model (MixedNU) that could objectively (i.e. without my knowing any of the individuals) weight
the more informed opinions to a greater extent than uninformed, random guesses.
In the few cases where I could try the model on populations of statistically useful
size, the model yielded group means significantly closer to the known true values than
the unweighted sample means. How well such a model might work in other trials
remains to be seen.
In creating and testing other models, the examination of group responses strongly
suggested to me that,
508
23
I am usually asked, whenever invited to give a lecture on this topic, what fields of study I had in mind in this line of
advice. It would of course be imprudent, if not offensive to some in the audience, to be specific. My reply to the inquirer
and to the reader is the same: You can probably answer that question yourself.
Appendices
n
X
n
jnm
pj 1 pnj
xnm 1 1 xnnm dx
Bnm , n nm 1
8:12:1
8:12:2
for some integer m n. This is done by iterated integration by parts of the integral in
(8.12.2), as shown below to the third level of reduction
1
xm 1 xnm1 dx
p
pm 1 pnm
m pm1 1 pnm1
nm
n mn m 1
1
2
nm2
mm 1 p 1 p
n mn m 1n m 2
m2
3
1
mm 1m 2
xm3 1 xnm2 :
n mn m 1n m 2
p
8:12:3
The pattern that unfolds is clear, and it does not
takemuch algebraic rearrangement to
n1
, leads to the left side of (8.12.2).
show that (8.12.3), upon multiplication by n
m
509
510
n1
nm 1
1
xnm 1 1 xnnm dx 1
xnm 1 1 xnnm dx
B nm , n nm 1
8:12:4
which reduces to (8.12.1) upon use of the defining relation for a beta function
p
1
Bnm , n nm 1 xnm 1 1 xnnm dx xnm 1 1 xnnm dx:
0
8:12:5
8.13 Solution to logic problem #1: how old are the children?
The product of the three childrens ages equals 36. This can be achieved in the
following ways, where (n1, n2, n3) lists the ages in decreasing order.
Ages of three children
Sum of ages
(9, 2, 2)
(9, 4, 1)
(4, 3, 3)
(6, 3, 2)
(6, 6, 1)
(12, 3, 1)
(18, 2, 1)
13
14
10
11
13
16
21
The sum of the ages is not specified in the problem, but it must not be unique since
B requested further information. Thus, that sum must be 13 since it results from the
two distributions (9, 2, 2) and (6, 6, 1). Of these possibilities, however, only (9, 2, 2) has
an eldest child, there being two older children of the same age in (6, 6, 1). The answer,
therefore, is (9, 2, 2). The extra information of blue eyes was just a red herring.
511
Im
Alehouse
s1
s2
Alchemy
Metaphysics
Re
z2
Spike 2
z1
TREASURE
Spike 1
Fig. 8.12 Map of college campus showing the solution (location of treasure) to logic
problem #2.
z 1 z2
0, 1,
2
8:14:1
which places it one unit below the origin on the Imaginary axis. Surprisingly, as it
turns out, as long as the locations of the Metaphysics and Alchemy Buildings are
known, the initial location of the Alehouse does not matter.
8:15:1
512
where
Y N, 2 N0, 1
8:15:2
is a normal random variable of mean and variance 2. The second expression for
Y, which we already encountered in Chapter 1 (see Eq. (1.10.4)), will be particularly
useful shortly in examining the distribution of a product of normal variates. First,
however, let us consider how to calculate the moments of X.
8.15.1 Moments of a log-normal distribution
The inverse of Eq. (8.15.1), X = eY, provides a far more convenient way to calculate
the moments of a log-normal distribution than direct use of the probability density
(pdf ), moment-generating (mgf ), or characteristic (cf ) functions. Recall that the mgf
of a normal variate is
gY t heYt i et2 t :
1 2 2
8:15:3
It then follows immediately that the moments of X can be obtained directly from
(8.15.3)
hXn i gY n en2
1 2 2
n 0, 1, 2 . . .:
8:15:4
by setting the argument t equal to the order n of the sought-for moment. This leads to
expressions (8.10.6) (8.10.10), or, in general, to the following mgf and cf
gX t heXt i
n
X
t
hXn i
n!
n0
hX t he i
iXt
X
itn
n0
n!
n
X
t
n!
n0
n12n2 2
1
2
en n
2 2
8:15:5
n
Y
Y i i , 2i
8:15:6
i1
then, by means of relation (8.15.2), the log of X can be cast into the form of a sum of
logarithms
ln X
n
X
lnY i i , 2i
i1
n
X
i1
X
ln i 1 i N i 0, 1
i1
n
X
ln i
ln 1 i N i 0, 1
i1
8:15:7
513
8:15:8
and the subscript on the symbol Ni (0,1) emphasizes the independence of each
standard normal variate in the sum.
Upon expansion of the log functions ln (1+ i N(0,1)) in a Taylor series
ln 1
1n1
n1
n
2 3
n
2
3
8:15:9
and truncation at first order in i valid for well-localized Gaussian functions the
exact equation (8.15.7) takes the approximate form of a sum of normal random
variables
ln X
n
X
ln i
i1
n
X
i N i 0, 1
i1
n
X
n
X
ln i ,
2i
i1
i1
n
n
X
Y
i ,
2i
N ln
i1
8:15:10
i1
which is equivalent [again through use of (8.15.2)] to a single normal variate N(, 2)
with mean and variance
ln
n
Y
i1
i ,
n
X
2i :
8:15:11
i1
2
3
8:15:13
514
and truncated (for sufficiently small n at first order to yield the approximate relation
ln X
Nln n , n2 2 :
8:15:14
e1 2 eN0, 1 2
2
2
eN1 2 , 1 2 1 2 , 21 22 :
2
8:15:15
8:15:16
Under conditions where linearization of the exponential forms in (8.15.16) is justified, the variate Z is representable as a normal variate with respective mean and
variance
Z e1 e2
2Z 21 e21 22 e22
9
The random flow of energy
Part I
Power to the people
B Jones, The Life and Letters of Faraday (1870), Vol. 2, p. 404. (Quotation from Faradays lecture notes of 1858.)
515
516
Energy (kWh)
800
600
400
200
0
12
24
36
48
60
72
84
96
108
120
Time (months)
Fig. 9.1 Discrete time series fxtg (gray dots) of electrical energy usage (kWh) for a period of
120 consecutive months. The connecting black lines serve to guide the eye.
N
1X
xt 409:24 kWh:
N t1
517
9:2:1
(As an aside, I note that the average monthly electric energy consumption of a US
residential utility customer in 2011 (the last year for which I have data) was 940
kWh.)2 The fluctuations result in a sample variance
s2x
N
1X
xt x2 107:492 kWh2 :
N t1
9:2:2
Beyond these obvious features, the raw data reveal little about the underlying
pattern . . . if there is one.
The slight asymmetry of the record about the sample mean signifies that the time
series is non-stationary: as the series progresses from left to right, the sample mean
decreases in time. As discussed in previous chapters, it is usually necessary to remove
the mean and slope when mining a time series for information. A time series of nonzero mean produces a large spike in the power spectrum at zero frequency and leads
to slow damping of the autocorrelation. Likewise, a non-zero trend produces lowfrequency oscillations in the power spectrum that can obscure important features.
The panels of Figure 9.2 show the results of various operations on fxtg to prepare
the record for further analysis. The first panel shows the time series fytg
N1
y t x t x x t
t 1 . . . N
9:2:3
2
transformed to eliminate the mean x and slope x
2
3
N=3
N
X
X
1
4
x
xt
xt 5
N=3N N=3 tNN=31
t1
9:2:4
in which the bracketed expression [N/3] signifies the largest integer less than or equal
to N/3. Equation (9.2.4) is the discrete form of the slope of a continuous time record
3
2
T=3
T
1
7
6
x
xtdt
xtdt5
9:2:5
4
T=32T=3
2T=3
of duration T. A close look at the series fytg, in particular its symmetrical fluctuation
about the horizontal baseline (y = 0), shows that it does indeed appear to have zero
mean and zero slope, two statistics readily verified by direct computation, together
with the sample variance
http://www.eia.gov/tools/faqs/faq.cfm?id=97&t=3
518
Dierence 112
Dierence 12
Dierence 1
Energy (kWh)
- 200
- 400
12
24
36
48
60
72
84
96
108
120
12
24
36
48
60
72
84
96
108
120
12
24
36
48
60
72
84
96
108
120
12
24
36
48
60
72
84
96
108
120
400
200
0
- 200
- 400
400
200
0
-200
-400
400
200
0
-200
-400
Time (months)
Fig. 9.2 Top panel: time series fytg of Figure 9.1 adjusted for zero mean and zero trend. The
other panels show the respective difference series: second panel, fr1 yt g; third panel, fr12 yt g;
and bottom panel, fr1 r12 yt g.
s2y
N
1X
u2 96:362 9284:7 kWh2 :
N i1 i
9:2:6
Unless needed for clarity, physical units like kWh will be omitted in the remainder of
the chapter.
The second panel in Figure 9.2 shows the first-difference time series futg of lag
1 defined by the expression
ut r1 yt yt yt1
t 2 . . . N:
9:2:7
519
t 13 . . . N
9:2:8
and the fourth panel shows the multiplicative difference series at both lags
wt r1 r12 yt t t1 yt yt1 yt12 yt13
t 14 . . . N:
9:2:9
As a matter of convention, the subscript 1 is usually dropped from the nabla (r) in
the case of first-difference lag 1. For clarity and, in particular, to distinguish
discrete differencing from the gradient operation I will retain the subscript in
all cases. Another matter of notation: use of the backward-shift (or more simply:
backshift) operator B, introduced in Chapter 6, allows one to express differencing
in a notationally simple way that will later facilitate the algebraic manipulation of
time series
ut 1 B yt
t 1 B12 yt
wt 1 B1 B12 yt :
9:2:10
Differencing reduces the number of elements in a time series, but the loss of these first
few elements is usually of no statistical consequence for a long series.
The utility of the difference series will become apparent when we examine the
sample autocorrelation functions. One can see, however, from looking at Figure 9.2
that a difference series gives a visual appearance of greater randomness compared to
the original series. For example, the series fytg of Figure 9.2 (top panel) show several
peaks at roughly 12-month intervals. These peaks, assuming they represent real
information and are not merely statistical fluctuations, have vanished in the series
fwtg (bottom panel). One of the strategies employed in solving a finite-difference
equation that results from a particular model of randomness is to operate on the
original time series so as to reduce it to white noise, represented by a random variable
of mean 0 and stationary variance, such as the following
t N0, 2 :
9:2:11
r z k
t
N
X
k 0 . . . m
zt z
9:2:12
520
AC of y
ry(k)
0.5
0
-0.5
12
24
36
48
60
72
60
72
60
72
AC of 1 y
ru(k)
0.5
0
-0.5
12
24
36
48
AC of 112 y
AC of 12 y
rv(k)
0.5
0
-0.5
-1
12
24
36
48
rw(k)
0.5
0
-0.5
12
24
36
48
60
72
Lag (months)
Fig. 9.3 Autocorrelation (solid line) of the time series in Figure 9.2. Top panel, ry(k); second
panel, ru(k); third panel, rv(k); bottom panel, rw(k). The dashed lines represent approximate limits
of plus and minus two standard deviations: 2N1/2, where N is the length of the time series.
= 14 for w), and the lag number k, ranging from 0 to maximum lag m, marks the
delay in units of the sampling interval t (here equal to one month). As a matter of
notation, the symbol r(k) will be used for the sample autocorrelation and (k) for the
theoretical autocorrelation of a particular model. Also, although it was convenient in
Chapter 6 to express the lag number as a subscript (e.g. rk), in this chapter lag will be
expressed as an argument if a subscript is being used to identify the time series.
The four panels of Figure 9.3 show the variation with k of the autocorrelation
functions defined in (9.2.12). As in the plots of the time series, the autocorrelation functions are discrete functions defined at points (not shown) connected by solid
lines to aid the eye. The autocorrelation ry(k) (top panel) clearly reveals a pattern
buried in the original time series namely a slowly decaying periodic correlation of
521
energy readings with 12-month periodicity. The pattern of correlations is very long
range, continuing beyond the arbitrarily chosen maximum lag number (72). The
periodicity is extraordinarily precise: peaks are seen to occur at exact multiples of
12 months despite the noise content in the corresponding series fytg. We have seen in
a previous chapter that Brownian noise gives rise to long-range correlations; the
pattern in ry(k) is entirely different from that of Brownian noise.
Autocorrelation ru(k) (second panel) shows that differencing at lag 1 has eliminated nearly all correlations except those at lag numbers k = 1, 12, 24, and 36. The
pair of dashed lines delimitpthe
approximate boundaries of plus and minus two
standard deviations sr 1= N (under the assumption that the noise is Gaussian)
by which to decide tentatively whether a particular correlation is statistically significant or not. Although no correlations at multiples of 12 higher than 3 appear to be
significant in the plot of ru(k), the figure suggests that such correlations have merged
with the noise rather than actually vanished. The distinction between the two
alternatives is that, if the first is correct, one would expect correlations at k = 48,
60, etc. to become significant in longer repetitions (. . . therefore smaller variance . . .)
of the same stochastic process.
Autocorrelation r(k) (third panel) shows that differencing at lag 12 has eliminated
all correlations except the correlation at k = 12 months. Although the correlation
r(5) exceeds the +2sr boundary, I have no reason to believe there is anything
physically significant about 5-month intervals in my electric energy usage. On the
contrary, the physical significance of 12-month intervals e.g. January to January to
January, etc. is comprehensible. If the noise is approximately Gaussian, then one
would expect about 95% of the set of correlations fr(k)g to fall within the 2sr
limits. Therefore, 5%, or 1 in 20, should fall outside the limits purely by chance. It
should not be surprising, then, if at least 1 in a plot of 71 correlations (not counting
the constant r(0)=1), should exceed +2sr.
Finally, the autocorrelation rw(k) (bottom panel) shows that after multiplicative
differencing at lags 1 and 12 the correlations remaining in the time series are
primarily at lag values k = 1 and (possibly) 11, 12, and 13. Note the importance
of the algebraic sign (+ or ) to the pattern of autocorrelations. For example,
in the second panel there is a positive correlation at 12 months and a negative
(anti-)correlation at 1 month, whereas in the bottom panel the correlations at
1 and 12 months are both negative.
The autocorrelation function of a difference series is theoretically derivable from
the autocorrelation function of the original series. Consider, for example, a stationary time series fytg of mean 0 for which the theoretical autocovariance functions
y k hyt ytk i
k . . . 0, 1, 2 . . .
9:2:13
are known. The angular brackets in (9.2.13) signify an ensemble average. The function
y 0 hy2t i 2y
522
is the variance of fytg, and the theoretical autocorrelation function of fytg is defined
by the ratio
y k k=0:
9:2:14
Although a time series obtained from an actual experiment has a definite origin
(t = 0) and finite length (N), and the lag numbers of the associated sample fry(k)g
terminate at some designated maximum value (m), the time series of the underlying
hypothetical stochastic process is of infinite length with a theoretical autocovariance function fy(k)g extending over the range ( > k > ) symmetrically about
k = 0,
y k y k:
9:2:15
In other words, it is to be understood that jkj (rather than k) enters the argument of
(9.2.13) and (9.2.14) although, to keep notation simple, the absolute value sign will
not be employed unless needed for clarity.
The autocovariance of the series ut = r1yt can be obtained directly from the
expectation values
u k hut utk i hyt yt1 ytk ytk1 i
2y k y k 1 y k 1
9:2:16
which yield
u k
2y k y k 1 y k 1
2 y 0 y 1
y k 12 y k 1 y k 1
1 y 1
9:2:17
In the same way, one can derive the autocorrelation (k) of the series t = r12yt and
w(k) of the series wt = r1 r12yt
y k 12 y k 12 y k 12
9:2:18
k
1 y 12
3
2
1
7
6 y k 2 y k 1 y k 12 y k 1 y k 12
4
5
1
y k 11 y k 13 y k 11 y k 13
4
:
w k
1 y 1 y 12 12 y 11 y 13
9:2:19
Plots (not shown) of (9.2.17), (9.2.18), and (9.2.19) as functions of k, with the
theoretical y(k) approximated by the corresponding sample ry(k), superpose the plots
of ru(k), r(k), and rw(k) in Figure 9.3 nearly perfectly.
523
Equations (9.2.17)(9.2.19) help explain the structure of the observed autocorrelation plots in Figure 9.3 even before we search for an underlying explanatory stochastic process. For example, look at the plot of ry(k) (top Panel) of Figure 9.3 and
consider the peak at k = 12. The theoretical expression (9.2.17) for ru(12) calls for
subtracting from ry (12) the mean of the two flanking values, ry(11) and ry(13). Since
ry(11) ry(13) and both are less than ry(12), the outcome is a statistically significant
positive number. However, now consider the calculation of ru(11), which calls for
subtracting from ry(11) the mean of ry(10) and ry(12). From the nearly linear slopes of
the hilly waveform, one sees that the mean of ry(10) and ry(12) is very nearly equal to
ry(11), and so the subtraction yields a result close to 0. As this condition pertains on
the left and right slopes of all hills of ry(k), the resulting structure of ru(k), apart from
the isolated anti-correlation at k = 1, resembles a comb of Dirac delta functions
located at lag numbers equal to multiples of 12. Similar reasoning can be applied to
account for the structure shown in the other panels.
We are faced, then, with a curious problem. My monthly usage of electrical energy
shows a strong, persistent (i.e. long range), decaying, 12-month periodic correlation
with a triangular-appearing base waveform. Two of the difference series show anticorrelated energy consumption at intervals of one month (e.g. high in January, low in
February, high in March, low in April, etc.) and either positive or negative correlations at 12-month intervals. What law or process accounts for such a structured
pattern, given that the noise in the corresponding time series masks any overt
periodic structure?
To help answer that question, let us first consider the power spectrum of fytg.
9.3 Examining the data: frequency and power spectra
The last stage (for now) in the empirical investigation of the energy time series is to
obtain the power spectrum S(), calculable by any of several ways, each of which
illustrates points worth keeping in mind.
First, employing a form of the WienerKhinchin (WK) theorem, we can calculate3
S() from the sample autocorrelation ry(k) as a continuous function of frequency
or angular frequency = 2
S 1 2
m1
X
ry k cos k ry m cos m:
9:3:1
k1
The expression for S() in (9.3.1) omits a scale factor 2s2y t, which is unimportant if one is interested only in the location
and relative amplitude of the peaks.
524
jc
m
j 0, 1, 2 . . . m,
9:3:2
where
c
1
0:5
2 t
for t 1
9:3:3
is the Nyquist or cut-off frequency (introduced in Chapter 3), leading to m/2 independent (and therefore uncorrelated) spectral estimates
j k
1j ru m:
ru k cos
Sj 1 2
m
k1
m1
X
9:3:4
j 0, 1, 2 . . . N=2,
9:3:5
where
a0
N
1X
y
N t1 t
aj>0
N
2X
2j t
yt cos
N t1
N
bj0
N
2X
y sin
N t1 t
2j t
: 9:3:6
N
The time interval corresponding to a peak at harmonic j is Tj = (N/j)t. Note that the
harmonic numbers of a given period T are not the same for series (9.3.4), in which
the maximum lag m determines periods, and for (9.3.5), in which the duration
N of the time series determines periods.
And last, we can resort to a fast digital transform (FFT) that would execute the
same task as (9.3.6) by a different and much quicker algorithm. In the present
study this was not necessary because the length of the time series (N = 120) is so
short that implementation of all methods took only fractions of a second.
There is yet another method, to be discussed shortly, for obtaining the autocorrelation and power spectrum specific to two fundamental types of linear stochastic
processes that will ultimately be part of the solution to this investigation.
The upper panel of Figure 9.4 shows a panoramic plot of the spectral power S() as
a function of computed from (9.3.1) for the entire frequency range (c 0).
It reveals a large peak at about 0.08 (month)1 and what appears to be at least three
much smaller, but potentially significant, peaks at higher frequencies, and confusing
oscillatory structure at lower frequencies. The lower panel shows in greater detail the
portion of the range from 0 to 0.2 containing the highest peak. Black points mark
calculated values taken at equal intervals whereas, again, the line connecting points
merely helps guide the eye. The more points included in the calculation, the fuller the
525
0.1
0.2
0.3
0.4
0.5
Power S()
0.5
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency (month )
-1
Fig. 9.4 Top panel: power spectrum of time series fytg of Figure 9.2. Bottom panel: details of
power spectrum (black points) in vicinity of peak at period 12 months (frequency
1=12 0:083). Oscillatory side lobes accompanying the large central peak can overlap
nearby spectral components. Large gray dots mark spectral points at the minimal set of
independent harmonics and avoid spurious side lobes.
peaks would appear to be resolved. In the lower panel, peak frequencies and associated periods (in months) were found to be
1 0:008276
2 0:08276
3 0:1676
T 1 120:83
T 2 12:083
T 3 5:967
corresponding closely to 12 years, one year, and one-half year. As remarked previously in the book, the lowest frequency in the Fourier analysis of a discrete time
series corresponds to the duration of the time series and is not indicative of any
underlying physical mechanism. The power spectrum therefore reveals seasonal
periodicities in the usage of electrical energy which, of course, is not unexpected.
Although my home is not directly heated by electricity, electricity is required to run
the furnace which burns a fossil fuel.
The presence of oscillatory side lobes flanking the central peak in Figure 9.4 also
reveals a potential problem with calculating the power spectrum at so many points.
That, in fact, is one reason for displaying the results of the calculation. Speaking
generally, distortions in the spectrum engendered by overlap of oscillatory side lobes
526
1
10
20
30
40
50
60
T 1 120
T 10 12
T 20 6
T 30 4
T 40 3
T 50 2:4
T 60 2
9:3:7
which, (disregarding the non-physical peak at j1) will be recognized as the harmonic
series T(n)=12/n (n = 1, 2, 3. . .). The peaks at j2 and j3 show up strongly; the others
(not shown in the lower panel) are close to noise level, but still discernible. The peak
at j7 is only half complete since the spectrum ends at the Nyquist frequency
60
12. A longer time series would be needed to better resolve this portion of
c 120
the spectrum.
527
task is to find the process that generated it. In this endeavor, the guiding principle is
to account for the data by a theory employing the fewest independent parameters: a
parsimonious theory in the terminology of statistical analysis.
The building blocks of our theoretical construction come in two basic categories
of time series: (1) AR (for autoregressive) and (2) MA (for moving average).
Used alone and in various combinations involving differencing, the two forms permit
analysis of a wide range of linear physical systems.
We will examine briefly the seminal features of each class, starting with the first.
9:5:1
n
X
j ytj t
9:5:2
j1
in which fjg (j = 1. . .n) is the set of parameters to be determined from the data
and, as before, t N0, 2 is a Gaussian random variable of mean zero and
variance 2 . One is free, of course, to make t a different kind of random variable,
but unless there are cogent reasons for doing so, it is not usually done. The
normal distribution has the desirable property of stability (see Eq. (6.6.9)), which
facilitates the solution of (9.5.1) in ways that could not be applied to most other
distributions.
Two seminal advantages of working with AR models relate to the ease (at least in
principle) of (1) solving the master equation (9.5.2) and (2) determining the autocorrelation function (k) and power spectrum S().
Consider first the stationary solution to (9.5.2). In Chapter 6, we solved the case of
AR(1) by a judicious alignment and subsequent subtraction of time-lagged versions
of the equation. A more general direct approach employing the backshift operator
leads immediately to a solution in the form of an infinite series of random shocks.
Re-express (9.5.2) in the form
528
1
n
X
!
j B
yt t ,
9:5:3
j1
then solve (9.5.3) as an algebraic equation with the inverse operation interpreted as a
Taylor series expansion4
!1
!k
n
n
X
X
X
j
j
yt 1
j B
t
j B t :
9:5:4
j1
k0
j1
9:5:5
.
.
.
1
n
k0
fg
where the second sum is over all the partitions of k such that
n
X
9:5:6
j k. A necessary
j1
and sufficient condition for the variance to converge to a finite constant in other
words for the solution to be stationary is that the roots of the characteristic
equation
1
n
X
j Bj 1 B 0,
9:5:7
j1
with B regarded as a complex variable, lie outside the unit circle. Satisfaction of
the condition also guarantees that the correlation function (k) tends to 0 with
increasing k.
We have already seen in Eq. (6.6.17) that the correlation function of an AR
process is obtained by solving the set of YuleWalker (YW) equations
k > 0
n
X
j jj kj:
9:5:8
j1
We will examine the structure and content of these equations more closely later in the
investigation of particular AR models. For the present, it is to be noted that (9.5.8) is
4
This method of solving a finite-difference equation has its continuous counterpart in the use of the resolvent operator
to solve the Schrodinger equation in quantum mechanics. See M. P. Silverman, Probing The Atom (Princeton University
Press, 2000).
529
an infinite set of linear algebraic equations. In practice, one approximates (k) by the
sample function r(k) for as many equations (n in (9.5.8)) as needed to solve for the set
of unknown parameters fjg, and then uses these parameters and the YW equations
to generate the correlations at all other lag values. The solution involves inversion of
a matrix of coefficients, which generally must be performed numerically because an
analytical solution, except for a very low order AR process, would otherwise be
cumbersome to obtain and work with.
An equivalent, theoretically exact method5 for deriving the autocovariance function (k) which therefore provides exact expressions for the autocorrelation (k)
entails use of the autocovariance generating function (agf )
X
1
1
k zk
1 z1
Gz 1 z
9:5:9
k
with (z) defined in (9.5.7). Note that relation (9.5.9) is just a variant (up to a
constant scale factor) of Eq. (6.6.18) derived previously. The power spectrum S()
is then proportional to G(ei)
S / 1 ei 1 1 ei 1 j1 ei j2 :
9:5:10
9:5:11
1
1 z
1
1 z1
X
j, l0
jl z jjlj
k zk ,
9:5:12
k0
with the symmetry (9.2.15) taken into account so that the resulting power series is in
terms of non-negative powers only. By selecting all pairs of indices j and to make
j + l = k for a given k, one obtains the series
Gz z0 0 2 4 z1 1 3 5 z2 2 4 6
z0 0 z1 0 z2 2 0
9:5:13
5
6
M. Kendall, A. Stuart, and J. Ord, The Advanced Theory of Statistics Vol. 3 (Macmillan, NY, 1983) 526.
There are actually two parameters because the variance 2 is also not known in advance.
530
where
0
1
1 2
9:5:14
is the variance of ut in units of 2 . The autocorrelation function, following immediately from (9.5.13),
k k
k 0, 1, 2 . . .,
9:5:15
X
j
jjj ei j 1
X
j1
j ei j
j ei j
j1
1
1
1
1
1
1 ei j
1 ei j
9:5:16
1 2
,
1 2 cos
2
where several algebraic simplifications mediated the transition between the lines.
Next, the autocorrelation generating function (effected in a single line):
AGF
1
1
:
9:5:17
S /
1 ei j 1 ei j 1 2 2 cos
Both methods yield the same function of to within a scale factor, as they must, but
the latter method requires less effort.
9.6 Moving average (MA) time series
The second building block in our search for a solution to the energy problem is the
MA(n) series, which takes the form
yt t 1 t1 n tn
n
X
1
j Bj t 1 B t
531
9:6:1
j1
9:6:2
n
X
j2j :
9:6:3
j1
The structure of Eq. (9.6.1) shows that the present value of the function yt depends
on random shocks in the past. Since random events in the future cannot influence the
present, it must follow that
hyt tk i 0
k 1:
9:6:4
However, the ensemble average of yt with a past shock tk does not vanish. To
evaluate an expectation of this kind, as well as to find the autocorrelation function of
the MA time series, consider first the simplest member of this class, MA(1), with
equation
yt t t1 :
9:6:5
9:6:6
9:6:7
532
k>1
2 ht1 yt i
hy2t i
hyt yt1 i ht yt1 i 2
y 0
y 1
hyt t1 i
ht t1 i 2
hyt ytk i
9:6:8
y 1
y k
y 1
y 0 1 2
y k
0:
y k > 1
y 0
9:6:9
The pattern manifested by MA(1) carries through in the general case MA(n),
namely nonvanishing correlations only for k
n, although it would be somewhat
tedious to demonstrate this by calculations generalizing (9.6.8). Instead, the covariance structure of MA(n) is obtainable with greater facility through use of an
autocovariance generating function analogous to (9.5.9) for AR(n)
Gz 1 z 1 z1 :
9:6:10
Substitution of (z) from (9.6.1) into (9.6.10) leads to
Gz 1
n
X
j zj
n
X
j1
j zj
n
X
j l zjl
9:6:11
j, l1
j1
from which one directly extracts the autocovariance function (in units of 2 )
y 0 1
n
X
2j
j1
y 1 1
y 2 2
..
.
y n n
y k > n 0:
n
X
j1
n
X
j1 j
j2 j
9:6:12
j1
Thus, as claimed, the autocovariance vanishes for lag numbers greater than n.
533
Note that the sum over negative powers of z i.e. the third term on the right side
of (9.6.11) was not needed to determine the autocovariance. All terms, however, are
essential to calculate the power spectrum by a relation analogous to (9.5.10)
S / 1 ei 1 ei j1 ei j2 :
9:6:13
n
X
j1
2j
n
n X
n1
X
X
2 j cos j 2
j l cos j l :
9:6:14
j > l l1
j1
p
X
j1
j ytj t
q
X
j tj ,
9:7:1
j1
j1
To delve comprehensively into the manifold varieties of ARMA time series would go
well beyond the objectives of this chapter, the primary focus of which is on insight
and methods for solving a particular problem of physical interest. Dedicated references exist for a more thorough treatment of the analysis of time series, and several
that I have used are referenced at the end of the book. Let it suffice to say without
demonstration that the covariance function of an ARMA time series is of infinite
extent, comprising damped exponentials and/or damped sine waves after the first
q p lag numbers.
The solution for yt is obtained immediately (albeit symbolically) from (9.7.2)
1
yt 1 B
1 B t ,
9:7:3
and is readily shown to be of Gaussian form of mean 0, if t is a standard normal
variate. The autocovariance function takes the form
1 z 1 z1
Gz
9:7:4
1 z 1 z1
534
and the corresponding power spectrum (up to a scale factor) is, as expected,
S / Gei
j1 ei j2
j1 ei j2
9:7:5
Equations (9.7.3) (9.7.5) tell everything one would want to know about an ARMA
( p, q) time series although the series expansions required to extract this information
will be the more computationally intensive the higher the orders p and q.
With this rudimentary understanding of AR, MA, and ARMA time series, we
now have the tools to model the electric energy problem.
9:8:1
that requires only two parameters in addition to the variance 2 of the random noise
(or shock) term t.
I label this system AR(12)1,12 because it is actually a reduced form of the twelfth
order autoregressive process, all parameters fjg being 0 except for j = 1 and 12. The
unknown variance 2 is not numbered among the parameters in labeling the process.
The assignment of order, which underlies the nomenclature, is based on a standard
method of solving finite difference equations. Given an equation like (9.8.1) without
the random shock, one usually makes the ansatz (i.e. an educated guess, or trial
solution) yt / Yt, which in the present case would lead to a twelfth order algebraic
equation. We will see how this procedure works later in a simpler, algebraically
solvable system.
The solution of (9.8.1), given by (9.5.4) and (9.5.6), predicts that yt is a Gaussian
random variable of mean 0 and variance
#
"
!#
"
X
k 2
k
X
X
212 21
k
2j 2kj
2
2
2
2
12 1 Pk 2
y
2
1 12
9:8:2
2
j
12
1
k0 j0
k0
in which Pk(z) is the Legendre function of order k.7 The reduction of the second
expression to the third is not given here, but can be verified directly by use of a
symbolic mathematical application like Maple.
7
A Legendre function yk(x) is an appropriately normalized solution to the second-order differential equation (1 x2) y00
2xy0 + k(k + 1) y = 0 in which the primes signify differentiation with respect to x.
535
Two methods for estimating the AR parameters entail use of (a) the YuleWalker
(YW) equations relating values of the autocovariance function at different lags, or
(b) the principle of maximum likelihood (ML) applied to the adjusted time series.
The first (YW) is quicker and simpler; the second (ML) is computationally more
intensive, but more accurate. Consider the simpler method first.
The YW equations of the AR(12)1,12 model,
0 1
k60 1 jk1j 12 jk12j
9:8:3
r 11
1
YW
YW
12
r1
r12
:
9:8:4
YW
12
ry 1 r y 11r y 12
1 ry 112
ry 12 r y 1r y 11
1 ry 112
0:358
9:8:5
0:391:
These results must be viewed with some caution since the choice of a different pair of
YW equations can produce a different set of numerical values.
The maximum likelihood (ML) method of estimating parameters of a linear time
series is ordinarily preferred, since it makes use of nearly all the elements of the time
series. Under the assumption that the residuals
t yt 1 yt1 12 yt12 N0, 2
9:8:6
N
N
1 X
ln 2 2
y 1 yt1 12 yt12 2 :
2
2 t13 t
9:8:7
One then obtains three coupled equations by setting equal to 0 the first derivative of
L with respect to each of the three parameters. Solving the two equations
X
0 X 2
0X
1
1
yt1
yt1 yt12
!
yt yt1
ML
B t13
B t13
C 1
C
t13
BX
C
C
9:8:8
X
X
B
@
@
A
A
ML
yt1 yt12
y2t12
yt yt12
12
t13
t13
t13
536
ML
0:348
ML
12
0:470
9:8:9
the values of which are then substituted into the third equation for the variance of the
residuals
s2
N
2
1 X
ML
ML
yt 1 yt1 12 yt12 73:0262 :
N 12 t13
9:8:10
The covariance matrix, which is obtained from the second derivatives of the likelihood function, yield the following standard errors and cross-correlation of
parameters
s1 0:0786
s12 0:0813
r1 12
cov 1 2
0:371:
s1 s12
9:8:11
Loss of one degree each for (a) completeness relation and (b) estimation of 2 from the data.
537
Residual
200
200
10
20
30
40
50
60
70
80
90
100
110
120
Time (months)
20
Frequency
15
10
5
0
300
200
100
100
200
300
Bin
Fig. 9.5 Top panel: chronological record of residuals of the AR(12)1,12 model calculated with
ML parameters: 1 = 0.348, 12 = 0.470, 2 73:032 . Bottom panel: histogram of residuals
enveloped by Gaussian probability density N0, 2 (solid).
X
2y
212 21
2
2 k
12 1 Pk 2
1:735
9:8:12
2
12 21
k0
538
Autocorrelation r(k)
0.2
0
10
15
20
25
30
35
40
45
50
55
60
Lag k (months)
Fig. 9.6 Autocorrelation of the residuals of Figure 9.5. Dashed lines mark 95% confidence
limits: 1.3N1/2 for k = 1; 1.7N1/2 for k = 2; 2N1/2 for k > 2.
ML
ML
1:741:
s2 5:333 103
9:8:13
It would seem that one could hardly ask for closer agreement between hypothesis and
verification.
It remains to be seen, however, how well the model predicts the full autocorrelation function (up to some specified maximum lag), as well as the power spectrum. For
the first task, we turn again to the YW equations. Setting y(k) equal to the empirical
values ry(k) for k = 1. . .12, permits estimation of y(k) iteratively at all other values of
k from the YW algorithm
8
>
< 1 for k 0
9:8:14
y k r y k for k 1. . .12
>
: ML
ML
1 y jk 1j 12 y jk 12j otherwise:
The resulting autocorrelation is plotted as the dashed line in the upper panel of
Figure 9.7. Although the function displays a decaying oscillatory waveform in
approximate accord with the sample autocorrelation (gray dots), it deviates increasingly in peak amplitude and location with lag number.
The corresponding power spectrum (dashed line), derived from (9.5.10)
1
S
2 ,
ML i
ML
1 1 e 12 e12 i
9:8:15
is compared with the sample power spectrum (gray line) in the lower panel of
Figure 9.7. Each power spectrum in the figure is normalized to its maximum
value. As evident from the figure, the AR(12)1,2 model (dashed trace) does not
539
Autocorrelation
AR(12)
0.5
0
-0.5
12
24
36
48
60
Lag (months)
Power S()
AR(12)
0.5
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency (month )
-1
Fig. 9.7 Top panel: comparison of empirical autocorrelation ry(k) (gray dots) with
autocorrelation calculated by model AR(12)1,12 (dashed) and model AR(12) (solid) with
maximum likelihood (ML) parameters. Bottom panel: comparison of the power spectrum
of the empirical time series (solid gray) with power spectra calculated on the basis of models
AR(12)1,2 (dashed black) and AR(12) (solid black) with ML parameters.
match well the relative amplitude and location of the fundamental peak (at frequency
= 1/12 0.083).
The two panels of the figure also display plots marked by solid black lines. We will
return to these shortly.
Although the 2-parameter AR(12)1,12 model has passed basic statistical tests of its
premises, one may nevertheless wonder whether it is perhaps a little too simplistic. In
other words, how do we know that eliminating the 10 parameters 2, 3,. . .11 at the
outset may not be responsible for the discrepancies seen in the autocorrelation and
power spectrum? If AR(12)1,12 works reasonably well, perhaps the full 12-parameter
AR(12) model will work even better. This would require, however, that we solve a set
of 12 coupled linear equations in order to determine the 12 unknown parameters.
Fortunately, that is the sort of tedious work for which computers are ideally suited.
Let us suppose that the model AR(12) applies and then estimate the full set of
parameters fjg (j = 1. . .12) by solving the YW equation, which can be written
compactly in matrix form as R = r, symbolizing the (12 12) matrix equation
540
1
B r1
B
B r2
B
B r3
B
B r4
B
B r5
B
B r6
B
B r7
B
B r8
B
B r9
B
@ r 10
r 11
r1
1
r1
r2
r3
r4
r5
r6
r7
r8
r9
r 10
r2
r1
1
r1
r2
r3
r4
r5
r6
r7
r8
r9
r3
r2
r1
1
r1
r2
r3
r4
r5
r6
r7
r8
r4
r3
r2
r1
1
r1
r2
r3
r4
r5
r6
r7
r5
r4
r3
r2
r1
1
r1
r2
r3
r4
r5
r6
r6
r5
r4
r3
r2
r1
1
r1
r2
r3
r4
r5
r7
r6
r5
r4
r3
r2
r1
1
r1
r2
r3
r4
r8
r7
r6
r5
r4
r3
r2
r1
1
r1
r2
r3
r9
r8
r7
r6
r5
r4
r3
r2
r1
1
r1
r2
r10
r9
r8
r7
r6
r5
r4
r3
r2
r1
1
r1
1 0 1
10
r11
1
r1
B 2 C B r 2 C
r10 C
C B C
CB
B
C B C
r9 C
C B 3 C B r 3 C
B
B C
C
r 8 C B 4 C
C B r4 C
B
C
C
C
r 7 C B 5 C B
B r5 C
B
B
C
C
r 6 C B 6 C B r 6 C
C
B
C B C:
r5 C
C B 7 C B r 7 C
B
C B C
r4 C
C B 8 C B r 8 C
B
B C
C
r 3 C B 9 C
C B r9 C
B
C
C
C
r 2 C B 10 C B
B r10 C
@
@
A
A
r1
11
r11 A
1
12
r12
9:8:16
Formidable as Eq. (9.8.16) may appear, there is a regularity indeed beauty to the
structure of R. In the terminology of linear algebra R is a Toeplitz matrix, i.e. a
square (n n) matrix of constant diagonal with elements of the form Rij = Rjijj.
Owing to the symmetry, the matrix has 2n 1, rather than n2, degrees of freedom
with the consequence that a linear equation such as (9.8.16) can be solved in a
number of operations of the order of n2, rather than a higher number for a general
matrix (such as n3 in the case of the standard Gaussian elimination method).
Substitution into (9.8.16) of the sample autocorrelation function fry(k) k =
1. . .12g at the first 12 lag numbers, followed by matrix inversion generates the
solution vector (YW)
0
0
1
1
0:3135
0:3025
B 0:1181 C
B 0:09344 C
B
B
C
C
B 0:03955 C
B 0:04619 C
B
B
C
C
B 0:03800 C
B 0:02498 C
B
B
C
C
B 0:05823 C
B 0:07602 C
B
B
C
C
B 0:01661 C
B 0:03391 C
YW
ML
B
B
C
C
B
B
9:8:17
C,
C
B 0:09264 C
B 0:1227 C
B 0:01026 C
B 0:00055 C
B
B
C
C
B 0:06359 C
B 0:05437 C
B
B
C
C
B 0:1063 C
B 0:05548 C
B
B
C
C
@ 0:07564 A
@ 0:09571 A
0:3152
0:4089
the two largest components of which do not differ greatly from the two-dimensional
YW solution (9.8.5).
Also shown in (9.8.17) is the solution vector (ML) obtained by maximizing the
13-parameter conditional log-likelihood function
!2
N
12
X
N
1 X
2
y
,
9:8:18
L ln 2
y
2
2 t13 t j1 j tj
541
the details of which are left to an appendix. The two sets of solutions are similar, but,
as in the case of the two-dimensional problem, the maximum likelihood approach is
to be preferred over solution of the partial set of YuleWalker equations. The
ML
ML
and 1
(which is about 74% of
parameters of greatest magnitude are 12
ML
ML
12 ); all other ML parameters are less than 30% of 12 . One may wonder
whether the extra work has led to any significant improvement in the explanatory
power of the model.
Use of the ML solution in (9.8.17) to estimate the autocorrelation function from a
generalization of the iterative YW algorithm (9.8.14)
8
1
k0
>
>
>
>
< r y k
k 1, 2 . . . 12
y k X
9:8:19
12
>
ML
>
>
jk
jj
k
13
>
y
j
:
j1
9:8:20
2n
1
2j cos n
9:8:21
and give rise to maxima (i.e. spectral peaks) at frequencies such that = 2k/n (k =
0,1,2. . .) i.e. at periods n/k, which correspond in the present case exactly to the
observed harmonic series of 12/k months.
However, suppose that only the two largest parameters, n and 1, are non-zero.
Then S() becomes
S
2n
21
1
h
i
2 n cos n 1 cos n 1 cos n 1
9:8:22
542
Autocorrelation
Simulated AR(12)
0.5
0
-0.5
0
12
24
36
48
60
72
Lag
Power Spectrum
Simulated AR(12)
0.5
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency
Fig. 9.8 Top panel: comparison of empirical autocorrelation (gray dots) with autocorrelation
of an AR(12) time series (black solid) simulated with ML parameters and t = N(0, 852).
Bottom panel: corresponding AR(12) power spectrum (black) compared with empirical power
spectrum (gray dots and connecting lines).
and the frequency closest to = 2/n at which the bracketed expression vanishes is
influenced primarily by the term at frequency (n 1). For n = 12, and 12 and 1
both positive and of comparable magnitude (~0.3), the spectral peak is downshifted
from 2/12 by 0.015. If the two parameters had opposite signs, the spectral peak
would be upshifted (not necessarily by the same amount).
The power of rapid computation afforded by desktop computers and software like
Maple and Mathematica provides a complementary way to explore the intricacies of
any hypothetical model, besides subjecting the one empirical time series to a battery
of statistical tests. One can simulate the stochastic process numerous times an
example of a Monte Carlo method to create an ensemble of time series by which
to judge whether the time series, autocorrelation function, and power spectrum of the
one available sample are plausible representatives.
The top panel of Figure 9.8 shows a comparison of the empirical autocorrelation
ry(k) (gray dots) with the autocorrelation function of one such AR(12) simulation
(black solid) created by the algorithm
yt
8
y 13 t 1
>
< t
12
X
ML s
>
j
ytj t
:
t 14
543
9:8:23
j1
in which t N0, 2 . Different values of the variance 2 were tried; the figure shows
one of the outcomes for 2 852 . The lower panel of the figure compares the
corresponding power spectrum of the simulated time series with that of the sample.
Overall, the autocorrelation of the simulated series is seen to follow reasonably
closely the pattern of the sample autocorrelation, apart from an initial overshoot
around k = 6 and the weak dephasing of peaks that occurs at higher lag numbers.
Likewise, the power spectrum of the simulation reproduces in magnitude and location both the fundamental and first harmonic peaks of the sample spectrum.
Because the noise t (obtained from a Gaussian pseudo-RNG) varies from simulation to simulation, the resulting autocorrelation and power spectrum of a particular
trial can look different in detail from those displayed in Figure 9.8. The plots in the
figure were intentionally chosen because they represented the corresponding empirical functions well. However, the fact that the AR simulations produced profiles like
these fairly often indicates that such outcomes are not unrepresentative.
One could, in principle, stop at this point if the intent were primarily model
development for forecasting. But to a physicist used to closer agreement between
his theories and his data, there is a nagging perception that with more effort one
might still do better and learn more.
12
X
9:9:1
j1
544
1=2
c a2 b2
tan b=a
9:9:2
9:9:3
k 0, 1, 2 :
9:9:4
To solve (9.9.4), make the ansatz k = sk and then multiply each term by s2k to
obtain the quadratic equation
s2 1 s 2 0
9:9:5
whose roots are s+ and s. The general solution is then of the form
k Ask B sk
9:9:6
9:9:7
9:9:8
2 2
9:9:9
in (9.9.3) and (9.9.4). For a time series periodic in 12 time units (e.g. months), = 2/12
and the YW equation for AdCos then takes the form
p
k 3jk1j 2 jk2j 0
9:9:10
in which there is one, not two, parameters to be determined from data.
The autocorrelation function (9.9.10) and associated power spectrum
1
S
p
1 3ei 2 e2i 2
9:9:11
are compared with the corresponding empirical functions in Figure 9.9. The amplitude = 0.97 was chosen by simulation and visual inspection. The accord is actually
rather good. Peaks of the autocorrelation are precisely at integer multiples of 12
545
Autocorrelation
AdCos
0.5
0
-0.5
12
24
36
48
60
72
Lag
Power Spectrum
0.5
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency
Fig. 9.9 Top panel: comparison of empirical autocorrelation (gray dots and connecting lines)
with autocorrelation (black) of AdCos time series of amplitude = 0.97 (determined by visual
inspection). Bottom panel: corresponding AdCos power spectrum (black) compared with
empirical power spectrum (gray dots and connecting lines).
(although there is still an overshoot in the vicinity of k = 6), and the fundamental
peak (which is the only peak) of the power spectrum is precisely at 2/12 (by
construction).
It is important to note that, however closely the autocorrelation function of a
hypothetical time series matches the autocorrelation of some empirical time series,
there is no guarantee that the stochastic process upon which the model is based will
describe satisfactorily the empirical series. Such is the case here with AdCos.
A simulated time series with less than (but close to) 1 and 2 about 80 or so can
produce an autocorrelation like that in Figure 9.9, but that simulated time series
displays variations too rounded and too regular to be generated by the same stochastic process that produced my electric energy readings.
Another seminal point: although the adaptive (co)sin parameters (9.9.9) lead to a
theoretical autocorrelation function that peaks at integer multiples of 12 when
employed in the YW equation (9.9.10), the autocorrelation of simulated time series
generated by stochastic process (9.9.3) with AdCos parameters does not peak at
integer multiples of 12, but, instead, displays phase shifts that increase with lag
546
b 68:74
9:9:13
obtained by a maximum likelihood fit to the time series of energy readings, gives
results closer to those observed. It is to be noted that parameters (9.9.13) are precisely
the amplitudes a10 and b10 corresponding to period T = 12 in the Fourier analysis
(9.3.6) of the empirical time series.
The top panel of Figure 9.10 shows the purely deterministic part (oscillatory black
trace) of (9.9.12) superposed on the empirical series of energy readings (gray trace).
Peaks and valleys line up nearly perfectly, highlighting the seasonal pattern that is
not readily evident amidst the noise. The two traces are displaced upward by 400 units
so that one can see clearly the lower plot (black trace) obtained by simulating the full
stochastic process (9.9.12) with a Gaussian RNG of mean 0 and variance 2 762 .
In contrast to an AdCos simulation, the DeCos simulation, like the empirical series,
does not exhibit any visually striking periodicity. Only by direct comparison with the
deterministic (co)sin wave in the figure does one become aware of the seasonality of
the time series.
The lower panel of Figure 9.10 shows the power spectrum of the simulated time
series superposed over the power spectrum of the empirical series. The match in location
and relative amplitude is nearly perfect in the vicinity of the fundamental peak at
frequency 1=1/12. No higher harmonics, however, are apparent in this figure although
the peak near frequency 2=2/12 has shown up in other simulated trials.
The autocorrelation of a deterministic (co)sin is another pure (co)sin, which is
markedly different from the autocorrelation of the stochastic process (9.9.12) shown
in the top panel (black trace) of Figure 9.11, superposed on the autocorrelation of the
empirical series (gray dots). The four panels of Figure 9.11 present a panoply of
correlation functions like those in Figure 9.3 for the empirical series of energy readings
and associated difference series. It is striking how closely the simple process (9.9.12)
reproduces the essential statistical features of the autocorrelation of all four series y,
r1y, r12y, and r1r12y. This does not mean that the DeCos model necessarily represents the true stochastic process that generated he series of electric energy readings.
The question of judging the suitability of a model will be taken up in due course.
Using computer simulation, I have explored the results of combining DeCos and
AR(12) processes. The combination reproduces the autocorrelation function more
satisfactorily than does AR(12) alone, but such a hybrid model has 14 parameters,
547
DeCos
Energy (kWh)
600
400
200
0
-200
12
24
36
48
60
72
84
96
108
120
0.14
0.16
0.18
0.2
Power S()
Time (months)
20
10
0
0
0.02
0.04
0.06
0.08
0.1
0.12
Frequency (months )
-1
Fig. 9.10 Top panel: simulation of stochastic DeCos time series (9.9.12) with parameters a =
42.21, b = 68.74, 2 762 (lower black trace); plot of deterministic oscillatory portion
(upper black trace) superposed on empirical time series (gray). Bottom panel: power spectrum of
simulated DeCos series (black) compared with power spectrum of empirical series (gray).
9:10:1
classified as MA(1) MA(1)12 for the two independent parameters. This is a pattern
that often shows up in analysis of linear systems subject to forcings at two different
periods, in this case one month and one year. The model systems, MA(1) MA(1)12
548
AC of y
DeCos
0.5
0
-0.5
12
24
36
48
60
72
12
24
36
48
60
72
12
24
36
48
60
72
12
24
36
48
60
72
AC of 1 y
1
0.5
0
-0.5
AC of 12 y
1
0.5
0
-0.5
AC of 112 y
1
0.5
0
-0.5
-1
Lag
s
Fig. 9.11 Autocorrelation functions (solid black) of a simulated DeCos series yt (top panel)
s
s
s
and corresponding difference series r1 yt (second panel), r12 yt (third panel), r1 r12 yt
2
2
(bottom panel) with parameters a = 60, b = 30, 76 . The empirical autocorrelation
of electric energy readings (gray dots) is also shown in the top panel. Dashed lines delimit the
region within approximately 2 standard deviations.
and AR(12)1,12, give different interpretations to the nature of these forcings. In the
language of systems analysis, the AR(12)1,12 process, defined by Eq. (9.8.1), asserts
that the current state of the system depends on the state of the system at 1 and 12 time
units earlier. The MA(1) MA(1)12 process, in contrast, asserts that the current state
of the system depends on random shocks that occurred at 1 and 12 time units earlier.
Both processes also include random shocks at the present moment.
Drawing on the summary description (9.6.12) of MA systems given in
Section 9.6, we can say that wt is a Gaussian random variable of mean 0 and
variance
2w w 0
2 1 2 2 2 1 2 1 2
2
k 11, 13
w k
2
1 2
1
>
>
>
>
>
>
>
k 12
>
>
>
1
2
>
>
:
0
all other k
549
9:10:2
9:10:3
9:10:4
r w 1 0:463 ) 0:672,
rw 12 0:453 ) 0:658:
9:10:5
Two solutions for each parameter are obtained because the theoretical expressions in
(9.10.4) are quadratic in the parameters. The reason for discarding the solutions
greater than one will be explained shortly. For the moment, however, note that the
discarded solution for each parameter is the reciprocal of the retained solution. In
other words and to generalize two MA(q) processes
xt t
q
X
j1
j tj
yt t
q
X
1
j tj ,
9:10:6
j1
while they represent different time series, nevertheless give rise to identical autocorrelation functions. Thus, one cannot uniquely characterize a MA process from the
autocorrelation alone.
Interestingly, and are about the same in (9.10.5), namely ~0.7. Given these
parameters, we can test the theoretical equality of the autocorrelation at lags 11 and
13, the predicted value of which is w(11) = w(13) = 0.213. The empirical values are
rw(11) = 0.132 and rw(13) = 0.308. However, since these are two realizations of
equal random variables within the framework of the model MA(1) MA(1)12,
the more appropriate statistic testing their equivalence is the mean
550
rw 11 r w 13 2 0:220,
9:10:7
which looks to be in reasonable accord with the theoretical prediction (although one
cannot be sure without an estimate of the associated uncertainties). A more thorough
analysis of uncertainties will not be made here since the primary objective is merely to
see whether the investigated model provides a plausible explanation of the observed
series of energy readings.
It was noted in a previous section that a necessary and sufficient condition for an
AR model to be stationary is that the roots of the characteristic equation (9.5.7) must
lie outside the unit circle. The physical significance of this condition is that the
variance of the solution yt is finite and the autocorrelation function decreases with
increasing lag, in conformity with causality. An analogous criterion for a MA model
is that it be invertible. By invertible is meant that the system variable xt in (9.10.6)
be expressible in terms of systems variables, rather than random shocks, at earlier
times. Equivalently, this means that the MA process be expressible as an AR process.
Such an inversion is easily accomplished formally. First, solve for the current
shock
t 1
q
X
j B
1
j
xt 1 B1 xt ,
9:10:8
j1
and then substitute (9.10.8) into the original expression (9.10.6) for xt to obtain
xt t B1 B1 xt :
9:10:9
All terms on the right side of (9.10.9), apart from the current random shock, involve
the state of the system at earlier times. However, for the inverted solution to be
executable and give rise to a stationary solution, the roots of the characteristic
equation
1 B 0
9:10:10
must lie outside the unit circle. Otherwise, the solution (9.10.9), and therefore
(9.10.6), is not physically acceptable. This is the reason for discarding the other set
of solutions for and in (9.10.5). It can be proven (although not here) that for a
given autocorrelation function there is only one set of q parameters for which MA(q)
is invertible.
Assuming that the MA(1) MA(1)12 model accounts for the difference series wt,
the task then remains to deduce the stochastic process that accounts for the original
series yt. That task is a difficult one. Working backward from a difference series to the
original series is the finite-difference analogue to integration of a differential equation. Thus, the model represented by the finite-difference equation
wt r1 r12 yt yt yt1 yt12 yt13 t t1 t12 t13 9:10:11
551
p
X
q
X
j ytj
j1
j tj
9:10:12
j1
9:10:13
j1
yt 1 B
1
1 B t
!
1 B B12 B13
t ,
1 B B12 B13
9:10:14
1
4y 0 1 y 1 y 12 y 11 y 13 ,
2
9:10:15
n
X
i1
varXi 2
covXi Xj
9:10:16
i >j
was employed to arrive at the second equality. It then follows immediately that
2w
2y
1
4 1 y 1 y 12 y 11 y 13
2
upon identifying 2y vary y 0.
Turning to the empirical series fytg and fr1r12ytg, we find
2y
varyt 9:285 103
) 2 0:769,
4
varwt 1:207 10
w
9:10:17
9:10:18
552
2y
2w
1
0:743:
1
4 1 ry 1 r y 12 ry 11 ry 13
2
9:10:19
The result is reasonably close. The second equality in (9.10.11) also gives us the
relation for the variance of residuals
2
var wt
1 2 1 2
) 2
1:207 104
1:492
73:742 :
9:10:20
To establish whether the hypothesized conditions of the model are met, the residuals
ftg were tested for normality and independence. Evaluating the residuals, however,
poses a difficulty not encountered in the case of AR models where there is a single
random shock t acting at the present. Recall that the residual t is the difference
between the current state of the system yt and the theoretical process from which yt
arose. From (9.10.11), the residual for the ARIMA(0,1,1) (0,1,1)12 process is
t yt yt1 yt12 yt13 t1 t12 t13 :
9:10:21
The state of the system yt within the interval N t 1 has been observed and is
known, but the random noise cannot be directly observed. How then are the variates
tj for j > 0 to be evaluated?
There are several ways to do this, including a method of back-forecasting, but
the simplest approach by far is to use the algorithm
t
0
13 t 1
yt yt1 yt12 yt13 t1 t12 t13
t 14:
9:10:22
For a long time series, the arbitrary assignment of 0s to the first 13 residuals will have
little consequence for the residuals test, provided these 13 0s are not included in the
resulting histogram. Execution of this test led to a histogram of residuals that was fit
to a Gaussian N(0, 73.742) [(9.10.20)], with 219 10:4 and P = 94.1% (the probability of obtaining a higher 2 for subsequent trials arising from the same stochastic
process). Similarly, a plot of the autocorrelation of residuals (to test independence)
showed no statistically significant outliers.
So far, then, the ARIMA model has not failed any of the preliminary tests. For a
more thorough investigation of the consequences of the model, we resort again to the
Monte Carlo method. The top panel of Figure 9.12 shows one of numerous time
series generated by computer simulation (black) with parameters = = 0.7 and a
Gaussian RNG N(0,(80)2), in comparison with the empirical energy series (gray).
Despite the apparent random fluctuations, there is, upon close examination, a
pronounced correlation of peaks at 12-month intervals. The corresponding power
spectra of the simulated and empirical series are shown in the bottom panel of the
553
Energy (kWh)
110 3
Simulated ARIMA
500
0
-500
12
24
36
48
60
72
84
96
108
120
0.14
0.16
0.18
0.2
Time (months)
Power S()
20
10
0
0
0.02
0.04
0.06
0.08
0.1
0.12
Frequency (month-1)
Fig. 9.12 Top panel: simulated ARIMA time series (black) compared with the empirical
energy series (gray) displaced upward by 500 units for clarity. Parameters are = = 0.7,
2 762 . Bottom panel: power spectrum of the simulated (black) and empirical (gray) time
series.
figure. The ARIMA model matches both the fundamental and first harmonic in
location and amplitude.
Figure 9.13 shows a panoply of autocorrelation functions of the simulated series
corresponding to the set shown in Figure 9.3 for the empirical series. The accord
between corresponding functions is remarkable. Not every simulated trial resulted in
such good agreement; after all, we are dealing with a stochastic process. Nevertheless,
the fact that the ARIMA model with maximum likelihood parameters , , 2
generates rather easily patterns like those shown in the figure suggests (. . . but does
not prove . . .) that the hypothesized process can account for the record of meter
readings representing my electric energy consumption.
As in my exploration of AR models, I also examined by computer simulation
the results of combining ARIMA(0,1,1) (0,1,1)12 and DeCos models, which leads to
a 5-parameter stochastic equation. Some of the trials led to autocorrelation functions
that matched the empirical functions quite well, but the simulated time series
exhibited an oscillatory structure after about the fortyeighth month that was too
regular and smooth to represent convincingly the actual series of readings.
554
AC of y
Simulated ARIMA
0.5
0
-0.5
12
24
36
48
60
72
12
24
36
48
60
72
12
24
36
48
60
72
12
24
36
48
60
72
AC of 1 y
1
0.5
0
-0.5
-1
AC of 12 y
1
0.5
0
AC of 112 y
-0.5
1
0.5
0
-0.5
-1
Lag (months)
s
Fig. 9.13 Autocorrelation (AC) of the simulated ARIMA time series of Figure 9.12: yt (top
s
s
s
panel); r1 yt (second panel), r12 yt (third panel), r1 r12 yt (bottom panel). AC of the
empirical energy readings (gray dots) is also shown in the top panel. Dashed lines delimit the
region within approximately 2 standard deviations.
555
These questions have elicited much discussion in the past . . . and probably still do.
A succinct response to what is perhaps the most basic of the questions is the remark
attributed to G. E. P. Box, a pioneer in time series analysis: Essentially, all models
are wrong, but some are useful.9 Like many aphorisms, this one contains some
truth, but needs to be understood in context. I suppose one might consider the
application of Maxwells equations to some classical electromagnetic system as a
model of that system, but, given the fundamental role that electrodynamics
assumes in the structure of theoretical physics, I can hardly imagine a physicist
thinking of the theory as wrong, but useful. Even if an electromagnetic system
being investigated turned out to be quantum in nature, rather than classical, a
resulting discrepancy between calculation and experiment would more likely draw
a judgment that the limits of validity of the theory were exceeded, rather than that
classical electrodynamics is wrong.
Loosely speaking, then, a model is what one constructs when there is no fundamental overarching theory to draw upon. That is often the case when the process to
be understood has originated in a field of study without self-consistent, reproducibly
testable fundamental laws or principles or, to the contrary, such principles may be
known but the problem is of such complexity that it is uncertain how to implement
them. For the subject matter of this chapter the uncertain flow of electric energy
where noise in the system reflects human behavior influenced by economics (cost of
energy), environmental concerns (aversion to waste and pollution), as well as physical circumstances (seasonality and local temperature), it is probably safe to say that
there is no unique law. Under other circumstances, however, where human behavior
does not enter significantly for example, the investigation in the next chapter of the
variable flux of solar energy into the ground an appropriate physical process will
emerge as the explanatory mechanism.
The matter of a true law aside, how is the best model (of a set of trial models) to be
determined? Two approaches that are widely relied upon are
the Akaike information criterion (AIC), and
the Bayesian information criterion (BIC).
The first was initially derived from information theory; the second was the outcome
of a Bayesian analysis. Both methods, in fact, can be products of Bayesian reasoning,
the only difference being in the choice of priors. Consider first the AIC.
In Chapter 6, we touched upon the rudiments of information theory as developed
in the late 1940s primarily by Claude Shannon.10 At the core of this theory is the
concept of statistical entropy H, which is a probabilistic measure of the information
contained in some set of symbols. The entropy concept is extraordinarily general and
9
10
556
14
S. Kullback and R. A. Leibler, On information and sufficiency, The Annals of Mathematical Statistics 22 (1951) 7986.
For details, consult S. Kullback, Information Theory and Statistics (Dover, 1968), Chapter 1.
H. Akaike, Information theory as an extension of the maximum likelihood principle, in Second International
Symposium on Information Theory (Eds.) B. N. Petrov and F. Csaki (Akademiai Kiado, Budapest, 1973) 267281.
^
If g(x) is the pdf of the model and y
are the ML parameters
determined from a set of data y, then the double
^
averaging referred to in the text is of the form Ey Ex log g xjy
, in which x and y are conceptualized as independent
random samples from the same distribution.
557
9:11:1
The first term of (9.11.1) is twice the log-likelihood function evaluated at the ML
parameters; the second term is twice the number K of free parameters (including the
variance of the random shock) characteristic of a particular model. Upon substitution of the ML parameters into the log-likelihood, the AIC reduces to the form
(derived in an appendix)
AIC N log b2 2K
9:11:2
9:11:3
As we shall see, (9.11.2) can be evaluated fairly easily for all the models we have
examined. Equation (9.11.2) is an asymptotic expression, applicable for N/K greater
than about 40. A second-order (in K) correction to the AIC, symbolically represented
by AICc,
AICc AIC
2KK 1
2NK
N log b2
,
NK1
NK1
9:11:4
was derived by N. Sugiura15 in 1978 for finite sample size. For notational simplicity
in the remainder of this section, I will use the symbol AIC to represent relation
(9.11.4).
Given a set of data and various explanatory models (that cannot be eliminated on
prior theoretical grounds), the model deemed most suitable is the one that minimizes
the AIC. For a given class of model, a larger number of parameters may result in a
lower b2 , but the structure of AIC is such as to penalize models with higher numbers
of parameters. As long as one is comparing models for the same set of data, the use of
AIC is not restricted to a set of nested models i.e. a series of models for which the
master equations differ sequentially from one another by an additional term. Moreover, the statistical distribution of the noise t need not be the same for all competing
models being ranked by the AIC.
We consider next, but more briefly, the Bayesian information criterion (BIC),
proposed in 1978 by G. Schwarz,16 on the basis of a Bayesian probability argument
rather than information theory. Noting that the maximum likelihood principle
invariably leads to choosing the highest possible dimension for the parameter space
of a given class of models, Schwarz sought a modification of the ML procedure by
examining the asymptotic behavior of Bayes estimators employing priors that
15
16
N. Sugiura, Further analysis of the data by Akaikes information criterion and the finite corrections, Communications
in Statistics Theory and Methods A7 (1978) 1326.
G. Schwarz, Estimating the dimension of a model, Annals of Statistics 6 (1978) 461464.
558
concentrate probability on lower-dimensional subspaces corresponding to the parameter space of competing models. The criterion he arrived at takes the form
BIC 2 log Lmax K log N N log b2 K log N,
9:11:5
which levies a greater penalty than does the AIC on models with many free parameters. As in the case of the Akaike criterion, the best model (of a set of competing
models) is the one with lowest BIC.
Comparison of (9.11.4) and (9.11.5) shows that the greater the number N of
included observations, the more the BIC and AIC will differ in their assessment of a
model of given K. Which of the two criteria is conceptually better founded or at
least more useful? In the derivation of the BIC all models being compared were
initially presumed to be equally likely; i.e. the prior distribution was uniform. Some
analysts have found this to be a poor choice:
While [a uniform prior] seems reasonable and innocent, it is not always reasonable and is never
innocent; that is, it implies that the target model is truth rather than a best approximating
model, given that parameters are to be estimated. This is an important and unexpected result.17
The AIC can likewise be derived from a Bayesian (rather than information theoretic)
argument, but with what is termed a savvy prior i.e. a prior that depends on the
number of free parameters and number of observations.
Table 9.1 summarizes the properties of seven models (out of a much larger set
I investigated), that best account for the statistical features of my record of electric
energy use. For purposes of organization, they are listed as belonging to one of the
three broad families: AR, ARIMA, and pure Cos (either adaptive or deterministic).
Most of the model parameters were estimated by the principle of maximum likelihood; a few resulted from visual inspection of computer simulations to match the
calculated and empirical autocorrelation functions.
Table 9.2 summarizes the results of applying the AIC and BIC procedures to the
models of Table 9.1 The three best models, according to both sets of criteria, are
ranked in Table 9.3 in order of increasing AIC.
The first-ranked (lowest AIC) model is the seasonal ARIMA, which posits that my
use of electrical energy at any moment is conditioned upon my use 1, 12, and
13 months previously, as well as by Gaussian random noise at the moment and at
1, 12, and 13 months previously. This is a fairly complicated law. It is not likely to be
the process that first occurs to a person looking at the zigs and zags of the empirical
time series or its autocorrelation. And yet, it is also a very simple law when understood as the outcome of a multiplicative differencing operation at the intervals of one
month (smallest interval between recordings) and one year (period of the Earths
revolution).
17
K. P. Burnham and D. R. Anderson, Multimodal inference: understanding the AIC and BIC in model selection,
Sociological Methods and Research 33 (2004) 261304.
Table 9.1
Parameters18
AR(12)1,12
AR(12)1,12 DeCos
^1
^2
^3
^4
^5
^6
^7
^8
^9
^10
^11
^12
0.3025
0.093 44
0.046 19
0.024 98
0.076 02
0.033 91
0.1227
0.000 55
0.054 37
0.055 48
0.095 71
0.4089
0.3476
0
0
0
0
0
0
0
0
0
0
0.4700
0.2790
0
0
0
0
0
0
0
0
0
0
0.3431
b2
a^
b^
(70.86)2
. . .. . .
. . .. . .
(73.28)2
. . .. . .
. . .. . .
(70.00)2
10.77
35.86
a^
b^
1
2
AdCos
. . .. . .
. . .. . .
0.98
p
3 1:697
2 = 0.960
^
^
b2
a
b
559
ARIMA(0,1,1) (0,1,1)12
0.6721
0.6583
(70.67)2
. . .. . .
. . .. . .
DeCos
42.205
68.742
. . .. . .
A caret over a parameter signifies that it is a maximum likelihood estimate. Parameters not marked by a caret were
obtained by simulation and visual inspection.
560
Table 9.2
AR(12)1,12
AR(12)1,12 DeCos
Parameters r
Residual var 2
AICc
BIC
13
(70.9)2
941.7
972.5
3
(73.3)2
925.2
933.0
5
(70.0)2
919.8
932.6
Parameters r
Residual var 2
AICc
BIC
AdCos
2
(138.1)2
1059
1064
DeCos
3
(79.1)2
941.4
949.2
ARIMA(0,1,1) (0,1,1)12
Parameters r
Residual var 2
AICc
BIC
Table 9.3
ARIMA(0,1,1) (0,1,1)12
DeCos
5
(98.8)2
993.5
1006
3
(70.7)2
917.4
925.2
Model
AIC
Rank
AIC
Rel. prob.
ARIMA(0,1,1) (0,1,1)12
AR(12)1,12 DeCos
AR(12)1,12
917.4
919.8
925.2
1
2
3
0
2.4
7.8
1
31%
2.0%
such criteria are only guides, not rigid directives. Ultimately, the suitability of a
model to explain some physical phenomenon must depend on how compatible that
model is with what else is known and on the purpose for which the model is intended.
Nevertheless, having the AIC values for a set of models, what can one do with these
numbers?
If AICmin designates the lowest AICc value of a set of models, then the AIC
difference of model i is defined by
i AICi AICmin ,
9:11:6
and the relative likelihood of model i given a fixed set of data is proportional to
ei =2 . Thus, the AIC provides one way to estimate the probability of one model
compared to another. Table 9.3 shows that the second-ranked model is about 31% as
probable as the first-ranked, and the third-ranked model is only about 2% as
probable as the first-ranked. The other models listed in Table 9.2 have AIC differences sufficiently large (>20) as to justify their exclusion from further consideration.
561
562
Energy (kWh)
600
500
400
300
200
0
10
15
20
25
30
35
40
45
50
Time (months)
Fig. 9.14 Segment of energy time series (small gray dots with connecting lines) with
superposed 12-month moving average (large gray dots) and least-squares line of regression
(dashed) showing positive trend.
4
X
xtj .
j0
A minor difficulty arises when the period is an even number, as in the present case
(T = 12). The midpoint = 11/2 falls halfway between five and six time units from
the starting time t. The problem is surmounted by centering the average i.e. by first
performing a moving average of 12 units and then a second moving average of two
units. In detail, the procedure works as follows
9
11
>
1 X
>
>
yt
xtj
>
>
12 j0
>
>
>
=
11
12
1
X
X
1
1
) zt6 xt 2xt1 . . . 2xt11 xt12
yt1
xt1j
xtj >
24
>
12 j0
12 j1
>
>
>
>
>
1
>
;
zt6 yt yt1
2
9:12:1
The centered moving average zt is equivalent to transforming the original series xt by
1
a single moving average of 13 time units with weights 24
1, 2, 2 . . . 2, 2, 1.
The trace of large gray dots in Figure 9.14 shows the results of a centered 12-month
moving average on the 51-month portion of the electric energy record that disturbed
me. The trace of points does indeed trend upward as shown by a very well matched
563
However, the trend that I could detect statistically is 1.056 kWh out of a mean
monthly usage19 of 333.68 kWh, or
19
If energy increases in time (months) as E(t) = + t, then the mean energy per month is E 12 .
564
Energy (kWh)
600
500
400
300
200
0
12
15
18
21
24
27
30
33
36
39
42
45
48
Time (months)
Fig. 9.15 Continuation (after first meter replacement) of energy time series (small gray dots
with connecting lines) of Figure 9.14 with superposed 12-month moving average (large gray
dots) and least squares line of regression (dashed black) up to time of second meter
replacement. Subsequent 12-month moving average is shown as a descerding dashed gray line.
x
1:056
0:00316,
e
x MPS 333:68
9:12:4
which is a smaller number, and therefore higher sensitivity, than what the power
company could measure. I pointed that out to Dan and asked whether the
power company ever tested their meters for long-term drift? He said he didnt know,
but would get back to me. The answer turned out to be No.
Dan put in a request for the meter at my home to be replaced. The replacement
was duly made and over the course of the next three years, while carrying on with my
other projects, I nevertheless recorded each month the energy reading on my electric
bill. A cursory inspection of Figure 9.15 might show that all was well. The small gray
dots and connecting lines again mark the monthly energy consumption. To my
satisfaction, a centered 12-month moving average (large gray dots) of the 39-month
time series could be fit to a straight (dashed black) line with a slope so flat that the
least-squares trend
Slope
0:174 0:17 kWh=month
METER #2
9:12:5
Intercept 350:95 4:14 kWh
was statistically equivalent to 0.
There was only one problem. Look closely at (9.12.5) and (9.12.2). According to
the replaced meter, my average energy consumption per month had suddenly jumped
up by close to 351 333 = 18 kWh (i.e. by ~5.4%) when, in reality, during all those
months I was still living an electrically frugal life. In the terminology of physics, the
565
power company seems to have substituted a defective meter with bias for a defective
meter with drift.
I contacted the power company again. (I think they remembered me.) Without my
requesting it, a technician came to the house soon afterward and replaced the meter.
That was unusual, so I telephoned the meter department to inquire why, and was told
that the company had decided to replace all the residential electric meters in the State
with what they termed smart meters. The feature of a smart meter that made it
smart was that it transmitted meter readings at about 900 MHz to a drive-by reader
so that no power company employee had to visit the house and actually look at the
meter. That may be so, but I could not help thinking that a smart meter might also be
one programmed to generate bias and drift at levels more difficult for a statistical
physicist to detect.
The dashed gray line at the far right in Figure 9.15 shows a segment of the 12-month
moving average of the energy time series for about one year following installation of
the third the smart meter. The mean energy consumption has dropped precipitously
and the slope appears flat. At the time of writing, this is the current meter in my home.
566
2
Hence, the probability that the specific individual does not win twice is (1 p ).
2 N
If there are N ticket purchasers, the probability that no one wins twice is (1 p ) .
Therefore, the probability that at least one person does win twice is
P2 PWin TwicejN, p 1 1 p2 N :
9:13:1
9:13:2
A more recent internet search informed me that the failure rate of smart meters made by a particular company for
residential electric customers in California was 1600 out of 2 million, or 0.08%. If this failure rate applied to the
company serving the State where I live, the probability of someone getting two defective meters would be 53.6%.
567
1.2 million customers. If each were charged the extra amount in (9.13.3), the company would have brought in an unearned profit of more than $15 million. Now we are
talking about real money. Moreover, if meter error and therefore overcharge were
proportional to consumption, the illicit profit would be considerably higher, since my
own monthly consumption is comparatively low.
The relative likelihood of Scenario II to Scenario I can be tested, in principle, by
examining the meters of a representative sample of the companys customers to see
whether their drift, if any, and bias, if any, are all (or mostly) positive or whether
the odds are 5050 for a meter to register an extra profit or loss for the company.
Such a test would require appropriately sensitive instrumentation (which the meter
department apparently did not have at the time and, for all I know, may not have
now) or the time and patience to conduct a statistical analysis of a lot of time series.
So here, in summary, are two possible explanations for the unexpected positive
trend in energy readings that I found.
One is the near certain probability that someone would receive two defective
meters purely by chance and I was that someone.
The other is that let us be imaginative for a moment somewhere in a subterranean level of the power companys home office building and unknown to most
employees is the Department of Unearned Profit Enhancement (DUPE) whose
assignment is to design meters that indicate excess energy consumption by amounts
so low that not even the meter engineers (Dans group) can detect it. The ruse is
potentially highly rewarding and virtually undiscoverable unless a customer keeps
careful track of energy usage.
Which is correct? You choose.
Appendices
X
1
k
1
k0
for < 1
9:14:1
"
"
1 B 12 B
k0
X
k
X
k0 j0
X
k
X
k0 j0
12 k
#
t
#
kj 12k11j
B
t
j1 12
!
kj
j1 12
9:14:2
"
t12 k11j
#
X
k
X
k j t12 k11j
k0 j0
9:14:3
defined in the third line. From the property of the Gaussian distribution (previously
demonstrated in greater generality)
cN0, 2 N0, c2 2 ,
it follows from (9.14.2) that
0 "
#2 1
X
k
X
ut N @0,
k j 2 A:
k0 j0
568
9:14:4
569
The evaluation of the sum over j in (9.14.4) will not be given here. The fastest way to
confirm the relation in (9.8.2) is to use a symbolic mathematical application like
Maple.
n
X
j ytj t ,
9:15:1
j1
9:15:2
j 1 . . . n
9:15:3
L
0,
2
9:15:4
9:15:5
where
Mj k
N
X
tn1
ytj ytk
Vk
N
X
yt ytk
9:15:6
tn1
570
b2
2
N
n
X
1 X
^ j ytj :
yt
N n tn1
j1
9:15:7
9:15:8
Hi j
Hi n
Hn n
9:15:9
N n
:
2 4
N
N
1 X
2 ,
ln2 2 2
2
2 t1 t
9:16:1
where the actual expression for t in a particular model (such as (9.15.2)) contains the
parameters to be estimated. That the summation in (9.16.1) begins with index t = 1
does not lessen the generality, since one can set t = 0 for some initial range of t.
N
X
2t with the ML parameter b2 leads
Factoring N from (9.16.1) and identifying N1
t1
N
N
ln b2 ln2 1
2
2
9:16:2
in which the second term is a constant that can be dropped from the AIC since the
models to be compared must all be based on the same data set of length N.
571
however, are the relations employed in the analysis of the trend line in Figures 9.14
and 9.15.
The sum of squares of residuals of the line to which the 12-month moving average
in Figures 9.14 and 9.15 was fit takes the form
X
Qa, b
xt at b2 :
9:17:1
t
Equating to zero derivatives of (9.17.1) with respect to a and b leads to the formal
matrix solution
11 0 N
1
0 N
N
X
X
X
0 1
1
tC B
xt C
B
a^
C B 7
C
B 7
7
@ AB
C B
C
C
C
BX
B
N
N
X
X
A
@ N
b^
2A @
t
t
t xt
7
9:17:2
N
1 X
^ 2:
xt a^ bt
N6 7
9:17:3
The elements of the inverse coefficient matrix in (9.17.2) are readily evaluated by
computer as sums, but can also be put into closed form
N
X
1N6
7
N
X
1
t N 7N 6
2
7
9:17:4
N
X
1
t2 N 6 2N 2 15N 91
6
7
to obtain analytical expressions for the regression parameters. The time index t in the
sums begins with 7, rather than 1, because the length of the 12-month moving
average series must be shorter than the original time series by six elements.
The covariance matrix yields the following expressions for the uncertainties and
covariance of the two parameters
N
X
t2
vara
N 6
N
X
7
9:17:5
2
t t
572
varb
V
N
X
9:17:6
2
t t
cov a, b
V t
9:17:7
N
1 X
N7
:
t
N6 7
2
9:17:8
N
X
t t
where
t
10
The random flow of energy
Part II
A warning from the weather under ground
1
2
3
573
574
from the photosphere, whose temperature of 5800 K is inferred from the Wien
displacement law
max T 2:9 106 nm K
10:1:1
and the wavelength max 500 nm of its peak emission. Given the temperature T of a
spherical black-body of radius R, the rate of radiant emission follows from the
StefanBoltzmann law
Prad 4R2 SB T 4 ,
10:1:2
10:1:3
where
One can derive the relation SB 2 5 k4B =15h3 c2 from quantum statistics, in which h is Plancks constant, c is the speed
of light, and kB is Boltzmanns constant.
1 tonne = 1000 kg, equivalent to about 2200 lbs. The US ton is 2000 lbs.
575
of stainless steel. No steel, however, would melt in this ultra-rarefied layer through
which the International Space Station orbits (at about 400 km). Temperature is a
thermodynamic or statistical concept that ordinarily loses meaning when applied to
individual particles. Beyond the thermopause, the exosphere, in which remaining
atoms and molecules move freely along ballistic trajectories, extends to the nearperfect vacuum of space.
The radiant flux of the Sun i.e. energy transported each second across a unit area
normal to flow measured at the top of the atmosphere, is close to 1361 W/m2, a
value referred to as the Solar constant S0 (although it actually varies somewhat in
time and location). What precisely is meant by top of the atmosphere depends on
the method of measurement, the earliest being by high-altitude balloon (floating
in the stratosphere at 30 km) and the more recent by satellite (e.g. at about
950 km in the thermosphere). Knowing the inverse-square law of light intensity
(for isotropic radiation) and the distance of the Sun from Earth, one can estimate a
value for S0
S0
PS
4 1026 W
1415 W=m2
4r 2S 4 1:5 1011 m2
10:1:4
Intedifferential patch of area dS with outward unit normal n is dPabs js ndS.
gration over the hemisphere facing the Sun yields Pabs in (10.1.5), as shown in an
appendix.
Given that nearly 31% of incoming solar radiation is reflected back into space a
fraction that defines the Earths albedo and yet the planets mean surface
temperature is actually about +15 C, rather than even lower than 18.4 C, one
may be moved to inquire how such warming is possible. The answer lies within the
576
jS
Vacuum
jn
n
n-1
jn-1
j2
jG
j1
Ground
Fig. 10.1 Schematic diagram of energy flow in a planetary atmosphere modeled by n discrete
layers. jk is the flux (upward or downward) from the kth layer. The solar flux js is directly
absorbed by the ground, which radiates flux jG.
10:1:6
At thermal equilibrium, the rates of energy input and energy output must balance
within each layer, whereupon the processes schematically shown in Figure 10.1 give
rise to the following sequence of equations
jn jS
2jn jn1
2jn1 jn jn2
2jn2 jn1 jn3
..
.
2j2 j3 j1
2j1 j2 jG
) jn1 2jS
) jn2 3jS
) jn3 4jS
) j1 njS
) jG n 1 jS :
10:1:7
577
k 0 . . . n
10:1:8
jG
4 SB
14
n 1 jS
4 SB
14
)
29:8 C
15:1 C
n1
:
n 0:64
10:1:9
From relation (10.1.9) it is clear that assumption of even one layer leads to too high a
mean Earth temperature, whereas disregarding the atmosphere leads to too cold a
mean temperature. The thermodynamic principles underlying (10.1.9) are correct
although the assumption that the atmosphere absorbs all the IR emission from the
ground is not. A non-integer value 0.64 for n leads to a closer prediction of the
Earths mean temperature. The effective number of layers, defined by the empirical
relation
neff
TE
0
TE
1,
10:1:10
0
I discuss optical thickness and processes of absorption and scattering in the atmosphere in the book, M. P. Silverman,
Waves and Grains: Reflections on Light and Learning (Princeton University Press, 1998).
578
Temperature (oC)
30
(a)
(b)
(c)
(d)
(e)
(f)
20
10
0
0
0.5
1.5
2.5
Time (y)
Fig. 10.2 Panoramic plots (truncated to 2.5 years) of temperature variations measured hourly
during the period 20072012 by sensors at depths (in cm) of (a) 10, (b) 20, (c) 40, (d) 80, (e) 160,
and (f ) 240. The closer to the surface is the sensor location, the greater is the sensitivity to
diurnal noise. The temperature record at depth d is designated xd.
a data logger, which has recorded the six temperatures every hour on the hour since
noon of 7 June 2007.
At the time of writing this chapter, the experiment was still in progress. The
temperature histories that underlie this narrative each comprise N = 47 240 observations or about 5.4 years of collection. A panoramic sample of the data taken
from all six sensors during the first two and a half years is shown in Figure 10.2.
Labeled (a) (f ) according to depth, the temperature variations rise and fall
asynchronously in time with what is unmistakably a 12-month period although
we shall examine the frequency content more thoroughly in due course. The thin
black trace (f ), designated x240(t), comes from the deepest probe, which, at a depth of
240 cm, sat in thermal silence experiencing for the most part only the slow change of
the seasons. In contrast, (a) the noisiest gray trace x10(t), a mere 10 cm below the
surface, spent each moment in a (metaphorical) thermal rock concert, responding
nervously to each shriek and cry of the weather. With increasing depth as the
time series go from x20 to x40 to x80 to x160 the temperature fluctuations become
calmer.
Like the plot of electric energy usage in the previous chapter, the plots of underground temperature variations reveal interesting historical features in addition to
whatever scientific content they may have. Wide flat troughs of some waves disclose
long periods of apparent stasis above ground. These recall harsh New England
winters with the ground heavily laden with snow, bringing a thermal silence to the
subterranean landscape. Elsewhere in the record, shown in Figure 10.3, occur two
sharp vertical spikes hanging from the troughs of a wave like icicles. These recall
relatively brief, but intense episodes of rain in which the casing filled with water and
sensors rapidly thermalized to the same temperature.
579
x10
20
10
0
10
200
400
600
800
1000
1200
1400
1600
1800 2000
x10
30
25
20
800
801
802
803
804
805
806
807
808
200
400
600
800
1000
1200
1400
1600
809
810
20
x240
15
10
5
0
1800 2000
Time (d)
Fig. 10.3 Top panel: large-scale time variation of temperature record x10 (gray), 24-hour time
_
average (thin black) ~x 10 , and 365-day moving average (heavy black) x 10 ; the mean x10 (dashed)
is shown as horizontal baseline. Middle panel: small-scale time variation of x10 (solid) with ~x 10
(dashed) superposed. Bottom panel: large-scale time variation of temperature record x240
(gray), 24-hour time average (heavy black dashed) ~x 240 , and 365-day moving average (heavy
_
black) x 240 ; the mean x240 (dashed) is shown as baseline. At this depth (240 cm) the short-scale
record does not show diurnal variations.
Figure 10.3 shows the two records x10 (t) and x240 (t) in greater detail at both long
and short time scales. These two records are particularly interesting because the first
is the most sensitive to weather and the second to climate. Superposed over the
original hourly record (gray) in the first and third panels are the 24-hour averaged
records (black) obtained from the algorithm
24
1 X
~x d t
xd 24 t 1
24 1
N
t 1, 2, . . .
24
10:2:1
580
for depths d = 10 and 240 cm. As seen in the second panel, which shows the time
variation of the original record x10(t) (solid trace) and averaged record ~x 10 t
(dashed trace) over the course of 10 days, the transformation (10.2.1) has removed
all diurnal variation. A corresponding plot for x240(t) and ~x 240 t is not shown
because diurnal fluctuations at a depth of 240 cm are so low that there is little
difference between the two time series (as is evident from the overlapping traces in
the third panel).
The solid black traces nearly parallel to the baselines in the first and third panels
are plots of the 365-day moving average calculated from the algorithm
_
x d t
364
1 X
~x d t
365 0
N
365 :
t 1, 2, . . .
24
10:2:2
Because the period (365 days) is an odd number, centering is not necessary. The
apparent flatness of the two lines shows that the moving average has removed
virtually all annual variation. It may seem, then, from a preliminary examination
that if a 24-hour average and 365-day moving average7 have removed all structure
from the time series, there is nothing left to explain but that impression would be
mistaken.
We shall begin this study as we have begun previous ones by examining the
autocorrelation and power spectra to see what the eye of analysis can reveal.
It is worth noting, in case it is not apparent, that the two kinds of averages are structurally different. In effect, the
24-hour average replaces each suite of 24 points by one point, thereby shortening the input series by a factor of 24. In
contrast, the 365-day moving average replaces each point by a sum of 365 points, thereby shortening the input series by
a length of 365.
581
AC r10(k)
1
0
365
730
1095
1460
1825
Lag k (d)
Fig. 10.4 Autocorrelation r10 (k) vs lag k according to: non-normalized expression (10.3.1)
(solid black) and normalized expression (dashed black) (10.3.4). Thin gray trace shows pure
cosine function cos(2 k/365).
Mk
X
~x d t ~
d ~x d t k ~ d
~r d k
t1
M
X
2
~x d t ~ d
10:3:1
t1
M
1X
~x d t
M t1
10:3:2
is the sample mean. The plot of (10.3.1) for ~r 10 k is shown as the solid black trace in
Figure 10.4; a plot of ~r 240 k (not shown) generates a practically identical trace. The
trace can be matched nearly perfectly by an exponentially decaying cosine function
k k cos k
10:3:3
with = 2/365 and 0.9992; slight visual discrepancies (not shown in the figure)
between ~r 10 k and (10.3.3) become apparent only around lag 1460.
In contrast to (10.3.1), a plot of the sample autocorrelation (dashed trace)
~r d kcor
M
~r d k,
Mk
10:3:4
corrected to account for the number of terms in each summation, matches nearly
perfectly the superposed cosine function (thin gray trace)
kcor cos k:
10:3:5
Although the numerical difference between and 1 is very small, the two theoretical
autocorrelations differ significantly at long lags and carry different implications.
582
We have seen in the previous chapter that in general the autocorrelation function
does not uniquely determine the temporal function that generated the time series.
Nevertheless, in the present situation the theoretical time series is uniquely determined from both mathematical and physical circumstances. The temporal function
whose autocorrelation is a pure cosine (10.3.5) is, itself, a pure cosine
x1 t cos t,
10:3:6
2ln
k k cos k 2
sin k
10:3:7
10:3:8
10:4:1
where
xd
N
1X
xd t
N t1
10:4:2
differs from (10.3.2) in that it is a sum over hourly (not mean daily) observations. The
FT of (10.4.1) was calculated from relations
9
8
N
N
X
X
>
>
1
2
2
j
t
>
>
>
> ad j
1 k 0 >
yd t k 0
yd t cos
>
>
>
=
<
N t1
N t1
N
N
j 0, 1, 2 . . .
N
>
>
2
>
>
2X
2 j t
>
>
>
>
j
y
t
sin
b
>
>
d
d
;
:
N
N
t1
10:4:3
583
(a)
Temperature (oC)
10
(b)
(d)
(e)
0
5
(c)
10
15
0
0.25
0.5
0.75
1.25
1.5
1.75
2.25
2.5
2.75
Time (y)
Fig. 10.5 Waves at the fundamental frequency =2 Ty1 1 y1 in the Fourier analysis of
temperature time series (a) y10, (b) y40, (c) y80, (d) y160, (e) y240. Phase shifts and amplitudes
relative to y10 provide information regarding the spatial diffusion of energy through the
ground.
given earlier in the book, but repeated here to establish the notation and normalization convention used in this chapter. The symbol k0 is the familiar Kronecker delta
function. For yd (t), the amplitudes ad (0) = bd (0) = 0. Later in the chapter, it will
also be convenient to make use of the results of the FT in the equivalent form
cd j ad j2 bd j2 1=2
bd j
d j arctan
:
ad j
10:4:4
10:4:5
584
y10 (oC)
10
0
10
20
0
0.5
1.5
2.5
Time (y)
Fig. 10.6 Comparison of empirical time series y10 (gray) with Fourier superposition (black) of
fundamental waves at periods of one day and one year. Rapid diurnal variation at the time
scale of the figure accounts for the thickness of the black trace.
From relation (10.4.5) it follows that the harmonic number of the component at
the period Td 1 d = 24 h is jd = 1095. The nearest sub-surface waveform
2 t
2 t
y, d
b10 3 sin 3
y10 t a10 3 cos 3
N
N
10:4:7
2 t
2 t
a10 1095 cos 1095
b10 1095 sin 1095
N
N
obtained by superposing the components corresponding to annual and diurnal
variations is plotted (black trace) in Figure 10.6. Although one cannot see directly
the rapid 24-hour oscillations at the time scale (years) of the figure, it is evident that
this deterministic process, and not just noise due to weather, contributes to the
thickness of the empirical record (gray trace).
The power spectral amplitude at any harmonic is the square magnitude of the
corresponding Fourier amplitude
Sd j ad j2 bd j2 cd j2 :
10:4:8
Figure 10.7 shows log S10 (j) (black points) obtained by a fast Fourier transform
(FFT) of the mean-adjusted series y10 truncated to 215 = 32 768 elements. (The input
to the FFT algorithm employed by my mathematics software must be an integer
power of 2.) The logarithm (to base 10) is plotted to enhance visibility of the most
striking feature, which is a series of 11 sharp peaks shown in the upper panel at
equally spaced intervals over the harmonic range 016 000. The only other statistically significant peak in the spectrum occurs at j = 4, shown in the lower panel which
surveys at higher resolution the harmonic range 080.
For a time series of 215 elements, a peak at fundamental period of 1 y or 8760 h
would occur at harmonic number 3.74. Harmonic numbers, however, must be integers, and j = 4, corresponding to 8192 h, is the closest achievable approximation.
585
2
0
2
4
6
0
10
12
14
16
6
4
2
0
0
10
20
30
40
50
60
70
80
Harmonic Number
Fig. 10.7 Top panel: power spectrum log S10(j) (black points) as a function of harmonic
number j in the range 016 000. Peaks occur at frequencies corresponding to the diurnal
harmonic series (24 h)/n (n = 1, 2, . . .). Dashed white trace is the least-squares fit (from doublelog plot of Figure 10.9) yielding exponential decay at decay rate 10 = 2.44. Bottom panel:
details of log S10(j) for 80
j
0, showing a single peak corresponding to the fundamental
period of 1 year.
The word harmonic has two meanings here. Physicists refer to harmonic number as the integer multiple of a
fundamental frequency in a Fourier series. In mathematics, the harmonic series is the series of terms 1, 12 , 13 , 14 , . . ..
586
Table 10.1
Harmonic jd
T exp jd 2j
1365
2371
4096
5461
6830
8196
9561
10 926
12 288
13 653
15 026
3.619 24
2.076 47
1.255 50
0.434 81
0.597 78
0.962 25
1.178 79
1.619 40
1.773 12
2.133 40
2.094 70
24.006
11.999
8.000
6.000
4.798
3.998
3.428
2.999
2.667
2.400
2.181
T thy n 24
n n 1, 2 . . .
24
12
8
6
4.8
4
3.429
3
2.667
2.4
2.181
There is no reason to believe the series abruptly terminates; rather, with increasing
n the power spectral amplitudes merge with the noise. If such a sequence were to
occur for the annual period, there would be peaks at j = 4, 8, 12, 16 . . ., but only the
fundamental is apparent in the lower panel of the figure. Evidently, the process
generating the harmonic series of peaks involves the rotation of the Earth on its
axis, but not its revolution about the Sun.
Although one expects the power series S10 to reveal periodicities of one year and
one day, the question arises as to why frequencies corresponding to higher harmonics
of one day also occur. Clearly, such peaks are not noise, but arise from a deterministic process of sharp regularity. I will offer an explanation shortly, but for the present
it is of interest to explore further both the signal and the noise within the S10 and S240
power spectra.
To begin, consider how the power S10 (jd) in the diurnal peaks, recorded in
Table 10.1, relates to harmonic number jd. Figure 10.8 shows that a plot (small
circles) of log S10 (jd) against log jd generates an unambiguously straight-line pattern
d
with negative slope 10 . (The superscript d stands for day or diurnal.) The
magnitude of the slope and associated uncertainty, obtained by a linear least-squares
fit, is
d
10 5:82 0:55:
10:4:9
The implication of the plot is that the power spectral amplitudes of the process
generating the series of diurnal peaks has a power-law dependence on frequency
d
S10 d / d 10 :
10:4:10
Consider, next, the plots (gray) of log Sd (j) against log j in Figure 10.9 for depths
d = 10 cm (upper panel) and 240 cm (lower panel). These plots include the entire
587
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.1
4.2
10
0.5
1.5
2.5
3.5
5
0
0.5
1.5
2.5
3.5
588
power spectrum, most points of which comprise noise rather than signal (i.e.
regularly spaced peaks). Since the sensor at 240 cm is largely shielded from diurnal
events at the surface, one would not expect to find and S240 (j) does not show a series
of diurnal harmonics like those appearing in the upper panel. The double-log plots
again reveal linear patterns, suggesting a power-law dependence of spectral power on
frequency. The respective slopes, extracted by least-squares analysis, were found to be
10 2:44 0:11
10:4:11
10:4:12
The corresponding lines of regression are shown as black traces in the double-log
plots of Figure 10.9 and as a dashed white trace in the single-log plot of
Figure 10.7.
To put such numbers in perspective, recall that the exponent in a power-law
dependence S / is a measure of the kind of randomness displayed by the
process generating the power spectrum. The larger the value of , the less stochastic
and more deterministic the process appears to be. We have seen that 0 for white
noise, characteristic of the decay of radioactive nuclei and to good approximation of
the daily change in share price of stocks in a stock market. This is a process in which
future events are entirely unpredictable over the time interval between measurements
or observations. We have also encountered 1.8 for a process describable by
Brownian noise, which, as an example of diffusion, is more predictable. If points
within the spectral regions of the diurnal peaks were excised from S10 in the upper
panel of Figure 10.9, the slope (10.4.11) becomes much closer to (10.4.12) of S240.
That the noise in the power spectra Sd (j) corresponds to a diffusive process is not a
coincidence, as we shall see in the next section.
Return now to the question: what is responsible for occurrence of a harmonic
series of peaks in the temperature record x10, rather than just a single peak at the
period 24 h? To be surprised at the occurrence of these other peaks is to have
assumed that, because the Earth rotates at a period of 24 h, the heating of the ground
must vary periodically in the same way. However, this correspondence does not hold
for at least two reasons. First, except for two times of the year the autumnal and
vernal equinoxes the duration of daylight (when the Earth takes in solar energy) is
not equal to the duration of darkness (when the Earth cools). And second, even if the
periods of daylight and darkness were equal, the processes by which the ground cools
differ from those by which the ground is heated. Thus, although the Earth may rotate
on its axis once every 24 h,9 the thermodynamic processes of heating and cooling do
not occur symmetrically over that period.
9
Although not pertinent to the present discussion, it should be noted for accuracy that the motion of the Earth about its
axis is more complicated than simple rotation. Owing to the non-spherical distribution of mass, the principal moments
of inertia are not all equal, and the axis of the Earth undergoes a low-amplitude precession (Chandler wobble), tracing
out a circle about the North Pole every 400 days. See, for example, H. Goldstein, Classical Mechanics, 2nd Edition
(Addison-Wesley, 1980) 212.
589
An examination of the detailed physical processes by which the Earth heats and
cools is an undertaking beyond the intention of this chapter. Rather, the objective is
to account for the observed periodic forcings in the power spectrum by a simple
empirical model that incorporates the effects of these processes phenomenologically. As noted at the beginning, the ground is heated during the daylight hours by
absorbing short-wavelength radiation from the Sun and long-wavelength radiation
from the atmosphere. After sunset, there is still an exchange of energy with the
atmosphere; ordinarily, it is the ground that cools the faster. The asymmetry of
these processes can be represented by a piecewise periodic forcing function of the
form
cos t
if cos t 0
0
f t
10:4:13
1 cos t 2 otherwise
in which =2/24 h1 is the angular frequency of Earths rotation, is the phase
shift needed to establish a common time origin with the empirical time series, 0 is
a shift parameter that determines the fraction of daylight during the 24-h period,
and 1, 2 are parameters that characterize the after-sunset cooling processes.
The numerical values of the parameters in (10.4.13) are determined by adjusting
the power spectrum of f (t) to agree with S10 (jd). The upper panel of Figure 10.10
shows the match obtained (gray trace) for the observed harmonic series of 11 peaks
(black points). As a check, the resulting time series f (t) in the lower panel is superposed (dashed black trace) over the temporal function (gray trace) due solely to the
observed diurnal harmonics
X
2t
2t
d
a10 jd cos jd
b10 jd sin jd
:
10:4:14
y10 t
N
N
j
d
The comparison, shown for a 96 h period, is very close apart from a characterisd
tic distortion of (10.4.14) from a pure sinusoidal waveform, since y10 t is, after
all, a superposition of 11 harmonics. In a more comprehensive model, which is
not given here, the variation in the forcing f (t) due to ellipticity of the Earths
orbit is also accounted for. The function (10.4.13) would then also contain
a harmonic contribution at the angular frequency of the Earths revolution
= 2/8760 h 1.
590
10
15
0
10
12
14
16
2472
2484
2496
1
0.5
0
0.5
1
2400
2412
2424
2436
2448
2460
Time (h)
Fig. 10.10 Top panel: comparison of S10 (jd) of diurnal harmonics (black dots) and the power
spectrum of empirical forcing function f (t) (gray) in (10.4.13). Bottom panel: comparison of
corresponding time variation of f (t) (dashed) and the contribution (10.4.14) to x10 (t) due solely
to the diurnal harmonics (gray).
10
W. T. Grandy, Entropy and the Time Evolution of Macroscopic Systems (Oxford, 2008) 80.
591
j(x,t)
T
dq
dx
T-dT
j(x+dx,t)
Fig. 10.11 Schematic diagram of heat diffusion through a layer with temperature difference dT
between the top and bottom surfaces. j (x, t) is the flux of heat energy at horizontal level x at
time t.
jx, t T
Tx, t
,
x
10:5:1
10:5:2
10:5:3
in terms of the specific heat capacity c and mass density of the material of the layer.
Equating (10.5.2) and (10.5.3) leads (by standard limit-taking procedures of differential calculus) to the relation
Tx, t
1 jx, t
t
c x
10:5:4
10:5:5
for the variation of temperature in space and time in terms of a single parameter, the
thermal diffusion constant or thermal diffusivity
D
T
:
c
10:5:6
592
To help convey the physical meaning of the four quantities in (10.5.6), it is useful to
note explicitly their dimensions and units (in the MKS system). These are
mass density
mass kg
3
vol
m
energy
J
masstemperature difference
kg K
energy=time
W
thermal conductivity T
lengthtemperature difference m K
length2 m2
thermal diffusivity
D
:
time
s
specific heat capacity c
10:5:7
A differential equation of the form (10.5.5) first order in time and second order
in space is a diffusion equation, familiar to physicists. We need to solve this
equation in order to make sense of the pattern of waveforms in Figure 10.2 or
Figure 10.5 in particular, to understand the relative amplitudes and phase shifts
(i.e. peak locations) of the records from the different sensors. Also, once the solution
T (x, t) is known for temperature, then it is a simple matter of spatial differentiation to
arrive at an expression for the energy flux j(x, t). This heat flow into the ground is part
of the energy balance at the surface, which, together with incident solar radiation,
also includes the transfer of heat between ground and atmosphere by conduction
(sensible heat) and the release or absorption of heat resulting from a phase change
(latent heat)
Latent
Sensible
Heat flux
Incident radiation
: 10:5:8
heat flux
heat flux
into ground
at ground surface
The problem of energy flows between ground and atmosphere and within
the atmosphere, as stated previously, is an undertaking outside the scope of this
chapter.
We turn next to solving the heat equation (10.5.5). To transform from a differential equation in two variables to a differential equation in one variable express T (x, t)
as a Fourier integral
Tx, t
~ x, ei t d
T
~ x, ei t d:
2Re T
10:5:9
The transition from the first line to the second requires the identity
~ x, * T
~ x, ,
T
10:5:10
which must hold if the temperature T (x, t) is to be a real-valued function. The utility
of the second expression in (10.5.9) is that one need find a solution only for
593
T x, t 0,
10:5:11
x2
D
recognizable as the Helmholtz equation, which turns up ubiquitously in physics,
particularly in the study of physical optics and acoustics. The solution for nonnegative , readily verified by substitution into (10.5.11), has the form
r !
~ x, T
~ exp i ix
> 0:
10:5:12
T
D
In the solution to the Helmholtz equation (or to the wave equation from which the
Helmholtz equation is often derived), the factor that multiplies the spatial variable x
in the argument of the complex exponential is referred to as the wave number, a
factor that is real-valued for traveling waves. In (10.5.12), however, the wave number
is complex, the implication of which can be seen from the use of Eulers relation11 to
express the square root of the unit imaginary as a sum of real and imaginary parts
p
1
10:5:13
i sin
p 1 i:
i ei =4 cos
4
4
2
Inserting (10.5.13) into (10.5.12) yields
~ x, T
~ e
T
p
2D
x i
2D
10:5:14
~ e
Tx, t 2 T
2D x
r
x d
cos t
2D
10:5:15
~ cos td
T0, t T s t 2 T
10:5:16
T, t ! 0:
~ is then derivable by an
If the surface temperature Ts (t) is known, the function T
inverse Fourier transform
~ 1
T
11
ei = cos + i sin .
T s t cos tdt:
10:5:17
594
leading to
Tx, t 0
n e
r
n
cos n t
x n :
2D
2D x
10:5:19
leading to
Tx, t 0
X
n
n e
n0
2D
r
n0
x n :
cos n0 t
2D
10:5:21
The solutions obtained above describe temperatures that oscillate in time and decay
exponentially with depth. We will apply these wavelike solutions to the subterranean
temperature series shortly, but first it is instructive to examine an approach to the
diffusion equation that is obtained in a different way, has a different form, and
different interpretation.
595
10:6:2
in which 1 = 1 for Brownian noise, and the error term 1 is explicitly shown to be a
normal variate of mean 0 and variance 2 associated with the time t. Thus 1 and t+1
are independent and uncorrelated. The two equations, (10.6.1) and (10.6.2), have
superficially similar forms if one associates 2Ddt of the stochastic differential
2
equation
p with the variance of residuals of the finite difference equation. However,
it is dt, not dt, that enters the stochastic equation. Since the square root of a
differential quantity is not usually encountered in elementary calculus, one may well
wonder how to work with it.
In contrast to the diffusion equation (10.5.5) whose solution is a function, the
solution to the Wiener process (10.6.1) is a random variable or, equivalently, a
probability distribution. The equation is solvable by a method very similar to the
one employed in Chapter 6 to solve the AR(1) process (see Eq. (6.6.5)). To simplify
notation in (10.6.1), define
p
t 2Ddt N ttdt 0, 1
10:6:3
and examine the values of X(t) iteratively, starting with t = 0:
t0
t dt
t 2dt
..
.
t ndt
Xdt X0 0
X2dt Xdt dt X0 0 dt
X3dt X2dt 2dt X0 0 dt 2dt
n
X
kdt :
X n 1dt Xndt ndt X0
k1
10:6:4
596
The sum of variates in the last line of (10.6.4) reduces to a single normal variate
n
X
k1
kdt
p dt
. . . N ndt
2D dt N 0 0, 1 N 2dt
dt 0, 1
n1 dt 0, 1
10:6:5
N t0 0, 2D t
of variance 2Dt at the finite time t lim ndt, and the solution to (10.6.1) is therefore
dt!0
n!
Xt X0 N t0 0, 2Dt:
10:6:6
As shown in Chapter 6, the normal probability density function (pdf ) associated with
the distribution (10.6.6)
2
1
pX x, t p exx0 =4D t
4 Dt
10:6:7
10:6:8
There is a close connection between the stochastic approach to diffusion and the
deterministic approach of the preceding section, which may be seen by solving the
thermal diffusion equation (10.5.5) in a different way. Instead of a Fourier transform
(10.5.9) with respect to the time variable (i.e. integration over angular frequency),
express the solution T (x, t) as a Fourier transform with respect to spatial coordinate
(i.e. integration over wave number)
Tx, t
10:6:9
The dispersion relation i.e. explicit k-dependence of (k) will be found in the
course of the analysis.
For this method of solution, the initial condition
Tx, 0 T i x
~ kei k x dk,
T
10:6:10
rather than the surface boundary condition Ts(t), is required. Substitution of the
inverse transform of (10.6.10)
~ k 1
T
2
T i x0 ei k x dx0,
10:6:11
597
T i x0 ei k xx i k t dk dx0:
10:6:12
Operating on (10.6.12) with the time and spatial derivatives of the diffusion equation
(10.5.5) yields an equation
10:6:13
k iDk2 ,
which furnishes the requisite dispersion relation. To this point, then, the sought-for
solution takes the form
Tx, t
T i x0 ei k xx D k t dk dx0:
2
10:6:15
Integration over wave number in (10.6.15) by completing the square in the exponent is the final step to obtaining a solution
Tx, t
0 2
T i x0
e xx =4D t 0
p dx
4 D t
10:6:16
598
important quantity because of the seminal contribution of thermal diffusion to the net
energy balance at the Earths surface as indicated schematically in (10.5.8).
Re-examining Figure 10.5 in the light of the solution (10.5.19) applied to the single
(annual) fundamental frequency 2=T y 7:1726 104 s
r
p
Tx, t 1 e 2D x cos t
x 1
10:7:1
2D
suggests that there are two independent ways to obtain D from the time series of
temperatures at depth d relative to the measurements made at depth 10 cm: (a) by
phase shift and (b) by amplitude attenuation.
t
k
10:7:2
p
equivalent in form to phase velocity. Replacement of wave number by k =2D
followed by some algebraic manipulations leads to the phase-based value for D
1 x2 x 1 2
Ty
10:7:3
Dp
4 t2 t1
in terms of the time interval between temperature peaks recorded by two sensors at
different depths. The result (10.7.3) does not depend on the actual temperatures.
x 2 x1
Da
:
10:7:5
T y logT max x1 logT max x2
Table 10.2 summarizes the estimates obtained for D by both methods in which
coordinates (x1, t1) in relations (10.7.3) and (10.7.5) pertain to the temperature sensor
599
Table 10.2
Depth (cm)
10
20
40
80
160
240
Mean
Av
......
......
4.1877 107
4.5694 107
4.3569 107
4.7359 107
7
3.3906 10
3.6500 107
7
4.7672 10
5.3189 107
7
6.4114 10
7.2372 107
7
Dp 4:623 10
Da 5:102 107
D 4:86 0:49 107
Table 10.3
Component
Soil minerals
Soil organic
matter
Water
Diffusion coefficient
D (m2/s)
2650
1300
870
1920
2.5
0.25
1.1 106
1.0 107
1000
4180
0.56
1.34 107
at depth 10 cm, and (x2, t2) refer in sequence to the records from the other sensors.
Numerical precision to four decimal places was retained for calculation, but the final
result
D 4:86 0:49 107 m2 =s
10:7:6
12
G. S. Campbell and J. M. Norman, An Introduction to Environmental Biophysics 2nd Edition (Springer, 1998) 118.
600
A more critical test, not only of the numerical value of D, but of the diffusion
model for solar energy transport into the ground, is the prediction of the temperature
record xd (t) at depth d from the record x10 (t) of the 10-cm sensor. The theoretical
temperature function, based on special cases (10.5.19) and (10.5.21), can be written as
!
!
r
r
2 t
x 10 cos
Tx, t 1 exp
x 10 1
D Ty
Ty
D Ty
!
!
r
r
2 t
2 exp
x 10 cos
x 10 2
10:7:7
D Td
Td
D Td
s
!
!
r
2
4 t
2
x 10 cos
3 exp
x 10 3 ,
D Td
Ty
D Td
where I have included contributions at the fundamental periods of 365 days and
24 hours and the first diurnal harmonic at 12 hours. The parameters (amplitudes
and phases)
11:562
1 c10 3
2 c10 1095 0:900
3 c10 2190 0:162
1 0:864
2 0:594
3 0:0:981
10:7:8
utilized in (10.7.7) come from the Fourier analysis of the mean-adjusted three-year
temperature record y10 (t).
How well the theory predicts the record y240 (t) based on the parameters of the
record y10 (t) is shown in Figure 10.12. Noisy and sharp gray traces respectively
depict the empirical y10 (t) and y240 (t) temperature records; solid and dashed black
Temperature (oC)
20
y10
10
y240
10
20
0
0.5
1.5
2.5
Time (y)
Fig. 10.12 Empirical temperature records y10 (noisy gray) and y240 (sharp gray) and temperature
records y10 (solid black) and y240 (dashed black) predicted from the one-dimensional diffusion
equation.
601
traces respectively depict the theoretical y10 (t) and y240 (t) waveforms. (Note again
that the apparent thickness of the theoretical y10 (t) waveform is due to the diurnal
oscillations, which are absent from the y240 (t) record.) The excellent agreement
between observation and theory in amplitude and phase suggests that the
deterministic features of the temperature records are well accounted for by the
one-dimensional diffusion model.
There is a further test to which the theory can be subjected which brings out, as a
byproduct, a feature of the environment that may be surprising to those who work
mainly indoors (like I do) and do not pay much attention to the energy flow outside.
Having determined the soil diffiusivity D, we can estimate the delay time between
peak temperature at a depth of 10 cm and at the surface. At the interface between the
ground and the air, it is the ground that directly absorbs solar radiation and subsequently heats the layer of air close to the ground, but the delay time is much shorter
than the time required for heat to conduct through the soil to a depth of 10 cm. It is
of interest, therefore, to compare empirical and predicted lag times between peak air
temperature at the surface and peak ground temperature at a depth of 10 cm.
Figure 10.13 shows the variation in time of the surface temperature (gray trace)
during 10 days in July 2007 and the corresponding subterranean temperature record
x10 (t) (black trace). As expected, peaks in the surface temperature occur before peaks
underground, the mean lag time being estimated at
obs
Air
30
Temperature (oC)
10:7:9
x10
25
20
15
0
20
40
60
80
100
120
140
160
180
200
220
240
Time (h)
Fig. 10.13 Variation in temperature with time at the surface (gray) and at a depth below
ground of 10 cm (black) for the period 1331 July 2007. Mean lag time is approximately 3 12 h.
Surface temperatures peak at close to 14:00 h local time.
602
Serendipity is the circumstance of discovering something one is not in quest of. For
example, in the investigation of the previous chapter I did not set out to test whether
the electric meter at my home was accurate and then discover that the power
company could conceivably make a fortune by distributing meters with slow drift.
Rather, I was interested at the outset only in finding a simple statistical model that
described residential electric energy usage. A similar statement could be made for the
research in the present chapter as well. I undertook the project with no agenda other
than to see how temperature varied at different depths below ground and find a
13
14
See, for example, (1) The Diurnal Cycle by the National Climatic Data Center, of the National Oceanic and
Atmospheric Administration, http://www.ncdc.noaa.gov/paleo/ctl/clisci0.html; and (2) P. C. Knappenberger, P. J.
Michaels, and P. D. Schwartzman, Observed changes in the diurnal temperature and dewpoint cycles across the
United States, Geophysical Research Letters 23 (1996) 26372640.
Lines from the song Bad Moon Rising.
603
365-day MA x240
Temperatuere (oC)
13
12.75
12.5
12.25
12
11.75
11.5
0
200
400
600
800
1000
1200
1400
1600
Time (d)
Fig. 10.14 The 365-day moving average (solid) of the mean daily x240 temperature and the
least-squares line of regression (dashed) with slope of 0.28 0.0032 C/y.
simple way to model that behavior mathematically. However, it has been my experience over many years that researches undertaken purely out of curiosity (. . . in
contrast to the serious work for which one has to spend hours writing grant
proposals . . .) often leads to unexpected and significant outcomes.
Let us return to the third panel of Figure 10.3, a plot of the temperature (x240)
recorded at 240 cm below ground at which depth the vicissitudes of the weather are
nearly entirely damped out. Besides the features of the plot described previously,
there is a long, flat, horizontal black line reclining below the dashed baseline like a
snake sleeping under a log. This line is the 365-day moving average (MA-x240) of the
daily-averaged hourly x240 temperature record. As mentioned previously, the flatness
of the record after removal of 24-hour and 365-day periodicities indicates that all
significant time variations in the x240 time series have been accounted for. Nevertheless, far from being of little interest, the MA-x240 time series has its own story to tell.
Figure 10.14 presents the MA-x240 record at higher resolution (black trace)
together with the least-squares line of regression (dashed). The time interval covered
is approximately 5.4 years (shorter by 365 days on the graph because of the moving
average) from 2007 to 2012, during which the averaged temperature has been
increasing at a rate of approximately
0:28 0:0032 C=y,
10:8:1
determined from the slope of the line of regression. To put this number in perspective, there are several comparisons one can make.
Numerous studies employing different measurement techniques have documented
the variation in the mean temperature of the Earths surface. Although at one time a
highly contentious issue primarily among policy makers than among scientists who
604
actually performed such studies the fact that the planet is warming anomalously is
now beyond dispute.15 The precise rate of change may vary somewhat depending on
the source of the information, but most numbers I have seen are close to the figure
o
Global mean annual temperature
10:8:2
e 0:02 C=y
rise over the past 30 years
reported by the US National Research Council.16 There is no error in decimal point;
the rate (10.8.1) is about 14 times the rate (10.8.2).
The objective in measuring the mean global temperature rise is to have a background figure that is largely independent of the wide variation in local heat distribution. As such, methodology is sought that specifically avoids the heat-island effect,
the phenomenon that urban areas are ordinarily hotter than rural ones by virtue of
fewer trees to provide shade and capture moisture, more asphalt and cement to
absorb heat and produce run-off rather than facilitate ground absorption of precipitation, among other reasons. However, were it simply a matter of a more or less fixed
temperature difference, the heat-island effect would not affect the rate of temperature
increase.
Perhaps a more compatible comparison would be regional, rather than global.
According to a report17 of the Union of Concerned Scientists, the mean temperature
increase for the US Northeast is
o
US Northeast annual
0:028 C=y
10:8:3
temperature rise 1970 2002 e
There is still no decimal point error; the rate (10.8.1) is ten times the rate (10.8.3).
Since my analysis of the underground temperature data covered a relatively short
time span of half a decade, I sought to determine whether a longer history of
temperature measurements would also reveal so striking an anomaly. The upper
panel of Figure 10.15 shows the original time series of above-ground measurements
of the average July temperature for the City of Hartford, in which the college campus
is located, spanning five decades from about 1960 to 2012.18 The data were collected
at the Hartford-Brainard Airport, a distance of about 6 km from where my underground temperature measurements were made. In the lower panel is the time series
resulting from a 10-year moving average, performed to suppress fluctuations and
obtain a graph for comparison with decadal figures reported in the literature.
15
16
17
18
References in the scientific literature and popular news media are legion, but the following news article captures the
basic sentiment: J Gillis, Global Temperatures Highest in 4,000 Years, New York Times (7 March 2013), http://www.
nytimes.com/2013/03/08/science/earth/global-temperatures-highest-in-4000-years-study-says.html. The basis of the
news article is the report: S. A. Marcott, J. D. Shakun, P. U. Clark, and A. C. Mix, A reconstruction of regional
and global temperature for the past 11,300 years, Science 339 (8 March 2013) 11981201; available online at http://
www.sciencemag.org/content/339/6124/1198.abstract.
Americas Climate Choices, Committee on Americas Climate Choice, National Research Council (National Academy
Press, 2011) 15. The figure reported is 0.6 C over 30 years.
Union of Concerned Scientists, Climate Change in the U.S. Northeast (UCS Publications, October 2006) 10. The
reported figure is 0.14 F per decade.
http://weatherwarehouse.com/WeatherHistory/PastWeatherData_HartfordBrainardField_Hartford_CT_July.html
605
28
26
Hartford CT (July)
24
22
20
18
1964
1972
1980
1988
1996
2004
2012
1988
1996
2004
2012
25
24
10-year MA
23
22
21
1964
1972
1980
Year
Fig. 10.15 Top panel: variation in mean July temperature of Hartford, Connecticut (US) from
1960 to 2012. Bottom panel: 10-year moving average (black points) and least-squares line of
regression (solid) with slope 0.028 0.0026 C/y. (Lines connecting points serve only to guide
the eye.)
Superposed on the moving average is the least-squares line of regression, the slope of
which again yields a rate
o
City of Hartford annual
10:8:4
0:028 0:0026 C=y
temperature rise 1960 2012 e
virtually identical to the regional rate (10.8.3). Clearly, then the much higher rate of
temperature rise is a recent phenomenon.
The anomalously high rate of increase in the Hartford temperature immediately
raises the question of just how anomalous this rate is in particular, whether it
applies to other cities. Hartford is a medium-size city with a population (as of 2012)
of 124 893. As a final comparison, I turned my attention to nearby New York City
(NYC), the largest metropolitan area in the USA by population estimated (as of
2012) to be 8 336 697.
Above-ground temperature records for NYC, measured at a station in Central
Park, Manhattan, are available online for the period 19002012 as annual averages
for each individual month January through December. I downloaded the information
into 12 separate data files T,i ( = 1. . .12; i = 0. . .112), in which index specifies
the month (e.g. = 1 ) January) and index i specifies the year beyond 1900 (e.g.
i = 60 ) 1960), and combined them into a single time-ordered file 12+i = T,i of
N = 12 112 = 1344 Celsius temperatures. (The original files were in Fahrenheit.)
606
Temp. (oC)
30
20
10
0
-10
18
16
14
12
10
8
60
64
68
72
76
80
84
88
92
96
60
64
68
72
76
80
84
88
92
96
18
16
14
12
NYC
10
107
108
109
110
111
112
Time (y)
Fig. 10.16 Temperature variation over different time intervals for New York City. Top panel:
monthly mean temperatures from 1960 to 2012. Middle panel: 12-month moving average
covering period 19602012. Bottom panel: 12-month moving average covering period
20072012. Dashed traces are least-squares lines of regression.
The first panel of Figure 10.16 shows the temperature variation for a truncated
sample of the data spanning 19602012. To the naked eye the pattern is little more
than a pleasing sinusoid at the expected annual period and with barely perceptible
amplitude variation, much like the subterranean x240 temperature record in
Figure 10.3. But the eye of analysis tells a different story.
The second and third panels show the centered 12-month moving average
fk k 1 . . . Ng
k
12
X
j1
cj kj
1
c0 c12 ;
24
cj
2
j 2 11
24
10:8:5
over the period 19602012 (middle panel) and the period 20072012 (bottom panel),
superposed by least-squares lines of regression (dashed). The moving average has
removed all traces of an annual periodicity, leaving only Gaussian random noise.
Although at first glance k in the middle panel looks noisy compared to the original
record in the top panel, that appearance is deceptive due to the different vertical
scales. Had k been plotted at the scale of k in the top panel, the maximum
607
fluctuations (of about 3 C) would only marginally have exceeded the thickness of the
line of regression.
Here, then, are the results I obtained for the NYC rate of temperature change per
year for three different spans of time:
Long span
Medium span
Short span
1900 2012
1960 2012
2007 2012
10:8:6
The rate of temperature increase, anomalous though it may be, is entirely consistent
with what I had found previously for Hartford i.e. (a) a rate of about 0.02 C/y,
comparable to the regional rate, when taking the period 19602012, but (b) a much
higher rate of about 0.38 C/y for the recent period 20072012. That the NYC rate is
higher than Hartfords (by about 35%) is not surprising given that the population
size and, especially, the population density of NYC (10 430 km2) is much higher
than Hartfords (2 772 km2). Also, this rate was measured above ground, whereas
the rate I deduced for Hartford was from measurements 2.4 m below ground.
Although I began the study of underground temperature with primarily an
academic interest, the numbers that ultimately emerged from this exercise have
far-reaching implications all the more serious because they quantify an aspect of
climate change that is only very infrequently, if at all, given public exposure.
Certainly, there is much discussion of global warming although from a
scientific perspective that label represents only one part of a more complex outcome
of climate change (and not all locations may end up warmer). It is the global
aspect, however, upon which attention is mostly fixed that is, the potential largescale responses of the planet to heating such as melting of glaciers in Greenland or ice
shelves in Antarctica, the subsequent rise in ocean level with attendant flooding, the
disruption of large circulatory currents in the Atlantic with concomitant planetary
redistribution of heat, heightened frequency and intensity of extreme weather events
like hurricanes and tornados, broad regional perturbations of ecosystems leading to
extinctions of species and spread of disease-bearing insects and other vectors.
None of the preceding possibilities is to be dismissed out of hand or taken lightly
but neither do I think at this point that these events will be the ones that undermine a
stable society the soonest. Rather, it is the local, not global, heat stress to cities. The
cities are where most people live, and, as the human population continues to increase
beyond sustainable levels,19 more and more people will be living in cities. As temperatures likewise climb, so too will the demand for means of cooling, a demand that
already challenges electric grids in the US and abroad. According to the US Centers
for Disease Control and Prevention20
19
20
M. P. Silverman, The Children Keep Coming: How Many Can Live on Earth?, http://www.trincoll.edu/silverma/
reviews_commentary/how_many_can_live_on_earth.html [Op-Ed piece submitted to the Wall Street Journal, 2011]
http://www.cdc.gov/climateandhealth/effects/heat.htm
608
Heat waves are already the most deadly weather-related exposure in the U.S., and account for more
deaths annually than hurricanes, tornadoes, floods, and earthquakes combined.
If there is a bright side to this inauspicious forecast, it is that local problems are more
readily remedied than global ones. I am not optimistic that meaningful action will be
taken anytime soon within the USA, let alone among different nations, to adopt
measures for controlling greenhouse gas emissions. Too many participants have a
stake in maintaining the status quo. It is difficult to get a man to understand
something, writer Sinclair Lewis remarked, when his salary depends upon his not
understanding it. In place of salary you can substitute more generally political
and financial interests. Fortunately, addressing the problem of heat stress at the city
level does not require negotiating an international treaty or reconciling the clashing
interests of political parties. Remedies that forestall disaster could include actions as
simple as planting trees and installing reflecting paneling on rooftops.
***
There was a time once although not any longer when a reputable scientist could
dismiss the use of statistical reasoning in the planning and interpretation of experiments. One of Nobel Laureate Ernest Rutherfords famous statements is: If your
experiment needs statistics, then you ought to have done a better experiment.
Rutherford, it could be noted, said other things equally ill-considered (such as
regarding the possibility of extracting energy from the atom . . . by which he meant
nucleus . . . as moonshine). The fact is: statistical methods in the service of the laws
of physics provide a tool for seeing through random noise so that what intelligible
patterns of information are present can be recognized, quantified, and used to
achieve an end (hopefully a worthy one).
Songwriter Bob Dylan may have had a scientific point (whether he knew it or not)
in his oft-quoted line from the song Subterranean Homesick Blues.21 One probably
does not need a weatherman to know which way the wind blows. But if Dylan were
living in a sprawling densely populated city a score of years from now, he might want
to amend that lyric to read
You dont need a physicist to know which way the heat flows.
For the time, being, however, until the deadly serious matter of urban heat stress is
widely recognized and acted upon, you do.
21
Bob Dylan, Subterranean Homesick Blues, Columbia Records (original release 1965).
Appendices
10:9:1
with polar angle and azimuthal angle . From elementary calculus, a differential
patch of surface area on the sphere is dS = R2 sin dd.
Suppose the rays of the Sun incident on the sphere to be parallel to the y axis. Then
the unit solar flux vector is j y , and the (positive) rate of energy absorption per
unit incident energy is given by the integral over the hemisphere facing the Sun
jn dS R
I
1
2S
sin d sin 2 d R2 :
10:9:2
10:10:1
10:10:2
609
610
in which the second line results from a change of integration variable and use of a
trigonometric identity to expand cos ( (t+k)). The autocorrelation function then
takes the form
k
I 2
k
cos k
k
sin k ,
10:10:3
0
I 1
where
e=2 < 1
10:10:4
= 2ln=
10:10:5
e cos 2 d
1
I2
2
2 21 e2
22 4
e cos sin d
1 e2
:
22 4
k
sin k :
k cos k 2
2
10:10:6
10:10:7
10:10:8
Bibliography
611
612
Bibliography
[1] Maurice Kendall and Allan Stuart, The Advanced Theory of Statistics, Vol. 1
Distribution Theory, 2nd Edition (Hafner, 1963)
[2] Maurice Kendall and Allan Stuart, The Advanced Theory of Statistics, Vol. 2
Inference and Relationship (Griffin, 1961)
[3] Maurice Kendall, Alan Stuart, and J Keith Ord, The Advanced Theory of Statistics,
Vol. 3 Design and Analysis and Time-Series 4th Edition (Macmillan, 1983)
This is a definitive advanced treatment of orthodox statistics: a massive threevolume encyclopaedic work that only the most avid statistics enthusiast is likely to
want to read cover to cover. It gives the broadest coverage of any statistics
reference I know. (My three volumes are of different editions because they were
acquired at widely different times from different places.)
George Box, Gwilym Jenkins, and Gregory Reinsel, Time Series Analysis (Pearson,
2005)
A classic work, much of it developed by George Box, that presents a thorough
treatment of the different classes of model stochastic systems for time-series
analysis, forecasting, and control.
Thermodynamics and statistical physics
Index
Bessel functions 24
and distribution of autocovariance 192
and distribution of product of normal RVs 289
and Poisson process 173
and Skellam distribution 23
beta function 44, 93
binomial distribution 7, 403
and nuclear decay 123, 128
moment generating function 20
black-body radiation
of the Earth 575
of the Sun 573
BoseEinstein statistics see statistics
boson 195
boundary layer (of fluid) 411
Cauchy distribution 29
of half-life estimates 318
of median 294
of ratio of standard normal variates 191, 292
Central Limit Theorem 13, 28, 32, 312, 402
and WOC hypothesis 464
ChapmanKolmogorov equation 351
characteristic function 32
chemical potential 199
of massless particles 260
of neutrino 266267
relation to Lagrange multiplier 263
thermodynamic definition 263
chi-square distribution 38
and degrees of freedom 40
and gamma distribution 40
and test of hypothesis 65
moment generating function 40
cholesterol ratio 313
circulation 422423
combinatorial coefficient
binomial 8
multinomal 10
Compton wavelength 196
613
614
Condorcets jury theorem 465, 509
correlation coefficient 25, 57
covariance function 130, 356
and variogram 373
discrete estimate 131, 191
distribution of 192
of autoregressive time series 529
of moving average time series 532
CramerRao theorem 56, 60
Craven, John (and Bayesian search method) 461
cross-correlation function 130, 149
cryptography 237
cumulative distribution function (cdf ) 7, 393
cut-off frequency 134
dAlemberts paradox 410
degeneracy 200
degrees of freedom
and maximum lag 150
in chi-square distribution 3940, 60, 64
in time series 147148
delta function 34, 140
density of states 206
diffusion 350
coefficient (thermal diffusivity) 591, 598
equation of motion 353, 591, 596
of thermal energy 589, 597
digamma function 109
Dirichlet function 268
distributions of products and ratios 277, 325
of Poisson RVs 278
of uniform RVs 281, 325
of normal RVs 287, 326
drag 409, 419, 442
coefficient 410, 449
in Newtonian flow 410
in Stokes flow 410
on airfoil 434
electronpositron annihilation 118, 261
ensemble average 130
entropy 47
and extensivity 205
and prior information 49
and probability 49
and Shannon information 48, 340
and tests of randomness 258
conditional 343
ergodicity 131
error function (erf ) 13, 486
error propagation theory (EPT) 274, 311, 395
of mean 274
of product of RVs 274
of ratio of RVs 275
of variance 274
estimator 55
unbiased 55
Index
Euler generating function 19
Eulers constant 109
expectation (statistical) 351
exponential distribution 14
and lack of memory 15
and nuclear decay 157
and power spectrum 190
extensivity 205
FermiDirac statistics see statistics
fermion 195
Ficks law 591
First Law of Thermodynamics 222, 260
Fourier series 133 (continuous function), 133
(discrete function)
Fourier transform 138
CooleyTukey FFT algorithm 146
discrete 189, 524, 582
normalization constant 142
flight (theory of ) 419
fugacity 264
gamma coincidence experiment 117
gamma distribution 40
and Fourier amplitudes 190
gamma function 24
incomplete 108
Gauss multiplication formula 44
Gaussian (or normal) distribution 12
and Central Limit Theorem 13
moment generating function 26
solution to diffusion equation 351
geometric distribution 15, 175
and photon emission probability 210
Gibbs correction 205
Gibbs paradox 205
Gonzales, Federico (fall and survival) 432
harmonic series 585
heat-island effect 604
height of the atmosphere 443
Helmholtz equation 593
histogram 10, 114
and multinomial distribution 10, 154
ideal gas 380
adiabatic transformation 443
and stock market 382
equation of state 443
mean free path 381
rms speed 381
time between collisions 381
information mining
Galton model 480
Silverman model 483
information theory 340, 347, 384
insolation 575
Index
intensity interferometry 215
interval function 281
invariance principle 96
inverse probability 92
KullbackLeibler information (KL) 556
kurtosis 6, 285, 392
KuttaJoukowski theorem 423
Lagrange multiplier 4951
and statistical mechanics 51, 199
latent heat 592
law of averages 372, 387
law of corresponding states 412
law of errors 13
law of large numbers 404, 464
Legendre duplication formula 44
Levy distribution 356
lift 419
coefficient 427
on a sphere 426
on airfoil 434
light 194
as photons 195
chemical potential 207, 216
correlation coefficient 232
fluctuations 212214, 220
heat capacity 218, 220
helicity 197
non-classical states 236
photon anti-bunching 236
photon bunching 232, 234, 244
photon correlations 226, 229
pressure 221
likelihood function 4, 55
and chi-square distribution 6465
and probability 63, 65
conditional 359
of autoregressive time series 358
of multivariable Gaussian 57
ratio 62
log-normal distribution 495496, 502, 511
Magnus effect 425426
Markov process 354
martingale 378
maximum entropy principle 46
and statistical mechanics 199
maximum likelihood (ML) method 54
and chi-square test 65
and least squares test 65
ML covariance matrix 56
ML estimator 56
MaxwellBoltzmann statistics see statistics
Maxwell relations 223
mean
maximum likelihood estimate 58
615
616
Index
Index
Students t distribution 41
forms of pdf 44
test of batting equivalence 404, 407
sufficient statistic 56
terminal speed 414
thermal conductivity 592
thermal diffusivity 592
time series models
autocorrelation 519, 521
autoregressive (AR) 353, 527, 534, 595
autoregressive integrated moving average
(ARIMA) 551
autoregressive moving average (ARMA)
533
first-differences 518519
invertibility 550
ML parameters 385, 569
moving average (MA) 527, 530, 547
power spectral density 385, 523
predictability 373, 376
with oscillations 543
Toeplitz matrix 540
ultra-relativistic plasma 263
617