(Mark P. Silverman) A Certain Uncertainty Nature

A CERTAIN UNCERTAINTY:
NATURES RANDOM WAYS
Based around a series of real-life scenarios, this vivid introduction to statistical

reasoning will teach you how to apply powerful statistical, qualitative, and probabilistic tools in a technical context.
From analysis of electricity bills, baseball statistics, and the movement of stock
markets, through to the physics of fermions and bosons, and the effects of climate
change, each chapter introduces relevant physical, statistical, and mathematical
principles step-by-step in an engaging narrative style, helping to develop practical
proficiency in the use of probability and statistical reasoning.
With numerous illustrations, which make it easy to focus on the most important
information, and full-color figures available online at www.cambridge.org/silverman,
this insightful book is perfect for students and researchers of any discipline interested
in the interwoven tapestry of probability, statistics, and physics.
m a r k p . s i l v e r m a n is the G. A. Jarvis Professor of Physics at Trinity College,
Connecticut. He received his Ph.D. in Chemical Physics from Harvard University,
and has since pursued a wide range of experimental and theoretical studies concerning the structure of matter, the behavior of light, and the dynamics of stars and
galaxies.
A CERTAIN UNCERTAINTY:
NATURES RANDOM WAYS
MARK P. SILVERMAN
Trinity College, Connecticut
University Printing House, Cambridge CB2 8BS, United Kingdom

Cambridge University Press is part of the University of Cambridge.
It furthers the Universitys mission by disseminating knowledge in the pursuit of
education, learning and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781107032811
M. P. Silverman 2014
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2014
Printing in the United Kingdom by TJ International Ltd. Padstow Cornwall
A catalogue record for this publication is available from the British Library
Library of Congress Cataloging in Publication data
Silverman, Mark P., author.
A certain uncertainty : natures random ways / Mark P. Silverman, G.A. Jarvis Professor
of Physics, Trinity College, Connecticut.
pages cm
Includes bibliographical references.
ISBN 978-1-107-03281-1 (Hardback)
1. Statistical physics. 2. Mathematical physics. I. Title.
QC174.8.S545 2014
530.150 95dc23 2014004090
ISBN 978-1-107-03281-1 Hardback
Additional resources for this publication at www.cambridge.org/silverman
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication,
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
To Sue, Chris and Jen

(the only certainties in my life)
Books by Mark P. Silverman
And Yet It Moves: Strange Systems and Subtle Questions in Physics (Cambridge
University Press, 1993)
More Than One Mystery: Explorations in Quantum Interference (Springer, New
York, 1995)
Waves and Grains: Reflections on Light and Learning (Princeton University
Press, 1998)
Probing the Atom: Interactions of Coupled States, Fast Beams, and Loose Electrons
(Princeton University Press, 2000)
A Universe of Atoms, an Atom in the Universe (Springer, New York, 2002)
Quantum Superposition: Counterintuitive Consequences of Coherence, Entanglement
and Interference (Springer, Heidelberg, 2008)
Contents
Preface
Acknowledgments
1
Tools
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
1.10
1.11
1.12
1.13
1.14
1.15
1.16
1.17
1.18
1.19
1.20
1.21
1.22
1.23
1.24
1.25
1.26
page xiii
xvii
of the trade
Probability: The calculus of uncertainty
Rules of engagement
Probability density function and moments
The binomial distribution: bits [Bin(1, p)] and pieces [Bin(n, p)]
The Poisson distribution: counting the improbable
The multinomial distribution: histograms
The Gaussian distribution: measure of normality
The exponential distribution: Waiting for Godot
Moment-generating function
Moment-generating function of a linear combination of variates
Binomial moment-generating function
Poisson moment-generating function
Multinomial moment-generating function
Gaussian moment-generating function
Central Limit Theorem: why things seem mostly normal
Characteristic function
The uniform distribution
The chi-square (2) distribution
Students t distribution
Inference and estimation
The principle of maximum entropy
Shannon entropy function
Entropy and prior information
Method of maximum likelihood
Goodness of fit: maximum likelihood, chi-square, and P-values
Order and extremes
1
1
3
5
7
9
10
12
14
16
17
20
22
24
26
28
32
34
38
41
45
46
49
49
54
61
72
vii
viii
Contents
1.27 Bayes theorem and the meaning of ignorance

Appendices
1.28 Rules of conditional probability
1.29 Probability density of a sum of uniform variates U(0,1)
1.30 Probability density of a 2 variate
1.31 Probability density of the order statistic Y(i)
1.32 Probability density of Students t distribution
74
84
84
85
86
87
89
The fundamental problem of a practical physicist

2.1
Bayes problem: solution 1 (the uniform prior)
2.2
Bayes problem: solution 2 (Jaynes prior)
2.3
Comparison of the two solutions
2.4
The SilvermanBayes experiment
2.5
Variations on a theme of Bayes
91
91
96
98
100
104
Mother of all randomness

Part I The random disintegration of matter
3.1
Quantum randomness: is the force with us?
3.2
The gamma coincidence experiment
3.3
Delusion of layered histograms
3.4
Elementary statistics of nuclear decay
3.5
Detrending a time series
3.6
Time series: correlations and ergodicity
3.7
Periodicity and the sampling theorem
3.8
Power spectrum and correlation
3.9
Spectral resolution and uncertainty
3.10 The non-elementary statistics of nuclear decay
3.11 Recurrence, autocorrelation, and periodicity
3.12 Limits of detection
3.13 Patterns of randomness: runs
3.14 Patterns of randomness: intervals
3.15 Final test: intervals, runs, and histogram shapes
3.16 Conclusions and surprises: the search goes on
Appendices
3.17 Power spectrum completeness relation
3.18 Distributions of spectral variables and autocorrelation functions
112
112
112
117
121
122
128
129
133
138
146
152
154
160
163
175
177
181
188
188
189

Part II The random creation of light
4.1
The enigma of light
4.2
Quantum vs classical statistics
4.3
Occupancy and probability functions
4.4
Photon fluctuations
194
194
194
199
206
212
Contents
4.5
The split-beam experiment: photon correlations
4.6
Bits, secrecy, and photons
4.7
Correlation experiment with down-converted photons
4.8
Theory of recurrent runs
4.9
Runs and the single photon: lessons and implications
Appendices
4.10 Chemical potential of massless particles
4.11 Evaluation of BoseEinstein and FermiDirac integrals
4.12 Variation in thermal photon energy with photon number
(E/N)jT,V
4.13 Combinatorial derivation of the BoseEinstein probability
4.14 Generating function for probability [Pr(Nn k)] of k successes
in n trials
5
ix
226
236
240
246
254
260
260
267
268
269
270
A certain uncertainty
5.1
Beyond the beginning of knowledge
5.2
Simple rules: error propagation theory
5.3
Distributions of products and quotients
5.4
The uniform distribution: products and ratios
5.5
The normal distribution: products and ratios
5.6
Generation of negative moments
5.7
Gaussian negative moments
5.8
Quantum test of composite measurement theory
5.9
Cautionary remarks
5.10 Diagnostic medical indices: what do they signify?
5.11 Secular equilibrium
5.12 Half-life determination by statistical sampling: a mysterious
Cauchy distribution
Appendix
5.13 The distribution of W XY/Z
272
272
274
277
281
287
296
299
304
310
313
315
Doing the numbers nuclear physics and the stock market

6.1
The stock market is a casino
6.2
The details CREF, AAPL, and GRNG
6.3
Theory of information H
6.4
Is there information in a stock market time series?
6.5
Stock price and molecular diffusion
6.6
Random walk as an autoregressive process
6.7
Stocks go UP and UP . . . and DOWN and DOWN
6.8
What happened to the law of averages?
6.9
Predicting the future
6.10 Timing is everything
Appendices
328
328
332
340
347
350
353
364
372
372
378
384
318
325
325
Contents
6.11
6.12
6.13
6.14
Information inequality H (AjB) H(A)

Power spectral density of an autoregressive time series
Exact maximum likelihood estimate of AR(1) parameters
Statistics of gambling and law of averages
384
385
385
387
On target: uncertainties of projectile flight

7.1
Knowing where they come down
7.2
Distribution of projectile ranges
7.3
Energy vs speed: a test of hypotheses
7.4
Play ball! home runs and steroids
7.5
Air resistance
7.6
Theory of flight
7.7
Fly(ing) ball spin and lift
7.8
Falling out of the sky is a drag
7.9
Descent without power: how to rescue a jumbo jet disabled in flight
Appendices
7.10 Distribution and variation of projectile range R(V, )
7.11 Unbiased estimator of skewness
390
390
392
401
404
409
419
425
432
441
453
453
455
The guesses of groups

8.1
A radical hypothesis
8.2
A mathematical truism?
8.3
Condorcets jury theorem
8.4
Epimenides paradox of experts
8.5
The Silverman GOG experiments
8.6
Interpretation of the GOG experiments
8.7
Mining groups for information: Galtons democratic model
8.8
Mining groups for information: Silvermans Mixed-NU model
8.9
The BBCSilverman experiments: the reach of television
8.10 The log-normal distribution: a fundamental model of group
judgment?
8.11 Conclusions: so how wise are crowds?
Appendices
8.12 Derivation of the jury theorem
8.13 Solution to logic problem #1: how old are the children?
8.14 Solution to logic problem #2: where is the treasure?
8.15 Origins and features of a log-normal distribution
457
457
463
465
470
471
476
480
483
488
495
506
509
509
510
510
511
The random flow of energy

Part I Power to the people
9.1
A different kind of law
9.2
Examining the data: time and autocorrelations
9.3
Examining the data: frequency and power spectra
515
515
515
516
523
Contents
10
xi
9.4
Seeking a solution: the construction of models
9.5
Autoregressive (AR) time series
9.6
Moving average (MA) time series
9.7
Combinations: autoregressive moving average time series
9.8
Phase one: exploration of autoregressive solutions
9.9
Phase two: adaptive and deterministic oscillations
9.10 Phase three: exploration of moving average solutions
9.11 Phase four: judgment which model is best?
9.12 Electric shock!
9.13 Two scenarios: coincidence or conspiracy?
Appendices
9.14 Solution of the AR(12)1,12 master equation
9.15 Maximum likelihood estimate of AR(n) parameters
9.16 Akaike information criterion and log-likelihood
9.17 Line of regression to 12-month moving average
526
527
530
533
534
543
547
554
561
565
568
568
569
570
570

Part II A warning from the weather under ground
10.1 What lies above?
10.2 What lies beneath?
10.3 Autocorrelation of underground temperature
10.4 Fourier transform and power spectrum of underground
temperature
10.5 Energy diffusion: approach I deterministic
10.6 Energy diffusion: approach II stochastic
10.7 Interpreting the waveforms
10.8 Climate implications
Appendices
10.9 Absorption of solar radiation by a sphere
10.10 Autocorrelation of a decaying oscillator
573
573
573
577
580
Bibliography
Index
582
589
594
597
602
609
609
609
611
613
How is it possible that mathematics, which is indeed a product of human thought independent
of all experience, accommodates so well the objects of reality?
Here, in my view, is a short answer: In so far as mathematical statements concern reality, they
are not certain, and in so far as they are certain, they do not refer to reality.
Albert Einstein1
Albert Einstein, from the lecture Geometrie und Erfahrung [Geometry and Experience] given in Berlin on 27 January
1921. (Translation from German by M. P. Silverman.)
Preface
An overview start here
I have heard it said that a preface is the part of a book that is written last, placed first,
and never read. Still, I will take my chances; this is, after all, a book about probability
and uncertainty. The purpose of this preface is to explain what kind of book this is,
why I wrote it, for whom I wrote it, and what I hope the reader will gain by it.
This book is a technical narrative. It is not a textbook (although you can certainly
use it that way); there are no end-of-chapter questions or tests, and the level of
material does not presuppose the reader to have reached some envisioned state of
preparedness. It is not a monograph; it does not survey an entire field of intellectual
activity, and there is no list of references apart from a few key sources that aided me
in my own work. It is not a popularization; the writing does not sensationalize its
subject matter, and explanations may in part be heuristic or analytical, but (I hope)
never shallow and hand-waving.
A narrative is a story albeit in this book one that is meant to instruct as well as
amuse. Each chapter, apart from some background material in the beginning, is an
account of a scientific investigation I have undertaken sometimes because the
questions at issue are of utmost scientific importance; other times on a whim out of
pure curiosity. The various narratives are different, but through each runs a common
thread of probability, uncertainty, randomness, and, often enough, serendipity.
Why, you may be thinking, should my scientific investigations interest you? To this
thought, I can give two answers: one brief, the other longer.
The short answer is that I have written six previous books of the same format
(narrative descriptions of my researches), which have sold well. Many people who
bought (and presumably read) the books found the diversity of subject matter
interesting and the expositions clear and informative, to judge from their unsolicited
correspondence. It seems reasonable to me, therefore, that a Bayesian forecast of a
readers response to this book would employ a favorably biased prior.
The longer answer concerns how people learn things. The principal objective of
this book, after all, is to share with anyone who reads it part of what I have
learned in some 50 years (and still counting) as an experimental and theoretical
physicist.
xiii
xiv
Preface
In the course of a long and somewhat unusual scientific career, my researches have
taken me into nearly every field of physics. In broad outline, I study the structure of
matter, the behavior of light, and the dynamics of stars and galaxies. My investigations of quantum phenomena have employed electron interferometry, radiofrequency and microwave spectroscopy, laser spectroscopy, magnetic resonance, atomic
beams, and nuclear spectroscopy. I have examined the reflection, refraction, diffraction, polarization and scattering of light as a classical wave, and the absorption,
emission, and correlation of light as a quantum particle (photon). I have reported on
the quantum statistics of neutron fluids and BoseEinstein condensates in exploded,
collapsed stars, and the classical statistics of fragments of exploded glass in my
laboratory. I have studied the interactions (electromagnetic, nuclear, and gravitational) of real matter on Earth and of dark matter in the cosmos. My interests
embraced projects of high scientific significance (such as tests of quantum electrodynamics, of the theory of nuclear decay, of Newtonian gravity and of general
relativity) and projects to understand the workings of physically simple, yet surprisingly complicated, physics toys (such as a motor comprising only a AA battery, small
cylindrical magnet, and a paper clip; or a passive hollow tube that is fed room
temperature air at the center and emits hot air from one end and cold air from
the other).
The point of the preceding partial enumeration of research interests is simply this:
I was not trained to do all the above and more; I had to teach myself and the
motivation for learning what I needed to know in each instance derived from the
desire to solve a particular problem that interested me. I did not undertake my
physics self-instruction out of a desire to absorb abstract principles!
A narrative a story humanizes the starkness of physical principles and abstraction of mathematical expressions, and thereby helps provide motivation to learn
both. While the personal situations that prompted me to undertake the studies
narrated here are unlikely to pertain to you, the reader, I cannot help but believe
that the issues involved are as relevant to you as they were to me.
Do you travel and fly in an airplane? Then you may want to read my analysis of
the survival of a pilot who fell five miles without a parachute and how, from that,
I developed a protocol for bringing down safely a jumbo jet whose engines all fail.
Do you invest in the stock market to save for retirement? Then you may want to
read my statistical analysis of how common stocks behave and what you can expect
the market to do for you.
Do you take medications of some kind or have an annual physical exam with a
blood test? Then you will be interested in what my statistical analysis reveals about
the reliability of the clinical laboratory reports.
Have you ever served on a jury or a committee or some group required to reach a
collective judgment? Then you will surely be interested in my theoretical analysis
and experimental tests (aided by collaboration with a BBC television show) of the
so-called wisdom-of-crowds phenomenon.
Preface
xv
Do you pay a power company each month for use of electric energy? Are
you confident that the meter readings are accurate and that you are being
charged correctly? Before answering the second question, perhaps you should
read the chapter detailing the statistical analysis of my own electric energy
consumption.
Do you enjoy sports, in particular ball games of one kind or another? Then you
may be intrigued by my analysis of the ways in which a baseball can move if struck
appropriately or, perhaps of more practical consequence, how I inferred that a
certain prominent US ballplayer was probably enhancing his performance with drugs
long before the media became aware of it.
Are you concerned about global climate change? Then my statistical study of the
climate under ground will give you a perspective on what is likely to be the most
serious consequence to occur soonest a consequence that has rarely been given
public exposure.
And if you are a scientist yourself especially a physicist then you may be utterly
astounded, as I was initially, to learn of persistent claims in the peer-reviewed physics
literature of processes that, had they actually occurred, would turn nuclear physics (if
not, in fact, all laws of physics) upside down. You should therefore find particularly
interesting the chapter that describes my experiments and analyses that lay these
extraordinary claims to rest.
The foregoing abbreviated descriptions should not disguise the fact that as
mentioned at the outset this book is a technical narrative. The book can be read,
I suppose, simply for the stories, skipping over the lines of mathematics. However, if
your goal is to develop some proficiency in the use of probability and statistical
reasoning, then you will want to follow the analyses carefully. I start the book with
basic principles of probability and show every step to the conclusions reached in the
detailed explanations of the empirical studies. (Some of the detailed calculations are
deferred to appendices.)
A textbook, in which material is laid out in a linear progression of topics, may
teach statistics more efficiently but this book teaches the application of statistical
reasoning in context i.e. the use of principles as they are needed to solve specific
problems. This means there will be a certain redundancy but that is a good thing. In
many years as a teacher, I have found that an important part of retention and
mastery is to encounter the same ideas more than once but in different applications
and at increasing levels of sophistication.
Virtually every standard topic of statistical analysis is encountered in this book, as
well as a number of topics you are unlikely to find in any textbook. Furthermore, the
book is written from the perspective of a practical physicist, not a mathematician
or statistician and, where useful, my viewpoint is offered, schooled by some five
decades of experimentation and analysis, concerning issues over which confusion or
controversy have arisen in the past: for example, issues relating to sample size and
uncertainty, use and significance of chi-square tests and P-values, the class
xvi
Preface
boundaries of histograms, the selection of Bayesian priors, the relationship between

principles of maximum likelihood and maximum entropy, and others.
As a final point, it should be emphasized that this book is not merely a statistics
book. Rather, the subject matter at root is statistical physics. Every chapter, apart
from the first, involves some experimental aspect, whether measured in a laboratory,
simulated on a computer, or observed in the world at large. The themes of the
narratives concern physical processes from widely different reaches of physics:
dynamics of discrete particles, dynamics of fluids, dynamics of heat flow, statistical
mechanics of bosons and fermions, creation of non-classical forms of light, transformations of radioactive nuclei, and more. In the process of solving particular
problems, there arise and I will answer profound questions that are rarely
encountered in physics textbooks. Consider thermodynamics, for example. Why is
the chemical potential of black-body radiation zero? Is it zero for all kinds of
photons? Is it zero because the photon is massless? Would a massless neutrino have
a zero chemical potential? Read this book and find out.
What background do you need to read this book? Clearly, the more mathematics
and physics you know beforehand, the more of the technical details you will be able
to understand. An undergraduate physics major should be able to read all of it by the
time he or she graduates. In fact, some of the content comes from the physics lectures
I give at an undergraduate institution. A person with a knowledge of calculus should
be able to read most of it. But anyone with an interest in probability, statistics, and
physics should be able to take away something useful and thought-provoking from
just the text.
That concludes the short answer, the long answer, and the objectives stated in the
first paragraph of the Preface if you read it.
Note regarding figures: Color figures for this book are available at the Cambridge
University Press website www.cambridge.org/silverman.
Mark P. Silverman
Acknowledgments
I would like to thank my son Chris for his invaluable help in formatting the text of
many of the figures in the book, for designing the beautiful cover of the book, and for
his advice on the numerous occasions when my computers or software suddenly
refused to co-operate. It is also a pleasure to acknowledge my long-time colleague,
Wayne Strange, whose participation in our collaborative efforts to explore the
behavior of radioactive nuclei was essential to the successful outcome of that work.
I very much appreciate the efforts of Dr. Simon Capelin, Elizabeth Horne,
Samantha Richter, and Elizabeth Davey of Cambridge University Press to find
practical solutions to a number of seemingly insurmountable problems in bringing
this project to fruition. And I am especially grateful to my copy-editor, Beverley
Lawrence, for her thorough reading and perceptive comments and advice.
1
Tools of the trade
It is remarkable that a science which began with the consideration

of games of chance should have become the most important object
of human knowledge.
Pierre-Simon Laplace1
1.1 Probability: The calculus of uncertainty

All measurements and observations, forecasts and inferences, are subject to uncertainty. These uncertainties reflect a lack of precise knowledge arising from the
limitations of ones time, which restricts the amount of data that can be collected,
or instrumentation, which determines the resolution with which signals or information can be acquired, or the fundamental laws of nature, which give rise to
intrinsically random processes whose exact outcomes cannot be predicted irrespective of the apparatus and observation time. Although a well-ordered world governed
by deterministic laws with no uncertainties may seem desirable at times, such a world
will never be and, in any event, would make for a rather dull place indeed.
To deal with the vagaries of nature one ordinarily must turn to the principles of
mathematics bearing on probability and statistics. I will make no attempt to define
probability. For one thing, innocuous as the subject may sound, it has spawned two
schools of thought whose members have gone after one another (in a manner of
speaking) like Crips and Bloods. So, from a practical standpoint, I would rather not
begin a book with remarks likely to inflame any group of readers. Second, and more
to the point, probability is a sufficiently basic concept that, in trying to capture its
meaning in a few words, one ends up using tautological expressions like chance or
odds or likelihood that do not really explain anything. The latter term, in fact, is
not even a synonym, but is quite distinct from probability as will become apparent
later when we encounter Bayes theorem or make use of the method of maximum
likelihood.
Quoted by Mark Kac, Probability in The Mathematical Sciences (MIT Press, Cambridge, 1969) 239.
Tools of the trade
Let it suffice, therefore, to say that, if you are reading this book, you are already
familiar with the basic idea of probability in at least two contexts.
(a) The first is as the relative frequency of occurrence of an event. Suppose the
sample space i.e. list of all possible outcomes of some process comprises
events A, B, C whose frequencies of occurrence in N 100 observations are
respectively NA 20, NB 50, and NC 30. (The total number must sum to N.)
Then, assuming a random process generated these events, one can estimate the
probability of event A by the ratio P(A) NA/N 1/5, with corresponding
expressions for the other events. We read this as one chance in five or a probability of 20%.
(b) The second is as a statement of the plausibility of occurrence of an event. Thus,
given meteorological data such as the current temperature, humidity, cloud
cover, wind speed and direction, etc., a meteorologist might pronounce a 40%
chance of rain for tomorrow. Tomorrows weather occurs but once; one cannot
replay it one hundred times and construct a table of outcomes and frequencies.
The probability estimate relies in part on prior knowledge of the occurrences of
similar past weather patterns.
The two senses of probability reflect the two schools of thought, referred to
usually as frequentist and Bayesian. There are subtle issues connected with
both understandings of probability. In the frequentist case (a), for example, a
more complete and accurate definition of probability would have N approach
infinity, which is no problem for a mathematician, but would pose a crushing
burden on an experimental physicist. The Bayesian case (b) avoids resorting to
multiple hypothetical replications of an experiment in order to deduce the
desired probabilities for a particular experiment, but the method seems to entail
a hunch or guess dependent on the analysts prior knowledge. Since different
analysts may have different states of knowledge, the subjectivity of a Bayesianderived estimate of probability appears to clash with a general expectation
that probability should be a well-defined mathematical quantity. (One would
hesitate to use calculus if he thought the value of an integral depended on who
calculated it.)
At this point I will simply state that both approaches to the calculation of
probability are employed in the sciences (and elsewhere); both are mathematically justifiable; both often lead to the same or comparable results in straightforward cases. For all the philosophical differences between the two
approaches, it may be argued that the frequentist deduction of probability is
actually a special case of the Bayesian method. Thus, when the two methods
lead to significantly divergent outcomes, the underlying cause (if all calculations
were executed correctly) arises from different underlying assumptions regarding
the process or system under scrutiny. With that conclusion for the moment, let
us move on.
1.2 Rules of engagement
1.2 Rules of engagement

Although philosophical differences may persist regarding the estimation or inference
of probabilities, there is no disagreement over the mathematical rules for combining
probabilities once they are known. Suppose A and B are two independent events with
respective probabilities P(A) and P(B). Then
(a) the probability that A and B both occur is
PAB PAPB;
(b) the probability that A or B occurs is
PA B PA PB:
Note: the simultaneous occurrence of events is expressed symbolically by multiplication
(AB); the exclusive occurrence of events is expressed symbolically by addition (A B).
If A and B are not necessarily independent, one might want to know what is the
probability of A occurring, given that B has occurred. This is the conditional probability of A given B, written as P(AjB) and defined by the relation
PAjB PAB=PB:
1:2:1
From a frequentist point of view, the foregoing expression may be interpreted as the
ratio (theoretically, in the limit of an infinitely large number of trials; practically,
for a reasonably large number of trials) of the number of events in which A and
B occur together to the number of events in which B occurred irrespective of the
occurrence of A.
It is common symbolism to represent the non-occurrence of an event by an overbar; thus A represents all outcomes that do not include event A. From the foregoing
considerations, therefore, we can succinctly express two fundamental rules of conditional probability:

inclusivity
PAjB P AjB 1,
1:2:2
Bayes theorem
PBjA
PAjB PB
:
PA
1:2:3
The first rule (1.2.2) signifies that, after B occurs, A either occurs or it does not; those
are the two mutually exclusive outcomes
that
exhaust all possibilities. Note that it
is not generally true that PAjB P AjB 1. Rather, given P(AjB) and Bayes
theorem, it is demonstrable that

PA PAjB 2PAB
PAjB P AjB
,
1 PB
as shown in an appendix.
1:2:4
Tools of the trade
The second rule (1.2.3), although called Bayes theorem, is a logical consequence
of the laws of probability accepted by frequentists and Bayesians alike. It is regularly
used in the sciences to relate P(HjD), the probability of a particular hypothesis or
model, given known data, to P(DjH), the more readily calculable probability that a
process of interest produces the known data, given the adoption of a particular
hypothesis. In this way, Bayes theorem is the basis for scientific inference, used to
test or compare different explanations of some phenomenon.
The parts of Eq. (1.2.3), relabeled as
PHjD
PDjHPH
,
PD
1:2:5
are traditionally identified as follows. P(H) is the prior probability; it is what one
believes about hypothesis H before doing an experiment or making observations to
acquire more information. P(DjH) is the likelihood function of the hypothesis H.
P(HjD) is the posterior probability. The flow of terms from right to left is a
mathematical representation of how science progresses. Thus, by doing another
experiment to acquire more data let us refer to the outcomes of the two experiments
as D1 and D2 one obtains the chain of inferences
PHjD2 D1
PD2 jD1 H PD1 jH PH

PD2 D1
1:2:6
with the new posterior on the left and the sequential acquisition of information
shown on the right.
As an example, consider the problem of inferring whether a coin is two-headed
(i.e. biased) or fair without being able to examine it i.e. to decide only by means of
the outcomes of tosses. Before any experiment is done, it is reasonable to assign
a probability of to both hypotheses: (a) H0, the coin is fair; (b) H1, the coin is
biased. Thus
ratio of priors:
PH 0
1:
PH1
Suppose the outcome of the first toss is a head h. Then the posterior relative
probability becomes
first toss :
PH 0 jh PhjH 0 PH0 1212

1
:
PH 1 jh PhjH 1 PH1 1 12 2
Let the outcome of the second toss also be h. Assuming the tosses to be independent
of one another, we then have
second toss :
PH 0 jh2 , h1 Ph2 jh1 , H0 Ph1 jH 0 PH 0

111
1
2 2 21 :
PH 1 jh2 , h1 Ph2 jh1 , H1 Ph1 jH 1 PH 1 11 2 4
1.3 Probability density function and moments
It is evident, then, that the ratio of posteriors following n consecutive tosses resulting
in h would be
nth toss:
PH0 jhn . . . h1
1
:
PH1 jhn . . . h1 2n
Thus, although without direct examination one could not say with 100% certainty
that the coin was biased, it would be a good bet (odds of H0 over H1: 1:4096) if
12 tosses led to straight heads.
It is important to note, however, that unlikely events can and do occur. No law of
physics prevents a random process from leading to 12 straight heads. Indeed, the
larger the number of trials, the more probable it will be that a succession of heads of
any specified length will eventually turn up. In the nuclear decay experiments we
consider later in the book, the equivalent of 20 h in a row occurred.
The probability of an outcome can be highly counter-intuitive if thought about in the
wrong way. Consider a different application of Bayes theorem. Suppose the probability
of being infected with a particular disease is 5 in 1000 and your diagnostic test comes back
positive. This test is not 100% reliable, however, but let us say that it registers accurately
in 95% of the trials. By that I mean that it registers positive () if a person is sick (s) and
negative () if a person is not sick s. What is the probability that you are sick?
From the given information and the rules of probability, we have the following
numerical assignments.
Probability of infection P(s) 0.005
Probability of no infection Ps 0:995
Probability of correct positive: P(js) 0.95
Probability of false negative P(js) 1 P(js) 0.05
Probability of correct negative Pjs 0:95
Probability of false positive Pjs 1 Pjs 0:05:
Then from Bayes theorem it follows that the probability of being sick, given a
positive test, is
Psj
PjsPs
0:950:005
0:087
PjsPs PjsPs 0:950:005 0:950:995
or 8.7%, which is considerably less worrisome than one might have anticipated on
the basis of the high reliability of the test. Bayes theorem, however, takes account as
well of the low incidence of infection.
1.3 Probability density function and moments
In the investigation of stochastic2 (i.e. random) processes, the physical quantity being
measured or counted is often represented mathematically by a random variable.
2
The world stochastic derives from a Greek root for to aim at, referring to a guess or conjecture.
Tools of the trade
A random variable is a quantity whose value at each observation is determined by a

probability distribution. For example, the number of radioactive nuclei decaying
within some specified time interval is a discrete random variable; the length of time
between two successive decays is a continuous random variable. Once the probability
distribution is known or at least approximated the probability for any outcome
(or combination of outcomes) can be calculated, as well as any statistical moments
(provided they exist).
If we let X stand for a discrete random variable whose set of realizable values
fxi i 1,2,. . . Ng are the possible outcomes to an experiment with corresponding
probability distribution fpig, then the probability that the experiment leads to some
N
X
pi 1.
outcome in the set is the normalization or completeness requirement P
i1
The average i.e. mean value of some function of the outcomes, f(X), is expressed
symbolically by angular brackets
h f X i
N
X
f xi pi :
1:3:1
i1
Thus the nth moment of the distribution of X is defined to be

n hX n i
N
X
xni pi :
1:3:2
i1
Several particularly significant moments or combinations of moments include:

mean:
X 1 hX i
N
X
x i pi ,
1:3:3
i1
E
D
variance: var X 2X X X 2 2 21 ,
from which the standard deviation X is calculated. We also have
*
+
X X 3
32 1 231
3
skewness: SkX
,
X
3X
1:3:4
1:3:5
which is a measure of the asymmetry of a probability distribution about its center,

and
*
+
X X 4
43 1 62 21 341
,
1:3:6
kurtosis: K X
4
X
4X
which is a measure of the degree of flatness of a distribution near its center. It is
ordinarily not necessary to go beyond the fourth moment in applying statistics to
experimental distributions.
1.4 The binomial distribution: bits [Bin(1, p)] and pieces [Bin(n, p)]
With regard to notation, the subscript X designating the random variable of

interest may be omitted from the symbols for statistical functions where no confusion
results.
To a continuous random variable X is associated a probability density function
(pdf ) p(x), such that the probability that X lies within the range (x, x dx) is p(x)dx.
The normalization requirement and moments of X are now given by integrals rather
than sums:
pxdx 1
mn
xn pxdx:
1:3:7
The range of integration can always be taken to span the full real axis by requiring,
if necessary, the pdf to vanish for specific segments. Thus, if X is a non-negativevalued random variable, then one defines p(x) 0 for x < 0.
The cumulative distribution function (cdf ) F(x) sometimes referred to simply as
the distribution is the probability Pr(X x), which, geometrically, is the area under
the plot of the pdf up to the point x:
x
Pr X x Fx
px0 dx0:
1:3:8
It therefore follows by use of Leibnitzs equation from elementary calculus

d
dx
b
x
a x
db
da
Fx, ydy Fx, b Fx, a
dx
dx
b
x
a x
Fx, y
dy
x
1:3:9
that differentiation of the cdf yields the pdf: px dF=dx. This is a practical way to
obtain the pdf, as we shall see later, under circumstances where it is easier to
determine the cdf directly.
1.4 The binomial distribution: bits [Bin(1, p)] and pieces [Bin(n, p)]
The binomial distribution, designated Bin(n, p), is perhaps the most widely encountered discrete distribution in physics, and it plays an important role in the research
described in this book. Consider a binomial random variable X with two outcomes
per trial:
n
success 1 with probability p
1:4:1
X
failure 0 with probability q 1 p:
The number of distinct ways of getting k successes in n independent trials, which is
represented by the random variable Y X1 X2 Xn , where each subscript
Tools of the trade
Probability
0.2
n = 60
p = 0.1
0.15
0.1
0.05
10
12
14
16
Number of Successes
Fig. 1.1 Probability of x successes out of n trials for binomial distribution (solid) Bin(n, p)
Bin(60, 0.1) and corresponding approximate normal distribution (dotted) N(, 2) N(6,5.4).
labels a trial, is the coefficient of pkqnk in the binomial expansion p qn

Pn n k nk
n
n!
with combinatorial coefficient
k!nk
k0 k p q
!. Thus, the binomial
k
probability function can be written in the form

n
n x 0,
1:4:2
px qnx
Pxjn, p
p
which shows explicitly the two parameters of the distribution. It is then straightforward, albeit somewhat tedious, to calculate from (1.3.2) the statistical quantities
np
var npq
q p
Sk p
npq
3 n 2 pq 1
npq
1:4:3
and others as needed. If the probability of obtaining either outcome is the same
p q 12, the distribution is symmetric and the skewness vanishes. For p < q the
skewness is positive, which means the distribution skews to the right as shown in
Figure 1.1. In the limit of infinitely large n, the kurtosis approaches 3, which is the value
for the standard normal distribution (to be considered shortly). A distribution with high
kurtosis is more sharply peaked than one with low kurtosis; the tails are fatter (in
statistical parlance), signifying a higher probability of occurrence of outlying events.
In calculating statistical moments with the binomial probability function, the trick
to performing the ensuing summations is to transform them into operations on the
binomial expression ( p q)n whose numerical value is 1. For illustration, consider
the steps in calculation of the mean
n
n
X
d X
d
q1p
n
n
x nx
xp q
px qnx p p qn npp qn1 ! np
p
hX i
x
x
dp x0
dp
x0
where only in the final step does one actually substitute the value of the sum: p q 1.
d
For higher moments, one applies p dp
the requisite number of times. There is a
1.5 The Poisson distribution: counting the improbable
more convenient way to achieve the same goal (with additional advantages as well) by
means of a generating function, which will be introduced shortly.
1.5 The Poisson distribution: counting the improbable

The Poisson distribution, symbolized by Poi(), is perhaps second on the list of most
widely encountered discrete distributions in physics. It is the distribution that one
virtually always thinks of in connection with counting particles from disintegrating
nuclei or photons from radiating atoms. More generally, it characterizes the statistics
of phenomena whereby the probability of an occurrence is very low, but the number
of trials is very large. Seen in that light, the Poisson distribution is a special case of
the binomial distribution, and one can derive the probability function of a Poisson
random variable X
Pxj e
x
x!
x 0, 1, 2 . . .
1:5:1
directly from P(xjn, p) by appropriately taking limits p ! 0 and n ! such that the
mean np remains constant. This is a tedious calculation, and a more efficient way
is again afforded by use of a generating function.
The moments of the Poisson distribution are calculable from relation (1.3.2) with
substitution of probability function (1.5.1). The sums are completed by the same
d
device employed in the previous section, except that now one operates with d
on the
x
X
expression
e . For example, consider the first and second moments
x!
x0

x
X
x
d X
e e
x e
x!
x!
d
x0
x0

x

x
X
2
d
d X
d 2
2

e
e
x
e 2
X e
x!
x!
d
d
d
x0
x0
hXi e
from which follows the equality

hXi var X ,
1:5:2
which is a characteristic feature of the Poisson distribution. By analogous manipulations one obtains the skewness and kurtosis
Sk 1=2
1
K 3 :
1:5:3
Since is never negative in a Poisson distribution (physically, it is a distribution

of counted objects), Sk is also seen to be a non-negative function and therefore
the Poisson distribution always skews to the right. Also, since K > 3, the
10
Tools of the trade
distribution is more sharply peaked and has fatter tails than a standard normal
distribution. The above two expressions suggest, however, that as the mean gets
larger, the Poisson distribution approaches the shape of the normal distribution.
That this is indeed the case will be shown more rigorously by means of generating
functions.
1.6 The multinomial distribution: histograms

The multinomial distribution is a generalization of the binomial distribution. It is the
theoretical basis for a histogram: the graphical representation of counted or measured data sorted into categories (called classes) of specified value. Consider a random
variable X representing the result of an experiment (i.e. single trial) with a multiplicity
r of possible outcomes fxi i 1 . . . rg with corresponding probabilities fpig. Then the
probability that in n trials the outcome xi will occur ni times is obtained from
expansion of the nth power of a multinomial form p1 p2 pr n , which leads
to the expression

r
Y

pni i
n
Pn1 , n2 , . . . nr jn; p1 , p2 , . . . pr P fni gjn; fpi g
:
pn11 pn22 . . . pnr r n!
n 1 . . . nr
n!
i1 i
1:6:1
The two-tiered symbol
n
n1 . . . nr

n!
r
Y
ni !
with
r
X
ni n
1:6:2
i1
i1
defined above is the multinomial combinatorial coefficient.

The form of P(fnigjn;fpig) may be understood in the following way, which
is a generalization of the way one would deduce the binomial probability
distribution.
n
The probability that ni independent events of type xi occur is pi i .
Thus, the probability that a particular sequence of n1 x1s, n2 x2s, . . . nr xrs occurs
is pn11 pn22 . . . pnr r since all trials are independent
of one

another.
n
different ways.
However, this sequence could occur in
n 1 . . . nr
It is useful to demonstrate this combinatorial statement since the multinomial distribution enters significantly (in the form of a histogram) in all the experimental
investigations to be discussed in the book.
The number of ways one can partition a set of size n into r ordered subsets such
that the first has size n1, the second has size n2, etc., and where n1 n2 nr n
is the product
11
1.6 The multinomial distribution: histograms
Table 1.1
Distribution of outcomes of two dice

(yi)
yi
(x1, x2)
2
3
4
5
6
7
8
9
10
11
12
(1,1)
(1,2), (2,1)
(1,3), (3,1),
(1,4), (4,1),
(1,5), (5,1),
(1,6), (6,1),
(2,6), (6,2),
(3,6), (6,3),
(4,6), (6,4),
(5,6), (6,5)
(6,6)
(2,2)
(3,2),
(2,4),
(2,5),
(3,5),
(4,5),
(5,5)
1
2
3
4
5
6
5
4
3
2
1
(2,3)
(4,2), (3,3)
(5,2), (3,4), (4,3),
(5,3), (4,4)
(5,4)
Total
36
P(yi) (yi)/
1/36
2/36 1/18
3/36 1/12
4/36 1/9
5/36
6/36 1/6
5/36
4/36 1/9
3/36 1/12
2/36 1/18
1/36
X
Pyi 1
i1

n1 , n2 , . . . nr jn
n
n1

n n1
n2

n n1 n2
n3

n n1 n2 nr1

:
nr
1:6:3
(The symbol is often used to represent multiplicity in statistical physics.) Note,

however, that the the first two factors can be reduced in the following way

n!
n n1 !
n!
n
n n1

: 1:6:4
n1
n2
n1 ! n n1 ! k2 ! n n1 n2 ! n1 !n2 ! n n1 n2 !
This pattern carries through for all subsequent factors, and by induction one obtains

n!
n
n1 , n2 , . . . nr jn
:
1:6:5
n1 . . . nr
n1 !n2 ! nr !
As an illustration useful to the discussion of histograms later, consider a game in which
two dice are tossed simultaneously. Each die has six faces with outcomes xi i (i
1,2,. . .6). The outcomes of two dice are then yi i (i 2,3,. . .12). What is the probability
of each outcome yi, assuming the dice to be unbiased? Since there are 6 6
36 possible outcomes, the probability that a toss of two dice yields a particular value
of y is the ratio of the number of ways to achieve y i.e. the multiplicity (y) to the
overall multiplicity : P(yi) (yi)/. By direct counting, we obtain Table 1.1.
If we were to cast the two dice 100 times, what would be the expected outcome in
each category defined by the value yi, and what fluctuations about the expected
values would be considered reasonable? We would therefore want to know the
theoretical means and variances in order to ascertain whether the dice were in fact
unbiased. To determine means, variances and other statistics directly from a
12
Tools of the trade
Table 1.2 Expected outcomes of 100 tosses of two

unbiased dice
yi
ni
2
3
4
5
6
7
8
9
10
11
12
2.78
5.56
8.33
11.11
13.89
16.67
13.89
11.11
8.33
5.56
2.78
Total
ni
1.64
2.29
2.76
4.14
3.46
3.73
3.46
3.14
2.76
2.29
1.64
100.00
multinomial probability function is cumbersome; we will do this rigorously and

efficiently by an alternative procedure later. However, a simple and intuitive way to
answer the two questions is to recognize that each y-category in Table 1.1 may for the
purposes of these questions be considered as the outcome of a binomial random
variable because the result of a toss either falls into a specific category yi or it does
not. Thus, we deduce from relations (1.4.3) that the mean frequency of occurrence
and variance of each category can be expressed as
ni nPyi
2ni nPyi 1 Pyi ,
1:6:6
as summarized in Table 1.2.

A plot of the frequency of outcomes (theoretical or observed) of this hypothetical
experiment with two dice as a function of class constitutes a histogram. To know
whether a set of observed frequencies is in accord or not with the expected values can
be ascertained through various statistical tests to be described later in conjunction
with actual experiments.
It is to be noted that the frequencies in a multinomial distribution are not all
independent because they must sum to the fixed number n of trials. Thus, one would
expect an anti-correlation (or negative correlation) between any pair of frequencies
since an increase in one must result on average in a decrease in the other. How such
correlations are to be calculated will also be taken up shortly.
Let us turn next to several continuous distributions of wide usage in physics.
1.7 The Gaussian distribution: measure of normality
The Gaussian or normal distribution, symbolically designated N(, 2), is quite likely
the most widely encountered distribution employed in the service of science,
1.7 The Gaussian distribution: measure of normality
13
engineering, economics, and any other field of study where random phenomena are
involved. The principal underlying reason for this not always justified in the
application is the mathematical proposition known as the Central Limit Theorem
(CLT), which shows the normal distribution to be the limiting form of numerous
other probability distributions used to model the behavior of random phenomena.
In particular, the normal distribution is most often employed as the law of
errors i.e. the distribution of fluctuations in some measured quantity about its
mean. It has been written in jest (perhaps) that physicists believe in the law of
errors because they think mathematicians have proved it, and that mathematicians
believe in the law of errors because they think physicists have established it experimentally. There is some truth to the first assertion in that the Gaussian distribution
emerges from a general principle of reasoning (referred to as the principle of
maximum entropy) which addresses the question: Given certain information about
a random process, what probability distribution describes the process in the most
unbiased (i.e. least speculative) way? We will examine this question later. Suffice it
to say at this point that the normal distribution does indeed apply widely, but,
when it does not, one can be led astray with disastrous consequences by drawing
conclusions from it.
The Gaussian distribution of a continuous random variable X whose values span
the real axis takes the form
2
1
2
Pxj, p ex = x :
2
1:7:1
By evaluating the moments of X one can show after a not insignificant amount of
labor that the parameters and 2 are respectively the mean and variance. From the
symmetry of P(xj, ) about the mean, it follows that the skewness is identically zero.
Evaluation of the fourth moment leads to a kurtosis of 3.
One can transform any Gaussian distribution to standard normal form N(0, 1) by
defining the new dimensionless random variable Z (X )/. The cumulative
distribution function (often represented by ) then takes the form
1
z p
2
eu
=2
du,
1:7:2
which is related to the error function

z
2
2
erfz p eu du
1:7:3
in the following way

z
z z erf p :
2
1:7:4
14
Tools of the trade
As an academic physicist I am regularly asked by students whether I grade on a

curve. However, few students actually understand what grading on a curve means.
The curve is the bell-shaped standard normal pdf, and to grade on it, strictly
speaking, means to partition the area under the curve into four segments (z 1),
(1 > z 0), (0 > z 1), (1 > z), such that the passing grades (A, B, C, D) will have
(approximate) relative frequencies of 15%, 35%, 35%, 15%. For example, if I assign
A to a student whose test score is X , then
x

1
2
1 Prz 1 p eu =2 du 0:159:
Pr
2
1
Thus, if test scores were normally distributed, I would expect about 15% of the class
to receive a grade of A. Such an assumption might hold for a class of large enrollment
(perhaps 50 or more), but not for small-enrollment classes. If I graded on a curve in
an advanced physics class of six bright students, there would be one A, two Bs, two
Cs, one D and a great deal of dissatisfaction.
1.8 The exponential distribution: Waiting for Godot

The negative exponential distribution, symbolized by E(), is interpretable as a
distribution of waiting times between occurrences of random events although it
appears in other contexts in physics as we shall see. If X is a random variable
whose realizations span the positive real axis, then the exponential pdf takes
the form
n x
e
x 0
Pxj
1:8:1
0
x < 0:
Using the pdf to calculate the moments of X, one can show that hXni n n!/n,
from which follow the statistics
1=
2 1=2
Sk 2=3
K 9=4 :
1:8:2
The significance of the parameter is seen to be the inverse of the mean waiting time,
which is equivalent to a frequency or rate. Though continuous, the exponential
distribution has a direct connection to the discrete Poisson distribution in which
the same parameter represents the intrinsic decay rate of a system. For example, if
the number of occurrences of some phenomenon in a fixed window of observation
time t is described by a Poisson distribution with parameter t, then the
probability that 0 events will be observed in that time interval is PPoi(0jt) et,
and therefore the probability that at least 1 event will be observed in the time interval
is the cumulative probability FPoi(t) Pr(X t) 1 et. The derivative of FPoi(t)
with respect to time
1.8 The exponential distribution: Waiting for Godot
dFPoi t
Pexp tj et
dx
15
1:8:3
then gives the pdf of an exponential distribution of waiting times.

A significant attribute revealed by the variance of the exponential distribution
is that the fluctuation (~ ) about the mean is of the order of the size of the
signal (~ ) itself. This will be seen to have important experimental consequences
when we examine the physics of nuclear decay. The skewness and kurtosis
of the exponential distribution bear no resemblance at all to those of the
normal distribution and there is no limiting case in which the former reduce
to the latter.
Another attribute of considerable interest is that the exponential distribution is the
only continuous distribution with complete lack of memory. If the waiting times of
a sample of decaying particles are described by an exponential distribution, then in a
manner of speaking (to be understood statistically) the particles never get old so long
as they have not yet decayed. To see this, suppose the particles were all created at
time 0. Then the probability that there is no decay before time t is given by the
integral
PrX > t ex dx et :
1:8:4
Now let us suppose that T units of time have passed, and we seek the conditional
probability that there is no decay before time t T given that there was no decay
before time T
PrX > t TjX > T
etT
et :
eT
1:8:5
The probability is the same independent of the passage of time following creation of
the particles. Note, in obtaining the preceding result we used the definition of
conditional probability: P(AjB) P(AB)/P(B). As applied to the case of waiting
times, the numerator P(AB) is the probability that the waiting time is longer than
both t T and T. But clearly if the first condition is satisfied, then the second must
also be, and so in this case P(AB) P(A).
The lack of memory displayed by the exponential distribution has a discrete
counterpart in the geometric distribution Pgeo(kjp) pqk1 in which an event occurs
precisely at the kth trial (with probability p) after having failed to occur k 1 times
(with probability q 1 p). The probability of an eventual occurrence is 100%
PrX 1
qk1 p p
k1
and the mean time between events is 1/p
X
k0
qk
p
p
1,
1q p
1:8:6
16
Tools of the trade
hk i
X
k1
kqk1 p p
d X
d
p
1
qk p 1 q1
,
2
dq k0
dq
p
1 q
1:8:7
where use was made in both calculations of the Taylor series expansion
X
1
xk :
1 x k0
1:8:8
There are other continuous distributions that play important roles in the physics
discussed in this book, but we will discuss them as they arise. Let us turn next to the
important topic of generating functions.
1.9 Moment-generating function

Probability is not a directly measurable quantity; there is no such thing as a probability meter. Most commonly, it is the moments of a distribution that are accessible
by counting or measurement. Although the moments of a distribution can be calculated from a theoretical probability function or probability density by summation or
integration, they can usually be determined far more simply by differentiating a
moment-generating function (mgf ). Taking derivatives is almost always more easily
done than doing summations or integrals.
Besides the ease afforded in calculating moments, there are other advantages to
working with an mgf. For one thing, the mgf of a probability distribution is unique
because a distribution is uniquely characterized by all its moments. Thus, if you do
not know initially how some random variable is distributed which is frequently the
case in statistical physics but you can by some means establish that its mgf takes the
same form as the mgf of a known probability distribution, then you can be certain
that the unknown distribution is identical to the recognized one. A second advantage
is that generating functions provide an efficient means of determining the statistics of
linear superpositions, such as sums and differences, of independent random variables. Such superpositions of random variables occur frequently in physics since they
may represent the outcome of a sequence of measurements or the difference of a
signal and noise.
An occasional drawback to the use of a moment-generating function is that
not every distribution has one. In those instances or generally, as an alternative method one can work with the characteristic function (cf ), which is
equivalent to a Fourier transform of the probability density function (pdf ) for
a continuous distribution and probability generating function (pgf ) for a discrete
distribution.
The mgf of a random variable X, symbolized by gX(t), where t is a dummy variable
eventually to be set equal to 0, is defined as the expectation of eXt. Thus, the mgf of a
discrete or continuous random variable is calculated, respectively, from the relations
1.10 Moment-generating function of a linear combination of variates
8
b
X
>
>
>
ex t px
>
>
<

xa
gX t eXt
>
>
>
> ex t px dx
>
:
17
discrete X
1:9:1
continuous X:
For clarity, in anticipation of cases involving the generating functions of several

random variables, the mgfs (and, when necessary, pdfs) will be labeled by a subscript
showing explicitly to which random variable they refer. The origin of the term
moment-generating function becomes evident by expanding the exponential in the
angular brackets above
+
*
X
X
Xt
x t n
tn X
n tn
:
1:9:2
e
hxn i
n!
n! n0 n!
n0
n0
The nth moment is then obtained by taking the nth derivative of gX (t) with respect
to t and setting t 0
dn gX t
1:9:3
n
:
dtn t0
Note that the zeroth moment is just the completeness relation: gX (0) 1.
In statistical analysis, it is often the case that the moments about the mean are the
quantities of interest. Moreover, in my experience, one rarely needs to go beyond the
third or fourth moment. In such circumstances the natural log of the generating
function is useful to work with because it follows from sequential differentiation that

dln gX t
hXi X
dt
t0 D
E
d 2 ln gX t
2
X
2X
1:9:4
X

dt2
t0

E
D
d 3 ln gX t
3
3X SkX :
X
X

dt3
t0
Regrettably, the progression does not extend to the fourth moment or beyond.
Nevertheless, the expansion of ln g(t) yields useful quantities referred to as the
cumulants of a distribution. We shall not need them in this book.

Once the mgfs of known types of random variables fXi i 1 . . . ng have been
calculated, it is straightforward to calculate the mgf of a linear superposition of
independent random variables composed of these types. Note that the constituents
do not have to be identically distributed just independent. Let gXi t be the mgf of Xi.
18
Tools of the trade
Then the mgf of Sn
n
X
ai Xi , with constant coefficients ai, is deduced by the
i1
chain of steps below

gSn t eSn t
n
* P
i1
ai X i
n
Y
gXi ai t ! gX atn ,
eai tXi
i1
iid
i1
1:10:1
where the third equality is permitted because the random variables are independent.
Recall: If A and B are independent, then hABi hAihBi. The arrow above shows the
reduction of gSn t in the case of independent identically distributed (iid) random
variables all combined with the same coefficient a.
Two widely occurring special cases are those involving the sum (a1 a2 1) or
difference (a1 a2 1) of two iid random variables for which (1.10.1) yields
gX1 X2 t gX t2
gX1 X2 t gX tgX t:
1:10:2
Another useful set of relations comes from evaluating the variance of the general
n
X
ln gXi ai t
linear superposition Sn by differentiating ln gSn t
i1

n
X
d 2 ln gSn t
a2i

dt2
t0
i1

n
n
X
X
ai g0Xi ai t

) Sn
ai i

gXi ai t t0
t0
i1
i1
!
n
X
gXi ai tg00Xi ai t g0Xi ai t2
2
)
a2i 2Xi :

Sn

gX ai t 2
i1

dln gSn t

dt
1:10:3
t0
Another special case of particular utility is the equivalence relation for a normal
variate X

1:10:4
N , 2 N 0, 1,
which will be demonstrated later in the chapter.
A situation may arise I have encountered it often in which the mgf of some
random variable X is a fairly complicated function of its argument and therefore does
not correspond to any of the tabulated forms of known distributions. A useful
procedure in that case may be to expand the mgf in a Taylor series to obtain an
expression of the form
gt
a n tn
en0
1:10:5
,
*
which is not to be confused with a structure like
n
X
i1
+
ai Xi
and does not necessarily
correspond to a linear superposition of random variables. (For example, it may arise
19
from nonlinear operations.) An examination of the first few sequential derivatives

of (1.10.5)
g1 j0
g2 j0
g3 j0
g4 j0
g5 j0
a1
2a2 a1
6a3 6a1 a2 a31
24a4 24a1 a3 12a22 12a2 a21 a41
120a5 120a1 a4 120a2 a3 60a1 a22 20a2 a31 a51
1:10:6
reveals a pattern that suggests a systematic way of calculating the moments of the
distribution (and subsequently an approximation to the pdf if so desired). The form
of the nth derivative is n! times the sum over all partitions of the integer n weighted by
a divisor k! for each term in the partition that occurs k times. A partition of a positive
integer n is a set of positive integers that sum to n. We can represent a particular
n
X
jj by the notation f11 22 33 . . . nn g.
partition n
j1
Consider, for example, n 3. There are three ways to satisfy the integer relation
k 2l 3m 3, namely
3 3 0 0 2 1 0 1 1 1 ) f3g, f2, 1g, f13 g,

a3
which leads to the weighted sum 3! a3 a2 a1 3!1 for the entry g(3)j0 in (1.10.6).
There is a graphical technique to construct the partitions of an integer relatively
quickly by means of diagrams known as Youngs tableaux. Each term in a partition
is represented by a horizontal row of square boxes of length equal to the term; the
boxes are stacked vertically, starting with the longest row. Thus, considering again
the three partitions of n 3, we have the three diagrams
(3)
(2,1)
(13)
The preceding ideas were drawn from the theory of symmetric groups,3 which tells
us that the total number r(n) of partitions of an integer n is the coefficient of xn in the
power series expansion of Eulers generating function

Y
1
1 xj
1 x 2x2 3x3 5x4 7x5 11x6 :
1:10:7
E x
j1
Examination of the first few terms verifies what could be easily determined by
drawing the Youngs tableaux. Should one need to know r(n) for large n, there is
3
J. S. Lomont, Applications of Finite Groups (Academic Press, New York, 1959) 258261.
20
Tools of the trade
an asymptotic approximation derived by the renowned mathematicians G. H. Hardy

and S. Ramanujan
p
1
r n p e 2n=3 :
1:10:8
4n 3
1.11 Binomial moment-generating function
As an illustration, let us re-examine the binomial distribution (coin-toss problem)
from the vantage of its mgf. Define a binary random variable X whose value is 1 with
probability p if the outcome is a head h or 0 with probability q 1 p if the outcome
is a tail t. Such a variable is termed a Bernoulli random variable, provided p remains
constant for all trials. Then

1:11:1
gX t eXt pet qe0 pet q:
If the coin is tossed n times or n coins are tossed independently and simultaneously
n
X
once the outcome is describable by a random variable Y
Xi whose mgf follows
i1
immediately from relation (1.10.1)

n
gY t pet q :
1:11:2
It is now a straightforward matter of taking derivatives either of the mgf or its

natural log to confirm the previously given mean, variance, skewness, and kurtosis
of the binomial distribution. For example:

h
i
dgX
n1
t
t
npe
pe
np
t0
dt t0
1:11:3

h
i
d 2 ln gX
1
2
t
t
t 2
t
npe
pe

n
pe
pe
npq:
t0
dt2 t0
After the third or fourth derivative, the procedure becomes tedious to do by hand,
but symbolic mathematical software (like Maple or Mathematica) can generate higher
moments nearly instantly.
Although we arrived at the binomial mgf by starting with probabilities p and q of
the Bernoulli random variable X and then calculating the generating function for the
composite random variable Y, we could equally well have begun with the binomial
probability function (1.4.2) and calculate the expectation value directly:
y ny X
y
n
n
X
n pq
n pet qny
n
gY t eYt
pet q :
eyt
1:11:4
y
y
n!
n!
y0
y0
If, however, we already have the mgf from the procedure leading to (1.11.2), but do
not know the binomial probability function, we can derive it from the mgf by a
method to be demonstrated shortly.
1.11 Binomial moment-generating function
21
A point worth noting about the procedure leading to Eq. (1.11.2) is that the sum
of the elemental Bernoulli random variables (the Xs) produces a random variable
Y which is also governed by a binomial distribution or symbolically:
Bin1, p Bin1, p Binn, p. From the mathematical form of the binomial
n terms
mgf, one can see generally that the addition of independent random variables of type
Bin(n, p) and Bin(m, p) generates a random variable of type Bin(n m, p). There are
relatively few distributions that have the property that a sum of two random
variables of a particular kind produces a random variable of the same kind. Moreover, as is easily demonstrated, this property does not hold for the difference of two
binomial random variables. If Y X1 X2, where the two variates are independent
and of type Bin(n, p), then
gY t pet q pet q 1 2pq cosh t 1 n
n
1:11:5
in which the second equality was obtained after some algebraic manipulation
employing the identity p q 1. The resulting mgf differs from that of a binomial
random variable and, in fact, does not correspond to any of the standard types
ordinarily tabulated in statistics books. Nevertheless, knowing the mgf, one can
calculate from it all the moments of the difference of two independent binomial
random variables of like kind. Although knowledge of the mgf affords a means to
determine the probability function and we shall examine shortly how to do this in
the present case it is better to proceed differently. We seek the probability Pr(X1
X2 z) that the difference is equal to some fixed value n z n. This can be
expressed by the suite of probability statements
PrX1 X2 z
n
X
x2 0
n
X
PrX1 x2 zjX2 x2 PrX2 x2

1:11:6
PBin x2 zPBin x2 ,
x2 0
where the second equality is permissible because X1 and X2 are independent.

The symbol PBin(x) is an abbreviated representation of the complete probability
function (1.4.2). It then follows upon substitution of the binomial probability functions that

n
X
n y ny
n
p q
pyz qnyz
PrX 1 X2 z
y
yz
y0z X

1:11:7
nz
p
n
n 2 y 2 ny
q
:
p
y
q y0 y z
Note that the upper limit to the sum over the dummy index y must be n z since the
first coefficient vanishes when its lower index exceeds the upper index. The expression
in (1.11.7) can be reduced to closed form in terms of a hypergeometric function 2 F1
22
Tools of the trade

2 !

p
n z
2nz
PrX1 X2 z
p 1 p
2 F1 n, z n, z 1,
z
1p
1:11:8
but the derivation is beyond the intent of this chapter.4
1.12 Poisson moment-generating function

The moment generating function of a Poisson random variable X of mean value is
also readily obtained
n
n
X

x X
et x
t
ee 1 ,
gX t eXt e
ext
x!
x!
x0
y0
and leads to
1:12:1

dln gX t
d 2 ln gX t
,
dx t0
dx2
t0
which confirms the equality of hXi and var(X). Moreover, if X1 and X2 are independent Poisson random variables of respective means 1 and 2, then the mgf of their
sum Y X1 X2
gY t gX1 tgX2 t e1 2 e 1
t
immediately establishes the fact that Y is a Poisson random variable of mean

Y 1 2.
If we had not used the mgf, we could have still arrived at the same conclusion by a
method of reasoning based on summing over conditional probabilities, but it is a
more cumbersome procedure:
PrX1 X2 y
y
X
x1 0
y
X
PrX2 y x1 jX1 x1 PrX1 x1

PPoi y x1 j2 PPoi x1 j1
x1 0
y
X

1
x1
yx
2
1 1
1:12:2
e
e
y x1 !
x1 !
x1 0
y
y
e1 2 X
y!
e1 2 X y x yx
yx
x1 2
x 1 2
y!
y!
x!y x!
x0
x0
2
e1 2
1 2 y :
y!
Hypergeometric functions occur in the solution of second-order differential equations that describe a variety of physical
system. One of the most important examples is the radial part of the wave function of the electron in a hydrogen atom
(i.e. the Coulomb problem).
1.12 Poisson moment-generating function
23
The first step is in effect a statement of the sought-for probability by means of Bayes
theorem. The transition from the first to the second is permitted because the Poisson
variates X1 and X2 are assumed independent. In the third step the explicit form of the
Poisson probability function is employed. In the fourth step the expression is
rearranged so as to take the form of a binomial expansion, which, when summed,
yields in the fifth step the Poisson probability function with parameter Y 1 2.
The difference of two independent Poisson random variables, however, is not
governed by a Poisson distribution. One could have foreseen this without performing
any calculation because a Poisson variate must be non-negative, yet the difference of
two such variates can be negative. (An identical argument applies to the difference of
two binomial random variables.) Such a difference is encountered fairly often in
experimental atomic, nuclear, and particle physics, as well as in other disciplines,
whenever it is necessary to subtract a random background noise from a signal of
interest. The mgf of the difference Y X1 X2 takes the form
t
gY t e1 2 1 e 2 e
t
1:12:3
which identifies a Skellam distribution,5 whose name and probability function are
not widely known in physics. Nevertheless, from the mgf one can quickly obtain the
mean, variance, or any other desired statistic by differentiation:
hY i hX1 X2 i 1 2
varY varX1 varX2 1 2 :
1:12:4
To calculate the Skellam probability function, we can proceed as we did in the

previous section for the difference of two binomial variates, or for variety we
can work with the probability generating function (pgf ), which is defined as the
expectation fY(t) htYi. It then follows that the probability Pr(Y y) is the
coefficient of the term ty in the series expansion of fY (t). If one has already calculated
the mgf, then the pgf is obtained simply by replacing et by t. Thus, from Eq. (1.12.3)
we have

p p1 p
2 1
1 2
2 t
1 t
1 2 1 t2 t1
1 2
,
1:12:5
e
e
f Y t e
where the purpose of the rearrangement in the second equality was to cast the second
exponential factor into a form recognizable (to those familiar with some of the
uncommon types of Bessel functions) as the generator
ez t t
t
=2
In ztn
1:12:6
n
of modified Bessel functions In(z) in Jn (iz), in which Jn(z) is the more familiar
Bessel function of the first kind. The probability
5
J. G. Skellam, The frequency distribution of the difference between two Poisson variates belonging to different
populations., Journal of the Royal Statistical Society: Series A, 109 (1946) 296.
24
Tools of the trade
PrX1 X2 y e1 2
y=2
p
1
I y 2 1 2
2
1:12:7
then follows from (1.12.5) and (1.12.6).

Ordinary and modified Bessel functions satisfy differential equations that differ in
the sign of one term
(

2
n2 x2 y ) J n x
d
y
dy
2
x
0x
1:12:8

dx2
dx n2 x2 y ) In x:
That difference is a critical one, however. In contrast to the ordinary Bessel function
Jn(z), which oscillates as a function of its argument, a modified Bessel function In(z)
increases exponentially. Both types are finite at the origin. Bessel functions of integer
index n can be calculated explicitly by means of the series
1m x2mn
m!m n! 2
m0
x 2mn
X
1
I n x
:
m!m n! 2
m0
J n x
1:12:9
If the index n is not an integer, then the gamma function (m n 1) replaces

(m n)! in the denominator.
The general form of a gamma function (x) is defined by the integral
x tx1 et dt ! x 1!:

integer x
1:12:10
The factorial value shown by the arrow in the case of an integer argument derives
from the property of the gamma function
x 1 xx,
1:12:11
which can be established inductively by integration by parts. The gamma function

occurs widely in statistical analyses.
1.13 Multinomial moment-generating function
The last of the discrete distributions we need to reconsider now is the multinomial
distribution. Designate by fNi i 1 . . . rg the set of random variables whose
realizations fnig are the frequencies of outcomes of n trials sorted into r categories
or classes fxig with corresponding probabilities of occurrence fpig. Then, as previously stated, the probability of a particular set of outcomes is
r
Y
pni i
PrfN i ni gjfpi g n!
:
n!
i1 i
1:13:1
25
1.13 Multinomial moment-generating function
The moment generating function, in which t now stands for the set of r dummy
variables (t1 . . . tr), is the expectation
subject to
r
X

pn1 . . . pnr r
en1 t1 . . . enr tr 1
gt eNt n!
n1 ! nr !
fn g
1:13:2
ni n. Rearrangement of the preceding expression leads to a form
recognizable as a multinomial expansion

n
gt p1 et1 . . . pr etr :
The set of probabilities fpig are not all independent because pr 1
1:13:3
r1
X
pi . The factor
i1
etr , the equivalent of which is absent in the generating function of a binomial distribution, was included for symmetry to permit all classes to be handled equivalently.
In most instances it is considerably simpler to work with the generating function
than to carry out complex summations with the multinomial probability function.
For example, by differentiating Eq. (1.13.3) we immediately obtain the means,
variances, and covariances of the random variables fNig representing the frequencies
of each class:
9

g
>
>
>
hN i i npi
>
>
ti t0
>
>
>

> (
=
2 2 g
varN i 2i N 2i hN i i2 npi 1 pi
2
N i 2 npi nn 1pi )

>
ti t0
cov N i , N j N i N j hN i i N j npi pj :
>
>
>

>
>

>
2 g
>
>
Ni Nj
n

1
p
p
i j ;

t t
i
j t0
1:13:4
A dimensionless measurement of the degree of correlation between outcomes in two
classes is provided by the correlation coefficient
s

p i pj
cov N i , N j

:

ij
1:13:5
ij
1 pi 1 pj
As noted before, the negative sign in the covariance or correlation coefficient signifies
that on average the change in one frequency results in an opposite change in another
frequency because of the constraint on the sum of all frequencies. The binomial
distribution, where p2 1 p1, provides an illustrative special case; Eq. (1.13.5) leads
to 12 1, i.e. 100% anti-correlation, as would be expected.
A multinomial distribution can arise sometimes in unexpected ways. Consider the
following situation, which will be of interest to us later when we examine means of
judging the credibility of models (also referred to as hypothesis testing) with particular focus on examining the properties of radioactive decay. Suppose a random
26
Tools of the trade
process has generated K independent Poisson variates fNk Poi (k) k 1 . . . Kg. The
probability of getting the sequence of outcomes fn1, n2, . . . nKg is then
K
K
Y

Y
nk
nk k
Pr fnk gjfk g
ek k e
,
nk !
n!
k1
k1 k
K
X
1:13:6
k . If, however, a constraint were imposed on the outcomes such that

K
X
nk n, then the conditional probability of
their sum must take a fixed value
where
k1
k1
obtaining the outcomes would be

!

K
X

Pr fnk gfk g,
nk n
k1
PPoi fnk gjfk g

!
K
X

nk n
PPoi
K nk
Y
nk !
k1
n
e
n!
nk
K
Y
k =
n!
,
nk !
k1
k1
1:13:7
which is seen to be a multinomial probability function with parameters pk k / . The
!
K
X

nk n is justified
substitution of the Poisson probability function for Pr
k1
because the sum of K independent Poisson variates is itself a Poisson random variable.
1.14 Gaussian moment-generating function
The moment generating function of the normal or Gaussian distribution is of
particular significance in the statistical analysis of physical processes. Besides generating the moments of the distribution, it provides a reliable means of ascertaining
how well an unknown probability distribution may be approximated by a normal
one. Designate, as before, X to be a Gaussian random variable with mean and
variance 2. Calculation of the mgf then leads to the integral
1
gt he i p
2
ext ex
xt
=2 2
dx,
1:14:1
which is most easily evaluated by (a) transforming the integration variable to a

dimensionless variable z (x )/ said to be in standard normal form, (b)
completing the square in the exponent, and (c) recognizing the normalization of
the resulting Gaussian integral
1
p
2
to obtain the expression
ezt
=2
dz 1
1:14:2
27
1.14 Gaussian moment-generating function
gt et
1
2
2 t2
1:14:3
We will make frequent use of this function throughout the book.

Using the mgf (1.14.3), we can easily demonstrate the equivalence relation
(1.10.4). Define the random variable X a bY where a and b are constants and
Y N(0, 1) is a standard normal variate. Since a and Y are independent, the mgf of
X is expressible as a product
gX t ga tgbY t eat gY bt:
1:14:4
In going from the first equality to the second the mgf of a constant is simply

1:14:5
ga t eat eat ,
and the mgf of a constant times a random variable Y takes the form
E
D
gbY t ebY t eY bt gY bt:
1:14:6
However, for Y N(0,1), the mgf (1.14.3) applied to relation (1.14.6) yields
2 2
gY bt e b t . Thus, the product of the factors in (1.14.4) leads to
1
2
gX t eat e
1
2
b2 t2
eat
1
2
b2 t 2
1:14:7
which identifies X as a normal random variable. Setting a and b 2 yields

precisely relation (1.10.4).
One of the applications of the mgf is to establish the conditions for progressive
approximation of one distribution by another. For example, the mgf of a binomial
random variable Bin(n, p) is gBin t pet qn 1 pet 1n . Expansion of
ln gBin(t) n ln(1 p(et 1)) in powers of (et 1), which may be regarded as a
small quantity since t is ultimately set to zero in calculations with the mgf, yields the
Taylor series6
1
2
ln gBin t npet 1 np2 et 1 :

In the limit that p ! 0 and n ! so that the product np ! , we can truncate the
preceding expansion after the first term to obtain a limiting form of the mgf
gBin t ! e e 1 gPoi t,
t
1:14:8
which identifies a Poisson distribution of mean .

Next, consider expansion of ln gBin(t) in powers of t

1
1
1
ln gBin t np t t2 np2 t2 ! npt np1 pt2
2
Recall that: ln1 x x 12 x2 13 x3 14 x4 :
28
Tools of the trade
taking care to include all contributions of the same order in t. For vanishing p, but
np
1, we truncate the expansion after the quadratic term to obtain the limiting form
gBin t ! enpt
1
2
npqt2
gGaus t,
1:14:9
recognizable as the mgf of a Gaussian distribution with mean np and variance

2 npq, where q 1 p 1.
In summary, one can say that the shape of the probability curve of a binomial
distribution approaches in form that of a Poisson distribution for low p and large n
leading to a mean np of arbitrary magnitude. If np is much greater than 1, however,
the shape formed by a continuous curve connecting the discrete points of the binary
(or Poisson) distribution takes on the symmetrical shape of a Gaussian distribution
with mean and variance equal to np.
1.15 Central Limit Theorem: why things seem mostly normal

It often occurs in science that one encounters random variables whose probability
distributions are not known. This is particularly the case when the quantity being
sought is inferred from more elemental randomly varying quantities. Then, even if
the probability distributions of the elemental variables are known, it may be very
difficult to calculate exactly the distribution of the composite quantity. For
example, consider the traditional experiment in introductory physics labs to
measure the acceleration g of freefall at the surface of the Earth. This requires
timing a vertically falling object and marking the intermediate locations as a
function of time. The data comprise measurements of time intervals and spatial
intervals with random experimental errors of measurement whose distributions
are not a priori known. The standard statistical procedure of error propagation
analysis lets one estimate a mean value and standard deviation of g, but, without
knowledge of the underlying probability distribution, it is not possible to interpret
the significance of these statistical quantities. This is not merely an academic
problem confined to instructional labs, but an issue that can have potentially
serious consequences in the real world, particularly in science, medicine, and
engineering.
The Central Limit Theorem of statistics often provides a workable solution by
elucidating the circumstances under which a combination of random variables of
different distributions together form a quantity distributed for all practical purposes
like a Gaussian variate. Consider, as an illustration, the special case of a random
n
X
Xi interpretable as the mean of n independent, identically distribvariable X 1n
i1
uted measurements fXi i 1 . . . ng each with mgf gX (t). From Eq.(1.10.1), the mgf of
n
X takes the form gX t gX nt , the natural log of which can be expressed in terms
of the moments of X by expanding gX (t) in a Taylor series about t 0
ln gX t

t
nln gX
n
t=nk
n ln 1
k
k!
k1
29
!
n ln1 t:
1:15:1

k
gX t
Here k d dt
is the kth moment of X and the term (t) is to be regarded as a
k
t0
small quantity since t will eventually be set to 0. A Taylor series expansion of the
logarithm
h
i

t
1
1
1:15:2
n t t2 t3 ,
ln gX
2
followed by arrangement of all terms in increasing powers of t, then leads to an

expression
ln gX

t
n

t2
t3

1 t 2 21
3 32 1 231
6n2
D 2n
E
X 1 3
2X 2
1 t t
t3
6n2
2n
1:15:3
in increasing moments about the mean of X. If the number of observations n, which

appears in the denominator of each term to a power of one less than the corresponding moment, is sufficiently large that terms beyond the second moment can be
neglected, the truncated series is of the form of a Gaussian mgf of mean X 1
and variance
2X 2X =n:
1:15:4
If the condition that the variables fXig be identically distributed is relaxed, then the
foregoing analysis carries through in the same
way,

albeit with some extra summa2
tions, leading to a Gaussian distribution N X , X with parameters
X
n
1X
n i1 Xi
2X
n
1X
2 :
2
n i1 Xi
1:15:5
It is worth noting explicitly that the only requirement on the distributions of the
original variables fXig is the existence of first and second moments. This modest
requirement is usually met by the distributions one is likely to encounter in physics
although the Cauchy distribution, which appears in spectroscopy as the Lorentzian
lineshape, is an important exception. A Cauchy distribution has a median, but the
mean, variance, and higher moments do not exist.
A significant outcome of the foregoing calculation is that the standard deviation of
the mean of n observations is smaller than the standard deviation of a single
p
observation by the factor n. This statistical prediction is the justification for
repetition and combination of measurements in experimental work. Perhaps it is
intuitively obvious to the reader that the greater the number of measurements taken,
the greater would be the precision of the result, but historically this was not at all
30
Tools of the trade
obvious. An interesting case in point is the contrast in attitude toward aggregation of

data by the eighteenth-century astronomer Johan Tobias Mayer and acclaimed
mathematician Leonhard Euler. Mayer, whose work required a practical bent,
regarded the errors in his observations as random and self-canceling, but Euler
believed that in combining measurements the bad ones would contaminate the
good ones.7 The idea that one achieves better results by combining measurements
has taken a while for acceptance.
Indeed, the implications of the square-root law are subtle and so easily misconstrued when applied in practice, that a concrete illustration is worth discussing here.
Consider a Poisson process such as the decay of a homogeneous sample of radioactive nuclei in which decays are counted (e.g. by detecting outgoing particles) within
a specified window of time let us say one second. Each count of one-second
duration constitutes one bin of data accumulation. Let the random variable
X represent the count in one bin. If we know that X is a Poisson random variable
of mean , then the variance of X equals and the standard deviation of X is
p
X . An experimentalist, therefore, might report the outcome of a single measp
urement as x x, where the single count x is used to estimate the mean and variance
of the distribution.
At the risk of complicating matters, it is nevertheless necessary to note the
distinction between moments of a theoretical distribution (often assigned Greek
letters: , , etc.) and corresponding empirical moments inferred or estimated from
actual measurements (often assigned Roman/italic letters: m, s, etc.). I will discuss
later, in conjunction with specific experiments, issues relating to statistical estimation.
For the present, the principal point I want to make is this: although we have a value
for the variance of x, there is no way actually to ascertain whether that estimate is
accurate on the basis of the count in a single bin since variance (as the name implies)
refers to the variation in counts that would occur from multiple measurements (i.e.
many bins).
Suppose now the experiment consists of making n sequential counts, each of one
second duration, to fill n bins of data. We can estimate (well justify this later) the
mean and standard deviation of the underlying distribution by the following
summations
x
n
1X
xi
n i1
s2X
n
1X
xi x2
n i1
1:15:6
and report our measurement as x sX . The empirical value s2X should correspond
approximately to the value x if the underlying distribution is truly Poissonian. Note
that this is still the estimate of the variance of a single trial the count in one bin
only now we have the variable counts in n bins to verify its value directly.
7
S. M. Stigler, The History of Statistics: The Measurement of Uncertainty before 1900 (Harvard University Press,
Cambridge MA, 1986) 28.
31
But the n-bin experiment gives us more. Equation (1.15.4) tells us that the variance
of the mean of the n measurements is s2X s2X =n, which for large n is a much smaller
variation. The quantity sX is referred to in statistics as the standard error, where
error connotes uncertainty, not mistake. However, as before, we cannot check this
prediction on the basis of a single set of n measurements which, for need of a word
that is short and alliteratively parallels bin, I shall refer to as one bag.
Suppose we collect b bags of data, each bag containing n bins where the content of
one bin is the count in one second. Treating the means of each bag fxj j 1 . . . bg in
the way that we formerly treated the bins fxi i 1 . . . ng in a bag, we calculate the
mean of the bag means and the variance of the mean of the bag means
mX
b
1X
xj
b j1
s2X
b
2
1X
xj mX
b j1
1:15:7
and report the experimental result in the form mX sX . The numerical value of sX
p
should satisfy (approximately) the relation sX sX = n. But this experiment also
s2
s2
gives more: the estimated variance of the mean of b bags of data is s2mX bX nbX .
Of course, we cannot actually verify this without performing another set of experiments, each one comprising b bags of data. And so it goes.
However, we could equally well have interpreted the set of b bags of data as a
single large bag of nb realizations fyk k 1 . . . nbg of a random variable Y. Estimates
of the mean y, variance sY, and variance of the mean sY are then given by the
expressions
y
nb
1 X
y
nb k1 k
s2Y
nb
1 X
y y2
nb k1 k
s2Y
s2y
:
nb
1:15:8
Numerically, we would expect

y mX
sY sX
sY smX :
1:15:9
The greater number of bins per bag does not reduce the variance of the count in a
single bin, but yields a mean whose variance is as small as previously found for the
variance of the means of the b bags of data. The two ways of handling the data give
equivalent overall estimates for the mean and variance of the stochastic process
generating the data. There are advantages, however, to partitioning the data into
bags if the objective, for example, is to test whether the distribution of counts is
actually Poissonian, or to examine whether the mean or variance of the source of
data may be varying in time.
Table 1.3 shows the results of 25 600 outcomes, ordered sequentially into 16 bags
of 1600 bins each, of a Poisson random number
generator

(RNG) set for
100.
The table reports the mean of each bag xj j 1 . . . 16 and variance s2X calculated from relations (1.15.6). From the table one calculates
directly the mean of

all bags (mX) and the variance of bag means s2X . Comparing theoretical
32
Table 1.3
Tools of the trade
Outcome of Poisson RNG with 100 (1600 bins per bag)
Bag No.
i
Mean
x
Std Dev.
sx
Bag No.
i
Mean
x
Std Dev.
sx
1
2
3
4
5
6
7
8
99.6
99.9
100.1
100.4
100.1
100.0
100.4
100.1
99.0
100.7
100.7
101.5
100.8
104.1
101.0
98.5
9
10
11
12
13
14
15
16
99.6
100.1
99.9
99.9
99.9
100.1
100.3
100.0
97.5
100.0
99.7
96.7
105.1
105.6
101.5
100.6
expectations and empirical outcomes, we find excellent agreement with the principles outlined above.
THEORY
p
X 100 10
X
10
0:250
X p
n 40
X
10
Y p
0:0625
nb 160
EMPIRICAL
sX 10:040
sX 0:251
sX
sY p 0:0628
nb
A final point (for the moment) in regard to Eq. (1.15.4) or Eq. (1.15.5) is that the
expression for variance of the mean is a general property of variances irrespective of
the Central Limit Theorem. Without the CLT, however, we would not necessarily
know what to do with this information. The theorem tells us, for example, that, if the
process generating the particle counts can be approximated by a Gaussian distribution, then we should expect about 68.3% of the bins to contain counts that fall within
a range sX about the observed mean x:
1.16 Characteristic function
The characteristic function (cf ) of a statistical distribution is closely related to the
moment generating function (mgf ) when the latter exists and can be used in its place
when the mgf does not exist. It is a complex-valued function defined by

hX t eiXt gX it,
1:16:1
p
where i 1 is the unit imaginary number. For a random variable X characterized
by a pdf pX(x), the characteristic function takes the form
hX t
eiXt pX xdx
1:16:2
33
1.16 Characteristic function
which is recognizable as the Fourier transform of the pdf. In this capacity lies its
primary utility, for it permits one to calculate the probability density (or probability
function) by an inverse transform
1
eiXt hX tdt,
1:16:3
pX x
2
which cannot always be done so straightforwardly by means of the mgf itself. One can,
of course, also calculate moments of a distribution by expansion of hX (t) in a Taylor
series about t 0 to obtain an alternating progression of real and imaginary valued
quantities, but I have found little advantage to using it this way when gX(t) is available.
As an illustration of the inverse problem of determining the pdf from the cf, consider
2
the standard normal distribution for which the generating function is gX t et =2 and
2
therefore hX t et =2 . The probability density then follows from the integral
1
pX x
2
e
ixt t2 =2
2
2
2
ex =2
dt
e t 2ixtx dt
2
3
1
2
2
ex =2 6
6 1
p 6p e
2 6
4 2
1
2
tix2
7
7 ex2 =2
7
dt 7 p
7
2
5
1:16:4
The calculation is easily extended to the case of an arbitrary Gaussian distribution

N(, 2) at the expense of a few more algebraic manipulations in completing the
square in the exponential.
The method can also be applied to calculate the probability function of a discrete
distribution (as an alternative procedure to using a probability generating function).
Consider, for example, a binomial distribution Bin(n, p) for which the mgf was found
to be gX (t) ( pet q) n. The cf is then hX (t) ( peit q) n and implementation of the
transform (1.16.3) is accomplished through the following steps: (a) binomial expansion of the terms in parenthesis, (b) collection of factors containing the integration
variable and reversal of the order of summation and integration, (c) collapse of the
summation by means of a function:
n
X
it
n
k
1
1
n
ixt
ixt
e
pe q dt
e
peit qnk dt
p X x
k
2
2
k0
n
X
n k nk 1
pq
eikxt dt
k
2
k0

n x nx
pq :
x
kx
1:16:5
34
Tools of the trade
The last step bears some comment. A Dirac delta function (x) is technically not a
function, but a mathematical structure with numerous representations whose value is
zero everywhere except where its argument is zero, at which point its value is infinite;
yet the area under the delta function (that is, the integral of the delta function over
the real axis) is 1. The object was introduced into physics by P. A. M. Dirac to the
horror of mathematicians (or so I have read) but eventually was legitimized by
Laurent Schwarz in a theory of generalized functions (referred to as distribution
theory although the concept of distribution is unrelated to that in statististics).
Ordinarily, the delta function has meaning only in an integral where it serves to
sift out selected values of the argument of the integrand for example:
f xx adx f a. One gets a sense of how this occurs from the integral
representation
0
1
x Lim @
K! 2
1
1
e dtA
2
ixt
K
eixt dt
1:16:6
identified in (1.16.5) by the horizontal bracket. The second equality expresses the
familiar form one usually sees for the representation of the delta function. If
the argument is not zero, then the integrand oscillates wildly with average value
of 0. The proof that the foregoing representation satisfies the property of unit area is
best accomplished by means of contour integration in the complex plane and will not
be given here. To perform that integral rigorously, however, one must employ the
correct representation of (x) as a limiting process expressed in the first equality.
In the calculation (1.16.5) of the binomial probability function, the Dirac delta
function causes the right side of the equation to vanish for all values of the discrete
summation index k except for k x. It is therefore assuming the role of the discrete
Kronecker delta kx, which by definition equals 1 if k x and zero otherwise. There is
no inconsistency here, however, because the inverse transform of the characteristic
function is a probability density, and the Dirac delta function, which in general is a
dimensioned quantity (with dimension equal to the reciprocal dimension of the
integration variable) is required for the left-hand side of (1.16.5) to be a density,
even though it is defined only for discrete values of x. In short, the method works, and
we shall not worry about mathematical refinements to make the analysis more
elegant, only to end up with the same result.
1.17 The uniform distribution

An idea of how rapidly the compounding of non-normal probability distributions
can approach normality may be gleaned from examining the extreme case of the
uniform distribution U(a, b), in which the probability density of a random variable X
pX xjb, a
8
<
1
ba
:
0
b x a
35
1:17:1
otherwise
is constant over the entire interval within which the variable can fall. The value of the
constant is the reciprocal of the interval, as determined by the completeness relation.
Use of pdf (1.17.1) leads to the moment-generating function
1
gX t he i b a
xt
ext dx
a
ebt eat
:
b at
1:17:2
The uniform distribution is perhaps one of very few distributions where it is considerably easier to determine statistical moments directly by integrating the pdf than by
differentiating the mgf. Performing the integrations, we obtain
D
E
1
1
X hXi b a 2X X X 2 b a2
2
D
E. 12
2 1 2

1:17:3
X b ab a2 Sk X X 3 3X 0
3
D
E.
9
K X X 4 4X 1:8:
5
Since the distribution is symmetric (being constant over the entire interval), the
skewness is expected to vanish. The kurtosis turns out to be a number independent
of the interval boundaries and much smaller than 3 (the value for a normal distribution) signifying a comparatively broader peak about the center, which is one way of
looking at a completely flat distribution.
The difficulty with using the mgf for a uniform variate is that substitution of t 0
into gX (t) and its derivatives leads to an indeterminate expression 0/0. In such cases,
we must apply LHopitals rule from elementary calculus to differentiate separately
the numerator and denominator (more than once, if necessary) before taking the
limit. Consider, for example, calculation of the mean
bt

dg t
be aeat ebt eat
X X

dt t0
b at
b at2 t0

b2 a2
bebt aeat
b2 a 2
b2 a 2
1:17:4

b a
2b at t0 b a 2b a
ba
:
2
To avoid indeterminacy, the numerator and denominator of the second term in the second
line had to be differentiated twice. Clearly, use of the mgf to determine moments of the
uniform distribution is a tedious procedure to be avoided if possible. However, there
are other uses, more pertinent to our present focus, in which the mgf is indispensable.
Suppose we want to determine the statistical properties of a random variable
n
X
Y
Xi , which is a sum of n independent random variables each distributed
i1
36
Tools of the trade
uniformly over the unit interval, i.e. Xi U(0,1). Y, therefore, spans the range
(n Y 0). The mgf of Y and correspondingly the characteristic function
hY (t) gY (it) are immediately deducible from (1.10.1)
t

it
n
e 1 n
e 1
gY t
)
h Y t
:
1:17:5
t
it
Although at this point we do not have the pdf of Y, we can determine the moments
from the derivatives of gY (t)
hY i Y
9
>
>
>
>
>
>
>
>
>
>
>
=
n
2
2 n2
n
2Y
Y
12
4 12
) Sk 0
3
2
3 n
>
n
>
>
Y
6
>
>
8
8
>
K 3 :
>
>
5n
>
4 n4 n3 n2
n >
>
;
Y

16 8 48 120
1:17:6
As expected, the skewness vanishes and the kurtosis approaches 3 in the limit of
infinite n. Moreover, expansion of ln gY (t) to order t3 leads to an approximate mgf
of Gaussian form
gY t e2 t212t eY t2 Y t
n
1 n
1 2 2
1:17:7
in accordance with the Central Limit Theorem.

The CLT, however, does not tell us how rapidly a distribution approaches normal
form. To ascertain this, we need the pdf pY(y), which the characteristic function in
(1.17.5) allows us to determine, by means of the Fourier transform,
1
p Y y
2
hY te
iyt

y
X
1
k n
dt
1
y kn1 :
k
n 1! 0
1:17:8
I have used the symbol [y] in the upper limit of the sum above to represent the
greatest integer less than or equal to y. Recall that Y is a continuous random variable
over the interval 0 to n, but the numbers in the binomial coefficient must be integers.
The calculation leading from the first equality to the second in (1.17.8) is most
easily performed by contour integration in the complex plane and will be left to an
appendix. To verify that pY (y) satisfies the completeness relation, we calculate the
cumulative distribution function

y
1X
k n
FY y pY y dy
y k n :
1
k
n! k0
y
Completeness follows from the binomial identity
1:17:9
37
Probability Density
n=2
n=3
0.8
n=4
n=5
0.6
0.4
0.2
0
0.5
1.5
2.5
Outcome
3.5
4.5
Fig. 1.2 Probability density of the sum of n uniform variates (solid) with superposed
Gaussian densities (dashed) of corresponding mean n/2 and variance n/12.

n
n
1X
1X
k n
n
nk n
FY n
1
1
n k
kn 1:
k
k
n! k0
n! k0
1:17:10
The preceding identity is by no means obvious, but it can be proven fairly simply by
comparing the nth derivative of the function (et 1)n and its binomial expansion,
both evaluated at t 0.8
Figure 1.2 shows a sequence of plots of pY (y) (solid trace), calculated from
Eq. (1.17.8), for sums of two to five uniformly distributed variates over the unit
interval. Superposed over each plot is a plot (dashed trace) of the corresponding
Gaussian pdf N n2 , 12n . It is remarkable that the addition of as few as three uniform
random variables already generates a probability distribution reasonably well
approximated by a normal distribution. In fact, the sum of just two uniform variates,
which produces a triangular distribution, is matched very closely by the corresponding Gaussian curve in width, height, and inflection points. Of course, convergence
can be much slower for other probability distributions and some may never
approach Gaussian form at all because their first and second moments are undefined.
Besides serving as an interesting illustration of the Central Limit Theorem, the
uniform distribution is of particular interest in is own right because it is the distribution of cumulative distribution functions (cdf ). To see this, suppose pX (x) to be an
arbitrary (but well-behaved) pdf, with cdf F(x)
x
Fx PrX x
pX x0 dx0 :
1:17:11
W. Feller, An Introduction to Probability Theory and its Applications, Vol. 1 (Wiley, New York, 1950) 63. The identity
arises in the classical occupancy problem (i.e. MaxwellBoltzmann statistics) of r balls distributed among n cells such
that none of the cells is empty.
38
Tools of the trade
How, then, is the random variable Y, defined by Y F(X), distributed? From the
following sequence of relations

PrY y PrFX y Pr X F1 y F F1 y y,
1:17:12
it follows that Y must be a uniform random variable over the interval 0 to 1, i.e. Y
U(0,1). The fact that a cdf is governed by a distribution U(0,1) plays an important role
in statistical tests of significance, such as goodness of fit tests to be discussed shortly.
One can also use a uniform distribution to generate random numbers distributed
in an arbitrarily desired way. Start by generating n realizations fyi i 1 . . . ng of Y
U(0,1). If we suppose that Y is the cdf of a random variable X
x
y Fx f x0 dx0
1:17:13
then the set of numbers obtained by solving the inverse relation

xi F1 yi
1:17:14
constitute n realizations of a random variable with pdf f (x). In general, the inversion
will have to be done numerically.
Consider, as an example, the exponential distribution for which an analytical
solution is easily obtained
xi
yi ex dx 1 exi
1:17:15
1
xi ln1 yi
1 yi 0:
The upper panel of Figure 1.3 shows a histogram of 10 000 numbers fyig generated
by a U (0,1) random number generator, and the lower panel shows the corresponding
values fxig obtained from (1.17.15) for an exponential distribution with parameter
3. The dashed curve superposed on each histogram is the theoretical pdf.
1.18 The chi-square (2) distribution

Although the use of a moment generating function or characteristic function allows
us to determine readily the statistical properties of a linear superposition of random
variables, other methods must be sought when dealing with random variables that
arise as a result of nonlinear operations. In such cases, the simplest procedure if it
works would be to transform the pdf or mgf of the resulting distribution into one
that is already recognized and tabulated. Fortunately, this method works for one of
the most commonly occurring cases: the square of a normal random variable.
1.18 The chi-square (2) distribution
39
540
U(0,1)
Frequency
520
500
480
460
440
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.6
0.7
0.8
0.9
Outcome
1500
Frequency
E(3)
1000
500
0
0
0.1
0.2
0.3
0.4
0.5
0.8
0.9
Outcome
Fig. 1.3 Top panel: histogram of 10 000 samples from a U (0, 1) random number generator.
Lower panel: histogram of exponential variates E () generated by transformation (1.17.15)
with parameter 3. Dashed curves are theoretical densities.
To start with, consider a standard normal random variable Z N(0,1), for which
2
the probability density is pZ z 2 1=2 ez =2 . Under a transformation W Z2, the
new pdf can be deduced by the following chain of steps
pW w dw

dz
pZ z dz 2 pZ z dz 2 pZ zw dw,
dw
1:18:1
leading to
pW w
2pZ zw 22 1=2 ew=2

1 w1=2 w=2
p
e
,
2 2
j dw
21=2 w1=2
dz j
1:18:2
which is identifiable as the pdf of a chi-square random variable of one degree

of freedom, or, symbolically W 21 . From the pdf above, the corresponding mgf,
gW (t) (1 2t)1/2 is derivable by algebraically manipulating the integral occurring
40
Tools of the trade
in the expectation heWti into the form of the gamma function (1) 1. (See
Eqs. (1.12.10) and (1.12.11).)
Given the mgf for a single variate Z2, it follows immediately that the superposition
k
X
Z 2i , each the square of a standard
of k independent random variables, W
i1
normal random variable, yields the mgf

gW t 1 2tk=2
1:18:3
of a chi-square random variable of k degrees of freedom. We will take up the concept

of degrees of freedom at the appropriate point, but for the present let us focus on the
properties of the distribution, designated symbolically by 2k .
From the derivatives of the mgf (1.18.3) one finds that the first four moments of a
2k random variable are
1 k
2 k 2 k
and therefore
mean k
var 2k
3 k3 6k2 8k
4 k4 12k3 44k2 48k
r
8
Sk
k
K 3
1:18:4
12
:
k
1:18:5
With increasing k, the skewness of the distribution function approaches 0 and the
kurtosis approaches that of a standard normal variate.
The inverse Fourier transform of the characteristic function hW(t) gW(it) yields
the pdf
1 w2k1 w=2
e
,
1:18:6
pW wjk
22k 2
but this calculation, like that of the integral encountered in the previous section, also
entails contour integration in the complex plane, and the demonstration will be left to
an appendix. Figure 1.4 shows the variation in 2k density function (1.18.6) for a set of
low degrees of freedom (k 15) (upper panel) and a set of relatively high degrees of
freedom (k 5866 in intervals of 4) (lower panel). For k 1, the pdf is infinite at the
origin although the area under the curve is of course finite. For k 2, the curve is a
pure exponential, as can be seen from the expression in (1.18.6). As k increases
beyond 2, the plot approaches (although with slow convergence) the shape of a
Gaussian pdf with mean k and variance 2k.
Although ubiquitously used in its own right to test how well a set of data is
accounted for by a theoretical expression, the chi-square pdf can also be considered a
special case of a more general class of gamma distribution Gam(, k) with defining
probability density
pX xj,
x1 ex
, > 0 :
1:18:7
41
1.19 Students t distribution

0.5
k=1
Probability Density
0.4
k=2
0.3
k=3
0.2
k=4
k=5
0.1
0
0
Outcome
10
Probability Density
0.05
k = 50
0.04
k = 54
k = 58
k = 62
k = 66
0.03
Gaussian
0.02
0.01
0
20
30
40
50
60
70
80
90
100
Outcome
Fig. 1.4 Probability density of 2k (solid) for low k (top panel) and high k (bottom panel). The
dashed plot is the density of a normal variate N (k, 2k) for k 66.
and moment generating function

t
gX t 1
t < :
1:18:8
Looked at in this light e.g. by comparison

of mgfs

a chi-square random variable
2k is a gamma random variable Gam 12 , 2k .

The t distribution, published anonymously in 1908 by William Gossett under the
pseudonym of Student (because his employer, the Guinness Brewery in Dublin, did
42
Tools of the trade
not permit employees to publish scientific papers), is the distribution of a random

variable T constructed to be the ratio of a standard normal variate U N(0, 1) and an
independent normalized chi-square variate V 2 2d of d degrees of freedom. Specifically, one defines T by
p
U
U d
T p
:
1:19:1
V
V 2 =d
The motivation for this peculiar arrangement of random variables arises from its
statistical application in testing the mean of a sample against a hypothesized mean of
a normal distribution or in comparing two or more sample means to infer whether or
not they are statistically equivalent to the mean of the same parent population. We
will employ the t distribution in this way later in the book.
When testing a sample mean x against the theoretical mean of a parent population, it is often the case that the population variance 2 is not known although the
variance s2 of a sample of size n has been determined. One could, of course, estimate
2 by s2 in implementing the test with a normal distribution, but the error incurred by
this approximation can be significant for samples of small size. The Central Limit
Theorem validates the ubiquitous occurrence of a normal distribution in the limit of
a large (technically, infinite) number of samples. When used to make statistical
inferences on small samples, however, the normal distribution gives probabilities
that are too small because the tails of the distribution fall off (exponentially) too fast.
In other words, the normal distribution can underestimate the probability of occurrence of outlying events that ordinarily have a low probability but which, when they
occur, can prove catastrophic. The t distribution allows one to sidestep the problem
of an unknown population variance in the following way.
If X N(, 2) is a normal variate for which values x and s2 have been obtained for
the mean and variance by a random sample of size n, then the quantity
u
x x
p ) U N 0, 1
X
= n
1:19:2
is a realization of a random variable U N(0, 1). It is also demonstrable that the

quantity
ns2
1:19:3
v2 2 ) V 2 2n1
is a realization of an independent chi-square random variable V 2 2n1 . It may seem

surprising that the distributions of s2 and x are independent of one another since both
quantities are calculated from the same set of data, but this demonstration of both
the independence and the type of distribution can be found in advanced statistics
books.9 From (1.19.2) and (1.19.3) it follows that the ratio
P. G. Hoel, Introduction to Mathematical Statistics (Wiley, New York, 1947) 136138.
p
p
p
u n 1 x n 1
x n1
t
!
0
v
s
s
43
1:19:4
does not contain the unknown population variance . . . or the population mean, as
well, if the parent population is hypothesized to have a mean of 0, a situation
characterizing a null test (e.g. a test that some process has produced no effect
distinguishable from pure chance).
The derivation of the pdf pT (t) from the component pdfs
1
2
pU u p eu =2
2
d2
2 2 v2 =2
2
v
e
pV 2 v d=2
2 d=2
1:19:5
proceeds easily if one ignores the constant factors i.e. just designates all constant
factors by a single symbol c and focuses attention only on the variables. In a
subsequent chapter I discuss the distribution of products and quotients of random
variables more generally, but for the present the solution can be worked out by a
straightforward transformation of variables. The idea is to

2
(a) start with the joint probability distribution fUV 2 u, v2 pU upVp
2 v
,
(b) transform to a new probability distribution fTV(t, v) where t u d=v,
(c) integrate over v to obtain the marginal distribution pT(t) of t alone, and
(d) determine the normalization constant c from the completeness relation
f T tdt 1.
Execution of steps (a) and (b) by means of the transformation

u, v fUV u, v vfUV u, v

p
fTV t, v fUV u, v
t
t, v
d
u
1:19:6
leads to
v2
f t, v cvd e 2
1td
1:19:7
which by step (c) results in the marginal probability density

d1
t2 2
f t c 1
:
d
1:19:8
The integral in step (d) is not elementary, but can be worked out by means of contour
integration in the complex plane with use of the residue theorem. This calculation,
deferred to an appendix, leads to the density
44
Tools of the trade

d1
2
2d1 d 1=2 2
t2
p
pT t
1
d
d d
t :
1:19:9
To my initial surprise when I first obtained it, relation (1.19.9) is not the Student t pdf

d1
2
d 1=2
t2
pT t p
1
d
d d=2
t
1:19:10
one gets by keeping track of all the constants in steps (a)(d) above, and which is the
form usually found in statistics textbooks. The two expressions (1.19.9) and
(1.19.10) are entirely equivalent, however, although they do not look it. The
demonstration of their equivalence, which requires showing that
h
i

1
1
1
d 1 d 21d d
1:19:11
2
p
(note: 12 ), immerses one in the fascinating, if not bewildering, relationships
of the beta function B(x, y), which we define and use in the next chapter in consideration of Bayes problem (i.e. the problem that divided probabalists into two warring
camps). The expression (1.19.11) is one form of an identity referred to as the
Legendre duplication formula,10 often seen in the cryptic form

1
1:19:12
x! x ! 22x1 1=2 2x 1!,
2
where fractional factorials are defined by means of the gamma function x! (x 1),
or the alternative form

1
x ! 2x1 1=2 2x 1!!,
1:19:13
2
where the double-factorial notation defines the products

2n!! 246 2n 2n n!
2n 1!! 135 2n 1
2n 1!
:
2n n!
1:19:14
The Legendre duplication formula is itself a special case of a remarkable identity

known as the multiplication theorem (or Gausss multiplication formula)

k1
1
2
k1
z
2 2 k kx kz
z z z
k
k
k
1
2
1:19:15
for complex number z. Setting k 2 and z d/2 in (1.19.15) leads directly to

(1.19.11).
10
G. B. Arfken and H. J. Weber, Mathematical Methods for Physicists (Elsevier, Amsterdam, 2005) 522523.
1.20 Inference and estimation
45
There is no moment generating function for the t distribution (the integral

diverges), and the characteristic function is not particularly useful. One can calculate
moments directly with the pdf (1.19.10). The range of t is the entire real axis, but the
pdf is an even function of t, and therefore integration need be done only over the
positive real axis. For the first few moments one finds
hT n i 0 n 1, 3, 5 . . .

d
2T T 2
d
4
2
T
d2
:
KT 4 3
d4
T
1:19:16
In the limit d ! , which in practical terms means d ~ 10, the moments approach
those of a standard normal distribution.
An alternative way to compute the moments of the t distribution is to exploit the
fact that the numerator and denominator of the ratio (1.19.1) are independent
variates and therefore the expectation of the quotient is expressible as the product
of two expectations
*
+
d n=2 Un
n
1:19:17
dn=2 hUn i hV n i:
hT i
Vn
Note that (1.19.17) is not equal to dn/2 hUni/hVni. Rather, one must calculate the
negative moments of V, which can be done by integration of the pdf or, as I discuss in
Chapter 4, by integration of the moment generating function. Not every distribution
has finite negative moments. The chi-square distribution is one that does.
In the upper panel of Figure 1.5 Student t distributions with d 3 and 10 degrees
of freedom are compared to normal distributions of the same mean (0) and variances
(3, 5/4). Over the range d 2, the appearance of the t distribution does not
change greatly. At the scale of the figure, the Gaussian distribution of same variance
looks wider, but the appearance is deceiving. In the lower panel, which shows the
tails of the two distributions for d 3, the tail of the t distribution for coordinates
jtj > 5 is fatter, i.e. decreases more slowly and predicts a higher probability than a
normal distribution for the same t values.
1.20 Inference and estimation

In the study of random processes, two questions usually arise: (I) What probability
distribution characterizes the process in the most objective i.e. unbiased way
consistent with known information? (II) What are the parameters of this probability
distribution? The first question is an important example of inference; the second of
estimation. The principle of maximum entropy provides an answer to the first question.
The method of maximum likelihood addresses the second. Let us take first things first.
46
Tools of the trade

0.4
Probability
0.3
0.2
b
0.1
Outcome
0.02
Probability
0.015
Gaussian
0.01
0.005
Student t
4
4.5
5.5
6.5
7.5
Outcome
Fig 1.5 Top panel: Student t (solid) and Gaussian (dashed) densities for degrees of freedom
d 10 (plots (a), (b)) and d 3 (plots (c), (d)). Bottom panel: tails of the Student t (solid) and
Gaussian (dashed) densities for d 3.
1.21 The principle of maximum entropy

Entropy, together with energy, constitutes one of the two pillars upon which the
discipline of equilibrium thermodynamics the study (broadly speaking) of the
transformation of energy rests. Einstein had once remarked upon the robust nature
of thermodynamics in that if our theoretical understanding of the quantum structure
of matter should ever fail entirely, the principles of thermodynamics would remain
1.21 The principle of maximum entropy
47
valid and unaffected. This is so because thermodynamics is a consistent body of

macroscopic relationships not tied to an underlying model of matter. That attribute
is both its strength and its limitation.
The objective of a subject as vast in scope and application as thermodynamics is
not easily reduced to a few words, but the following statement by Herbert Callen
comes as close as any I have seen: The basic problem of thermodynamics is the
determination of the equilibrium state that eventually results after the removal of
internal constraints in a closed composite system.11 And how is one to determine
that equilibrium state? The solution lies in the concept of entropy, a function of the
extensive (i.e. size-dependent) variables of the system, which is itself additive over
constituent subsystems. In the absence of an internal constraint, the values assumed
by the extensive variables are those that maximize the entropy over the manifold of
all equilibrium states which might have been realized while the constraints were in
place. From this entropy maximum postulate plus a few definitions and some
empirical relations (equations of state) describing how matter behaves, unfolds the
mathematically elegant structure of equilibrium thermodynamics.
There is, however, a more fundamental statistical way to view the content of
thermodynamics. It is, again in Callens words12, the study of the macroscopic
consequences of myriads of atomic coordinates, which, by virtue of the statistical
averaging, do not appear explicitly in a macroscopic description of a system. From
this statistical perspective, the concept of entropy is detached from the workaday
measurable quantities of heat, work, temperature, and the like, and becomes instead
a measure of the distribution of the elemental constitutents of a physical system
over their available states. It is frequently said that entropy is a measure of order
(or disorder) in a system the greater the order, the lower the entropy but this is
an ambiguous relationship at best since there is no thermodynamic or statistical
mechanical order function. Moreover, examples can be adduced that refute the
association.13
In a thoroughly statistical treatment which physicists generally refer to as
statistical mechanics or statistical thermodynamics, depending on emphasis
expressions for the mean values and fluctuations of macroscopic thermal quantities
are derived from the characteristic energies (energy eigenvalues) of the particles
(nuclei, atoms, molecules . . .) of the system and the probability distribution of the
particles over their energy states (referred to as occupation probabilities). Out of this
grand scheme, which does depend on our understanding of the atomic structure of
matter, emerges a most remarkable expression for entropy
X
S kB
pi ln pi ,
1:21:1
i
11
13
12
H. B. Callen, Thermodynamics (Wiley, New York, 1960) 24.
Callen, op. cit. p. 7.
K. G. Denbigh, Note on Entropy, Disorder, and Disorganization, The British Journal for the Philosophy of Science
40 (1989) 323332.
48
Tools of the trade
where the sum is over all states of the system. Apart from a universal constant
(Boltzmanns constant kB) chosen so that corresponding statistically and thermodynamically derived quantities agree, S depends explicitly only on the occupation
probabilities. Implicitly, S is also a function of measurable physical properties of the
system because the equilibrium probabilities themselves depend in general on the
energy eigenvalues, the equilibrium temperature, and the chemical potential (which
itself may be a function of temperature, volume, and number of particles in the
system). Nevertheless, the connection between entropy and probability is striking.
One can in fact interpret the expression for S as proportional to the expectation value
of the logarithm of the occupation probability.
The identical expression, made dimensionless and stripped of all ties to heat, work,
and energy, was proposed by Claude Shannon in 1948 as a measure of the uncertainty in information transmitted by a communications channel.14 This was the key
advance that, nearly ten years later, permitted Ed Jaynes, in one of the most fruitful
and far-reaching reversals of reasoning I have seen, to develop an alternative way15 of
understanding and deriving all of equilibrium statistical mechanics from the concept
of entropy as expressed by Shannons information function
X
pi ln pi :
1:21:2
H
i
As Jaynes described it:

Previously, one constructed a theory based on the equations of motion, supplemented by
additional hypotheses of ergodicity, metric transitivity, or equal a priori probabilities, and the
identification of entropy was made only at the end, by comparison of the resulting equations with
the laws of phenomenological thermodynamics. Now, however, we can take entropy as our
starting concept, and the fact that a probability distribution maximizes the entropy subject to
certain constraints becomes the essential fact which justifies use of that distribution for inference.
The significance of Jaynes perspective was the realization that the structure of
statistical mechanics did not in any way depend on the details of the physics it
described. Rather, it was a consequence of a general form of pure mathematical
reasoning that could be employed on countless problems totally unrelated to thermodynamics. In particular, this mode of reasoning subsequently termed the principle
of maximum entropy (PME) can be used to answer Question I: What is the most
unbiased probability distribution that takes account of known information but
makes no further speculations or hypotheses? We have seen how the Central Limit
Theorem explains the apparently ubiquitous occurrence of the normal distribution.
The PME, as will be demonstrated, provides another reason.
14
15
C. E. Shannon, A Mathematical Theory of Communication, Bell System Technical Journal 27 (1948) 379423,
623656.
E. T. Jaynes, Information Theory and Statistical Mechanics, Physical Review 106 (1957) 620630; Information
Theory and Statistical Mechanics II, Physical Review 108 (1957) 171190.
49
1.23 Entropy and prior information
1.22 Shannon entropy function

Before examining the PME, it is instructive to see how the Shannon (or statistical)
entropy function (1.21.2) satisfies the properties one would expect of both entropy,
which is an extensive physical quantity, and probability. If A and B are two independent physical systems, then the total entropy of the combined system is additive:
H HA HB. By contrast, if pA(i) is the probability of occurrence of state i in
system A and pB(j) the probability of occurrence of state j in system B, then the
probability that the two independent states occur simultaneously is multiplicative:
p(i, j) pA(i)pB(j).
That the statistical entropy of the combined system behaves this way may be seen
as follows
X
X
H
pi, j lnpi, j
pA i pB j ln pA i pB j
i, j
i, j
X
X
X
X
pB j
pA i ln pA i
pA i
pB j ln pB j

j
i
1
HA HB ,
j
1
1:22:1
where the completeness relation was used to reduce the sums above the horizontal
brackets to unity. No other functional form has this property.

To implement the PME to find an unknown probability distribution in a specific
problem one maximizes H subject to constraints posed by any prior information
about the system being studied. In the simplest cases, each constraint is introduced as an algebraic expression multiplied by an unknown factor known as a
Lagrange multiplier. The entire procedure is actually a fairly routine application
of a branch of mathematics known as the calculus of variations. Whereas in
standard calculus one finds the maximum or minimum values of a function,
in the calculus of variations one seeks a function that yields the extremum of
a functional.
1.23.1 No prior information

Consider first the simplest case of a discrete system with n states fxi i 1 . . . ng (or,
equivalently, a stochastic process with n possible outcomes per trial), each with a
probability of occurrence pi. If we have no prior information at all about the
probability distribution, other than that it must satisfy the completeness relation
n
X
pi 1, then the most unbiased entropy functional we can write takes the form
i1
50
Tools of the trade
H
n
X
pi ln pi
1
i1
n
X
!
pi
1:23:1
i1
in which is a Lagrange multiplier. Seeking the extremum of H by setting the

derivative H/pj (for all j) to zero, leads to the uniform distribution pj e(1 ),
which, upon substitution into the completeness relation, gives pj 1/n. In other
words, if nothing is known beforehand about the system or process, then the most
unbiased distribution is one in which all outcomes are equally probable. This choice,
made intuitively (rather than derived systematically from an overarching principle)
by early developers of probability theory such as Laplace and Bayes, has been termed
the principle of insufficient reason or principle of indifference.
There are subtle, yet profound, issues connected with the question of how to frame
mathematically the proposition that one knows nothing about a system (. . . what
exactly is nothing? . . .) that have led to much of the fireworks between Bayesians
and frequentists. For now, let us sidestep the matter and examine a problem at the
next level of complexity.
1.23.2 Prior information is a single mean value

Consider the same system as before except that now, in addition to the completeness
relation, we have as prior information the mean value F of some function f(x) of the
states
F h fi
n
X
pi f xi
i1
n
X
pi f i :
Finding the extremum of the entropy functional

!
!
n
n
n
X
X
X
pi ln pi 0 1
pi 1 1
pi f i ,
H
i1
1:23:2
i1
i1
1:23:3
i1
which now contains two Lagrange multipliers, one for each constraint, leads to an
exponential distribution
pj e10 e1
fj
e1 f j
,
Z 1
1:23:4
where the second equality, obtained by substitution of the first expression into the
completeness relation, displays the so-called partition function
Z 1
n
X
e1 f i :
1:23:5
i1
The value of the Lagrange multiplier 1 is determined (implicitly) from the second
constraint
51

n
X
f i e1 fi
i1
F h fi X
n

e1 f i
ln Z 1
:
1
1:23:6
i1
In most cases an analytical solution for 1 may not be possible.
1.23.3 Prior information is more than one mean value

If prior information consists of a set of m known mean values fFk hfki k 1 . . . mg,
then the entropy functional will contain m 1 Lagrange multipliers (0, 1 . . . m) to
be determined from the m 1 equations of constraint by an obvious extension of the
previous relation
m
P

j f ji
j1
n
P
f ki e
ln Z 1 . . . m
i1

F k h f k x i
m
P
k
n j f ji
P
e j1
1:23:7
i1
with partition function

Z1 . . . m
n
X
e1
f1i ...m fmi
1:23:8
i1
The term partition function, which a reader versed in physics will instantly recognize, is not misused here. It is, in fact, the partition function encountered in statistical
mechanics the symbol Z standing for the German expression Zustandsumme (sum
over states). In statistical mechanics the Lagrange multipliers have physical significance, being related to the temperature of the system (if the mean energy is part of the
prior information), the chemical potential of the system (if the mean number of
particles is part of the prior information), and other physical quantities depending on
the nature of the system and the assumed prior information. The partition function
contains all the statistical information one can know about a system in equilibrium.
For example, the second moments, cross-correlation, and covariance in a system for
which the mean values of two functions ff1(x), f2(x)g are known and Z Z(1, 2) take
the forms
n
X
fk x2
f ki 2 e1
f 1i 2 f 2i
i1
n
X
i1
1 f 1i 2 f 2i
2 ln Z
2k
k 1, 2
1:23:9
52
Tools of the trade

n
X

f 1 x f 2 x
f 1i f 2i e1
f 1i 2 f 2i
1 2 Z
Z 1 2
1:23:10

cov f 1 x f 2 x f 1 x f 2 x f 1 x f 2 x

1 2 Z
1 Z
1 Z

Z 1 2
Z 1
Z 2
2 ln Z
:
1 2
1:23:11
i1
n
X
1 f 1i 2 f 2i
i1
1.23.4 Two-state system

The case of a two-state system (x1, x2) and single observable
f xi f i i 1, 2
F h fi
1:23:12
provides a physically interesting and tractable example in which the Lagrange

multiplier can be determined explicitly. The prior information consist of
(a) normalization
p1 p2 1
1:23:13
F p1 f 1 p2 f 2
1:23:14
(b) mean
and implementation of the PME results in the probabilities

p1
ef 1
ef 1 ef 2
p2
ef 2
ef 1 ef 2
1:23:15
and known mean value

F
f 1 ef 1 f 2 ef 2
:
ef 1 ef 2
1:23:16
The relations above permit one to solve for ef 2 f 1 and hence obtain

f 2 F
ln Ff
1
,
f2 f1
1:23:17
which is positive for f2 f1 > 0 and negative for the reverse. Elimination of then
leads to a partition function expressed directly in terms of F
Z F e
f 1
f 2
f2 F
F f1
f
f1
2 f 1
f F
2
F f1
f
f2
2 f 1
1:23:18
53
and to probabilities

p1
f2 F
f2 f1

p2

F f1
:
f2 f1
1:23:19
Note that once the partition function is expressed in terms of the mean values of
observables, then one cannot calculate moments, as in Eq. (1.23.7), simply by taking
derivatives of Z with respect to the Lagrange multipliers. In that case, the straightforward thing to do is construct the moment-generating function, which in the
present case becomes

1:23:20
gt e f t p1 e f 1 t p2 e f 2 t
and readily generates the moments
h f i
F
2 f 2 h f i2 f 2 FF f 1 :
1:23:21
1.23.5 Prior information is mean and variance

As a final illustration of the maximum entropy principle, consider the original system
again where now our prior information comprises the completeness relation and both
the first (1) and second (2) moments of the observable quantity, which is itself the
variable X. The three equations of constraint are embedded in the entropy functional
by means of three Lagrange multipliers, leading to
!
!
!
n
n
n
n
X
X
X
X
2
H
pi ln pi 0 1
pi 1 1
pi xi 2 2
pi xi :
i1
i1
i1
i1
1:23:22
However, this is not the most convenient form in which to find the extremum. Often
(perhaps even most often) the analysts interest is in moments about the mean. There
is no loss of generality, then, in defining the Lagrange multipliers differently in order
to rewrite the entropy functional in a way that reflects that interest
!
!
!
n
n
n
n
X
X
X
X
1 0
2
0
2
H
pi ln pi 0 1
pi 1 0
pi xi 2
pi xi :
2
i1
i1
i1
i1
1:23:23
For notational simplicity I dropped the subscript 1 from the label of the first moment
and combined the prior information to form a variance 2 2 21 . Since the sum in
the second bracket vanishes identically (by virtue of the expression in the first
bracket) irrespective of the probability distribution, it provides no new information
and therefore one loses nothing in simply setting 01 to zero. The procedure to
maximize the reduced entropy functional
54
Tools of the trade
H
n
X
i1
pi ln pi 0 1
n
X
i1
!
pi
n
X
1
0 2 2
p i x i 2
2
i1
immediately yields a discrete probability distribution

2

2
0
2
1
pj / e 2 xj 2 ! p xj, 2 p ex 2
2
!
1:23:24
1:23:25
which, when transformed to an appropriately normalized continuous distribution,

becomes the normal distribution N(, 2).
In summary, illustrations of the principle of maximum entropy show that
(a) a uniform distribution (principle of indifference) results when one has no prior
information beyond the requirement that the total probability is unity;
(b) an exponential distribution, such as those that occur in statistical physics (e.g.
MaxwellBoltzmann, FermiDirac, BoseEinstein), results when the prior information consists of the mean values of functions of some stochastic quantity; and
(c) a Gaussian or normal distribution results when the prior information consists of
the first and second moments (or the first moment and variance) of some
stochastic quantity.
Under the assumed conditions in each case, the use of any other probability distribution would imply that either more information was known at the outset or that the
analyst has incorporated into the analysis an element of unjustified speculation.
1.24 Method of maximum likelihood
Two principal tasks of statistics are to test hypotheses and to estimate physical
quantities from data. Let us suppose that the data referred to in statistics as the
sample are the outcomes of n independent observations, each regarded as an
independent, identically distributed (iid) random variable Xi (i 1. . .n) with probability density (or in the discrete case a probability function) f(xj). In many cases it is
the parameter (or set of parameters) upon which the pdf depends, that is to be
estimated. The task of estimation, then, is to extract from the statistics of a sample
the true values of quantities characteristic of the full population. This population
may be a real one as, for example, in the census of a nation in which the total number
of people is generally too large for each person to be queried; hence a representative
random sample of people is selected for questioning. However, a set of repeated
measurements of the mass of an elementary particle can be imagined to be a sample
drawn from a hypothetical infinitely large population (or ensemble) of potential
measurements executed under equivalent conditions.
The ensemble mode of thinking is the point of view of orthodox statistics and the
basis of statistical mechanics as developed by J. Willard Gibbs, which is the approach
ordinarily taught in statistical mechanics courses. There is an alternative point of
55
view based on Bayes theorem, which dispenses with the philosophical encumbrance
of ensembles and focuses exclusively on the data to hand, not those that did not
materialize. This divergence of thought constitutes one of the battlefronts in the
probability wars alluded to at the beginning of the chapter. Estimates based on the
two approaches do not always turn out to be the same. (Indeed, estimates made by
different orthodox procedures, do not necessarily turn out to be the same either.)
Philosophy aside, the differences between orthodox and Bayesian estimates derive
principally from what one does with the likelihood function. I will come back to this
point later in the chapter.
From the orthodox perspective, the likelihood function of n independent random
variables is defined as their joint probability density. Thus, if fxi i 1. . .ng is a
realization of the set of random variables introduced above, the corresponding
likelihood function would be
Ljfxi g f x1 j f x2 j . . . f xn j
n
Y
f xi j,
1:24:1
i1
where, in the general case, may stand for a set of parameters. The method of
maximum likelihood (ML), due primarily to geneticist and statistician R. A. Fisher16,
may be expressed somewhat casually as follows: The best estimate (usually) of the
parameter is the value ^ that maximizes the likelihood L(jfxig). This immediately
raises the question of what is meant by best.
It is said that a spoken language has many words of varying nuances for something of particular importance in the culture of the people who speak the language. If
that is true, then the concept of estimate is to a statistician what the perception of
snow is to an Eskimo (. . . or perhaps to a meteorologist). To start with, the
statistician distinguishes between an estimator , which is a random variable used
to estimate some quantity, and the estimate , which is a value that the estimator
can take. The orthodox statistician considers the quantity to be estimated to have a
fixed, but unknown, value, whereas the estimates of the estimator are governed by
some probability density function of supposedly finite mean and variance. The goal
of estimation is therefore to find an estimator whose expectation value yields the
sought-for parameter with the least uncertainty possible. With those points in mind:
An estimator is unbiased if its expectation value hi equals the estimated
parameter .
An estimator is close if its distribution is concentrated about the true value of the
parameter with small variance.
An estimator is consistent if the value of the estimation gets progressively closer
to the estimated parameter as the sample size increases.
16
R. A. Fisher, Theory of Statistical Estimation, Proceedings of the Cambridge Philosophical Society 22 (1925) 700725.
56
Tools of the trade
An estimator is minimum-variance unbiased if the variance of its pdf is the

lowest of all unbiased estimators. There is, in fact, a lower bound, known as the
CramerRao theorem, to the variance of an estimator that meets certain reasonable conditions regarding existence of the first and second derivatives of the
logarithm of the likelihood function.
An estimator is asymptotically normal if its pdf approaches that of a normal
distribution with increasing sample size.
An estimator is deemed efficient if, among a set of consistent, asymptotically
normal estimators of the same quantity, it has the minimum variance.
And last (for our purposes), but of particular utility, is sufficiency, a concept also
due to Fisher. A statistic S is sufficient in regard to an unknown parameter if it
condenses the data (i.e. the sample) so as to contain all the information that the
sample can provide for estimation of that parameter. In other words, having the
single sufficient statistic, one cannot learn anything further about the unknown
parameter by knowing the individual values of the sample or by seeking other
estimators. Clearly, it is desirable that an estimator be a function of sufficient
statistics.
With this basic vocabulary, one can say of ML estimators that some are uniformly
minimum-variance unbiased, while others are not; that a sequence of ML estimators
is consistent and asymptotically normal with a variance equal to the CramerRao
lower bound; and that, if a sufficient statistic exists for the parameter to be estimated
(which is not always the case), the ML estimator must be a function of it. All in all,
for large sample size the ML estimate of is about as good as one may hope to find
although there may be others just as good.
From the perspective of a practical physicist, an especially attractive feature of the
ML method is the facility with which it delivers both the estimate and its uncertainty.
Noting that it is often easier to work with the logarithm of a sequential product of
functions (as in Eq. (1.24.1)) and that a function and its log are maximized at the
same point, we consider
L ln L
n
X
ln f xi j,
1:24:2
i1
a quantity that some statisticians have termed the support function, but which
I will refer to simply as the log-likelihood. In the general case of m parameters f1 . . .
mg one must then solve the set of equations
n
f xi j=j
L X
0
j
f xi j
i1
j 1 . . . m:
1:24:3
The variance of each ML estimate and covariance of pairs of estimates are given by
the elements of a covariance matrix C H1, where Cjj 2j , Cjk cov (j, k) are
derived from the second derivatives of the log-likelihood

Hjk Hjk
2 L
j k
57

^
j, k 1 . . . m:
1:24:4
The symbol ^ appended to the bracket signifies that the second derivatives are to be
evaluated by substitution of the ML values of the parameters f^j g:
The preceding method for estimating uncertainty of the parameters follows
straightforwardly from the structure and interpretation of the log-likelihood function
expanded in a Taylor series about the ML values of its argument. For simplicity,
consider the example of two parameters:

2
2 2
X
1X

L
L
^
^
^
i i
i î j ^j
LlnL1 ,2 L 1 , 2
i ^
2 i, j1 i j ^
i1

2

1X

L ^1 , ^2
H ij i î j ^j
2 i, j1

1

1
L ^1 , ^2 UT HU L ^1 , ^2 UT C1 U:
2
2
1:24:5
In the first line of the expansion, the term involving a sum over first derivatives of
L vanishes by virtue of the ML maximization procedure. The second line shows
the reduced expression with matrix elements of H substituted for the second
derivatives of L. The third line shows the equivalent expression in terms of the
parameter vector

1 ^1
U
1:24:6
2 ^2
(and its transpose UT) and the inverse of the covariance matrix C
2

1
1 2
,
1:24:7
C
1 2
22
cov^1 , ^2
where the correlation coefficient is defined by 12 1 2 . The matrices H and
C are related as follows

1
H11 H 12
1= 21 = 1 2
1
C
:
1:24:8
H
H 21 H 22
1 2 = 1 2 1= 22
Upon neglect of derivatives higher than second, the likelihood function then becomes
proportional to the negative exponential of a quadratic form

1 T 1
1:24:9
L ^1 , ^2 jD / e2U C U ,
which is recognized
as a multivariable Gaussian function of the ML para
meters ^1 , ^2 and data D. For a single variable, the exponential (1.24.9)
58
Tools of the trade

^ 2 =2 2
^
reduces to the familiar form pjD
/ e
hood becomes

1

L ^1 , ^2 jD / e 2
. For two variables, the likeli-

2

2

H11 1 ^1 H 22 2 ^2 2H12 1 ^1 2 ^2
8
2
2

9
>
=
< 1 ^1
2 ^2
1 ^1 2 ^2 >
1

>
2 2 2
1 2
2 21
;
:
e 1 2 >
1:24:10
and shows explicitly the connection between second derivatives of the likelihood
function and the uncertainties in parameter estimates. The preceding formalism is
readily generalizable to any number of parameters.
For illustration and later use consider a set of data fxi i 1. . .ng presumed to be a
sample from a Gaussian distribution of unknown mean 1 and variance 2 2.
The likelihood and its log take the forms
L
n
Y

2 1=2 xi 2 =2 2
e

2 n=2
n
X
xi 2 =2 2
i1
1:24:11
i1
n

n
1 X
xi 2 ,
L log L log 2 2 2
2
2 i1
1:24:12
from which follow the set of ML equations and their solutions

n
L
1X
2
xi 0
i1
n
L
n
1 X
xi 2 0
2
2 2 2 4 i1
^ 2
n
1X
xi
n i1
n
1X
xi ^ 2 :
n i1
1:24:13
1:24:14
The ML estimator for is therefore the sample mean, a random variable defined by
the expression17
X
n
1X
Xi :
n i1
1:24:15
The expectation of X gives an unbiased estimate of the population mean (i.e. of the
location parameter in the theoretical probability density):
*
+
n
n

1X
1X
1
1:24:16
X
Xi
hXi i n :
n i1
n i1
n
17
The overbar is used to represent both a sample average (random variable) and negation (hypothesis); the two uses are
very different and should cause no confusion.
59
This is not the case, however, for the ML estimator of population variance defined by
S0
2
whose expectation value is
n1
n
n
2
1X
Xi X
n i1
1:24:17
2 . It is the sample variance defined by
S2
n
2
1 X
Xi X
n 1 i1
1:24:18
whose expectation gives the unbiased value 2. The bias in S0 2 arises from the
presence in the sum of squares of the statistic X, which is a random variable, in
contrast to the population mean , which is an unknown, but fixed, parameter. To see
this, note that the partition of the sum18
n
X
Xi 2
i1
n
X
n

2 X
2

2
Xi X X
Xi X n X
i1
leads to a relation of expectation values

*
+ *
+
n
n
D
X
X
2
2 E
2
Xi X
X i n X
i1
1:24:19
i1
1:24:20
i1
that reduces to
2
n 1 2
n 1 S2 n 2 n
n
2
S 2:
1:24:21
Practically speaking, for a large sample there is no statistically significant difference

between the two estimates S2 and S0 2. However, S2 as a definition of sample variance
makes more sense on logical grounds because a sample size of n 1 can have no
variance. In that case S2 is (correctly) undefined, whereas S0 2 0 is (falsely) suggestive
of no uncertainty.
The mixed second derivatives of L in (1.24.12) lead to the matrix H

9
2 L
n
>
>
H11 2
2
>
>
1
0
^ , ^ 2
^
>
n
>

>
2
=

0

L
2
C
B
H12 H21
0
) H @ ^
n A
2 ^ , ^ 2
>
0
4
>
>

>
n h
i
2^
2 L
n
1 X
n >
>
2
>
4 6
xi ^
4 ;
H22

2
^ i1
2^
2 ^ , ^ 2 2^
1:24:22
18
The cross term in the binomial expansion of the expression in square brackets vanishes identically as a consequence of
the definition of relation (1.24.16).
60
Tools of the trade
the negative inverse of which yields the covariance matrix whose elements constitute
the variances of the ML parameters
var^

^ 2
n
1:24:23
2^
4
var ^ 2
n
1:24:24
with zero covariance. This means that the ML estimators derived above are independent, asymptotically normal random variables of the forms

^ 2
4
2 2^
,
1 N ^
2 N ^ ,
:
1:24:25
n
n
Note, as pointed out previously, that the variance of the mean is smaller than the
variance of a single observation by the factor n [a relation also contributing to
Eq. (1.24.21)].
The property of normality and the variance (1.24.23) of the ML estimator X are
actually valid statements irrespective of the size n of the sample. However, the exact
variance of the ML estimator S0 2 can be shown to be 24(n 1)/n2, which asymptotically reduces to the expression in (1.24.24). The explanation for this is that the exact
n
X
2
Xi X , which is propordistribution of the variance of the sample mean, 1n
i1
2
n
X
Xi X
p
constructed to be the sum of the squares of n standard
tional to a form
= n
i1
normal random variables, is not Gaussian, but a chi-square distribution 2n1 . There
are n 1, rather than n, degrees of freedom because the sample mean X is itself
calculated from the data and, once known, signifies that only n 1 of the set of
variates fXig are independent.
One last point of interest in regard to the variances of the ML estimates for
and 2 is to see how they compare with the lower bound of the CramerRao
theorem, which can take either of the two forms below for an estimate of a
function ().
d=d2
d=d2
E
2
:
var CR D
n log f Xj=2
n log f Xj=2
1:24:26
Since () in this case, the derivative in the numerator becomes 1. Given a

Gaussian pdf with natural logarithm

1
x 2 1
lnf xj, 2 ln 2
ln2 ,
2 2
2
2
1:24:27
61
1.25 Goodness of fit
the first equality of (1.24.26) reduces to the expressions

1
2
E
var D
2
n
n X
4
1:24:28

1
4 4 =n
2 4
D
E
,
var 2

2
4
2
n
2
1 2 X
X
n 21 2 X
4
2
1:24:29
where use was made of the expectations hZ2i 1 and hZ4i 3 of the standard
normal variable Z (X )/. Comparison with (1.24.23) shows that the ML
variances of the Gaussian parameters are as small as theoretically possible. The
same minimum variances would have been obtained had we used the second
equality in (1.24.26).
1.25 Goodness of fit: maximum likelihood, chi-square, and P-values

Suppose we have made n observations of some randomly varying quantity X that at
each observation could take any one of K values fAk k 1. . .Kg. We have, therefore,
a multinomial distribution of frequencies fnkg of outcomes sorted into K classes with
K
X
nk n and probability function
the constraint
k1
Prfnk gjfpk g n!
K

Y
pn
k
k1
nk !
for the totality of n trials. In general, apart from the completeness relation
1:25:1
K
X
pk 1,
k1
we might not know the probability pk for an outcome to take the value Ak, but we can
do two things: (a) estimate the maximum likelihood (ML) probabilities from the
frequency data, and (b) make a theoretical model of the random process that has
generated the data. Consider first the ML estimate.
In the case of a large sample size n, the log-likelihood function of the multinomial
expression (1.25.1) can be written and simplified as shown below
!
K
K
K
X
X
Y
pnk k
nk ln pk
ln nk ! ln n!
L ln L ln n!
n!
k1 k
k1
k1
1:25:2
K
K
X
X
nk ln pk
nk ln nk n ln n,
k1
k1
where we have approximated the natural log of a factorial n! by the two largest terms
[ln n! ~ n ln n n] in Stirlings approximation
62
Tools of the trade

pnn
1
1
n! e 2n
1
:
e
12n 288n2
1:25:3
To maximize L with respect to the set of parameters fpkg given only the completeness
relation, we introduce a single Lagrange multiplier to form the functional
L0
K
X
nk ln pk 1
k1
K
X
!
pk
1:25:4
k1
with omission of all terms not containing the parameters since they would vanish
anyway from the ML equations
L0 nk
0
pk pk
k 1. . . K :
1:25:5
Substitution of the solution ^

p k nk = into the completeness relation leads to n
and therefore to the ML estimates
^
pk
nk
:
n
1:25:6
It is worth stressing that the set of probability parameters f^p k g arrived at by

the foregoing procedure give the largest value to the likelihood function (1.25.2);
no alternative set of probabilities yield a larger value.
Suppose now we were to model the random process by some probability
function f(xj), which depends on parameters whose values may be unknown
at the outset. Let fk f(Akj) be the hypothesized probability that an observation results in the outcome Ak. We need some way to estimate the optimum set
of parameters for the given model referred to in statistical parlance as the
null hypothesis and then ascertain whether the model credibly accounts for
the data. As before, a suitable way to do this would be to calculate the ML
estimates
^ of the parameters and then compare the likelihood of the model

^
L fnk gjffk g to the maximum likelihood Lfnk gjf^p k g attainable by any alternative model. Substitution of the ML estimates f^p k g of Eq. (1.25.6) into
Eq. (1.25.1) leads to a relatively simple expression for the ratio of the two
likelihood functions
" # nk
"
#nk
K
K
Y
f^k
f^k
L0
Lfnk gjff^k g Y
n

n
Lmax Lfnk gjf^
pk g k1 nk =n
n
k1 k
1:25:7
63
because the products of factorials in numerator and denominator cancel. The log of
the ratio then yields a relation
!
"
!#

K
K
X
X
f^k
f^k
L0
nk ln
nk ln n ln
ln
n ln n
Lmax
nk
nk
k1
k1
!
K
X
nk
nk ln
nf^k
k1
1:25:8
from which one can calculate how likely the null hypothesis is in comparison to the
maximum likelihood.
An advantage to the use of the likelihood ratio for comparison of two
hypotheses or models is that it is invariant under a transformation of parameters. For example, if you wanted to test whether the parameter 1 or 2
characterized a set of data believed to be drawn from a distribution with pdf
2
/ ex= , the likelihood ratio would be the same if, instead, you transformed the
distribution by 2 and then tested for parameters 1 and 2. The example is a
trivial one, but the conclusion still holds in the general case of more complicated
transformations of a multi-component parameter vector. The reason for the
invariance is that the likelihood ratio is a value at a point, rather than an
integral over a range.
That same asset can become a disadvantage, however, to using Eq. (1.25.8) for
inference because the distribution function associated with the likelihood ratio in
specific cases may be difficult or impossible to determine and so to say that one
model is 50% as likely as another does not tell us how probable either is. The power
of a statistical test of inference is defined to be the probability of rejecting a
hypothesized model when it is correct i.e. when the parameters of the model are
the true but unknown parameters of the distribution from which the data were
obtained. A test is the more powerful if it can reject the null hypothesis with a lower
probability of making a false judgment. In a significance test of a model, an ideal
power function would be 0 if the parameters of the model corresponded to the true
parameters, and 1 otherwise. In general, the likelihood function is not a probability
but a conditional probability density, a fact that is a virtue to some and a liability
to others.
With the adoption of a few approximations and some algebraic rearrangements,
the final expression in (1.25.8) can be worked into a form with a known distribution
irrespective of the null hypothesis. To see this, start by
(a) adding and subtracting 1 in the argument of the logarithm,
(b) adding and subtracting n^f k in the pre-factor, and
(c) dividing and multiplying the entire summand by n^f k
64
Tools of the trade
L0
ln
Lmax
K
X
!
!
nk nf^k nf^k
nk
ln 1
1
nf^k
nf^k
n^f k
k1
K
1:25:9
n^f k 1 k ln1 k
k1
^
fk
so as to express the log-likelihood ratio in terms of a quantity k nknn
, expected to
f^
k
be small if the null hypothesis is credible. Expanding (1.25.9) in a Taylor series in k

L0
ln
Lmax

K
X
k1

1 2 1 3
^
nf k k k k ,
2
6
1:25:10
recognizing that the linear term vanishes identically

K
X
k1
nf^k k
K
X
K
X

f^k n n 0,
nk nf^k n n
k1
1:25:11
k1
and truncating after the quadratic term, we obtain an expression

K
L0
1X
nk nf^k 2
1
2d
L ln

^
Lmax
2 k1
2
nf k
1:25:12
identified as a sum of K chi-square random variables of some number d of degrees of

freedom to be specified momentarily.
The justification for the interpretation derives from unstated assumptions that (a)
the number of observations n and classes K are both reasonably large (with n
K), in
which case (b) the probability ^f k of a particular outcome Ak is fairly small and
approximately Poissonian, whereupon (c) n^f k is an acceptable measure of the variance of frequency Nk whose realization is the observed nk. We have seen previously in
Eq. (1.13.7) that a multinomial distribution such as we have begun with in (1.25.1)
results from the conditional probability of observing K independent Poisson variates
whose sum is a fixed quantity, a connection first pointed out by Fisher.
If these assumptions hold, then 2L in Eq. (1.25.12) corresponds to a sum of the
squares of K standard normal variates Z2k N k hN k i2 = 2Nk , which, if all are independent, would be equivalent to a chi-square variate of K degrees of freedom.
However, because the frequencies Nk are constrained to sum to n, only K 1
can be independent. Moreover, if the data were used to estimate the parameters
fj j 1. . .mg, then the number of degrees of freedom would be reduced by 1 for
each estimate. We may therefore take the statistic
QK
K
X
n
k1
n^f k 2
) 2dK1m
n^f k
1:25:13
65
to have an asymptotic chi-square distribution of d K 1 m degrees of freedom.

Equation (1.25.12) now allows us to assign some measure of probability to a level of
likelihood. Recall that the expected value of a chi-square variate 2d , based on chance
alone, is d with a variance 2d. Thus, simply by looking at the value of 2d in
comparison to d, one can get an idea of the credibility of some hypothetical model.
For example, if a hypothesized model, whose parameters are known and not
determined from the data, is 50% as likely as the maximum likelihood when tested
against data grouped into two classes which means one degree of freedom it
follows (given the approximations leading to (1.25.12)) that the chi-square corresponding to this observation is 2obs 2 lnp
L0 =Lmax 2 ln 2 1:386. Since the
expected value is 1 with a standard deviation 2 1:414, it looks like the hypothesized model is not improbable. We will quantify this shortly.
Suppose, however, that the model was found to be 50% as likely as the maximum
likelihood when tested against data grouped into a large number of classes, so that
d
1. In the limit of very large d, the distribution 2d approaches a normal distribution N (d, 2d), as demonstrable theoretically from the mgf and shown graphically in
Figure 1.4. The standard normal q
variable
corresponding to a relative likelihood of
d
p
1,
which
would indicate a highly improbable
50% would then be z j1:386dj

2
2d
result for the hypothetical model. Clearly, the same value of likelihood can lead to
radically different values of probability and therefore to different inferences
regarding statistical significance.
From the foregoing discussion and in particular Eq. (1.25.12) it is seen that the
chi-square test of a theoretical model is not independent of the maximum likelihood
method, but follows from it as an approximation. Maximization of the log-likelihood
ratio corresponds to minimization of the resulting chi-square statistic. If, as an
additional approximation, there was justification in assuming that the variance was
the same for each class, then the denominator could be taken outside the sum in
relation (1.25.13). Moreover, if the parameters had not been estimated by
ML
the
method at the outset, then the function fk() f(Akj) would replace ^f k f Ak j^ in
the sum. Under these conditions, maximizing the log-likelihood ratio corresponds to
K
X
minimizing the statistic Q0K
nk nf k 2 with respect to the unknown
k1
parameters. This is the familiar estimation method known as least squares.

Although the line between the two tasks may be blurred at times, statisticians and
scientists recognize a distinction between inference (the testing of models) and
estimation (the determination of parameters for an assumed model). Chi-square tests
of models find widespread use precisely because there is a known distribution19
19
Statisticians may cringe, but, as a physicist, I am using the same symbol 2d for the random variable, its observed value
in a given situation, and the associated distribution. Hopefully this economy of notation will add to the clarity of the
text rather than to confusion.
66
Tools of the trade
(namely, 2d ) by which to gauge whether a discrepancy between predictions of a model

and the specific set of observations has arisen purely by chance.
Perhaps the most widely practiced means of making this judgment, however, is by
calculating the P-value corresponding to the statistic 2d in (1.25.13). This is the
cumulative probability
P Pr >
2
2obs
2obs
xd=2 ex=2
2d=2 d=2
dx
1:25:14
that subsequent sets of observations would yield chi-square values at least as large as
the observed value 2obs for the given number d of degrees of freedom. A small value
of P, corresponding to 2obs
d, is ordinarily interpreted as signifying that the
discrepancy between theory and observation is not likely to have arisen by chance
alone and therefore the proposed model may not be a good one. Fisher had initially
adopted and statisticians have subsequently largely followed a standard of 5%
(i.e. P < 0:05) as a threshold for rejecting a particular model. So entrenched is the use
of P-values in the scientific literature that any manuscript containing a statistical
analysis of experimental results is likely to be rejected by a reviewer or editor if
P-values are not part of the tests of significance.
There are a number of issues, however, surrounding the concept of P-values and the
use of chi-square tests in general, that have generated over the years a vast volume of
commentary by statisticians. I will summarize briefly what to me are the most pertinent
and provocative criticisms, which indeed need to be borne in mind if errors of analysis
and interpretation are to be avoided, and I will append my own commentary at the end.
1.25.1 No significance to low P?

One of the principal critics of P-values has been the geophysicist Harold Jeffreys
whose landmark treatise on probability20 helped put statistical theory on a firm
foundation. In Jeffreys words
If P is small, that means that there have been unexpectedly large departures from prediction. But
why should these be stated in terms of P? The latter gives the probability of departures, measured
in a particular way, equal to or greater than the observed set, and the contribution from the
actual value is nearly always negligible. What the use of P implies, therefore, is that a hypothesis that
may be true may be rejected because it has not predicted observable results that have not occurred.
This seems a remarkable procedure. On the face of it the fact that such results have not occurred
might more reasonably be taken as evidence for the law, not against it. [Italics are Jeffreys.]
At root, the issue raised by Jeffreys is this: In judging the credibility of a model and
its alternative(s) is it more appropriate to take (a) the area under the tail of the
20
H. Jeffreys, Theory of Probability 3rd Edition (Oxford University Press, London, 1961) (1st Edition 1939).
67
chi-square distribution (i.e. the P-value) or (b) a ratio of the ordinates (i.e. pointvalue) of the probability density of the statistic?
1.25.2 No significance to high P?

In the diametrically opposite situation where a significance test of a model has
resulted in a value of 2d considerably less than the expected value d for the assumed
number of degrees of freedom, the corresponding P value is close to 1. Does this
mean we are to reject a model because it accounts for the observations too closely?
The situation has engendered a variety of replies from statisticians, generally to the
effect that in nearly every instance the investigators have done something wrong21
for example, to have made numerical errors in computation or to have biased their
data inadvertently or intentionally and therefore the results are too good to be
true.22 A different interpretation, principally by Edwards,23 is that the chi-square
test is essentially a test concerning the overall variance of a model in contrast to the
mean. According to Edwards
The crucial question the experimenter must ask himself before applying 2 is if I get a very
small value, will it make me suspicious about my null hypothesis? If the answer is no, then his
interest is in means and not variances, and the 2 test is inappropriate.
A low value of 2, therefore, according to Edwards would not be indicative of a fit

that is too good; rather, it would suggest that a model leading to a variation smaller
than Poissonian would be better.
1.25.3 No significance to any P since the whole 2business is arbitrary?

Statisticians have long remarked upon the fact that the number of classes and their
boundaries are arbitrary choices at the disposal of the investigator and that different
choices can result in radically different values of 2 and P for the same data set. How,
then, can a test of significance be significant if you can get any desired outcome? In
the administration of a chi-square test, class boundaries are ordinarily chosen so that
all class intervals are equal with the consequence that the number of samples in each
class diminishes the further the class value is from the mean. As a step towards
rendering the chi-square test less arbitrary, some statisticians have proposed defining
classes of unequal widths with boundaries calculated to lead to equal frequencies.24
However, this modified procedure has its own difficulties.
21
22
23
24
W. G. Cochran, The 2 Test of Goodness of Fit, The Annals of Mathematical Statistics 23 (1952) 337.
G. U. Yule and M. G. Kendall, An Introduction to the Theory of Statistics (Griffin, London, 1940) 423.
A. W. F. Edwards, Likelihood (Johns Hopkins, Baltimore, 1992) 188. Original Cambridge edition 1972.
H. B. Mann and A. Wald, On the choice of the number of class intervals in the application of the chi square test,
Annals of Mathematical Statistics 13 (1942) 306317.
68
Tools of the trade
1.25.4 Why bother with 2 anyway since all models would fail
if the sample is large enough?
The claim has been made that, in testing a null hypothesis which is not expected to be
exactly true, but credible to a good approximation, the hypothesis will always fail a
chi-square test applied to a sufficiently large sample of experimental data. Phrased
provocatively, one statistician wrote25
I make the following dogmatic statement, referring for illustration to the normal curve: If the
normal curve is fitted to a body of data representing any real observations whatever of
quantities in the physical world, then if the number of observations is extremely largefor
instance, on the order of 200,000the chi-square P will be small beyond any usual limit of
significance.
The conclusion, therefore, cited by a second acquiescing statistician,26 was What is

the point of applying a chi-square test to a moderate or small sample if we already
know that a large sample would show P highly significant?. Recall that a highly
significant P means that we can with justification reject the null hypothesis so in a
sense this criticism is the opposite of the third, which ascribes no significance to P.
Before adding my own two cents, first an admission: I have selectively quoted
comments from statisticians so as to frame their remarks in the most confrontational
way to highlight issues that I believe really are important and deserve careful
attention. No statistician, however at least none whose papers I have read
actually recommended discarding the chi-square test. No experimental physicist
would in any event do that because the test is far too useful and easily implemented
(. . . and required for publication).
Much of the confusion that may accompany use of a chi-square test can be
avoided by keeping in mind that the original test statistic followed a multinomial
distribution (1.25.1) from which the chi-square statistic arose in consequence of three
approximations: (1) Stirlings approximation of factorials; (2) Taylor expansion of a
natural logarithm; and (3) substitution of a continuous integral for a discrete summation. So long as each expectation nfk of the tested model f (xj) is reasonably large,
the reduction is reasonably valid, and the chi-square statistic (1.25.13) is distributed
as 2d to good approximation. If necessary, one may combine classes to achieve a
suitable expectation, which for satisfactory testing should be no fewer than about
510 as a rule of thumb. There was nothing in the derivation, as far as I can see, that
subsequently restricted the chi-square test of significance to the variance of a model
to the exclusion of all other attributes.
25
26
J. Berkson, Some difficulties of interpretation encounered in the application of the chi-square test, Journal of the
American Statistical Association 33 (1938) 526536.
W. G. Cochran, op cit. p. 336.
69
The arbitrariness of classes and boundaries arises only in testing the significance of
a continuous distribution, for in the case of a discrete distribution where specific
objects are counted (e.g. photons, electrons, phone calls . . . whatever), there is a
natural, irreducible assignment of classes whereby each class differs in integer value
from the one that comes before or after by one unit. This may not be the most
practical choice for every test, since it may require a very large sample size, but
conceptually, at least, it establishes a non-arbitrary standard.
In the case where data arising from a discrete distribution have been approximated
by (or transformed into) continuous random variables, there is a simple procedure
for avoiding a ridiculously large and statistically unwarranted chi-square. Statisticians have pointed this out long ago,27 but, unaware of their papers, I discovered it
for myself in testing a distribution of counts from a radioactive source. The experience makes for a lesson worth relating. The counts, which were all integers believed
on theoretical grounds to be Poisson variates, decreased (on average) in time as the
experiment progressed because of the diminishing sample of nuclei. In the next
chapter I will discuss in detail the statistics of nuclear decay. For now, however,
suffice it to say that a standard procedure in the analysis of nuclear data is to remove
the negative trend line in order to examine the variation in counts as if the population
of radioactive nuclei were infinite. In de-trending the data, however, the transformed
numbers were no longer integers. Sorted into 90 classes, the data were tested for
goodness of fit by a Poisson distribution of known mean, leading to an astounding
result of 289 > 1600, where a number around 90 was expected. A previous test on
the original (not de-trended) data had given highly satisfactory results. What
went wrong?
The 90 classes fAk k 1. . .90g were labeled by the number of counts obtained in a
specified window of time (one bin of data); thus A1 150, A2 151, A3 152, etc. In the
test on the de-trended data, the frequency of outcomes x for k 1 > x k was compared
with the Poisson probability for Ak and this gave a very high chi-square, suggesting
that the null hypothesis (namely, the data were Poisson variates) was untenable.
However, if the class values were shifted by 0.5, so that the central value of each class
was an integer i.e. k 12 > x k 12, the chi-square of the de-trended data became
85.14 for 90 classes, corresponding to P 0.596, which was entirely reasonable.
One must likewise be aware of the circumstances under which a discrete distribution is approximated by a continuous one. Return to the previous example where
data originated as integer counts of particles from a sample of radioactive nuclei. The
mean number of counts x per bin being much larger than 1, the hypothesized Poisson
distribution Poi (), with population mean estimated by the sample mean x, should
have been well approximated by a normal distribution N x, x. However, a chi-square
test of the goodness of fit of N(0, 1) to the data in standard normal form
27
M. G. Kendall and A. Stuart, The Advanced Theory of Statistics Vol. 2: Inference and Relationship (Griffin, London,
1961) 508509.
70
Tools of the trade
p
z x x= x led to so high a value of 2 that the presumed model would have been
unambiguously rejected. Again, what went wrong?
The problem in this instance lay not with locations of class boundaries, but with
the widths of class intervals. The transformed data z are not integers, but neither are
they continuously distributed. Since the values of the counts x are always integer, the
values of z can have a minimum separation of x1=2 . Thus, if one makes the bin width
smaller than that minimum, there can result numerous bins of 0 count, which causes
failure of the chi-square test. With adequately sized bin widths, a value of chi-square
and associated P-value were obtained that did not justify rejection of the null
hypothesis. Note that there was nothing intrinsically wrong with applying the test
to a continuous distribution so long as one took steps to insure that the data being
tested actually were continuously distributed. Nor does the fact that I could get either
a high P or low P by changing the size of the bins imply that the test outcomes were
arbitrary and therefore meaningless. On the contrary, the low P-value resulted
from executing the test under conditions that were inappropriate in two related ways:
(a) testing goodness of fit of a continuous distribution to quasi-discrete data which
resulted in (b) violation of an approximation leading to the chi-square statistic (i.e.
no empty bins).
The same suite of investigations convinced me that the assertion that any model
fitted to a body of data representing . . . quantities in the physical world would fail a
chi-square test, given a sufficiently large (e.g. > 200 000) number of observations was
entirely without foundation. If the model is a true representation of the body of
data i.e. the model captures the essential features of the stochastic process that
generates the data then a chi-square test can yield a respectable P-value for any
sample size. In testing, for example, 1 000 000 standard normal variates, sorted
into 400 classes, for goodness of fit to N(0, 1), I have obtained 2399 419, giving
P 0.236.
However and here is a point of critical importance that all too often seems
to have been overlooked in the confused wrangle over the meaning or worth of
P-values the quantity P is itself a random variable. As a cumulative probability [see
Eq. (1.25.14)] P is governed by a uniform distribution [see (1.17.12)] with mean 12 and
variance 121 . Therefore, obvious though it may be to state this, one should not expect
too much from a single P, any more than is to be expected from a single nuclear count
or the reply of a single respondent to a poll. That does not mean that either P or 2 is
not useful. Rather, if an inference to be made is important, then it is incumbent upon
the investigator to collect sufficient data even if that means more time-consuming
experiments and fewer publications to determine how the P or 2 is distributed. If
discrepancies between the hypothetical model (null hypothesis) and the data are due
to pure chance, then, although a range of P values from low to high will be obtained
from numerous experimental repetitions, they should nevertheless follow a uniform
distribution. By contrast, if a proposed model is a poor one, the P-values should
nearly all be low.
71
Table 1.4
Chi-square test of Poisson variates
Statistics
289
Mean
Standard Error
Median
Standard Deviation
Skewness
Kurtosis
Minimum
Maximum
Count
87.03
2.22
84.83
15.69
0.172
0.203
53.82
125.87
50
P
0.541
0.046
0.605
0.322
0.219
1.376
0.0062
0.999
50
Consider, as an illustration of the preceding homily, a suite of chi-square tests that

were performed on 50 samples of nuclear decay data, each sample comprising one
million bins of data, presumed to be independent, Poisson-distributed variates (the
null hypothesis) sorted into 90 classes. As shown in Table 1.4, the 50 chi-square tests
yielded the following statistics on both 289 and P.
Note that a minimum Pmin 0.0062 was obtained without there being any
justification for rejecting the null hypothesis; that a maximum Pmax 0.999 was
obtained without any computational errors having been made or my having lied
about the results; that the sample mean P 0.541 and standard error (standard
deviation of the mean) sP 0:046 areqin
excellent agreement with their respective
theoretical values hPi 0.500, hPi 1=12
50 0:041. The upper and lower panels of
Figure 1.6 respectively show histograms of the observed 289 and P-values sorted into
10 bins with the theoretically expected results superposed. This outcome of a series of
50 chi-square tests can itself be tested for significance by a chi-square test (where now
we have nine degrees of freedom). The results
test on chi-square
test on P
2obs 10:33; P 0:324

2obs 9:8; P 0:367
support the null hypothesis that distribution of chi-square values arose through pure
chance. Had I performed only a single test (rather than 50) of the Poisson variates
and obtained a particularly low or high P-value, statisticians (e.g. those writing
cautionary philosophical commentaries) would have had grave doubts about the
randomness of the nuclear decays. And yet, because P is distributed uniformly (see
lower panel of Figure 1.6), a P-value is just as likely to fall between 0.0 and 0.1 as
between 0.4 and 0.5.
The lesson in all this if there is one is that ambiguous or troubling outcomes to
chi-square tests often stem from insufficient data, a problem that can be solved by
experiment, not philosophy.
72
Tools of the trade
Frequency
15
10
0
50
60
70
80
90
100
110
120
130
140
0.9
150
Chi-Square
Frequency
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1.1
P Values
289
Fig. 1.6 Histogram of observed

(top panel) and associated P-values (bottom panel) from
tests of 50 samples of 22 Na decay data, each sample comprising 106 bins. Small circles (top) and
dashed line (bottom) show theoretically expected values.
1.26 Order and extremes

The issues surrounding high and low P-values raise more generally a question of
how to recognize when a particularly large or small value of some experimental
observation or computational test is likely to have arisen by chance. This is a
matter of what is known as order statistics. Suppose X1,X2,. . .Xn are independent, identically distributed random variables, each with cdf F(x) and pdf f(x)
dF(x)/dx, representing some set of observations. The variables can then be
arranged in order of their increasing values Y1 Y2 Yn. Since the ordered
set fYi 1. . .ng consists of the same numbers (only in different order) as the
original set fXig, there can be no difference in the mean and standard error.
However, the Ys are clearly not independent, since if Yi < y, then all (Yi 1, Yi2. . .
Y1) < y, and there is no reason to expect the Ys to be distributed in the same way
as the Xs. Indeed, in contrast to the Xs, which are identically distributed, the pdf
fY i y is different for each order statistic Yi.
1.26 Order and extremes
73
To determine f Yi y, it is convenient to derive first the cumulative distribution

function
FY i y Pr Y i y
Pr Y i yjY i1 > y Pr Y i1 yjY i2 > y
Pr Y i2 yjY i3 > y PrY n y:
1:26:1
In words, the preceding expression conveys the idea that the total probability for the ith
order statistic to be less than or equal to y is a sum over the probabilities of mutually
exclusive events whereby the condition Yi y is met at most by Yi or by Yi1 as well, or by
Yi2 as well, . . . or by Yn as well. The occurrence of any one of these events for example,
the event that Yj y, but Yj1 > y signifies that j out of the n random variablesfX
ig
n
satisfy this inequality, and nj do not, a condition that could have occurred in
j
different ways. Since the probability that a specific selection of j variates is less than or
equal to y and the remaining nj of them are greater than y is F (y)j [1F(y)]nj, the
probability for all such selections is given by the binomial expression

n
Fy j 1 Fy nj ,
1:26:2
Pr Y j yjY j1 > y
j
whereupon the total probability in (1.26.1) is given by the sum (starting with index i)
n
X
n
FY i y
1:26:3
Fy j 1 Fy nj :
j
ji
Of particular interest are the extreme order statistics Y1 and Yn, which are deducible
immediately from (1.26.3)
n
X
n
Fy j 1 Fy nj
FY 1 y
j
j1

n
X
1:26:4
n
n
j
nj
Fy 1 Fy
Fy 0 1 Fy n
j
0
j0
1 1 F y n

n
FY n y
Fy n 1 Fy 0 Fyn :
n
1:26:5
The cdfs for the lowest and highest order statistics, in fact, could have been deduced
immediately. Consider Y1. The probability that a variable Y (one of the Xs) is greater
than or equal to y is 1 F (y), and so the probability that all n variables are greater
than y is (1 F(y))n. Thus, the probability FY1 y that at least one of the variates is
less than or equal to y is 1 (1 F(y))n. We will see this kind of reasoning again when
we examine the elementary theory of nuclear decay. As for Yn, since the probability
that one of the variables is less than or equal to y is F(y), it should be fairly evident
that the probability that all n variables are less than order equal to y is (F(y))n.
74
Table 1.5
Tools of the trade
Extreme order statistics for n variates U(0,1)
Statistic
Density
Expectations
Y1
fY 1 y n 1 yn1
1
hY 1 i n1
2
Y 1 n12n2
q
Y 1 n1n2 n2
Theory
(n 50)
0.0196
Observed
(n 50)
0.006 17
0.000 75
0.0192
z1 jymin hY 1 i= Y 1 j j 0:699j < 1

Yn
fY n y nyn1
n
hY n i n1
2
n
Y n n2
q
Y n n12n n2
0.9804
0.999
0.9615
0.0192
zn jymax hY n i= Yn j 0:969 < 1
The probability density corresponding to the general expression (1.26.3) is given

by the derivative f Y i y dFY i y=dy, which can be calculated by either (a) a straightforward, plodding method that calls for tenacity and careful attention to detail, or (b)
a quick, simple method that calls for insight. Both ways are instructive and lead to
f Y i y
n!
Fyi1 1 Fyni f y:
i 1!n i!
1:26:6
The details are left to an appendix. The pdfs of the extreme order statistics, however,
can be calculated directly and easily from (1.26.4) and (1.26.5).
Consider the circumstance, pertinent to tests of significance, where variates fXig are
distributed uniformly as U(0, 1), in which case the cdf is simply F(x) x. The pdf and
first two moments of the lowest and highest order statistics may then be summarized
in Table 1.5 above. Returning to the example in the previous section of the 50 chisquare variates and corresponding P-values, one sees from Table 1.5 that the observed
lowest and highest Ps fall within one standard deviation of the predicted expectations.
Statistical principles, more than intuition and hunches, provide a better guide for
judging whether extreme events are too extreme to have occurred by chance.

The use of Bayes theorem for estimation and inference is ordinarily regarded as an
alternative to the maximum likelihood method. However, just as the chi-square test
of significance and least-square method of estimation can be regarded as reductions
75
of the maximum likelihood method to special cases, I prefer to think of the maximum
likelihood method itself as a particular application of Bayes theorem. For one thing,
this is a friendlier perspective in discussing the matter with other colleagues, since
the use of Bayes theorem has been the source of much contention in the theory of
statistical inference. But more importantly, it is basically accurate to do so since
Bayes theorem, without the accumulated emotional overburden, is an uncontested
fundamental principle in probability theory and therefore a starting point for nearly
all methods of statistical estimation and inference.
Recall the structure of Bayes theorem, Eq. (1.2.5). Given a set of experimental
data D and various models (hypotheses) Hi proposed to account for the data, then
PHi jD
PDjH i PHi
PDjH i PH i
:
X
PD
PDjH PH
i
1:27:1
i1
As discussed earlier,
(1) P(Hi) is the prior probability of a model based on whatever initial information
may be pertinent;
(2) P(DjHi) is the likelihood, i.e. the conditional probability of obtaining the experimental results given a particular model; and
(3) P(HijD) is the posterior probability of a particular model after the results of the
experiment have been taken into account.
In comparing two models H1, H2, one way to use Bayes theorem would be to
evaluate the ratio
PH1 jD PDjH 1 PH 1
PH 2 jD PDjH2 PH 2
1:27:2
and select the model leading to the larger posterior probability.

Different models are usually distinguished by the choice and numerical values of a
set of parameters , whereupon Bayes theorem can be written to show this functional
dependence explicitly:
(a) P(jD) / P(Dj)P() for a discrete parameter or
(b) P(djD) / P(Dj)p()d for a continuous parameter with density p().
A problem of inference (which hypothesis?) then reduces at least in part to a
problem of estimation (which parameter?). There are various, not-necessarily
equivalent, ways to make this estimate. For example, estimate the parameter by
(i) the value ^ that maximizes the posterior probability, i.e. the mode of the posterior
probability function

dPjD
0,
1:27:3
d ^
76
Tools of the trade
or
(ii) the mean value hi
PDjpd
hi
1:27:4
PDjpd
or
(iii) the root-mean-square (rms) value of rms
r
D
E
rms
hi2 ,
1:27:5
or
(iv) the value ~ that minimizes the squared error
D
2 E
d ~
0,
d~
the solution of which works out to be hi
2 E
d D
~
2hi 2~ 0 ) ~ hi :
d ~
1:27:6
1:27:7
The impediment to using these expressions, however, and the flashpoint for much of
the contention over Bayesian methods of inference, is the prior probability p(). In
particular, what functional form does p() take to represent the condition of no prior
information about i.e. the state of ignorance.28 It is to be stressed and this is
another critical point whose misunderstanding has been the source of much contentious discussion in the past that the prior does not assign probability to the value of
the unknown parameter, which is not a random variable, but to our prior knowledge of
that parameter. There have been other potentially divisive issues as well, such as
repudiation by some statisticians of the very idea that the probability of a hypothesis
makes any sense, but I will dispense with all that here. From my own perspective as a
practical physicist, any set of non-negative numbers summing to unity and conforming
to the rules of probability theory can be considered legitimate probabilities, whether
they arose from frequencies or not. The essential is that the set of numbers be testable,
reproducible (statistically), and help elucidate the problem being investigated.
28
Ignorance derives from a root word meeting not to know and, as used in statistics, does not carry the vernacular
connotations of stupidity or incompetence.
77
It would seem, at first, that the logical course of action would be to assume a
uniform distribution for unknown parameters in those instances where one has no
prior information about them. There are difficulties with this course, however.
The most serious is that the estimate then depends on an arbitrary choice of how the
model is parameterized. For example, if the random variables of a model are believed
2
2
to be generated by a pdf of the form pxj / ex = , and one assumes a uniform
distribution p() constant for the prior, then one cannot assume the transformed
2
parameter 2 in the pdf pxj / ex = to be uniformly distributed as well because
p
p constant
/ 1=2 :
jd=dj
2
1:27:8
And yet an analyst, having no more prior information about than about , could
have begun the analysis by assuming to be uniformly distributed. Clearly, then,
there is a logical inconsistency here somewhere, since the same state of prior knowledge should lead to the same posterior estimate no matter how one chooses to label
the parameters of a model.
The maximum likelihood (ML) method provides a way around the problem of
priors by disregarding them and basing the estimate on the mode of the likelihood,
i.e. the maximum of the conditional probability P(Dj). The method is invariant to a
transformation of parameters since, by the chain rule of calculus,
d
d
d
0 PDj
PDj ,
1:27:9
d
d
d
d
d
PDj 0 if d
PDj 0, which leads to the same point estimate
and therefore d
whether the model is formulated in terms of or .
A secondary difficulty with assuming that a parameter about which no prior
information is known is distributed uniformly is that Bayes theorem then leads to
some odd results in comparison with corresponding ML estimates. For example,
consider the set of observations fxi i 1. . .ng believed to have arisen from a Poisson
process with unknown parameter . As worked out previously, the parameter
dependence of the likelihood function is en nx , maximization of which gives the
n
X
ML estimate ^ x N1
xi , the mean value of the observations, a reasonable result.
i1
Contrast this with the Bayes estimate obtained by calculating the expectation hi
under assumption of a uniform prior p() constant:

n nx
n nx1
e d
e
d

1 nx 2
1
0

B hi 0
x :
nx
n

en nx d
en nx d

0
Uniform Prior
Uniform Prior
1:27:10
78
Tools of the trade
Although the asymptotic value of B and ^ are the same, the preceding result is
puzzling and not good for a small sample.
The problem of how in general to determine what prior distribution best represents ignorance is a very important one, as it is crucial to having a completely
consistent and reliable theory of scientific inference. Various ad hoc solutions have
been proposed in the past such as Jeffreys prior, which is to take p(d) / d for a
parameter that ranges infinitely in both directions ( ) and p(d) / d/ for
a parameter that ranges infinitely in one direction ( 0). However, there was no
fundamental reason to adopt such rules of thumb or suggest how to deduce the most
suitable prior in other circumstances where neither of the preceding two choices may
be adequate.
As of this writing, there may yet be no universally agreed solution to the problem
of priors for all circumstances, but it seems reasonable to me to expect that the state
of ignorance be defined by a principle of invariance i.e. a group theoretical concept
such as recognized and illustrated by Ed Jaynes29 over four decades ago. Jaynes
starting point, to counter the objection that use of Bayes theorem for inference led to
subjective results dependent on personal judgments of the analysts, was to insist that,
given the same initial information (or lack thereof ), all analysts should be able to
arrive at the same prior probability function for the parameters being estimated. To
achieve a unique prior (and therefore a unique posterior), it was essential for a
problem to be worded precisely so as to make clear what attributes of the parameters
of the underlying model were not specified. It was this lack of clarity that led to some
well-known paradoxes, such as Bertrands paradox,30 in the history of probability
theory.
Bertrands paradox, proposed in 1889, provides a good illustration of how
an ambiguity in the statement of a problem can lead to multiple solutions.
What is the probability P that the length of a chord, selected randomly, will be
greater than the side of the equilateral triangle inscribed in that circle? The
solution depends on what is meant by choosing the chord randomly. Here are
three such ways.
(1) Random length: The linear distance between the midpoint of the chord and the
center of the circle is random. (The solution is P 12.)
(2) Random arc: The arc length between two endpoints of a chord chosen randomly
on the perimeter of the circle is random. (The solution is P 13.)
(3) Random point: The location of the midpoint of the chord anywhere within the
area of the circle is random. (The solution is P 14.)
29
30
E. T. Jaynes, Prior Probabilities, IEEE Transactions On Systems and Cybernetics, Vol. 1 Section 4 (1968) 227241.
E. Parzen, Modern Probability Theory and Its Applications (Wiley, New York, 1960) 302304. At least six solutions were
proposed by a mathematician Czuber in 1908; cited in Parzen, p. 303.
79
Each of the foregoing interpretations of randomness leads to a different solution

when the corresponding random variable (length, angle, area) is assumed to be
distributed uniformly.
If the chord were actually chosen by an experiment, however, there could be only
one solution, provided the same physical mechanism was repeatedly used. For
example, if a transparent circular disk of radius R were tossed upon a flat table ruled
with parallel lines a distance 2R apart, then only one of these lines would cross the
disk and form a chord. All distances from this chord to the center of the disk would
be equally likely in accordance with method (1) above. On the other hand, if a
spinner (rotatable needle) were placed at a fixed point on the perimeter of the circle,
and the orientation of the randomly spun needle determined the second endpoint of
the chord, then the experiment would correspond to method (2) above. To know
which experiment to perform, it is essential to note what information is not given in
the statement of the problem so that the results will not be biased by unjustified
constraints. The theoretical solution should then be invariant to transformations of
these unspecified attributes. Thus, if the statement of the problem made no mention
of the size or location of the circle, then the solution must be invariant to transformations that displace the center of the circle or scale its radius. Examining the paradox
from this perspective of invariance, Jaynes demonstrated31 that, of the three ways to
interpret randomness above, only method (1) was invariant under both translation
and scaling.32
Let us return to the example of the set of data fxig believed to be distributed by a
Poisson process of unknown, but fixed, parameter to which we must assign a prior
probability density that expresses objectively what we know about the parameter
before the experiment is done so we can perform the Bayesian estimate (1.27.4). In
actual experiments, such as those involving radioactive nuclei which I discuss in the
next chapter, the parameter of interest is the intrinsic rate of decay of the nuclei.
The mean number of decays to be expected in a counting interval (bin) of time t is
then t, and the Poisson probability law can then be expressed in terms of the
parameter
Pxjt et
tx
:
x!
1:27:11
The unit of time, however, is arbitrary, and adopting another unit so that the count
interval becomes t0 and the decay parameter becomes 0 would not change in any way
our prior knowledge about the unknown decay rate. It would then follow that under
a transformation (, t) ! (0 , t0 ) subject to t 0 t0 , that is,
31
32
E. T. Jaynes, The Well-Posed Problem, Foundations of Physics 3 (1973) 477493.

A graphic simulation of the appearance of the circle upon repetitive random selection of chords by each of the three
methods is shown in the Wikipedia entry Bertrand Paradox (probability), http://en.wikipedia.org/wiki/
Bertrand_paradox_(probability). Only in the case of method (1) (random length) does the circular disk look
uniformly covered by the chords. (A plot of just the midpoints of the chords, however, is densest at the center.)
80
Tools of the trade
! 0 q
,
t ! t0 q1 t
1:27:12
the functional form of the density call it f () that represents what we know about
is precisely the same as the density that represents what we know about 0 . Hence
f d f 0 d0 qf qd
or
f qf q:
1:27:13
The functional relation (1.27.13) can be converted into a differential equation by

differentiating both sides with respect to the arbitrary transformation parameter q,
and then setting q 1 to obtain
df
constant
f 0 ) d lnf 0 ) f
:
d
1:27:14
Equation (1.27.14), which corresponds to the Jeffreys prior but now with a
theoretical justification based on symmetry independent of anyones personal
opinion is the only expression that objectively represents a state of prior
knowledge compatible with the mathematical invariance that ignorance of implies.
Used in Eq. (1.27.4) to estimate the value of the unknown parameter , one now
obtains

n nx
e d

0

n nx1
e
d

Uniform Prior

d

n nx

e

0

B h i

n nx d
e

1 nx 1
x ^
n nx
Uniform Prior
1:27:15
the same result as the estimate by maximum likelihood.
The same kind of reasoning can be applied to distributions with more than one
parameter, such as the important case of the normal distribution where one may be
ignorant of the location parameter and scale parameter . If one has no prior
information about these parameters, then the posterior probability, corresponding to
a set of observations fxi i 1. . .ng
Pd, djD /
n h
Y
i1
2 1=2 exi
=2 2
f , dd,
1:27:16
81
that the unknown parameters fall within the ranges (, d), (, d) must
be invariant under simultaneous transformation of location and scale (x, , ) !
(x0 , 0 , 0 )
0 b
0 a
1:27:17
x0 0 ax :
If we truly have no prior information about their values, then merely relocating the
mean and changing the variance cannot provide new information, and thus the prior
density function f() must have the same dependence on its argument after the
transformation as before, from which it follows that
f , dd f 0 , 0 d0 d 0 :
1:27:18
Substituting expressions (1.27.17) for the transformed parameters into the argument
of the right side and evaluating the Jacobian of the transformation leads to the
functional equation

0 , 0

1:27:19
f , f b, a
af b, a :
,
To solve Eq. (1.27.19) take derivatives of both sides sequentially with respect to each
of the two arbitrary transformation parameters, and then set the parameters to their
values for the identity transformation (a 1, b 0). Start with b:
f ,
f b, a
0a
:
b
b
1:27:20
The vanishing of the right side of (1.27.20) tells us that f cannot be a function of b,
whereupon we can write f (, ) af (, a). But this is the same functional relation
that we encountered before in (1.27.13) with solution (1.27.14). The posterior probability (1.27.16) therefore takes the form
Pd, djD /
n h
Y
2 2
1=2
exi
i1
/ 2 n=2 n1 e
=2 2
i d d
n
1 X
xi 2
22 i 1
1:27:21
dd
in which no prior knowledge is represented by a uniformly distributed location

parameter and a logarithmic distribution of the scale parameter, again as proposed
by Jeffreys. Jaynes derivation on the basis of an invariance argument shows Jeffreys
choice of prior to be the only function consistent with the assumption of total
82
Tools of the trade
ignorance as defined objectively in terms of symmetry. It should be noted, however,

that it is a mathematical idealization to allow the ranges ( , 0)
because the prior probability density is then not normalizable. In application to any
actual experiment, the extent of an investigators ignorance of the parameters is never
so great.
To use the posterior probability (1.27.21) for making estimates or inferences, it is
useful first to partition the sum in the exponent in Eq. (1.27.21), as we have done
before, to obtain an expression
Pd, djD p, jDdd / 2 n=2 n1 e
nx
2 2
nS0
e 22 dd
1:27:22
in terms of sufficient statistics: the sample mean x and (biased) sample variance S0 2.
From Eq. (1.27.22) we can determine the marginal probabilities of each parameter
i.e. the density of one irrespective of the value of the other by integrating over the
undesired parameter
PdjD d p, jD
0
PdjD d
i
d h
2 n=2
/ x 2 S0
d
02
p, jD d / n enS
=2 2
d:
1:27:23
1:27:24
The proportionality constants in the foregoing three equations can be worked out
exactly if needed, but, depending on how the equations are used, they may simply
drop out of the calculation. For example, to estimate the parameter from the data
by using (1.27.22) to calculate the expectation hi one has
e
hi
2
n x

2 2
e
2
n x
2 2

d
2
x p y ey =2 dy
n
x
ey
=2
1:27:25
dy
x
p and cancellation of factors
upon change of variable to standard form y =
n
common to numerator and denominator. The integral corresponding to the first
moment of y vanishes identically because of symmetry, whereupon the quotient
reduces immediately to x, the same value as the ML estimate.
Similarly, one can use Eq. (1.27.24) to estimate the parameter from the expectation h2i
2 n nS0 =2 2
2
2

n enS
02
=2 2
n1 enS
02
n1 enS
02
=2 2
=2 2
n
2
y2 ey dy
83
1:27:26
n

n

1
X
nS 0
nS
2 n S0 2 1
x i x 2 :
n
2 n 1
2
n2
n 2 i1
2
y2 ey dy
02
02
In the first line above the Jeffreys prior is applied. In the second line substitution of
y nS0 2/22 transforms the integrals into gamma functions. The estimate differs from
the corresponding ML estimate (S 0 2) although the results are asymptotically
equivalent.
Appendices
1.28 Rules of conditional probability

The inclusivity rule follows directly from the definition of conditional probability
according to which

P AB
PAB
PAjB
P AjB
:
1:28:1
P B
P B
Adding the two expressions in (1.28.1) yields

PAB P AB
P B
1
PAjB P AjB
P B
P B
1:28:2
because all possible outcomes involving event

Beither do or do not involve event A.
The counterpart to (1.28.2) PAjB P AjB does not sum to unity because the
two conditional probabilities do not express the totality of mutually exclusive events.
Rather,

PAB P AB

PAjB P AjB
P B
P B
PAB PA PAB
1:28:3
P B
1 P B
PA PAjB 2PAB
1 PB
in which the second line follows from completeness relations

PB P B 1
PAB P AB PA
1:28:4
and the third line from combining the two terms and recognizing the expression for
conditional probability P (A|B).
84
1.29 Probability density of a sum of uniform variates U(0, 1)
85
1.29 Probability density of a sum of uniform variates U(0, 1)

To perform the integral
1
px
2
ixt
hte
1
dt
2

n
eit 1
it
eixt dt
1:29:1
expand the binomial expression and interchange the order of integration and summation to obtain
px
n
X

ikxt
e
n 1
i
dt:
k 2
tn
1:29:2
1 eiaz
dz,
2 zn
1:29:3
nk n
1
k0
Next, consider the contour integral

J a
where a > 0 is a constant and the contour C, to be traversed in the positive (i.e.
counterclockwise) sense, is a semicircle of radius R in the upper-half complex plane
with diagonal along the real axis. On the semicircular portion of the contour the
integration variable takes the form z Rei R cos iR sin, and therefore the
magnitude of the integrand vanishes exponentially as eRsin in the limit that R ! .
The integrals in (1.29.2) and (1.29.3) are then related by
1
2
eiat
1 eiaz
1 eiu
n1
dt Lim
dz a Lim
du,
R! 2
R! 2 un
tn
zn
C
1:29:4
where the second equality results from a change of integration variable u az.
The contour integral can be evaluated immediately by means of the residue
theorem of complex analysis
X
f udu 2i
Res f ; ui
1:29:5
C
in which the sum is over the poles of the function in the integrand. The integral in
(1.29.4) has a single pole of order n at u 0. Recall that the residue of a function f(u)
expanded in a Laurent series is the coefficient of the term u1. Thus, expansion of
eiu/un in powers of u generates an infinite sum of terms
2
3
j
n1 n
X
an1 eiu
i
1
n1
jn
4
5 a i
1:29:6
du
a
u
du
2 un
j! 2
n 1!
j0
C
ij, n1
86
Tools of the trade
of which the only nonvanishing term is the one for which j n 1, giving the result
shown above. Substitution of (1.29.6) into (1.29.2) with identification of a k x
leads to the final expression in (1.17.8).
Note that, if a < 0, then the path of integration from to would have to be
closed by a semicircle in the lower half complex plane in order for the contribution to
the contour integral along this portion to fall off exponentially as R ! . The
integral must then be multiplied by 1 since the traversal of the contour would then
be in a negative (i.e. clockwise) sense.
1.30 Probability density of a 2 variate

To perform the integral
1
px
2
ixt
hte
1
dt
2
eixt 1 2itk=2 dt
1:30:1
expand the integrand in a negative binomial series and interchange summation and
integration to obtain
2
3

X
k
1
k
k=2 4
2i2j
t2j eixt dt5:
1:30:2
p x
j
2
j0
The integral above has the same general form as the one worked out in Section 1.29
and therefore can likewise be evaluated as a contour integral but along a semicircular contour of radius R in the lower half complex plane because of the negative sign
in the exponent. It then follows from the residue theorem that
2
3
n n1
n1 iu
1
x
e
5 i x :
1:30:3
tn eixt dt Lim 4
du
R!
2
un
n 1!
2
Substitution of the result (1.30.3) for n j into (1.30.2) yields the sum of terms

X
x=2 j
k k
k=2
k

px 22 x21
1:30:4
j
2j1 !
j0
k
2
which can be reduced further by using the identity

nj1
n
1 j
j
j
1:30:5
and expressing the combinatorial coefficient

on the right in terms of its defining

factorials. This leads to cancellation of 2k j 1 ! in the denominator of (1.30.4) and
to the final result
x21 X
x=2 j x21 ex=2
k

k
j!
22 2k 1 ! j0
22 2k
k
p x
87
1:30:6
where the factorial in the denominator was replaced by its equivalent representation
as a gamma function.
The identity in (1.30.5) is easily demonstrable when one recalls that a combinatorial
coefficient can be written as a quotient of two product sequences of equal length.

7
765
For example:
123. Applied to a negative binomial coefficient, one has
3
n
j
j terms
nn 1n 2 n j 1
1 2 3 j
1 j
nn 1n 2 n j 1
1 2 3 j
1 j
1:30:7
n j 1!
j!n 1!
where the final expression, equivalent to the right side of (1.30.5), is obtained by
multiplying both numerator and denominator by (n 1)!.
(A) The straightforward plodding method requires taking and simplifying the derivative of the cumulative distribution function
n
X
n
Fy j 1 Fy nj
FY i y
j
ji
n
i1
X
X
n
n
nj
j
F 1 F
Fj 1 Fnj
j
j
j0
j0
1:31:1
i1
X
n
nj
j
1
F 1 F
j
j0
i1
X
n
1 1 Fn
j
j
j0
in which F(y) is the cdf of the unordered random variable Y and (y) F(y)/(1 F(y)).
(To keep the notation as simple as possible, the functional dependence on y will be
omitted whenever it is not needed for clarity.) The derivative of the defined quantity is
"
#
d
1
F
f
f y
1:31:2
dy
1 F 1 F2
1 F 2
where f(y) dF(y)/dy is the pdf of the unordered random variable Y.
88
Tools of the trade
Taking the derivative of (1.31.1) with insertion of (1.31.2) and a little rearranging
produces
2 0 1
3
0 1
n
n
i1
i1
X
X
1
n1 4
j
j1
@ A
f Y i y nf 1 F
j @ A 5
n1 F j0
j0
j
j
1:31:3
0
1
2 0 1
3
n
n

1
i1
i2
X
1 X
@
A j1 5:
nf 1 Fn1 4 @ A j
1

F
j0
j0
j
j
In the transition from the first line to the second, note that

j n
n1
and
(a)
j1
n j

i1
X
n1
(b) the first nonvanishing term in the sum
must start with j 1.
j1
j0
One can therefore change the summation index so that the first nonvanishing term
actually begins with j 0 by also changing the upper limit of the sum to i 2, leading
to the second term of the second line in (1.31.3).
Substitution into (1.31.3) of the combinatorial identity

n1
n1
n
1:31:4
j1
j
j
allows one, after some more algebraic manipulation, to subtract the second sum from
the first, yielding the result

n1 n 1
f Y i y nf 1 F
i1 ,
1:31:5
i1
which reduces to the pdf given in (1.26.6).
The combinatorial identity in (1.31.4), known as Pascals triangle, can be demonstrated algebraically by manipulating the sum of factorial expressions represented by
the right side and showing the equivalence to the factorial expression represented by
the left side. However, a simple combinatorial argument avoids such tedious calculation. Suppose we have n distinguishable objects, and we focus attention on one of
them. We then consider the number of ways to select j objects, which of course is

n1
n
ways to choose j objects that include the
. However, there are
j1
j

n1
ways to choose j objects that do not include
originally designated one and
j
the designated one. Since the two groups are mutually exclusive and exhaustive, the
identity (1.31.4) follows.
89
(B) The insightful method33 makes clever use of the multinomial distribution.
Start with the defining relation
f Y i y
dFY i y
FY y y FY i y
Pry y Y i > y
Lim i
Lim
:
y!0
y!0
dy
y
y
1:31:6
Then recognize that the probability in the numerator is the product of the probability
of three mutually exclusive and exhaustive events:
(a) (i 1) of the set of originally unordered variates are <y;
(b) one of variates falls in the range (y, y y);
(c) (n i) of the set are y y.
The number of ways to achieve this threefold grouping is given by the multinomial
n!
coefficient i1!1!
ni!. It then follows that the derivative in (1.31.6) can be
expressed as

dFY i y
n!
Fy y Fy
1 Fni
Fi1 Lim
y!0
dy
i 1!n i!
y
1:31:7
f y

n1
n
f yFi1 1 Fni,
i1
which is precisely the result (1.26.6) previously obtained with considerable effort.

To evaluate the integral

I0
t2
1
d
d1
2
dt d
1=2
1 x2
d1
2
dx
1:32:1
consider instead the contour integral [with a (d 1)/2 > 1]

I
C

2 a
1z

dz
C
dz
z ia z ia
1:32:2
over a closed semi-circular contour C of radius R ! in the upper-half complex

plane (i.e. Re(z) 0) with its base along the real axis. The integral over the
33
A. M. Mood, F. A. Graybill, and D. C. Boes, Introduction to the Theory of Statistics 3rd Edition (McGraw-Hill, New
York, 1974) 253.
90
Tools of the trade
semicircular portion vanishes as R(2a1), whereupon the integrals I and I0

are equal.
Within the specified contour, the integrand in (1.32.2) has a pole of order a at
z0 i. In general, the residue of a function of the form f(z) g(z)/(z z0)n, where
g(z) is analytic at z0 is

g z
1 d n1 gz
gn1 z0
Res
;
z
z

,
1:32:3

0
z z0 n
n dzn1
n
zz0
which can be readily verified by expanding f(z) in a Taylor or Laurent series about z0
and identifying the coefficient of the term (z z0)1. Application of (1.32.3) to the
function g(z) (z i)a in (1.32.2) to obtain the residue, followed by use of the
residue theorem (1.29.5), yields the result

I0
1
t
d
d1
2

1
dt 2
t
d
d1
2
dt
leading to the normalization constant in (1.19.9).
d 1=2 d
2 d 1=22
d
1:32:4
2
The truth is, the science of Nature has been already too long made
only a work of the brain and the fancy: It is now high time that it
should return to the plainness and soundness of observations on
material and obvious things.
Robert Hooke1
2.1 Bayes problem: solution 1 (the uniform prior)

In 1920 Karl Pearson, one of the principal figures in the creation of modern
statistics and the person who introduced the chi-square test (although he calculated
the number of degrees of freedom incorrectly) published a paper titled The
Fundamental Problem of Practical Statistics.2 Technically, Pearsons paper
addressed a seemingly narrow question: If you toss a coin n times and get a heads,
what is the probability that you will get b heads if you toss the coin m times more?
This was in its essence the problem solved by the Reverend Thomas Bayes (of Bayes
theorem) some two centuries earlier and published posthumously in 1764.3 Bayes
essay was (and is) notoriously difficult to read and has been the focus of attention by
numerous luminaries in mathematical statistics and physics such as Laplace, Boole,
De Morgan, Venn, Pearson, Fisher, Jeffreys, and others. Often enough, that attention took the form of critical commentary that other authors misunderstood some
aspect of Bayes paper. I have read Bayes paper myself and found it wondrously
soporific and make no pretense to having understood it better than anyone else.
When Pearson, whose interest early in his career was in mathematical physics,4
wrote his paper, he thought he had discovered a hitherto unrecognized essential
feature of Bayes paper. Whether this was so or not was debatable, but Pearsons
choice of title clearly indicated at least this: Far from being a peripheral issue of
1
2
3
4
Robert Hooke, Micrographia (first published by the Royal Society 1665) unnumbered page from The Preface.
K. Pearson, The Fundamental Problem of Practical Statistics, Biometrika XIII (1920) 116.
T. Bayes, An Essay towards solving a problem in the doctrine of chances, Philosophical Transactions of the Royal
Society of London 53 (1764) 370418.
T. M. Porter, Karl Pearson: The Scientific Life in a Statistical Age (Princeton University Press, Princeton NJ, 2004) 42,
181183.
91
92
limited interest, Bayes problem served as a surrogate for one of the most important
tasks of statistics and physics: inference or the prediction of future outcomes based
on past observation. In Pearsons words:
None of the early writers on this topicall approaching the subject from the mathematical
theory of games of chanceseem to have had the least inkling of the enormous extension of
their ideas, which would result in recent times from the application of the theory of random
sampling to every phase of our knowledge and experienceeconomic, social, medical, and
anthropologicaland to all branches of observation, whether astronomical, physical or
psychical.
In the history of probability, Bayes method came to be known as inverse probability since, in contrast to the forward direction of calculating probabilities of
outcomes from an assumed model, Bayes theorem could be used to predict the
probability of a model (or hypothesis) from observed outcomes. We have already
examined aspects of this issue in the previous chapter in regard to estimating the
parameters of models. Inference, however, goes beyond estimation for it concerns
not only how to extract values of parameters from data, but what to do with them
afterward. Inference is as much an art as science because there is generally no unique
right answer; different methods can differ in their predictive utility. But all methods
have to deal with the ancient and seemingly intractable problem of ignorance: how
is an unknown probability to be represented mathematically?
Bayes clever solution to that conundrum, reminiscent of physicists practice
(particularly in the nineteenth century) of making mechanical models to help derive
or illustrate mathematical laws, was to devise a gedanken-experiment (thoughtexperiment) involving the motion of a ball on a billiards table. The ball was
presumed to stop with equal probability at any location within the width of the
table, and each instance of rolling the ball was an event independent of preceding
ones. In this way, Bayes arrived at assigning a uniform distribution to a parameter
whose value determined the probability of success for example, a head in the
coin-toss problem defined at the beginning of the section. What Pearson came to
understand, however, was that Bayes solution to Bayes problem led to the same
result whether one assumed a uniform distribution on this parameter or not. This is
an interesting and significant observation, worth looking at in detail.
Define a binary random variable X, where a success (S) corresponds to X 1 and a
failure (F) to X 0, in the following way:

1
for < 0
X
2:1:1
0
for 0
where 0 is the decision variable and is the deciding variable. In other words,
choose a value of randomly from a distribution with continuous pdf f(). If < 0,
the outcome of the coin toss will be S; if 0, the outcome will be F. The
experiment is repeated n times and the result is a successes [aS] and (n a) failures
93
[(n a)F]. The decision variable 0 is chosen at the start from the same pdf f() and
remains fixed for the first n trials as well as the m trials to follow for which the
possible number of successes is in the range (m b 0).
From Eq. (2.1.1) the probability of a success or failure is given by
0
PrX 1j0 P
f d F0
2:1:2
PrX 0j0 f d 1 P 1 F0
0
where F() not to be confused with the symbol F for Failure is the cumulative
distribution function (cdf ) defined above by the integral of an arbitrary probability
density constrained only by the requirements that it be non-negative and normalizable,
f d 1. It then follows straightforwardly from the rules of probability
that the probability of a successes out of n trials given no prior information

(I ignorance) is
Pra; njI
Prajn, f d

1

n
n
a
na
f d
Pa 1 Pna dP
P 1 P
a
a
2:1:3

0

n a 1n a 1
n
Ba 1, n a 1
a
a
n 2
n!
a!n a!
1
a!n a! n 1!
n1
where the beta function B(x, y) appearing in the third line of (2.1.3) is defined in terms
of gamma functions or an equivalent integral5
1
ab
Ba, b
xa11 xb1 dx:
a b
2:1:4
Note two important consequences of the above calculation:

(a) the integral in the first line of (2.1.3) is expressible entirely in terms of the
integration variable P [from (2.1.2)], which is a cdf distributed, as we have seen,
as a U(0, 1) variate irrespective of the unknown density function f() of the
deciding variable , and
5
G. Arfken and H. Weber, Mathematical Methods for Physicists (Elsevier, New York, 2005) 520526.
94
1
(b) the final result in line 4 of (2.1.3) is a constant n1
independent of the number of
successes a, which is precisely what one might have expected given the range of
outcomes (n a 0) comprising n 1 possibilities and the absence of any
prior information on .
Now for the inference part, i.e. to calculate the conditional probability, which I shall
symbolize by Pr((b; m) j (a; n)), of getting b successes out of a further m trials, given
the previous result of a successes out of n trials. The new prior density function is

n
Pa 1 Pna dP,
Pr dPja; n
2:1:5
a
and the likelihood is

Pr Pjb; m

m
Pb 1 Pmb :
b
2:1:6
Bayes solution referred to also as the BayesLaplace solution because Laplace

deduced Bayes theorem around 1774 independently of Bayes and applied it to
problems of inference in astronomy6 thus emerges from the following sequence
of steps
1

1
m b
m
n a
mb
na
Pba 1Pmnba dP
P 1P
P 1P
dP
b
b
a
0
Prb;mja;n 0

1
1
n a
na
Pa 1Pna dP
P 1P
dP
a
0
0

m
Bba1,mnba1
b
Ba1,na1
m!
ab!nmab! n1!
b!mb!
nm1!
a!na!

ab
nmab
a
na

:
nm1
n1
2:1:7
The first line in (2.1.7) expresses the definition of conditional probability under the
two conditions that
(a) the two sets of coin tosses actual and unrealized are independent (hence the
product of the two binomial expressions in square brackets) and
6
S. M. Stigler, The History of Statistics: The Measurement of Uncertainty before 1900 (Harvard University Press,
Cambridge MA, 1986) 103105.
95
(b) all values of the probability of success P can occur with equal weights (hence the
integral over P).
The resulting integrals are recognized as beta functions in line 2 and expanded into
the defining factorial expressions in line 3. With a little rearrangement, the factorials
can be grouped to form the product of combinatorial coefficients in the final form
in line 4.
If the number of trials (n, m) and successes (a, b) are sufficiently larger than 1 that
the 1s can be omitted in (2.1.7), then the factorials can be rearranged to the approximate expression

m
m
n
Nm
b
b
a
Kb

Prb; mja; n
,
2:1:8
nm
N
ab
K
which is the form of a hypergeometric distribution with total number of trials
N n m and total number of successes K a b. This is the distribution that
results from random sampling (e.g. of balls in an urn) without replacement. The
mean and variance of the variate b in distribution in (2.1.8) are respectively

m
m
m N K
2:1:9
varb K
1
b K
N
N
N N1
and reduce to the corresponding approximate expressions
b Kp
e
varb e Kp1 p
2:1:10
with probability of success p m/N for a binomial distribution in the limit N K, in

which case the sampling is largely unaffected by lack of replacement.
A further noteworthy feature of (2.1.8) follows from use of the lowest order of
Stirlings approximation (ln n! n ln n n) to approximate the factorials in the
combinatorial coefficients, whereupon the solution takes the interesting form
Prb; mja; n
nn mm a bab n m a bnmab
aa bb n ana m bmb n mnm
2:1:11
corresponding to a result obtained by R. A. Fisher based on an entirely different way

of reasoning. Fisher compared the likelihood that a successes out of n trials and b
successes out of m trials constituted independent samples from two binomial distributions with ML parameters ^1 a=n and ^2 b=m to the likelihood that the
samples were chosen from the same binomial distribution with parameter
^12 a b=n m.7
7
A. W. F. Edwards, Likelihood (Johns Hopkins University Press, Baltimore, 1992) 216217 gives a justification of
Fishers reasoning.
96
2.2 Bayes problem: solution 2 (Jaynes prior)

The path taken by Bayes, Laplace, and Pearson to solve the fundamental problem
is not the only way to go. Indeed, it elicited criticism on various accounts, much of it
apparently philosophical and not worth revisiting here. As a practical matter to a
practical physicist, criticism is useful primarily if it leads to some other workable
solution that may be better. As we have seen, the major impediment to achieving a
broad consensus on problems of inference using Bayes theorem centered on the
question of how to represent prior information when none was available and an
objective solution to that problem was developed by Jaynes based on considerations
of invariance under symmetry transformations: Whatever quantities are not specified
within the statement of a problem should not affect the solution if their values are
arbitrarily changed. This led Jaynes to derive transformation relations which the
parameters of a model (i.e. the quantities to be estimated from data) must be
expected to obey.
In the approach to Bayes problem under discussion now, suppose to be the
probability of success whose prior distribution f() is unknown. An alternative
to Bayes solution would be to determine the best estimate of , conditioned
on the record of prior successes (a; n), by application of the binomial probability
function

n a
1 na f d,
Pr dja; n / pajn, f d
a
2:2:1
and then use that estimate call it ~ to predict the probability of a future outcome
(b; m) as follows
~
Prb; mja;
n

m ~b
~ mb :
1
b
2:2:2
~ One solution is to use the method of

What, then, constitutes the best estimate ?
maximum likelihood, i.e. to find the mode of the likelihood in (2.2.1). As we have
seen, this leads to ~ ^ an. Another way would be to calculate the expectation hi
from (2.2.1)
1
n a1
1 na f d
a
hi
1
n a
1 na f d
a
2:2:3
if there were justification for a particular pdf f(). Pearson circumvented the issue by
using Bayes theorem in such a way that the distribution f() did not appear. Jaynes
2.2 Bayes problem: solution 2 (Jaynes prior)
97
theory of invariance, however, yields a specific functional form for f() through the
following argument.8
Suppose we attempted to estimate the initial probability of success p(S) about
which nothing was known beforehand by performing an experiment that yielded
data D. Then we could relate the posterior probability p(Sj D) to the prior probability
p(S) by Bayes theorem
pSjD
pDjSpS
p S
pDjSp S pDjFp F pS pDjF 1 pS

pDjS
or, in the interest of a less cumbersome notation,

0
,
c1 1
2:2:4
where 0 and are respectively the posterior and prior probabilities of success and
c p(Dj S)/p(Dj F) is a ratio of likelihoods. If, however, the experiment was ill-chosen
and did not teach us anything new, then our state of knowledge afterward would be
the same as before, in which case the distributions of 0 and must be the same,
f 0 d 0 f d:
2:2:5
Combined with Eq. (2.2.4), the preceding equation leads to a functional relation

c
2:2:6
1 c 12 f cf
1 c 1
that can be solved in the manner employed in the previous chapter. Take derivatives of
both sides with respect to the parameter (c) and then set the parameter equal to its identity
element (c 1). The calculation is straightforward and leads to the distribution
f d
d
:
1
2:2:7
Substitution of (2.2.7) into (2.2.3)

1
a 1 na1 d
hi
Ba 1, n a a 1n a
Ba, n a
n 1a n
2:2:8
a1 1 na1 d
0
gives a Bayesian estimate ~ an, the same as the ML estimate. Thus the solution,
Eq. (2.2.2), to Bayes problem with the Jaynes prior reduces to
8
E. T. Jaynes, Prior Probabilities, IEEE Transactions on Systems Science and Cybernetics 4 (1968) 227241.
98
~
Prb; mja;
n
m
b

a b
n
1
amb
:
n
2:2:9
There are some oddities to the use of (2.2.7) that are worth noting. First, the function
f() becomes singular at the points 0, 1. In other words, the prior in (2.2.7)
weights the endpoints more heavily than the midsection. Jeffreys, whose approach to
probability theory served as inspiration to Jaynes, did not himself find the suggestion
of (2.2.7) (which he attributed to the biologist Haldane) appealing, although his own
logarithmic prior for a scale parameter is also singular at the point 0. Having
recorded his skepticism of the BayesLaplace uniform prior, which might appeal to
a meteorologist. . .but hardly to a Mendelian, he then wrote9
Certainly if we take the BayesLaplace rule right up to the extremes we are led to results that
do not correspond to anybodys way of thinking. The rule dx/x(1 x) goes too far the other
way. It would lead to the conclusion that if a sample is of one type with respect to some
property there is probability 1 that the whole population is of that type.
It would seem therefore, that use of (2.2.7) must entail either exclusion of endpoints
0, 1 with failure of the prior to be complete, or inclusion of the endpoints with
failure of the prior to be normalizable. It can be argued, however, that the density
function of a parameter about which nothing is presumed known beforehand ought,
in fact, to exclude from its range points indicative of certainty.
A second, related peculiarity is that, if (2.2.7) is substituted into Eq. (2.2.1), the
conditional probability
Prdja; n
a1 1 na1
n 1!
d
a1 1 na1 d
Ba, n a
a 1!n a 1!
2:2:10
becomes singular or indeterminate for n 1, a 0, 1. Jaynes has offered the

explanation that, until one actually has obtained at least one success and one failure
out of two trials, there is no basis for presuming that the population is binomial, i.e.
that it contains events of two different kinds. Therefore the restriction (n > a 1) is
required as part of prior information in order that use of the binomial probability
function be justified. In this minimum-information case of n 2, a 1, Eq. (2.2.10)
reduces to Pr(d j (1;2)) d, which is the BayesLaplace uniform prior.
2.3 Comparison of the two solutions

Whereas the features of the second solution (2.2.9) with the JaynesHaldane prior
I shall
call it
PJ(b) are fairly obvious. . . it is just a binomial distribution

Bin m, ^ a=n . . . the properties of the first solution (2.1.7) with the BayesLaplace
9
H. Jeffreys, Theory of Probability (Oxford, New York, 1961) 124.
99
2.3 Comparison of the two solutions

0.12
0.1
Probability
0.08
c
0.06
0.04
0.02
0
0
10
15
20
25
30
35
40
45
50
Outcome
Fig. 2.1 Bayes solution for the probability PB(b) of b successes in m 50 trials given a prior
successes in n trials at constant ratio (a/n) equal to (a) 1/2, (b) 5/10, (c) 25/50, (d) 50/100. Solid
curve (e) shows binomial probability distribution Bin50, 12.
uniform prior I shall call it PB(b) are not particularly transparent. Knowing that
the asymptotic form approaches that of a hypergeometric distribution is not all that
helpful since this is a complicated function. Visualization, however, is helpful and
Figure 2.1 shows the variation in form of PB(b) as a function of b for m 50 and
different values of a and n such that the ratio a/n is a fixed quantity (which for no
particular reason except aesthetics) I chose to be 1/2.
The distribution is discrete, so the dashed lines tracing the different curves are
there only to guide the eye. In order not to encumber the figure, I have omitted
symbols showing the discrete point values, except in the tallest curve depicted with a
solid line. This is the binomial distribution corresponding to PJ(b). The behavior of
PB(b) now becomes evident and in a certain sense very reasonable. When the prior
information is deduced from a small sample, for example (a;n) (1;2), the prediction
of future outcomes (b;m) is highly uncertain, and the plot PB(b) spreads widely and
with low amplitude over the range (m b 0). As the sample from which the prior
estimate of the probability of a success is inferred increases, the predictions become
sharper, and the plot PB(b) approaches asymptotically in form the binomial distribution PJ(b). Although readily apparent, I note explicitly that the function PJ(b)
depends only on the ratio of a to n and not on the two values separately.
To demonstrate analytically what the figure suggests graphically, it is better to
work directly with the probability function (2.1.7) than with a moment generating
function or characteristic function, neither of which is easily calculable or useful in
^
the case of a hypergeometric distribution. Since a and n are related by a n,
substitute for a in PB(b) to obtain the form
100

y m b
y

PB b
2:3:1
zm
z

^ y n 1 ^ , z n 1, each of which is a monotonic

where I have defined x n,
linear function of n and therefore becomes much larger than the other index in the
combinatorial coefficient as n becomes very large. Now consider the behavior of one

xb
as x becomes large:
of the combinatorial coefficients say
x

x b! x 1x 2 x b xb
xb
Lim
2:3:2
:
x
x>>b
b!
x!b!
b!
xb
x
Approximating each of the three combinatorial coefficients in (2.3.1) with the corresponding form given by (2.3.2) yields the result
xb ymb
b! m b!
Lim PB b
large n
zm =m!
mb
^
b
n n 1 ^
m
b
n 1m

amb
m a b

1
PJ b
b
n
n
2:3:3
demonstrated by Figure 2.1.
2.4 The SilvermanBayes experiment

To a practical physicist, philosophical argumentation over how to represent the prior
state of ignorance, while sometimes illuminating, can never supplant an experimental
test. And so, intrigued by the problem and always interested in resolving a controversy, I designed one. The idea was to use a random number generator (RNG) to
select the decision variable and subsequent deciding variables that determined the
probability of success P. Taking the BayesLaplace and the JaynesHaldane results
[(2.1.7) and (2.2.9)] as the two prevailing models to test, I was particularly interested
to see (a) how they compared under conditions where prior information (the a
successes out of n trials) was obtained from samples of different sizes, (b) whether
specific choices of the prior distribution (i.e. choice of the RNG) had any effect on
posterior probabilities, and ultimately (c) which method would be preferable if I had
to wager on the outcomes.
In its essentials, my experimental protocol followed Bayes posthumous paper except
for replacement of the ball on the billiards table with a computational RNG, which
made it possible to run a large number of sets of trials in the manner described below.
101
(I) PRIOR INFORMATION

Choose an RNG.
Generate a single decision variable 0, which remains fixed throughout all
trials.
0
Generate n initial random variables fi i 1. . .ng from the RNG.
0
0
Determine whether a trial led to a success Xi 1 or failure Xi 0
according to (2.1.1).
n
X
0
Xi .
Tally the number of successes a
i1
Estimate the prior probability of success p0 a/n.

(II) POSTERIOR INFORMATION
Using the same RNG as in the first part, generate N sets of variables with
j
m variables per set fk k 1. . .m; j 1. . .Ng.
j
j
Determine whether each trial led to a success Xk 1 or failure Xk 0
according to (2.1.1).
m
X
j
Xk .
Tally the number of successes in each set: Bj
k1
Tally the frequency of occurrence Nb of each possible outcome b (m b 0),

m
X
subject to
N b N.
b0
N
X
Bj
and corres Deduce the empirical mean probability of success p N1
m
2
j1
N
X
Bj
1
p .
ponding variance varp N1
m
j1
(III) COMPARISON OF EXPERIMENT AND THEORY
Make a histogram of the frequencies fNbg or probabilities fNb/Ng with 2m 1
categories (bins), each bin corresponding to a single value of b.
Overlay a plot of the theoretical probability PJ(b) (2.2.9) based on the
JaynesHaldane prior.
Overlay a plot of the theoretical probability PB(b) (2.1.7) based on the Bayes
Laplace uniform prior.
Calculate the absolute error

m

X

Nb
abs
2:4:1
Ptheory b

N
b0
or root-mean-square (RMS) error
v
u m
2
uX N b
t
Ptheory b
rms
N
b0
2:4:2
in the inferences of both theories.

In the course of the experiment I tried quite different RNGs such as uniform
U(0,1), normal N(0,1), and exponential E(1) as prior density functions, and all
102
Probability of Success b/m
0.2
0.15
0.1
b
0.05
0
0
10
12
14
16
18
20
Number of Successes b
Fig. 2.2 SilvermanBayes experiment with randomly chosen decision parameter 0 0.179.
The histogram shows frequency of successes per set of 20 trials in 5000 sets of trials
implemented by a N(0,1) RNG. (a) Binomial distribution PJ Bin 20, 12 (Eq. (2.2.8)) based
on Jaynes solution with parameter (a/n) (2/4). (b) Bayes distribution PB (Eq (2.1.7))
based on prior F(0) 0.571. (c) Binomial distribution Bin(20,0.570) with observed
mean probability of success p 0:570. Dashed lines connecting discrete points serve only as
visual guides.
led, under otherwise identical conditions, to the same empirical histograms, an

example of which is reproduced in Figure 2.2. This figure shows data obtained
with a N(0,1) RNG. The randomly chosen (but subsequently fixed) decision
parameter turned out to be 0 0.179, corresponding to a cumulative distribution
(i.e. probability of success) P F(0) 0.571. Four tosses (i.e. random selections) were then made of which two turned up heads, giving the prior information (a;n) (2;4) and a prior estimate p0 0.500 for the probability of success.
With this information, the probability of getting b 0,1,2 . . . 20 heads in the next
m 20 tosses could be predicted with PJ(b) and PB(b). The distributions of
discrete plotting symbols (connected by dashed lines to aid visibility) in Figure 2.2
show the predictions.
The histogram (normalized to unit area) in Figure 2.2 represents the distribution in number of successes per set of 20 trials obtained from 5000 sets of trials
with the same RNG. The mean probability of success obtained empirically was
p 0:570, in very close agreement with the predicted value F(0). The standard
error (i.e. standard deviation of the mean) was
0.111, precisely as predicted
p
theoretically for a binomial distribution, p p1 p=m as deduced in the
following steps:
m
X
Xi
!2 +
m
X
Xi
103
+2

2
B hBi2
B
i1
i1
p var
m2
m2
m
h
i
mhX2 i mm 1hXi2 m2 hXi2 hX2 i hXi2
m2
m
2

2
1 F 0 1 F 1F 01 F2 F F2
m
m
2
2:4:3
Since the prior information was obtained from a small sample (n 4), it was to be
expected (as explained in the previous section) that the BayesLaplace predictions
would be dispersed much more widely about p0 than the JaynesHaldane predictions.
Figure 2.2 confirms this.
If the prior estimate p0 were identical to F(0), the curve of PJ(b) would form a
tight envelope about the histogram (as illustrated by the light dashed line), and the
error would be close to zero. Looking at the figure, which summarizes an outcome
with prior estimate p0 0.50 (about which the theoretical curves are centered) not
too far from p 0:57 (about which the histogram is centered), one would likely
conclude that PJ(b) has made overall better predictions than PB(b). The measures of
J
B
error (we will use the absolute error (2.4.1)) confirm this: abs 0:153, abs 0:202.
Since the function PJ(b) takes the form of the empirical distribution irrespective of
prior sample size n, whereas the function PB(b) takes the form of the empirical
distribution only for very large values of n, one might be tempted to conclude that
the better bet would be on the JaynesHaldane predictions. This, however, would be
a mistake and possibly a costly one if a wager were actually involved. I have run the
experiment many thousands of times it took but a few seconds with a computer
and in most instances by far the BayesLaplace predictions led to smaller overall
errors than the JaynesHaldane predictions. This turnabout occurs because the prior
estimate p0 is more likely not to be sufficiently close to the true probability of
success, F(0), in which case the more diffuse PB(b) distribution overlaps the empirical distribution to a greater extent than does the PJ(b) distribution.
Figure 2.3 illustrates this point quantitatively in the case of a prior sample of size
n 20 and posterior test of size m 50 in which the outcomes to be inferred span the
range b 0,1,. . .50. For the example shown, the true probability of success was
taken to be F(0) 0.5. The top two plots with dashed lines show the absolute errors
(i..e. summed over all outcomes b) incurred by predictions based on PJ and PB as a
function of the prior number of successes a or, equivalently, of the prior estimate
of the probability of success p0 a / n. Except for just three values of a (9, 10, 11) out
of the entire range a 0,1. . .20, the BayesLaplace error lies below the Jaynes
Haldane error. The difference of the two errors Jabs Babs as a function of a is shown
by solid line. This conclusion holds for both small and large prior and posterior
sample sizes.
104
2
a
Prediction Error
1.5
1
0.5
0
0.5
1
0
10
12
14
16
18
20
Number of Prior Successes

Fig. 2.3 SilvermanBayes experiment with fixed decision parameter. Plot of absolute errors as
X
50
PJ b, 50ja, 20
a function of number of successes: (a) Jaynes solution: JAbsa

X
50 h

i
b1

1
B
1

PBinbj50, p 2, (b) Bayes solution: Absa
PB b, 50ja, 20 PBin bj50, p
,
and (c) the difference JAbsa BAbsa.
b1
2.5 Variations on a theme of Bayes

Under the conditions previously described, the outcome of each random trial (coin
toss) was determined by reference to a fixed decision variable 0, and the distribution
of the number of successes in n trials was binomial: Bin(n, P F(0)). Suppose,
however, that before each coin toss 0 was chosen randomly (the same RNG again
being used for both decision and deciding variables). What then would be the
distribution of successes in n trials?
One way to proceed is to work directly with the binomial probability function as
we have in the previous sections; another way is to make use of the momentgenerating function (mgf ). Let us choose the latter for variety and because it leads
to interesting results along the way.
Recall that the mgf of a binomial random variable Y Bin(n, p) has the form
n
gY t pet q 1 pet 1 ,
2:5:1
which in the present case can be written to show explicitly the dependence on n and 0
n
gY tjn, 0 1 F0 et 1 :
2:5:2
If the number of trials n is sufficiently large that the values selected randomly for 0
induce P F(0) to cover its range (0,1) evenly, then the population of successes
would be characterized by the P-averaged mgf
105
Table 2.1
Discrete and continuous uniform distributions
Distribution
U[0, n]
U(0, n)
et 1n
tn
n
2
n2
3
n3
4
n4
5
n2
12
n1t
e
1
n1et 1
n
2
n2
n
3 6
3
n
n2
4 4
n4
3n3
n2
5 10 30
nn2
12
mgf
hYi
hY2i
hY3i
hY4i
2Y
0
h
SkY
n
30
2
9 n 2n4=3
nn2
5
1
n
gY tjn 1 Pet 1 dP
0
9
5
en1t 1
:
n 1et 1
2:5:3
The right side of Eq. (2.5.3) actually takes the form of one of the familiar discrete
distributions, but, if we were not aware of this, we could obtain the probability
function fpj j 0,1,2. . .g for j successes by replacing et by s and examining the
probability generating function (pgf ) f(s)
f Y sjn
X
j0
pj s j
1 sn1
:
n 11 s
The function of s in (2.5.4) is the truncated geometric series

upon it follows immediately that
(
1
pj n 1 j 0, 1, 2 . . . n
0
otherwise:
2:5:4
n
X
n1
s j 1s1s , where-
j0
2:5:5
Thus, expression (2.5.3) is the mgf of a discrete uniform distribution over the range
(0, n), as shown explicitly by the probability function in (2.5.5). I will distinguish the
discrete uniform distribution symbolically from a continuous uniform distribution
over the same range by writing U[0, n] in contrast to U(0, n) (see Table 2.1). The first
few moments of both distributions can be calculated readily either by summing/
integrating over the probability function or differentiating the moment-generating
function. (As discussed in the previous chapter, the latter operation requires taking
the limit t ! 0 by means of LHopitals rule.)
Figure 2.4 shows a stochastic confirmation of the predicted probability function
(2.5.5). The histogram records the distribution of number of successes per set of
106
220
Frequency
210
200
190
180
170
160
0
10
12
14
16
18
20
Number of Successes
Fig. 2.4 SilvermanBayes experiment with variable decision parameter 0. Histogram of the
number of successes per set of 20 trials obtained from 4000 sets of trials in which 0 is a U(0, 1)
random variable realized before each set. The dashed line at 190.48 marks the expected mean
number of successes for the predicted posterior distribution U[0, 20].
20 trials obtained from 4000 sets of trials in which the decision variable 0 was
selected randomly before each set of trials from a U(0, 1) random number generator,
followed by the selection of the deciding variables from the same RNG. The dashed
line marks the theoretically expected mean number of events (4000/21 190.48) in
each outcome category. A chi-square test of the goodness-of-fit of the distribution
U[0, n] yielded 20.7 for d 21 degrees of freedom, corresponding to the P-value 0.48
(not to be confused with the symbol P for probability of success).
Suppose next, as a further variation of the Bayes problem, that the number n of
trials per set was also to be chosen randomly. In other words, before tossing the coin,
the number of trials n is selected randomly by a RNG and then the decision variable
0 for that set is selected randomly by the same RNG. How, under these conditions,
would the number of successes be distributed? In the course of repeating the experiment numerous times, how many heads on the average would you expect to obtain
per set of trials? From a Bayesian perspective, the only prior information that is now
available is the type of random number generator and its associated parameter(s).
To be specific, let us adopt first a Poisson distribution with parameter for the
probability of the number of trials n
e n!
1 e
n
PrN nj
n 1, 2 . . .:
2:5:6
Note that n 0 is not part of the range; each set is required to comprise at least 1
trial. The factor (1 e)1 is therefore required in order for the completeness
relation to be satisfied. Having already averaged the mgf (2.5.2) over P, we must
107
now average the result (2.5.3) over n, where n 1,2. . .. The averaging is not difficult
to do, although it is a little tedious. It would be easier, however, to perform the
average over n first and then over 0. The final result, as one can demonstrate, does
not depend on the order in which the averages are taken.
Starting again, therefore, with the mgf (2.5.2) and performing the average
n

1 X
ne
gY tj0 1 e
1 F0 et 1
n!
n1
"
#
1 F0 et 1n
1

e
1e
e
n!
n0
e e1F0 e 1 e eF0 e 1 e
1 e
1 e
t
2:5:7
we see from the form of the resulting mgf that the random variable describing the
distribution of successes would itself be a Poisson variate with mean parameter
F(0) except for the probability of getting zero successes. This is an interesting result
in itself, representative of a class of compound distributions that can occur in physics
(for example, in testing photon emissions for randomness, to be discussed in a later
chapter) and in commercial risk assessment (for example, the probability of damage
to structures by lightning strikes where the number of hits is a Poisson variate to be
folded into the probability of damage per hit).10
The average of (2.5.7) over 0 or P produces the mgf
21
3
1
t
1
gY t gY tj0 f 0 d0 1 e 4 ePe 1 dP e 5
2:5:8
0
0
t
ee 1 1
e

,
et 11 e 1 e
which is not a particularly familiar generating function. Nevertheless, one can
determine the first few moments from (2.5.8), which turn out to be
hYi
varY
1 2
1
2
3 2
hY
i
1 e
1 e

1
4 1
12 6 e
1
2
1 e
2:5:9
:
Substitution of s et in (2.5.8) yields the probability generating function

f Y s
10
es1 1
e
,

s 11 e 1 e
2:5:10
W. Feller, An Introduction to Probability Theory and its Applications Vol. 1, 2nd Edition (Wiley, New York, 1950)
270271.
108
the expansion of which in a power series in s leads to the probability pj ( j 0,1. . .)

for j successes per set of indeterminate number of trials
"
# !
j
1
e X d k j
e
1
j0 ,
pj

2:5:11
k

1 e
j! k0 d
1 e
where j0 is a Kronecker delta symbol. The sum in (2.5.11)
"
#
m
X
dk m
x
x m 1, x
e
k
k0 dx
2:5:12
is equivalent to the incomplete gamma function defined by the integral
m 1, x tm et dt:
2:5:13
Thus, the posterior probability distribution (2.5.11) can be expressed succinctly as

1
j 1,
e
1

j0 ,
2:5:14

pj

1 e
1 e
j 1
where (j 1) j! since j is integer.
Consider next the outcome of the preceding Bayes experiment in which the
number of trials n in a set is chosen from a discrete uniform distribution U[1, N],
again eliminating n 0 from the range, instead of from a Poisson distribution. The
P-averaged mgf (2.5.3) must now be summed over all allowed values of n
g Y t
N
X
1
en1t 1
,
N et 1 n1 n 1
2:5:15
an operation that does not lead to a simple recognizable distribution although we

shall work it out momentarily. In the form of (2.5.15), however, the first few
moments are obtainable without much effort because the expressions resulting from
the derivatives of gY(t) are themselves readily summed11 after the limit t ! 0 is taken:
hYi
N1
4
hY 2 i
4N 5N 1
36
varY
N 2 18N 11
:
144
2:5:16
To determine the probability distribution, we convert (2.5.15) to the corresponding

pgf
f Y s
11
N
N
n
N
X
X
1
1 sn1 1 X
1 X
sk
pj s j
N 1 s n1 n 1
N n1 n 1 k0
j0
The relations needed are:
N
X
x1
x NN1
and
2
N
X
x1
x2 NN162N1.
2:5:17
109
in which it is to be noted that the series in s is finite, the highest-order term being sN.
The pattern of the set fpjg generated by (2.5.17) is an interesting one, as seen by
writing it explicitly for N 3:

1
1 1 1
1
3
2 3 4

1
1 1 1

3
2 3 4

1
1 1 1
:

3
2 3 4

1
1 1
3
3 4

1
1
3
4
p0
p0
p1
p2
p3
2:5:18
The first p0 (in the inner box) is the probability of no successes under the condition
that n 0 is included in the number of trials. The other probabilities (in the outer
box) are pertinent to the problem we are examining and are seen to conform to the
relation
N1
1 X
1
N kj1 k
p0 p1 :
pj
j 1
2:5:19
There is, however, a closed-form expression for the sum in (2.5.19) deriving from the
identity
N
X
1
k1
N 1 ,
2:5:20
where (x) is the digamma function

x
d
lnx
dx
and is Eulers constant12

Lim
N!
12
N
X
1
n1
2:5:21
!
ln N
0:577 215 6649,
2:5:22
The Euler or EulerMascheroni constant is an unending decimal number which (to my knowledge) has not been
proven to be algebraic or transcendental (i.e. not a solution of an algebraic equation with rational coefficients). e and
are examples of transcendental numbers.
110
0.14
a
0.12
0.1
c
Probability
0.08
0.06
0.04
0.02
0
0.02
10
15
20
25
30
35
40
Number of Successes
Fig. 2.5 SilvermanBayes experiment with variable parameters 0 (decision) and n (trial size).
Distribution of the number of successes in which n is a random variable (a) U[1,20], (b) U[1,50],
(c) Poi(10), (d) Poi(25); the posterior probability is averaged over allowable values of both n
and 0. The mean number of successes is (a) 5.25, (b) 12.75, (c) 5.00, (d) 12.50.
defined as the limiting difference between the harmonic series and the natural
logarithm. We may therefore express the probability function (2.5.19) in the form
1
pj N 2 j 1 j 1
N
p0 p1 :
2:5:23
A comparison of the outcomes of using either a Poisson or uniform RNG for the
random selection of n is illustrated in Figure 2.5 for two values of the Poisson mean
( 10, 25) and two values of the uniform upper limit (N 20, 50), which lead,
correspondingly, to approximately the same mean number of successes (respectively
5 and 12.5). In the first case (Poisson RNG), the probability pj of obtaining j
successes in a set of trials is maximum for j 0, drops to approximately half
maximum near j , and is asymptotically approaching 0 by j 2. The larger the
value of , the longer the function remains close to its maximum value before
descending rapidly somewhat like a FermiDirac occupation probability curve.
In the second case (uniform RNG), pj is maximum at j 0 and drops monotonically
to 0 at j N. The two distributions illustrated bear no similarity whatever to a
normal distribution, and therefore it should be no surprise that the significance of the
standard deviation as a measure of uncertainty is very different.
111
One final remark regarding variations on a theme of Bayes is perhaps pertinent

before passing on to other matters. In previous sections we have considered two basic
theoretical models of inference, which I designated B for the BayesianLaplace
calculation with uniform prior (i.e. uniform on the probability of success P not on
a deciding variable ) and J for a calculation with the JaynesHaldane prior that
led to the maximum likelihood estimate of probability of success. Perhaps the
reader may have wondered what one would get by substituting Jaynes prior f() /
[(1 )]1 directly into the integral for Bayes solution, where, contrary to Bayes
initial formulation of the problem which Pearson showed was independent of f(), it
is now understood that the symbol , and not the cumulative distribution F(), is the
probability of success.
The result
1
Prb; mja; njJaynes

m b
n a1
mb
na1
1
d
1
b
a

1
n a1
na1
1
d
a
0
m
Ba b, n m a b
b
Ba, n a

ab1
nmab1
a1
na1

nm1
n1
2:5:24
when plotted as a function of b for fixed a, n, m is not significantly different from the
Bayes solution (2.1.7) already discussed, except in the case of n 1 trial, which
produces a completely flat distribution rather than the broad rounded curve of
Figure 2.1a. We may interpret this, as before, by arguing that, until one has executed
a minimum of two trials and obtained one success and one failure, we cannot even
presume that outcomes follow a binomial distribution. With increasing n, the posterior probability of success (2.5.24) approaches the same hypergeometric distribution
(2.1.8) as Bayes solution.
3
Part I
The random disintegration of matter
For repetition is an essential rule and phenomenon throughout the

world: it applies in like manner to the course of the stars as to the
swirling of atoms; to dead as to living particles and substances. . ..
Everything is subject to recurrence and bears this recurrence within
itself. The Law of Seriality links all with the womb of the Universe
from which everything in the world has come.
Paul Kammerer1
3.1 Quantum randomness: is the force with us?
In a section of his book dealing with parapsychology,2 Arthur Koestler examined the
behavior of Austrian biologist Paul Kammerer whose passion was to find and record
coincidences. Kammerer, according to Koestler, would spend hours on park benches,
noting the numbers of passersby in each direction and detailing their age, gender, the
way they dressed and the items they carried. From these observations, leading to his
Law of Seriality, Kammerer concluded that the data revealed irrefutable recurrences of events that defied explanation by coincidence. Kammerers law was
controversial as were also his researches in biology. Depressed, he eventually committed suicide like statistical physicists Boltzmann and Ehrenfest. Perhaps one
should think twice before seriously engaging in statistical physics.
One of the most extraordinary claims I have ever encountered in the modern
peer-reviewed scientific literature a claim that continues to be made, and that to
my knowledge has never been retracted or independently confirmed from the time
of its inception well over a decade ago to the time I published a refutation3
concerns the correlated fluctuations of ostensibly independent random processes.
So bizarre is this claim that were I to begin by paraphrasing it the reader would
2
3
Paul Kammerer, Das Gesetz der Serie [The Law of Seriality] (Deutsche Verlags-Anstalt, 1919) 456. (Translation from
German by M. P. Silverman).
Arthur Koestler, The Roots of Coincidence An Excursion Into Parapsychology (Vintage, New York, 1972) 8586.
M. P. Silverman and W. Strange, Search for correlated fluctuations in the decay of Na-22, Europhysics Letters 87
(2009) 32001 p1p5.
112
113
immediately suspect an intentional exaggeration so I will record the authors

own words:4
It is shown that due to fluctuations, a sequence of discrete values is generated by successive
measurement events whatever the type of the process measured. The corresponding histograms
have much the same shape at any given time and for processes of a different nature and are
very likely to change shape simultaneously for various processes in widely distant laboratories.
For a series of successive histograms, any given one is highly probably similar to its nearest
neighbors and occurs repeatedly with a period of 24 hours, 27 days, and about 365 days, thus
implying that the phenomenon has a very profound cosmophysical (or cosmogonic) origin.
Before commenting on the proffered evidence, let me make the foregoing paragraph
explicitly clear in the context of a field in which I have some expertise (nuclear
physics). Suppose I set up a nuclear counting apparatus to count what in effect
amounts to the number of radioactive nuclei of type X disintegrating in some
specified window of time (let us say one second) and a colleague also sets up
elsewhere (for example in some other part of the building, city, state, country. . .wherever) an apparatus to count decaying radioactive nuclei of type Y which need not be
the same species (or nuclide in the terminology of physics) that I count. The two
sets of apparatus begin their tasks and record chronologically in a long time series
many one-second intervals (bins) of counts. From bin to bin the number of counts
will vary, some bins showing counts greater than the mean, others less, since the
transmutation of nuclei is an archetypical quantum process (a transition between
different quantum states) whose individual occurrences are believed to be random
and unpredictable.
If my colleagues apparatus were to record consistently a greater-than-average
number of decays of nuclide Y whenever my apparatus recorded a greater-thanaverage number of decays of nuclide X, then statistically we would say that the
fluctuations (variations about the average) of the two stochastic processes were
positively correlated. If his counts were consistently lower than average, whenever
mine were greater than average, then the fluctuations would be negatively correlated.
In any event, the two independent decay processes would be correlated, and such
correlations would be contrary to quantum theory as it is currently understood. It
has long been known that spontaneous nuclear decay is virtually insensitive to the
gravitational, chemical, or electromagnetic environment in which the unstable
nucleus finds itself. This broad remark calls for some qualifications, which I will
address later.
A reproducible observation, therefore, of correlated fluctuations in the time series
of identical disintegrating nuclei would call for a most unusual explanation, quite
possibly beyond currently known physical principles. Moreover, for time series of
S. E. Shnoll et al., Realization of discrete states during fluctuations in macroscopic processes, Physics. Uspekhi. 41 (10)
(1998) 10251035. [Uspekhi Fizicheskikh Nauk, Russian Academy of Sciences]
114
Mother of all randomness I
completely dissimilar, ostensibly independent stochastic processes such as other

types of nuclear decay, chemical diffusion, biochemical reaction kinetics, or whatever to exhibit cross-correlated fluctuations would be tantamount to the breakdown of all contemporary science rooted in the principles of physics.
Now, in fact, the foregoing scenario was not exactly what the quoted paragraph
claimed, although it would be a necessary consequence. Read the quotation again
carefully. What was actually claimed to have been observed was a correlation in
fluctuations not between time series of events, but between histograms of these time
series. If such were truly the case, this finding would represent an even more peculiar
aberration from current physical principles than what I have just described. Recall
what constitutes a histogram. If, in the hypothetical example I gave, my distant
colleague and I were each to make a plot of the frequencies of occurrence of the
different nuclear counts sorted into an arbitrarily designated number of classes, we
would be graphically representing a multinomial distribution. Now, in partitioning a
time series of nuclear disintegrations into classes defined by the number of counts per
one-second time interval, the temporal information carried by the data is entirely
lost. The events that fell within a bin labeled 100 counts, for example, could have
occurred at the beginning, middle, end. . .anywhere. . .within the period of data
collection. Nevertheless, the authors claimed that fluctuations in the frequencies of
a series of successive histograms showed correlations with a periodicity of approximately one day, one month, and one year. They attribute to this phenomenon an
unknown cause of cosmic origin.
I first learned of these extraordinary claims in requests sent to me by other
physicists expressing their incredulity and asking that I test the claims for validity.
Some years before and for entirely different reasons I had published a series of
papers5 examining the most common modes of spontaneous nuclear disintegration
for evidence of nonrandom behavior. These processes included
alpha decay emission of a helium nucleus He42 (the alpha particle),

beta decay emission of an energetic electron e (the beta particle) and unobserved anti-neutrino,
electron-capture decay the capture by a nucleus of an inner-shell electron with
subsequent redistribution of electrons over levels and release of electromagnetic
energy (gamma and X-ray photons). The process is usually designated by the
specific level from which an electron is captured, e.g. K-capture.6
In the physicists organization of nature whereby all fundamental forces fall into one
of four categories gravitational, electromagnetic, weak nuclear, and strong nuclear
the first one above is an electromagnetic interaction and the next two are examples of
5
6
See, for example: M. P. Silverman, W. Strange, C. R. Silverman, and T. C. Lipscombe, Tests for Randomness of
Spontaneous Quantum Decay, Physical Review A 61 (2000) 042106 (110).
For historical reasons the manifolds (shells) of atomic electrons are named (from innermost outward) K, L, M, etc.
115
the weak nuclear interaction. The researches were undertaken to test one of the most
basic features of quantum physics, namely, that transmutations of unstable nuclei
occur randomly and without memory. A finding that this is not so would have a
twofold significance, at least. First, it would represent a striking violation of quantum
mechanics, the theory that accounts most comprehensively for the structure of matter
and interaction of matter and energy. And second, there would be repercussions for
practical use of nuclear decay as a means of generating true random numbers, in
contrast to pseudo random numbers created by mathematical algorithms run on
computers, for the wide range of applications that require them, such as cryptography, statistical modeling (in science, medicine, economics, etc.), Monte Carlo
methods of simulation, computer gaming, and others.
The series of investigations into the randomness of nuclear decay (which I have
described in my earlier book A Universe of Atoms, An Atom in the Universe)7 was,
I believe, the most comprehensive study of its kind undertaken to that time. The
outcome was to conclude that the data (temporal sequences of nuclear decays) were
thoroughly compatible with what was to be expected on the basis of pure chance
or, in statistical parlance, to say that the results were under statistical control. It is
worth emphasizing at this point that one can never prove that some process is
random, for no matter how many statistical tests the data generated by the process
may satisfy, there is always a possibility of producing yet another test that the data
(or a larger sample of data) may fail. What is ultimately demonstrable is that a
stochastic process may be non-random. A non-random process can furnish information by which future outcomes are predictable to an extent greater than that due
to pure chance alone.
Let us now examine what evidence justified the claims of correlated fluctuations
between stochastic processes and extrapolation to the existence of a new cosmic
force. The observations were of two principal kinds, both based on visual inspection
of the shapes of histograms.
The first observation purportedly manifested what the quoted authors believed to
be discrete states during macroscopic fluctuations. The histograms (exhibited in the
cited article for a variety of nuclear processes such as the alpha decay of plutonium239 (239Pu) and K-capture in iron-55 (55Fe) were constructed of layers in which the
first layer recorded frequencies of events i 1. . .I, the second recorded frequencies of
events i 1. . .2I, the third recorded frequencies of events i 1. . .3I, and so on, the jth
layer recording frequencies of events i 1. . .jI for some integer I. A striking pattern
of well-defined articulations in the layers resulted, like those shown in the upper
panel of Figure 3.1. I will discuss shortly the significance of this finding, which I have
easily duplicated, but let us first move on to the second piece of evidence.
The second kind of reported observation was a perceived recurrence in time of
histograms of similar shapes. Presented in the article were two composite figures (not
7
M. P. Silverman, A Universe of Atoms, An Atom in the Universe (Springer, New York, 2002).
116
600
(a)
Frequency
500
400
300
200
100
0
160
180
200
220
Class Value
240
600
(b)
Frequency
500
400
300
200
100
0
160
180
200
220
Class Value
240
Fig. 3.1 Twenty-layered histograms (unit intervals between classes) with overlapping (top
panel) and non-overlapping (bottom panel) event histories. Elemental histograms Ha (a 1,
2. . .210) were constructed from 1000 random numbers produced by a Poi(200) RNG.
i
X
Ha .
Correlated-layer histograms Li (i 1. . .20) were formed by superpositions Li
a1
Uncorrelated-layer histograms were formed by superpositions L1 H1, L2 H2 H3, L3

H4 H5 H6, etc.
reproduced here) comprising 12 or 18 panels of histograms, each panel a superposition of two histograms separated in time. The histograms, which had been
smoothed to facilitate visual comparison showed recurrent coarse features such
as broad single peaks, rabbit ears, rolling ridges, and other geometrical structures.
Before examining the matter of correlated nuclear fluctuations rigorously and
comprehensively, I want to stress that the so-called shape of a histogram is an illdefined geometric feature and not an invariant characteristic of a multinomial
distribution. It can take widely differing forms for a given set of events depending
on the number and widths of the arbitrary classes into which events are assigned.
Irrespective of the validity of the claims made in the article, the visual observation of
patterns is too fraught with human bias to be accepted as evidence of a scientific
phenomenon especially one at such variance to prevailing physical theory. Indeed,
117
3.2 The gamma coincidence experiment
the branch of mathematics known as Ramsey theory8 virtually guarantees that

almost any sought-for pattern can be found in the distribution of a sufficiently large
set of points. Only a rigorous statistical analysis can reveal whether time series and
frequency distributions actually manifest correlated fluctuations.
The likelihood (in the vernacular sense) that some fundamental part of physics will
be turned on its head is low to be sure, but it has happened in the past and who is to
say that it will not happen again. Intrigued by the problem raised by the article,
although not by the evidence itself, I decided eventually to take up the challenge and
see for myself whether some mysterious cosmic force caused periodic correlations
in decaying radioactive nuclei. As an instructive problem, it entails nearly all the
basic probability and statistics that a physicist may need to know.

To observe whether correlated nuclear decay occurs and to ascertain, presuming it
does, whether the effect is periodic, a particularly clean nuclear process was
needed that is, a well-understood disintegration mode leading to a low background
of competing events. Together with my colleague of many years, Wayne Strange, we
decided to examine the decay of sodium-22 (22Na). This is a form of weak nuclear
interaction in which a positron and neutrino are emitted
22
Na ! 22 Ne e e
3:2:1
rather than an electron and anti-neutrino as in ordinary beta decay.

Macroscopically, the transmutation converts an isotope of sodium into an isotope
of neon with about 90% of the reactions leading to an excited state of neon, which
subsequently decays to the ground state with emission of a 1274.6 keV gamma
photon. (In standard MKS units, 1 eV 1.602 1019 J.) The half-life of the
conversion i.e. the time in which half the sample has decayed is about 2.603 years
or 950 days. Microscopically, by which I mean the activity taking place within the
22
Na nucleus, a proton has changed its identity to a neutron with release of a positron
(for balance of electric charge) and a neutrino (for balance of what physicists refer to
as lepton number)
1
2
p ! n e e :
3:2:2
(Submicroscopically, an up quark inside a proton changed its idenity into a down

quark, but this level of fundamentality is deeper than we need to go here.) Protons
cannot undergo the reaction (3.2.2) in isolation because the mass of a proton is
smaller than that of a neutron. Therefore, in order that energy be conserved, a
sufficient quantity of energy must be provided to the left side of the equation. This
8
R. Graham, B. Rothschild, and J. H. Spencer, Ramsay Theory (John Wiley and Sons, New York, 1990).
118
takes place within the nucleus when the binding energy (i.e. the energy released when
individual nucleons protons and neutrons combine to form a nucleus) of the
mother nucleus (22Na) is less than that of the daughter nucleus (22Ne). The energy
difference is then partitioned among the mass and kinetic energies of the product
particles.
There were several reasons for the choice of sodium decay. First, the process
should be governed by Poisson statistics, a hypothesis that I will discuss in more
detail in due course; thus the parent probability function was known and all other
pertinent statistical quantities could be determined analytically. Second, this transmutation is, as mentioned, an example of a weak nuclear interaction with long halflife; thus the time series of decays over the period of our experiment was very nearly
stationary. In other words, the number of radioactive nuclei in the sample was
sufficiently large throughout the duration of the experiment that the mean number
of decays per counting interval (bin) remained nearly constant. In reality, the mean
count decreased slightly in time, but we could detect this and correct for it. Third, the
decay summarized by (3.2.1) yielded a stable nuclide of neon and a single outgoing
positron, which immediately interacted with an ambient electron leading to electron
positron annihilation
e e !
3:2:3
to produce two counter-propagating 511 keV gamma photons. The simplicity of the
final state together with spatial correlation and narrow energy uncertainty of the s
permitted us to make gamma photon coincidence measurements with very low
background and high signal-to-noise ratio.
From a philosophical perspective, it would not be an exaggeration to note that the
process (3.2.3) exemplifies the two conceptual pillars upon which rests the entire
aedifice of physics. First is the complete annihilation of matter to pure energy, a
process impossible to imagine before Einsteins theory of special relativity. Each
photon carries the energy equivalent of the mass of one electron, which amounts to
511 keV.9 Second is the entanglement of the two counter-propagating gamma
photons, a quantum mechanical two-particle state that does not factor into products
of single-particle states no matter how great the separation between the particles, and
which subsequently can manifest correlations inexplicable on the basis of classical
physics. Erwin Schrodinger, who developed the form of quantum mechanics initially
known as wave mechanics, coined the term entanglement [Verschrankung], referring to this feature as the characteristic trait of quantum mechanics, the one that
enforces its entire departure from classical lines of thought.10 Interesting as the
9
10
The mass m of an electron is approximately 9.109 1031 kg. The energy equivalent is mc2, where c is the speed of light
2.998 108 m/s. This leads to 8.187 1014 J or ~ 511 103 eV.
E. Schrodinger, Discussion of Probability Relations Between Separated Systems, Proceedings of the Cambridge
Philosophical Society 31 (1935) 555563.
119
subject is (I discuss it in a previous book11), entanglement plays no role in the nuclear

experiment I am now describing.
The experiment, employing a source of radioactive sodium of initial activity
0.079 Ci, proceeded as follows. (Note: 1 micro-Curie (Ci) of radioactivity equals
3.7 1010 Becquerel, where 1 Becquerel (Bq) is defined to be one decay per second.)
Each disintegration of a sodium nucleus gave rise to one pair of back-to-back gamma
photons. These were detected within a coincidence time interval of 50 ns by a pair of
NaI(Tl) [thallium-activated sodium iodide] scintillation detectors and associated
coincidence electronics accepting only those gamma photons with energy within a
375 keV range from 345 keV to 720 keV i.e. centered approximately on the electron
mass-energy of 511 keV. Coincidence detection is a highly efficient way of eliminating
noise due to stray particles that happen to enter ones detectors accidentally from
cosmic rays, radioactive inclusions in cinder block walls, or anything else, because the
detection system records a signal only when two particles (in this case gamma
photons) are detected within a specified time interval of one another. The first
particle to arrive at a detector triggered the clock. The number of coincidences within
a sampling window (bin) of t 0.439 s was recorded sequentially for a total
counting time of 167 hours. Under these conditions the coincidence count rate was
about 441 per second, compared with the background rate in absence of the sodium
source of 0.021 counts per second.
The search for correlations in counts from time-separated bins called for examining both the chronological series of counts and the frequency of occurrence of counts
for evidence of non-random behavior, in particular for any indication that the data
exhibited some periodicity or other regularity in time. To this end, the time series of
gamma coincidences was partitioned chronologically into bags a neologism
defined in Chapter 1 where one bag comprised 8192 bins, or close to one hours
worth of data. We can represent the time series of coincidences mathematically in
two equivalent ways, depending on whether the focus of interest is the entire series
X {xt t 1. . .N}, where
N 167 bags 8192 bins=bag 1 368 064 bins
3:2:4
or the partitioned series X {xa,b a 1. . .167 bags; b 1. . .8192 bins}. Next, a

chronological series of histograms {Ha a 1. . .167} was made from the time series of
the bags. Temporal information was lost this way for coincidence events within a
bag, but the sequence of histograms gave a time resolution of about one hour.
The stochastic nature of nuclear decay is vividly depicted in Figure 3.2 which
shows a segment of the time series of coincidences i.e. a plot of gamma counts
against bin number. (In this figure each bin represents a counting window of 10t
4.30 s.) Note the wide scatter of the counts about the mean. The fascinating aspect of
11
M. P. Silverman, Quantum Superposition: Counterintuitive Consequences of Coherence, Entanglement, and Interference

(Springer, Heidelberg, 2008).
120
Count
2100
2000
1900
1800
0
100
200
300
400
500
600
700
800
900
1000
Bin
Fig. 3.2 Scatter plot of gamma coincidence counts at a function of time (i.e. bin number),
where each bin represents 10 bins, or a time window of 4.39 s.
a plot such as this one, however, is the seemingly contradictory nature of its two
outstanding features. On the one hand, there is the visual suggestion of pure randomness, the points dotting the plane of the graph like gray snowballs thrown
against a wall. On the other hand, random though the snowball impacts may be,
they are governed by a statistical law, as represented graphically by the three
histograms in Figure 3.3.
The histograms, which respectively summarize the frequencies of occurrence of
gamma coincidences at the very start of the experiment (Bag 1), the middle of the
experiment (Bag 83) and at the end of the experiment (Bag 167) were each subjected to
a chi-square test for goodness of fit to a Poisson distribution of corresponding mean
parameter {a a 1. . .167} obtained for each bag by a maximum likelihood line of
regression fit to the scatter plot. The histograms comprised K 91 classes of unit
width, each class identified by an integer number of gamma coincidences, spanning a
range centered on the integer closest to the mean. The figure reveals to the eye
graphically how well the histograms conform to a Poisson distribution. However,
the aggregate of chi-square tests, which matches the distribution of resulting values
f 2a a 1 . . . 167g against the theoretical density 2d89 , shows the minds eye analytically whether the data support the null hypothesis. A chi-square test of this fit with
14 classes yielded P 0.26 for 2d13 15:76, which is unambiguously acceptable.
Thus, although the number of coincidences per bin is randomly scattered about
the mean in Figure 3.2, that scatter falls within fairly well-defined limits set by the
single parameter (mean variance) of the Poisson probability law. The noted
nuclear physicist turned quantum philosopher, J. A. Wheeler, who (so it seemed to
me) liked to speak and write in riddles apparently a consequence of his early
exposure to Niels Bohr conjured up the phrase law without law12 to describe
12
J. A. Wheeler and W. H. Zurek, Eds. Quantum Theory of Measurement (Princeton University Press, Princeton NJ,
1983), Chapter I.13 Law Without Law, 182213.
121
3.3 Delusion of layered histograms
Relative Frequency
Distribution of Counts for Bag 1
0.025
0.025
0.020
0.020
0.015
0.015
0.010
0.010
0.005
0.005
0.000
140
160
180
200
220
0.000
140
240
160
180
200
220
240

0.025
0.020
0.015
0.010
0.005
0.000
140
160
180
200
220
240
Count Class
Fig. 3.3 Histogram of counts for bags of data collected at the beginning (Bag 1), middle (Bag
83), and end (Bag 167) of the experiment. Each bag represents about one hour of data
accumulation. Superposed on each histogram is a Poisson probability function of
corresponding mean parameter . The dashed curve is the Gaussian density N(, ) ~ Poi(),
valid for 1.
quantum interference phenomena. I cannot say that I ever really fathomed what he
meant by it or by his other cryptic expressions such as magic without magic,
higgledy-piggledy universe, great smokey dragon, and more, but perhaps an
interpretation that gives some sense to the phrase is the highly structured nature of
randomness. The idea of randomness to someone who has not thought about it too
deeply is the occurrence of events willy-nilly, without plan or choice, without organization or direction, without pattern or connection, without identifiable cause, without predictability to put together some of the common associations I have
encountered. On the contrary, as we proceed with this investigation, it will become
clearer how tightly the patterns of randomness are constrained.
3.3 Delusion of layered histograms

We are now at a point to examine and dispense with the claim that the discrete
structures manifested in layered histograms like the one in Figure 3.1 (top panel)
provide any information whatever about cosmic forces in the universe. The
122
composite histogram in the figure comprises 20 layers, each of which, in the notation
I
X
H a where I 1. . .20. The
just developed, can be represented symbolically by LI
a1
articulated structures, however, merely reflect the increasing (with I) degree of

correlation of layers because of overlapping data sets. The data in the figure, in fact,
were generated mathematically by a Poisson random number generator. The articulations, thereore, are merely an artifact of the mode of data presentation and have
nothing whatever to do with correlated fluctuations arising from any hitherto
unknown physical force. Further evidence of this may be seen in the lower panel of
Figure 3.1 in which each layer comprises a non-overlapping sequence of elemental
histograms: L1 H1, L2 H2 H3, L3 H4 H5 H6, and so forth. The
uncorrelated layers give an impression of a jumble of lines with no discrete structures.
If there is something to be learned from the construction of histograms with
correlated layers, it may be in the domain of art. I have experimented with this genre
of design by using a variety of random number generators to build up the correlated
layers and arrived at patterns pleasing to the eye. The general form of a histogram is
characteristic of the particular RNG employed, but the articulations vary in location
from histogram to histogram because, after all, the underlying sets of numbers are
selected randomly (i.e. pseudo-randomly) by a computer algorithm.
3.4 Elementary statistics of nuclear decay

In looking for experimental evidence that nature may not follow the prevailing
statistical model of nuclear decay, one needs to understand first how that model
arose and all that it implies. Let us begin therefore with what I have called the
elementary statistics of nuclear decay, which emerge from several quite different
lines of thought.
3.4.1 Differential one-step-at-a-time method

Our null hypothesis is that the disintegration of an unstable nucleus occurs independently of the prior decays of other nuclei or of the states of nuclei that have yet to
decay. This assumption is justified in the main (. . .there are qualifications, but we
shall pass over them for the present. . .) by quantum mechanical calculations
employing time-dependent perturbation theory that lead to expressions for a timeindependent intrinsic decay rate . I have discussed comprehensively in a previous
book13 the theory of unstable quantum states. The present emphasis now being on
statistics, not quantum theory, we shall simply regard as an empirical constant.
13
M. P. Silverman, Probing The Atom: Interactions of Coupled States, Fast Beams, and Loose Electrons (Princeton
University Press, Princeton NJ, 2000).
123
The physical content of the null hypothesis is that the probability that a nucleus
decays within a short time interval t is proportional to t
Pr1j;t p t,
3:4:1
where the intrinsic decay rate is the constant of proportionality. It then follows that
the probability that this nucleus does not decay within the composite time interval t
mt is

t m
Pr0j;t 1 tm 1
! et
3:4:2
m m!
which, in the limit of an infinitely large number of infinitesimally short time intervals
takes the form of an exponential as shown in (3.4.2). Therefore, the probability that
this nucleus does decay within the time interval t is 1 et. Under the assumption
that all such decays are mutually independent events, the probability that n out of
N nuclei decay within time t is equivalent to asking for the probability of n successes
out of N independent trials, which immediately calls to mind a binomial distribution

n
Nn
N
1 et et
PrnjN, ; t
:
3:4:3
n
Equation (3.4.3) gives the exact probability distribution for the disintegration of
radioactive nuclei or, indeed, for the irreversible transition of any quantum system
out of its initial state, subject to the null hypothesis. Although exact, the binomial
expression is not practically useful because the number of nuclei in a macroscopic
sample is astronomically large, even if macroscopic may actually entail a very small
mass. For example, an approximate 0.08 Ci sample of 22Na has a mass close to
13 g and contains about 1017 sodium atoms. I will justify this assertion shortly.
Without further simplification
or approximation, no computer can evaluate
N
combinatorial coefficients
with N ~ 1017. However, the intrinsic decay rate
n
for a weak interaction process is very low. For a process with half-life of ~2.6 years,
the corresponding decay rate is ~ 8.4 109 transitions per second. This relation
will also be justified in a moment. We have seen, therefore, that when the number of
trials is very large and the probability of success very small, a binomial distribution
Bin(N, p) reduces to a Poisson distribution Poi() of mean parameter Np.
For the case of sodium decay, the binomial-to-Poisson reduction leads to a
mean number of decays at time t given by

t N 0 1 et ,
3:4:4
where the subscript 0 on N signifies the initial number of radioactive nuclei in the
sample. The mean number of counts Nc that one detects (assuming 100% efficiency
of detectors) in a bin of temporal width t, which is equal to the number N of
radioactive nuclei lost from the sample (hence the minus sign) during this interval, is
124
dt
t N 0 et t
dt
N 0 t N 0 p,
N c N
3:4:5
where the approximation in the second line holds for t << 1. Treating the discretely
small quantities in (3.4.5) as differentially small leads to the rate equation for loss of
radioactive nuclei
dN t
N 0 et ,
dt
3:4:6
whose solution for N(t) with initial condition N0 gives the familiar result
N t N 0 et :
3:4:7
The preceding relation makes apparent why is termed the decay constant.
The

relation of to the half-life is derivable from (3.4.7) by setting N N 0 =2 to
obtain
1
2
1
2
ln 2
:
1
2
3:4:8
We are now in a position to tie some loose ends together. Adopting the half-life
2.603 years (7.28 107 s) for 22Na yields ~ 8.44 109 s1. The number of
radioactive nuclei in a sample of activity
1
2
jdN=dtj e 0:08 Ci 2:96 109 disintegrations=second
3:4:9
is then

dN=dt 2:96 109 nuclei s1

N 0
3:5 1017 nuclei:

8:44 109 s1
3:4:10
From knowledge of Avogradros number, NAv ~ 6.02 1023 atoms per molar mass of
substance, one readily calculates the mass m of N0 atoms of atomic mass 22 g
m 22
gram N 0
g
3:5 1017
mol 106
106 e 12:8 g:

22
mol N Av
gram
6:02 1023
3:4:11
In summary, a small amount of radioactive sodium can nevertheless contain a lot

of atoms.
Relation (3.4.5) reveals how the mean count per bin will vary in time as a result of
the natural lifetime of the radioactive nucleus. This relation will be put to use later
when it is necessary to detrend the data for subsequent analysis i.e. to remove the
linear variation with negative slope that is well understood in order to search the data
for departures from theory that are unexplained. Moreover, if the mean number of
counts per bin, expressible as Nc N(t)p (t) in view of (3.4.5) and (3.4.7), is much
larger than one throughout the data collection, then, as discussed previously, we are
125
justified in making a second approximation to the distribution function (3.4.3) by

replacing it with a Gaussian distribution N(, ), where the value of will diminish
slowly in time.
3.4.2 Master equation all-at-once method

There is another approach, which goes by the name of a Yule process (named for
George Udny Yule, not for Christmas) or pure birth process14 (so called for a
population that can give birth to new members none of whom can die) that provides
a functionally different, but equivalent, means of deducing the statistics of nuclear
decay from a set of rate equations. Although we will eventually arrive at the same
final result (3.4.3), the utility of an alternative method of analysis is that it can turn
out to be more suitable for attacking other kinds of problems besides the particular
one under scrutiny.
The method, as I shall adapt it, might be better termed a pure death process
since it describes irreversible losses of nuclei through disintegration (although the loss
of a nucleus results in the birth of new particles). Designate by Pn(t) the probability of
n disintegrations within the time interval (0, t). N0 is the initial number of nuclei, and
N(t) is the number remaining at time t. It is assumed here (and justified by quantum
mechanics) that
(a) the probability of nuclear decay is homogeneous in time i.e. the same for all
time intervals of length t, irrespective of the age or history of the sample, and
(b) the rate of decay at any moment is proportional to the number of surviving
nuclei.
That rate, therefore, can be written as
Rn N N 0 n
3:4:12
Rn1 N 0 n 1 Rn :
3:4:13
from which follows
The probability Pn(t t) of n 1 disintegrations within t t is the sum of the

probabilities of two mutually exclusive events: (a) n decays up to time t followed by
0 decay within t, and (b) n 1 decays up to time t followed by 1 decay within t, or
symbolically
Pn t t Pn t1 Rn t Pn1 tRn1 t n 1:
14
3:4:14
W. Feller, An Introduction to Probability Theory and its Applications Vol. 1, (John Wiley & Sons, New York, 1950)
402404.
126
The interval t is sufficiently narrow that we can ignore the possibility of more than
one decay within it. Rearranging terms in Eq. (3.4.14) and taking the limit t ! 0 leads
to the differential equation

Pn t t Pn t
dPn t
Lim
3:4:15
Rn Pn t Rn1 Pn1 t:

t!0
t
dt
For n 0, there can be no contribution with n 1 decays, and so (3.4.15) becomes
dP0 t
R0 P0 t:
dt
3:4:16
Upon substitution of expressions (3.4.12) and (3.4.13) for the rates, there results a set
of master equations
dPn t
N nPn t N n 1Pn1 t
dt
dP0 t
NP0 t
dt
initial condition: Pn 0 n0
n > 0
3:4:17
that determine completely the exact statistics of nuclear decay.

To solve the preceding set of equations one can solve first for P0(t), which is easily
seen to be an exponential
P0 t eN 0 t ,
3:4:18
and then substitute this result into the differential equation for P1(t). Having found
P1(t), substitute it into the differential equation for P2(t), and so on up the line until
one recognizes a general pattern and can obtain the general solution by induction.
This is a somewhat tedious way to proceed, and I will use a different approach
employing a generating function.
Let us define a probability generating function
Gz, t
N
X
zNn Pn t,
3:4:19
n0
which, in contrast to the single-variable pgfs introduced previously, is now a function

of two variables, z and t. Nevertheless, it is straightforward to demonstrate that, as
before, one calculates the probabilities from (3.4.19) by taking derivatives:

1
Nn Gz, t
Pn t
:
3:4:20
N n! zNn z0
Relation (3.4.19) satisfies the initial condition
Gz, 0 zN
3:4:21
since Pn(0) n0 i.e. at the initial instant of time no nucleus has yet decayed (n 0).
127
We will determine the functional form of G(z,t) by finding the differential equation
that governs it. To this end note the following temporal and spatial derivatives
(actually, z is just an expansion variable, not connected in any way to the spatial
degrees of freedom of the nuclei)
N
N
X
Gz, t X
dPn
zNn N nPn N n 1Pn1
zNn
dt
t
n0
n0
z 1
N 1
X
zNn1 N nPn
3:4:22
n0
N
N 1
X
Gz, t X
N nzNn1 Pn
N nzNn1 Pn :
z
n0
n0
3:4:23
In the first line in (3.4.22) the time derivative of Pn(t) was replaced by its equivalent
from the master equation (3.4.17). Note that the two terms in the square brackets
have opposite signs; after some algebraic rearrangement and relabeling of indices the
term in zN drops out and the expression in the second line results.
Examination of the final expressions in (3.4.22) and (3.4.23) reveals the following
equality
G
G
z 1
:
t
z
3:4:24
The simplest approach to solving the differential equation above might be to try separation of variables, i.e. to express the generating function in the form G(z, t) Z(z)T(t).
Although this ansatz15 does not work (the initial condition (3.4.21) cannot be satisfied),
the outcome suggests that G(z, t) might be a function of ln(z 1) t, so let us write
Gz, t Glnz 1 t:
3:4:25
If this ansatz is correct, it would then have to follow that

Gz, 0 zN Glnz 1,
3:4:26
and this, in fact, does work because

h
iN
zN elnz1 1 ,
3:4:27
which now gives us the precise functional form to use for G(z, t) i.e. with timedependence included:
h
iN h

i N
:
3:4:28
Gz, t elnz1t 1 zet 1 et
15
An ansatz, a word borrowed from German, is a mathematical expression assumed to apply in some situation, but
without a rigorous justification for its use at the outset.
128
One can readily establish that (3.4.28) satisfies Eq. (3.4.24). Moreover, by expanding
the binomial expression in the second equality, it follows immediately that

N
Gz, t zet 1 et

X
N
N
X

N
Nn
t n t Nn
1e
z
e
z Nn Pn t
3:4:29
n
n0
n0
thereby leading to the same binomial probability distribution Pn(t) as obtained
previously in (3.4.3). In hindsight, replacement of z Nn by sn in the generating
function in (3.4.28) yields precisely what we would have obtained from the binomial
mgf derived in Chapter 1.
Although we already know the conditions under which a binomial distribution
reduces to a Poisson distribution, we can rediscover this connection in a different way
by examining the master equation (3.4.17). If the number N of nuclei in the sample is
enormously greater than the number n decaying within any time interval throughout
the experiment which is assuredly the case in the experiment I am discussing then
one can ignore the dependence of the decay rate Rn on n to obtain a constant decay
rate R N. The master equation (3.4.17) for n 1 then simplifies to
dPn t
N Pn t Pn1 t RPn t Pn1 t,
dt
3:4:30
and one can define the generating function

Gz, t
N
X
zn Pn t
3:4:31
n0
where, for all practical purposes, the upper limit of the sum is effectively infinite. It is
then not difficult to establish that G(z, t) satisfies the differential equation
G
Rz 1G,
t
3:4:32
which follows upon neglect (in combining two sums) of a vanishingly small term
/ PN in the limit N ! . The solution to Eq. (3.4.32), subject to initial condition
G(z, 0) 1 is the exponential
Gz, t eRz1t ,
3:4:33
which is recognized to be the pgf of a Poisson distribution with parameter Rt Nt.

3.5 Detrending a time series
Having established that the time sequence and frequency distribution of the
gamma coincidence counts were well accounted for by a Poisson distribution,
the next stage of the investigation was to ascertain whether hidden in the data
3.6 Time series: correlations and ergodicity
129
Fig. 3.4 Scatter plot of the natural log of mean counts per bag (1 bag 8192 bins) as a
function of time. Solid line is the maximum likelihood line of regression for the entire set of
data with slope X 193.8
0.16.
were inexplicable correlations or periodicities. Two powerful analytical tools for

addressing this matter are use of the serial correlation function and the Fourier
transform. Before using these tools, however, it was necessary to remove from the
data the variation in mean count due to the natural lifetime. Although the lifetime
of 22Na is long (~2.6 years) and the duration of the experiment comparatively
short (~167 hours), the instrumentation could readily detect this variation, as
shown in Figure 3.4. The slope of the line of regression corresponds to a
total fractional change in mean count / of only 0.114% i.e. about one part
in 1000.
Each point in Figure 3.4 represents the sample mean of a single bag (8192 bins).
The line through the scattered points is the maximum likelihood (ML) line of
regression for the entire set of data, whose slope furnished the best-fit value of the
stationary mean ^
X 193:8
0:16 counts per bin from which was obtained the
maximum likelihood estimate of the intrinsic decay rate of magnitude
^ 8:27
0:57 109 s1 . The estimated uncertainties in each quantity correspond to
1 standard deviation as calculated by the maximum likelihood method.
Given the ML values ^
X and ^, the full data set {xt} was then transformed to a series
Y fyt g fxt ^
X ^tg that, in the absence of unknown interactions affecting the
decay of the nuclei, should have a stationary mean of 0.

Let us suppose that we have recorded N time histories {yk (t) k 1. . .N} of a random
signal Y(t) over N sequential intervals of time, each interval ranging from t 0 when
we began recording the particular sample to t T when we stopped. Assuming each
130
sample function yk to be equally likely, we define the autocorrelation function as the

ensemble average of the set of histories
N
1X
yk tyk t ,
N! N
k1
RY t, t hY tY t i Lim
3:6:1
where inclusion of the time variable t in the argument of RY signifies that the outcome
may depend on when, within the time history, the ensemble average is taken. The
autocorrelation function describes quantitatively how closely the values of the data
at one time t depend on the values at another time t ; the increment is referred to
as the delay time or lag. In the case of two different random signals X(t) and Y(t), one
can define by an analogous ensemble average the cross-correlation function
N
1X
xk tyk t ,
N! N
k1
RXY t, t hXtY t i Lim
3:6:2
where, in general, order matters and RXY (t, t ) 6 RYX (t, t ). We shall not be
cross-correlating different functions in this chapter, so as a matter of nomenclature
I will refer simply to the correlation function, which is to be understood to mean
autocorrelation.
If the ensemble mean Y (t) hY(t)i, defined by the same kind of limiting process shown
in (3.6.1), is not zero, then one is usually interested in the covariance function defined by
CY t, t hY t Y tY t Y t i:
3:6:3
For zero delay, Eq. (3.6.3) defines the variance of the stochastic process.
In the special case that the mean and correlation are independent of time translations, the process is said to be weakly stationary, and RY () and CY () depend only on
the delay. If all probability distributions relating to the random process Y(t) are timeindependent, the process is termed strongly stationary. For a Gaussian random
process the two concepts coincide because the mean and covariance determine all
other probability distributions. We shall see in due course that the stochastic model
best fitting nuclear decay is an example of a strongly stationary process. The
autocorrelation function of a stationary process is characterized by the symmetry
RY RY :
3:6:4
From an experimental standpoint, it may be impractically time-consuming to have to

sample a very large number of time histories of a random process in order to study its
statistical properties (although, in fact, this is what the 167 bags of coincidence-count
data represent). It would be economical, therefore, if the statistics of a random
process could be obtained by a time-average over a single sample history such as
1 T
Y lim
ytdt
3:6:5
T! T 0
131
where the y (t) in the integrand could be any one of the sample functions of the set
{yk (t)}. In general, time and ensemble averages are not equivalent. If, however, a
random process is stationary and the time average does not depend on the specific
sample function yk(t) i.e. it is independent of k then the process is said to be
ergodic from Greek roots signifying path of work or action. For a stationary
ergodic process Y(t), time and ensemble averages are equivalent, and we can express
the correlation function (for Y 0) as
T
1
ytyt dt:
RY hY tY t i Lim
T! T
3:6:6
There is a long history of investigation of ergodic systems going back to Ludwig

Boltzmann, Henri Poincare, and the Ehrenfests (Paul and Tatyana), discussion
of which is well outside the scope of this chapter. Suffice it to say without further
qualifications, that, according to quantum mechanics, spontaneous nuclear
decay is expected to be an ergodic process. This is an implication of our null
hypothesis.
Because the autocorrelation or covariance is a dimensioned quantity (i.e. it takes
the dimension and units of Y2), it is useful to normalize the expression by dividing by
the variance in order to obtain the dimensionless autocorrelation coefficient
Y
RY hY tY t i
RY 0
hY t2 i
3:6:7
that falls within the range 1 () 1. Nomenclature is not consistent, and one will
find the normalized expression (3.6.7) referred to as the autocorrelation function, and
the non-normalized expression (3.6.1) as the autocovariance function.
When a random signal being sampled is not continuous, but is discrete as in the
counting of quantum particles, the serial correlation function of lag k defined by16
!
!
Nk
Nk
Nk
1 X
1 X
1 X
Rk
yt
y0
y0
ytk
3:6:8
N k t1
N k t0 1 t
N k t0 1 t k
and normalized as follows
rk 2
Nk
Nk
1 X
1 X
yt
y0
N k t1
N k t0 1 t
Nk
Nk
X
1 X
4 1
yt
y0
N k t1
N k t0 1 t
!2
Nk
1 X
y0
ytk
N k t0 1 t k
Nk
Nk
1 X
1 X
ytk
y0
N k t1
N k t0 1 t k
!2 312
5
3:6:9
16
M. Kendall, A. Stuart, and J. K. Ord, The Advanced Theory of Statistics Vol. 3, (Macmillan, New York, 1983) 443445.
132
is an approximate measure of the true correlation coefficient (k), where time is

measured in integral multiples of some unit interval t. The preceding expressions
are rather cumbersome and can be simplified to
Nk

1 X
Rk
yt y ytk y
N k t1

Nk
1 X
yt y ytk y
rk
N k t1
s0 Y
s0 Y
3:6:10
by using the sample mean and (biased) sample variance

y N 1
N
X
yt
1
s02
Y N
t1
N
X
yt y2
3:6:11
t1
for the entire series. A disadvantage to the form of rk , however, is that it can
take values greater than 1, in contrast to the behavior of a true correlation
coefficient. Thus, an alternative form often employed for the correlation
function and correlation coefficient is obtained by approximating (N k) 1
by N1
R0k
Nk
1X
y yytk y
N t1 t
Nk
X
yt yytk y
r0k
t1
N
X
3:6:12
yt y
t1
where r 0k now lies strictly within the range (1, 1). The serial correlation function
and coefficient of a stationary random process of zero mean are then
Nk
X
R0k
N k
1X
y y
N t1 t tk
r0k
yt ytk
t1
N
X
yt 2
t1
3:6:13
The adjusted time series obtained in the 22Na gamma-coincidence experiment represents a process of this kind.
For a long time series of data, the calculation of the correlation function
directly from relation (3.6.13) is extraordinarily time consuming, even when a
fast desktop computer is employed. There is, however, a more efficient way to
perform the calculation, based on a relation, known as the WienerKhinchin
theorem, between the correlation function and the power spectrum of the time
series. Beyond facilitating calculation, the power spectrum plays an important
role in this research because it, together with other functions to be described,
reveals what hidden periodicities may lurk within the time series of decaying
nuclei.
3.7 Periodicity and the sampling theorem
133

The variation in time of a continuous real-valued periodic function f (t) with period
T can be represented by a Fourier series in sines and cosines or in complex
exponentials

X

X
X
2nt
2nt
2nt
f t a0
an cos
bn sin
cn ei T
3:7:1
T
T
n
n1
n1
where, by means of Eulers relation
e
i cos
i sin ,
3:7:2
the two sets of coefficients are related as follows

c0 a0
)
cn>0 12an ibn
c*n>0 cn 12an ibn
(
,
an>0 12cn cn

bn>0 2i cn cn :
3:7:3
To determine the coefficients from the original function f (t), it is generally easier to
use the complex form of the series because the basis functions satisfy a very simple
ortho-normalization relation
2
eimm d 2mm0 :
3:7:4
Thus, by multiplying both sides of (3.7.1) by a basis function ei(2kt/T) for integer k
and integrating t over the range (0, T), one obtains
2
T
1
1
c0
f tdt
f d
T
2
0
ck60
3:7:5
1
2kt
1
f tei T dt
f eik d
T
2
0
where the second equality follows from the change of variable 2t/T.
The time series of nuclear disintegrations, which was analyzed for hidden periodic
structure, is not continuous, but a discrete series comprising samples taken every t
seconds (the width of one bin) for a total duration T Nt. To represent a discrete
time sequence in a Fourier series, the expression in (3.7.1) must be modified as
follows
f nt f n a0
N=2
X
j1

aj cos
X

N=2
N=2
X
2nj
2nj
2nj
bj sin
cj ei N 3:7:6
N
N
j1
jN=2
134
with corresponding coefficients

N
1X
N
2ink=N
ck
f e
k 0, 1 . . .
N n1 n
2
N
X
1
f
a0 c 0
N n1 n

N
2X
2nk
N
ak
f cos
k 1...
N n1 n
N
2

N
X
2
2nk
N
k 1... 1 :
f n sin
bk
N n1
N
2
3:7:7
In the discrete series (3.7.6) the ratio t/T became a ratio of integers n/N, the unit t
having canceled from numerator and denominator, and the integrals over t in (3.7.5)
were replaced by sums in (3.7.7). The calculation leading to the Fourier coefficients in
(3.7.7) makes use of a discrete ortho-normalization relation (the sum of a complexvalued geometric series)
N
X
n1
ei
2mn
N
N1m
N
ei
sin m
Nm0
sin m=N
integer jmj N
3:7:8
in place of the integral (3.7.4).

The question Why does the index k, which enumerates the frequencies in the
discrete Fourier spectrum, terminate at N/2 if the index n in the time series goes
up to N ? highlights a subtle, but important, issue known as the Shannon
(or sometimes the NyquistShannon) sampling theorem. 17 There is a simple,
heuristic way to understand the sampling theorem, and a more rigorous,
technical way.
First, the simple way. A time sequence comprising N discrete, non-overlapping
intervals (bins) can manifest no period T longer than its duration. Thus, the longest
period (in units of the bin width t) in the series is T Nt, and correspondingly the
lowest frequency (the fundamental) is 0 1/Nt. Conversely, the sequence can
manifest no period shorter than a single bin in fact, shorter than two bins in
which case the highest frequency contained within the sequence (referred to as the
cut-off frequency c or Nyquist frequency) is c 1/2t. To see that the shortest
period must be two bins (and not one bin), picture two contiguous bins, the first with
a positive-valued sample point and the second with a negative-valued sample point.
One can imagine that these two points sample a sine wave which crosses the (horizontal) time axis at times 0, t, 2t. A single sample point in a single bin provides no
indication of periodicity.
17
C. E. Shannon, Communication in the presence of noise, Proceedings of the Institute of Radio Engineers 37 (1949)
1021. Reprinted in the Proceedings of the IEEE 86 (February 1998) 447457.
135
Now the more technical explanation. Consider a time-varying signal x(t) that is
sampled (measured) periodically every t seconds for a sampling time t. If there
are no gaps in the sampling process, then t t; if, however, t t, then the
signal is being sampled for only a relatively small fraction of the time. In any
event, since the sampling is periodic, we can represent the sampling function by a
Fourier series
S t
ck ei2kt=t
k
ck eiks t
3:7:9
k
with fundamental period t or sampling angular frequency s 2/t. The functional form of the sampled signal is then xs(t) x(t)S(t), whose spectral content is
given by its Fourier transform
Xs
xs teit dt
X
k
ck
xteiks t dt
3:7:10
ck Xks :
k
The physical significance of the final expression in (3.7.10) is that the sampling
process has generated an infinite number of replicas of the original signal in
frequency space, the replicated spectral lineshapes being spaced at intervals of
s 2/t or s s/2 1/t.
Let us now suppose that the original signal is band-limited, which means that its
frequency content is confined to a frequency interval 2B about the central frequency.
If the replicas are not to overlap, then the highest frequency of one replica (e.g. B for
the spectrum k 0 centered at the origin) must be less than the lowest frequency of
the succeeding replica (e.g. s B for k 1), which places a lower limit on the
sampling frequency
s > 2B )
1
1
> 2B ) c
> B:
t
2t
3:7:11
In other words, as long as the sampling frequency is greater than the bandwidth or,
equivalently, the cut-off frequency exceeds the highest frequency contained in the
frequency spectrum of the signal one can reproduce exactly the original timevarying signal from a single replica of the Fourier transform of the sampled signal,
even if the sampling time t is much shorter than the dead time between samples. This
is actually a remarkable theorem when one thinks about it.
All physical signals are band-limited because the signal must have finite starting
and ending times, but the highest frequencies in the spectrum may exceed a practical
sampling frequency. In that case the replicated spectral lineshapes overlap and an
effect known as aliasing occurs. Signals at frequencies greater than c contribute to
136
Amplitude
1
0
10
20
30
40
50
60
70
80
90
100
Time
Fig. 3.5 Square pulse of unit amplitude (small circles) of period 100 (arbitrary unit). Fourier
reconstructions (solid) comprise frequencies up to maximum harmonic number n of (a) 1, (b) 5,
(c) 99.
the sampled signal at frequencies below c. Specifically, for any frequency in the
range (c 0), the frequencies (2c
), (4c
),. . .(2nc
) are aliased with ,
as is readily demonstrable from the set of relations below:
cos2 2nc
t
cos2t
4nc t integer n
1
integer t 2 integer
4c t 4
2t
cos2 2nc
t cos 2t :
3:7:12
An example that ties these various ideas together is the representation in a Fourier
series of a square pulse of period Tp 2
8
> t > 0
<1
3:7:13
f t 0
t 0, , 2
:
1 2 > t >
sampled discretely in bins of unit width (t 1) over a total time Nt with
2
N t
100. The function is odd over the period, in which case only the coefficients
of the sine series are nonvanishing
an 0
bn>0
21 1n
n
n 0, 1 . . .
3:7:14
as readily determined from (3.7.7). Thus the square pulse can be reconstructed from a
series of the form

t 1
4
3t
1
5t
sin
:
3:7:15
sin
sin
f t
The plots in Figure 3.5 show the original square wave and Fourier reconstructions
with maximum frequencies n n/Tp marked by harmonic indices n 1, 5, and 99.
137

2
a
Amplitude
2
0
10
20
30
40
50
60
70
80
90
100
Time
Fig. 3.6 Unit square pulse (gray circles) and Fourier amplitudes bn (t) for n equal to (a) 1,
(b) 49, (c) 99. For sampling time t 1, n 49 corresponds to the highest discernible
frequency, whereas n 99 is aliased with the fundamental n 1.
As expected, the greater the number of harmonics included in the series for f (t), the
closer the reconstruction resembles the square wave. The figure also illustrates a
point made in the heuristic explanation of the sampling theorem. If the original
function were sampled once every 50 units of time (half the period), then the cutoff frequency would be the reciprocal of the period c 1/2 1/100, and aliasing
would occur for harmonics with frequencies

n
n
1
> c
) n > 1:
n
T p 2
2
In other words, all frequency components beyond the fundamental would be
aliased. This result accords physically with the observation that if we had only
two sample points one in each bin of width 50t we could not tell to which
harmonic a point belonged. However, if the original function were sampled once
every t 1 unit of time, as was actually the case in construction of the figure, then
the cut-off frequency would be c 1/2t 1/2, and aliasing would occur for
harmonics with frequencies

n
n
1
n
> c
) n > 50:
T p 100t
2t
In other words, only those terms with harmonic numbers n > 50 would be aliased.
Figure 3.6 illustrates the aliasing phenomenon explicitly. Because the Fourier
spectrum of the square wave pulse (3.7.13) contains only odd-integer sine waves as
represented by (3.7.14), the cut-off frequency c 1/2 actually corresponds to
harmonic n 49 (since there is no contribution from n 50). Superposed over the
square pulse (gray circles), are the sine waves bn(t) corresponding to harmonic
numbers n 1, 49, 99. For a sampling interval t 1 and pulse period 100t, the
138
49
highest discernible frequency corresponds to harmonic n 49: 49 T49p 100
< 12
99
(oscillatory black). The harmonic n 99 at frequency 99 T99p 100
(dashed black)
1
appears to have the same frequency as the fundamental (solid black), 1 T1p 100
,

with phase shift of 180 or radians. This agrees with relation (3.7.12) that frequency
1
99
1
2c 1 100
100
is aliased with 100
. To distinguish the aliased harmonic
n 99 from the fundamental n 1, it would be necessary to sample the square wave

at twice the rate, i.e. at 2/t.
3.8 Power spectrum and correlation

A real-valued time series x(t) and its Fourier transform
xte2it dt
3:8:1
are random variables18 whose measurable statistical properties are represented by

expectation values. Of particular significance for describing a stationary random
process are the autocorrelation function at lag
RX hxtxt i
3:8:2
and its Fourier transform, the power spectral density SX() at frequency
S X
RX e2i d:
3:8:3
The function SX() is a measure of the energy content or, more accurately, the
rate of energy transfer or power in the frequency range (, d). The term
power calls to mind an electromagnetic wave (think Poynting vector), but the
terminology is applied as well to any stochastic time record such as the record of
gamma coincidence counts obtained from the decay of 22Na. Although defined by
(3.8.2), the autocorrelation function is also deducible from the inverse Fourier
transform of (3.8.3)
RX
SX e2i d:
3:8:4
18
It is common notation in physics to represent a time series by a lower-case letter and its Fourier transform by the
corresponding upper-case letter. This contrasts with our previous notation, also in common usage, of representing a
random variable by an upper-case letter and its realization in a sample by the corresponding lower-case letter. It is
impossible to remain entirely consistent in all matters of notation, as one would soon exhaust the supply of familiar
symbols.
139
From relations (3.8.2) and (3.8.4) it follows that at zero delay the autocorrelation
RX 0 hxt i
SX d
3:8:5
gives the mean square value of a time series, which is equivalent to the integrated
power spectrum. If the mean hx(t)i 0, the integrated power spectrum equals the
variance 2X .
The pair of relations (3.8.3) and (3.8.4) are known as the WienerKhinchin (WK)
theorem. Together, they provide an indispensable set of tools for investigating
correlations and periodicities that may be hidden in a noisy signal. What makes the
WK theorem a theorem and not merely a trivial Fourier transform pair is that it
remains valid even in the case of a non-square-integrable signal x(t) for which the
Fourier transform X() does not exist. We shall be working with signals, however,
that do have a Fourier transform.
An alternative way of arriving at SX() is to substitute the expressions
xt
Xe2it d

*
3:8:6
x t x t
* 2it
X e
into the defining relation (3.8.2) for RX() to obtain
RX
d0 hXX* 0 i e2i t e2i
d
SX 0
SX e2i d:
3:8:7
Consistency between (3.8.7) and (3.8.3) then leads to the relation

SX 0 hXX*0 i
3:8:8
which one might think to simplify to

SX hjXj2 i:
However, the presence of a delta function in (3.8.8) is necessary for dimensional
consistency, since SX() is a density (power per unit frequency) and () has
dimension of inverse frequency as does also X(). The physical content of (3.8.8)
is that different frequency components of a stationary random process are
uncorrelated.
140
The functions RX() and SX() are even functions of their arguments
RX RX
SX SX :
3:8:9
The second symmetry is a consequence of the first, and, as noted in (3.6.4), the first
follows from the hypothesis of a real-valued stationary random process.
Putting the pieces together, one can analyze a stochastic time record in either of
two ways as symbolized by the chain of steps:
A xt ! X ! SX ! RX
B xt ! RX ! SX :
It is instructive to apply the WK relations to two examples (a) white noise and (b) a
purely harmonic process since these examples arise in our search for hidden
correlations and periodicities in nuclear decay (and other spontaneous quantum
processes). The first arises because, if our null hypothesis is true, then the disintegration of nuclei is a white-noise process. And if the null hypothesis is not true, then
the second process may possibly lie hidden in the time record of decays.
White noise refers to a stochastic process with power uniformly distributed over the
entire frequency spectrum. Alternatively, it may be regarded as the ultimate expression
of randomness whereby no two distinct points of a time-varying function are correlated, no matter how close in time they occur. The consistency of these two viewpoints
follows from the WK theorem. Consider the first (constant spectral density)
SWN
) RWN
e2i d 2 ,
3:8:10
and the second (no correlation for any delay 6 0)
RWN ) SWN
2
e2i d 2:
3:8:11
The assumption in (3.8.11) that the mean value of the noise is zero identifies the
constant in (3.8.10) as the variance of the noise. We have also made use of the
familiar representation of the delta function
1

2
e
e2i d
3:8:12
in which the absence or presence of the factor 1/2 depends on whether integration is
over frequency or angular frequency.
The opposite of a completely random process is a perfectly deterministic one.
Consider the harmonic function x(t) A cos(20t) of constant amplitude A and
frequency 0. Invoking the ergodic theorem, which equates ensemble and time
141
averages for a stationary random process, we can calculate the autocorrelation

function from the limit
1
R Lim
T! 2T
A2 cos 20 t cos 20 t dt
A2
cos 20 :
2
3:8:13
The correlation coefficient () R()/R(0) cos(20) shows perfect correlation

1 between points separated in time by 10 , 20 , . . . n0 for integer n 1, and
perfect anti-correlation 1 for points separated in time by 210 , 230 , . . . 2n1
20 .
From the WK theorem, the power spectral density
A2
S
2
e
2i
A2
cos 20 d
4

e2i0 e2i0 d
A2
0 0
2
3:8:14
is seen to comprise only two components at frequencies

0. Allowing the
frequencies of the power spectral density to span the entire real axis is a mathematical
convenience facilitating calculation. The physically significant (i.e. measurable)
power spectrum, sometimes represented symbolically by G() 2S(), comprises
only non-negative frequencies.
To investigate a discrete time record obtained by sampling a random process at
time intervals of t i.e. with a cut-off frequency c 1/2t we need a discrete
counterpart to the WK theorem. A discrete representation of the autocorrelation
function, which maintains the symmetry exhibited in (3.8.9), takes the form
RX / R0
m
X
Rk kt kt
3:8:15
k1
where the delta functions restrict to integer multiples {k 0,1,. . .m} of the sampling
time t. Substitution of (3.8.15) into expression (3.8.3) for SX() and integration over
leads to the relation

m
m
X
X
k
SX / R0 2 Rk cos 2kt R0 2 Rk cos
c
k1
k1
)
(

m
X
k
/ R0 1 2 rk cos
3:8:16
c
k1
in which the definition of the cut-off frequency c was used in the second expression,
and the definition of the discrete correlation coefficient rk Rk/R0 was used in the
third. Because a delta function of time in (3.8.15) is a density, it has units of inverse
time. The proportionality constant in (3.8.16) must then be proportional to t, the
only temporal parameter available. If we restrict the frequency to physically
142
meaningful positive values only, then the proportionality constant linking the physically realizable power spectral density GX() to the elements of the autocorrelation
function is 2t. Since most applications are concerned only with the content of the
spectrum and relative strengths of the spectral amplitudes, the value of the proportionality constant is usually of no consequence, and, unless otherwise indicated, we
will regard (3.8.16) as an equality.
An alternative way19 to introduce the power spectrum that will prove useful later
is to construct from a time series {yt t 1. . .N} of zero mean the two functions
N
1 X
yt cos t
A p
N t1
N
1 X
B p
yt sin t
N t1
3:8:17
with 2t, which resemble the coefficients of a Fourier series, and define the
power spectrum by
S A2 B2
(
)
N
N X
N
X
1 X
2
0
y 2
yt yt0 cos t t
N t1 t
t0 1 t1
1
t0 6t
N
X
y2t
t1
N 1 X
Nk
X
yt ytk cos k
k1 t1
R0
N1
X
1 2 r k cos k :
N
N 1
N k
X
1X
1X
y2t 2
yy
N t1
N t1 t tk
k1
R0
!
cos k
Rk
3:8:18
k1
The transition from the second line to the third is made by a change in summation
index t0 t k, where k is the lag (of which the unit t has been absorbed in the
definition of the angle ). The final expression in (3.8.18) is identical in form to
(3.8.16) except that the sum includes all possible correlation coefficients and not
just those up to an arbitrarily set maximum lag m. The question of what value
should be taken for the maximum lag in some particular application will be
discussed later.
p
I call attention to the fact that the normalization constant in (3.8.17) is 1= N and
not 1/N as is usually the case in summing over elements of a statistical set (e.g. in
forming a sample mean or variance). The virtue of this choice, which is a convention
adopted in certain algorithms for rapid computation of Fourier series (to be elaborated later), is that it leads to the following large-sample (N 1) limits (for 6 0, )
19
M. Kendall, A. Stuart, and J. K. Ord, The Advanced Theory of Statistics Vol. 3 (Macmillan, New York, 1983) 510511.

N
1X
1
cos 2 k !
N k1
2
N
1X
1
sin 2 k !
N k1
2
143
N
N
X
1X
cos k
sin k 0 ! 0 3:8:19
N k1
0
k 1
that will facilitate determining how various functions of Fourier amplitudes are
distributed.
Although the discrete power spectrum (3.8.16) can be evaluated for any frequency
up to the cut-off frequency c, the set of discrete frequencies {j jc/m} for j 0,
1,. . .m is particularly convenient20 as it leads to m/2 independent spectral estimates.
This follows from recognizing that points in the discrete time series Rk separated by
intervals less than mt m/2c can be correlated. In the frequency domain, therefore,
points separated by less than the reciprocal interval 2c/m can be correlated. Evaluated at this special set of discrete frequencies, the spectral density becomes

m
m
X
X
jk
jk
Rk cos
r k cos
/12
3:8:20
Sj / R0 2
m
m
k1
k1
and can be shown to satisfy a kind of completeness relation
!
m1
X
1 1
1
S0
Sj S m R 0 :
m 2
2
j1
3:8:21
The demonstration of (3.8.21), which uses complex exponentials to achieve a remarkable reduction in what appears at first sight to be a complex function (in both senses
of the word) will be left to an appendix.
To appreciate the utility of autocorrelation and power spectral analysis for
recovering information from a noisy signal, consider a sampled signal of the form

2t
2t
xt A cos
B cos
t
3:8:22
Ta
Tb
where (t) is a random variable of type U(0, 1). In other words, at each instant t at
which the signal of interest is sampled, the measurement includes a randomly
fluctuating component of magnitude between 0 and 1. The objective of the measurement is to confirm the existence of any periodic terms and to estimate their periods.
This would be impossible to do by looking at the observed time series in the top panel
of Figure 3.7, which shows (in gray) the signal sampled at unit intervals t 1 for a
total recording time of N 512 intervals. The parameters of the non-random part of
the theoretical waveform (3.8.22), shown in black and displaced downward by one
unit for visibility, are Ta 10, Tb 20, A 0.20, B 0.15.
The middle panel of Figure 3.7 shows the discrete autocorrelation function (black
points) of the observed signal (3.8.22) calculated directly from the defining relation
20
J. S. Bendat and A. G. Piersol, Measurement and Analysis of Random Data (Wiley, New York, 1966) 292.
144
1
Signal x(t)
0.5
0
0.5
1
1.5
0
50
100
150
200
250
Time
Correlation r(k)
1
0.5
0
0.5
1
0
10
20
30
40
50
60
70
80
90
100
30
35
40
45
50
Lag
Power G(j)
20
10
0
10
0
10
15
20
25
Harmonic
Fig. 3.7 Upper panel: periodic signal (black) x(t) 0.20 cos (2t/10) 0.15 cos (2t/20)
displaced downward by one unit for clarity; empirical signal (gray) made noisy by
superposition of U(0, 1) noise sampled at intervals of one time unit for a total of 512 time
units. Middle panel: autocorrelation of the periodic signal (gray) and empirical signal (black
points). Lower panel: power spectrum of empirical signal showing harmonics at j 10,
20 corresponding to periods of 20 and 10 time units derived from the autocorrelation
function of maximum lag 100.
145
(3.6.12) up to lag m 100, after first having been transformed to the corresponding
record y(t) of zero mean. Also shown (gray curve) is the theoretical autocorrelation of
the noiseless signal

R
A2
2
B2
2
,
3:8:23
r
cos
cos
2
R0 A B2
Ta
Tb
A2 B2
where
0
1
T
1
R Lim @ xtxt dtA:
T! T
3:8:24
(Note that the time average of the noiseless signal is zero.) Although it is now clear
from the plot that the noisy signal contains periodic terms, their periods and amplitudes are not evident.
This information is provided by the power spectrum, shown in the bottom panel
of Figure 3.7, which was calculated by (3.8.20) at the special set of discrete frequencies {j jc/m}. The abscissa, labeled by harmonic index j, unambiguously shows
harmonics at j 10 and 20. The period corresponding to a particular harmonic is
obtained from the reciprocal relation
1
jc
j
j
m
Tj
2mt
T j 2m
:
t
j
3:8:25
Thus, for maximum lag m 100, the power spectrum correctly reveals periods of
T10 20 and T20 10 time units with a ratio of power spectral amplitudes S20/S10
2.3 close to the theoretically exact value A2/B2 (2.0/1.5)2 ~ 1.8.
The necessity or at least advantage of working with a time series y(t) of zero
mean may be seen by considering the relation between the autocorrelation function
Y
X
Rk and the corresponding function Rk of the stationary random process of sample
mean x 6 0
Y
Rk
Nk
Nk
Nk
1X
1X
1X
X
yt ytk
xt xxtk x
xt xtk x2 Rk x2 :
N t1
N t1
N t1
3:8:26
Y
Recognizing that R0 is the (biased) sample variance s02

Y , we can deduce from (3.8.26)
a relation between correlation coefficients of the two time series

2
Y
rk x=s0Y
X
:
3:8:27
rk
1 x=s0Y 2
From (3.8.27) it is evident that in the limit of large mean for fixed variance the coeffiX
cients r k all approach 1. Had we not worked with a time series of zero mean, but
146
performed the analysis instead with a series of significant non-zero mean, the autocorrelation coefficient as a function of delay would have been a downward sloping line
weakly modulated by oscillations of low contrast. Correspondingly, the power spectrum
of this autocorrelation would have been dominated by a strong peak at index j 0,
which could have distorted the power distribution at other frequencies.
The necessity of working with a time series of zero mean applies as well to an
alternative approach to recovering information from a noisy signal. We could have
proceeded by calculating first the Fourier spectral amplitudes fan , bn n 0 . . . 12Ng of
the (zero mean) time record y(t) by use of a fast Fourier transform (FFT) algorithm,
and then obtained the discrete autocorrelation function Rk from the inverse FFT of
the power spectrum Sn a2n b2n . Although stating this procedure in words may
make it sound complicated and time-consuming, in practice as applied to long time
series (e.g. of nuclear decay data) it has led to results in seconds that otherwise would
have required hours to compute.
Perhaps the most familiar FFT algorithm is the one developed by Cooley and
Tukey21 in 1965. The description of this and other algorithms goes beyond the
intended scope of this chapter, but several points are worth noting. The Cooley
Tukey method calculates a discrete Fourier transform (DFT) of length N by performing a number of operations of order N log N, rather than the much larger N2
which typifies the calculation of Fourier amplitudes directly from the defining
integrals. Also, the CooleyTukey FFT algorithm relies on a factorization technique
that requires N to be a power of 2 hence the choice N 512 29 in the preceding
illustration. For a time record of 512 points, it matters little in terms of efficiency
which method one employs. However, for a record containing more than 1 million
bins of nuclear decay data, the relative efficiency of using the CooleyTukey FFT
algorithm instead of direct evaluation of the defining sums or integrals goes as
106
1:7 105 . Thus, a one-second calculation by FFT could take more than
log106
40 hours by the direct computation.
3.9 Spectral resolution and uncertainty
There is a connection between the duration of a time series and its spectral bandwidth
analogous to the quantum mechanical uncertainty principle governing the measurement of location and momentum of a particle. Indeed, because the mathematics of
waves describes the statistical behavior of quantum particles, the latter uncertainty
principle may in some ways be regarded as arising from the former.22 This constraint
plays a role in the sampling theorem previously described and has consequences for
calculation of autocorrelation functions and power spectra.
21
22
J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of
Computation 19 (1965) 297301.
147
Consider a time series of zero mean y(t) that is nonvanishing only within the
interval (0, T). If this record were repeated multiple times, it would constitute a
periodic function of period T and therefore of fundamental frequency 0 1/T.
A Fourier representation of this series would then take the form
y t
cn e2int=T
n X
n
an cos 2 t
bn sin 2 t
T
T
n0
n1
n
3:9:1
with coefficients given by (3.7.5)

T
1
n
yte2iTt dt
cn
T
c0 0:
3:9:2
Note the form of (3.9.2); this coefficient is identical to T1 Y n=T , where Y() is the
Fourier transform of y(t). Thus the specific set of samples {Y(n/T)} determines the set
of coefficients {cn} which determines the function y(t) which determines the transform
Y() for all frequencies . Symbolically:
fYn=Tg ) fcn g ) yt ) Y
for all c :
3:9:3
If Y() were confined to the frequency band (B, B), then the minimal number of
discrete samples of Y() needed to describe y(t) a quantity referred to as the number
of degrees of freedom would be
2B
2B
2BT:
0
1=T
3:9:4
An equivalent way to understand relation (3.9.4) is to recognize that the Fourier

series in (3.9.1) must vanish when frequency (n/T) > B for a band-limited signal i.e.
when n > BT. Thus there are n BT independent sine terms and n 1 BT
independent cosine terms for a total of 2BT 1 2BT independent terms needed
to specify the band-limited and time-limited signal y(t). If the mean of y(t) is zero,
then a0 0, and n cosine terms are independent.
Alternatively, we could have performed the previous construction by imagining
multiple repetitions of a function Y() of frequency defined over an interval (B, B)
with corresponding fundamental period 1/2B. This function would be represented by
a Fourier series
Y
1 X
Cn e2inf =2B
2B n
3:9:5
with coefficient
1
Cn
2B
B
B
Y e2i2B d
n
C0 0
3:9:6
148
1
that is seen to be identical to 2B
yn=2B. By the same reasoning as before, therefore, it
follows that the specific set of samples {y(n/2B)} determines the set of coefficients
{Cn} which determines the function Y() which determines the transform y(t) for all t.
Again, symbolically
fyn=2Bg ) fCn g ) Y ) yt for all t T :
3:9:7
If y(t) were confined to the range (0, T), then the minimal number of discrete
samples of y(t) needed to describe Y() leads to the same number of degrees of
freedom
T
2BT:
1=2B
3:9:8
In short, the sampling theorem links the duration of a signal, the highest frequency
in its spectrum, and the number of samples required to characterize the signal
completely. However, it is, in fact, not possible for a function of finite duration to
have a finite bandwidth. For example, a pure sine wave extends infinitely in time.
Correspondingly, a delta-function pulse, which vanishes at all times except for a
single instant, has an infinite spectral content.
In general, it can be shown by means of Parsevals theorem23
jY j2 d
yt dt
3:9:9
(which gives equivalent expressions for total power in terms of integration over time
or over frequency) and the Schwartz inequality
f t dt gt dt
2
f tgtdt
3:9:10
(which generalizes to integrals of arbitrary functions a geometric inequality relating

the lengths of the sides of a triangle) that the spread in time T over which a signal is
recorded and the corresponding spread in frequencies of its spectral content must
satisfy the relation24
T
23
24
1
,
4
3:9:11
Although Parsevals theorem can be interpreted in terms of an integrated power, the functions in the integrand are not
random variables and no ensemble or time averages are involved.
A corresponding relation in quantum mechanics is ET 12
h in which E 2
h is the uncertainty of the energy of
a quantum system whose duration is uncertain by T. The universal constant
h (pronounced h-bar) is Plancks
constant divided by 2.
149
where
2 Y 2 d
t yt dt
2
T 2
2
yt2 dt
3:9:12
Y 2 d
In practice, signals of experimental interest must always be of limited duration and

can be regarded (or, by means of filters, constructed) to have an effective bandwidth.
Consider the important example of band-limited white noise a generalization of
the process of unlimited white noise discussed in the previous section whose spectral
content is uniformly distributed over an interval 2B, as described by the power
density

1
B B :
S 2B
3:9:13
0
otherwise
The WK theorem then yields the autocorrelation function
B
2
cos 2 d
RWN 2 S cos 2 d
2B
sin 2B
,
2B
3:9:14
whose form is the so-called sinc function [sin(x)/x] commonly encountered in the
analysis of optical diffraction phenomena. The function is maximum RWN(0) 1 at
0, drops to zero at
1/2B, and undulates thereafter with small decreasing
amplitude. Points in the time record separated by a delay equal to 1/2B (and,
practically speaking, for all delays greater than 1/2B) are uncorrelated. Thus, the
number of degrees of freedom expressed in (3.9.8) is the number of statistically
uncorrelated samples within the recorded time T of the signal. These samples are
statistically independent in the special case of band-limited Gaussian white noise.
The matter of correlation and independence is important and somewhat subtle. If
two random variables are independent by which is meant that their joint probability density factors p(x, y) p(x)p(y) then they are uncorrelated according to the
general defining relation (the Pearson correlation coefficient)
X,Y
hX X Y Y i
hX X i hY Y i

!
0:
X ,Y independent
X Y
X Y
The converse is not always true: if X and Y are uncorrelated by which is meant
that X,Y 0 they are not necessarily independent. Consider the two functions
X(t) sin t and Y(t) sin2(t). The latter is completely dependent on the former, but
150
0.6
Max Lag = 100

0.4
Power Spectral Density
0.2
0.2
0
10
12
14
16
18
20
22
24
26
28
30
10
11
12
13
14
15
0.8
Max Lag = 50
0.6
0.4
0.2
0
0.2
0
Harmonic Number
Fig. 3.8 Power spectral density SX() (Eq. (3.9.15)) of the periodic signal in Figure 3.7 as a
function of with step size 0.1 for maximum lag m 100 (upper panel) and m 50 (lower
panel). Peaks occur at harmonics i 2m/Ti (i 1, 2) where T1 10 and T2 20 time units.
Peak width, measured between zero crossings flanking the central maximum, is precisely
2.
X,Y 0 because of the respective odd and even symmetries of the two functions. Zero
correlation implies independence, however, in the case of two jointly normal
variables.
The number of degrees of freedom has implications for the maximum lag m at
which to evaluate the autocorrelation of a time series. Recall that m is also the
number of correlation functions Rk or coefficients rk contributing to the power
spectrum (3.8.20). Figure 3.8 shows the normalized power spectral density
(

)
m
X
1
k
r k cos
SX
12
3:9:15
m
m
k1
151
of the noiseless part of x(t) in the previous example (3.8.22) calculated for maximum
lags of m 50 and 100 time units and signal duration of N 512 time units.
To enhance visibility, the computation was performed at more points than just
the special set of discrete harmonics j previously introduced. The larger the value
of m, the greater is the separation of the peaks. Thus, higher m affords higher
resolution.
As a heuristic explanation of this feature, consider that the autocorrelation
function {Rk k 0, 1. . .m} constitutes a time record of length mt and therefore an
effective bandwidth Be 1/mt. The higher m is, the narrower is Be.
There is a downside, however, to making the maximum lag too large. For a
bandwidth B e 1/mt and record length T Nt, the number of degrees of
freedom (3.9.8) becomes 2N/m. The expression (3.6.13) for the sample
autocorrelation was derived under the condition N m. As m increases for
fixed N, decreases, and the sample estimate of the correlation function as a
whole may be poor (although estimates of individual points may
remain good).
One also runs up against the uncertainty principle (3.9.11). The variance in
the measurement of S() within any narrow range about some specific frequency
0 is inversely proportional to 2N/m. I will discuss shortly with greater rigor
the statistical distribution of the power spectrum and other random variables arising
from Fourier analysis of the time series of nuclear decays. For now, however, the
uncertainty principle affords a complementary way to understand why fluctuations
in the measurement of S() become greater with increasing m. The maximum time lag
mt corresponds to an effective frequency bandwidth Be (mt)1. As m is
increased for fixed total record length Nt, and decreases, one eventually violates
the uncertainty principle

1
N
1
:
T Nt
mt
m 4
3:9:16
Violation of (3.9.16) results in large fluctuations in the measurement of S() because

determining to a precision commensurate with resolution would require a longer
record length than what was available.
A simple computer experiment with the model (3.8.22) illustrates this
important point. With coefficients A and B set to 0 so that the time record
x(t) should be pure noise, five sequential measurements i.e. computer simulations with a U(0,1) RNG were made of the power spectrum
m
G j=2m Gj at harmonics j corresponding to periods of 10 and 20 time
units respectively for fixed record length N 512 with maximum lags first of
m 100 (harmonics j 20, 10) and next with m 20 (harmonics j 4, 2).
The outcome is tabulated as follows.
152
100
100
20
20
Trials
G10
G20
G2
G4
1
2
3
4
5
1.04
0.62
2.00
0.25
1.39
0.87
1.61
1.80
0.40
0.88
1.37
0.86
0.66
0.55
1.05
1.06
1.20
0.56
1.04
1.03
Calculation of the mean and standard error (SE) of the five measurements for each
100
choice of j and m confirm empirically upon comparing the ratio SE/Mean of G10 with
20
100
20
G2 and of G20 with G4 the greater degree of uncertainty in the power spectral
amplitudes obtained from autocorrelation functions of higher maximum lag times.
100
100
20
20
Statistic (pure noise)
G10
G20
G2
G4
Mean
Standard error
Ratio (SE/mean)
1.060
0.304
28.6%
1.112
0.259
23.3%
0.898
0.146
16.2%
0.978
0.109
11.1%
Thus the choice of maximum lag requires a compromise to achieve both good
spectral resolution and statistical reliability.
If, however, harmonics of amplitudes significantly above the noise level are actually present in the time series, the choice of maximum lag is less influential.
A repetition of the preceding experiment when the original amplitudes A 0.20,
B 0.15 were retained, led to the following outcome.
100
100
20
20
Statistic (A 0.20, B 0.15)
G10
G20
G2
G4
Mean
Standard error
Ratio (SE/mean)
10.28
0.49
4.8%
16.84
1.03
6.1%
3.36
0.12
3.7%
3.81
0.30
6.8%
This shows relatively little difference in the ratio of standard error to the mean for
corresponding spectral peaks.
In the search for periodicities in a time series of nuclear decays, no such harmonics
are expected to be present.

Although the elementary statistics of nuclear decay, discussed previously, led to time
series of counts described by binomial and Poisson statistics, the use of
153
autocorrelation and power spectral analysis to search for hidden correlations and
periodicities introduces other random variables and their associated statistical distributions. These non-elementary statistics arise in asking how the Fourier amplitudes
(real part, imaginary part, modulus, phase) of the time series and the elements of the
correlation function, correlation coefficient, and power spectrum are distributed.
Since these are random variables, different ensembles (the bags of data) will almost
certainly produce different amplitudes for the same harmonics and different correlation coefficients for the same lag values.
The fact that a Poisson distribution may provide a good description of a time record
of nuclear disintegrations which one expects to be the case in the absence of external
forces is no guarantee that the data will make a good fit to other statistical distributions predictable on the basis of the null hypothesis. Recall that the null hypothesis is
that the probability of a single nuclear decay in a short sampling interval is proportional to that interval and independent of outcomes in previous or subsequent time
intervals. Therefore, a test of these distributions constituted the next step in looking for
evidence of non-random behavior that violated physicals laws.
The time record {xt t 1. . .N} of disintegrations of radioactive 22Na, for which the
stationary mean is X N 0^t, was transformed to a record {yt t 1. . .N} of zero
mean and zero trend in preparation for calculation of the autocorrelation function
and power spectrum. Under the condition X 1, which pertained in these experiments, the Poisson distribution characterizing the time record {yt} is very closely
approximated by the corresponding Gaussian distribution N(0, X). From the null
hypothesis and Gaussian approximation there then follow all the statistical distributions summarized in Table 3.1.
The technical details of the derivations of these distributions, which make use of
generating functions as developed in Chapter 1 as well as relations concerning
products and quotients of random variables to be discussed in subsequent chapters,
will be left to an appendix. Of significance now is the fact that each of the distributions, which tests different facets of the time record and Fourier amplitudes of
the decaying nuclei, is determined exclusively by a single empirical parameter
the mean count per bin X of the original time record fixed at the outset of the
experiment. The means and variances of these distributions, which give perspective to
measurements that will be discussed shortly, are summarized in Table 3.2.
Figure 3.9 shows histograms of four of the statistical quantities in Table 3.1 [real
part of amplitude, spectral power, modulus, and phase (defined by the ratio of
imaginary to real parts of the amplitude)] with corresponding theoretical densities
superposed. It is to be emphasized that the histograms of the figure, which display
virtually no discernible deviations from theory at the scale of viewing, are not
computer simulations, but the actual experimentally derived frequencies. Bear in
mind in examining the figure that, apart from , there are no adjustable parameters.
The excellent agreement sensed by the eye is substantiated by analysis, as shown by
the results of chi-square tests summarized in Table 3.3.
154
Table 3.1
Distributions of nuclear decay statistics*
Statistic
Distribution
Symbol
Probability density
Counts
Poisson ~
Normal
Normal
X Poi() ~ N(, )
f P x; e x!
fa, bg N 0, 12

2
2
1
f N x; , 2 p
ex =2
2
Gamma

fjaj2 , jbj2 g Gam 12, 1
f G x; r, s sr xr1 esx
Exponential
Rayleigh
Cauchy
jaj2, jbj2 E()

(jaj2 jbj2)1/2 Ray()
b/a Cau(0, 1)
f E x; 1 ex=
2
f R x; 2 xex =
f C x; r, s 1xr 2
s 1 s
Normal
Rk60 N(0, 2/N)

Rk0 N(0, 22/N)
rk N(0, 1/N)
p
G FT R E N
Amplitude (real
or imaginary)
Squared
amplitude
Power
Modulus
Amplitude ratio
Autocorrelation
function and
coefficient
Power via WK
theorem
Exponential

2
2
1
ex =2
f N x; , 2 p
2 2
f E x; 1 ex=
* The order of parameters in the density functions is the same as in the symbols identifying the
types of random variables.
Table 3.2
Statistical moments pertinent to nuclear decay
Distribution
Parameters
Mean
Variance
MGF or CF
Poi()
N(, 2)
Gam(r, s)
E()
Ray()
Cau(r, s)
X 0
2
r 12; s 1
r 0; s 1
ee 1
2 2
et t

r
1 st
(1 t)1
2
e t
eirt sjtj
r
s
Does not exist
1
2
1
2
r
2
s2
2
1 14
Does not exist
1
2
To this point, therefore, there is nothing in the statistics of the decay of 22Na that
would suggest a deviation from the prevailing theory (the null hypothesis). However,
it is possible that a periodic component of weak amplitude could remain undetected
within the histograms of Figure 3.9. Let us examine more closely, therefore, the
matter of recurrence, autocorrelation and periodicity.
Recall that a histogram is a graphical representation of a multinomial distribution
M({nk}, {pk}) M(n, p) of outcomes k 1. . .K with frequencies {nk} and probabilities
{pk} governed by the (discrete) probability function
155
Table 3.3
2 Test of distributions of Fourier amplitudes
Distribution
2obs
Real part
Imaginary part
Square of real part
Square of imaginary part
Power
Modulus
Ratio: imaginary/real
45.6
50.7
13.3
28.3
38.8
44.7
46.9
45
40
13
26
40
40
40
0.45
0.12
0.43
0.34
0.48
0.28
0.21
Amplitude (Real)
Power Spectrum
0.005
Relative Frequency
Relative Frequency
0.04
0.004
0.03
0.003
0.02
0.002
0.01
0.001
0.00
-40
-20
20
0.000
40
200
Modulus
800
1000
0.30
0.05
Relative Frequency
Relative Frequency
600
Amplitude Ratio (Im/Re)
0.06
0.04
0.03
0.02
0.01
0.00
0
400
0.25
0.20
0.15
0.10
0.05
0.00
10
20
30
40
50
-10
-5
10
Fig. 3.9 Empirical (bars) and theoretically predicted (solid) distributions of Fourier
amplitudes {j j ij} of the 22Na decay time series: (a) Gaussian distribution of real
part {j}; (b) exponential distribution of power spectral density f2j 2j g; (c) Rayleigh

1=2
distribution of modulus 2j 2j
; (d) Cauchy distribution of amplitude ratio {j/j}.
f M n; p n!
subject to the constraint
K
X
k1
n
Y
p nk
k
k1
nk !
3:11:1
nk n. If the null hypothesis is valid, then the probabil-
ity pk of an event in the kth class is the Poisson probability
156

2
pk f P xk ; e
xk
exk =2
! p
xk ! >>1
2
3:11:2
for the decay of xk nuclei. The mean frequency and variance of the kth class are
respectively
nk npk
3:11:3
varnk npk 1 pk ,
3:11:4
and the covariance of two frequency classes is

cov nj , nk npj pk :
3:11:5
Since the frequency distribution in (3.11.1) depends on the single parameter in

(3.11.2), a periodicity in the chronological sequence of histograms, as has been
claimed in published articles, can occur only if the population mean is periodic in
time. Such a periodicity could be revealed in the power spectrum and autocorrelation
of the record of coincidence counts.
As summarized in Table 3.1, the autocorrelation coefficient rk (for k > 0, since
r0 1 by definition) should be a normally distributed random variable with
variance 2r 1=N if the original time series comprises N elements, each a Poisson
variate. The upper panel of Figure 3.10 shows the autocorrelation coefficients of
the mean-adjusted time record {yt t 1. . .N} of 22Na decays as a function of lag
time up to a maximum delay of 671 units, corresponding
pto about 42 hours. The
ordinate is in units of the standard deviation r 1= N , and visually the preponderance of points falls between
2r as one would expect for a normally
distributed random variable. Better than confidence limits on a single measurement, the entire distribution of correlation coefficients {rk k > 0} is shown in the
lower panel of the figure with the theoretically predicted density for N(0, N1)
superposed. Again, the match could hardly be better. A chi-square test of this
Gaussian fit led to P 0.81 for 219 13:5. In short, the two panels of Figure 3.10
are indicative of white noise; there is no evidence of any statistically significant
correlations.
There are various ways, each instructive, of examining the Fourier spectrum of
the time record {yt}. The top panel in Figure 3.11 shows the power spectral density
(to be referred to simply as the power) as a function of frequency as specified by
the harmonic index. A characteristic of this plot is the foamy appearance of the
upper surface, illustrative of the strong fluctuations relative to the mean. This has
significant consequences, which I shall discuss shortly. The plot would look largely
the same, although with weaker fluctuations, if the modulus, rather than power,
were plotted. The middle panel of Figure 3.11 shows the logarithm (to base 10
although the choice of base is unimportant) of the power plotted against the
logarithm of the harmonic index. This, too, reveals a characteristic pattern: a
157
r/r
-2
0
200
400
600
Lag
Frequency
200
100
0
-0.005
0.005
Autocorrelation
Fig. 3.10 Top panel: autocorrelation rk/r as a function of lag k (671 k 0) of the 22Na decay
time series comprising N 218 bins with lag interval
k 512
bins ~ 224.77 s. Lower panel:
distribution of rk fit by a Gaussian density (solid) N 0, 2X =N .
triangular wedge of points whose upper surface is more or less flat with zero
slope. I will explain the significance of this plot momentarily. The bottom panel of
Figure 3.11 plots the imaginary part of the Fourier amplitude against the real part.
The most striking feature of this plot is the isotropic distribution of points with
nearly uniform density except for the foamy periphery again indicative of significant fluctuations. The three plots were constructed from the FFT amplitudes of a
trend-adjusted time series of N 105 bins of gamma coincidence counts from
decaying 22Na nuclei, with bin interval of 4.39 s. In keeping with the Shannon
sampling theorem, the FFT amplitudes and derived power spectrum comprise
N/2 5 104 harmonics.
Look carefully at the upper panel of Figure 3.11, in particular at the flecks of
foam, which represent numerous statistical outliers beyond the mean. How is one to
know whether any of these points actually represents a periodic component to the
nuclear decay or whether all are just noise? If the null hypothesis is valid, the
ordinates {Sj j 1. . .N/2} of the power spectrum of {yt} should be distributed
exponentially (see Table 3.1) with a standard deviation equal to the mean:
S S 2X X . In other words, as pointed out earlier, the fluctuations are of
comparable size to the signal.
158
Power (104)
1.5
1
0.5
0
0
50
100
150
200
250
300
Harmonic ( 100)
Log Power
6
4
2
0
2
4
0
0.5
1.5
2.5
3.5
4.5
Log Harmonic
Im(Amplitude)
100
50
50
100
100
50
50
100
Re(Amplitude)
Fig. 3.11 Three perspectives in the display of the discrete Fourier transform (FFT) of the 22Na
decay time series. Top panel: power spectral density against harmonic number; middle panel:
double-log plot of power against harmonic number; bottom panel: imaginary part against real
part of the complex Fourier amplitude. Plots comprise J 215 harmonics obtained from the
first 216 bins of a time series of length 105 bins with bin interval t 4.39 s. The frequency
corresponding to harmonic j is j j/(Jt).
159
The statistical significance of the largest ordinate constitutes a problem in order

statistics, as discussed in Chapter 1. A specific application, which I shall refer to as
the WalkerFisher (WF) harmonic test,25 calculates the probability

N=2
PrS Smax 1 1 eSmax =S
3:11:6
that at least one element of the set {Sj} exceeds the largest observed value Smax. In
implementing the WF test, one must ignore the harmonic j 1 because it corresponds in every discrete Fourier spectrum to the length of the recorded time series. In
the test performed on the power spectrum of the full time series of 22Na decay data,
the largest ordinate was Smax 2894 for a mean count per bin X 193.8 and highest
harmonic jmax N/2 687 847, corresponding to a maximum period of about
83.5 hours. The test statistic (3.11.6) yielded the probability Pr(S Smax) 0.201,
which is consistent with pure chance.
In case it may have escaped the readers attention, there is a remarkable aspect to
the preceding numbers that highlights how dangerous it can be to fall into a common
trap of thinking that nearly everything in nature is distributed normally. Note that
the mean X of the recorded time series is also the mean S and standard deviation S
of the power spectrum (see Table 3.1), and therefore the value of the largest ordinate,
expressed in standard normal form, corresponds to (Smax S)/S 13.9 or about
14 standard deviations from the mean! And yet this value of Smax yields a chi-square
P-value i.e. cumulative probability of S Smax of about 20%. The explanation for
this counter-intuitively high probability is that the ordinates of the power spectrum
are distributed exponentially, not normally. The corresponding probability for a
Gaussian random variable would be
3N=2
2
Smax
S =S
2
7
6 1
ez =2 dz5
5:04 1015 :
PrS Smax 1 4p
2
Another critical point to bear in mind is the distinction between

(a) the probability that a random variable (call it X) exceeds a certain value Xmax
i.e. 1 F(Xmax) where F(x) is the cumulative distribution function, and
(b) the probability that the highest order statistic Yn max(X1. . .Xn) in a sample of
size n exceeds the value Xmax, which is 1 FY n Xmax where FY n x Fx n .
The appropriate test of significance of power spectrum ordinates is based on the
solution to problem (b), not (a), as pointed out first by G. T. Walker26 in a study of
25
26
M. G. Kendall, A. Stuart, and J. K. Ord, The Advanced Theory of Statistics, Design and Analysis and Time-Series, Vol. 3
(Macmillan, New York, 1983) 589590.
G. T. Walker, Correlation in seasonal variation in weather. III: On the criterion for the reality of relationships or
periodicities, Indian Meteorological Department (Simla) Memoirs. 21 (1914), 22.
160
seasonal variation in weather, and later developed further by R. A. Fisher.27 The

probability obtained from problem (a) in the present case is small for both the
exponential and normal distributions, although much smaller for the latter:
1 FExp Xmax eY max = 3:3 107
Y max=
1
2
1 FGaus Xmax 1 p
ex =2 dx 1:0 1020 :
2
Now consider again the pattern shown in the middle panel of Figure 3.11. The
rationale for constructing the plots in the top and bottom panels is probably clear,
but the reason for the double-log plot may perhaps be less evident. This kind of plot,
however, reveals very useful information about the underlying stochastic process. It
is often the case that the power spectral density of a random process can be
represented by a power law,

2 jj 1 ,
S / jj
3:11:7
0
otherwise
within some limited range of frequencies. It then follows that the slope of the double
log plot is a constant: dlnS
dln . The exponent defines the type of stochastic
process and provides a quantitative measure by which to gauge the degree of
predictability of future outcomes. It may seem at first like an oxymoron that the
outcomes of a random process are predictable, but there are, in fact, different degrees
of randomness. We will look into this question later when we consider randomness in
the stock market. For the present, suffice it to say that white noise, defined by 0
is the most random process in the universe. It contains no information at all useful
for prediction.
In summary, neither the power spectrum nor the autocorrelation spectrum gave
evidence of a statistically significant component of period T 83.5 hours in the time
series of coincident counts arising from the decay of 22Na. For all practical purposes,
the analyses so far have shown the nuclear decay of radioactive sodium to be
equivalent to white noise.

When one searches for something and does not find it, that does not necessarily mean
it is not there. It is possible, however, to place an approximate limit on the sensitivity
of an experiment to reveal a periodic component of period T0 in the decay of a
27
R. A. Fisher, Tests of significance in harmonic analysis, Proceedings of the Royal Society of London A 125 (1929)
5459.
161
radioactive nuclide by simulating the decay time series with a Poisson RNG of timevarying mean
X t X0 1 cos 2t=T 0
3:12:1
and decreasing the amplitude of the harmonic until the presence of the harmonic is
no longer discernible in either the autocorrelation or power spectrum. The information obtainable from such a simulation depends on whether T0 is less than or
greater than the duration of the time series T. Let us examine each in turn.
Figure 3.12 shows the progressive change in the power spectrum (right panels)
and autocorrelation (left panels) as the harmonic amplitude takes on the
8
(a)
3
2
(b)
1
4
-1
-2
-3
0
0
200
400
0.0
Power
0.6
0.8
1.0
(d)
4
2
-1
-2
0
200
400
600
0.0
0.2
0.4
0.6
(e)
3
2
0.8
1.0
0.8
1.0
(f )
6
Power
1
0
4
2
-1
-2
-3
0
0.4
-3
0
0.2
(c)
Autocorrelation
600
0
200
400
Lag
600
0.0
0.2
0.4
0.6
Frequency
Fig. 3.12 Autocorrelation rk/r and power spectrum Sj for Poisson RNG simulated time series with
periodic mean of harmonic amplitude 0.0% (panels a, b); 0.3% (panels c, d); 0.5% (panels e, f ).
162
1
(b)
(c)
Autocorrelation
0.5
(d)
0.5
(a)
1
0
50
100
150
200
250
300
350
400
450
500
Lag
Fig. 3.13 Autocorrelation function of (a) xt cos(2t/(N/10)), (b) xt cos(2t/25N), (c) xt t
(exact calculation of rk), (d) xt t (linear approximation to rk) for N 29 and t 0. . .N 1.
sequential values 0 (Figures 3.12a,b), 0.003 (Figures 3.12c,d), and 0.005 (Figures 3.12e,f)
for a period T0 less than the duration T of the time series. The top two panels are
indicative of white noise. In the bottom two panels, the periodic waveform in the
autocorrelation and the delta function-like spike in the power spectrum are so
strong as to be practically blinding, even for so weak a relative amplitude of
0.5%. The middle two panels display results at an approximate threshold value
0.3%, at which the power ordinate Smax just passes the WF test, signifying no
departure from statistical control, and the harmonic variation in rk merges with the
noise. Thus, if a harmonic component with amplitude > 0.3% were present in the
time series {yt}, it would have been revealed by statistical analysis even though
visual inspection of the sequence of 167 histograms would show no statistically
significant recurrences.
A time series of duration T does not permit one to measure a period T0 > T.
Nevertheless, the data may contain sufficient information to reveal the possible
presence of a harmonic component even if its period could not be measured. One
can see why immediately from the example illustrated in Figure 3.13. Plot 3.13a
shows the autocorrelation of the periodic time series xt cos(2t/T1) (t 0. . .N 1)
with short period T1 N/10 for N 29 512 time units. Contrast this plot with that
of plot 3.13b, which shows the autocorrelation of the same function with long period
T2 25N. The former oscillates with decreasing amplitude many times over the range
of lag values, whereas the latter diminishes virtually linearly over the same range.
Now consider the discrete linear time series xt at b over the same range, where a
and b are constants. It is not difficult (although a little tedious) to show that the
163
autocorrelation coefficients rk (k 0. . .m) take the asymptotic form

3
k 2N1 3 k3 in the limit N (a,b,m). In other words, rk decreases for the
r k ! 1 2N
most part linearly with lag k, although evidence of curvature shows up for sufficiently
high k, as displayed in plot 3.13c for choice of parameters a 1, b 0. Plot 3.13d
records the approximate autocorrelation of the same linear function in the absence of
the cubic term. A partial-period harmonic component in a signal, if present, should
therefore manifest itself in much the same way as a linear trend, which is one of
the primary reasons for adjusting the original time record at the outset to remove the
known negative trend due to the natural lifetime of the source. Indeed, with fine
tuning of the parameters a and b, one could make the autocorrelation coefficients of
the linear function and the partial-period harmonic function nearly indistinguishable.
The effect of a linear trend and therefore of a partial-period harmonic component on the power spectrum {Gj} is to produce low-frequency oscillations, as shown
in the top panel of Figure 3.14 for the original unadjusted time series {xj} of gamma
coincidence counts. The maximum lag in the plot is m 2048 time units of t 36.44 s.
The middle panel shows that transformation to the detrended series {yj} has removed
the oscillations. From computer simulations of low-frequency oscillations in the
power spectrum for partial-period components of various amplitudes and periods,
like the plot in the bottom panel of Figure 3.13, together with the fact that no linear
trend other than that attributable to natural lifetime was manifested by the time
record {xj}, it was possible to conclude that the experiment would have revealed a
trend resulting from an external interaction of period up to approximately 5T, i.e.
~835 hours or about 35 days.

It is a common misconception sometimes even among scientists that randomness
equates to formlessness, i.e. to the complete absence of regularity or recognizable
patterns. Of course, the opposite error is also common, namely that it is all too easy
to believe one has detected deterministic patterns in some set of data where none
really was present28. . .as I believe was the case with the published claims of recurrent
histograms that motivated the present investigation. Nevertheless, the fact that a time
series generated by a random process must display certain recurrent patterns
without which the process would be non-random can be exploited in various ways
by means of runs tests. Such tests have been employed for purposes of quality
control in manufacturing, cryptographic analysis relating to national security, protection of financial records, and government regulation of games of chance. I have
used these tests myself (as discussed in a previous book29) to examine a variety of
28
29
There is actually a name apophenia given to this state of mind. See Apophenia, http://en.wikipedia.org/wiki/
Apophenia.
M. P. Silverman, A Universe of Atoms, An Atom in the Universe (Springer, New York, 2002) 285294.
164
20
Original xt
15
10
5
0
-5
-10
0.000
0.005
0.010
0.015
0.020
20
Power
15
Transformed yt
10
5
0
-5
-10
0.000
0.005
0.010
0.015
0.020
20
15
Simulation
10
5
0
-5
-10
0.000
0.005
0.010
0.015
0.020
Relative Frequency
Fig. 3.14 Power spectrum of autocorrelation of experimental gamma coincidence time series
{xt} unadjusted for negative trend due to natural lifetime (top panel); experimental series {yt}
adjusted for zero-trend (middle panel); time series simulated by a Poisson RNG for 1/4-cycle
variation in mean with harmonic amplitude 2.0% (bottom panel). The lag interval k
83 bins ~ 36.44 s. A relative frequency 1.0 corresponds to (k)1.
nuclear decay processes, such as the alpha decay of 214Po, beta decay of 137Cs, and
electron-capture decay of 54Mn, for evidence of non-random behavior.30,31 No
statistically significant evidence was found.
A run is an unbroken sequence of outcomes of the same kind usually of binary
alternatives like (1, 0), or (head H, tail T) whose length is the number of elements
defining the run. There are different ways to count runs, depending on what one
30
31
M. P. Silverman, W. Strange, C. R. Silverman, and T. C. Lipscombe, Tests of alpha-, beta-, and electron capture
decays for randomness, Physics Letters A 262 (1999), 265273.
M. P. Silverman, W. Strange, C. R. Silverman, and T. C. Lipscombe, Tests of spontaneous quantum decay, Physical
Review A 61 (2000) 042106 110.
165
regards as the unbroken sequence. For example, one might count the run of 1s in the
sequence 01110 in one or more of the following ways:
(a)
(b)
(c)
(d)
1 run of length 3,
3 runs of length 1,
2 runs of length 2 (where the middle 1 contributes to both runs), or
1 run of length 1 and 1 run of length 2.
I will refer to runs as exclusive if a given sequence of identical outcomes leads to a

unique run length, which is the maximum length possible, as represented in the
foregoing example by count (a). An exclusive run of 1s must start immediately after
a 0 and terminate with the occurrence of the next 0. Thus, if one were monitoring in
real time the outcomes of a random process which had so far generated the outcomes
(. . .0111), all one could tell at that point was that the length of the last run of 1s must
be at least 3. If the next outcome was 0, or if the process was terminated so that there
was no subsequent outcome, then the length definitely would be 3. If the process
continued and the next outcome was 1, then the length of the run would again be
ambiguous (i.e. at least 4). I will refer to runs as inclusive if a given sequence of
identical outcomes can contain runs of different lengths. However, if the runs are
non-overlapping, then an outcome cannot be counted twice. Thus the sequence
01110 contains numbers of non-overlapping inclusive runs as determined by (a), (b),
(d), but not (c).
In this section I shall be concerned with exclusive runs which, by definition, are
non-overlapping and independent. If outcome 1 is designated a success and outcome 0 a failure, the runs content of the sequence 0011100001010111 would be
tabulated as follows.
Length
Success [1]
Failure [0]
1
2
3
4
2
0
2
0
2
1
0
1
A stochastic process ideally suited for runs analysis is a coin toss, which, if performed
with an unbiased coin, is a realization of a Bernoulli process. Each trial is independent of the others, and the probability p of a success say head H remains constant
for all trials. Most people think they know what a random sequence of coin tosses
should look like, but they are probably mistaken. As an exercise32 to see whether this
32
M. P. Silverman, W. Strange, C. R. Silverman, and T. C. Lipscombe, On the Run: Unexpected Outcomes of Random
Events, The Physics Teacher 37 (1999) 218225.
166
was the case with college students, I divided the students in various physics classes of
mine over the years into two groups and assigned one group the task of tossing a coin
256 times and writing down in sequence the outcome of each toss; the other group
was told to write down what they imagined a typical sequence of 256 tosses to be, but
not actually to do the tossing. The students would then turn in their papers without
indicating to which group they belonged and I told them that I could predict with a
success of about 90% or higher which sets of data were obtained experimentally and
which imaginatively.
The key to this apparent feat of clairvoyance is the inherent disbelief of those
unschooled in the properties of randomness that long runs of heads or tails can occur
in a truly random sequence of coin tosses. This disbelief underlies many a false
gambling strategy whereby the gambler, having lost n times in a row in some game
equivalent to a Bernoulli process feels certain that his luck will turn since the event of
yet another loss must be highly improbable. But that is not so if the events are
independent. The probability of a loss on the (n 1)th trial is the same as it was
on each of the previous n trials. Analogous reasoning applies to runs in a Bernoulli
sequence. The probability of a sequence of nH is the same as the probability of any
specified ordering of n binary alternatives that is, (1/2)n provided the coin is
unbiased ( p 1/2). Thus, the trick to distinguishing experiment from imagination is to look for long runs.
A person unaware that a random sequence of coin tosses must include long runs
will almost invariably write a sequence of 256 outcomes like the following
HTHHTTHHHTTHTHH, etc. with a lot of reversals and therefore too many
short runs. A simple binomial argument, however, leads in the case of an unbiased
coin to the approximate expectation
rkH rkT
n
2k2
3:13:1
for the mean number of runs of heads or of tails of exactly length k out of n trials, and
RkH RkT
n
k1
3:13:2
for the corresponding mean number of runs of length k or longer out of n trials, in the
limit of large n k. To arrive at (3.13.1), one multiplies the number of trials n by the
probability 12k2 of getting a T to start the run, k Hs in a row, and a final T to end
the run. The formula (3.13.2) then follows by summing rkH over all values of run
length k

n
n
X
X
1
n
1
n
rjH n
k1 1 nk1 ! k1 :
3:13:3
j
nk
2
2
2
jk
jk 2
Thus, I would expect to find at least two runs of 6H or four runs of 5H in a sequence
of 256 coin tosses, with the same statistics for tails. Equations (3.13.1) and (3.13.2)
167
are approximate in that runs occurring at the start or closure of a sequence have been
ignored. For a long sequence, the contribution of these end runs becomes
negligible.
The exact theory of Bernoulli runs is a more complicated exercise in combinatorial
reasoning than can be taken up here, and so derivations of the theory outlined below
will be left to the literature.33,34,35,36 Given a random sequence of na events of type a
[successes] and nb events of type b [failures], with a total sample size n na nb, the
mean number of success runs of length precisely k (where k 1) is
rak
na !nb nb 1n k 1!
;
na k!n!
3:13:4
the mean number of success runs of length k or longer is

Rak
na !nb 1n k!
;
na k!n!
3:13:5
and the mean number of total runs of both kinds is

R Ra1 Rb1
n 2na nb
:
n
3:13:6
The theory of runs usually proceeds by determining (3.13.5) first and then (3.13.4)
through the relation
r ck Rck Rck1
c a, b:
3:13:7
Using the exact relation (3.13.5), I would expect a mean of 1.90 success runs of length
6 or longer and 3.9 success runs of length 5 or longer in a sequence of 256 trials
which is close to the previous expectations obtained from the approximate relations
(3.13.2).
For long sequences n k and approximate equality na nb, the variances of the
preceding expectation values are closely approximated by the following expressions
2 r ak 3r ak
2 Rak Rak
2 Ra
n1
:
4
3:13:8
Exact expressions for the variances are quite complicated and will not be needed nor
given here.

The exact probability Pn, k Pr Rak 1jn of occurrence of at least one success
run of length k or longer in a Bernoulli sequence of length n, where p is the
33
34
35
36
A. Wald and J. Wolfowitz, On a test whether two samples are from the same population, The Annals of Mathematical
Statistics 11 (1940) 147162.
A. M. Mood, The distribution theory of runs, The Annals of Mathematical Statistics 11 (1940) 367392.
H. Levene and J. Wolfowitz, Covariance matrix of runs up and down, The Annals of Mathematical Statistics 15 (1944)
5869.
A. Hald, Statistical Theory with Engineering Applications (Wiley, New York, 1959) 338373.
168
probability of a success (maximum likelihood estimate ^p na =n), can be calculated

from the relatively simple-looking formula37
Pn, k 1 n, k p nk, k
k
n, k
n
k1

X

1j
j0

n jk
1 pj pjk
j
3:13:9
h i
n
where the upper limit k1
in the sum signifies the greatest integer less than or equal
to n/(k 1). This formula becomes impractically difficult for large n, however,
because of the factorial products that make up the binomial coefficients. An alternative approach is to employ directly the generating function from which (3.13.9) was
derived
Gk s
X
1 pk sk
1 Pn, k sn
1 s 1 ppk sk1 n0
3:13:10
whose expansion in a Taylor series gives coefficients (1 Pn,k). Powerful symbolic

mathematical software such as Maple or Mathematica allows one to perform the
expansion numerically in seconds for values of n in the thousands. Using Maple, for
example, and setting the probability of success p 1/2, I was able to expand (3.13.10)
to 256 terms in under one second, from which I could determine the probabilities
P256,6 87.5% and P256,5 98.7%, which justified my claim to my students.
Series expansion of a generating function also eventually becomes impractical
for sufficiently large n, whereupon one must resort to suitable methods of approximation. One approach entails a partial fraction expansion of a generator that takes
the form
Gs Us=Vs
3:13:11
where V(s) is an mth order polynomial such that the numerator and denominator do
not have a common root. Recall that a partial fraction expansion converts a product
of factors into a sum of terms
W s
an sn an1 sn1 a0
w1
w2
wm

, 3:13:12
s s1 s s2 s sm s s1 s s2
s sm
where the constant coefficients {wj j 1. . .m} are obtained by cross-multiplying so as

to bring the sum on the right side into the form of the original product, and then
equating terms of like power of s in the numerators of the two expressions. In this
way, a partial fraction expansion of the generator (3.13.11) leads to the expression

m
X
X
U sj =V 0 sj
Gs
n sn ,
3:13:13
s

s
j
j1
n1
37
J. V. Uspensky, Introduction to Mathematical Probability (McGraw-Hill, New York, 1965) 7779.
169
where V 0 (s)dV/ds and {sj j 1. . .m} are the roots of V(s). A series expansion in s of
(3.13.13), such as illustrated in (3.13.10) for a particular generator, is then readily
made, provided one has been able to find the solutions sj of the equation V(s)0.
Actually, for large expansion order n i.e. under the very circumstances for which
exact methods may become impractical it is not necessary to find all the roots, but
only the least positive one sL, in which case the sum over j in (3.13.13) can be
approximated by a single term
Gs
U sL =V 0 sL
s sL
3:13:14
whose nth order expansion coefficient is easily shown to be

n
UsL =V 0 sL
:
sn1
L
3:13:15
Consider again the coin-toss task I posed my students. For p 1/2 and run length
k 6 the generator (3.13.10) reduces to
Gs
1 6
1 64
s
U s
,
1 7
V s 1 s 128
s
3:13:16
and one finds (by using the RootOf function of Maple) that
sL 1:008 276 516 723 31
1 7
s 0. The expansion coefficient
is the least positive root to the equation 1 s 128
(3.13.15) then takes the form
n 1 Pn, 6
1 6
1 64
sL

,
7 6
1 128
sL
sn1
L
3:13:17
which leads to probability P256,6 87.45% as before. Equation (3.13.17) allows one
to determine nearly effortlessly the variation in Pn,6 with increasing number of
trials n. Thus, for n 1024 tosses, the probability of finding at least one run
of successes of length 6 or longer is P1024,6 99.98%.
For very large n and run lengths k ~ 8, the distribution of Rak can be approximated by a Poisson distribution, since, by (3.13.8), the variance of Rak is approximately equal to the mean. Thus, if
p n, k m
em mm
m!
3:13:18
is the probability of at least m occurrences (with mean number m Rak ) of success

runs of length k out of n trials, then the probability of at least one such occurrence is
Pn, k PrRak 1jn 1 pn, k 0
k1
1 eRak 1 en=2
3:13:19
170
in the Poissonian approximation. Equation (3.13.19) yields probabilities P256,6

85.0% and P256,5 97.9%, which are close to the more exact figures previously given
(even though k is less than 8).
Return now to the matter of nuclear decay. The sequential count of nuclear decays
is not a string of binary alternatives. Nevertheless, one can transform the digital time
record into a binary record suitable for runs analysis in various ways such as the
following:
Runs with respect to a target value The count xi in each bin is compared with a
target value X (e.g. the median count) and assigned 1 if xi X and 0 if xi > X.
Runs
with respect to parity The count is assigned the symbol e if even and o

if odd.
Runs up and down The sequential difference xi1 xi is assigned the symbol plus
() if it is positive (a run up) and minus () if it is negative (a run down). We will
consider shortly the null case xi1 xi 0. A sample of n digital counts results in
n 1 differences.
Note that the runs created by each of the ways listed above comprise different
elements of a given time series and therefore provide complementary ways to test
the series for non-randomness.
In contrast to the first two transformations listed, the third producing runs up
and down leads to elements that are neither independent of one another nor of
constant probability. If the parent time record is random, then the more an element
departs from the median, the less likely it will be that the succeeding element will
depart from the median even further. The sequence of differences, therefore, does not
constitute a Bernoulli sequence and is not described by the statistical relations
(3.13.4) through (3.13.10).
Under the assumption that all n! arrangements of a set of n different numbers are
equally probable, the mean number rn, k of runs up and down of precisely length k and
the mean number Rn, k of runs up and down of length k or longer are given by the
expressions38

2 nk2 3k 1 k3 3k2 k 4
r n, k
k n 2
3:13:20
k 3!

2 nk 1 k2 k 1
Rn, k
k n 1
3:13:21
k 2!
R Rn, 1
38
2n 1
3
3:13:22
H. Levene and J. Wolfowitz, The Covariance Matrix of Runs Up and Down, The Annals of Mathematical Statistics 15
(1944) 5869.
171
with variances
"
r n, k 2n
2

2 2k5 15k4 41k3 55k2 48k 26
k 3! 2
22k2 9k 12
2k 32k 5k 3!k 1!
#
44k3 18k2 23k 7
2
k2 3k 1
3:13:23
2k 5!
k 3!
2k 1k!2
"
2k 12k2 4k 1
2
2
2

Rn, k 2n
k 2! 2
2k 1k!2 2k 3k 2!k!
#
4k 1
k1

3:13:24
2k 3! k 2!
2 R
16n 29
:
90
3:13:25
Equation (3.13.25) is exact, whereas Eqs. (3.13.23) and (3.13.24) retain only terms
proportional to n in what otherwise are very long expressions. Under the condition
n k pertinent to the experiments discussed in this chapter, the omitted terms are
insignificantly small. The exact distribution of the cumulative up/down runs is not
easily derived or expressed. To good approximation, however, the total number of
runs R is normally distributed for a sample length n > 20.39
Although the derivations of the preceding relations must be left to the references,
it is again possible, as in the case of partition runs (e.g. runs with respect to the
median) to deduce the n-dependent terms in (3.13.20) and (3.13.21) which are by far
the major contributions by a simple combinatorial argument. Under the same
condition as before (n k), the mean number of difference runs (both up and down)
of length k or greater in a sequence of n random numbers takes the form
Rn, k 2nPrD kjn, in which Pr(D kjn) is the probability of k positive differences D in a sequence of k 1 ascending numbers in a set of k 2 numbers. The set
cannot begin with the lowest number since there would then be k 2 ascending
numbers and k 1 positive differences. Thus, of the (k 2)! ways to order k 2
numbers, only k 1 of the orderings lead to runs up of length k. Hence, Pr(D kjn)
(k 1)/(k 2)!, from which follows Rn, k 2nk 1=k 2!, which approximates
(3.13.21). Application
of the identity (3.13.7) then yields straightforwardly
r n, k 2n k2 3k 1 =k 3!, which approximates (3.13.20).
39
A. Hald, Statistical Theory with Engineering Applications (Wiley, New York, 1952) 354.
172
Table 3.4
Test of Poisson approximation to up/down runs for n 8192
Run length k
Exact mean Rn, k
Exact var(Rn,k)
2
3
4
5
6
7
8
9
10
2047.6
546.0
113.7
19.5
2.8
0.36
0.041
0.0041
0.000 38
160.9
271.9
83.9
16.2
2.5
0.32
0.036
0.0037
0.000 34
Although perhaps not immediately apparent, the variance (3.13.24) can be shown
to approach in value the cumulative mean (3.13.21) for long runs (k 5) of long
sequences (n 1). To see this, factor (k 1)/(k 2)! from the terms in the square
bracket; only the last term, equal to 1, survives in the limit of large k. Thus, we can
again estimate the probability for cumulative up and down runs by a Poisson
approximation

PrRn, k 1 1 eRn, k
k e5
3:13:26
as was done previously [(3.13.19)] for target runs. An illustration of this equivalence
is shown in Table 3.4 for a sequence of length n 8192, corresponding to the number
of bins in 1 bag of data. The unshaded portions of the table include values of run
length for which the Poisson approximation is good.
Statistical analyses based on runs were intended originally for examining the
quality control of processes for example, the manufacture of mechanical parts
leading to outcomes describable by continuous, rather than discrete, random
variables. If the observed fluctuations of a specified measurement were predictable by probability theory (i.e. without an assignable cause), then the process
was said to be under statistical control. Mathematical treatments of difference
runs were generally based on the assumption that occurrence of identical adjacent elements in a time series was sufficiently improbable to be disregarded, but
this is not necessarily the case for a series of integers. The mean number of
occurrences of adjacent identical integers 0,1,2. . .n 1 uniformly distributed
over a range n in a sample of length N is N/n, a result that may well be
statistically significant.
In the case of a Poisson process of mean , the probability Pr(kkj) of two identical
adjacent outcomes is
173
Table 3.5
Runs up and down of

Thy
Length k
Rn, k
1
2
3
4
5
6
7
8
917 129
343 923
91 713
19 107
3275
478
60.7
6.82
22
Na coincidence counts
Exp
theory
Rn, k
experiment
916 214
344 158
92 060
19 263
3383
526
81
8
Prkkj
pkj2
k0
2
I 0 2,

X
e
k0
k
k!

Rn, k
R n, k R n, k
495
330
255
132
56.7
21.8
7.78
2.61
1.85
0.712
1.36
1.18
1.90
2.21
2.61
0.45
Exp
Thy
2
3:13:27
where I0(x) is a modified Bessel function of the first kind. (We encountered this
function in the discussion of the Skellam distribution in Chapter 1.) Although the
sum extends over all non-negative integers, counts tend to cluster about the mean ,
p
thereby creating an effective range k 4 4 . Consider 100 for example,
for which the effective range would comprise ~ 40 integers between 80 and 120:
exact:
X
k0
pkj2 0:028 23 approx:
120
X
pkj2 0:028 12:
k80
In a bag of 8192 bins, one would expect approximately 8192 0.028 229 occurrences of adjacent bins with identical counts.
To handle the case of identical adjacent outcomes, one can consider all runs
distributions that result from replacing each null (0) with each binary value (,).
If all of these distributions are incompatible with statistical control, the hypothesis of
randomness is to be rejected. Alternatively, one can assign to each null a randomly
chosen binary value for example, by using an alternative random process or reliable
pseudo-RNG and analyze the runs of the resulting sequence.
The time series of gamma coincidence counts arising from the decay of 22Na was
analyzed for runs up and down, with results tabulated in Table 3.5 and shown
graphically in Figure 3.15 for a sequence of length n 1.376 106 with mean count
per bin X ~ 194.
For a time record of so many bins, up/down runs of length up to k 8 were
observed. The square plotting symbols in the top panel of the figure mark the (log of
the) observed number of runs of each length. The solid line (to help guide the eye)
is the theoretical prediction of runs for a random distribution of integers (null
174
Log Cumulative U/D Runs
6
5
4
3
2
1
Run Length
1000
R(Exp) - R(Thy)
500
500
1000
1
Run Length
Fig. 3.15 Upper panel: observed values (squares) and theoretical values (solid) of the log of
cumulative up/down runs Rn,k of length k and record length n 1 375 694 bins. Error bars mark
2 standard deviations. Lower panel: difference (circles) of observed and predicted runs
Rn, kExp Rn, kThy . Dashed lines mark intervals of
2 standard deviations at each run length.
hypothesis). Error bars mark intervals of

2 standard deviations. A closer match
could hardly be imagined. In the second panel of the figure circular plotting symbols
mark the difference between the observed and predicted number of cumulative runs
of each length. All eight points fall within the interval of
2 standard deviations
delineated by the dashed lines. Since the distribution of Rn, k is approximately
Poissonian for n k 5, and a Poisson distribution is approximately Gaussian
for 1, the probability of an experimental outcome lying outside
2 standard
deviations about the mean is ~ 4.6%.
175

Although the number of radioactive nuclei decaying at any moment within some
window of observation t is random, there is a certain regularity not a periodicity,
but a predictability to the recurrence of a given decay count. The time between
occurrences is referred to as the waiting time, and we encountered this concept
before (in Chapter 1) in the discussion of the exponential and geometric distributions.
It is now pertinent to look into the matter further.
Consider a discrete time series of integers representing the number of decaying
nuclei within time intervals (bins) of width t where is the mean count per bin. If
nuclear decay is a Poisson process, then
pX pxj e
x
x!
3:14:1
is the probability of x decays occurring within a bin irrespective of where in the time
record the bin is located, provided the process is stationary (i.e. is constant). We will
assume the record is stationary or has been so adjusted, as described previously.
To facilitate examination, let us denote by EX the event defined by the occurrence of
x decays. If we designate by 0 the bin in which EX last occurred, then the waiting time in
units of t is the number (call it n) of the bin in which EX next occurs. This bin would
then be designated 0, and the process repeated to determine the subsequent waiting
time. In this manner by working ones way through a long chronological record of
nuclear decays and tallying the times of first occurrence of event EX one can obtain a
sample estimate of the mean waiting time hTXi and associated variance 2T X .
The probability that the waiting time TX n follows a geometric distribution
PrEX PrT X n qn1
X pX
qX 1 pX ,
3:14:2
where EX has failed to occur with probability qX in the first n 1 bins and then occurs
with probability pX in the nth bin. It is then easy to calculate from (3.14.2) the
moment-generating function (mgf )
gT X t heT X t i
nt
pX qn1
X e
n1
pX X
pX e t
n
qX et
,
qX n1
1 q X et
3:14:3
which provides an expedient means of deducing statistical moments (by differentiation with respect to the expansion variable t)
2T X
1
pX
hT 2X i g00T X 0

00 q
hT 2X i hT X i2 lngT X 0 X2 :
pX
hT X i g0TX 0
1 qX
p2X
3:14:4
The concept of waiting time can be generalized so as to apply to the rth ocurrence
r
(rather than the first occurrence) of a count x in the nth bin. Designate this event EX .
176
r
Then EX is achieved when a success (a count x) occurs in bin n and a failure (count
other than x) occurs in (n 1) (r 1) n r of the remaining n 1 bins. The
r
number of ways in which EX can take place is then given by the binomial coefficient

n1
r
, whereupon it follows that the probability Pr EX in words, the probnr
r
ability that the waiting time T X between rth occurrences of success takes the value
nt is

n 1 r nr
r
r
pX q X :
3:14:5
Pr EX Pr T X n
nr
It is more convenient to work with the formula in (3.14.5) if the bin variable n is
replaced by the number of failures k n r (where k 0,1,2. . .). Then (3.14.5) takes
the form

rk1 r k
r r
r
Pr T X r k
3:14:6
pX qX
pX qX k ,
k
k
where the second equality expresses what is called a negative binomial distribution.40
The equivalence between the two binomial expressions in (3.14.6) is established by

r
according to the rule defining a binomial
explicitly writing the factors in
k
coefficient
!
r
r r 1 r k 1
r r 1 r k 1

1k
1
2
k
k!
k

r 1!
r r 1 r k 1
r k 1!
1k
1k

r 1!
k!
k!r 1!
!
rk1
1k
k
3:14:7
and then multiplying numerator and denominator by (r 1)! (as shown in the second

rk1
.
line) in order to create the factorials that define
k
Given the negative binomial form of the probability function in (3.14.6), it is
straightforward to demonstrate the completeness relation

X
X
pr
r r
r
k
r
qX k prX 1 qX r Xr 1
3:14:8
pX qX pX
k
k
pX
k0
k0
40
W. Feller, An Introduction to Probability Theory and its Applications Vol. 1 (Wiley, New York, 1957) 155156.
177
and to calculate the mgf

X
D r E X
n 1 r nr nt
rk1
k
t r
TX t
pX qX e pe
qX et
gT r t e
n

r
k
X
nr
k0
pet

r

X
pet
r
k
qX et
k
1 qX et
3:14:9
k0
from which the statistical moments

2
r
r r qX
r
r
TX
hT X i
pX
p2X
2T r
X
rqX
p2X
3:14:10
are obtained.
r
There is a simple, instructive explanation for the form of the mgf (3.14.9) of T X ,
r
which is seen to be the rth power of the mgf (3.14.3) of TX. The random variable T X is
interpretable as the sum
r
TX TX TX TX
3:14:11
r terms
of r independent random variables, each representing the waiting time for the first
occurrence of a bin with count x. Since the mgf of a sum of independent
random
r
variables is the product of the component mgfs, it follows that gT r t gT X t .
X
The entire time record of gamma coincidence counts arising from decay of 22Na
was analyzed for the intervals of count values X 190 through X 198 where the
mean count per bag was approximately 194. Additionally, in order to examine
whether the statistics of the record may have changed throughout the duration of
the experiment, the intervals were examined as well for the first 10 hours of counting,
for a middle period of hours 50 through 60, and for the hours 150 through 160
towards the end of the experiment. In calculating the theoretical mean waiting time
and variance from (3.14.4), it was necessary to take account of the variation in due
to natural lifetime, since this parameter determines the probability pX in (3.14.1).
A histogram of the intervals of recurrences of X 194 counts per bin is shown in
Figure 3.16 with a plot of the theoretical geometric probability function superposed.
This histogram is typical of the results obtained for the other count values as well. So
close is the agreement of experiment with theory (i.e. the null hypothesis) that I again
remind readers they are looking at real data and not a computer simulation. Table 3.6
summarizes the goodness of fit for a range of class values.

Although the shape of a histogram has no absolute geometric or statistical meaning, a final set of tests was nevertheless devised specifically to search for the
178
Table 3.6
2 test of intervals of counts 190198 (full time record)

hT Exp
X i

T Thy
X

T Exp
X
Class
X
dof
d
P
value
hT Thy
X i
(Thy)
(Exp)
(Thy)
(Exp)
190
191
192
193
194
195
196
197
198
153
140
153
148
159
154
176
181
189
168
170
172
171
166
161
178
167
171
0.790
0.955
0.848
0.898
0.638
0.640
0.528
0.216
0.164
36.2
35.6
35.2
35.0
35.0
35.0
35.4
35.9
36.6
36.2
35.6
35.2
34.7
34.9
35.2
35.4
35.9
36.7
35.7
35.1
34.7
34.5
34.4
34.6
34.9
35.4
36.1
35.7
35.0
34.9
34.3
34.4
34.6
34.9
35.5
35.9
Relative Frequency
0.025
0.020
0.015
0.010
0.005
0.000
0
50
100
150
200
Intervals of X = 194 Counts

Fig. 3.16 Histogram of intervals of recurrences of X 194 counts per bin in a time record of
n 1 375 694 bins (approximately 167 hours) with superposition (solid) of theoretically
k1
predicted geometric distribution Pr T 194 k p194 q194
where p194 e194 194194/194!
0.0286, q194 1 p194 0.9714.
recurrence of histogram shapes whose observation was claimed repeatedly in

published articles by examining the decay of radioactive sodium for correlations
in the intervals of different count frequencies. The basic idea behind the test is this:
The recurrence of a histogram shape must require by definition of how a histogram
is constructed a regularity in repetition of the frequencies of the classes of the
histogram, otherwise it would be totally meaningless to say that two histograms had
the same or similar shapes.
179
Suppose, for example, that at the end of an initial period of counting nuclear
decays the classes corresponding to count values X . . .80, 90, 100, 110, 120. . . in the
resulting histogram showed a set of frequencies . . .130, 628, 1000, 587, 140. . ., where
the highest frequency (1000) corresponded to the central class, i.e. the center of a
more or less Gaussian-looking shape. If a histogram of similar shape were to occur
again, we would expect the same classes to exhibit frequencies very close to the
preceding set. This idea is illustrated below in tabular form with classes of unit width
and count frequencies arrayed chronologically in bags, where the bag number serves
as a measure of increasing time.
Bags !
Classes
Value
1
2
3
.
.
.
.
.
.
.
K
X1
X2
X3
.
193
194
195
.
.
.
XK
n11
n21
n31
.
85
100
94
.
.
.
nK1
n12
nK2
...
...
...
n1k
85
100
94
nKk
...
n1l
85
100
94
nKl
M
n1M
nKM
The central class defined by X 194 corresponds closest to the mean number of
counts per bin in the time series.
The frequencies nkm (k 1. . .K classes, m 1. . .M bags) entered in the table are
hypothetical and meant only to show the kind of pattern that would occur if the
histograms generated by the bags of frequencies exhibited a recurrent shape with
perfect regularity. Since nuclear decay is a stochastic process whereby the number of
decays fluctuates randomly from bin to bin, one would not expect the sequence of
histograms each histogram corresponding to one bag to manifest a pattern as
striking as the one shown above. The question, therefore, is how to detect amidst
statistical noise an underlying pattern of recurring shapes. . .if such a pattern were
actually present.
It has already been demonstrated in the previous two sections that the observed
runs up and down and the observed waiting times of different count values in the time
series of coincident gamma counts were in complete accord with theory (the null
hypothesis). It is difficult to imagine how the numbers of a time series can pass such
tests of randomness and yet occur with frequencies that display the temporal
180
Table 3.7
Runs up/down of descending sorted intervals of class C0
Count
class Ck
Sequence
length n
RnObs (observed)
RnThy (theory)
Normalized
residual zR
Probability
Pr(z jzRj)
190
191
192
193
195
196
197
198
199
37 977
38 606
39 050
39 464
39 029
38 839
38 280
37 430
36 728
25 223
25 800
26 031
26 366
26 099
25 928
25 466
25 015
24 467
25 318
25 737
26 033
26 309
26 019
25 892
25 520
24 953
24 485
1.15
0.76
0.02
0.68
0.96
0.43
0.65
0.76
0.22
0.25
0.45
0.98
0.50
0.34
0.67
0.52
0.45
0.82
regularity illustrated above. Nevertheless, nature has led to surprises before especially in matters involving quantum mechanics with outcomes that seemed counterintuitive,41 if not unreasonable. The test I devised to examine this possibility
employed again but simultaneously the concepts of waiting times (intervals)
and runs up and down.
The conceptual basis of the test was the following. If there is a causal periodicity to
the recurrence of histograms {Ha a 1. . .167}, then not only must the frequency of
occurrence of a particular class value (e.g. the count X 194) recur with some
regularity i.e. manifest intervals whose frequency of repetition is unaccountable
on the basis of pure chance but the intervals for different classes of count values
must be correlated or, again, there would be no meaning to the idea of equivalent
histogram shapes. The test was implemented, therefore, in two stages.
In the first stage, tests of up/down runs were made on the intervals in the
frequencies of a range of classes Ck 194 k (5 k 4) about the central class
C0 194 to establish that the results were all as expected on the basis of pure chance
i.e. under statistical control. This was indeed established.
In the second stage, the intervals of C0 were then arranged in descending order,
and the intervals of the other classes were sorted in the corresponding order. Runs
up/down tests were again performed on the intervals of Ck (k 6 0) to test whether the
sequences of intervals were still under statistical control or whether they were
correlated with the now highly improbable rank ordering of the intervals of C0.
The results, summarized in Table 3.7 for the total number of runs R Rn,1, confirmed
that the re-ordered intervals of Ck60 still conformed completely to what one would
expect on the basis of pure chance, signifying no correlation with the intervals of C0
or with each one another.
41

181
Recall that for a sample size n > 20, R is normally distributed to a good approximation with mean and variance given by (3.13.22) and (3.13.25). The fifth column
expresses the observed number of runs in standard normal form as the residual zR
(R(obs) R(thy))/R, with the corresponding P-value listed in the sixth column.
As is evident by the close agreement of observation with prediction, the runs-ofintervals test showed no evidence whatever that the histograms of a long time series
of 22Na decays gave rise to recurrent shapes. In all likelihood, any such appearance to
the contrary particularly with histograms massaged to generate smooth peaks,
valleys, and rabbit ears only reflects the intrinsic capacity of the human mind
(apophenia) to seek coherent patterns out of random noise.

An addition to knowledge is won at the expense of an addition to ignorance. It is hard to empty the
well of Truth with a leaky bucket.
Sir Arthur Eddington42
The research described in this chapter proceeded over a period of more than five
years. For one thing, carrying out a sustained program of research within an
undergraduate institution devoted primarily to teaching and therefore without
the support of graduate students or postdoctoral assistants greatly restricted the
times when work on the project could be done. This is simply a statement of fact, not
a complaint, as there are compensating features to being at a liberal arts college, and
I am there by choice. For another, the project evolved in complexity, as I began to
understand better the experimental and analytical dimensions of what needed to be
done, and learning can be a relatively slow process. Fortunately if one chooses to
think of it this way I had little competition since belief in the randomness of nuclear
decay is so ingrained in the psyche of most physicists that few if any other labs (none
to my knowledge at the outset) probably thought this experimental fact was sufficiently in doubt to be worth checking.
Actually, compared to the plethora of experimental studies of quantum phenomena relating to interference and entanglement, I was aware of only relatively few tests
to examine specifically whether quantum transitions occurred nonrandomly. It was
this surprising paucity that prompted me to investigate nuclear decay in the first place
in the 1990s. Physicists who believed that the statistics of radioactive nuclei had long
ago been established beyond doubt probably conflated such a demonstration with
exponential decay. In fact, quantum theory predicts a non-exponential decay of
quasi-stationary states for times short compared with the coherence time of the
42
A. S. Eddington, The Nature of the Physical World (University of Michigan Press, 1963) 229. [Originally the
1927 Gifford Lectures published by Cambrdige University Press.]
182
system or long compared with the mean lifetime.43 In the case of 22Na, the former
time domain would be roughly 1018 seconds (time required for a nucleon to cross
the diameter of the nucleus), and the latter about 2.5 years. The time scale of the data
collection was about 167 hours, well outside both time domains. The experimental
conditions therefore were consistent with the null hypothesis, which does lead to
exponential decay as explained earlier in this chapter.
My motivation for examining one particular nuclide (22Na) in such detail and in
the specific ways described in this chapter was, as I have written in the introduction,
primarily due to the repeated extraordinary claims published by certain groups of
researchers. Since much of the daily work in science, as in other professions generally,
does not rise above the mundane, the challenge posed by these published assertions,
although apparently dismissed out of hand by many in the physics community or at
least by those who wrote me to debunk them made my own scientific life more
exciting. In fact, despite harboring the biases that most physicists do in favor of a
prevailing theory that had remained inviolable since its inception in the 1920s,
I secretly wished that the claims were valid, that nuclear decay somehow managed
to disguise a deterministic component beneath an outer appearance of randomness
or more likely (although still highly unlikely) that some external influence of
unknown origin, a cosmogenic force, existed that exerted a subtle control over
what otherwise would be independent, random processes.
Alas or perhaps fortunately the experiments and analyses described here found
no such thing. Rather, all the tests pointed unwaveringly to the following
conclusions.44
The discrete states in histograms of nuclear decay reflected only correlations
introduced artificially by construction.
Visual inspection of shapes of histograms provided no reliable test of correlations
in the underlying stochastic processes.
Nuclear decay (at least that of radioactive sodium)
was completely consistent with white noise;
showed no correlations in fluctuations of counts in the time series;
showed no correlations in fluctuations of frequencies in the histograms;
showed no periodicity in either nuclear counts or count frequencies for time
intervals ~167 hours;
showed no unexplained trends over a period ~35 days.
There are, of course, countless other nuclear decays to examine, as well as random
fluctuations arising from electromagnetic waves and chemical reactions. But, if the
point was to search for a cosmic influence of universal reach, then presumably such
43
44
M. P. Silverman, Probing the Atom: Interactions of Coupled States, Fast Beams, and Loose Electrons (Princeton
University Press, Princeton NJ, 2000).
M. P. Silverman and W. Strange, Search for Correlated Fluctuations in the Decay of Na-22, Europhysics Letters
87 (2009) 32001 p1p6.
183
an influence did not exist if it had no effect on the decay of 22Na (which is a weak
nuclear interaction), the subsequent ee annihilation into gamma rays (which is a
relativistic quantum electrodynamic interaction), and all the electronic activities that
went on within our detection and data processing instrumentation (which constitute
non-relativistic classical electromagnetic interactions).
Nevertheless, the question of whether it is conceivable within the framework of the
known laws of physics for ostensibly independent nuclear processes to be correlated
by some kind of universal interaction is an interesting one. The Standard Model of
particles and forces predicts an ever-present background field (Higgs field) pervading
all space, which determines particle masses. Similarly, the Standard Cosmological
Model (big bang inflation) requires an all-pervasive field (dark matter) to account
for the cosmic distribution of mass and a second such field (dark energy) to account
for a perceived increase in the expansion rate of the universe. Whether such fields
could lead to correlated fluctuations in nuclear decay is highly dubious. Indeed, the
very existence of such fields in cosmology has come into question, and, despite my
having also contributed to this genre of speculation,45,46,47 I am progressively evolving to the view that these mysterious entities will eventually go the way of the
nineteenth-century aether once the nature of gravity is better understood.
This statistical narrative of what at least to me was a fascinating undertaking
could well have ended with the preceding sentence were it not for a surprising
revelation at least as extraordinary and possibly more believable than the claims of
discrete structures and recurrent histograms. As the project recounted in this chapter
neared conclusion, reports came to my attention of data showing variable nuclear
decay rates correlated with the Earths orbital position about the Sun48 and influenced by variable solar activity such as solar flares.49 The data from which such
drastic conclusions were drawn were not recent, but comprised measurements of the
half-life of radioactive silicon (32Si) made at the Brookhaven National Laboratory
(BNL) over the period 19821986 and of radioactive europium (152Eu, 154Eu) made
at the Physikalisch-Technische Bundesanstalt (PTB) in Germany over a period of
approximately 15 years from 19841999, and a measurement of the decay of radioactive manganese (54Mn) made in 2006. These measurements, I understood, were
undertaken for purposes of metrology and calibration and not with an eye to testing
fundamental principles of nuclear physics. Indeed, I was later to learn that when the
long-duration measurements manifested variable nuclear decay rates, the
45
46
47
48
49
M. P. Silverman and R. L. Mallett, Coherent degenerate dark matter: a galactic superfluid?, Classical and Quantum
Gravity 18 (2001) L103L108.
M. P. Silverman and R. L. Mallett, Dark matter as a cosmic BoseEinstein condensate and possible superfluid,
General Relativity & Gravitation 34 (2002) 633649; Erratum 35 (2002) 335.
M. P. Silverman, A Universe of Atoms, An Atom in the Universe (Springer, New York, 2002) 325385.
J. H. Jenkins et al., Evidence of correlations between nuclear decay rates and Earth-Sun distance, Astroparticle
Physics 32 (2009) 4246.
J. H. Jenkins and E. Fischbach, Perturbation of nuclear decay rates during the solar flare of 2006 December 13,
Astroparticle Physics 31 (2009) 407411.
184
experimenters largely retired the data out of concern that the experiments had
somehow gone awry.
Once resurrected, however, the claims of variable nuclear decay rates did not go
unchallenged, and evidence against small periodic annual variations modulating the
exponential decay curve was obtained from re-examination of a variety of nuclear
decays obtained both terrestrially50 and from radioisotope thermoelectric generators
aboard the Cassini spacecraft.51
Around the time I learned of this interesting controversy, I was co-organizing a
workshop on the fundamental physics of charged-particle (e.g. electron) and heavyparticle (e.g. neutron) interferometry to be held at the HarvardSmithsonian Center
for Astrophysics in Cambridge, Massachusetts in April 2010. The conference, it
seemed to me, could provide an excellent opportunity to understand better the
conflicting conclusions regarding decay of radioactive nuclei, which, while not
exactly interferometry, nevertheless involved some very heavy quantum particles.
And so I invited two speakers on opposite sides of the issue. Only one accepted the
invitation, and therefore only one side was presented that in support of variable
nuclear decay. I found the presentation thought-provoking, but not convincing.
If a variation in nuclear decay rate were indisputably shown to be correlated with
either the position of the Earth about the Sun or short-term violent activity at the
Suns surface, what possibly could be the cause? The processes that go on within a
nucleus have long been thought to be impervious to non-nuclear activities outside the
nucleus that is, to all local environmental variables like temperature, humidity,
molecular bonding, chemical reactions, ambient light, and even fairly strong laser
fields directed at the nucleus. The only exception I know of are processes that involve
the density of exterior bound electrons at the nucleus.
The most familiar case, previously mentioned, is that of electron-capture decay
whereby a neutron-deficient nucleus can convert a proton to a neutron by capturing
an orbital electron, usually from the innermost (K) shell. This can occur if the
daughter product has a mass greater than the threshold mass for positron emission.
Ordinarily, this effect is very weak. An isotope of beryllium (7Be), for example,
decays with a half-life of about 54 days by capturing a K-shell electron to form an
isotope of lithium (7Li), a process that plays a role in dating geologic samples.
Depending on the molecular bonding by which 7Be is bound in a molecule (hydrated
ion, hydroxide, or oxide), the half-life of 7Be can vary by about 1%.52
A less common case, however, known as bound-state beta decay, can lead to
spectacular modifications. This is a weak-interaction process in which the emitted
electron (beta particle) remains in a bound atomic state rather than being emitted
50
51
52
E. B. Normal et al, Evidence against correlations between nuclear decay rates and EarthSun distance, Astroparticle
Physics 31 (2009) 135137.
P. S. Cooper, Searching for modifications to the exponential radioactive decay law with the Cassini spacecraft,
Astroparticle Physics 31 (2009) 267269.
R. A. Kerr, Tweaking the clock of radioactive decay, Science 286 No. 5441 (29 October 1999) 882883.
185
into a continuum of unbound states. For a neutral atom, only weakly unbound states
with low density at the nucleus are available, and the process is insignificant compared to other decay pathways. However, for the fully ionized atom such as can
occur in a stellar interior deeply bound states close to the nucleus are available. One
example, produced terrestrially in a heavy-ion synchrotron, is the bound-state beta
decay of an isotope of rhenium (187Re) for which the half-life of the neutral atom is
42 billion years and that of the fully ionized atom is only 14 years!53
Of various mechanisms that have been proposed for the alleged correlation of
nuclear decay and solar activity, some e.g. novel force fields emanating from the
Sun that cause changes in the magnitudes of fundamental parameters such as the
fine-structure constant involved entirely new physics, whereas others e.g. variations in the flux of neutrinos emitted by the Sun involved known particles, but
interacting with matter with cross sections far greater than those predicted by the
Standard Model. Nevertheless, the neutrino mechanism is of particular interest
because it can be tested more readily than others that require new force fields. If
the neutrino mechanism is applicable, one would not expect to find an annual
variation in the rate of decay of nuclei that disintegrate by emission of alpha
particles, which is an electromagnetic rather than weak nuclear process. As of this
writing, the issue is still ambiguous.
How utterly remarkable it would be if neutrinos from the Sun actually affected the
decay of radioactive elements on Earth, may be glimpsed from the following backof-the-envelope (or front-of-the-computer-screen) estimate of (a) the rate of
absorption of neutrinos by radioactive manganese (54Mn) compared with (b) the
natural decay rate due to electron capture. Manganese-54, whose decay I investigated
years earlier with my colleague Wayne Strange, is one of the nuclides claimed to be
affected by solar flares. The rate of process (a) is given by the expression
neutrino flux
neutrino
number of
number of neutrinos
a
at the Earth
cross section
absorbed per second
nucleons
N n :
3:16:1
The term flux, derived from the Latin root for flow, means number of particles
passing through a unit area in a unit of time. The neutrino flux at the Earth has been
measured to be ~ 7 1010 m2 s1. The cross section is a measure (in terms of area)
of the probability that an interaction occurs. The neutrino cross section increases
with energy, but at the low energies of beta decay (MeV compared with GeV),54 a
representative value of the neutrino cross section is ~ 1047 m2. It is this extremely
small number that underlies the frequently cited remark that a neutrino can pass
53
54
F. Bosch et al., Observation of bound-state decay of fully ionized 187Re: 187Re187Os cosmochronometry, Physical
Review Letters 77 (1996) 51905193.
The electron volt (eV) is a common measure of particle energy. The MeV, (mega or million eV) is characteristic of lowenergy nuclear phenomena like beta decay, whereas GeV (giga or billion eV) is characteristic of high-energy elementary
particle phenomena.
186
undeflected through a light-year of lead. The rate of process (b), which was derived
earlier in the chapter, takes the form

intrinsic
number of
number of 54 Mn nuclei
b
decay rate 54 Mn nuclei

decaying per second
3:16:2
N 54 Mn ,
where the decay rate and half-life are related by ln 2= , and the number of
manganese nuclei and total number of nucleons are related by N n 54N 54 Mn .
The ratio of the rate of process (a) to process (b) is then approximately

54 7 1010 1047 2:7 107
neutrino-induced decay 54
27
e 1:510 ,
ln 2
electron-capture decay
0:693
3:16:3
1
2
1
2
where it is to be noted that the number of nucleons drops out. Unless there is
something about neutrinos that physicists really do not understand, it is difficult to
reconcile with current theory how solar neutrinos could affect the rate of weak
nuclear decay processes on Earth.
Because the modulation of nuclear decay by solar activity would have momentous
theoretical and practical consequences (e.g. for nuclear-based geological dating), it
goes without saying (but I will say it anyway) that such claims are unlikely to be
accepted until careful studies are done to understand just how the Sun affects the
apparatus employed in experiments for detecting particles and processing electronic
signals. It has long been known that the Sun emits a veritable wind of charged
particles and a broad spectrum of electromagnetic radiation. Sporadic violent solar
activity has damaged orbiting satellites and affected communications on Earth. What
is perhaps not widely appreciated and only relatively recently investigated in depth
is that the Sun has more than ten million normal modes of oscillation that can affect
virtually every electrical device imaginable. As one recent report on the Suns
ubiquitous influence concluded:55
. . .we have shown a series of examples where data encountered in the engineering environment
is not of the form that most texts prepare engineering students to expect. The majority of this
data is nonstationary on a variety of scales, definitely not white, and contains many discrete
modes. . . .these modes begin as normal modes of the sun, are often further split by Earths
rotation and possibly other causes. These modes are ubiquitous in space physics data, in the
magnetosphere and ionosphere, in barometric pressure data, in induced voltages on ocean
cables, and even in the solid Earth. Although the specific physical coupling mechanism is not
understood, the solar modes appear to be a major driver of dropped calls in cellular phone
systems. We currently do not know how many different kinds of systems are directly or
indirectly affected by phenomena arising from these modes. . .
55
D. J. Thomson, et al., Solar modal structure of the engineering environment, Proceedings of the IEEE 95 (May 2007)
10851132.
187
With controversy again come excitement and opportunity. As of this writing I am

again examining the decay of 54Mn and the beta-emitter cesium-137 [137Cs], looking
very carefully for fluctuations within the instrumentation that may correlate with
changes in solar behavior. Whatever the outcome whether some emanation from
the Sun truly influences nuclear decay or only the apparatus used to detect it the
endeavor is bound to reveal something interesting.
And so the search goes on.
Appendices
3.17 Power spectrum completeness relation

The power spectrum evaluated at the special set of discrete frequencies that provide
m/2 independent spectral estimates, where m is the maximum lag of the autocorrelation function, takes the form

m
X
jk
Rk cos
j 0 . . . m
3:17:1
Sj R 0 2
m
k1
up to a normalization constant.
It then follows that
1
2
1
2
S0 12 R0
Sm R0
1
2
m
X
Rk
k1
m
X
3:17:2
k
1 Rk
k1
and

m1
m
m1
X
X
X
jk
Sj m 1R0 2 Rk
cos
m
j1
k1
j1
m 1R0
m h
X
C1211k
i
1 1k Rk :
3:17:3
k1
Combining the terms in (3.17.2) and (3.17.3) leads to the identity

!
m1
X
1 1
1
S0
Sj Sm R0 j 0 . . . m:
m 2
2
j1
188
3:17:4
189
Evaluation of the sum of cosines identified by C in (3.17.3) is most easily done

by recognizing that each term in the sum is the real part of the complex exponential rj (eik/m)j, whereupon C becomes the real part of the sum of a geometric
series
m1
X
j1
rj
m1
X
rj 1
j0
r rm eik=m cos k
:
1r
1 eik=m
3:17:5
Taking the real part of (3.17.5) leads to the expression beneath the bracket in (3.17.3).

3.18.1 Detrended time series Y(t)
A time series yi (i 1. . .N) of observations governed by a Poisson distribution with
mean parameter 1 is well described by independent, identically distributed (iid)
Gaussian random variables Yi of mean 0 and variance 2

Y N 0, 2 N 0, :
3:18:1
The moment-generating function (mgf ) of each variate is gY i u heY i u i e u =2 ,
N
X
and the mgf of a linear superposition of normal variates, Z
ci Y i , is the product
N
i1
Y
gY i ci u, which also takes the form of a Gaussian mgf so that
g Z u
N
X
i12
Z N 0, Z with variance 2Z 2
c2i .
2 2
i1
3.18.2 Fourier amplitudes F() A() + iB()

The discrete Fourier transform (DFT) of Y
N
1 X
F p
Y i eiti A iB
N i1
3:18:2
with observation times ti i t constitutes two linear superpositions of iid Gaussian

variates
N

1 X
A p
Y i cos ti N 0, 2A
N i1
N

1 X
B p
Y i sin ti N 0, 2B
N i1
with respective variances [see (3.8.19)]
190
2A
N
2 X
1
cos 2 ti ! 2
2
N i1
N!
t!0
2B
N 1
2 X
1
sin 2 ti ! 2
2
N i0
N!
3:18:3
t!0
that reduce in the limits shown to one-half the variance of the original time series.
Thus, the real and imaginary amplitudes of the DFT of the time series are normally
distributed

A
1
1
3:18:4
N 0, 2 N 0, :
2
2
B
3.18.3 Squared amplitudes A2(), B2()
The probability density function (pdf ) of z y2 is related to the pdf of y by the chain
of steps
1

dy
pY ydy pY y pY y dy pY y pY y dz
pZ zdz
dz
3:18:5
from which follows the transformation
pZ z
p
p
pY y pY y pY z pY z
p
:
jdz=dyj
2 z
3:18:6
Application of (3.18.6) to the pdf pY (y) of a normal variate N 0, 12 leads to the pdf
pZ(z) of a gamma-distributed random variable Gamr, Gam12 , 1. Thus, the
square moduli of the DFT are gamma distributed

1 1
A2
3:18:7
Gam , :
2
B2
3.18.4 Power spectrum S()jF()j2 A()2 + B()2
The mgf of a gamma variate Gam(r, ) takes the form g(u) (1 1u)r. It then

follows that the mgf of the sum of two iid gamma variates Gam 12 , 1 is the square
g(u)2 (1 u)1, which is identical to the mgf of an exponential distribution with
parameter . Thus, the ordinates of the power spectrum are distributed exponentially
S A2 B2 E:
3:18:8
3.18.5 Modulus jFj
The pdf of w
191
q
A2 B2
p
z is related to the pdf of z by a chain of steps
dw

1 pW wdw pW w
dz pZ zdz
dz
3:18:9
leading to the transformation

pW w

pZ z
2z pZ z 2wpZ w2 :
jdw=dzj
1
2
3:18:10
Applied to an exponential variate Z E(), Eq. (3.18.10) leads to the pdf of a

Rayleigh distribution with parameter
h
i
2
2
pZ z 1 ez= ) pW w 2w 1 ew = 21 wew = :
3:18:11
Exponential pdf
Rayleigh pdf
Thus the moduli of the power spectral ordinates are distributed as

q
jFj A2 B2 Ray:
3:18:12
3.18.6 Phase: tan () B()/A()

The tangent of the phase of the DFT amplitude is the quotient of two iid normal
variates N(0, ). However, from the general equivalence relation

3:18:13
N , 2 N 0, 1,
demonstrated in Chapter 1 by means of the mgf, it follows that tan is distributed in
the same way as the quotient of two iid standard normal variates N(0, 1). The pdf of the
quotient z x/y, to be discussed in more detail in a later chapter, takes the general form
pZ z pX yzpY yjyjdy:
3:18:14
Equation (3.18.14) reduces to the pdf of a Cauchy Cau(0, 1) distribution centered on
z 0 with unit width parameter when the pdfs of x and y are Gaussian functions of
zero mean and unit variance.
3.18.7 Autocovariance Rk
We employ the discrete estimate
Rk
n
1X
yy
n t1 t tk
k > 0
3:18:15
of the autocovariance of the adjusted time series {yi}, each element of which is
representable by an independent normal random variable Y N(0, 2) N(0, ).
192
The pdf of the product z xy of two independent random variables, also to be

discussed later in the book, takes the general form
pZ z pX xpY z=xx1 dx,
3:18:16
whereupon Eq (3.18.16) reduces to
2
1
2 z
1
pZ z 2 e22 x x2 x1 dx
3:18:17
when the pdfs of x and y are densities for N(0, 2). Rather than evaluate the integral (3.18.17)
directly, it is actually more convenient to use it to calculate the corresponding mgf
gZ u
ezu pZ zdz
1
2
2 2
x2 =2 2 1
e
x
dx
ezu ez =2x dz
2
0
p
2
2 xexu =2
r

2 1 12x2 12 2 u2
e
dx 1 4 u2 ! 1 2 u2
2

1
2
3:18:18
1
2
which reduces to a deceptively simple form that is not one of the familiar distributions
to be found in books. At first glance it may resemble the mgf of a gamma distribution
(with r 12), but the latter is a function of u, not of u2.
The pdf resulting from (3.18.17) is in fact a Bessel function a modified Bessel
function of the second kind, (2)1 K0(z/2), to be exact but we do not need to
know or use this information. Everything of relevance is contained in the mgf
(3.18.18)
which it follows that the mgf of a sum of n k iid products ytytk is

from

, and therefore the mgf of the autocovariance Rk>0 is
1 2 u 2
nk
2
nk
h uink
2 2
2u
1 2
:
gRk u gZ
n
n
3:18:19
Expansion of the natural logarithm of (3.18.19) with truncation after the first term
(of order u2) which is entirely adequate for large sample size n 1 leads to the
2
2
mgf gRk u e e =nu of Gaussian form with variance 2Rk 2 =n. Thus, the
autocovariance function of lag k > 0 is distributed as

Rk>0 N 0, 2 =n :
3:18:20
1
2
The autocovariance R0 of lag 0 corresponds to the average of a sum of n squares of iid

Gaussian variates N(0, 2), which in the special case of N(0,1) was previously shown
to be equivalent to a chi-square variate of n degrees of freedom with mgf

un=2
g2n u 1 2
:
n
193
3:18:21
By the same, somewhat tedious, analytical procedure one now arrives at the mgf

un=2
:
3:18:22
gR0 u 1 2 2
n
There is, however, an easier way to arrive at (3.18.22), which shows the connection
n
X
2
N i 0, 2
and the chi-square variate
between the random variable Z 1n
n
X
i1
2
2
1
n n
N i 0, 1 . By exploiting the equivalence N(0, 2) N(0,1), we can evalui1
ate the mgf gZ(u) by the following simple chain of steps

gZ u
2
2
n=2

2 2
2 2 u
u
u
e n n u e n n
,
12
g2n
n
n
3:18:23
which is precisely (3.18.22).

Upon expanding the logarithm of gR0 u, truncating as before at the term of order
2
u , and substituting 2 , we reduce mgf (3.18.22) to the form
22
2
3:18:24
gR0 u e eu n u
1
2
of a Gaussian variate of mean and variance 22/n. Thus, to good approximation,

the autocovariance of lag 0 is distributed as

R0 2n e N , 22 =n :
3:18:25
3.18.8 Autocorrelation coefficients rk0 Rk/R0
The autocorrelation coefficients rk Rk/R0 for lag 1 (since r0 1 is constant) is
distributed as the quotient of two Gaussian variates N(0, 2/n)/N(, 22/n). Substitution of the Gaussian pdfs of the numerator and denominator into Eq. (3.18.14) leads,
like the procedure employed previously, to an integral that can be used more expediently to calculate the characteristic function (cf ) of the distribution than the quotient
pdf itself. It is useful indeed necessary in this case to calculate the cf instead of
the mgf because the integral defining the latter does not converge. I will omit the
intervening steps except to point out that toward the end of the derivation the
condition n 1 is imposed, which reduces an exponential factor to unity whereupon
the resulting integral (an integration of a Gaussian function over all space) itself
reduces to 1, leaving an expression which is the cf of a Gaussian variate of zero mean
and variance 1/n. To good approximation for large n, it therefore follows that the
autocorrelation coefficient is distributed as
r k1 Rk =R0 N 0, 1=n:
3:18:26
4
Part II
The random creation of light
My design in this book is not to explain the properties of light by

hypotheses, but to propose and prove them by reason and experiments.
Sir Isaac Newton1
4.1 The enigma of light
The neologism apophenia, coined by psychiatrist Klaus Conrad, describes a condition of seeing meaningful patterns in meaningless random data. In extreme cases
like that of Kammerer or the character of John Nash in the book and film
A Beautiful Mind the condition is psychopathic. Nevertheless, whether madman or
genius or somewhere in-between, humans appear to be evolutionarily hard-wired to
see patterns in randomness.
One reason, as the previous chapters demonstrated, is that there actually are
patterns in randomness. They do not permit one to predict the occurrence of
individual future events, but neither are they entirely meaningless. Rather, these
patterns signify by their absence if one has the mathematical tools to look for
and interpret them the likelihood that a particular stochastic process is not random.
The nature of light had long remained an enigma to natural philosophers. In the
seventeenth century, Isaac Newton thought of light as a stream of particles.
A century and a half later, the French physicist August Fresnel wrote a theoretical
essay (for a scientific competition) accounting for the uniquely wavelike phenomenon
of light diffraction. Subsequent experimental confirmation of an especially counterintuitive prediction of Fresnels theory namely, that a bright spot should appear at
the center of the dark shadow of an illuminated sphere lent credence to the new
paradigm that light was a wave. Ironically, Newton could have reached the same
conclusion had he known how to interpret his extraordinary, yet even today little
known, experiments on the diffraction of light through a wedge-like aperture.2
1
2
Sir Isaac Newton, Opticks (first printed by The Royal Society, 1704) 1.
See M. P. Silverman and W. Strange, The Newton two-knife experiment: intricacies of wedge diffraction, American
Journal of Physics 64 (1996) 773787. I have reproduced Newtons experiment in my lab using lasers (rather than
sunlight) and a CCD (charged coupled device) camera (rather than the human eye) and analyzed the patterns in detail
using the scalar theory of Fresnel diffraction.
194
195
The crowning eighteenth-century achievement, however, was James Clerk

Maxwells synthesis of electricity, magnetism, and optics in one unified field theory
which, nearly forty years after Fresnels essay, predicted the theoretical existence of
electromagnetic waves that propagate through space at the speed of light. The
detection of such waves (in the radiofrequency portion of the spectrum) by the
German physicist Heinrich Hertz in the 1880s conclusively established that light
was a wavelike phenomenon. Conclusively, that is, until the early years of the
twentieth century, when quantum phenomena such as black-body radiation, the
photoelectric effect, and Compton scattering appeared to require for their interpretation a reversion to a particle picture of light behavior.
With the extension of quantum theory to Maxwellian electrodynamics to form a
consistent theory of quantum electrodynamics (QED), the nature of light is no longer
regarded by physicists as an enigma, although that does not make it any the less
interesting. Two kinds of particles constitute the denizens of the quantum world:
(a) fermions, which comprise particles like protons, neutrons, electrons, and neutrinos that make up ordinary matter and obey FermiDirac statistics, and (b) bosons,
which include particles like photons, gluons, and gravitons whose exchange by
fermions mediates the interactions experienced by ordinary matter and which obey
BoseEinstein statistics. Light, then, in all its forms, is made up of photons, a term
coined in 1926 by physical chemist Gilbert Lewis referring to irreducible quanta of
massless, electrically neutral, helicity-1, bosons. Each of these characteristics has farreaching conceptual and practical implications.
(a) Quanta (singular quantum) is the generic term for the discrete, particle-like
units in which some physical entity is produced. The energy and linear momentum p jpj of a quantum of light are respectively proportional to the frequency
and inversely proportional to the wavelength that would be ascribed to the
classical wave comprising a large number of these quanta
h h
h
p hk:
4:1:1
The proportionality constant is Plancks constant h 6.626 1034 Js. The

second set of equalities in (4.1.1) relate energy and momentum to angular
frequency ( 2) and wave number k 2/. The proportionality constant
h h=2 is called the reduced Plancks constant.
(b) Massless means that a photon has no rest mass m, but it still has energy and
momentum, as expressed by Einsteins relativistic formula
p
p2 c2 m2 c4 ! pc:
4:1:2
m0
A photon is never at rest; it either moves through vacuum at the universal

constant speed c or it vanishes when absorbed by matter, which thereby gains
Mother of all randomness II
196
the amount of energy and linear momentum (4.1.1) and undergoes a transition
to a higher-energy quantum state. Excited matter can undergo a transition to
a lower-energy state by emitting a photon that carries away the amount of
energy and momentum (4.1.1). Quantum mechanically, this is how light is
produced, in contrast to the classical mechanism whereby accelerating charged
particles emit electromagnetic radiation. Because photons are massless, their
range is infinite. That is why we can receive electromagnetic signals from
galaxies millions of light years distant.
Combining relations (4.1.1) and (4.1.2) leads to familiar relations linking
frequency, wavelength, and speed
c :
p
k
4:1:3
One might wonder why, if light either moves through vacuum at speed c or
else vanishes, does the speed of light through matter differ from c. The
answer is statistics. From a quantum perspective, the movement of light
through matter is in some ways analogous to the flight of a drunk through a
forest. By continually colliding with trees, falling down, getting up, and
recommencing running, the drunks mean speed is less than his instantaneous speed. Likewise, photons moving through matter are virtually
absorbed and re-emitted in collisions with atoms and molecules but otherwise move at speed c through the interstices of the material. Macroscopically, the effect of a material on the propagation of light is represented by the
index of refraction, a wavelength-dependent function that could also depend
on the polarization and direction of propagation of the light.
(c) The photon has no electric charge. Thus, although the photon is a carrier of the
electromagnetic interaction, it does not interact directly with electric or magnetic
fields. A consequence of this property for classical electromagnetism is the
linearity of Maxwells equations and the fact, therefore, that any linear superposition of solutions to these equations is also a solution. In QED, however, ultrastrong electric or magnetic fields can destabilize the vacuum, leading to virtual
(i.e. transient) electronpositron pairs that scatter photons. From a macroscopic
perspective, the vacuum has acquired a field-dependent refractive index. One
such exotic process, for example, is the magnetic birefringence of the vacuum,
which requires an ultra-strong magnetic field of such strength B that the work
expended in displacing an electron by its Compton wavelength (e h/mec) is at
least equal to the electron rest-mass energy
ecB
relativistic magnetic
force
electron Compton
wavelength
me c2
or
B
me c2
109 Tesla,
eh
4:1:4
197
where e 1.6 1019 C is the magnitude of charge on an electron or positron.3

The field strength in (4.1.4) is within about a factor of 10 of the strongest known
magnetic fields in the universe (produced by certain kinds of neutron stars) and
millions of times stronger than the largest magnetic fields producible in terrestrial
laboratories.
The interaction of intense light with matter (rather than vacuum) can also
generate a field-dependent response (quantified by the electric susceptibility of
the material) that serves as a source of more light at other frequencies. One
example is the process of spontaneous parametric down-conversion (PDC),
which will be discussed later in connection with experiments to search for nonrandom patterns in sequences of photon polarization measurements.
(d) Quantum particles have an intrinsic angular momentum s referred to as spin,
characterized by a quantum number s constrained to be of integer or half-integer
value, s 0, 12 , 1, 32 . . ., in units of h. The particles with half odd-integer spins are
fermions; those with integer spins are bosons. A photon is usually said to be a
spin-1 particle, but, strictly speaking, that is not correct. A particle of spin s has
2s 1 quantum substates which quantify the projection of angular momentum
on a conveniently designated axis of quantization. Photons, however, because
they are massless, have only two such substates. Rigorously speaking, the photon
has unit helicity magnitude jj, in which the helicity of a particle is the
projection of the spin s (in units of h) onto the direction of linear momentum
p that is, (s p)/jpj. The helicity of a photon therefore takes two values: 1
if s is parallel to p, and 1 if s is anti-parallel to p.
The quantum attributes of the photon have corresponding features in the classical
electromagnetic theory of light.
A classical electromagnetic wave is a composite of a large number of photons in
the same quantum state. Such states are possible because the photon is a boson.
(Corresponding classical-wave states for electrons do not exist.) For example, a
1 W source of red (633nm) light produces photons at the rate
dN Power
3:2 1018 photons=s:

dt
hc=
The linear momentum p of a photon corresponds to the wave vector k of the light
wave. The wave vector carries imaging information and plays a seminal role in
image processing methods such as spatial filtering and holography.
The helicity values (1, 1) of the photon correspond to (left, right) circular
polarizations (LCP, RCP) of the light wave. Facing the light source, an observer
3
The calculation in (4.1.4) is of a heuristic nature only, intended to arrive at a dimensionally correct, numerically valid
estimate of the critical field strength. Technically, a magnetic field does no work because it acts perpendicular to
displacement. According to special relativity, however, the magnetic field in the moving frame of a particle results in an
electric field in the instantaneous rest frame of the particle. Electric fields can do work on a particle.
198
would detect LCP waves if the electric vector of the approaching wave was rotating
counter-clockwise i.e. from the 12:00 position toward the observers left shoulder. Appropriate linear superpositions of LCP and RCP waves produce vertical
and horizontal linearly polarized (LP) light or, more generally, elliptically polarized light in which the tip of the electric vector traces out an ellipse in time. The
polarization of electromagnetic waves is utilized in numerous ways for example,
by optical rotatory dispersion or optical circular dichroism or polarization-based
imaging to study the structure, composition, and concentration of materials.
Every characteristic of lightwave vector, frequency or wavelength, and polarizationcan be exploited in practical ways to provide information about physical
systems.4
Without any desire to get caught up in a controversy of semantics, I will nevertheless express an opinion, based on numerous experimental and theoretical investigations of quantum systems5 over many years, that notwithstanding the oft-repeated
expression waveparticle duality, the building blocks of nature are particles, not
waves. An individual entity with invariant mass, charge, and spin (or helicity) is a
particle. A photon is a quantum particle.
Quantum mechanics is an irreducibly statistical theory. In quantum mechanics the
aspect of waves enters only statistically either in the aggregate behavior of similarly
prepared particles or in the repetitive observations of a single particle. The quantum
wave function is not the physical entity itself, but only a mathematical tool for
calculating probabilities and expectation values. It is a mathematical expression of
the information available about a quantum system.
Statistically, there is a relation between the uncertainty in location and uncertainty
in momentum of a quantum particle. A perfectly monochromatic photon (which, in
reality, does not exist) would be completely delocalized although its linear momentum would be measurable with no uncertainty. Photons produced by real sources are
describable mathematically by a wave packet, i.e. a linear superposition of monochromatic components; they have a finite spectral width and a calculable spatial
uncertainty as characterized by the uncertainty relation derived in Chapter 3
[Eqs. (3.9.11) and (3.9.12)] for a general function and Fourier transform pair.
Although quantum theory has been subjected to many tests since its inception at
the turn of the twentieth century, relatively few have been expressly designed to probe
the underlying randomness of nature. The photon is an ideal particle with which to
conduct such tests. The previous chapter was devoted to the randomness of nuclear
disintegration and a comprehensive set of tests on radioactive sodium in response to
a longstanding nuclear controversy. In this chapter I focus on the randomness of
4
5
I discuss my experiments covering all facets of physical optics in M. P. Silverman, Waves and Grains: Reflections on Light
and Learning (Princeton University Press, New York, 1998).
I discuss my experimental and theoretical investigations of quantum systems in M. P. Silverman, Quantum
Superposition: Counterintuitive Consequences of Coherence, Entanglement, and Interference (Springer, Heidelberg, 2008).
4.2 Quantum vs classical statistics
199
photon creation and interaction at a beam splitter. The discussion will bring to light
(pun intended) important elements of statistical physics such as the concept of a
compound distribution, the relation between Poisson statistics (a hallmark of classical physics) and the distributions of quantum particles, the correlations intrinsic
to the statistics of bosons and fermions, and the statistics of recurrent events, which
play an important part in an experimental test of photon statistics to be described in
due course.

Standard books on statistical and thermal physics employ various physical assumptions to derive the occupation probability
p E, N
eE
Z T, V, N
1=kB T
4:2:1
of an N-particle quantum system of fixed volume V in an equilibrium state of energy

E. T is the equilibrium temperature (on the Kelvin scale) of a heat reservoir with
which the system can exchange energy but not particles, and kB is the universal
Boltzmanns constant, which sets the scale of thermal fluctuations. The (infinite) set
of hypothetical identical systems characterized by the probability function (4.2.1) and
partition function
ZT, V, N
eET , V , N
4:2:2
is called a canonical ensemble.

In Chapter 1 we found, by using the principle of maximum entropy (synonymous
with information) that an expression of the form (4.2.1) is the most objective
probability distribution consistent with prior information comprising only (a) the
completeness relation for probabilities, and (b) the mean energy of the system. In
other words, (4.2.1) is a general outcome of purely mathematical origin; physics
enters only in the interpretation of the Lagrange multiplier , made by comparing
functions of Z to their thermodynamic counterparts, leading to the association of
with absolute temperature. If the system were permitted to exchange particles as
well as energy with the reservoir, then N would also be a distributed quantity of
presumed known mean, and a generalization of (4.2.1) would lead to the probability
distribution for a grand canonical ensemble with an additional Lagrange multiplier
to be identified with the chemical potential (actually with ) of the system. For
the most part, we will not need the grand canonical ensemble in this chapter
although we will make use of it later to discuss a paradoxical issue concerning
photon fluctuations.
200
If the state of the system can be described by a distribution of N identical particles

fnkg over a set of single-particle energy states fkg where k 0, 1, 2. . . enumerates the
ground state, first excited state, second excited state, etc., then the occupation
numbers must satisfy
X
nk N
4:2:3
nk k E
4:2:4
k0, 1, 2...
X
k0, 1, 2...
and the probability function (4.2.1) takes the form

pfnk g
gfnk ge
nk k
k0, 1, 2...
4:2:5
where gfnkg is the degeneracy or statistical weight of the set of occupation numbers.
The numerical values that an occupation number nk can take and the mathematical
form of the degeneracy factor gfnkg depend on the assigned statistics. Here the term
statistics as used by a physicist is different from that of a statistician.
In the lexicon of a statistician, a statistic is an observable random variable
(or function of such variables) that does not contain any unknown parameters;
as a discipline, statistics, according to the American Statistical Association,
is the science of collection, analysis, and presentation of data. To a physicist,
however, the term statistics may also refer to the solution of a particular
kind of occupancy problem: in how many ways can one distribute indistinguishable balls over distinguishable cells? The answer depends on the imposed
constraints.
If the distribution is inclusive, i.e. no constraints are imposed apart from (4.2.3)
and (4.2.4), then the solution leads to BoseEinstein statistics.
If the distribution is exclusive, i.e. so constrained that no more than one ball can be
placed in any given cell, then the solution leads to FermiDirac statistics.
These two solutions which, so far as experiment has revealed and relativistic
quantum theory has deduced, are the only two solutions that nature permits6
may be summarized by the relations in Table 4.1. Disregard for the moment the
third column.
Proposals have been made from time to time of the existence of entities that follow other kinds of statistics. I have
myself investigated hypothetical quantum systems comprising particles in bound association with magnetic flux tubes.
These composite quasi-particles manifest statistical behavior that interpolates between that of fermions and bosons,
depending on the value of the magnetic flux. I know of no fundamental particles, however, that behave this way.
201
Table 4.1
Comparison of particle statistics
BoseEinstein statistics
FermiDirac statistics
MaxwellBoltzmann statistics
nk 0, 1, 2 . . . N
nk 0, 1
nk 0, 1
gBE fnkg 1
gFD fnk g
1
0
nk 0 or 1
otherwise
gMB fnk g Y1
nk !
k
The fundamental lesson of the first and second columns of Table 4.1 is that the
statistical weight of each allowed set of occupancy numbers is 1; if the particles
are identical, then rearranging them among the same quantum states does not
provide any new information since one cannot tell which particle is in which state
by any distinguishing feature.
The partition function in (4.2.5)
X

nk k
X
Z
gfnk ge k0, 1, 2...
4:2:6
fnk g
is the sum over all allowed partitions of N subject to the constraints (4.2.3) and
(4.2.4). For purposes of illustration, consider a system of three spinless particles to be
distributed over three non-degenerate states. The resulting configurations, can be
represented by a triad of numbers (n1 n2 n3) giving the occupancy of states of energy
(1 2 3). Consider each of the three types of occupancy shown in Table 4.1.
4.2.1 BoseEinstein (BE) statistics
The case of three bosons is tallied as shown. Configurations 13 include a 3-particle
state. Configurations 49 include a 2-particle state. Configuration 10 includes only
BoseEinstein
1
2
3
4
5
6
7
8
9
10
3
0
0
0
0
1
2
1
2
1
0
3
0
1
2
0
0
2
1
1
0
0
3
2
1
2
1
0
0
1
202
1-particle states. The resulting partition function (with temporarily set equal to
1 for simplicity) takes the form
ZBE e31 e32 e33
e1 22 e1 23 e2 21 e2 23 e3 21 e3 22
e1 2 3 :
4:2:7
Each term in the sum (4.2.7) has an exponent equal to the total system energy for the
associated particle configuration. Note, however, that the partition function can also
be factored into the product of three sums in which each sum is over the occupancies
of a single energy state

ZBE 1e1 e21 e31 1e2 e22 e32 1e3 e23 e33 N 3:
4:2:8
The delta-function in (4.2.8) is there to remind us to exclude from the expanded
product any terms with total number of particles N 6 3. Thus, the term
1 e01 e02 e03, which represents no particles in any of the three states would be
excluded; likewise, we would exclude the term e21 e22 e23 , which represents a
6-particle configuration in a system defined to have only three particles.
Generalizing to arbitrary numbers of particles and states, one can write the partition
function (4.2.6) for identical bosons as a product of single-mode partition functions
!
!
Y
X
X
nk k
ZBE
e
nk
N
4:2:9
k1,2... nk 0,1,2...
k
where each single-particle energy state defines a mode. The product index k enumerates modes; the sum index nk specifies the number of particles in mode k.
The total number of ways of distributing N particles over r states with no restriction on number of particles per state is given by the multiplicity

Nr1
:
4:2:10
BEN, r
N
To see this, draw r 1 vertical lines to represent the boundaries of r cells and put in
N dots to represent the N particles. Keeping the first and last lines fixed, one obtains
all possible distributions of dots in cells by making all possible rearrangements of the
N dots and r 1 interior lines. The number of ways to partition N r 1 items into
two groups of size N and r 1 is given by the combinatorial expression (4.2.10). In
the given example
of three bosons distributed over three states, one finds

5
BE3, 3
10 as expected.
3
4.2.2 FermiDirac (FD) Statistics
Consider next the distribution of N identical particles over r states with the restriction
that no more than one particle can occupy a state. For the illustrative case of three
203
particles in three states, there is only one way to distribute the particles.
The corresponding partition function (again with temporarily set to 1) has only
one term
FermiDirac
ZFD e1 2 3 :
4:2:11
In general, the total number of ways of distributing N particles over r states with
exclusion is given by

r
,
4:2:12
FDN, r
N
which obviously requires r N. Expression (4.2.12) is almost self-evident. There are
r ways to choose a state for the first particle, r 1 choices for the second particle,
and on down the line until there remain r N 1 choices for the Nth particle.
However, since the particles are identical, the N! ways of assigning the N particles to
a given set of states provide no new information. Thus the total number of
arrangements is
rr 1r 2 r N 1
r!
N!
N!r N !
FDN, r
4:2:13
as expressed by the binomial combinatorial coefficient in (4.2.12).

As in the case of bosons, one can reorder the terms in the partition function to
arrive at an expression analogous to (4.2.8) and (4.2.9)
Z FD 1 e
1
1 e
P
2
1 e
3
...
1 e
k0, 1, 2...
k
N
!
nk :
4:2:14
nk N
The constraint on counting (in both the BE and FD cases) posed by a fixed
number of particles is lifted if one works with the grand canonical, rather than
canonical, ensemble although at the expense of introducing an additional system
parameter, the chemical potential, which depends in general on extensive and
intensive variables of the system and cannot be expressed in a closed form for
numerical evaluation.
204
4.2.3 MaxwellBoltzmann (MB) Statistics

Attempts at a statistical treatment of thermal phenomena were begun long before
physicists became aware that nature restricted all particles to be either fermions or
bosons. Particles, like the atoms or molecules of a gas, were considered distinguishable entities with no restrictions apart from conservation of total energy and
particle identity on how much energy a particle could have. MB statistics are the
statistics (in the language of physicists) of distinguishable particles without exclusion.
From the present perspective of assigning particles to states or balls to cells the
distribution of three particles over three states would generate a table of configurations similar to that of bosons, but with higher degeneracies.
Consider, for example, the BE configuration (3, 0, 0); the degeneracy factor is gBE
(3, 0, 0) 1. The corresponding MB configuration can be represented by (abc, 0, 0),
where the three distinguishable particles a, b, c are all in the state 1. Thus, the MB
degeneracy factor is also given by gMB(3, 0, 0) 1. Now consider the BE configuration (1, 1, 1), for which gBE(1, 1, 1) 1. Corresponding to this configuration,
however, are the following six MB configurations: (a, b, c), (a, c, b), (b, a, c), (b, c,
a), (c, a, b), (c, b, a). They are all equivalent because the particles, although considered distinguishable, are nevertheless all of the same kind, like a set of three red
marbles of the same size, composition, and density; the marbles look the same, but
nevertheless you can tell which one is to the left or right or in the middle. Thus the
degeneracy factor is gMB(1, 1, 1) 6. In general, the MB degeneracy factor for the
configuration fnkg is the multinomial combinatorial coefficient
0
1
N
N!
Y
A
4:2:15
@
gMBfnk g
nk
n1 !n2 ! . . .
where, by definition, it is understood that
nk N. For the example of three
particles, one finds gMB1, 1, 1

6, as expected.
In consequence of (4.2.6) with degeneracy factor (4.2.15), it follows that the MB
partition function for a canonical ensemble takes the form of a multinomial expansion with power N
X
!N

nk k
X
X
N!
0
4:2:16
e k0, 1, 2...
e k
N,
Z MB
!n
!
.
.
.
n
k1
fn g 1 2
3!
1!1!1!
where

X
k1
e k
4:2:17
is the single-particle partition function. The symbol Z 0MB for the partition function is
adorned with a prime because, although derived rigorously, the result (4.2.16) turns
205
out to be physically inadmissible, a fact well known before the advent of quantum
mechanics. The problem, referred to historically as Gibbs paradox, is that application of Z 0MB to classical statistical thermodynamics led to an entropy function that
did not scale properly with size.
Volume, particle number, entropy, and all the various thermodynamic energies
(internal energy U, Helmholtz free energy F, etc.) are extensive quantities, i.e.
additive over non-interacting subsystems. Mathematically, an extensive function,
e.g. entropy, must satisfy a relation of the form
SnU, nV, nN nSU, V, N
4:2:18
when the energy U, volume V, and particle number N are scaled by a factor n.
A function that obeys relation (4.2.18) is said to be a homogeneous function of first
degree. If the scale factor on the right side was nd, then the function would be
homogeneous of degree d. Thus, an intensive function, one like temperature T or
pressure P, that does not scale with size,
PnU, nV, nN n0 PU, V, N PU, V, N
4:2:19
is a homogeneous function of degree 0.

Z0MB does not lead to appropriately extensive thermodynamic functions of state for
a classical system in which particle energy can take a continuum of values. A quick
way to see this is to note that in the transformation from a sum over discrete singleparticle quantum states to an integration over single-particle classical phase space
(the six-dimensional space of coordinates and linear momenta)
X
1
4 V
3
3
!
d xd p 3
p2 dp
4:2:20
h3
h
k
the single-particle partition function becomes proportional to volume V. Thermodynamic functions of state are calculated from the logarithm (and its derivatives) of
the partition function. However, the relation
ln Z0MB N ln Nln V constants
is immediately seen to fail the test of (4.2.18). For example, if the volume V is
doubled, the left side, to which the Helmholtz free energy is proportional, is not
doubled.
The Gibbs correction rectifying this problem was to insert by hand a factor 1/N!
into the degeneracy function (4.2.15) to give the empirical (rather than theoretically
deducible) degeneracy function shown in Table 4.1. Applied to (4.2.6), the Gibbs
correction leads to the MB partition function
"
#
!
n
n
X
X
e 1 1 e 2 2
N
N
,
4:2:21
nk
Z MB
n1 !
n2 !
N!
n1 , n2 ...
k
206
where now the transformation from a sum over discrete states to an integral over
phase space leads to

V
ln Z MB N ln ln N! N ln
1 N ln
constants
4:2:22
N
N
upon substitution of Stirlings approximation for N!. If the extensive variables (V, N)
of the system are doubled, then ln ZMB is also doubled, as well as all the thermodynamic functions that derive from ln ZMB.
The purpose of the Gibbs correction is to account for the indistinguishability of
the particles by dividing out the number of permutations of particle distribution over
states that do not lead to new information. As we have just seen, however, this
correction works only for configurations with one particle per state [such as (1, 1, 1)],
but would incorrectly adjust for configurations with more than one particle per state
[such as (3, 0, 0)]. Thus, MB statistics is an approximation to exact quantum statistics
valid for conditions leading to a low mean state occupancy.
It is worth noting at this point that the transformation (4.2.20) defines the statistical density of states ( p)
pdp
4gV 2
p dp,
h3
4:2:23
which is valid if the volume V is sufficiently large that particle energy levels are spaced
closely enough to be treated as a continuous variable. The relation (4.2.23) includes
a degeneracy factor g to account for additional degrees of freedom due to spin (g
2s 1) or helicity (g 2). In the case of ultra-relativistic particles such as the photon,
substitution of pc from (4.1.2) yields density as a function of energy
d
4gV
hc3
2 d,
4:2:24
which is a relation we shall use often in this chapter.
4.3 Occupancy and probability functions

!
X
The constraint posed by N
nk on the canonical partition function makes
k
evaluation difficult. That is why investigations of the statistical properties of bosons

and fermions ordinarily proceed from the grand canonical partition function. However, if the number of particles in the ensemble is not conserved, then the sum over
the occupancy numbers of each mode can be carried out independently, and the
partition function factors into a product of single-mode partition functions. The
outcome is the same as using a grand canonical partition function with zero chemical
potential. Happily, this is the case with thermal photons. Black-body radiation has
207
zero chemical potential; matter absorbs and emits it in arbitrary numbers of quanta
with a mean state occupancy determined only by the equilibrium temperature. The
questions of (a) whether the chemical potential of photons is always zero, and
(b) whether the chemical potential of massless fermions would likewise be zero, are
intricate ones, which I consider further in an appendix.
Evaluating the BE and MB partition functions [(4.2.9) and (4.2.21)] for photons
leads to the following expressions
!

Y
Y
X

1
k nk
e
ZBE
4:3:1
1 e k
k1, 2... nk 0, 1, 2...
k
nk
Y
YX

e k
Z MB
exp e k
4:3:2
n
!
k
k n 0
k
k
in which the first entails summing a geometric series and the second an exponential
series. Although the chemical potential of fermions in thermal equilibrium is ordinarily not zero, there are circumstances, discussed in the appendix, where it does
vanish. For the sake of comparison, therefore, we evaluate (4.2.14) in the same way
to arrive at the expression
Y

1 e k :
4:3:3
Z FD
k
To facilitate discussion of split-beam experiments in the next section, focus attention on a single mode of energy and occupation number n.7 The three partition
functions and corresponding probability functions then reduce to the following
expressions

1

pn en 1 e
n 0, 1, 2 . . .
4:3:4
zBE 1 e

zMB exp e
zFD 1 e

en
exp e
n 0, 1, 2 . . .
n!
8
1
>
>
n0
<
n
e
1 e
pn
1 e >
1
>
:
n1
e 1
pn
4:3:5
4:3:6
where the large number of other modes may be considered part of the environment
with which the mode of interest is in equilibrium.
The mean occupation number n of the mode, as well as fluctuations about the
mean, can be calculated in at least three different ways.
To keep notation simple and consistent with standard usage in physics, I will not make a distinction, as in previous
chapters, between a random variable for occupation number and its realizations n.
208
Differentiation of the partition function. From (4.2.6) one obtains the general
relations
ln z
hni kB T
4:3:7

T , V
2
2 ln z
2
2
n hn i hni kB T
:
4:3:8
2

T , V
Direct evaluation of expectation values from the probability functions
X
gnn en
X
npn n
hni
z
n
X
gnn2 en
X
n
:
n2 pn
hn2 i
z
n
4:3:9
Differentiation of the moment-generating function (mgf ) represented here by G(t)

(instead of g(t)) to avoid confusion with the degeneracy factor,
dGt
d ln Gt
4:3:10
hni
dt
t0
dt
t0
d 2 ln Gt
2
n
4:3:11
,
dt2
t0
where
X
Gt hen t i
pn ent
h
in
gn et
4:3:12
and g(n) is the single-mode degeneracy function

gBEn gFDn 1
1
gMBn :
n!
4:3:13
Applying any of the three methods to the three types of physical statistics leads to the
expressions in Table 4.2 for the single-mode mean occupation numbers, variances,
and mgfs.
Figure 4.1 shows plots of n as a function of mode energy for the three physical
statistics. The distinction between classical and quantum phenomena lies in the size
of the ratio /kBT. For thermal energy low in comparison to the quantum
of energy,
1, the means and variances of BE and FD statistics approach
the MB values
209
Table 4.2
Single-mode occupation number statistics
Stats
Partition function
z(, )
Moment-generating
function G(t)
Mean
n
Variance
2n
BE
(1 e)1
e 1
e et
1
e 1
e
MB
exp(e)
exp[e (et 1)]
e
FD
(1 e)
e et
e 1
1
e 1
e
2
1
e
e 1
Mean Occupation Number
BE
MB
FD
0.5
1.5
Energy/kT
Fig. 4.1 Mean occupation number n as a function of energy for a canonical ensemble of
particles obeying BoseEinstein (BE), MaxwellBoltzmann (MB), and FermiDirac (FD)
statistics. At high compared with thermal energy kBT, nBE and nFD approach nMB
characteristic of classical particles. At zero , nBE becomes singular, indicative of quantum
condensation and nFD approaches 1/2.
hni
2n
! e
>>1
4:3:14
corresponding to the classical limit. The asymptotic equality of mean and variance in
(4.3.14) suggests a connection between the statistics of classical particles and the
Poisson distribution, a point that will emerge directly in the next step of the discussion. The form of the MB mgf in Table 4.2 also reveals this connection; recall that the
210
mgf of a Poisson variate X is gX (t) exp((et 1)) in which is the mean X (not to
be confused with chemical potential).
From the relations of Table 4.2, we can express the Boltzmann factor e, and
therefore the three probability functions (4.3.4)(4.3.6), in terms of the mean occupation number n, which is useful because the latter is a measurable experimental
quantity:
BE
e
hni
hni 1
MB
e hni
FD
e
hni
1 hni
pBEn
hnin
4:3:15
hni 1n1
hnin ehni
hni!

hni
pFDn
1 hni
4:3:16
pMBn
)
n1
n 0:
4:3:17
The form of (4.3.16) confirms that MaxwellBoltzmann particles are described

statistically by a Poisson distribution. As applied to particles with mass as in the
case of alpha or beta emissions from a radioactive nucleus a Poisson distribution
expresses the statistical independence of the emitted particles, the creation of one
particle being unaffected by previous production of other particles. The independence also pertains to state occupancy: the presence or absence of a MB particle in a
particular state does not influence the probability of another particle occupying that
state. The Poisson distribution therefore calls to mind grainy systems with discrete
constituents. It was this surprising aspect of the behavior of light as a system of
photons that Einstein discovered in examining the statistical implications of Plancks
radiation law in the first decade of the twentieth century.
Interestingly, the BE probability function (4.3.15) can be interpreted as a
geometric distribution for waiting times, which we examined in Chapter 1. If at each
trial an event occurs with probability p of success and probability q 1 p of failure,
then the probability of a first success at the (n 1)th trial is Pn1 qn p. Suppose q
takes the form q e and p 1 e; then

n

n

hni
1

1e
4:3:18
pBEn,
Pn1 e
hni 1
hni 1
which is precisely the BE probability function (4.3.15) for emission of n photons. The
preceding derivation is merely an interpretation of the form of the BE distribution,
not an explanation of its origin, which at the most fundamental level derives from the
connection between spin and statistics.
Figure 4.2 shows how the MB and BE distribution functions of a monomode light
source with fixed mean n vary as a function of the actual number of emitted
photons n. From an experimental standpoint, n corresponds to the mean number
of particles received in a counting interval (bin), previously represented by the
symbol when there was no confusion with chemical potential. For a weak (i.e.
classical) light source with n < 1, the MB distribution looks very much like the BE
211

1
Maxwell-Boltzmann
(Poisson)
0.8
Bose-Einstein
Occupation Probability
Occupation Probability
(a)
0.6
(b)
0.4
(c)
0.2
(d)
0.8
(a)
0.6
0.4
(b)
(c)
0.2
(d)
0
10
10
Photon Number n
Fig. 4.2 Emission probability as a function of number n of emitted photons for MB and BE
light sources with mean count n equal to (a) 0.1, (b) 1, (c) 2, (d) 4.
Emission Probability
0.5
<n> = 0.5
0.4
0.3
MB
0.2
<n> = 4.0
BE
0.1
10
Photon Number n
Fig. 4.3 Comparison of BE (gray) and MB (black) emission probability as a function of
photon number n for a weak source (solid) of mean count 0.5 and a strong source (dashed)
of mean count 4.0.
distribution, a comparison that shows up better in Figure 4.3 where the two distributions are presented together for n 0.5 and 4.0. For mean counts n > 1, the
BE distribution decreases monotonically with n in a fat tail asymptotically
approaching n1, whereas the MB distribution takes a bell-shaped form centered
at n n. One sees, then, that in the classical domain of low n, it is the
monotonically decreasing portion of the MB probability curve that correctly
approximates natures BE distribution.
212
0.25
Maxwell-Boltzmann
(a) (Poisson)
0.3
(b)
(c)
0.2
(d)
(e)
0.1
0
0
10
Bose-Einstein
(a)
0.4
0.20
(c)
0.10
(d)
(e)
0.05
0
15
(b)
0.15
10
20
30
40
Mean Photon Number <n>

Fig. 4.4 Probability of emission of n (a) 1, (b) 2, (c) 3, (d) 4, (e) 5 photons from MB and BE
light sources as a function of mean count n .
Another perspective is given in Figure 4.4, which shows plots of the MB and BE
distributions as a function of n for fixed numbers n of emitted photons. In an
experiment, one is often interested in certain kinds of events e.g. 1-photon or
2-photon emissions and needs to know how to adjust the source intensity or some
other experimental parameter to minimize contamination by unwanted events. Here
we see from the approximate matching of the pre-peak portion of plots (a) that
the MB approximation to the exact BE distribution is really valid only for singlephoton emission from a source of low n.
4.4 Photon fluctuations

From the BE probability function (4.3.15) expressed in terms of mean occupation
number, we can obtain a form of the mgf that is easier to work with

X
1
hniet n
n1
hni 1 n0 hni 1
n0 hni 1
1
1
1

:
t
hnie
hni 1
1 hniet 1
1
hni 1
GBEt hent i
X
hniet n
4:4:1
Successive differentiation of G(BE)(t) in the usual way leads to the moments of the
occupation number, in particular the first through third from which the variance
and skewness follow
hn i 2hni hni
2
9
>
=
hn3 i 6hni3 6hni2 hni>

;
213
2N hni1 hni
)
:
2hni 1
Sk p
hni1 hni
4:4:2
The limiting cases of the variance are especially interesting, as they reveal in stark
contrast the wave and particle properties of light from a statistical standpoint.
For a classical light source with n << 1, the variance in occupation number is
proportional to n, and thus
n
1
p
hni
hni
hni << 1:
4:4:3
This is the kind of fluctuation in a light signal that results in shot noise in a
photodetector. Increasing the number of photons (or strength of the optical field)
will smooth out these statistical fluctuations. An example would be the emission of
red light of wavelength 633nm and frequency c/ 4.74 1014 Hz from an
incandescent filament at temperature T 1000K. The ratio of a quantum of light
energy to the mean thermal energy per mode is8

h
1:96 eV
22:77,
kB T 0:086 eV
and the mean occupation number of this mode is

1
1:31 1010 :
e 1
The foregoing thermal light source would have to be at a temperature of T h/kB ~
22756 K for the mean thermal energy per mode to be equal to a quantum of red light
energy. At this temperature the mean occupation number would be ~ 0.58.
A temperature T h/kB ln 2 32854 K would be required for n 1. Generally
speaking, photons emitted from terrestrial thermal sources behave statistically for the
most part like particles. To produce a two-slit interference pattern, the hallmark of
wavelike behavior, with a classical light field comprising thermal photons requires
special effort, although commercially available instruments now exist where this
phenomenon can be routinely demonstrated in a classroom setting. The trick to
making it work is to reduce the intensity of the thermal source so that it is effectively
producing one photon at a time through the apparatus, in which case it is no longer a
classical field.
A non-thermal quasi-monochromatic light source like a laser can have a high
mean occupation number, in which case
8
An electron volt (eV) is the energy acquired by an electron (or positron) moving through a potential difference of 1 volt.
The conversion to MKS units is 1 eV 1.6 1019 joules.
214
n
1 hni >> 1:
hni
4:4:4
This kind of fluctuation is referred to as wave noise. The fluctuation in signal is

comparable in magnitude to the signal itself,9 a statement that applies to any quantity
proportional to photon number such as energy density (energy/volume), power
(energy/time), and intensity (power/area).
There is a heuristic way to see how this comes about. Consider the emission of
light from a large number N of independent monochromatic sources e.g. atoms or
molecules. In a classical picture, the sources radiate electromagnetic waves to which
we can associate an amplitude (of the electric field) and phase. The net amplitude E of
the optical field is obtained by superposing the amplitudes of the component wavelets. The mean intensity of the wave
D E
4:4:5
I jEj2
is obtained by averaging the square of this superposition over phases where angular
brackets signify the operation
1
h f i
2
f d:
4:4:6
Since the energy or intensity of the optical field is proportional to the square of the
amplitude in the classical picture and to the number of photons in the quantum
picture, we can write
D E
n / jEj2 :
4:4:7
In the explanation that follows, what matters most is the independence of the
phases of the wavelets and so for simplicity I will have all sources emit wavelets of
unit amplitude. If j is the phase of the jth wavelet, then the net complex amplitude
E of the optical field is given by
E
N
X
4:4:8
eij ,
j1
and the mean intensity is
*
+
2
N
N
N

D E
X
X
X
2
ik
cos
I jEj
1
2
N:
j
k
j1
j
j>k
N
4:4:9
We encountered this characteristic previously in the investigation of nuclear decay with variates (like the power spectral
amplitude) that follow an exponential distribution.
215
Consider next the average of the square of the instantaneous intensity

N

2
X
2
2
2 2
1 2 cos j k
I hI i hjE j i
j>k

N
N
N

X
X
X
2
14
cos j k 4
cos j k cos l m
j1
N
j>k
j>k>l>m
1
N N 1
2

N 2 N 2 N 2N 2 N ! 2N 2 ,
N>>1
4:4:10
which takes the final value shown above in the limit of a large number of sources.
From relation (4.4.9) and (4.4.10) the fluctuations in intensity of a monochromatic
classical light wave can be expressed as
D
E D E2
2
jE2 j2 jEj2
I2 I
2N 2 N 2
D E
1
4:4:11
I
N2
jEj2
which, in contrast to the result for thermal light, shows that the fluctuations are not
smoothed out as the number of emitters, and therefore the intensity of the light wave,
increases.
The preceding heuristic interpretation of (4.4.4) lies at the heart of a classical
explanation of an experimental procedure devised initially to measure stellar diameters and termed intensity interferometry by its developers, R. Hanbury Brown and
R.Q. Twiss.10 The procedure elicited considerable controversy at its inception
because many physicists could not accept that a nonvanishing time-averaged interference could occur between the intensities of independent light sources. Indeed, the
name of the measurement technique was an unfortunate choice because that was not,
in fact, what was being observed. Rather, the nonvanishing signal was related to the
second term of the middle line of relation (4.4.10), in which the product of components with identical phase differences were averaged. I have subsequently introduced
variations of this procedure into quantum physics to study the nature of entangled
quantum states and the statistics of fermions and bosons in novel ways quite distinct
from those of traditional matter-wave interferometry.11
To conclude this section, it is instructive to examine several aspects of the fluctuations of the full multi-mode field of a thermal light source, which raise some subtle
10
11
R. Hanbury Brown and R. Q. Twiss, A new type of interferometer for use in radio-astronomy, Philosophical Magazine
45 (1954) 663.
(a) M. P. Silverman, More Than One Mystery: Explorations in Quantum Interference (Springer, New York, 1995), and (b)
216
issues, apparent paradoxes, and pitfalls to avoid. Although the various aspects are
related, for clarity of emphasis I will take them up as separate issues.
ISSUE 1
Where did the shot noise and wave noise go?
From relation (4.3.1), which expresses the partition function ZBE of a system of ultrarelativistic BE particles (chemical potential 0) as a product of the partition
functions fzkg of the individual modes, we can write
X
X
ln 1 e k
ln zk :
4:4:12
ln ZBE
k
Applied to the canonical ensemble of photons as a whole, the reasoning by which the
variance (4.3.8) in photon number of a single mode was derived leads to the general
expression for variance in total internal energy
2 ln Z
hEi
2E hE2 i hEi2
4:4:13
,
2
V
a relation obtained previously in Chapter 1 (Eq. (1.23.10)) in the broader context of
the principle of maximum entropy. Substitution of (4.4.12) into (4.4.13) with replacement of the sum over states by integration over the density of states (4.2.24) (and
change of integration variable x ) yields the expression
2E
d 8VkB T 5 x4 ex dx
32 5 VkB T 5
:
e 1
h c3
ex 12
15hc3
0
4:4:14
4 4 =15
Looking at the result (4.4.14) and comparing it with (4.4.2) or the comparable
quantity
2n nn 1
e
e 1
4:4:15
from Table 4.2, one might be inclined to ask: Where in (4.4.14) does one now find
the distinction between shot noise and wave noise?. The answer is nowhere.
The factor (e 1)1 that appears in the first integral on the right side of (4.4.14)
is the mean occupation number n of the mode . Integration over all mode
energies effectively combines the fluctuations from states of low n and states of
high n to produce a net variance in internal energy proportional to (kBT)5.
Nevertheless, comparing the size of the fluctuations (4.4.14) to the mean internal
energy E
ln Z
d 8Vk B T 4 x3 dx
8 5 VkB T 4
hEi
V
e 1
ex 1
h c 3
15h c3
0
4 =15
4:4:16
217
yields the ratio

E
hEi

15
hc
,
4
2 V k B T
1
2
3
2
3
2
1
2
4:4:17
which does show that the fluctuations in energy diminish as the geometric size or the
temperature of the system increases.
There is, however, another way to interpret the fluctuations (4.4.14) in internal
energy. Although the occupation numbers of each mode of the field of thermal
radiation is unconstrained, the system nonetheless has a mean number of photons
obtained by summing (i.e. integrating) over all occupation numbers

d
kB T 3 x2 dx
kB T 3
8V
hNi
16 3 V
,
e 1
hc
ex 1
hc
4:4:18
23
where the Riemann zeta function12 is defined by the infinite series

n
X
1
n:
k
k1
4:4:19
From relations (4.4.14), (4.4.16), and (4.4.18) we can then make the following
associations
hNi / kB T 3 V
hEi / kB T 4 V
4:4:20
h 2E i / kB T V
and therefore
h E i
1
/ p
hEi
hNi
4:4:21
independent of V. In other words, just as in the case of single-mode shot noise, an

increase in the mean number of photons (from all modes) smoothes out the energy
fluctuations in the total thermal radiation field.
Incidentally, relation (4.4.13), which applies universally and not just to relativistic
bosons, reveals an important general thermodynamic property of the internal energy
fluctuations in a macroscopic system. Substituting (kBT)1 for the Lagrange multiplier yields the equation
12
The Riemann zeta function (over the complex field) has long been an object of fascination to mathematicians and
intimately connected with one of the most fundamental unsolved problems of mathematics. See J. Derbyshire, Prime
Obsession (Penguin, New York, 2003), and the review M. P. Silverman, American Journal of Physics 73 (2005) 287288.
218
2E
kB T
k B T 2 CV
T
V
2 hEi
4:4:22
showing that the greater the constant-volume heat capacity CV (E/T)jV of

the system, the greater is the variance in energy. If the heat capacity diverges, the
energy fluctuations would in theory become infinite i.e. in the thermodynamic
limit of an infinite system. An example is the transition between normal liquid
He-4 and superfluid He-4 occurring at the lambda-point temperature. A plot of
specific heat capacity against temperature resembles the Greek letter lambda, the
vertical stem of which marks the transition temperature where the heat capacity
diverges.
ISSUE 2
Why does an equivalent calculation give a different energy

fluctuation?
By definition, the variance in internal energy is

E
D
2E E hEi2 hE2 i hEi2 :
4:4:23
Thus, one should get the same result as (4.4.14) by calculating separately the terms
E2, E2 and taking the difference, rather than evaluating directly the second
derivative of the partition function. We have already calculated the mean energy
E, so there remains the task of calculating E2. There is, however, a pitfall to
avoid. If by analogy to
hEi
0
d
e 1
one makes the mistake of writing
2 d
NO!
hE i
e 1
2
or
hE i
2
2
d NO!
e 1
the result will turn out to be incorrect. Indeed, a cursory examination of the foregoing expressions for E2 and E2 shows immediately that the combination (4.4.23)
would lead to two terms with different powers of V and T. In no way could they be
combined to yield (4.4.14).
Return to the basic definition (4.2.4) of the internal energy E as a sum over discrete
modes, and consider the simplest case of just two modes
219
hE2 i hn1 1 n2 2 2 i
1 X
n1 1 n2 2 2 en1 1 en2 2
z1 z2 n1 , n2
!
!
1X
1X
1X
1X
2 n1 1
2 n2 2
n1 1
n2 2
n1 1 e
n2 2 e
2
n1 1 e
n2 2 e
z1 n1
z2 n2
z1 n1
z2 n1
hE21 ihE22 i2hE1 ihE2 i:
4:4:24
It then follows that the total variance in energy

E

D

2E E1 E2 2 hE1 i hE2 i2 hE2 1 i hE1 i2 hE2 2 i hE2 i2
2E1 2E2
4:4:25
is the sum of the variances of the energy of each mode because

cov E1 , E2 hE1 E2 i hE1 ihE2 i 0:
4:4:26
The preceding results are readily generalized to an arbitrary number of modes

*
!2 +
X
X
X
2
hE i
Ek
hE2k i 2 hEk i hEl i
k>l
DX
E2 X
X
hEk i
hEk i2 2 hEk i hEl i
hEi2
k
2E
X
2Ek
4:4:27
k>l
where it is seen that cross terms in the mean of the square and the square of the
mean drop out, leading to a total variance that is the sum of all modal variances as
would be expected for a system of independent modes and unconstrained occupation
numbers.
From (4.4.15) for the variance in photon number of a single mode, it follows that
2 2 n1 n
2 e
e 1
4:4:28
is the variance in energy of that mode, and therefore the total variance in energy is
"
#
2
e d
4gVkB T 5
x4 ex dx
2
2
E d
2
hc3
ex 12
e 1
0
0
0
"
#
4:4:29
4gVkB T 5
x3 dx
16 5 gVkB T 5
4
ex 1
hc3
15hc3
4k B ThEi
in agreement with the result (4.4.14) obtained previously (with g 2). The transition
from the first to the second line above is made by integration by parts. Comparing the
220
form of the variance in energy as expressed in the third line with (4.4.22) identifies the
heat capacity of thermal radiation as
CV
4hEi
,
T
4:4:30
a relation we shall use shortly.

Thus, as long as one begins with the correct expression for E2, it makes no
difference to the final result whether one calculates the system energy variance 2E
directly from the second derivative of the partition function or as a difference of two
expectation values.
ISSUE 3
Is the fluctuation in total photon number 0, , or something inbetween?
We began the statistical analysis of thermal radiation by using a canonical ensemble

in which energy can fluctuate because the system is in equilibrium with a heat
reservoir, but the number of particles is fixed. That would suggest that 2N 0.
However, we recognized that photon number is not conserved, and thereore any
number can be absorbed or re-emitted by the walls of the confining volume. That
would suggest that 2N . Is the variance of N calculable, and in any event does it
matter?
In general the answer to the second question is Yes, it matters. Statistical
analysis employing the partition function ZT, V, of the grand canonical ensemble,
whereby the system is in equilibrium with both heat and particle reservoirs, leads to a
variance in energy

2 ZT , V ,
hEi 2 2
2
2
,
4:4:31
E
k B T CV
V,
hNi T , V N
2
which is a sum of two terms: (a) the variance calculated for a canonical ensemble of
fixed number of particles and (b) the variance due to fluctuation in number of
particles. If the fluctuation in particle number is very large, then by (4.4.31) it appears
that the fluctuation in system energy could be greatly enhanced. We will see shortly
whether or not this is the case for thermal photons.
First, quantifying the suggestions in the first paragraph shows that we really are
in a quandary. The chemical potential of the thermal photon gas is 0. The
fluctuation in particle number, determined directly from the partition function, takes
the form
2 ln Z
1 hNi
4:4:32
2N 2 2
,
T , V T , V
which evaluates immediately to 0 because N, given by (4.4.18), is a function only of
T and V and not . However, by use of various thermodynamic relations, one can also
express 2N as
2N kT B hNiT
in terms of the experimentally measurable isothermal compressibility
1 V
T
:
V P
T , N
221
4:4:33
4:4:34
The pressure of a relativistic boson gas, whether calculated by means of a canonical

or grand canonical ensemble, can be shown to be 1/3 of the energy density that is
P
1 hE i 4 5 gkB T4
:
3 V
45hc3
4:4:35
Since P is not a function of V, the derivative (P/V )jT,N vanishes; the compressibility
and therefore 2N are now infinite. A physical quantity cannot be both 0 and .
A third approach to calculating the variance in photon number, is to calculate all
the pertinent thermodynamic quantities for a grand canonical ensemble of non-zero
chemical potential and then take the limit 0. The starting point is the grand
canonical partition function for a Bose gas
PV
X
ln ZBE ln 1 e k
kB T
k
4gVkB T 4
3hc3
x3
dx
e ex 1
4:4:36

kB T 3
16 V
Li4 e ,
hc
where
Lisz
n
X
z
n1
ns
z2 z3

2s 3s
4:4:37
defines the polylogarithm, which reduces to elementary functions only for certain
values of the order s. Before proceeding further, it is useful to understand the origin
of the steps leading to the final expression in (4.4.36).

Each term ln 1 e k in the sum in the first line resulted from the sum over
occupancy numbers of a particular mode k as in (4.3.1), only now the single-mode
Boltzmann term e k includes the chemical potential, which permitted one to
sum all modes independently even if the total number of particles is conserved,
because the chemical potential enforces a constraint on the mean number of
particles in the system. The first line is an exact relation for all BE particles.
PV/kBT, is established by
The second equality in the first line, relating ln ZBE to
P
where
comparing
the
statistical
entropy
S kB E, N pE, N ln pE, N ,
222
pE, N e EN =Z, with the thermodynamic entropy expressed in the First Law
U TS PV N. The thermodynamic extensive variables are equated to
expectation values of the corresponding statistical variates, as e.g. internal energy
U E and particle number N N.
The expression in the second line was obtained by replacing the sum over states by
an integral over density of states for ultra-relativistic bosons (kBT >> ) and
performing an integration by parts. The integrand was made dimensionless by
the change of variables x .
The evaluation of the integral (for degeneracy factor g 2) as an infinite sum in the
third line was performed by a method explained in an appendix, which leads to the
general form
X
xk
an
dx
k!
k 1Lik1a,
4:4:38
a1 ex 1
nk1
n1
0
where the gamma function (k 1) k! for integer k. For a 1, the first few integrals
pertinent to this discussion become
x dx
2 =6 e 1:6449
ex 1
x2 dx
2 3 e 2:4041
ex 1
4:4:39
x3 dx
4 =15 e 6:4939
ex 1
x4 dx
24 5 e 24:8863:
ex 1
All the physical quantities needed are derivable from ln ZBE . Thus, before and after
setting g 2 and taking the limit 0, we have the following relations.
Pressure
P
kB T
8 g kB T 4
Li4 e
ln ZBE
V
hc3
!
0
g2
Mean particle number

lnZBE
kB T 3
hNi kB T
8 gV
Li3 e

T , V
hc
8 5 kB T 4
45hc
1 hEi
:
3 V
4:4:40

kB T 3
:
! 16 3V
0
hc
g2
4:4:41
223
Variance in particle number

2
kB T 3
8 3 V kB T 3
2 ln ZBE
2
N k B T
8 gV
Li2 e !
: 4:4:42
0
hc
3
hc
2
T , V
g2
It then follows from (4.4.41) and (4.4.42) that the relative fluctuation in particle
number for a system of relativistic bosons in the limit of zero chemical potential

1 Li2 e
2 =6 3 1:3684
,
!
0
hNi e hNi
hNi2 hNi Li3e
2N
4:4:43
varies with hNi as expected for a system of independent particles.

We now have three different results for 2N =hNi2 , each presumably obtained by a
legitimate use of the principles of statistical physics. They cannot all be correct. Is any
one of them the correct result? In each of these three calculations the chemical
potential played a key role. Let us re-examine the question from a perspective that
in the end does not directly involve the chemical potential. Starting again from
the basic relation (4.4.32), we will find an alternative expression for the derivative
(N/)T, V by making use of (a) the First Law and (b) the properties of homogeneous functions.
(a) Begin with the First Law expressed in terms of the Helmholtz potential
F(T, V, N). From the exact differential relation (since F is a state function)
dT
dV
dN SdT PdV dN
4:4:44
dF
T
V , N
V
T , N
N
T , V
1
2
we can write the Maxwell relation (which signifies that second derivatives can be
taken in either order)

:
4:4:45
V
T , N
N
T , V
We will use this relation shortly.
(b) Now consider the defining property of an intensive function i.e. a homogeneous function of degree 0 expressed in (4.2.19). Suppose f (x, y) is an intensive
function of extensive variables x and y. Then

df x, y
df nx, ny
f dnx
f dny
0
dn
n1
dn
nx dn
ny dn n1
n1
4:4:46
f
f
x y :
x
y
With f equated first to the chemical potential and then to the pressure P, and the
replacements x V, y N, the relation (4.4.46) leads to the following two equalities
224

V
T , N
V N
T , V
V P

N
T , V
N V
T , N
4:4:47
which, when substituted into (4.4.45), yield an alternative expression for the variation
in particle number with chemical potential
N=V
:
4:4:48
T , V P=N jT , V
Replacing, in accordance with standard procedure, the thermodynamic N by the
statistical N and employing (4.4.48) in (4.4.32), we obtain an expression for
the variance in particle number
2N
hNi=V kB T
P=hN i
T , V
hNi

PV
hNi kB T
"
#1
ln Z
lnhNi
T , V
4:4:49
T, V
exclusively in terms of the partition function and mean particle number. The transition from the first equality to the second, where V and T are included in the function
to be differentiated, is legitimate because the partial derivative must hold temperature and volume constant. The third equality made use of relation (4.4.36) expressing
the grand canonical partition function in terms of system variables. As a quick check
of (4.4.49), we can apply the first relation to the equation of state of an ideal gas
9
PV NkB T >
=
hNik B T
2
4:4:50
P
kB T ) N kB T hNi:
>
V V
N T , V
V
The outcome is as expected for a classical system of grainy constituents.
Now consider the implication of the last equality in (4.4.49). The variance in
particle number is the reciprocal of the slope of a plot of ln Z against lnN. If
circumstances are such that all points in the plot were obtained for the same
temperature and volume i.e. there is at least one additional extensive or intensive
variable X that distinguishes one point from another then the slope exists, as does
the variance in particle number. However, if no additional variable influences the
system e.g. the chemical potential is a constant irrespective of whether that constant
is 0 or not then each point in the plot must represent a different temperature and/or
volume. Since no slope can be associated with a single point, the function 2N is then
neither zero nor infinite nor anything; it is not defined.
If the mass of a photon were nonvanishing, however small, the chemical potential
would not necessarily be zero, and the limiting process represented by (4.4.42) could
225
be implemented to yield, as deduced above, a variance in particle number proportional to the number of particles, just like an ideal gas. However, if the chemical
potential of the photon gas were identically zero, and the partition function depended
on no other parameters than temperature and volume, then the photon number
distribution would have a first moment, but not a second moment. (We have
encountered distributions before, like the Cauchy distribution, which has no first
or second moment.) This somewhat unusual circumstance perhaps makes one
wonder how, from a physical (rather than purely mathematical) standpoint, there
can be a difference in the properties of a physical system depending on whether a
measurable quantity like the chemical potential is arbitrarily small as opposed to
being identically zero.
The answer is that the transition from arbitrarily small to identically zero is
not always smooth. There is an abrupt difference, for example, in the allowed
polarizations of zero-mass and small-mass particles. A particle with nonvanishing
mass, however small, will have 2s 1 spin substates i.e. directions of polarization,
whereas a particle with zero mass can have no more than two independent states of
polarization (helicity components) irrespective of the spin s. This difference is a
fundamental outcome of any theory of particles invariant under a proper Lorentz
transformation and mirror-reflection.13 Stated differently but equivalently, for a
particle with mass, however small, the polarization depends on the reference frame
of the observer; for a massless particle, the spin direction is a relativistic invariant,
always either parallel or anti-parallel to the velocity.
As a final thought on this matter, I return to the question of whether the
magnitude of photon number fluctuations in a thermal photon gas has an observable
consequence on the energy fluctuations, as given by (4.4.31). The answer depends on
the coefficient
!
hEi
1
P
hEi PV TV
4:4:51
hNi
T , V hNi
T V , N
the derivation of which I leave to an appendix. Upon substitution of the previously
obtained relations [(4.4.40), (4.4.30)]
1 hEi
P
3 V
1 hEi
3V T
V, N
V, N
1
1 4hEi
CV
,
3V
3V T
4:4:52
one finds that (4.4.51) is identically zero. For a thermal photon gas, therefore, the
fluctuation in energy is given by
13
E. P. Wigner, Relativistic invariance and quantum phenomena in Symmetries and Reflections: Scientific Essays
(Indiana University Press, 1967) 5181.
226
2E
hEi
kB T
kB T 2 CV
T
V
4:4:53
irrespective of how the number of photons may fluctuate.
4.5 The split-beam experiment: photon correlations

Much can be learned about the statistical properties of light (as well as other kinds of
particles) from a split-beam experiment of general form schematically shown in
Figure 4.5. A source randomly emits particles of specified (or hypothesized) emission
probability function associated with BE, MB, or FD statistics. The particles are then
either randomly reflected from or transmitted through a beam splitter according to a
decision probability function, here assumed to be binomial. The particles received at
detectors A and B, are counted, whereupon the detection events are analyzed by a
correlator. As this is an idealized experiment, it is assumed that no emitted particles
go undetected; what is not received at A arrives at B.
One of the questions interesting from a statistical perspective is: given the probability functions for emission and decision, with what probability do particles arrive
at a detector? From the perspective of physics, the foremost question of interest is
whether particles randomly emitted from a source can somehow be correlated in their
arrivals at separated detectors. The analysis of a split-beam experiment introduces
the important concept of a compound distribution and additional ways of employing
generating functions to determine key experimental quantities. We will consider first
Counter NA
Correlator
Detector A
Splitter
Source
Emission PF
(BE, MB, FD)
NAB
Counter
NB
Detector B
Decision PF
(Bin)
Fig. 4.5 Schematic diagram of split-beam counting and correlation experiments. The source
supplies particles randomly according to a specified (BE, MB, or FD) emission probability
function (PF). The particles are randomly directed to counter A or counter B by the splitter
according to a binomial PF. The numbers of particles (NA, NB) received by detectors
A and B are counted and correlated to obtain the correlation function C(NA, NB) (NANB
NANB).
227
a split-beam experiment with classical particles governed by MB statistics, such as

photons in the limit of low mean occupancy numbers, and then experiments with
bosons and fermions governed by the exact BE and FD distributions.
4.5.1 MaxwellBoltzmann (MB) particles

The counter random variable X, defined by

1 ! A PrA p
X
0 ! B PrB q,
4:5:1
registers a 1 if a photon is received at detector A, the probability of which is p, and

a 0 if a photon is received at detector B, the probability of which is q 1 p. The
n
X
Xi , i.e. the number of counts registered by A within a
probability of a signal Sn
i1
specified counting interval (bin) given that the source emitted n photons, is calculable
from the binomial probability function

n k nk
PrSn kjN n
pq ,
4:5:2
k
where the random variable N represents the number of particles emitted from the
source. Since N is determined independently by the Poisson probability function of
the source (where we again represent the mean particle count by since there will be
no confusion with chemical potential in this section),
PrN n e
n
,
n!
the probability of detecting k photons at A and n k photons at B is

n n k nk
pn, k e
pq :
n! k
4:5:3
4:5:4
Table 4.3 records explicit values of the probability (4.5.4) for the emission of 0, 1, or
2 photons by the source. Note that different photon emissions can all lead to the
same number of arrivals at detector A. For example, a photon can arrive at
A because 1 or 2 or 3 or more photons were emitted from the source. Unless the
experimenter has taken specific measures to create a photon source that emits a predetermined number of photons,14 the photon number N is not an experimentally
controllable parameter. One needs to know, therefore, the marginal probability of
detecting a photon at A irrespective of the number of emitted photons
14
This, in fact, can be done. A single-atom fluorescence source emits one photon at a time. An atomic hydrogen source
radiating from the 2S1/2 metastable state emits two photons.
228
Table 4.3 Compound emission probability for MB source and

binomial splitter
Number emitted by source
N
Number detected at A
NA
Probability
Pr(NA)
e
0
1
e( q)
e( p)
0
1
2
*, k
e q2
e (2pq)
2
1
2 e p
1
2
p n, k :
4:5:5
n0
Substitution of (4.5.4) into (4.5.5)

p
*, k
X
n0
p n, k

e p k X
qn
pk X
qn
e
k! q n0 n k!
k! nk n!
4:5:6
pk
pk
e eq ep
k!
k!
shows that this marginal probability is governed by a Poisson distribution with mean p.
An alternative approach to ascertaining the probability function is to determine
the moment generating function (mgf ) of the compound distribution (4.5.2)
gNAt heNA t i
Pr N n gSnt,
4:5:7
n0
where
n
gSnt pet q
4:5:8
is the mgf of the binomial distribution Bin(n, p). Substitution of (4.5.8) into (4.5.7)

X
X
n
pet q n
n
e
pet q e
gNAt
n!
n!
4:5:9
n0
n0
pet q
p et 1
e
e e
generates the mgf of a Poisson distribution with mean p, in agreement with the
probability function (4.5.6). From the mgf we can readily determine the mean, mean
square, and variance although we already know what they are in this case and from
the symmetry of probability function (4.5.4) can also immediately determine the
corresponding quantities for detector B:
229
hN A i 2A p
hN B i 2B q:
4:5:10
An experimental question of interest is whether the photon detections at A and at

B are correlated. We answer this by calculating the covariance function or, equivalently, the correlation function
C N A , N B
cov N A , N B
h N A pN B qi hN A N B i hN A ihN B i
2
2
A B
2A 2B
2A 2B
4:5:11
where
hN A N B i
X
n
X
pn, k k n k:
4:5:12
n0 k0
Substituting (4.5.4) into (4.5.12) and employing methods used in Chapter 1 to

evaluate a sum over states, we obtain
!
!
X
n
n
n d d
X
X
n n
n X
k nk

hN A N B i e
k n kp q e
p
q
pk qnk
n!
n!
dp
dq
k
n0 k0
n0
k0
k

n

d
d X
p qn
d
d pq
e p
e p
q
q
e
n!
dp
dq n0
dp
dq
pq1
2 pq,
4:5:13
which is equal to the product NANB. (Note that one must not impose the
constraint p q 1 in (4.5.13) until after the derivatives with respect to p and q
have been taken.) Thus, C(NA, NB) 0 for particles subject to MB statistics.
This result is not surprising; one would not expect Poisson-distributed particles
arriving randomly at one location to be synchronized in any way with the random
arrival of such particles at a different location. Were that actually to occur, the fall of
raindrops (whose distribution is claimed to be Poissonian) at two locations under the
same cloud would be correlated.
4.5.2 BoseEinstein (BE) particles (photons)
We proceed as in the previous section except that now the probability for emission
of n particles within some specified time interval is given by the BE function
Pr N n
n
1n1
and the probability of k detections at detector A given n emissions is

n
n k nk
pn, k
pq :
n1
k
1
4:5:14
4:5:15
230
It follows then that the mgf for the distribution of particles at detector A is

X
n
1 X
pet q n
1
pet q 1
n
t
1
pe
g N A t
n1
1 n0
1
1
1
n0 1
1
,
1 p et 1
4:5:16
which by comparison with (4.4.1) is seen to be a BE distribution with mean p. This

tells us immediately the marginal probability of detecting a photon at A irrespective
of the number of photons emitted
p
*, k
X
n0
p n, k
pn
p 1n1
4:5:17
and the means and variances for particle arrivals at each detector
hN A i p
hN B i q
2A p 1 p 2B q 1 q:
4:5:18
As in the previous section, we could also derive the probability p*,k directly by
summing (4.5.15) over the index n. The algebraic steps are a little more involved
than in the previous case (where the sum led to an exponential function) but are
worth examining because the calculation reveals further connections between the BE
distribution and the distribution of waiting times. In implementing the sum (4.5.15)
over n, we soon arrive at the form

n
p=qk X
q
n!
p ,k
,
4:5:19
*
1k! nk 1 n k!
where the index begins at n k if the factorial in the denominator is to make any
sense. Reset the index to start at n 0 again by defining r n k to obtain

r
X
pk
q
k r !
p ,k
:
4:5:20
k1
*
r!
1 k! r0 1
A sum of the form (4.5.20) can be closed as follows

X
X
k r !
k!
kr
ar
ar
k!
r
r!
1 ak1
r0
r0
4:5:21
to yield a final form for p*,k identical to (4.5.17) upon substitution a q/( 1).
The closure in (4.5.21) is associated with the negative binomial distribution
defined by

f kjr, p

r r
rk1 r k
p qk
pq
k
k
k 0, 1, 2 . . .
4:5:22
231
and so called because of the identity

r
k

1
rk1
k

4:5:23
introduced in an appendix of Chapter 1 and encountered in the discussion of random

intervals in the previous chapter. Thus, in a mathematically rigorous analogy to the
expansion of a binomial to a positive power, the expansion of a binomial to a
negative power is given by
X
1
r
r 1 q
1 q
k0
r
k

X
rk1 k
q
q:
k
k
4:5:24
k0
That the distribution defined by (4.5.22) satisfies the completeness relation is seen
immediately by multiplying both sides of (4.5.24) by pr (1 q)r.
The negative binomial distribution (4.5.22) gives the probability for the rth
occurrence of an event (or success) at the (r k)th trial, where k can be 0, 1, 2,
etc. Thus, it is interpretable as the probability for the waiting time to the rth
success. We have seen that the BE single-mode occupation probability pBE of
(4.3.15) was interpretable as a probability for the first occurrence of success at
the kth trial. The negative binomial distribution solves the more general
problem.

rk1
will also be recognized from (4.2.10) as the BE
The coefficient
k
multiplicity factor BE, i.e. the number of ways that k indistinguishable objects can
be sorted into r cells. The two relations pBE and BE come together in addressing
the combinatorial problem: What is the probability that a set of r out of s specified cells
[e.g. quantum states or photon modes] are filled with exactly k out of m indistinguishable balls [e.g. photons]? The solution is given by
Number of ways to distribute k
indistinguishable balls over r cells
P k, rjm, s
rk1
r1
Number of ways to distribute the remaining

mk balls over the remaining sr cells
ms1
s1
m k s r 1
s r 1

4:5:25
Total number of ways to distribute

m indistinguishable balls over s cells
with appended explanation for each factor. In the limit that the number s of cells and
number m of balls become infinitely large while the ratio m/s becomes the mean
occupation number n of a cell, the right side of (4.5.25) reduces to
232
P k, rjm, s !
m !
s !

hnik
hnik
rk1
rk1
k
r 1 1 hnikr
1 hnikr
4:5:26
m
s ! hni
which is the appropriate BE probability for this case. Demonstration of this reduction is given in an appendix. In the special case corresponding to photon emission
into a single mode (r 1), relation (4.5.26) yields

hnik
hnik
k
,
4:5:27
P k, 1jm, s !
k1
0 1 hni
1 hnik1
which reproduces the probability (4.3.15) obtained previously by means of the
canonical partition function.
There are at least two ways to calculate the correlation of BE particles in a splitbeam experiment. The most direct way is to evaluate the expectation (4.5.12) by
substituting the BE probability function, closing the sum and taking derivatives with
respect to p and q in the manner employed before [(4.5.13)], and then applying the
constraint p q 1.
!
X
n
X
n
n
k n kpk qnk
hN A N B i
n1
k
n0 k0 1
"
! #
n n

n
d d
1 X
q X
pk
p q
4:5:28
dp dq 1 n0 1 k0 k
q

d d
1
p q
dp dq 1 p q 1 pq1
22 pq
The resulting correlation in (4.5.28) is twice that for MB particles, leading to a
positive correlation coefficient
r
hN A N B i hN A ihN B i
pq
C N A , N B
4:5:29
A B
p 1q 1
indicative of a photon arrival pattern referred to as photon bunching. Figure 4.6
shows the variation in C(NA, NB) as a function of mean count for different values of
the decision probability p.
A less direct but more general and powerful procedure that provides complete
statistical information about individual and correlated responses of the two detectors
is obtainable from the two-variate moment generating function

X
X
n1 n2
n1 n 2
N A t1 N B t2
n1 t 1 n2 t 2
e
i
e e
pn1 qn2 ,
gA, B t1 , t2 he
n1 n2 1
n
2
n1 0 n2 0
4:5:30
233

1
(a)
Correlation Coecient
0.8
(b)
(c)
0.6
(d)
0.4
0.2
BE Emission PF
Bin Decision PF
0
10
20
30
40
Mean Count
Fig. 4.6 Correlation function in split-beam counting experiment with BE emission probability
function (PF) and binomial decision PF with probability p (a) 1/2, (b) 1/10, (c) 1/20, (d) 1/50
of transmission to detector A.
where each sum now spans the range (0, ) of the number of photons that can arrive
at each detector. By separating the factors according to their summation index

X
1
pet1 n1 X
qet2 n2 n1 n2
gA, B t1 , t2
n2
1 n 0 1 n 0 1
1
4:5:31
and completing the sums sequentially (which again requires use of the negative
binomial summation identity (4.5.24)), one obtains after a little work the simple
expression
gA, B t1 , t2
1
:
1 1 pet1 qet2
4:5:32
We encountered a structure like that defined in (4.5.30) previously in the discussion of the multinomial distribution (relations (1.13.2 and 1.13.3)). To recapitulate, all desired moments, variances, and cross-correlations can be determined
from partial derivatives of gA,B(t1, t2) of appropriate order with respect to the
two arguments:
gA, B t1 , t2
hN A i
p
4:5:33
t1 0
t1
t2 0
234
gA, B t1 , t2
hN B i
q
t1 0
t2
4:5:34
t2 0
N A
2
2 gA, B t1 , t2
p 1 2p
t1 0
t21
4:5:35
2 gA, B t1 , t2
q 1 2q
t1 0
t22
4:5:36
t2 0
2
N B
t2 0
2 gA, B t1 , t2
22 pq
hN A N B i
t1 0
t1 t2
4:5:37
t2 0
2A
2 ln gA, B t1 , t2
p 1 p
t1 0
t21
4:5:38
2 ln gA, B t1 , t2
q 1 q:
t1 0
t22
4:5:39
t2 0
2B
t2 0
Moments (4.5.33)(4.5.39) yield the same correlation coefficient (4.5.29).

Continuation of the procedure to higher derivatives would generate higher-order
correlation functions like hNA 2 N B 2 i and their variances if these were needed. Moreover, the method can be generalized to any number of detectors. For example, were
one to design an experiment with particle counting at three detectors with known
source emission probability and splitter decision probability, the moments and
correlations of the counts could be deduced from the three-variate generating function gA, B, C t1 , t2 , t3 heN A t1 eN B t2 eNC t3 i.
The experimental outcome of photon bunching can be manifested in several ways:15
(a) a higher variance in count rate at a single detector than that expected for Poissondistributed particles;
(b) a positive correlation in the detection of particles at two detectors, such as
described above for the split-beam experiment; and
(c) a conditional detection probability at a single detector greater than that expected
for a Poisson distribution.
Outcome (c), which will not be analyzed here, is effectively the measurement of the
waiting time to a second detection event given that a first one has already occurred.
For BE particles, the first detection increases the probability of a second within the
so-called coherence time of the particle source, in contrast to the behavior of MB
15
I discuss photon bunching and fermion anti-bunching in quantitative detail in M. P. Silverman, Quantum Superposition:
Counterintuitive Consequences of Coherence, Entanglement, and Interference (Springer, 2008).
235
particles where each detection event is independent of any other. It is this third
phenomenon that evokes the image of bunching, since a time record of detection
events would show random regions of enhanced density. Most striking is the fact that
the theoretical conditional probability of a second event with zero time delay after the
first event is twice that for a coincidental second arrival predicted by Poisson statistics. That, in fact, is the implication of relation (4.5.28) compared with (4.5.13).
From a quantum perspective, the bunching of thermal photons, as manifested by a
positive correlation coefficient, is a consequence of BoseEinstein statistics. From a
classical perspective, it is attributable to the wave noise arising from the random
fluctuations in net amplitude of the classical wave comprising independently emitted
wavelets of random phase from numerous atomic or molecular sources. The shot
noise that is always present as a consequence of the grainy nature of photons is
averaged out by the correlator.
There are, however, non-classical states of the optical field i.e. states of light not
described by solutions to Maxwells electromagnetic equations that display a
different type of statistical behavior referred to as anti-bunching. These states play
an important part in the motivation and execution of the experiments to be described
shortly that test sequences of photon measurements for non-randomness. As the
name suggests, the conditional probability of a second detection event, given that a
first has occurred, has lower probability than that predicted by a Poisson distribution. This is the kind of statistical behavior expected for fermions as a consequence of
the Pauli exclusion principle.
4.5.3 FermiDirac (FD) particles

Consider a split-beam experiment with a source emitting FD particles in a single
mode of mean particle number n per bin. Within that interval either 0 or 1
particle can be emitted; if 1 particle is emitted it goes either to detector A or to
detector B with probability p or q respectively. Using relation (4.3.17) we can
summarize the detection probability pn1 , n2 as follows
8
< p 0, 0 1
pn1 , n2 ) p1, 0 p
4:5:40
:
p0, 1 q
and thereby deduce easily the two-variate generating function for FD particles
analogous to relation (4.5.32) for BE particles
gA, B t1 , t2 p1, 0 et1 p0, 1 et2 p0, 0 1 1 pet1 qet2 :
4:5:41
Relevant counting statistics of the FD split-beam experiment can be calculated

from (4.5.41) in the manner illustrated in the previous section for BE particles to
obtain
236
hN A i p
2
h N A i p
hN B i q
2
h N B i q
hN A N B i 0
2A p 1 p
2B q 1 q
r
hN A N B i hN A ihN B i
pq

:
C N A , N B
A B
1 p1 q
4:5:42
4:5:43
4:5:44
4:5:45
The negative sign of the correlation coefficient C(NA, NB) is indicative of antibunching.
The origin of the negative correlation is perhaps obvious, but nevertheless worth
commenting. Since no more than one particle can be in the emitted mode within any
counting interval, the receipt of a particle at one detector means that the other
detector cannot have received a particle hence the signals at the two detectors are
negatively correlated, as shown explicitly in (4.5.45) because the instantaneous
product NANB (and therefore the mean NANB) is 0.
As mentioned briefly, there are non-classical photon states that manifest antibunching even though the particles are intrinsically governed by BE statistics. One
example is a single-photon particle-number or Fock state, represented in Dirac
notation by j, k, where is the energy, k is a unit vector in the direction of
propagation, and is a two-valued label of the state of light polarization which could
take such forms as (V, H) for vertical and horizontal planes of polarization, (D, D)
for planes of diagonal polarization at angles 45 or 45 to the vertical, (R, L) for
right and left circular polarizations, or simply (1, 2) for two unspecified polarization
states. Single-photon states give rise to anti-bunching in a split-beam experiment for
the same reason that FD particles do namely, because there is at most only one
particle in the emitted mode, the two detectors cannot each receive a particle within
the same counting interval and therefore the product NANB is zero.
However, just as certain kinds of photon states can exhibit statistical behavior
ordinarily attributable to fermions, there are fermionic states that are predicted to
exhibit statistical behavior usually attributable to bosons a finding quite surprising
when first reported.16
4.6 Bits, secrecy, and photons

The exploration of natures random ways is motivated by several objectives. There
is, of course, the desire to discover and confirm the laws of physics i.e. a philosophical quest to satisfy a basic scientific curiosity. Experiments specifically crafted to test
whether fundamental physical processes, like the disintegration of nuclei or the
16
M. P. Silverman, Fermion ensembles that show statistical bunching, Physics Letters A 124 (1987) 2731. Further
information is given in M. P. Silverman, Quantum Superposition, op. cit.
237
emission and scattering of light, occur non-randomly are more than just tests of
quantum mechanics. Like other physical theories before it, quantum mechanics may
someday be revised or replaced but it is very unlikely, in my opinion, that any such
modification will reflect a discovery that nature is, after all, less random than we
currently believe. Nevertheless, only experiment can reveal whether or not this is so.
There is another and more practical objective of increasing importance in an era of
increasing digitalization. Although individuals might strive for order and predictability in their personal lives, governments, businesses, and organizations seek randomness in a mathematical manner of speaking. In particular, they need a steady flow of
random numbers to ensure the security of communications and transactions.
Cryptography is the science of rendering a message unintelligible except to an
authorized recipient. To do this, the message of interest is encrypted by combination
with a random message, the key, to form a cryptogram. If the key is truly random, of
length (in bits) no less than that of the message to be protected, and used only once,
then it can be demonstrated mathematically as was done by Claude Shannon in
1949 that the cryptogram is impossible to decipher.17 As an example, consider the
transmission of the secret message MEET TONIGHT.
A
1
B
2
M
E
E
T
T
O
N
I
G
H
T
C
3
13
5
5
20
20
15
14
9
7
8
20
D
4
E
5
F
6
G
7
H
8
I
9
BINARY CIPHER
16 8
4
2
1
0
1
1
0
1
0
0
1
0
1
0
0
1
0
1
1
0
1
0
0
1
0
1
0
0
0
1
1
1
1
0
1
1
1
0
0
1
0
0
1
0
0
1
1
1
0
1
0
0
0
1
0
1
0
0
J
10
K
11
L
12
M
13
N
14
O
15
RANDOM BITS
16 8
4
2
1
1
1
0
0
1
0
1
0
1
0
1
1
1
1
0
0
0
1
0
1
0
1
1
0
0
1
1
1
0
0
0
1
0
1
0
1
0
1
0
0
1
0
0
0
0
0
1
0
0
1
1
0
0
0
1
P
16
Q
17
16
1
0
1
1
1
1
0
1
1
0
0
R
18
S
19
T
U V W X Y Z
20 21 22 23 24 25 26
ENCRYPTION
8
4
2
0
1
0
1
1
1
1
0
1
0
0
0
1
0
0
0
0
1
0
1
0
1
1
0
0
1
1
0
0
0
0
1
0
1
0
1
1
1
0
1
0
1
1
1
1
20 T
15 O
1 A
17 Q
24 X
19 S
4 D
3 C
23 W
1 A
5 E
Start by assigning to each letter with no space between the two words since the
message as so written is quite clear the decimal number of its placement in the
Roman alphabet. Thus, we assign 13 to M, 5 to E, and so on. Our cipher or
algorithm is then to write the decimal number in binary. Recall that if a number is
written in base b, then the first column, starting at the right, represents b0, the second
column b1, the third column b2, etc. Thus the decimal symbol for the letter M is coded
as the binary number 01101. The initial 0 is included so that each letter of a message
will comprise 5 bits (binary digits) since no fewer than 5 bits can express all
26 letters of the Roman alphabet. (Z, for example, is 26 16 8 2 11010.)
Moreover, since 5 bits is sufficient to code 25 32 letters, we will interpret the code as
17
C. Shannon, Communication theory of secrecy systems, Bell Telephone Technical Journal 28 (1949) 656715.
238
modulo-26. Thus, a binary representation of 29, for example, would correspond to

the decimal number 29 26 3, which stands for the letter C.
In binary, the 11-letter message MEET TONIGHT is 55 bits. To encrypt the
message for perfect secrecy, I generated (using the Maple discrete RNG U[0,1]) a
string of 55 random bits and sequentially added the string to the binary cipher bit
by bit with no carryover. The instruction no carryover means that in adding
two numbers, the numeral (if there is one) to be carried to the next column is
dropped. Thus in binary with carryover the sum 1 1 10, but without carryover, the addition reads 1 1 0. The content of the square template labeled
Encryption shows the results of adding 55 random bits to the message bits
without carryover. To decrypt the message, one simply again adds the key to
the encryption bit by bit without carryover to obtain identically the 55 bits of the
original binary cipher.
An eavesdropper who retrieves the entire 55-bit cryptogram, and who is not aware
of my algorithm for constructing a message of Roman letters from 5-bit binary
numbers, could, depending on the key he tries, decipher the message into almost
any combination of letters because all such outcomes are nearly equally probable.
The reason for this is that the conditional entropy of the original message given the
cipher is equal to the entropy of the original message. In other words, nothing is
learned from eavesdropping. Even if the eavesdropper knew the cipher, but not the
key, he would end up with pure gibberish. The encrypted message expressed in
Roman letters turned out to be TOAQXSDCWAE. The information-theoretic significance and application of entropy will be discussed more thoroughly in Chapter 6
concerning quantum physics and the stock market.
A potentially serious weakness to the above scheme is that the random bits are
not truly random. Having been generated by a computer algorithm, they are only
pseudo-random; the same seed number in the RNG will always generate the same
sequence of bits. Physicists expect that truly random bits should be obtainable from
quantum processes such as the emission of photons or the decay of radioactive
nuclei, but the validity of this hypothesis needs to be tested comprehensively.
The protocol for perfect secrecy is not a practical one. It requires a key with as
many random bits as there are bits in the original message; the key can be used only
once or the eavesdropper may eventually be able to discern patterns that would
facilitate deciphering future messages; and some secure means must be found for
transmitting the key from the sender to the authorized receiver without an eavesdropper learning it. Alternative keys, involving for example the prime factors of a large
(e.g. 128 bit) number have been used instead. The security of such protocols relies on
the computational difficulty of performing the factorization (or other mathematical
operations) within a sufficiently short time interval as to enable an eavesdropper to
take advantage of the content of the message. For example, in the context of an
internet transaction, this content may be a credit-card number but it would do an
eavesdropper no good to obtain that number a hundred years after the transaction.
239
Among the newest applications of quantum physics are those obtained by putting
the adjective quantum in front of almost any noun relating to computers and
communication: quantum computing, quantum information, quantum cryptography, quantum key distribution, and the like. These are topics that lie outside the
intended scope of this book except for the following few pertinent remarks. No one,
to my knowledge, has yet made a quantum computer in the sense that physicists
ordinarily understand the term: a device that computes by means of a large number
of entangled quantum states (qubits) to perform multiple calculations in parallel.18
Such a computer, if it existed, could conceivably factor large numbers (and perform
other mathematical operations) fast enough to render present encryption methods
obsolete. The simplest counter-measures would be to adopt longer encryption keys or
a protocol like the one for perfect secrecy.
There would still remain the problem of key distribution. It is in this area that
quantum mechanics is providing practical solutions already available for exploitation. Quantum cryptography is less a matter of encryption than of supplying a
means, based on quantum principles, for securely transmitting a key over a public
channel. Although in theory any elementary particle may be used, in practice the
most suitable means of transmission are single-photon states. Photons are readily
created, have an infinite range, and can be relatively easily transformed into
desired states of polarization. The classical and quantum interpretations of what
occurs when polarized light encounters a polarizing device are profoundly
different.
Consider, for example, a monochromatic classical light wave, which in quantum
terms means an astronomically large number of photons in a narrow range of energymomentum states. If a D-polarized light wave is incident on a linear polarizer that
passes V-polarized light, the relative intensity of the transmitted light is given by the
square of the scalar product of the unit electric vectors of the incident and transmitted waves:
1
I transmitted
4:6:1
,
jê D ê V j2 cos 2
Iincident
4
2
a relation historically known as Malus law.19 In other words, half the light energy,
regarded as a continuous quantity, is transmitted and half is not. The half that is not
may be absorbed or reflected, depending on the type of polarizer.
From a quantum perspective, a single photon can be described as a linear superposition of quantum states in a (V, H) basis or a (D, D) basis. The two sets of bases
are related by
18
19
The term quantum computing is to be distinguished from biomolecular computing in which massively parallel
computation is provided by individual molecules like DNA to solve problems in graph theory such as the travelingsalesman problem.
A comprehensive discussion of light polarization and the interaction of polarized light with different optical
components is given in M. P. Silverman, Waves and Grains: Reflections on Light and Learning (Princeton, 1998).
240
9
1
>
jD i p jVi jHi >
>
=
2
>
1
>
jD i p jVi jHi >
;
2
or the inverse
9
1
>
jVi p jD i jD i >
>
=
2
>
1
>
jHi p jD i jD i >
;
2
4:6:2
where only the polarization labels are shown since the other characteristics (energy,
momentum) are assumed to be the same for all states. Neither basis is more fundamental than the other. In the symbolism of quantum mechanics, a photon in the state
D arriving at a V or H polarizer has a probability
1
2
1
P HjD jhHjD ij2
2
P VjD jhVjD ij2
4:6:3
of passing or not. For photons, the quantum scalar product expressed in Dirac notation
as AjB is equivalent to the scalar product (4.6.1) of unit electric-field vectors. The
seminal point is that with a single incident photon, a polarizer does not pass a fraction
of a photon; the outcomes are discrete, mutually exclusive, and unpredictable.
The security of quantum key distribution relies on (a) the unpredictability of
outcomes of measurements made on single-photon polarization states like (4.6.2)
and (b) the fact that attempts by an eavesdropper to intercept and copy transmitted
single-photon states would alter, in the aggregate, the correlation between states sent
and received. Thus, when sender and receiver compare a sample string of bits, they
would discover that an intrusion had occurred and could take measures to protect
further compromise of information. The reliance on single-photon states for unconditional security is crucial because these states can thwart a particularly effective
method of eavesdropping referred to as a photon-number splitting attack whereby
the attacker intercepts a pulse, splits off a photon, and makes measurements useful to
deciphering the key. There are no surplus photons in a single-photon Fock state.
To test whether measurements on single-photon states lead to outcomes consistent
with what one expects for a random process, one needs to generate a long sequence of
measurement outcomes and test them for revealing patterns. If the patterns are not
there, then the sequences are not random. One highly effective way to do this is by a
process originally known as parametric fluorescence, but referred to now by the
longer, cryptic name of spontaneous parametric down-conversion.
4.7 Correlation experiment with down-converted photons
In contrast to thermal light, the creation of single-photon states is not a simple matter
of switching on an incandescent light bulb. In the words of one review20
20
S. Scheel, Single-photon sources an introduction, Journal of Modern Optics 56 (2009) 141160 (quotation from
p. 141).
241
Generating one and only one photon at a well-defined instance . . . proves to be a formidable task. It
amounts to producing a highly nonclassical state of light with strongly nonclassical properties.
Single photons on demand must therefore originate from a source that operates deep in the quantum
regime and that is capable of exerting a high degree of quantum control to achieve sufficient purity
and quantum efficiency of photon production.
Spontaneous parametric down-conversion (PDC) is a quantum optical process in

which a single photon from a pump beam incident on certain non-centrosymmetric
crystals gives rise simultaneously21 to two photons of lower energy that emerge from
the opposite face of the crystal. For this to occur, the sum of the energies and linear
momenta of the emerging photons, conventionally referred to as the signal (S) and
idler (I) photons, must equal the energy and linear momentum of the pump photon.
Conventionally, the signal photon is the one with higher frequency. Rigorously
speaking, the vector sum of the linear momenta of the two photons is not exactly
equal to that of the pump photon because of the finite mass of the crystal, which also
takes up some of the longitudinal momentum. The term down-conversion refers to
the generation of photons with lower frequency than the pump photon. In a sense,
the process is the reverse of the earlier known process of frequency up-conversion
whereby signal and pump photons are combined within an appropriate crystal with
nonlinear susceptibility to generate an output photon at the sum frequency.
Although the pump beam is sufficiently intense to be regarded as a classical
electromagnetic field, the origin of the down-converted photons is nevertheless a
quantum event best thought of as a kind of spontaneous emission (like ordinary
fluorescence) stimulated by fluctuations in the zero-point energy (or vacuum) state of
the S and I modes. There is a heuristic way to visualize the process: the oscillating
electric field of the pump beam modulates the susceptibility (or refractive index) of
the medium at the incident frequency. The incident light wave propagating through
the crystal interacts with the driven electric dipoles of the medium, whose macroscopic moment is itself a function of the incident light field, to create photon pairs
correlated in polarization, frequency, and emission direction. The process is called
spontaneous, rather than stimulated, parametric down-conversion because there
are no S and I photons present initially to stimulate emission into those same modes.
Rather, the pump amplitude serves as a parameter in the time evolution of the
spontaneously (and presumably randomly) appearing down-converted states.
In Type I PDC the signal and idler have the same polarization, which is orthogonal to the polarization of the pump, and are emitted from opposite sides of a conical
surface whose symmetry axis is the direction of the pump beam. The angle between
the propagation directions of the S and I photons i.e. the apex angle of the cone is
determined by the frequency (or, in terms of the directly perceived observable, color)
of the photons. In Type II PDC the signal and idler photons have opposite
21
D. C. Burnham and D. L. Weinberg, Observation of simultaneity in parametric production of optical photon pairs,
Physical Review Letters 25 (1970) 8487.
242
polarizations and emerge along the surface of two different cones (one for the
ordinary or o-ray; the other for the extraordinary or e-ray) whose intersection
is centered on the pump beam.
Mathematically, the paired (or conjugate) single-photon states of the first type can
be represented by direct products of single-photon states

jViS jViI for pump jHiP
I 1, 2
4:7:1
jHiS jHiI for pump jViP
whereas the conjugate single-photon states of the second type take the form of a
superposition of such products, for example

1
II 1, 2 p jViI jHiS jHiI jViS :
2
4:7:2
Although the energy (frequency) and linear momentum (wave vector) need not be the
same for the S and I photons, only the polarizations, which are the informationcarrying degrees of freedom of interest here, are displayed in the states of (4.7.1) and
(4.7.2). Nevertheless, the constraints posed by energy and momentum conservation
lead to correlations in the frequency and propagation direction of the S and
I photons. In the special case of Type II PDC in which the S and I photons have
the same frequency and emerge along the two directions corresponding to the
intersection of the two cones, it is not possible to tell which photon is the signal
and which is the idler.
A quantum state of the form (4.7.2) one of four so-called Bell states22 named for
the theorist J. S. Bell is said to show quantum entanglement. The significance of an
entangled state is that it preserves quantum correlations among component states
even when the particles described by those states are far apart. Thus, if the signal
polarization of a PDC state represented by (4.7.2) was measured to be V, then one
would know with 100% certainty (barring non-ideal conditions like imperfect
detector efficiency, detector dark current, and other intrusions of the real world)
that the idler polarization was H, irrespective of the separation between the detectors.
Likewise, if the signal polarization were measured to be H, then without doubt the
idler polarization would have to be V. Since the correlations intrinsic to entangled
states cannot be reproduced by physical models based on classical theories of local
hidden variables (a point proved by Bell), entangled states of various kinds have
been created experimentally for purposes of testing predictions of quantum theory.
To date, I know of no replicable test in which the correlations predicted by quantum
theory were not confirmed.
It is worth mentioning, because the subject is fundamental and still a source of
contention among some philosophically minded physicists, that the properties of
22
The four Bell states are: j i p12 j0i1 j0i2 j1i1 j1i2 , j i p12 j0i1 j1i2 j1i1 j0i2 .
243
Lens
PDC
Signal
LASER
Idler
/2
Detectors/
Counters
LF A
0
PBS H
LF
LF
Fig. 4.7 Arrangement for measuring single-photon polarizations. A cw laser pump photon is
converted to simultaneous signal and idler photons within a BBO parametric down-conversion
crystal (PDC). The idler photon acquires diagonal polarization (D) after passing through a
half-wave plate (/2), and is either transmitted with horizontal polarization (H) or reflected
with vertical polarization (V) by the polarization beam splitter (PBS). Optical fibers, lenses and
filters (LF) transmit the signal and idler photons to detectors AB or AC, depending on the
outcome of the PBS measurement. The photon pairs are counted in coincidence and each such
event assigned the appropriate binary label.
entangled states have raised questions concerning the compatibility of quantum

mechanics and special relativity theory. In particular, claims have been made that
entangled states violate causality by permitting super-luminal transmission of information. Such claims, in my opinion are entirely incorrect; I see no conflict between
quantum mechanics and special relativity in this or any other circumstance.23 In any
event, entanglement plays no role in the quantum optical experiments that I will now
explain.24
In the experimental configuration schematically shown in Figure 4.7, whose
purpose was to generate sequences of single-photon polarization measurements,
blue-violet light (405nm) from a continuous-wave (cw) laser serving as the pump
irradiated a crystal of beta-barium borate (BBO) from which emerged horizontally
polarized (i.e. Type I) S and I infrared photons (wavelength 810nm) at an angle
to one another of 3 . The polarization of the idler photons was transformed from
H to D by passage through a half-wave (/2) plate. The idler photons then encountered a polarization beam splitter at whose internal surface a (presumably) unpredictable quantum decision was made either to transmit the idler in a state of
H polarization or to reflect the idler in a state of V polarization. Ideally, the
probability of each outcome should be close to 50%. Finally, by means of lenses,
optical fibers, and long-pass filters (to reduce background counts) the idler photon
was transmitted to detector B or C, depending on the outcome at the beam splitter, in
coincidence with arrival of the signal photon at detector A. The registration of a
23
24
For elaboration of this point, see M. P. Silverman, Quantum Superposition: Counterintuitive Consequences of Coherence,
Entanglement, and Interference (Springer, 2008).
D. Branning, A. Katcher, W. Strange, and M. P. Silverman, Search for patterns in sequences of single-photon
polarization measurements, Journal of the Optical Society of America B 28 (2011) 14231430.
244
coincidence between detectors A and B was symbolically designated 0 and the

coincidence between detectors A and C designated 1. In this way data sequences of
length on the order of tens of millions of bits were collected.
It must be noted that the process of parametric down-conversion is not a perfect
single-photon source. Theoretical analysis of PDC predicts that there should also
occur, albeit with lower probabilities, pairs of multiple-photon states, so that each
mode of the optical field emerging from the nonlinear crystal would actually be
represented by an expansion of the form

1 2
ji 1 jj j0iSI j1iSI 2 j2iSI 3 j3iSI
4:7:3
2
in which jnSI (n 0, 1, 2 . . .) stands for the direct product of n-photon S and I states
jniSI
1
jni jni ,
n! S I
4:7:4
and is a complex-valued parameter proportional to the electric field amplitude of

the pump beam. Each signalidler pair of multiple-photon states (4.7.4) in the
expansion represents a mutually exclusive measurement outcome with intrinsically
non-local quantum mechanical correlations between the sets of signal and idler
photons. A remarkable property of the state (4.7.3) is that the marginal distribution
of photons in one arm (signal or idler) is indistinguishable from that of thermal
light.25 In other words, the probability of n signal photons arriving at (let us say)
detector A, irrespective of the number of idler photons arriving at detector B, is given
by the BE relation (4.3.15)
Pn
n
1n1
4:7:5
in which the mean number of photons per bin and the effective photon temperature
T of the mode are related to the parameter of the pump by
sinh2 jj
:
kB T
2 ln cothjj
4:7:6
4:7:7
A consequence of (4.7.5), which has been experimentally verified,26 is that the

ensemble of photons within the same arm displays bunching. Perhaps one might have
thought that the signal or idler photons of a PDC source would display the same
statistics as photons in the pump beam, but this is not the case; the cw laser pump is not
25
26
B. Yurke and M. Potasek, Obtainment of thermal noise from a pure quantum state, Physical Review A 36 (1987) 3464
3466.
B. Blauensteiner et al., Photon bunching in parametric down-conversion with continuous-wave excitation, Physical
Review A 79 (2009) 063846 (16).
245
Table 4.4
Polarization measurements on PDC photons

Lo-
Hi-
Bin duration (ms)
1 0.0111 0.0002
1
2 0.364 0.002
0.1
Pr(0 event per bin)
0.98920 (Obs)
0.98896 (Thy)
0.694 (Obs)
0.695 (Thy)
Pr(1 event per bin)
0.01074 (Obs)
0.01098 (Thy)
0.253 (Obs)
0.253 (Thy)
Pr(> 1 event per bin)
0.00006 (Obs)
0.00006 (Thy)
0.053 (Obs)
0.052 (Thy)
Pr> 1 event per bin

Pr1 event per bin
0.0056 (Obs)
0.0056 (Thy)
0.209 (Obs)
0.206 (Thy)
Sequence length n
Pr (1) p
Number of bags M
(1 bag 8192 bins)
Sequence length n
Pr (1) p
Number of bags M
(1 bag 8192 bins)
Single-photon events
8 919 341
0.478 50
16 797 012
0.500 37
1088
2050
All non-null events

8 969 641
0.475 82
20 258 816
0.414 87
1094
2473
a thermal source. The thermal noise arises from amplification of the vacuum fluctuations that lead to the spontaneous emission of signal and idler photons.
Table 4.4 summarizes the characteristics of the sequences of signalidler events
experimentally obtained for two values 1 0.0111 and 2 0.364 of the mean
count per bin which will be referred to as the Lo- and Hi- sequences.
For beams of mean occupation number of < 1 counts per bin, where the bin
width (counting interval) is long compared to the coherence time of the source, the
emission statistics should be well approximated by a Poisson distribution. The theoretical probabilities in Table 4.4 were calculated with Poisson statistics. By far, the
largest number of outcomes is 0 events per bin. Since there are no coincidence counts
to be assigned a 0 or 1, these events were removed from the sequences to be tested.
The singly occupied bins are the ones that form the random number sequences of
interest in polarization-based random number generators. Multiple-photon events
are usually excluded from a random bit sequence because they also cannot be
assigned a binary label in addition to the fact that such events can compromise
the security of the key. As seen in Table 4.4, the fraction of events with two or more
246
photons within the same data collection time is very low if is sufficiently low, but
can still be less than 1 and lead to a significant presence of multiple-photon events.
To test whether the observed sequences of photon measurements with or without
inclusion of multiple-photon events gave evidence of non-randomness, a statistical
method was needed that could be applied to sequences of data comprising events of
more than two kinds. The theory of recurrent runs provides a mathematically
interesting and statistically effective solution to this need.
4.8 Theory of recurrent runs
A stochastic process generates random outcomes in time or space. Despite their random
occurrence indeed, precisely because of it, as discussed in the previous chapter the
outcomes of a stochastic process will display ordered patterns, which a statistically
na ve observer may mistakenly interpret as predictively useful information. Although it
is not possible to prove with certainty that a particular process is random, various
statistical tests can demonstrate within specified confidence limits that it is not random.
Among these, runs tests are especially useful because they are easily implemented, do
not depend on the form or parameters of the distribution of the sampled population,
and are sensitive to deviations from the statistics expected for a random sample.
Recall that a run was defined to be an unbroken sequence of similar events of a binary
nature, as, for example, a sequence of 0s and 1s. A runs test, then, is a test of randomness
in permutational ordering along a single dimension, either spatial or temporal. The
applicability of runs tests is more general that might be inferred at first glance because the
original data, which can be any discrete or continuous series of real numbers, can be
mapped to a set of binary elements in various ways. The different mappings generally
produce different sets of frequencies of runs of specified length, thereby independently
mining the information inherent in the data. Runs tests are distribution free because they
rely on ordinal or categorical relationships between the elements of the sequence to be
tested, rather than on the exact magnitudes of the elements themselves.
To apply a runs test one must know, or at least be able to approximate closely, the
distribution of the chosen statistic. The statistics of interest have traditionally been
the total number of runs (of both types of symbols) and the frequency of the shortest
and longest runs. However, the data are much more effectively utilized by determining for each run length t the probability pn,k,t for occurrence of k runs in n trials.
Although I discussed certain kinds of runs (exclusive runs) in the previous
chapter, it will be useful to consider the subject again from a broader perspective.
Generally speaking, runs tests are of three types. The first is based on categorical
relationships, by which is meant that a variate is assigned a symbol such as a or b
depending on whether it was greater or lesser than a specified threshold, e.g. the
median. The null hypothesis, against which the resulting series containing na elements

na nb
of one kind and nb elements of the other is compared, is that each of the
na
247
distinguishable arrangements is equally likely prior to sampling. This hypothesis

implies that the probability of an element (a or b) is constant, no matter where in
the series it appears.
The second type of runs analysis, based on ordinal relationships, defines an up
down run as an unbroken sequence of increasing or decreasing values. If n unequal
numbers are generated by a random process, then each of the n! distinguishable
orderings has an equal a priori probability of being observed. A binary series can be
constructed from an observed sequence of real numbers by taking first differences,
i.e. the difference of each pair of contiguous elements, and assigning symbol if
the difference is positive and if the difference is negative. In this case the
probability of a or is not constant within a run, but becomes less probable
the farther it appears from the start of the run.
I have introduced these two types of runs tests into nuclear physics to examine a
variety of radioactive nuclides undergoing transformations by different means such
as , , , and electron-capture decay processes. To examine sequences of PDC
photon polarization measurements, however, a third type of runs analysis is especially useful. This type is based on the theory of recurrent runs, which are defined as
follows: A sequence of n symbols A and A (read as not A) contains as many runs of
length t as there are non-overlapping uninterrupted successions of exactly t symbols.27 It is distinguished from exclusive runs in that the concept of run length is so
defined as to be independent of subsequent trials. For example, in the sequence
aaaabaaaaaa there are two runs of length 4 [aaaajbjaaaajaa], three runs of length 3
[aaa jabj aaajaaa], and five runs of length 2 [aajaajbjaajaajaa]. (Analyzed in terms of
exclusive runs, there would have been one run of a of length 4 and one run of a
of length 6, provided the sequence ended at the eleventh trial.) In a sequence of
Bernoulli trials, a recurrent run of length t occurs at the nth trial if the nth trial adds a
new run to the sequence. Thus, the recurrent runs of length 4 occur at positions 4, 9,
and the recurrent runs of length 3 occur at positions 3, 8, 11.
The advantage of this third definition of a run is that runs of fixed length become
recurrent events, and the statistical theory of recurrent events can then be applied to
test data for permutational invariance over a wider variety of patterns than just those
of unbroken sequences of identical binary elements. For example, one may be
interested in testing the recurrence of a pattern abab, which in a quantum optics
experiment might correspond to a sequence of alternate detections of left and right
circularly polarized photons at a single detector or to coincident detections at four
detectors. Besides applications to runs, the same theoretical foundation may be
applied to recurrent events in other forms such as return-to-origin problems (e.g.
instances where a random variable has returned over time to the starting value),
ladder-point problems (instances where a sum of random variables exceeds all
preceding sums), and waiting-time problems.
27
W. Feller, An Introduction to Probability Theory and its Applications (Wiley, 1950) 299300.
248
Suppose we have a sequence (x1, x2,. . .,xn) of n independent and identically

distributed (i.i.d.) Bernoulli trials, i.e. binary outcomes [success or failure] having
constant probabilities p and q 1 p, respectively. A recurrent event E is defined
here to be a run of successes of length t. The following random variables will produce
the statistics we are seeking in regard to E.
T k number of trials between k 1th and kth occurrence of E 1,
Sr
r
X
T k number of trials up to and including rth occurrence of E,
4:8:1
4:8:2
k1
N n number of occurrences of E in n trials, also referred to as the run count:

4:8:3
In the context of the PDC experiment to measure single-photon polarizations, the
variate of primary interest is (4.8.3), the number of occurrences in a fixed number of
trials. However, to determine that distribution, it is first necessary to obtain the
distribution of waiting times (4.8.2).
The distribution of the variable T is defined by the statement
Pr T n f n with f 0 0
4:8:4
where fn is the probability that E occurs for the first time at the nth trial. The
generating function of the probabilities of first occurrence is expressed by the series
expansion
F s
4:8:5
f n sn
n0
from which it follows that the generating function of the rth occurrence of E is
!r
X
X
r
r
r n
n
F s
f n s F s
f ns
,
4:8:6
n0
n0
where
Pr Sr n f nr
4:8:7
is the probability that the rth occurrence of E first takes place at the nth trial. [See
Eq. (3.14.11) and the discussion preceding it.]
Although the derivation of the generating function (4.8.5) is not difficult, it is
somewhat lengthy, and I will simply give the result28
F s
28
pt st 1 ps
1 s qpt st1
4:8:8
W. Feller, Fluctuation theory of recurrent events, Transactions of the American Mathematical Society 67 (1949) 98119.
249
from which the mean and variance of the recurrence times of runs of length t follow
by differentiation
dF s
1 pt
T
4:8:9
qpt
ds s1
(
)

d 2 F s
dF s 2 dF s
1
2t 1 p
2

2:
4:8:10
T

2
2
ds
ds
ds
qpt
q
qpt
s1
To keep notation as unencumbered as possible, I have suppressed the run length t in

the arguments of the generators, expansion coefficients, and statistical moments. It
should be borne in mind, however, that relations (4.8.5) through (4.8.10) and others
to follow all pertain to a fixed value of t.
In analysis of the PDC experiment for patterns of photon randomness, the
number of trials to the kth occurrence of E is not as useful as knowing the probability
that E occurs k times in a fixed number n of trials. The relation connecting the two
sets of variates is
Pr N n k Pr Sk n:
4:8:11
In words: if the total waiting time to the kth success is less than n, then the number of
successes in time n must be at least k. The probability pn,k that exactly k events
E occur in n trials is then expressible as
pn, k Pr N n k Pr Sk n Pr Sk1 n
4:8:12
and serves in the construction of two generating functions

Gn z
p n, k z k
4:8:13
k0
Fk s
X
n1
p n, k s n
Fk s1 F s
:
1s
4:8:14
Note that the summation in (4.8.13) is over the number of occurrences k, whereas the
summation in (4.8.14) is over the number of trials n. The second equality in (4.8.14)
follows directly from Eq.(4.8.11), and is derived in an appendix. Multiplying both
sides of (4.8.13) by sn and summing over n leads to the bivariate generating function
"
#
X
X
1 F s
H s, z
pn, k zk sn
4:8:15
1

s 1 zF s
n1
k0
from which the probabilities pn,k are obtained by series expansion of both sides of the
equality.
A sense of the structure of the formalism can be obtained by considering the case
of recurrent runs of length t 3 for a stochastic process with p 12. Substitution of
250
these conditions into Eq. (4.8.8) for F(s) yields the following expression for the right
side of Eq. (4.8.15) and its corresponding Taylor-series expansion to order s6

2 s2 2s 4
H s, z 3
s zs3 2s2 4s 8

1
7
3
13 4
1
3
s
z s3
z
z s5 O s6 :
1 s s2
8
8
16
16
4
4
4:8:16
Recall that the powers of s designate the number of trials, and the powers of z
designate the number of occurrences of runs of length 3. For a fixed power of s,
the sum of the coefficients of the powers of z within each bracketed expression sum to
unity, as they must by the completeness relation for the probability of mutually
exclusive outcomes. Note that the first three terms (s0 s1 s2) are independent of
z i.e. contain only powers z0 since there cannot be runs of length 3 in a sequence of
no more than two trials. For three trials, the probability of zero runs of length 3 is 7/8
and the probability of one run of length 3 is 1/8. For five trials, however, the
probability of zero runs is 3/4 and the probability of one run is 1/4. This pattern
persists: (a) to obtain a run of length t, the sequence of trials must be of length n t,
and (b) the greater the number of trials, the higher is the probability of obtaining
longer runs.
It is not necessary to know the individual pn,k to determine the mean number of
recurrent runs
X
hN n i
kpn, k :
4:8:17
k0
Multiplying both sides of Eq. (4.8.17) by s and summing n over the range (1, ) leads
to the generating function for the distribution of Nn
n
M1s
X
hN s isn
n1
F s
:
1 s 1 F s
4:8:18
Starting from the relation

hN2n i
4:8:19
k2 pn, k
k0
and following the same procedure that led to (4.8.18) yields the generating function
for the distribution of hN 2n i
M2s
X
hN 2s isn
n1
F s F2 s
1 s 1 F s2
4:8:20
From the expansion coefficients of M1(s) and M2(s) to order sn one obtains the
variance
251
var N n hN 2n i hN n i2 e
n 2T
,
3T
4:8:21
where the approximate equality holds in the asymptotic limit of large n. The generators for higher-order moments of Nn can be derived in the same manner, but are not
needed in this chapter.
The statistics (probabilities and expectation values) for any physically meaningful
choice of probability of success p, run length t, and number of trials n are deducible
exactly from the generator (4.8.15) and derived generators such as (4.8.18) and
(4.8.20). For many applications, however, particularly where it is possible to accumulate long sequences of data as is often the case in atomic, nuclear and elementary
particle physics experiments or investigations of stock market time series, the tests for
evidence of non-random behavior are best made by examining long runs. Suppose,
for example, one wanted the probability of obtaining the number of occurrences of
runs of length 50 in a sequence of 100 trials. This would require extracting the
hundredth term
p100, 50
1
562 949 953 421 312

1 125 899 906 842 623
1
z z 1
2 251 799 813 685 248 2 251 799 813 685 248
1:0000 2:3093 1014 z 7:8886 1031 z2

from the Taylor expansion of the generating function (4.8.15). Powerful symbolic
mathematical software such as Maple or Mathematica permits one to do this up to a
certain order limited by the speed and memory of ones computer, but these computational tools become insufficient when one is seeking exact probabilities of runs in
data sequences of thousands to millions of bits.
Explicit expressions for pn,k and Nn for specified run length t

i
1 dn h
1
k
1 s 1 F sF s
4:8:22
pn, k
n! dsn
s0
1
hN n i
n!
dn h
1
1
1 s 1 F s F s
dsn
4:8:23
s0
can be derived from the associated generators, but direct execution of these expressions by differentiation is not in general computationally economic. The computer, in
fact, performs the series expansion of the generators H(s,z) and M1(s) more rapidly
than it performs symbolic differentiation.
As an example that employed Maple and a Mac laptop computer, I calculated the
exact mean number of runs of length t 4 in a sequence of one million trials with
probability of success p 0.5 by the following steps.
252
Express M1 (s) as a rational function of s

M 1 s
1
s4
:
4
3
2
2 s s 2s 4s 8s 1
X
hN j isj , but do not
Convert the rational function numerically to a power series
j0
display the result since, after all, there are 1000001 terms.
Instead, extract the desired term by summing the series from term n to term n; there
is only one term in the sum. Extraction of this element for the specified conditions
led to N1000000 33333.258 in a fraction of a second.
One can show by application of the Central Limit Theorem to relation (4.8.11) that
for large n, the number Nn of runs of length t produced in n trials is approximately
normally distributed with mean
n
T
4:8:24
n 2T
3T
4:8:25
N
and variance
2N
in accord with the calculation leading to (4.8.21). (The asymptotic expressions

(4.8.24) and (4.8.25) also follow from the generators (4.8.18) and (4.8.20), but those
alone do not determine the distribution of Nn.) The Gaussian approximation, whose
relative accuracy improves in the limit of increasing n, is actually quite good even for
moderate values of n, as shown in Table 4.5 for n 100. Expansion of the generating
function M1(s) yielded the exact mean value as an integer or fraction, which was then
expressed as a floating-point number to three significant figures for comparison with
the Gaussian approximation. The latter, as indicated by the tabulated results, always
overstates the true mean values, and therefore the probabilities; the absolute error
h
i
h
.
i
Exact
Gauss
Exact
Exact
,
in
contrast
to
the
relative
error
p
p
, increases
pGauss

p

p
n, k
n, k
n, k
n, k
n, k
with run length and number of trials.
The procedure described above for converting the rational function of s into a formal
power series in s did not work with the bivariate generator H(s, z), which required for
conversion the solution of the roots of a high-order (>2) algebraic equation. That
additional complication arose because of the presence of a product of z with a tdependent power of s in the denominator. An alternative procedure to isolate the values
pn,k for fixed n, which still relies on the computational speed of series expansion and
worked well for sequence lengths in the low thousands, entails the following.

For given p and t, express H(s, z) as a rational function of s and z.

Generate a Taylor-series expansion of H(s, z) to order n and n1 in s.
Convert the Taylor-series expansions into polynomials P(n1) and P(n).
Subtract one polynomial from the other to obtain an expression of the form
253
Table 4.5
Mean numbers of runs for n 100 trials with p 0.5
Run
length t
Mean number of
runs (exact)
Mean number of runs

(Gaussian approximation)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
40
60
50
16.556
7.041
3.258
1.562
7.611 (1)
3.738 (1)
1.842 (1)
9.098 (2)
4.496 (2)
2.222 (2)
1.099 (2)
5.433 (3)
2.686 (3)
1.328 (3)
6.561 (4)
3.243 (4)
1.602 (4)
7.916 (5)
3.910 (5)
2.819 (11)
1.820 (17)
50
16.667
7.143
3.333
1.613
7.937 (1)
3.937 (1)
1.961 (1)
9.785 (2)
4.888 (2)
2.443 (2)
1.221 (2)
6.104 (3)
3.052 (3)
1.526 (3)
7.630 (4)
3.815 (4)
1.907 (4)
9.537 (5)
4.768 (5)
4.547 (11)
4.337 (17)
h
i
P n1 P n ! pn, 0 pn, 1 z pn, 2 z2 pn, z sn ,
n
t
n
t

where nt is the largest integer k such that kt n, and the coefficients pn,k are given as
exact fractions.
Evaluate the set fpn,kg as floating-point numbers, if desired.
As an example, the procedure led in under ten seconds to the full set p1000,k
fk 0. . .200g for the probability of k occurrences of runs of length 4 in a sequence of
1000 trials. The calculations were again performed with a Mac laptop running Maple.
Using the above methods to obtain exact numerical probabilities becomes impractical for very long sequences and long run lengths since the evaluation time grows
nonlinearly with n. For example, calculation of the distribution for n8192, t6
required more than 550 hours of computation. However, it was realized (by an
undergraduate working on the project) that after Gn(z) is calculated for small n, it
can be used to approximate Gn(z) for larger n by treating the longer sequence as a
concatenation of smaller ones, and applying a correction for loss of runs at the
254
boundaries. For the distributions arising from the PDC single-photon experiment,
which involved sequences of length 8192 trials, this method yielded values for pn,k
that differed from the exact probabilities by a theoretical bound no greater than 106.
In cases where direct comparison was possible, this discrepancy never exceeded 108.
4.9 Runs and the single photon: lessons and implications

The results of the recurrent runs analyses of the Lo- and Hi- experiments are
summarized in Tables 4.6 and 4.7, which show the observed and predicted numbers
of runs of length 2 through 27. Though improbable, those runs of very long length
were obtainable because the experiments yielded data sequences ranging from about
920 million bits, as shown in Table 4.4.
Table 4.6
Lo- experiment predicted and observed numbers of runs of 1s
Run
Length t
N obs
Single photon
events
Nn
Single photon
events
Nobs
All non-null
events
Nn
All non-null
events
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
1381157
572187
257254
119840
56511
26863
12878
6125
2966
1452
730
365
157
69
32
16
9
7
4
2
2
1
1
1
0
1381300 900
572300 700
257300 500
119700 300
56500 200
26870 160
12820 110
6120 80
2930 50
1400 30
670 30
321 18
153 12
73 9
35 6
17 4
8 2
3 2
1.8 1.4
0.9 0.9
0.4 0.6
0.2 0.4
0.1 0.3
0.05 0.2
0.02 0.15
1375944
567578
253947
117665
55200
26044
12443
5887
2850
1384
699
343
142
63
27
13
7
4
2
1
1
0
1376000 900
567600 700
257400 500
117500 300
55200 200
26110 160
12390 110
5890 80
2800 50
1330 40
630 30
301 17
143 12
68 8
32 6
15 4
7 2
3.5 1.9
1.7 1.3
0.8 0.9
0.4 0.6
0.2 0.4
255
Table 4.7
Hi- experiment predicted and observed numbers of runs of 1s
Run length
t
N obs
Single photon
events
Nn
Single photon
events
Nobs
All non-null
events
Nn
All non-null
events
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
2802820
1202758
561638
271694
133684
66255
33002
16568
8219
4047
1987
979
493
237
119
63
34
19
9
3
1
1
1
1
1
0
2803000 1300
1202000 900
561300 700
271800 500
133900 400
66400 300
33110 180
16530 130
8270 90
4130 60
2070 50
1030 30
520 20
259 16
130 11
65 8
32 6
16 4
8 3
4 2
2 1,4
1 1
0.5 0.7
0.3 0.5
0.1 0.4
0.06 0.25
2464125
911663
361732
147779
60917
25164
10473
4337
1775
725
273
99
38
13
8
2
1
0
2464000 1300
911500 90
361900 600
147500 400
60800 200
25130 160
10400 100
4320 70
1790 40
740 30
308 18
127 11
53 8
22 5
9 3
3.8 1.9
1.6 1.3
0.7 0.8
Two sets of analyses were performed. In the first, only single-photon events were
retained with the binary classification depicted in Figure 4.7. In the second, all nonnull events were retained and assigned to the binary categories
1 single-photon coincidence AC
8
< 0 single-photon coincidence AB
1
:
:
X multiple-photon coincidence event
Examination of Tables 4.6 and 4.7 show that, as expected, the presence of multiple
photon events led to fewer runs of 1s at each length because a run of 1s could be
terminated by occurrence of either a 0 or X. The empirical values of the probability
256
140
Frequency
120
Single Photon
Events
= 0.364
Run Length 6
100
80
60
40
20
0
40
50
Frequency
70
80
90
Number of Occurrences
250
200
60
All Non-Null
Events
= 0.364
Run Length 6
150
100
50
0
10
15
20
25
30
35
40
Number of Occurrences
Fig. 4.8 Observed frequencies of runs of length t 6 for 8192-bit bags of the Hi- singlephoton events only (upper panel) and with all non-null events (lower panel). Theoretical
distributions (solid), obtained by means of the concatenation method, are superposed.
Multiple-photon events, which comprise about 17% of the non-null events, shift the
distribution by decreasing the frequency of runs.
of success p obtained for the two sets of analyses for Hi- and Lo- experiments are
given in Table 4.4. Comparison of observed and theoretically expected numbers of
runs for both sets of analyses are in close agreement.
For a thorough comparison of observation with theory, the two data sequences
(Lo- and Hi-) were partitioned into M subsequences (bags) of length n 8192 bits,
as was done with the sequences of nuclear decay counts in the previous chapter. As
summarized in Table 4.4, the number of bags was slightly above 1000 for the Lo-
experiment and above 2000 for the Hi- experiment. Histograms were made of the
number of occurrences of each run length from 2 to 13, an example of which is shown
in Figure 4.8 for runs of length t 6 in the Hi- sequence.
Each of the 12 histograms of run frequencies obtained from the sequences for
each of the two experiments was then tested against the theoretical distributions
Nn,k,t M pn,k,t with a 2 analysis. Recall that the outcome of a 2 analysis is the
cumulative probability, or P-value, of obtaining a value of the tested variate
greater or equal to the observed value. If the null hypothesis (that the tested
257
P-Value
0.75
0.50
0.25
0
2
10
12
Run Length
Fig. 4.9 Distribution of P-values as a function of run length for runs of 1s in the Hi- sequence
of bits.
variate is a true random variable) is valid, then P is expected to be a U(0,1)

variate, i.e. to be distributed uniformly over the interval 1P0. On average,
therefore, it is to be expected that one out of 100 sequences from an ideal random
source will fail a 2 test purely by chance at the significance level of 0.01.
If more than 1% fail at this level, the randomness of the source would be
subject to doubt.
An example of the outcome of the 2 analyses is plotted in Figure 4.9, which shows
the distribution of P-values as a function of run length for runs of 1s in the Hi-
sequence of bits. There is nothing in the distribution that would call into question the
validity of the null hypothesis. Moreover, the occurrence of multiple-photon events
does not appear to modify the distribution in any systematic way, even though such
events account for about 17% of all non-null events.
The tests on the events designated 1, i.e. coincidences with vertically polarized
single-photon states, were also performed on the events designated 0, i.e. coincidences with horizontally-polarized states, with statistically equivalent outcomes.
What lessons, then, can be drawn from the PDC photon sequences taken
altogether?
In the context of another detailed examination of the idiosyncratic behavior of
nature at the atomic scale, the experiment revealed no basis for believing that the
emission of a pair of light quanta from a parametric down-conversion source is in
any way a deterministic, predictable event, or that there is any underlying causal
theory to forecast in what state of polarization a single photon will emerge from the
polarization beam splitter. Malus law of classical optics must be understood as a
random binary decision on discrete entities, and not as a mean partition of a
continuous entity. Nor to recall the controversy that motivated the investigations
of the previous chapter was there found any correlation between fluctuations in the
258
random outcome of PDC polarization measurements and the fluctuations of other

stochastic processes under observation during the same time period.
In the context of secure communication, the experiment described in this chapter
was by no means the first to test a quantum optical source of random bits for
purposes of quantum key distribution. Indeed, the U.S. National Institute of Standards and Technology (NIST) has made available through the internet a suite of
statistical tests for application to random number generators.29 However, from a
practical standpoint, the analysis developed for the present experiment, in contrast to
the NIST tests, did not require numerical unbiasing i.e. there was no need to
ensure that the probabilities p and q for the outcomes 1 and 0 of the data sequences
were equal. This is important because p and q will not likely be equal in many, if not
most, of the applications one encounters, as for example sequences of photon
measurements that include multiple-photon events.
Besides runs tests, there are other statistical tests that could have been used, and it
is worth remarking briefly why they were not. Conventionally, the sensitivity of a
statistical test is gauged by its power, which is defined as the probability of not
making a Type II error i.e. of not wrongly accepting the null hypothesis when it
is false. (A Type I error is to wrongly reject the null hypothesis when it is true.) There
is no simple formula for calculating the power of a runs test under general circumstances, since the power of a statistical test may depend on the specific application.
Nevertheless, there are reasons to believe that runs tests are particularly effective in
comparison with other tests that could have been employed.
For example, NIST tested three pseudo-random number generators with five
statistical tests at a level of significance of 1%. Each generator was used to generate
300 series of one million elements each. The relative effectiveness of the statistical
tests was dependent on the generator, but runs tests were shown to be the most
sensitive in all of the published graphical summaries.30
Another basis for tests of randomness is entropy, which in statistical physics is
related to probability and in communications science is related to information.
Power calculations of one such test, which measured the deviation of the estimated
entropy of a data set of length n from the theoretical maximum of a random series of
the same length, led to the conclusion that the test is more powerful than a runs test
for low n, but less powerful than a runs test for large n.31 The lengths of the data
series generated in the PDC experiment are very large, in which case runs tests would
29
30
31
A. Rukhin et al., A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic
Applications, National Institute of Standards and Technology, Special Publication 80022 (Revised 2010), http://
csrc.nist.gov/publications/nistpubs/800-22-rev1a/SP800-22rev1a.pdf.
J. Soto, Statistical testing of random number generators, Proceedings of the 22nd National Information Systems
Security Conference (National Institute of Standards and Technology, 1999), csrc.nist.gov/grops/ST/toolkit/rng/
documents/nissc-paper.pdf
S. Chatterjee, M. R. Yilmaz, M. Habibullah, and M. Laudato. An approximate entropy test for randomness,
Communications in Statistics Theory and Methods 29 (2000) 655675.
259
be preferred over the entropy test. Moreover, the entropy test yields a single statistic,
whereas runs tests yield a statistic for each run length.
Finally, in tests I have carried out on the randomness of first differences of closing
stock prices of a score or more of listed companies of the New York Stock Exchange
an investigation that I will discuss in Chapter 6 a number of the resulting series
passed tests of randomness based on autocorrelation, periodicity (by means of power
spectra), and entropy, but failed runs tests for nearly all values of run length.
Appendices

In all books I have seen that discussed the statistics of light, the chemical potential of
photons was asserted to be zero. The discussions, however, focused exclusively on
thermal (or black-body) radiation, leaving open the question of whether the chemical
potential of light is always zero. An argument, if one was given, beyond the bland
(and in general incorrect) statement that massless particles can be emitted and
absorbed in arbitrary numbers, was usually grounded in thermodynamics and went
something like the following.
The First Law of Thermodynamics takes the form (among others)
X
dU TdS PdV
i dN i ,
4:10:1
i
where the extensive variables of the system are internal energy U, entropy S, volume
V, and number of particles Ni of constituent i, and the intensive variables are absolute
temperature T, pressure P, and chemical potential i of constituent i. In thermal
equilibrium (dS 0) in a system of constant volume (dV 0) and constant energy
(dU0), the First Law reduces to
X
i dN i 0:
4:10:2
i
If the variations dNi are arbitrary, then it must follow that i0.
Suppose, however, that the variation in particle number is not arbitrary, but
governed by a reversible chemical reaction of the form
a1 X1 a2 X2 a3 X3 a4 X4 :
4:10:3
Then, if the reaction progresses by a differential amount d, the population of each

constituent changes in the proportion
dN 1
dN 2
dN 3
dN 4
260
a1 d )
X
a2 d
)
c i i 0
a3 d
i
a4 d

Reactant : ci ai
,
Product :
ci ai
4:10:4
261
which, upon substitution into (4.10.2), leads to the constraint (4.10.4) on chemical
potentials. This constraint does not pertain only to chemical reactions.
Consider, for example, the creation and annihilation of electronpositron pairs
e e 2;
4:10:5
which, according to (4.10.4) should be governed by the relation

e p 2 :
4:10:6
Is the chemical potential of the gamma photons zero even though now there
appears to be a specific number of photons produced in each reaction? The answer
is yes. In this case the reason given is that the photons escape the system and
therefore do not contribute to the thermodynamic equilibrium of the remaining
particles. (For the same reason, the chemical potential of neutrinos, once thought
to be massless, is usually taken to be zero in weak nuclear processes like the beta
decay of the neutron.) This was the case with the experiment I described in the
previous chapter in which detection of each pair of back-to-back gamma rays from
reaction (4.10.5) provided a signal for the decay of one radioactive sodium nucleus.
The reaction proceeded exclusively in the forward direction because the photon
density within the volume that defined the system was too insignificant to sustain
the reverse reaction.
Suppose, therefore, that we have a system of interacting electrons, positrons,
and gamma rays confined to a fixed volume with 100% reflective boundaries
so the photons cannot be absorbed by the walls. Since photons do not interact
with one another (leaving aside nonlinear QED processes) and do not exchange
energy with the walls of the container, they come into equilibrium with the
electrons and positrons only. The system is a charged or neutral plasma,
depending on whether or not there is an initial imbalance of charged particles.
The reaction (4.10.5) proceeds in both directions, and the three components of the
plasma come to the same equilibrium temperature. Is the number of photons still
uncertain? Is the chemical potential still zero? The answer to both questions is
again yes.
The number of photons is uncertain because the electronpositron annihilation
reaction can actually generate any number of photons consistent with the conservation of total energy, linear momentum, and angular momentum (spin). Thus,
a ee pair in a singlet state32 (state with anti-parallel spins) at rest can decay to
photon pairs that also have total linear momentum 0 and net spin 0. This would
include any integer number (n 1) of pairs of back-to-back photons of opposite
helicities (circular polarizations)
32
The multiplicity of states of a particle of spin quantum number s is 2s 1. Thus s 0 for e and e with anti-parallel
spins, and s 1 for parallel spins.
262
e e S0 2; 4; 6 . . . 2n n 1;2; . . .
4:10:7
provided that energy is conserved: mec2/n. Correspondingly, in the reverse

direction a pair of sufficiently energetic gamma photons could create an even number
of singlet ee pairs that satisfy the energy and momentum conservation laws.
Likewise, a ee pair in a triplet state (two parallel spins) at rest must decay to at
least three non-collinear photons
e e S1 3; 5 . . . 2n 1 n 1;2; . . .
4:10:8
that preserve energy and have zero net linear momentum. The pair cannot decay to
just one photon because that photon could not have zero linear momentum. The
various decay modes have different probability; the greater the number of photons
produced, the lower is the probability. Thus the half-life of singlet positronium the
bound state of one electron and one positron is much shorter (1.24 1010 s) than
the half-life of triplet positronium (1.39 107 s).
In any event, since the photon number for process (4.10.7) and (4.10.8) is indefinite, the relation (4.10.4) for chemical potentials can be satisfied only if 0, which
also implies that
p e :
4:10:9
Suppose, however, that the confined plasma were actually generated by a process of
the form (4.10.5) in which each annihilation rigorously produced only two gammas.
Would one still have 0 if photons were produced in definite numbers? Again:
yes because at the fundamental level of quantum field theory a particle and its antiparticle have chemical potentials of equal magnitude and opposite sign. Thus relation
(4.10.9) holds generally (where, by convention, the electron is usually taken to be the
particle and the positron the anti-particle). By this argument, the chemical potential
of the photon would have to be zero in any reaction of the form A B
because the photon is its own anti-particle ; thus 2 0.
There have been claims in the research literature33 of non-zero photon chemical
potentials in complex systems such as semiconductors, in which the particle is an
electron in a conduction band, and the anti-particle is a hole in the valance band.
Among the various interactions that can occur is a process like that of (4.10.5) in
which an electron and hole combine to produce radiation. One can then write an
equation like that of (4.10.6), where the claim is made that the chemical potentials
for the electron and hole do not sum to zero, and therefore 6 0. A detailed
analysis is beyond the intended scope of this chapter. Let it suffice to say that,
although the claim may not be incorrect, it may also be more a matter of semantics
than physics. The definition and application of thermo-statistical quantities such as
the chemical potential apply rigorously only to systems in equilibrium that means
33
P. Wurfel, The chemical potential of radiation, Journal of Physics C: Solid State Physics 15 (1982) 39673985.
263
thermal, chemical, and hydrostatic equilibrium and it is not apparent that electrons, holes, and radiation in a semiconductor satisfy those conditions. There is no
confinement or photon reservoir to supply photons to sustain the reverse reaction
necessary for equilibrium; photons immediately leave the system as in the
case of electronpositron annihilations of the previous chapter. Moreover,
electrons and holes can have different temperatures. Finally, the chemical
potentials of the electrons and holes were equated with corresponding Fermi energies. The two concepts, however, are not synonymous. The chemical potential,
defined by the First Law (4.10.1) and its equivalent representation in terms of other
thermodynamic potentials H (enthalpy), F (Helmholtz free energy), and G (Gibbs
free energy),
4:10:10
i
N i
S;V N i
S;P N i
T;V N i
T;P
is the amount of energy required to add one particle of type i to a system already
containing Ni particles of that kind. In contrast, the Fermi energy F is the highest
occupied energy level in a system of fermions. The chemical potential and Fermi
energy are equivalent only at T 0.
At the most basic level, distinct from all the physical arguments, there is a purely
mathematical setting that defines the nature of the chemical potential. From the
general perspective of the opening chapter of this book i.e. the derivation of
equilibrium statistical physics from the principle of maximum entropy the chemical
potential is at root a Lagrange multiplier for a particular constraint, which in physics
is usually associated with a conservation principle. Where there is no conserved
quantity, there is no meaningful, independent chemical potential. Consider again
electronpositron pair production and annihilation (4.10.5), by which is really meant
all processes (4.10.7) and (4.10.8). The electron, positron, and photon number
densities ne, np, n are not conserved quantities; assigning a chemical potential to
each of these quantities individually is not physically meaningful. However, if a
certain number of electrons and positrons is initially introduced into a fixed volume
with reflective walls (or into a system interacting with a photon reservoir), then the
net electrical charge is a conserved quantity, and a chemical potential e p
(leading to 0) can be associated with a constraint on the difference in fermion
densities nd ne np.
It is interesting physically and instructive mathematically to work out the statistical physics of this system (a charged or neutral plasma) a little more deeply in the
case of ultra-relativistic electrons and positrons; the photons, of course, are intrinsically ultra-relativistic. By ultra-relativistic is meant that the total energy of a particle
is sufficiently higher than its rest-mass energy (mec2) that it can be approximated by
the asymptotic form of (4.1.2). With this approximation, the integral for the energy
of electrons or positrons is the same as for photons, apart from the term 1 in the
denominator, which distinguishes fermions from bosons. For specified initial
264
conditions of net electric charge and mean energy, the equilibrium state subsequently
reached by the plasma is described by equations that can be cast in the form

1
1
nd
hc 3
charge
x
dx
x2 1 x
4:10:11
4g kB T
z e 1 ze 1
0

1
1
x3
u0
hc 3
x
dx
dx
x3 1 x
; 4:10:12
energy
x
4g kB T
z e 1 ze 1
0
0 e 1
4 =15
where u0 is the initial mean energy density, z exp(/kBT) is termed the fugacity, and
g 2 is the degeneracy factor, the same for both the fermions (spin states 1/2) and
photons (helicity states 1). Solution of coupled equations (4.10.11) and (4.10.12)
then leads to values for the chemical potential and temperature T. The second
integral in (4.10.12), which evaluates exactly to 4/15, derives from the photon energy
density; there is no fugacity factor because the chemical potential of the gamma
photon is 0. Although individual expressions for fermion occupation numbers and
energies can be reduced no further than to infinite series, the combined integrals
above deriving from the difference of fermion occupation numbers and the sum of
fermion energies can be evaluated exactly in closed form to yield the coupled
equations
"

#
h
i
3
3 nd hc3
2
3
2
2
kB T kB T
4g
kB T
kB T
"
k B T
11 30
7 7 2
kB T
2
15
4
7
kB T
4 #
15 u0 hc3
:
14 5
4:10:13
4:10:14
The term 11/7 in (4.10.14) includes a contribution of 1 from the fermion energy
density and contribution 4/7 from the photon energy density.
As a practical point to keep in mind, the coupled equations are nonlinear and the
sought-for parameters and T have vastly different magnitudes: the temperature of
the plasma may be at billions of Kelvin, whereas the chemical potential (expressed in
standard MKS units) could be trillionths of a Joule. Thus, solving directly for and
T may not work unless one starts with initial estimates very close to the correct
solution. A workable strategy in that case is to solve the original set of equations
(4.10.11) and (4.10.12) for the temperature T and the fugacity z, whose value is close
to 1, and then determine the chemical potential from kBT ln z. I have tried both
methods and found that the computer (using Maple) implemented both methods
quickly and arrived at the same solutions, although solving the integral equations
was far less sensitive to initial estimates.
Examples of the results are plotted in Figure 4.10 for both charged and neutral
relativistic plasmas. The upper panel for a neutral plasma ( 0) shows the variation
265

40
Temperature (109 K)
Neutral Plasma ( = 0)
30
20
10
0
0
100
200
300
400
300
400
Chemical Potential (keV)
500
Charged Plasma
(ne - np = 100 n0)
400
300
200
100
0
100
200
Initial Energy Density (103 u0)

Fig. 4.10 Solutions (black points) to the equations of equilibrium for a neutral (upper panel)
and charged (lower panel) electronpositron plasma. Temperatures obtained in both cases
were nearly the same. The chemical potential of the neutral plasma is 0; the chemical
potential of the charged plasma is the difference e p. Density is expressed in terms of
n0 3 where is the electron Compton wavelength; energy density is in terms of u0 mc2n0.
in equilibrium temperature as a function of initial mean energy density in units of

u0 mec2n0 5.74 1021 J/m3 where n0 3 7.01 1034 m3 is number density
of one particle per cubic electron Compton wavelength h/mec 2.43 1012 m.
The lower panel shows the variation in chemical potential with initial mean energy
for a charged plasma with a charged particle density of ne np 100n0. The
associated temperature plot is not given because it is virtually the same as that shown
in the upper panel. With increasing energy, the equilibrium temperature increases,
the fermion fugacity approaches 1 as the chemical potential approaches 0 (appropriate to a gas of radiation), and the ratio of electron (or positron) to photon density
approaches the value
266
lim
me c2 =kB T !0
,
2
n
x
x2
4
dx
dx :
x
x
ne
e 1
e 1
3
0
4:10:15
In contrast to the photon which is a massless boson, the neutrino, of which three
kinds or flavors (electron, muon, and tau) are currently known, is a spin-1/2
fermion initially believed to be massless. Observations of neutrino oscillations, i.e.
periodic transitions between neutrino flavor, strongly indicate that neutrinos have
mass, although exact values are not known.34 As inferred from other experiments
(e.g. maximum electron energy in beta decay), the electron-neutrino mass, if nonzero, must be very low, about 1eV/c2 or less. The mass of the hitherto lowest mass
particle known, the electron, is 511keV/c2.
If we assume for the sake of discussion that neutrinos are massless, then the
question posed at the outset for photons can also be asked of neutrinos: is the
neutrino chemical potential zero? Because the neutrino is a fermion, the answer turns
out to be more intricate and more interesting than the case of a massless boson. First,
as weakly interacting particles whose mean free path through lead is about 1 lightyear
( 1016 m), neutrinos ordinarily escape from reactions in which they are produced
terrestrially and therefore have insignificant influence on the thermodynamic equilibrium of laboratory experiments. In such cases, one can take the chemical potential
to be 0. There are exotic conditions, however, such as in the early stages of the
universe or in supernova explosions or within the interior of a neutron star, where
neutrinos become trapped by dense matter. Does 0 then? Note that neutrino
production and absorption, contrary to analogous processes for photons, are constrained by a conservation principle: conservation of lepton number (in the nuclear
weak interactions). The reaction describing neutron beta decay
n ! p e
4:10:16
into a proton, electron, and anti-neutrino cannot produce an arbitrary number of

anti-neutrinos because the lepton number of the right side must equal the lepton
number of the left side, which is 0. Were the preceding reaction to occur in both
forward and reverse directions in a system closed to particle loss or in equilibrium
with a reservoir of anti-neutrinos, the chemical potentials would have to satisfy the
relation
n p e
4:10:17
(where the p now stands for proton, not positron). In general, one would expect ,
and therefore , to be non-zero.
34
Measurement of the neutrino oscillation parameters permit inference of the difference of the squares of neutrino
masses.
267
The answer, however, depends on what kind of neutrino a real neutrino is. Theory
admits two possibilities: a Dirac neutrino D 6 D , which is distinct from its antiparticle, or a Majorana neutrino M M , which, like the photon, is identical to its
anti-particle. The chemical potential of a Dirac neutrino is not necessarily
zero, but the chemical potential of a Majorana neutrino must necessarily be zero for
a fundamental reason (the Pauli exclusion principle) different from that which
applies to photons. In contrast to photons, an arbitrary number of which can occupy
a state, neutrinos, like electrons, fill quantum states pairwise with opposite spins. If
the neutrino is its own anti-particle, then there is a non-zero probability that two
neutrinos in a given quantum state can annihilate one another. The number of
Majorana neutrinos, therefore, cannot be a conserved quantity, unlike the number
of Dirac neutrinos, and consequently there can be no chemical potential to impose
this constraint.

The BoseEinstein integral takes the general form shown in the first line below
k x
xk dx
x e dx
a
1
x
a e 1
1 aex
0
X
X
k x
x n
n1 k x n1
a x e
ae dx
a
xe
dx
0
n0
n0
X
X
an1
an1
yk ey dy k 1
k1
k1
n0 n 1
n0 n 1
4:11:1
X
an
k 1
nk1
n1
in which the second form is obtained from the first by multiplying numerator and
denominator of the integrand by aex. The integral is then worked as follows.
X
x 1
aex n .
Replace (1 ae ) by the equivalent infinite series
n0
Change the integration variable x to y x(n 1) to bring the integral into the form
of a gamma function.
Relabel the dummy index so the sum begins with n 1 and takes the form of a
polylogarithm.
In the special case a 1 (corresponding to chemical potential 0), the integral is
expressible as a zeta function
268
X
xk dx
1
k 1 k 1:
x
k1
e 1
n
n1
4:11:2
The FermiDirac integral takes a general form differing from the BoseEinstein
integral only by the 1 (rather than 1) in the denominator
k x
xk dx
x e dx
a
a1 ex 1
1 aex
0
a xk ex
0
x n
1 ae dx k 1
n0
X
an
n1
4:11:3
n 1k
and results, by the same procedure, in a sum with terms of alternating signs. The
special case a 1 leads to the relation
X
xk dx
1n1
k 1 k;
nk
ex 1
n1
4:11:4
where the Dirichlet eta function (k) is defined by the sum with alternating signs.
4.12 Variation in thermal photon energy with photon number (E/N)jT.V
The internal energy U of a system in thermodynamic equilibrium is ordinarily
considered a function of entropy, volume, and particle number, from which the First
Law takes the standard form
dU TdS PdV dN:
4:12:1
The variation in energy with volume at fixed temperature and particle number is then
given by
T
P;
4:12:2
V
T; N
V
T; N
where use of the Maxwell relation
V
T; N T
V; N
deriving from (4.12.1) allows us to write
T
P:
V
T; N
T
V; N
4:12:3
4:12:4
Now consider the internal energy as a function of temperature, volume, and particle
number
U U T;V;N
V
N;
V
T; N
N
T; V
269
4:12:5
which is expressible as shown on the right side because U, V, N are extensive variables
(homogeneous of degree one) and T is an intensive variable (homogeneous of degree
zero). This is an example of Eulers theorem for homogeneous functions. Rearranging the terms in (4.12.5) to isolate (U/V)jT.N and then substituting into (4.12.4)
leads to the expression
!
U
1
P
;
4:12:6
U PV TV
N
T; V N
T V; N
which is equivalent to (4.4.51) upon replacement of thermodynamic variables by
equivalent statistical expectations.

Starting with expression (4.5.25), we expand two of the binomial coefficients to write
the probability of filling a given set of r out of s cells with exactly k out of m
indistinguishable particles where s and m are each much greater than 1

rk1
m k s r 1
r1
s r 1

P k;rjm;s
ms1
s1
4:13:1

r k 1 m s k r ! s!m!

:
s r! m k! m s!
r1
X
Re-ordering the factorials in the factor labeled by the symbol X leads to the equation
X
s!
m! m s k r !
sr mk
;

s r ! m k!
m s!
s mrk
4:13:2
where the approximate final expression was deduced in the following way. Consider
just the first quotient which, by definition of the factorial operation, becomes
s!
s r 1s r 2 s sr :
s r !
4:13:3
r factors
For s very much larger than r, we can treat each of the r factors as approximately
equal to s, leading to the final result in (4.13.3). Applying the same reasoning to all
factorial ratios in X yields the final expression in (4.13.2), which can be factored as
follows
270
sr m k
s mrk
s
sm
r
m
sm
k
1
r

1 ms
m
s
k
1 ms
hvik
1 hvikr
4:13:4
where m/s is the mean number of particles per cell.

4.14 Generating function for probability [Pr(Nn k)] of k successes in n trials
We start with the definition
pn;k Pr N n k
4:14:1
where it is again understood that a success is the occurrence of a run (let us say
of 1s) of length t. Then from relation (4.8.12), it follows that
n

X
4:14:2
f k f k1
pn;k
1
and the generator we seek [(4.8.14)] is

Fk s
pn;k sn
X
n
X
n1 1
n1

f k f k1 sn :
Consider next the evaluation of a double sum of the form

!
n
X
X
h sn ;
Q s
n1
4:14:3
4:14:4
which, when expanded, leads to the following expressions for each value of the index n

n1
h1 s1 s2 s3 s4

n2
h2
s 2 s3 s4

s3 s 4
n3
h3
4:14:5

n4
h4
s4
..
.
Each of the infinite sums in (4.14.5) is easily closed to yield the following pattern
sn
1
s
1
1s
1s
sn
s
s2
s
1s
1s
sn
s2
s3
s2
1s
1s
n1
X
n2
X
n3
..
.
X
sr1
sr
sr1
sn
1s
1s
nr
4:14:6
4.14 Generating function for probability [Pr(Nn k)] of k successes in n trials
271
which, when added together, results in the single sum

Q s
1 X
hn s n :
1 s n1
4:14:7
Applying the result (4.14.7) to the generator (4.14.3) yields

F k s

1 X
F k s F k1 s
f nk f nk1 sn
1 s n1
1s
F k s1 F s
;
1s
the expression given in (4.8.14).
4:14:8
5
I often say that when you can measure what you are speaking about
and express it in numbers you know something about it; but when
you cannot measure it, when you cannot express it in numbers, your
knowledge is of a meager and unsatisfactory kind: it may be the
beginning of knowledge, but you have scarcely, in your thoughts,
advanced to the stage of science, whatever the matter may be.
Lord Kelvin (William Thomson)1
5.1 Beyond the beginning of knowledge

Here is a question for you: How would the area of a rectangle be distributed if the
lengths of its sides could vary uniformly over the interval from 0 to 1 cm? Would you
expect, for example, to find that the areas of a large number of rectangles created by
random selection of uniformly distributed side lengths were likewise distributed
uniformly over the interval from 0 to 1 cm2? If not, then how?
Here is another question one perhaps of greater import to readers who may be
at an age when they have to think about their cholesterol intake. You have had a
blood test and your physician informs you that the ratio of your total cholesterol
(TC) to high density lipoprotein cholesterol (HDL) one of the current leading
diagnostic indices for cardiovascular heart disease (CHD) is a certain value to be
concerned about. If measurements of TC and HDL separately are distributed
normally (i.e. in a Gaussian or bell-shaped curve) about known mean values and
with known standard deviations, how well would the ratio be known? Or, to phrase
the question differently, by how much would you expect that ratio to vary if
numerous repetitions of the lipid panel test were made on your blood sample?
The answer to that question would certainly influence how concerned you might
be with the one reported outcome. It is unlikely, however, that you will find the
answer by looking at the report sent by the clinical diagnostic laboratory to your
1
Lord Kelvin (William Thomson), from the lecture Electrical Units of Measurement given to the Institution of Civil
Engineers, 3 May 1883; quoted in S. P. Thomson, The Life of William Thomson Vol. 2 (Macmillan, 1919) 792.
272
5.1 Beyond the beginning of knowledge
273
physician, for these reports, in marked contrast to standard practice in physics,

ordinarily do not reveal measurement uncertainties.
Although the first question has to do with geometry and the second with clinical
medicine, they are both representative of a kind of question central to physics and
any other discipline that entails measurement. This seminal question is this: How are
composite measurements distributed?
Most experimental quantities of interest in science, engineering, and medicine are
not measured directly, but are inferred from products and quotients of direct measurements. For example, the acceleration of gravity (g), which countless undergraduates determine in their instructional physics laboratories by means of some kind of
freefall apparatus, is ultimately deduced from ratios of direct measurements of
spatial displacements and temporal intervals. The Hubble parameter (H0), which
has been very much in the news in recent years as an index of both the age and
accelerating expansion of the universe, is determined from the ratio of distance and
recession speed of nonlocal galaxies. The dynamical regime of fluids is determined by
a variety of indices such as the Reynolds number which comprises products and
quotients of length, speed, density, and viscosity. Mechanical properties of materials
are characterized by indices such as Youngs modulus or bulk modulus that represent
ratios of force to deformation. The list of examples is virtually endless.
Individual measurements entering into composite quantities may be regarded as
random variables representative of parent distributions of which the probability
density function (pdf ), cumulative probability function (cpf ), moment generating
function (mgf ), mean, variance, and other statistics are usually known or ascertainable. However, the corresponding distributions and statistics of the composite measurement are in general different from the parent distributions and rarely determined.
Standard monographs and reviews of statistical methodology for physicists, such as
the statistics sections of the Review of Particle Physics published in The European
Physics Journal or the Review of Particle Properties published in The Physical
Review, did not even mention the subject when I began writing this book. In fact,
prior to my own investigations,2,3 I know of no reports of physical measurements in
which such results were rigorously employed or tested. Kelvins note, which began
this section, gives the impression that expressing something in numbers means that
your knowledge is no longer of a meager and unsatisfactory kind, but it would be
illusory to believe that the numerical outcome of a measurement, divorced from its
statistical distribution, constitutes much beyond the bare beginning of knowledge.
Physical measurements are not complete or of scientific value without some reliable
assessment of their uncertainty and associated probability.
2
3
M. P. Silverman, W. Strange, and T. C. Lipscombe, Quantum test of the distribution of composite physical
measurements, Europhysics Letters 57 (2004) 572578.
M. P. Silverman, W. Strange, and T. C. Lipscombe, The distribution of composite measurements: How to be certain of
the uncertainties in what we measure, American Journal of Physics 72 (2004) 10681081.
274
5.2 Simple rules: error propagation theory

To be sure, there are approximate methods for determining the uncertainties of
arbitrary functions of random variables. Consider, for example, the random variable
Z f (X, Y), which is an arbitrary but well-behaved

function of two random variables
with known means (X, Y) and variances 2X , 2Y . It is to be assumed because this
situation occurs widely in physics and other disciplines that X and Y are independent and therefore uncorrelated, in which case expectations of their products factor
hX x mY Y n i hX x m ihY Y n i
5:2:1
and the succeeding analysis is made much more tractable. Therefore, expanding
f (X, Y) in a series about (X, Y) and truncating at the second order leads to the relation
fX, Y f jx , Y f x jx , YX x f y jx , Y Y Y
5:2:2
1
1
f xx jx , Y X x 2 f yy jx , Y Y Y 2 f xy jx , Y X x Y Y ,
2
2

f
2 f
2 f
where the function f and its partial derivatives f x x
, f xx x
are
2 , f xy xy , etc:
all evaluated at the mean values X x, Y Y. The expectation of f (X,Y ), approximated by (5.2.2), immediately yields
1
1
hZi f jx , Y f xx jx , Y 2X f yy jx , Y 2Y :
2
2
5:2:3
Likewise, the expectation of the square of expression (5.2.2) gives

1

hZ 2 i f 2
f 2x f f xx
2X f 2y f f yy
2Y f xx f yy 2f 2xy
2X 2Y :
x , Y
x , Y
x , Y
x , Y
2
5:2:4
To the same order of approximation, it then follows that

varZ hZ2 i hZi2 f 2x
2X f 2y
2Y f 2xy
x , Y
x , Y
x , Y
2X 2Y :
5:2:5
The two most common applications of the preceding theory, particularly relations
(5.2.3) and (5.2.5), are to products Z XY and quotients Z X/Y. For the former, in
fact, exact expressions are derivable irrespective of the distributions of X and Y. We
start with the identity
Z XY X Y X X Y Y Y X X X Y Y
5:2:6
and take the expectation of both sides

hXYi X Y hX X Y Y i X Y covX, Y,
5:2:7
where the covariance vanishes if, as assumed, the variates are independent. Upon
squaring (5.2.6) and taking the expectation, all cross terms linear in X x or Y Y
vanish, and the expression reduces to
5.2 Simple rules: error propagation theory
D
E
XY2 x Y 2 2x 2Y 2Y 2X 2X 2Y :
The exact variance of the product of independent variates is therefore
E
D
varXY XY2 hXYi2 2x 2Y 2Y 2X 2X 2Y ,
275
5:2:8
5:2:9
which is precisely the result given by the approximation (5.2.5). The two calculations
agree exactly because partial derivatives of order higher than two of the product XY
vanish. The mean and variance of Z XY can be cast in the form
2Z 2X 2Y 2X 2Y 2X 2Y

,
5:2:10
2Z 2X 2Y 2X 2Y 2X 2Y

which simplifies, as shown, for sharp distributions with X , Y << 1.
X
Y
Exact relations for the mean and variance of the quotient Z X/Y of two
independent random variables do not exist in general. From the approximate expressions derived from the series expansion, it is straightforward to show that

2
Z hX=Yi X 1 2Y
5:2:11
Y
Y

1
2
1
5:2:12
2Z varX=Y 2 2X 2x 2Y 2 2X 2Y ,
Y
Y
Y
which can also be cast into a compact form
2X 2Y 2X 2Y

2Z 2X 2Y 2X 2Y 2X 2Y

2 2 2
2Z
2Y
X
Y
1 2
Y
5:2:13
that reduces for sharply distributed variates to the same relation (5.2.10) as for the
product.
Generalization of the series expansion (5.2.2) to a function Z f (X1, X2, . . . XN) of
more than two independent variates, each with well-defined mean i and variance 2i ,
leads to a simple estimate for variance
2
N
X
Z

2i
5:2:14
2Z

X
i
i1
fXi i g
(under the condition i << i for each Xi) that one frequently finds in elementary
books of data analysis. If the composite measurement Z is a product of powers of
independent elementary variates
n
Y
Z X11 . . . Xnn
5:2:15
Xi i,
i1
276
then
hZi Z
n
Y
i i ,
5:2:16
i1
and

ln Z
i
Xi fi g i
from which follows the relation
N
2Z X
2
2i 2i :
2
Z
i
i1
5:2:17
For a composite expression comprising only products and quotients of the elementary variates, all the powers i are either 1 or 1, whereupon (5.2.17) reproduces
the familiar reduced expression of (5.2.10) or (5.2.13).
The rules of standard error propagation theory, such as (5.2.14), (5.2.16), and
(5.2.17), are perhaps so familiar that one rarely expects to see users justify them, but
in fact they may not be adequate for at least three reasons. First, they are approximations derived from a Taylor-series expansion to first order in the variances that
can fail entirely for certain parent distributions that do not have first and second
moments. Second, without knowledge of the actual distribution of a composite
measurement Z, one cannot rigorously associate some degree of confidence or
probability with the uncertainties derived from these rules. To assume, as is often
done, that the standard deviation Z of a composite measurement for example, Z
XY or Z X/Y represents a two-sided confidence interval of 68% is not a priori valid
unless Z is distributed normally, and this is not generally the case even when X and
Y are normally distributed. And third, the approximations obtained from a truncated
series expansion may fail to provide accurate statistical moments of a composite
measurement.
The case of a linear superposition of independent random variables ordinarily
poses no difficulty, since it is subject to the Central Limit Theorem (CLT) provided
the mean and variance of the distributions of the superposed variates exist and
estimates of uncertainty can be made with a normal distribution irrespective of how
the component variates are distributed. Such simplicity does not automatically
extend to products and quotients of random variables.
The field of nuclear physics provides useful systems by which to examine this
important, yet not widely appreciated aspect of statistical analysis. Radioactive
nuclei ordinarily decay by more than one pathway, and the rate of decay by a
particular mode to the total decay rate is termed the branching ratio of that mode.
An example of this phenomenon is furnished by radioactive bismuth (212
83Bi), which
can undergo transmutation by emission of either an alpha particle to become
5.3 Distributions of products and quotients
277
212
thallium (208
81Tl) or a beta particle to become polonium ( 84Po). The branching of the
alpha and beta decays can be used to create parent populations of random variables
from which experimentally derived product and quotient distributions are generated
and compared with theoretical predictions. Later in the chapter I will discuss this
quantum test of the distribution of composite physical measurements.
The study of the statistics of composite measurements is not just an undertaking
of academic interest. At the end of the chapter I will return to the introductory
question regarding the measurement of diagnostic medical indices to voice a concern that I have with the way such tests have been (and are being) conducted and
reported.

Suppose you throw two ordinary dice and want to know the probability that the
product of the resulting numbers is 12. We discussed the problem of tossing two dice
in Chapter 1, but the probability now being sought is different. The solution may be
deduced in a simple way by reasoning as follows. The dice are distinguishable and
therefore the product can arise from the partitions (6,2), (4,3), (3,4), (2,6) and no
others, where (x, y) gives the outcome of the first and second die respectively. If we
represent these outcomes by random variables X and Y whose product is Z, then the
sought-for probability may be expressed as
PZ12 PX2PY6 PX3PY4 PX4PY3 PX6PY2
or, alternatively,
PZ12
6
X
n2
integer 12n

12
PX
PYn,
n
5:3:1
where the condition below the summation sign restricts the summation index to integer
values for which 12 is factorable. If we assume that the numerical outcome of each die
has the same probability 1/6, then Eq. (5.3.1) leads to PZ (12) 4(1/6)2 1/9.
Equation (5.3.1) can be rewritten to express more generally the probability of any
outcome of the product or ratio of two discrete random variables

X
z
PZXYz
PX x y PYy
5:3:2
y
PZX=Yz
PXx yzj yPYy,
5:3:3
where the sum over y may require some restrictive condition, depending on the sets of
numbers (real, rational, integer . . .) to which X and Y belong and their specified ranges.
278
An experimental case to be examined shortly involves the product or ratio of

nuclear decay counts, the parent statistics of which are governed by the Poisson
distribution
PPoikj e
k
,
k!
5:3:4
which gives the probability of k counts in a specified time interval (bin) from a sample
for which the mean count per bin is . A random variable governed by the probability
law (5.3.4) will be designated by Poi (). From Eqs. (5.3.2) and (5.3.3) the distributions of Poi (1) Poi (2) and Poi (1)/Poi (2), products and ratios generated from
two independent decay modes with mean parameters 1 and 2, are
PZXYzj1 , 2 e1 2
y1
integer z=y
PZX=Yzj1 , 2 e1 2
y1
integer zy
z=y
1 y2
y!z=y!
5:3:5
y
zy
1 2
:
y!zy!
5:3:6
Note that the elements z in the set of products are positive integers and the elements
in the set of quotients are positive rational numbers (ratios of integers). We will see
the consequence of these conditions in due course.
The reasoning leading to Eqs. (5.3.2) and (5.3.3) applies as well to distributions of
continuous variables although the formalism is a little different for we must work
with probability densities. Because the probability that a continuous random variable
takes on precisely a given value is zero, we consider instead the cpf expressed by an
integral over the probability density pZ (z)
z
FZz PrZ z
pZz0 dz0:
5:3:7
Thus, for the product Z XY and quotient Z X/Y of independent variates, the
cumulative probability functions equivalent to (5.3.2) and (5.3.3) are
FZXYz FXz=ypYydy
pYy dy
z=y
pXx dx pYy dy
FZX=Yz FXzypYydy
zy
pYy dy
pXx dx 5:3:8
z=y
pXx dx:
5:3:9
The partitioning of the integral in (5.3.8) comes about because the hyperbola xy z
has two segments, one lying in the first (or NE) quadrant and the other in the third
279
(or SW) quadrant for z > 0. Therefore the condition xy < z is satisfied for all points
below the NE segment and all points above the SW segment. Analogous reasoning
with corresponding change of quadrants can be applied to the case for z < 0.
From the definition of the cpf, it follows that the probability that Z falls within the
differential range (z, z dz) is given by
FZz dz FZz pZz dz
pZz
dFZz
:
dz
5:3:10
The same expression is obtained by application of Leibnizs equation to (5.3.7), as

shown in Chapter 1. Applying (5.3.10) to Eqs. (5.3.8) and (5.3.9), we derive the
probability density of the composite quantity Z in terms of the probability densities
of the parent distributions
pZXYz
pXz=ypYyjyj1 dy
5:3:11
pZX=Yz
pXzypYyjyjdy:
5:3:12
The absolute magnitude in the integrand reflects the condition that the probability
density must always be non-negative.
There is an alternative way to determine product and quotient pdfs that is more
expedient to employ if one is familiar with the properties of the Dirac delta function,
in particular the identity
ax
1
x,
jaj
5:3:13
where a is constant, which is readily demonstrated by integrating both sides with

some well-behaved test function. Consider the pdf of the product Z XY for
illustration.
pZXYz

pXYx,ydxdy
xyz
pXxdx
h

z i
1
z
pYy x y
dy pXx dx pYy y dy 5:3:14
x
jxj
x
z
pXxpY
pXxpYyxy zdxdy

jxj1 dx:
In the first line, use of the delta function restricts the region of integration to only
those points (x, y) satisfying the condition xy z and thereby permits unrestricted
280
upper and lower bounds on the integral over the joint probability density pXY (x, y)
pX (x) pY (y), which factors because X and Y are assumed independent. In the second
line, the argument of the delta function is factored into a factor that depends on
the integration variable y and a factor x which is constant within the integral over y.
The identity (5.3.13) is then invoked, which allows the integral over y to be performed
trivially, leading to the final expression in (5.3.14) that is completely equivalent to
(5.3.11) because of the symmetry between x and y in the relation z xy. Indeed, the
delta function in (5.3.14) could have been factored so as to yield precisely (5.3.11).
Because x and y do not occur symmetrically in the quotient x/y, evaluation of the
integral for the pdf
pZX=Yz
pXYx, ydxdy
pXxpYyx=y zdxdy
5:3:15

x=yz
by factoring the delta function to express its argument in terms of either x or y does
not lead to expressions of the same form as did the pdf of the product. Rather, the
factorization

x
x yz
z
jyjx yz
5:3:16
y
y
leads to Eq. (5.3.12), whereas the factorization

x
1 z
1
1 z
z x

y
y x
jxj y x
5:3:17
leads (after change of integration variable u y1) to the different, but equivalent,
form
1
pZX=Yz 2
z
pXxpYx=zjxjdx:
5:3:18
(Keep in mind that the magnitude of the Jacobian must be used in transforming
integration variables because a pdf must be non-negative.) Equation (5.3.18) is also
derivable by means of the previous geometry-based method which starts with the
cumulative probability function.
Once the pdf for the product or ratio of two independently measured quantities is
known, the pdf for a composite measurement of any combination of products and
quotients of independent direct measurements can be constructed by iteration.
Consider, for example, the random variable W XY/Z, which might represent
measurement of the gas constant R PV/T from determinations of pressure, molar
volume, and absolute temperature. Iterative use of (5.3.11) and (5.3.12) then lead to
the pdf for W
281
5.4 The uniform distribution: products and ratios
pWw

wz
jyj1 dy
pZz pXYwzjzj dz pZz jzj dz pY y pX
5:3:19
in terms of the pdfs of the component measurements. An example of this more

complex case will be given in an appendix.
At this point, it is instructive to examine (i.e. re-examine) two familiar distributions.
For an example with instructive implications, consider the independent variates X
U1(0,1) and Y U2(0,1) distributed uniformly over the unit interval. I will refer to
U(0,1) as a standard uniform distribution. Recall that the symbolism U (a, b) designates a uniform random variable over the interval from a to b; the subscripts signify
that the two distributions are independent. With reference to the question at the start
of this chapter, X and Y can represent the length and width of a rectangle. The pdf of
a variate X distributed uniformly over the interval b X a, is

1
b x a,
pXxja, b b a
5:4:1
0
otherwise
pXxdx 1.
which satisfies the normalization requirement
There is a simple way to express a relation of the form (5.4.1) by means of the
interval function I(a, b)(x),

1
bxa
,
5:4:2
Ia, bx
0
otherwise
which facilitates the evaluation of integrals over restricted regions. Consider, for
dx
z
example, the integral
Ia, bxIa, b
where b > a > 0, which occurs in the
x jxj
derivation of the pdf of a product of uniform variates. The first interval function
b
dx
z
reduces the integral to Ia, b
. The remaining interval function restricts the
x x
a

range to b xz a , which is equivalent to 1b xz 1a and therefore to
integration

z
z
a x b . Comparing these limits with the limits imposed by the first interval
function generates the four conditions below.
Lower limit
z

z
a ) xmin
b
b
z

a ) xmin a
b
Upper limit
z

b ) xmax b
az

z
b ) xmax
a
a
Condition on z
z ab
z ab
282
Once the limits are determined, the integral can be easily evaluated
8
z=a
>
>
>
>
>
> dln x ln z 2 ln a z ab
<
dx >
z
a
Ia, bxIa, b
:
b
x jxj
>
>
>

>
>
dln x 2 ln b ln z z ab
>
>
:
5:4:3
z=b
Substitution of the standard uniform pdfs

pXx I0, 1x
pYy I0, 1y
into (5.3.8), (5.3.9), (5.3.11) and (5.3.12) with application of the preceding reasoning
for evaluating the integrals leads to the following cpfs and pdfs of the area Z XY
FZXYz z1 ln z
and aspect ratio Z X/Y
8
z
>
>
< 2
FZX=Yz
1
>
>
:1
2z
pZXYz ln z
z1
z1
0 z 1
8
1
>
>
<
2
pZX=Yz
>
>
: 1
2z2
z1
5:4:4
5:4:5
z1
The area probability relations (5.4.4) are not partitioned over the interval (0,1)
because the partition point ab 0 lies at a boundary. In the more general case
X U1(a, b), Y U2(a, b), the resulting expressions
8
zln z 2 ln a 1 a2
>
>
>
<
z ab
b a2
FZXYz
2
>
z2 ln b ln z 1 a 2ab
>
z ab
>
:
b a2
8
ln z 2 ln a
>
>
<
b a2
pZXYz
2 ln b ln z
>
>
:
b a2
z ab
z ab
change form at a partition point ab within the interval (a, b).

Plots of the probability densities in (5.4.4) and (5.4.5) are illustrated in Figure 5.1
and compared with histograms of 100 000 products and ratios generated by a
uniform random number generator (RNG). The number of events in a histogram
bin of width z centered on z is pZ (z) z. The statistics of the generated (pseudo)
random numbers for the two sets of samples (X and Y) conformed closely to the
theoretically expected values for U(0,1) variates.
283

Sample
Theory
X 0.500, Y 0.501
1
0:500
2
1
p 0:289
12
cov(X, Y) 0
X 0.289, Y 0.288
cov(X, Y) 2.94 104
Frequency per Bin-Width
Z = U(0,1) x U(0,1)
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
10
Product
0.6
Z = U(0,1) / U(0,1)
0.5
0.4
0.3
0.2
0.1
0
0
Quotient
Fig. 5.1 Upper panel: histogram of U1(0,1) U2(0,1) comprising 105 samples partitioned
among 500 bins over the interval (0,1); bin width 0.002. Lower panel: histogram of
U1(0,1)/U2(0,1) comprising 105 samples partitioned among 104 bins over the interval
(0,1000); bin width 0.1. Dashed traces are theoretical densities.
284
The figure shows excellent agreement between the theoretical probability densities
and the computer-simulated distributions of product and quotient.
It is important to note that, although a square is a degenerate rectangle, the
corresponding area distribution cannot be calculated from the foregoing equations
because the two sides of a square are correlated 100% i.e. selection of the length
completely determines the width. The case can be treated, however, in a manner
previously demonstrated by starting with the cumulative probability Pr (X2 z), or
p
p
p
p
5:4:6
FZX2z PrX z x z FX z FX z,
and taking the derivative
pZX2z
p
p
dFZz
1
p pX z pX z:
dz
2 z
5:4:7
This leads to the expressions

FZX2z
p
z
1
pZX2z p :
2 z
5:4:8
From (5.4.8) and (5.4.4) one can demonstrate that 50% of the squares but nearly
60% of the rectangles whose side lengths fall randomly within the range (0,1) have
areas less than 0.25. For squares, this is obviously consistent with the fact that 50%
of the side lengths are shorter than 0.5 (which generates the area 0.25). The different
fraction of rectangles is due to the fact that there are infinitely many combinations of
lengths and widths that lead to a given area but only one length for a square. It may
seem reasonable therefore that this fraction should be larger for rectangles, but
this is not the case for very small areas with z < 0.081. The reason for this behavior
follows from the geometric circumstance that when x is close to 0, say in the region
0.1 x 0, then z x2 falls in the compressed range 0.01 z 0, whereas when x
is close to 1, say in the region 1 x 0.9, then z x2 falls in the stretched range
1 z 0.81.
In developing the statistics of composite measurements, the general procedure is to
begin with the cpf F (z), calculate the pdf pZ (z) dF (z)/dz, and from the latter derive
the moments
mn hZ n i
zn pZz dz
5:4:9
of which the most significant are ordinarily the first few as used in the combinations
of mean, variance, skewness, and kurtosis.
Mean Z m1 :
Variance
varZ 2Z hZ m1 2 i m2 m21 :
5:4:10
5:4:11
285
Skewness
SkZ
Kurtosis
KZ
hZ m1 3 i m3 3m2 m1 2m31
:
3Z
3Z
5:4:12
hZ m1 4 i m4 4m3 m1 6m2 m21 3m41
:
4Z
4Z
5:4:13
To recapitulate some of the points made in Chapter 1, the standard deviation is

usually adopted as the error or uncertainty of a measurement. Skewness quantifies the asymmetry of a distribution about its mean, whereas kurtosis indicates the
degree of flatness around the mean and, correspondingly, the fatness of the tails of
the distribution.
The moments mn (X) 1/(n 1) of the parent distribution of X U(0,1) follow
straightforwardly from Eqs. (5.4.9) through (5.4.13), leading to:
1
2
X ,
X p ,
12
1
5
SkX 0,
KX ,
5:4:14
with identical expressions for the distribution of Y.

As discussed previously, it is not always necessary to know the exact distribution
to determine the moments. If by some means one can determine the moment
generating function (mgf ), then the moments are calculable simply by taking derivatives. In the case of a product of independent variables, however, the procedure may
be even simpler because the expectation operation factors. Consider the product
Z XY. Although one can use (5.4.9) with pdf given by (5.4.4), it is more expedient to
make use of the already determined moments (5.4.14) of U(0,1). Thus,
mnZ hXn i hY n i
5:4:15
n 12
from which follows the set of statistical quantities

Z
1
4
0:250
p
7
12
0:220
SkZ
p
18 7
49
0:972:
5:4:16
The estimates by error propagation theory (EPT), Eqs. (5.2.7) (with zero covariance) and (5.2.9) (neglect product of variances), yield the same mean as does (5.4.15)
p
EPT
126 0:204, which is suitably close. However, what
and a standard deviation Z
exactly would it mean to report the outcome of measurements of the rectangular area
as Zexp Z Z 0.25 0.22? To answer this question, one must resort to the exact
cpf to calculate the probability F(Z Z) F(Z Z) 69.2% that a subsequent
measurement would fall within the range Z about Z. To assume if one had not
determined the distribution of the composite measurement beforehand that it was
the same as the parent distribution or that it was a normal distribution, could lead
to a very different and incorrect estimate of measurement uncertainty. Now it
so happens that for a normal distribution, the probability Pr (jZ Zj Z) is
68.3%, which is quite close to the value just obtained for a standard uniform
286
distribution. However, this is just a coincidence without any deeper significance. If

one had chosen even a slightly different confidence interval, the two results could
have been wildly different. Thus, for a normal distribution Pr (jZ Z j 1.2Z)
77.0%, whereas for the product of two standard uniform variates F(Z 1.2Z)
F(Z 1.2Z) turns out to be a complex number (0.93 0.046i) due to the marked
skewness.
The case of the aspect ratio Z X/Y provides another example with striking
contrast between exact and EPT results. The mean and standard deviation of
Z estimated from the EPT equations
(5.2.11)
and (5.2.12) (neglecting the
p
pproduct of
EPT
EPT
variances) are EPT
4=3,
2=3
,
and
therefore
6=4 0:61.
Z
Z
Z
Z
Here is a case where EPT fails entirely because the exactly determined moments
hZni hXnihYni diverge because hYni diverges. One might think that eliminating
0 from the range of Y will improve matters, but this is not so.
Consider the more general parent distribution U (a, b) (b > a > 0) for X and Y in
which case the ratio now falls within the range ba Z ab. The cpf and pdf deduced
from (5.3.9) and (5.3.12) are
FZX=Yz
8
>
>
>
>
<
bz a2
2
2b a z
>
az b2
>
>
>
1

:
2b a2 z
a
1z
b
pZX=Yz
b
z1
a
8 2
b a=z2
>
>
>
>
< 2b a2
1z
>
>
b=z2 a2
>
>
:
2b a2
b
z1
a
a
b
5:4:17
Calculation of the exact moments of the distribution from (5.4.9) leads to a mean

1 1
ln
5:4:18
Z
2 1
and variance

2Z
1 2

2
1 1
ln 2 ,

4 1
5:4:19
where a/b. The corresponding EPT relations

EPT

1 1 2
3 1

2 EPT 2 1 2
Z
3 1
5:4:20
5:4:21
are functionally quite different from Eqs. (5.4.18) and (5.4.19) although they
approach the corresponding exact expressions in the limit ! 1.
287
5.5 The normal distribution: products and ratios

Consider next the important case of products and
of independent
quotients

measurements represented by random variables X N 1 1 , 21 and Y N 2 2 , 22 where, as a
reminder, the pdf of the normal distribution is
2
1
2
pxj, p ex =2
2
2
5:5:1
with mean and variance 2. Many of the composite measurements one is likely to
make in science entails multiplying or dividing elementary measurements that are
distributed at least approximately normally. For example, the nuclear decay experiments I discussed in a Chapter 3 involved Poisson-distributed random variables, but
the Poisson distribution is well approximated by a normal distribution for sufficiently
large mean.
Figure 5.2 gives a graphical overview of what typical distributions of the
product and ratio of normal variates might look like. The second and third histograms were constructed respectively from 50 000 independent pairs of N(4,1) and
N(8,1) variates drawn from a Gaussian RNG. The ratios of the pairs make up the
high, narrow first histogram, and the products of the pairs comprise the low, broad
fourth histogram. Solid lines enveloping the histograms were calculated from theoretically exact expressions that will be discussed shortly. Visually, the histogram of
0.8
N(8,1)/N(4,1)
Probability Density
0.7
0.6
0.5
N(4,1)
0.4
N(8,1)
0.3
0.2
N(8,1)xN(4,1)
0.1
10
15
20
25
30
35
40
45
50
55
Outcome
Fig. 5.2 Panoramic display of histograms of X/Y, X Y, and parent distributions X N(8,1),
Y N(4,1) with superposition (solid) of respective theoretical densities. Parent histograms
comprise 50 000 samples from a Gaussian RNG with observed covariance cov(X, Y)
1.173 103. Outcomes are distributed in 1000 bins over the interval (0, 100).
288
quotients looks sharper and more asymmetric, with marked skewness to the right,
than the parent distributions. The product histogram is much broader than the
parent distributions, but, at the scale of the figure, still resembles a normal distribution with barely any skewness. The quotient and product histograms look centered
more or less at the respective numerical quotient and product of the means of the
parent distributions, namely 8/4 2 and 4 8 32. All four histograms have unit
area in accord with the completeness relation for probability.
We will examine first the product Z XY. Substituting the pdf (5.5.1) for X and
Y into (5.3.11) yields the product pdf
z
2 .
2 22
2
1
2
2
ex1 =21 e x
jxj1 dx
pZXYz
2 1 2

2 8 (
2 )9 3
>
z 1 2 2 2
1
>
>
>
w
w w2 >
>7
>
6 >
=
<
7
6
1
2
2
1
1
7 dw ,
6exp
2
7
6
>
2 1 2 4 >
1
>
>
5 1 w
>
>

>
>
w

;
:
1
1
5:5:2
where the expression in the second line results from the transformation w (x 1)/
1. Equation (5.5.2) is exact and cannot, to my knowledge, be reduced further to
some recognized special function for general values of the parameters. However,
under the condition that the parent distributions are sharp
(i/i) >> 1 (i 1, 2), one
can neglect the integration variable w in each factor 11 w because the entire
integrand, which decreases exponentially with w, will have become negligibly small
when w is of comparable size to 1/1 or 2/2. [Note that the distribution of Z XY is
actually symmetric in the statistical parameters of each factor and one could have
begun the calculation by integrating over y, rather than x in the first line of (5.5.2).]
This approximation allows for completing the square in the numerator of the
exponential to obtain, after some algebraic rearrangement and integration of a
Gaussian density, a product pdf of normal form

2
1
i
2
hZi 1 2
>>
1
5:5:3
pZXYz p ezZ =2 Z
2Z 21 22 22 21
i
2 Z
with mean and variance the same as that of error propagation theory.
In cases where the approximation leading to (5.5.3) does not hold, the pdf of a
product of normal variates can differ markedly from a Gaussian. Consider,
for

2
example,
the
special
case
of
normal
variates
of
zero
mean,
X
N
1 0, 1 ,
2
Y N 2 0, 2 , where the variables span the entire real axis. The first expression in
Eq. (5.5.2) yields the density

1
z cosh u
e 1 2
du
5:5:4
pZXYz
1 2
0
289
N(0,1) x N(0,1)
0.8
0.6
0.4
0.2
0
4
Product
0.3
N(0,1) / N(0,1)
0.2
0.1
0
6
Quotient
Fig. 5.3 Histogram of N1(0,1) N2(0,1) (upper panel) and N1(0,1)/N2(0,1) (lower panel)
comprising 50 000 pairs of samples from a Gaussian RNG partitioned among 2000 bins over
the range (20, 20); bin width 0.02. Solid traces are theoretical densities.
after a transformation of variables u ln (y22/z1). The upper panel of Figure 5.3

shows a plot of (5.5.4) for standard normal variates (1 2 1). This plot might
bring to mind a Cauchy distribution, but the function is actually a modified
Bessel function of the second kind (which we encountered before in deriving
the distribution of autocorrelation in Chapter 3) of which one representation is
the integral4
M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions (Dover, New York, 1972) 376.
290
12 z
K z
ez coshu sinh2udu:
12
1
2
5:5:5
Thus, (5.5.4) can be expressed as a modified Bessel function of zeroth order

K 0 1z2
pZXYz
5:5:6
1 2
once one recalls that 12 . The pdf diverges at z 0, but gives rise to finite
moments; the mean is 0 and the standard deviation is z 1, again consistent with the
exact relation (5.2.9) for the variance of a product of any two independent, wellbehaved random variables.
For visual comparison, the lower panel of Figure 5.3 shows a true Cauchy
distribution arising from the ratio of standard normal variates, an interesting case
that we will come to shortly. We will also encounter the Cauchy distribution again
later where it is relevant to the novel use of the statistical principles of this chapter in
determining the decay rates of radioactive nuclei and other unstable quantum
systems.

The moments of the distribution of N 1 1 , 21 N 2 2 , 22 can be evaluated
directly and exactly from the moments of the factors
5:5:7
mZn zn pZXYzdz xn pXxdx yn pYydy mXn mYn ,
1
2
where evaluation of the Gaussian-type integrals of each variate (i 1, 2) yields

8
k0
n k
< 1
X
n
i
mnXi ni
5:5:8

0
odd k :
k
:
i
k0
k-1!!
even k 2
The first three moments
m1 1 2

m2 21 21 22 22

m3 1 2 21 3 21 22 3 22
lead to the variance
V 21 22 22 21 21 22 ,
5:5:9
(which is precisely Eq. (5.2.9)) and skewness
Sk
61 2 21 22
21 22 22 21 21 22
3=2 :
5:5:10
291
Although Z XY is symmetric in its factors, each of which is distributed according to

a symmetric pdf, the product distribution itself has in general a nonvanishing
In the case of equal means (
p1 2 ) and equal variances
skewness.
21 22 2 , the skewness reduces to 3= 2= in the limit (/) << 1. The
skewness vanishes exactly only in the case where one (or both) of the product
distributions has zero mean.
Turn next to the quotient pdf pZ X/Y(z), which upon substitution into (5.3.12) of
the pdfs for X and Y normally distributed over the entire real axis,
1
pZX=Yz
2 1 2
ezy1
=2 21 y2 2 =2 22
jyjdy,
5:5:11
yields a closed-form expression

! 1 2z2 !

2
2
1 22 z 2 21
pZX=Yz p
e 2 2 z 1

3=2
2 2 22 z 21
"
!
!#
1 22 z 2 21
1 22 z 2 21
erf p

1=2 erf p

1=2
2 1 2 22 z 21
2 1 2 22 z 21

2

1 12
1 22
exp 2 2
22 z 21
1 2
5:5:12
that looks rather intimidating. The error function erf(x) a transformation of the
Gaussian cumulative probability function is defined by the expression
x
2
2
erfx p eu du
5:5:13
and has limiting values erf() erf() 1.

To go from (5.5.11) to (5.5.12) requires a fair amount of work, but perhaps the
most economical way to proceed is to re-express the exponent in (5.5.11) simply as
expfay b2 cy d2 g. Expanding the terms in the brackets and completing
the square then leads to a form
2
adbc
a2 c2
2 1 2

2
ab cd
a2 c2 y 2
a c2 jyjdy,
e
which exposes clearly the important combinations of parameters. Next, account for
the absolute-value restriction by re-writing the integral as two integrals both with
range of integration (0, ). And last, transform the integration variable to

q
ab cd
w a2 c2 y 2
,
a c2
292
which again doubles the number of integrals: one involves an exact differential and is
immediately integrable in closed form; the other takes the form of an error function.
The four integrals resulting from (5.5.11) can be combined to yield (5.5.12).
If X and Y are sharply distributed with i /i >> 1 (i 1, 2), the error functions in
(5.5.12) (second line) reduce to 1 (1) 2, the exponential function (third line)
becomes negligible, and Eq. (5.5.12) can be approximated by the non-Gaussian
expression
(
!)
!
1
1 22 z 2 21
1 2 z2

pZX=Yz p
,
5:5:14
exp 2

2 2 z 21
2 2 z 2 3=2
2
which actually transforms into a standard normal distribution under a change

2 z1
. The sample mean resulting from N measurements of
22 z2 1
Z X/Y is then a random variable of the type N(0,1/N), a result that shall prove useful
later. A less accurate approximation is to replace z by the empirical ratio 1/2 in the
prefactor and denominator of the exponential in (5.5.14), whereupon pZ X/Y(z)

2
2 2
reduces to the normal distribution N Z , 20Z with Z 1/2 and 20Z 21 1 4 2 ,
2
2
which is the EPT variance.
Figure 5.4 shows in more detail (than Figure 5.2) a test of the exact theoretical
expressions for the ratio (upper panel) and product (lower panel) of normal variates
X N(8,1), Y N(4,1) by a computer simulation of 50 000 samples drawn from a
Gaussian RNG. The solid line enveloping the histogram in each panel is the theoretical prediction; the dashed line is the corresponding Gaussian approximation.
In general, the quotient pdf (5.5.11) leads to distributions markedly different from
a normal
Consider
distribution.

the ratio Z X/Y of the same two normal variates
X N 1 0, 21 and Y N 2 0, 22 whose product we calculated earlier. Substitution of
the pdfs for X and Y into Eq. (5.3.12) produces an integral that is readily evaluated to
yield a Cauchy probability density
of variable z
1
2
pZX=Yzj0,
1
1 z=2
1 = 2
5:5:15
centered on z 0 with width parameter 1/2. The lower panel of Figure 5.3 tests
the theoretical density (5.5.15) for 1 against a histogram generated from 50 000
pairs of samples from a Gaussian RNG.
The Cauchy distribution has been mentioned several times so far under various
circumstances, and it is pertinent at this point to examine its properties more
thoroughly. Of especial interest is the fact that it is a distribution to which the
Central Limit Theorem (CLT) does not apply. Although centered on 0 with a welldefined width parameter , the mean, variance, and moment generating function
(mgf ) resulting from (5.5.15) do not exist. The characteristic function (cf ), which
does exist,
293

0.8
Probability Density
N(8,1) / N(4,1)
0.6
0.4
0.2
0.5
1.5
2.5
3.5
4.5
Quotient
0.06
Probability Density
0.05
N(8,1) x N(4,1)
0.04
0.03
0.02
0.01
10
20
30
40
50
60
Product
Fig. 5.4 Histograms of the ratio (upper panel) and product (lower panel) of parent
distributions X N(8,1), Y N(4,1) with respective theoretical densities (solid). Two
theoretical curves, one from the defining integral (5.5.11) and the other from the closed-form
expression (5.5.12), overlap to produce the solid trace in the upper panel. Dashed traces show
Gaussian densities of corresponding mean and variance.
hZt heiZt i e jtj
5:5:16
helps explain why. Recall that the moments are given by the coefficients in the Taylor
expansion of the cf
t2
t3 4 t 4
hZt 1 ihZit Z 2
i Z3
Z
...,
2!
3!
4!
5:5:17
294
where the nth moment is calculated by taking the nth derivative of hZ (t) and then
setting the expansion variable t 0. However, at t 0 the function (5.5.16) presents a
cusp and the first derivative does not exist (since it is at 0 and at 0).
If the mean does not exist, then neither does the variance. However, the cf has a
unique second derivative, but this leads to hZ2i 2, which is not physically
meaningful because the mean square of a real-valued variate must be positive. In
fact, the calculation of hZ2i directly by means of the pdf
3
2
6
2
2
2 6
2
u
2
1
6
2
2
z pZzdz
du
du
hZ i
6 du
1 u2
6
1 u2
40

0
0
7
7
7
7
7
5
5:5:18
leads to an infinite result. Thus, although the cf exists, it does not lead to acceptable
first and second moments.
There are physical as well as mathematical consequences to the fact that there is
no mean or variance to a Cauchy distribution. In contrast to distributions subject to
the CLT in which the variance of the mean of N measurements is smaller by a factor
N1 than the variance of one measurement, one obtains no greater precision by
making N measurements than in making one measurement of a Cauchy-distributed
quantity. I have found that many scientists, including physicists, are surprised to
learn this. A heuristic explanation is that the more samples one draws from a Cauchy
distribution, the more often there will occur values from the fat tails, and the
fluctuations in the mean of the measurements will lead to a resultant of no greater
predictive value than that of a single sample.
Nevertheless, a Cauchy distribution does have a well-defined median, the moments
of which are deducible from the pdf of the corresponding order statistic. Suppose, for
example, that N independent measurements fxi i 1. . .Ng were made of a Cauchydistributed variate, where N is an odd integer. The median is then the middle order
statistic Ym with m (N 1)/2 for which the pdf, as determined in Chapter 1, is
pY my
N!
N1
2
!
fFy1 Fyg
N1
2
py:
5:5:19
(If N were even, then the median would be the average of the two middle order statistics.)
Substituting into (5.5.19) the pdf of a Cauchy random variable Y Cau (,) (see
Table 3.1) of location parameter and scale parameter
1
pCauyj,
5:5:20
1 y =2
and its associated cpf
y
FCauy
pCauy0 j, dy0

1 1
y
tan1
2
5:5:21
295

0.5
(d)
Probability Density
0.4
(c)
0.3
0.2
(b)
0.1
(a)
0
10
10
Median
Fig. 5.5 Probability density of the median of a Cauchy variate Cau(0,a) (black) and
corresponding Gaussian density N(0, (a/2)2 N) (gray) as a function of a for sample size N :
(a) 1, (b) 3, (c) 5, (d) 11.
yields the pdf of the median

"
pY my
N!
2
!
N1
2
1 1

4 2

#N1
2
y 2
tan

:
2
y
1

1
5:5:22
Complicated though the preceding expression might appear, one sees at once from
plotting it for (odd) sample size N > 1 that (5.5.22) looks very much like a Gaussian

distribution in the limit of large N. See Figure 5.5. Indeed, expansion of ln pY mu ,
where u (y )/, with truncation at order u2 leads to the simple Gaussian
expression
2
2
1
eu =2 YM
pY mu p
2 Y M
Y M p
2 N
5:5:23
centered on u 0 with a standard deviation Y M that decreases with sample size as

N1/2 (as expected). Figure 5.5 compares the exact Cauchy and approximate
Gaussian distributions for several low values of N to show how rapidly (5.5.22)
approaches (5.5.23). Beyond N 11, the two distributions are virtually
indistinguishable.
296
5.6 Generation of negative moments

In principle, the moments of Z X/Y can be obtained directly by numerical integration using the exact form (or an approximate form) of the pdf pZ (z), or, since X and
Y are assumed to be independent variates, the moments can be calculated, as in the
case of the product Z XY, by multiplying the moments of the component factors
n
n
5:6:1
mn z pZX=Yzdz x pXxdx yn pYydy mXn mYn
evaluated with the parent pdfs. The latter case, however, requires calculation of the
negative-power moments of a random variable.
The question of negative-power moments of random variables is one that I have
rarely found mentioned in statistical references. And yet, there are instances in which
the need for such moments arise naturally. For example, in the problem of waiting
times (which constituted one of the tests of the randomness of nuclear decay discussed previously) one may be interested in the statistics of the number N of trials
required to achieve a specified number r of successes for constant probability of
success p at each trial. If not known beforehand, the probability p can be estimated
by its maximum likelihood value ^
p given by the expectation ^p hr=Ni rhN 1 i,
which requires the first negative moment of N. For another example (introduced in
Chapter 1), the variance of a Student t variate, T d1/2U/V, where U N(0,1) and
V 2 2d are independent random variables, is given by 2T dhU 2 ihV 2 i, which
requires the second negative moment of V actually to be calculated as the first
negative moment of V2.
A general procedure for calculating the negative moments of a random variable
Y again makes use of the moment generating function gY (t). Recall that the positive
n
nth moment is obtained by calculating the nth derivative gY 0. This suggests that the
negative nth moment may be obtained by the inverse procedure i.e. an n-fold integral
and, indeed, this can be shown to be the case.5 The exact expression takes the form
n
t1
hY i
dt1
dt2 . . .
tn1
1
gYtn dtn
tn1 gYtdt,
n
5:6:2
where the gamma function (n) (n 1)! for integer argument. Underlying the first
equality in (5.6.2) is an assumption that one can interchange the order of integration
over any t-variable and the variable y occurring in the definition of the mgf
gYt he
Yt
pYyeyt dy:
5:6:3
N. Cressie, et al., The moment-generating function and negative integer moments, The American Statistician 35 (1981)
148150. I learned of this reference in 2010, long after I had worked out the method for myself. Prior to that, I had the
vanity to think I may have been the first to discover it. Such is life . . . aptly expressed in the Ecclesiastic maxim Sub sole
nihil novi est.
297
5.6 Generation of negative moments
The basis for the transition from the multiple integral in the first equality to the
single integral in the second becomes more transparent if one examines a diagram of
the integration region for the simple cases of n 1, 2 and judiciously transforms the
integration variables and limits. It would be seen then how each integration over a
t-variable contributes one factor of t to the integrand of the second integral. The
occurrence of t, rather than t, in the argument of the mgf may be understood by
following the steps in the derivation of the first negative moment
2
3
0
0
1
yt
hY 1 i y pYydy 4 e dt5pYydy dt eyt pYydy
0
0
gYt
gYtdt ! gYtdt gYtdt:

t!t
5:6:4
0
1
For illustration, consider the expectation hN i where N is the number of

trials (waiting time) to achieve r successes with probability of success p and failure
q 1 p. The variate N follows a negative binomial distribution whose mgf we have
already determined in Chapter 3 to be

r
pet
Nt
gNt he i
:
5:6:5
1 qet
Substitution of (5.6.5) into (5.6.2) leads to the negative nth moment

t r
1
pe
1 p rX
m 1 m n1 mt
n1
q t e dt
hN i
t
dt
n
n q mr m r
1 qet
0
r0X

p
1 m1 m
q :
q mr mn m r
n
5:6:6
The second equality follows from expansion of the parenthetic expression in a

negative binomial series. Recasting the integral in the form of a gamma function
cancels the gamma function in the denominator and leads to the factor mn in the
third equality. For the special case of r 1 (waiting time between first occurrences)
and n 1 (first negative moment), (5.6.6) reduces to the series

X
p qm p
q2 q3 q4 . . .
q
hN i
2
3
4
q mr m q
p
p
ln p,
ln1 q
q
1p
1
5:6:7
which is easily summed by recognizing it as the Taylor series expansion of the natural
logarithm ln(1 q).
298
Generally speaking, one must be careful to avoid the error of confounding the
mean hYni with the reciprocal hYni1. Nevertheless, curious to compare the estimate
^p obtained from solving (5.6.7) with the alternative estimate p 1=hNi, I generated
with an RNG 50 000 Poisson variates of specified mean and determined the
intervals between occurrences of a previously designated target value X. For example,
for X 100, the theoretical Poisson probability is pX 0.039 86. In one
experiment the mean interval between occurrences of X was hNXi 24.9406, from
which followed pX 1=hN X i 0:040 10. The corresponding mean reciprocal interval
was hN 1
X i 0:1377, which when substituted into (5.6.7) yielded the solution
^p X 0:041 47. Interestingly, in trying numerous values of and X, the values of pX
always came out a little closer than ^
p X to the theoretical Poisson value pX irrespective
of whether the specified mean or actual sample mean was used to calculate pX.
Consider next the negative moments of a chi-square variate V 2 2d of d degrees of
freedom for which the moment generating function is
gV 2t 1 2td=2 :
5:6:8
Substitution into (5.6.2) yields the expectation

D
E
V 2 n
1
2n d=2
d=2
n1
t 1 2t
dt
u
u 1n1 du
n
n
0
1 d n
2n 2 1
2 d
5:6:9
d > 2n
which reduces to
D
E
V 2 1
1
d2
hV 2 2 i
1
d 2d 4
5:6:10
for the lowest two moments.

Combining the positive even moments of a standard normal variate U N(0,1)
r
E

D
2
2
1
2m
2m x2 =2
m12
p x e
U
dx 2
m
2
5:6:11
(since the odd moments vanish) with the negative moments of a chi-square variate,
1=2
we obtain the nonvanishing (even) moments of the Student t variate T d V U
d m 2m m 12 2m 12 d m
p

12 d
1
1
m 22 d m 12 d 12
1
dm
12 d 12
2 d12

m 1, 2, 3, . . .
Bm 12 , 12 d m
dm
,
1
1
B2 d, 2
d > 2m
hT 2m i
5:6:12
299
5.7 Gaussian negative moments
p
where the relation 12 was employed in the first line of (5.6.12). Using (5.6.10)
and (5.6.11) we obtain the variance of the t distribution

D ED
E
D E
1
d
2
2
2
2 1
V
d1
T T d U
5:6:13
d2
d2
by a different method than that employed in Chapter 1.

Upon substitution in (5.6.2) of the mgf gt et t of the normal variate N (, 2),
one will find that the integral does not converge. Nevertheless, the relation is still
useful for estimating negative moments of normal variates for which /
1, a
situation encountered frequently in measurement and analysis of physical systems.
After all, a finite sample of real-valued physical measurements cannot have infinite
moments.
2 2
Since the lack of convergence is due to the rapid growth of the factor e t =2
overtaking the exponentially decreasing factor et, a sensible procedure is to retain
the latter and expand the former in a Taylor series, which leads to the expression
1
2
2 2
1
1 X
=2k n2k1 z
n1 t 2 t2
hY i
t e
dt n
z
e dz
n
n k0 2k k!
n
1
2
n2k
1X
n 2k 1!
:
n
2k=2k
k0
k!n 1!
5:7:1
Although (5.7.1) still diverges for an upper limit of kmax , it can provide useful and
practically stable estimates for a suitably chosen finite kmax. A practical convergence
criterion, that the (k 1)th term be smaller than the kth term, yields an inequality
=2 <
2k 1
n 2kn 2k 1
5:7:2
that helps gauge how slowly the sum diverges.

It is instructive to look at the variation in first three negative moments hYni as a
function of upper limit kmax for parameters 10, 1 and 2, yielding the two
ratios / 0.1 and 0.2. From the inequality (5.7.2), one would expect non-diverging
values of negative moments n 1, 2, 3 for truncation of the sum at terms kmax 50,
49, 48 for / 0.1 and kmax 13, 12, 10 for / 0.2. Table 5.1 bears out this
expectation. The item hY~ni in the table will be explained shortly.
Clearly, one could expand the sum in (5.7.1) to at least order (/)8 and obtain
consistent results. It is to be noted that the EPT estimate (5.2.11) of hY1i obtained
by setting hXi 1 corresponds to truncation of (5.7.1) at order (/)2.
300
Table 5.1 Negative moments of N (,2)

Moment hYni
kmax 10
kmax 20
kmax 30
kmax 40
hY~ni
/ 0.1 n 1
/ 0.1 n 2
/ 0.1 n 3
/ 0.2 n 1
/ 0.2 n 2
/ 0.2 n 3
0.101
0.0103
0.001 065
0.105
0.0116
0.001 369
0.101
0.0103
0.001 065
0.105
0.0116
0.001 45
0.101
0.0103
0.001 065
0.111
0.0471
0.109
0.101
0.0103
0.001 065
142
1136
4605
0.101
0.0103
0.001 065
0.105
0.0116
0.001 369
We can also attack the problem of negative moments by means of the characteristic function (cf ) which, in contrast to the mgf, leads to useful expressions in terms of
the real part of a convergent integral
9
8
=
<
1 2 2
1
n
n
n1 2 z
~ i
Re
i
hY
,
5:7:3
cos
z
i
sin
zz
e
dz
;
:
n n
0
where the tilde above the symbol Y distinguishes the moments from those calculated
from (5.7.1). Eq. (5.7.3) leads to the following explicit expressions for the first four
negative moments

1 2 2
1

z
1
2
~
hY i
sin z e dz

1 2 2
z
~ 2 i 1 z cos z e 2
hY
2
dz
1
~ 3 i 1 z2 sin z e 2
hY
23
2
z2
dz
0

1 2 2
1

z
4
3
2
~
hY i 4 z cos z e dz
6
5:7:4
the numerical values of which were included in Table 5.1 as a standard of comparison
with the values obtained by series truncation.
The relation between the moments hYni and hY~ni for the relatively large ratio
/ 0.25, is illustrated in Figure 5.6 as a function of cut-off kmax for n 1, 2, 3. The
two modes of estimating the negative moment of a normal variate yield the same
numerical values (to three decimal places) over a wide range of cut-off limits that
satisfy the convergence criterion.
301
Moment m-1
0.16
n=1
0.14
0.12
0.1
0.08
0
10
15
20
15
20
15
20
Moment m-2
0.2
n=2
0.1
0
0.1
0
10
Moment m-3
0.4
n=3
0.2
0
0
10
Cut-O kmax
Fig. 5.6 Plot of the variation in negative moment mn hYni of normal variate Y N(10, (2,5)2)
for n 1 (top panel), 2 (middle panel), and 3 (bottom panel) as a function of cut-off kmax in
Eq. (5.7.1) (solid) and corresponding moment hY~ni calculated from the characteristic function,
Eq. (5.7.3) (dashed).
When I first published these results, a few readers could not get past the point that
a normal random variable theoretically has no finite negative moments, despite this
point having been made explicitly in the published papers. It may therefore be
worthwhile to stress again here, lest a mathematical purist begin to reach for his or
her keyboard, that although mathematical expressions may manifest singularities,
finite real physical systems generally do not. The task of a practically minded
physicist, in contrast to a mathematicians, is to devise ways of drawing information
from the real world of things, not the hypothetical world of numbers. In such cases,
302
expediency may take precedent over rigor. Lord Rayleigh (John William Strutt) said
this aptly in his preface to The Theory of Sound6
In the mathematical investigations I have usually employed such methods as present themselves naturally to a physicist. The pure mathematician will complain . . . of deficient rigour.
But to this question there are two sides. For, however, important it may be to maintain a
uniformly high standard in pure mathematics, the physicist may occasionally do well to rest
content with arguments which are fairly satisfactory and conclusive from his point of view. To
his mind, exercised in a different order of ideas, the more severe procedure of the pure
mathematician may appear not more but less demonstrative.
Substitution of the series (5.7.1), truncated at (/)8, into (5.6.1) yields the
following
for moments of the ratio of two normal variates

explicit
estimations

N 1 , 21 and N 2 , 22 for condition (2/2) < 1:
"
2
4
6
8 #
1
2
2
2
2
1
m1
3
15
105

5:7:5
2
2
2
2
2
2
"
2
4
6
8 #
1 21
2
2
2
2
13
15
105
945

5:7:6
m2
2
2
2
2
22
"
2
2
4
6
8 #
1 1 3 21
2
2
2
2
45
420
4725

5:7:7
m3
16
2
2
2
2
32
4
"
2
4
6
8 #
1 621 21 3 41
2
2
2
2
m4
1 10
105
1260
17325
:
2
2
2
2
42
5:7:8
We will see shortly the utility of these estimated negative moments and the values
obtained from the integral expressions (5.7.4).
Table 5.2 compares the values of the first four moments, standard deviation,
skewness, and kurtosis of the quotient distribution N(10,1)/N(5,1) obtained from
50 000 samples of a Gaussian RNG with corresponding statistics calculated by means
of the Gaussian mgf (for positive moments) and cf (for negative moments), which,
according to the convergence criterion, gives results equivalent to the series expansion in (2/2) to eighth order at least. The match between theory and computer
experiment is impressive.
Nevertheless, some cautionary remarks are in order. In the real world of physical
things, the samples one draws are unlikely to be distributed exactly like a Gaussian
random variable, and therefore negative moments or, more generally, the moments
of products and quotients of normal variates that represent composite measurements will ordinarily be finite. To the extent, however, that one is actually dealing
6
J. W. S. Rayleigh, The Theory of Sound Vol. 1 (Dover, New York, 1945) xxxv. [First Edition 1877.]
303
Table 5.2 Moments of Z N (10,1)/N (5,1)

Moments
Computer simulation (Gaussian RNG)
Theory (mgf & cf )
m1
m2
m3
m4
Variance
Standard deviation
Skewness
Kurtosis
2.091
4.662
11.237
29.992
0.289
0.538
1.790
11.365
2.092
4.669
11.278
30.101
0.291
0.539
1.856
10.146
Table 5.3 Moments of N (10,1)/N (5,1) as a function of sample size
m1
m2
m3
m4
Theory
50 000
100 000
150 000
200 000
500 000
1 000 000
2 000 000
2.092
4.669
11.278
30.101
2.092
4.662
11.226
29.962
2.093
4.681
10.920
41.690
2.094
4.675
11.284
30.229
2.091
4.657
11.216
30.041
2.092
4.678
11.883
69.686
2.092
4.667
11.32
32.458
2.092
4.666
11.343
33.698
with normally distributed variates drawn, for example, from an acceptable pseudorandom number generator the evidence of divergence of the moments will eventually show up in samples of sufficient size. This occurs because the larger the number
of trials, the more likely there will occur outlying values in the tail of the distribution.
These have low probability, but in the aggregate are responsible for the divergence of
the moments hYni. Moreover, the higher the order n, the more prone is the moment
hYni to diverging.
Table 5.3 summarizes the moments of the quotient N(10,1)/N(5,1) obtained by
drawing two sets of Gaussian variates from samples of increasing size. The ratio
2/2 0.2 of the denominator distribution is small enough that the sample
moments m1, m2, m3 of the quotient distribution remained finite, stable, and close
to the theoretical estimate for sample sizes in the millions. The moment m4, however,
although in agreement with theory for most samples, was too high for sample
sizes of 100 000 and 500 000. There was nothing special about these sample sizes.
The moments of the sample are themselves random numbers formed by quotients of
random numbers. Had samples been drawn again (which, in fact, was done), the
fourth moments might well have disagreed with theory for some other choices of
sample size.
304
Table 5.4 Moments of N (10,1)/N (4,1) as a function of sample size

Theory
2.704
m1
m2
8.240
m3 28.811
m4 107.103
1000
5000
10 000
2.693
8.047
27.383
108.751
2.701
2.684
8.532
8.442
46.156
8.383
868
1319
25 000
50 000
2.701
2.712
8.272
8.763
32.653
82.132
236
6265.000
100 000
150 000
2.515
3,440
6.3 107
1.2 1012
2.702
13.686
2,193
2.4 106
Evidence of divergence of the moments shows up in samples of smaller size for

larger values of the ratio 2/2. Indeed, no finite estimate of the moments exists when
2/2 is sufficiently large to violate the convergence criterion for the cut-off limit to
the sum in (5.7.1). Consider, for example, the moments of the quotient N(10,1)/N(4,1)
for which the parametric ratio of the denominator distribution is now 2/2 0.25.
The computer experiment summarized in Table 5.4 now shows evidence of divergence of moments m2, m3, m4 as the sample size increases from a few thousands to a
few hundred thousands.
The lesson to be gained, therefore, is this: in estimating moments of composite
measurements formed by dividing elemental empirical quantities representable as
independent normal variates, one must be mindful of the sample size, as well as the
mean and variance of the divisor. Nevertheless, in most instances where composite
measurements are involved and they occur ubiquitously usually only the first and
second moments, i.e. the mean and variance (or standard deviation), are of practical
concern. These two moments are least affected by large sample sizes, provided of
course that the means of the denominator distributions are sufficiently well defined
and non-zero. Moreover, a too large sample size is unlikely to be an impediment in
most practical applications (especially in industry or medicine) where economic constraints may lead to sample sizes that, if anything, are too small or barely adequate.
5.8 Quantum test of composite measurement theory

For the truth of the conclusions of physical science, observation is the supreme Court of Appeal. . ..
Every item of physical knowledge must therefore be an assertion of what has been or would be the
result of carrying out a specified observational procedure.
Sir Arthur Stanley Eddington, The Philosophy of Physical Science (1958)
The transmutation of an atomic nucleus is a non-deterministic quantum event that,

as far as numerous experiments (discussed in Chapter 3 and elsewhere7) have been
7
M. P. Silverman, W. Strange, C. R. Silverman, and T. C. Lipscombe, Tests of alpha-, beta-, and electron capture
decays for randomness, Physics Letters A 262 (1999) 265; M. P. Silverman and W. Strange, Experimental tests for
randomness of quantum decay examined as a Markov process, Physics Letters A 272 (2000) 1.
305
able to ascertain, appears to occur entirely randomly and without regard to past
influences. Until shown otherwise, the process of nuclear decay is natures most
perfect random number generator. I recognize, of course, that nuclear processes
clearly have physical causes tied to the weak, electromagnetic, and strong interactions. By non-deterministic I mean that, as a consequence of physical laws and
not merely technologically remediable lack of information, it is not possible to
predict which nuclei of a sample will decay or when.
It is useful, therefore, to turn to nuclear physics to test the statistical distributions
of composite measurements developed in the preceding sections. Consider an experiment8 to obtain the distributions of the product and ratio of radioactive decays
occurring through the two branching decay modes of 212Bi, each of which leads
(directly or secondarily) to an alpha particle:
(a)
(b)
212
Bi ! 212Po ! 208Pb
( branch ratio 64.06%)
Bi ! 208Tl ( branch ratio 35.94%).
212
As symbolized above, radioactive bismuth can transmute to polonium by beta decay

(the predominant mode) or to thallium by alpha decay. The polonium also generates
an alpha particle in its rapid decay to lead.
The experiment was carried out in the following way. A 250 g sample of powdered
thorium dioxide, chemically prepared for other experiments about 40 years before,
provided a source of radium-224 (224Ra) in secular equilibrium. The term secular
equilibrium refers to a condition in which all radioactive members of a series (in
nuclear terminology: the parent and its daughters) have nearly equal activity.
Activity signifies the number of decays per unit time interval. I will discuss the physics
of secular equilibrium later. The source was placed in a sealed aluminum chamber in
which was mounted a silicon surface barrier detector. An electrical potential
of 1000 V with respect to the grounded chamber was applied to the detector,
allowing electrostatic precipitation of ionized 224Ra progeny, particularly 216Po.
The precipitation proceeded until a suitable level of activity was achieved.
A surface barrier detector is a charged-particle detector made from a wafer of ntype silicon treated to create a thin layer of p-type material at the surface over which
is deposited a thin layer of gold. An n-type material has electrons available for
conduction. Likewise, a p-type material has holes i.e. electron deficiencies available for conduction. In a pn junction, however, charge carriers may be depleted, so
there would be no electrical conduction until passage through the junction of a
charged particle. The number of electronhole pairs produced is proportional to
the energy of the incoming charged particle.
After deposition of polonium, the detector was removed from the chamber
and connected to standard nuclear electronics for pulse height analysis and
8
M. P. Silverman, W. Strange, and T. C. Lipscombe, Quantum test of the distribution of composite physical
measurements, Europhysics Letters 57 (2004) 572578.
306
multi-channel scaling. A pulse height analyzer is an instrument used in nuclear

research that accepts electronic pulses of different heights from a particle detector,
digitizes the pulse heights, and saves the number of pulses of each height in the
registers, or channels, of an instrument referred to as a multi-channel scaler.
Polonium-216 decays by alpha emission to lead-212 (212Pb), which in turn decays
212
to Bi, the nuclide of interest. There are two decay pathways: an alpha mode (36%
branch ratio) resulting in 208Tl with emission of a 6.05 Mev or 6.09 MeV alpha
particle; a beta mode (64% branch ratio) resulting in 212Po with a half-life of 0.3 s,
which promptly decays to 208Pb with emission of an 8.78 MeV alpha particle.
The data to be compounded for statistical testing were generated using pulseheight-analysis spectroscopy9 to isolate the peaks produced by 212Bi and 212Po.
This was done in two runs, each run monitoring only one peak and recording the
number of events per time interval via multi-channel scaling. The energy resolution
of the spectrometer was approximately 15 keV per bin, giving a separation of peaks
of about 180 bins, whereas the peak widths were about 30 bins. Thus, the two peaks
were easily isolated and the rates of each branch determined. The measured rate was
about 150 events per second for the Po branch and approximately 100 events per
second for the Bi branch. The dwell time i.e. the delay due to storage of data10 was
0.05 s per bin for the 8196 bins sampled. Each run required about 409.6 s. Runs were
taken sequentially with an interval of approximately 3 min between them to allow
adjustment of the energy window.
By combining the contents of 2, 4, 8, etc., contiguous bins, it was possible to derive
from the same data set parent populations of increasing mean counts for each decay
mode. An element of corresponding type for example, a 2-bin or 8-bin count from
each of the two decay modes was then chosen to form a product and quotient.
Although the parent populations are realizations of Poisson random variables rather
than normal random variables, the respective means can be made sufficiently large
that the envelope of each parent Poisson distribution is well approximated by the
corresponding normal distribution N(i,i) (i 1,2). Recall that the characteristic
feature of the Poisson distribution is that the variance equals the mean (2 ). The
parent distributions are shown in Figure 5.7 for 1 Po 14.7, 2 Bi 9.8,
resulting from 2-bin combinations.
Figure 5.8 shows the quotient distributions for data sets derived from 2-bin (upper
panel) and 8-bin (lower panel) combinations, where in the latter case the mean counts
per bin are now about four times greater, 1 Po 58.93, 2 Bi 39.45. In both
cases, the predicted probability functions, based on the assumption of Gaussian
parent distributions, make a strikingly good fit as envelopes of the experimental
histograms and lead to excellent values for the means, widths, and asymmetries of the
9
10
M. P. Silverman, W. Strange, C. Silverman, and T. C. Lipscombe, Tests for randomness of spontaneous quantum
decay, Physical Review 61 (2000) 042106 (110).
The phrase dwell time originally signified the time cargo remains in a terminals in-transit storage area while awaiting
clearance for shipment.
307

14
212
Relative Frequency x 100
Bi
212
Po
N(9.8,9.8)
7
N(14.7,14.7)
0
0
12
18
24
30
Number of Events per Bin

Fig. 5.7 Comparison of sample 212Bi and 212Po distributions with respective normal
distributions N(14.7,14.7), N(9.8,9.8) (dashed). The number of samples is 4096 in each parent
distribution; bin width 0.125.
quotient distributions, as summarized in Table 5.5. Comparable accord is also

obtained for the product distributions of the two decay modes, as illustrated in
Figure 5.9 for the 2-bin data with associated moments summarized in Table 5.6.
A chi-square test of goodness of fit is not really necessary here, as we know at the
outset that the data derive from discrete rather than continuous parent distributions.
Moreover, direct comparison with the histogram and statistics of the experimental
samples show that the theoretical expressions derived on the basis of normal parent
distributions predicted accurately the basic shape and moments of the distribution of
composite measurements. Nevertheless, the results of a chi-square test are interesting
for what they reveal about the sensitivity of the distribution of composite measurements in relation to the distribution of the component measurements.

Analysis of the quotient 8-bin data yields a P-value P 2 > 2obs 24.3%, where
2obs is the observed chi-square for d 14 degrees of freedom, indicating that rejection
of the theoretical
distribution as a fit to the data would be unwarranted. However,

P 2 > 2obs is close to 0% for the 2-bin data despite the excellent visual match and
moment predictions. This curious feature is due primarily to the fact that the exact
Poisson ratio distribution function, Eq. (5.3.6), generates rational numbers, rather
than the set of all non-negative real numbers. The consequence of this is that the
distribution is actually a kind of line spectrum that exhibits naturally occurring
lacunae, i.e. values of z at which the probability drops suddenly to zero or close to
zero. These appear as fluctuations in the upper panel of Figure 5.8 that contribute
significantly to the chi-square, but they are reproducible and do not vanish with
308
Table 5.5 Experimental and theoretical moments of

Z 212Po( 1 14.7)/212Bi (2 9.8)
Sample
Sample
PDF
m1
m2
m3
Standard deviation
Skewness
1.683
3.660
12.744
0.909
5.069
1.638
3.520
12.349
0.914
5.033
212
Po / 212Bi
Gaussian
N(( , )
1 = 14.74
2 = 9.86
RNG Simulation
0
0
2.5
Quotient
1.0
212
Po / 212Bi
1 = 58.93
2 = 39.45
0.5
1.75
3.5
Quotient
Fig. 5.8 Distribution of 212Po/212Bi decays. Upper panel: 4096 samples of 2-bin data with
theoretical density (solid) for NPo(14.7,14.7)/NBi(9.8,9.8). Lower panel: 1023 samples of 8-bin
data with superposition of theoretical density (solid) for NPo(58.9,58.9)/NBi(39.4,39.4).
309
Table 5.6 Experimental and theoretical moments of

Z 212Po(1 14.7) 212Bi(2 9.8)
Sample
Sample
PDF
m1
m2
m3
Standard deviation
Skewness
145.322
2.475 104
4.820 106
60.264
0.769
145.307
2.483 104
3.608 106
60.986
0.559
1.0
212
Po x 212Bi
1 = 14.74
2 = 9.86
0.5
140
280
Product
Fig. 5.9 Distribution of 212Po 212Bi decays: 4096 samples of 2-bin data with theoretical
density for NPo(14.7,14.7) NBi(9.8,9.8) (solid).
increasing sample size. This is illustrated in Figure 5.10, which compares the 2-bin
quotient data and the corresponding exact theoretical distribution calculated from
Eq. (5.3.6). To preserve the striking visual identity of the two histograms, they are
plotted in separate panels rather than as superposed figures in one panel.
The appearance of pseudo fluctuations is even more conspicuous in a distribution
of Poisson products in which the samples are all integer although the discrete parent
distributions are enveloped closely by smooth Gaussian functions. Nevertheless, the
spikey pattern in the product distribution is reproduced precisely by the exact
distribution law, Eq. (5.3.5), as illustrated in Figure 5.11 for the case Poi(10) Poi(5)
simulated by a Poisson RNG. The influence of such pseudo fluctuations on the
statistics of a composite measurement (product or ratio) diminishes, however, with
310
Probability (Experimental)
.12
Experiment
212
Po / 212Bi
.08
1 = 14.74
2 = 9.86
.04
Probability (Poisson)
.12
Theory
Poi(14.7)/Poi(9.8)
.08
.04
0
0
Quotient
Fig. 5.10 Histogram of 4096 samples of experimental 212Po/212Bi 2-bin data (upper panel)
compared with the exact theoretical probability function (lower panel) for Poi1(14.7)/Poi2(9.8).
Apparent fluctuations are not random, but are stable reproducible features resulting from
mathematical constraints on the sum in Eq. (5.3.6).
increasing mean values of the parent distributions, which, as expected, is the condition
under which a Poisson distribution tends toward a normal distribution.
5.9 Cautionary remarks

Let every student of nature take this as his rule that whatever the mind seizes upon with particular
satisfaction is to be held in suspicion.
Francis Bacon, Novum Organum (1620)
I have discussed in this chapter means of obtaining distributions and statistics of

composite measurements and illustrated how these distributions can differ significantly from those of the directly measured physical quantities from which a composite measurement is inferred. I focused, in particular, on component measurements
governed by Gaussian and Poisson statistics because these distributions are widely
encountered in physics as well as cognate sciences, industry, medicine, and everyday
life. And I have described the results of an actual physical experiment the only one
I know of that tests the theory of product and quotient distributions by means of
311
5.9 Cautionary remarks
Probability x 10-2
Simulation
Poi(10) x Poi(5)
4
3
2
1
0
Probability x 10-2
Theory
Poi(10) x Poi(5)
4
3
2
1
0
40
80
120
Product
Fig. 5.11 Histogram of 50 000 samples (bin width 0.125) of Poi1(14.7) Poi2(9.8) simulated by
a Poisson RNG (upper panel) compared with the exact theoretical distribution (lower panel).
The apparent fluctuations are again stable reproducible features deriving from mathematical
constraints on the sum in Eq. (5.3.5).
the branching decays of a radioactive nuclide. Generally speaking, unless there is

some a priori reason for a composite measurement to follow a normal (or other
known) distribution law, it would be prudent in my opinion for those making
measurements to determine explicitly either by analytical or numerical means, the
distribution and statistics specific to their measurement.
Why is it then that error propagation theory (EPT), whose underlying premise is
that the elemental quantities of a composite measurement are normally distributed, is
used widely and seems to work reasonably well? The answer is that there are, of
course, circumstances for which one might expect elemental measured quantities to
be distributed normally. One such circumstance, as this chapter has shown, is that the
quotient and product distributions of certain non-Gaussian parent distributions
become increasingly better approximated by normal distributions as the ratio of
mean to standard deviation increases i.e. as the distributions of the measured
quantities become sharper. It may be tempting to think that this characteristic would
apply to the distributions of products and quotients of random variables governed by
arbitrary distributions, but this is not the case.
312
Consider, for example, variables X and Y governed by pdfs pX (x) a2xeax and
pY (y) b2yeby, respectively. The exact distribution of the ratio Z X/Y
pZX=Yz 6a2 b2 z=az b4 ,
5:9:1
given by Eq. (5.3.12), does not tend toward a normal distribution for any choice of
parameters a and b. The reason for this may be understood as follows. The parent
distributions are actually a special case of the gamma distribution Gam(, m)
m m1 u
u e
m
5:9:2
am bm
zm1
,
Bm, m az b2m
5:9:3
pujm,
p
for which the mean and standard deviation are respectively m/ and m=,
p
leading to a ratio = m that is independent of the dimensioned
parameter . In
p
the example given leading to (5.9.1), / will always be 2 irrespective of a and b.
The ratio / increases with the index m, however, and the pdf of the general quotient
Z Gam(m, a)/Gam(m, b),
pZz
where B (m, n) (m)(n)/(m n) is the beta function (encountered in Chapter 2), is

well represented in the limit of large m by a normal distribution of mean and variance
determined from the moments
k
b Bm k, m k
hZ i
:
a
Bm, m
k
5:9:4
Such a limit, however, might be irrelevant to the study of a specific physical phenomenon whose law entails a particular value of m. For example, Plancks radiation law
in the high-frequency domain takes the form of a gamma distribution (5.9.2) of
radiation frequency with fixed index m 3.
Another such circumstance (for the applicability of error propagation theory) is
the Central Limit Theorem (CLT). It is frequently the case that those who make
measurements do not need to deal with the distribution functions of the quantities
they measure, but only with the distribution functions of the averages of those
quantities over a large number of measurements, N. The basic message of the CLT
is that under certain specified conditions the mean of an infinite number of measurements approaches a Gaussian distribution with a standard error that decreases as
N1/2. Since the ratio of mean to standard deviation of the mean becomes large, we
are back again to the first-mentioned circumstance where the conditions for validity
of customary EPT apply.
The specified conditions for validity of the CLT are almost always taken to
mean the existence of the first and second moments, a condition commonly met in
practice except for a small class of functions like the Cauchy distribution. Often
313
overlooked, however, is the requirement of an infinite number of measurements. In

practice, no one makes an infinite number of measurements, but the CLT does not
predict how fast an arbitrary distribution approaches Gaussian form. Depending on
the distribution functions of the component measurements, the distribution of the
mean of the composite measurement may approach a normal distribution very slowly
or even not at all. It is the insufficiency of number of measurements, therefore, that
all too often invalidate the use of the CLT and, concomitantly, the estimation of
uncertainties by EPT. In some cases of critical importance the insufficiency is
extreme, an example of which is given in the following section.

Physical applications of the results in this chapter are numerous, but I will conclude
with the question with which the chapter began, a question with significant public
health implications. Lipid disorders and coronary heart disease (CHD) are common
in the United States, Europe, and other developed countries. A large US cohort study
has suggested that approximately 30% of CHD events in men and women were
attributable to total cholesterol levels greater than 200 mg/dl (milligrams per deciliter).11 Correspondingly, increasing numbers of people in the USA and elsewhere are
screened for risk of cardiovascular disease by means of a lipid panel assay that results
in a single diagnostic index R, the ratio of total cholesterol (TC) to high-density
lipoprotein cholesterol (HDL).
Lipid profile tests are usually performed serologically on whole blood samples by
clinical diagnostic laboratories or by reflectance colorimetry on fingerstick samples.
In either case, the diagnostically definitive result reported to primary-care physicians
or tested individuals is the number R of a single trial on a single sample, usually
performed at most once per year with relative errors in the neighborhood of
TC/TC 11.1% and and HDL/HDL 16.2% for common fingerstick assays
and about 25% smaller for serum analyses.12 In stark contrast to standard scientific
practice, no measure of statistical uncertainty R is ordinarily provided for the reasons
(told to me by the medical director of one of the diagnostic laboratories) that
(a) diagnostic facilities do not know how to determine this value rigorously, and
(b) primary-care physicians would not know what to do with this information, anyway.
Note that under these circumstances, where the number of trials is N 1, the
central limit theorem does not apply, and one cannot assume at
pthe
outset that in the
limit of numerous measurements the quantity R R = R = N follows a normal
11
12
M. P. Pignone et al., Screening and treating adults for lipid disorders, American Journal of Preventative Medicine
(April 2001) 3S 5369.
Cholestech Corporation Technical Brief: Clinical Performance of the BioScanner 2000TM and the Cholestech LDX
System Compared to a Clinical Diagnostic Laboratory Reference Method for the Determination of Lipid Profiles
(Cholestech Corporation, Haywood, CA, 2001).
314
Table 5.7 TC/HDL ratio and risk of cardiovascular disease

Risk
Men
Women
Very low (1/2 average)

Low risk
Average risk
Moderate risk (2 average)
High risk (3 average)
<3.4
4.0
5.0
9.5
>23
<3.3
3.8
4.5
7.0
>11
distribution where R N1
N
X
Rk is the sample mean and R is the mean parameter of
k1
the composite distribution, in other words the true value.

Consider a patient with total cholesterol of 238 mg/dl, which is regarded as borderline high13 and high density lipoprotein cholesterol of 44 mg/dl, leading to a reported
TC/HDL ratio of R 5.41. If we make the reasonable assumption (based on the
independent procedures for measuring them) that total cholesterol and high density
lipoprotein cholesterol measurements may be treated as independent normally distributed random variables with the preceding means and maximal standard deviations
TC 26.4 mg/dl, HDL 7.1 mg/dL, then we can calculate the exact pdf and cumulative
probability function of R from Eq. (5.3.12) or (5.5.12) and first four moments from
(5.7.5)(5.7.8). The pdf, which resembles in form the upper panel of Figure 5.4, has a
mean of 5.56, standard deviation 1.17, skewness 1.05, and kurtosis 5.87. It differs
significantly from a normal distribution (skewness 0; kurtosis 3) in having a
narrower peak and long fat forward tail. The 5% and 95% confidence limits on
R, deduced from the cumulative probability function, occur at ratios R5% ~ 4.0 and
R95% ~ 7.6. In other words, 90% of repeated TC/HDL measurements of the same sample
would be expected to range from 4 to about 8. Yet nowhere in a typical diagnostic report
to the physician or patient is there likely to be any indication that the numerical outcome
of the single performed test could fall purely by chance within this wide range.
Some reports may include a short summary of coronary heart disease risk of males
and females relative to the general population, such as the one in Table 5.7.14
To the extent that a risk calculation based on the TC/HDL ratio is meaningful, the
purely statistical variation in a single measurement of R between 4 and 8 would imply
that a womans risk of coronary heart disease could range from low (for which no
medical intervention would be considered) to greater than twice that of the general
population (for which a physician might suggest treatment with a statin drug, a
treatment with its own intrinsic risks).
Equivalently, we may ask how confident can a physician be that the true value R
of the patients cholesterol ratio is within 10% of the single reported value R 5.41.
13
14
National Institutes of Health, High blood cholesterol: what you need to know, http://www.nhlbi.nih.gov/health/
public/heart/chol/wyntk.htm
Exercise Prescription on the Net, Blood cholesterol http://www.exrx.net/Testing/LDL%26HDL.html
315
From the computed cpf we find that Pr(1.05R R 0.95R) 20.2%, a value that
does not inspire much confidence.
How many measurements, then, would have to be made on a particular blood
sample for the physician to be 90% confident that R is within 10% of the observed
mean value R? To answer this question, we can make use of the transformation
2 R1
1=2 following Eq. (5.5.14) which yielded a standard normal variate
22 R2 21
N0, N 1 . The values of corresponding to the limits (0.95R, 1.05R) are
(min 0.264, max 0.247). The sought-for number of trials, obtained by equating
the integrated probability function of a standard normal variable to 90%
N
2
1=2 max
eN
=2
d 0:90,
min
is N 41. In general, however, only one or two tests are performed per patient
per year.
In short, given the virtually explosive growth in prescriptions and sales of statin
drugs, conceivably on the basis of single annual determinations of a ratio whose
uncertainty is not ordinarily known or understood by physicians, it would seem that
the measurement, reporting, and diagnostic interpretation of lipid panel analytes are
matters for serious reevaluation.
And it is not just lipid panel tests that involve composite diagnostic indices whose
distribution and uncertainty are unknown or incorrectly determined or omitted in the
summary report of results. Other examples may include the ratio of blood urea
nitrogen (BUN) to creatinine, which is used to ascertain the likelihood of prerenal
injury, or the ratio of albumin to globulin, which is an indicator of a potential kidney
or liver disorder. Although test reports often contain what a particular laboratory
considers a range of normals, they do not usually provide any interpretation of what
normal means. Moreover, if by normal is meant that the composite index follows a
normal distribution, that assumption is almost certainly incorrect. And the CLT is of
no help in cases like these because the number of measurements is too few.
To be sure, a competent physician is unlikely to prescribe a life-long medication
on the basis of a single test. Nor is it my intention to sow seeds of distrust of the
blood tests that are performed. Rather, as a physicist I know that no measurement is
significant or interpretable without an understanding of the uncertainty with which
it was obtained. This knowledge is no less applicable to diagnostic medicine as it is
to physics.
Within the Earths interior, the radioactive nuclei comprising each of three distinct
decay chains beginning with uranium-238 (238U), thorium-232 (232Th), or uranium235 (235U) are in a state of secular equilibrium. This means that the activities of all the
316
radioactive species within a series are nearly equal. The activity of a radionuclide
refers to the product of its decay constant and quantity (or concentration); the decay
constant is equal to ln2 divided by the half-life . Secular equilibrium can occur
under the conditions that
(a) the half-life of the parent nucleus is much longer than the half-life of any of the
daughter products in the series, and
(b) a sufficiently long time has elapsed to allow for the daughter products to develop.
The radium (224Ra) used as a source in the nuclear test of composite measurements
discussed in Section 5.8 was produced over a period of approximately 40 years
through a chain of transmutations starting from an oxide of 232Th and eventually
ending with a stable isotope of lead (208Pb) as shown in part below up to the shortlived isotope of radon (220Rn):
232 Th
14
Gy
!
X0 5:010
11
228 Ra
5:8 y
!
0:12
X1
228 Ac
6:1 h
!
995:4
X2
228 Th
1:9 y
!
0:37
X3
224 Ra
3:7 d
!
68:4
X4
220 Rn
!
X5
3:9105
55:6 s
5:11:1
The numbers above the arrows give the half-life i of each transition in a convenient
time unit (s second, h hour, d day, y year, Gy 1 billion years). The
numbers below the arrows give the decay constants
i ln 2= i
5:11:2
in units of decays per year.

The concentration (Xi i 1, 2, 3,. . .n) of the ith daughter product within a decay
series of length n starting from a concentration X0 of the parent nuclide is obtained
from the set of coupled differential equations
dX0
0 X0
dt
dXi
i Xi i1 Xi1
dt
5:11:3
that account, like a financial balance sheet, for the creation and destruction of the
nuclide on the left side. When secular equilibrium occurs, there is no net production
or loss of nuclei i.e. dXi/dt 0 and it then follows from (5.11.3) that
0 X0 i Xi
i 1 . . . n
5:11:4
for all the daughter products of the series.

It is a little tedious, but not difficult, to obtain an exact analytical solution to the
set of rate equations (5.11.3) for the chain of decays in (5.11.1). By setting
Xi Y i ei t ,
5:11:5
one can transform the set of equations in (5.11.3) to

dY 0
0
dt
dY i
ei i1 t Y i1 ,
dt
5:11:6
317

5
Activity x 1011
4
3
2
1
232
Th
0
0
10
228
228
Ra
20
228
Ac
30
Th
224
40
Ra
50
Time (y)
Fig. 5.12 Approach to secular equilibrium of the first four daughter products in the 232Th
decay series. Initial 232Th activity (dashed) has not perceptibly changed in 50 y. The nuclides
228
Ra, 228Ac (solid gray) and 228Th, 224Ra (solid black) follow two different decay curves,
which merge about 40 years after preparation of the parent sample of 232Th.
which can be integrated sequentially to yield

t
Y it ei i1 t Y i1t0 dt0 ,
5:11:7
starting with the solution Y0 constant (here taken to be 1) and boundary conditions
Yi(0) 0. The method of analysis leading to (5.11.7) is analogous to one I have
devised in quantum mechanics to solve harmonically driven transitions among the
states of a multi-state atom.15 It is interesting how widely disparate physical systems
can be studied by a few fundamental mathematical methods.
A plot (Figure 5.12) of the relative concentrations Xi(t)/X0(t) in (5.11.5) for the
given decay rates shows that the concentration curve of 224Ra (i 4 in the sequence)
flattens at around 30 years and achieves close to 99% of its secular equilibrium value
at 40 years. The thorium-232 half-life is so long that the concentration X0(t) is
effectively constant for the time period under consideration. Note that there are
actually four plots in Figure 5.12, but only two distinct curves are seen because two of
the plots overlap two others. This curious feature is a a fortuitous outcome of the
numerical values of the various decay constants, which reduce the exact solutions for
the activities to the two nearly exact approximate expressions
15
M. P. Silverman, Probing The Atom: Interactions of Coupled States, Fast Beams, and Loose Electrons (Princeton
University Press, Princeton NJ, 2000), Chapter 3.
318
8
<
Ait
: 0 1

0 1 e1 t
3
1
e1 t
e3 t
3 1
3 1
i 1, 2
i 3, 4
5:11:8
5.12 Half-life determination by statistical sampling:

a mysterious Cauchy distribution
The transition rate or half-life is one of the most frequently sought pieces of
information concerning the transmutation of radioactive nuclei, the de-excitation
of atoms and molecules, the loss of a chemical reactant by diffusion, and other
examples of stochastic processes that ironically are termed pure birth processes
in statistics even though the preceding examples are all systems that are decaying,
rather than reproducing or rejuvenating.16 The traditional way to obtain the decay
rate is to record some property of the system over time and fit the resulting plot
to a decaying exponential function, such as was described in Chapter 3 in
the measurement of the decay rate of radioactive sodium-22 (22Na). In practice,
the method is straightforward although ideally requires that measurements
be executed at regular time intervals, otherwise the analysis becomes more
complicated.
A relatively recent alternative17 to curve fitting extracts the half-life of a decaying
system by a very different statistical procedure, which illuminates the principles
developed in this chapter and leads to an unexpected empirical curve that might well
puzzle even experienced users of statistics. The measurement procedure, which we
will examine in the context of radioactive decay, is very simple.
Starting at some designated origin of time t0 0, make n observations of the
activity Ai (a random variable) in narrow intervals (bins) t centered at times ti (i
1. . .n). The measurements do not need to be regularly spaced.
Calculate the ratio of activities Zij Ai/Aj for all intervals tij tj ti > 0. Recall
that the activity is the product of the intrinsic decay rate (a constant) and number
of decays per bin (which decreases in time). Thus the ratio of two activities is the
ratio of the corresponding counts i.e. the ratio of two independent Poisson
variates.
Calculate [from Ai / eti for a pure birth process and (5.11.2) relating decay rate
and half-life] the two-point estimate of the half-life
16
17
The statistical appellation connotes a process whereby a state En can change only to the state En1. See W. Feller, An
Introduction to Probability Theory and its Applications Vol. 1 (Wiley, NY, 1957) 402403.
S. Pomme, Problems with the uncertainty budget of half-life measurements, In: T. M. Semkow et al. (Eds.) Applied
Modeling and Computations in Nuclear Science, ACS Symposium Series 945 (American Chemical Society,
Washington, DC, 2007) 282292.
5.12 Half-life determination by statistical sampling
ij
tij ln 2
ln Zij
319
5:12:1
for all N pairs of ratios Zij, where

N
n1 X
n
X
i1
1
1 nn 1:
2
ji1
5:12:2
Note: the requirement that tj occur after ti means that Aj is theoretically smaller than Ai, in
which case Zij > 1 and ln Zij is a positive number. However, because the activities are random
variables some observed values of Zij can in fact turn out to be less than 1, whereupon
Eq. (5.12.1) would yield a physically unacceptable negative value for the corresponding halflife. If such cases occur in the data, do not include them in the analysis.
Make a histogram of the (positive) two-bin half-life samples ij.

Under appropriate circumstances to be discussed shortly, the resulting histogram
looks like a perfectly symmetrical Cauchy probability density for which the location of the maximum gives the true half-life. A Cauchy distribution?
Upon first encountering this procedure in an obscure chemical publication, I could
not help but wonder: How could the true distribution of the random variables ij lead
to a Cauchy distribution? The ratio of activities Zij is the ratio of two Poisson
variates, which, in the limit of large mean count per bin can be approximated
accurately by the ratio of two normal variates. Given the fairly complicated expressions to which products and ratios of random variables can lead, there was little
reason, as far as I could see, why a Cauchy pdf should emerge. If anything, one might
have expected on the basis of the CLT to find some kind of normal distribution, but
that was definitely not the case.
I did a computer simulation (with Poisson RNG) of the specific process [decay of
iron-55 (55Fe)] described in the published article and obtained the same result as the
author. Indeed, in further simulations of hypothetical decay processes where I myself
set the value of the half-life, I found that a Cauchy distribution fit the histograms so
well that the goodness-of-fit would have almost failed a chi-square test because the
residual error was so much smaller than the number of degrees of freedom!
The published article did not include a theoretical analysis, nor did a search of the
internet produce one, so I worked it out myself. In subsequent correspondence with
the author, I learned that he had inferred a Cauchy distribution from the visual
appearance of the histograms, but had not demonstrated it mathematically. Here,
then, is how this remarkable method works.
Assume that the mean count per bin i at time ti is much greater than 1, so that the
n observations are for all practical purposes realizations of independent Gaussian
variates with Poisson variances 2i i . The initial mean count is 0. Then, as shown
in Section 5.5, the pdf of the ratio Zij of activities can be reduced from the exact, but
cumbersome, expression (5.5.12) to the pdf of a Gaussian variate
320
Z ij

Ni , i
N ij , 2ij
Nj , j
j > i,
5:12:3
where
^
i 0 eti 0 eti ln 2=^
ij
with EPT variance

2 2i 2j
1 i
2ij 2i 4
j j
j
j
etj ti
^
0 e^tj
^tj ti
1e
i
^
etj ti etj ti ln 2=^
j
1 i
j
tij ln 2
^
0 e
5:12:4
tj ln 2
^

1
tij ln 2
e ^

:
5:12:5
The true intrinsic decay rate ^ and half life ^ are constant parameters not to be
confused with the estimates (5.12.1) calculated from pairs of activities. The explicit
time dependences in (5.12.4) and (5.12.5) follow from the Poissonian character of
nuclear decay.
To analyze the histogram of two-point half-life estimates, we need the pdf of the
random variable , whose functional dependence on Z is given by (5.12.1). The
procedure for transforming pdfs should now be familiar. Given the Gaussian pdf
pZ (z), the pdf pT () is calculable from
.dz
pT pZ z
5:12:6
d
where the transformation function (or Jacobian) for a particular variate Zij is

dZ i j ti j ln 2 ti j ln 2

:
5:12:7
d 2 e
The composite pdf representative of the entire sample of n independent measurements is the normalized sum of the pdfs of the individual variates. Note that the sum
of the pdfs is not the pdf of the sum of variates, which would represent an entirely
different quantity namely, a measurement comprising the sum of all two-point halflife estimates.
Putting the pieces of the preceding analysis together leads to the exact expression
8
2 9

ti j ln 2
>
>
ti j ti j ln 2
i
>
>
>
>
e
=
<
n1 X
n
exp
X
2
j
1
ln 2

p
r
, 5:12:8
pT
exp

i
i
>
>
nn 1=2 2 i1 ji1
i
i
>
>
2
1
2
>
>
1
j
j
;
:
2j
j
which looks (and is) quite complicated. The basic structure, however, can be interpreted as follows. The first factor is the constant normalizing the pdf to unit area
321
when integrated over . The second factor contains (in the numerator) the constant
relating decay rate and half-life and (in the denominator) the normalization constant
from a Gaussian distribution. The sums are over all observations such that tj > ti > 0.
The next factor (within the sums) includes factors from the Jacobian (5.12.7) and the
standard deviation from the denominator of the Gaussian distribution. The final
factor is the exponential function of the Gaussian distribution. The exponential
exp (tij ln 2/) appearing within the argument of the Gaussian exponential and as a
prefactor is the functional relation Z ().
Equation (5.12.8) bears no resemblance to a Cauchy distribution. To see how this
extraordinary evolution comes about, I will strip away all inessential factors from
(5.12.8) and express time in units ti i t with t 1. Then, after substitution of the
explicit time-dependent expressions (5.12.4) for i and ij, the function in (5.12.8)
takes the skeletal form
f

n1 X
n
X
ji
i1 ji1

ji
2
ji
ji
e
,
exp 0 e
e
5:12:9
where 0 is the initial mean number of counts per bin and ^ is the sought-for true
value of the half-life. The following conditions are then imposed.
(1) and ^ are long compared to the intervals (j i).
(2) The source is strong: 0 1.
(3) Numerous measurements are made: n and N are 1.
Under these conditions a plot of (5.12.9) generates a curve that is well fit by a Cauchy
probability density.
Condition (1) is the critical step in the deconstruction for it allows us to
approximate
ji
ji
e e ^ j i 1 ^ 1
5:12:10
by making a Taylor series expansion of both exponentials to first order.

The better the conditions (2) and (3) are met, the narrower is the resulting lineshape,
whereupon the difference of reciprocals in (5.12.10) can then be approximated by
1 ^ 1
^ ^
2 :
^
^
5:12:11
One will also find that the form of the lineshape is not changed significantly if the
is replaced by the constant ^ in the denominators of the prefactors
variable

ji
ji
exp
2
. At this point, we have transformed the exact function (5.12.9) into a
sum of Gaussians

n1 X
n
X
j i ji^
0
2
2
e
:
5:12:12
exp

f
j

i

^
^ 4
^ 2
i1 ji1
322
The exponential function falls off rapidly outside a narrow interval around ^ and has
an argument smaller than 1 close to ^ . Thus, one can further approximate (5.12.12)
by a Taylor series expansion
)
(
0
1
1
2
2
n
o
exp 4 j i ^
,
0
2
2
0
^
1 4 j i2 ^ 2
exp 4 j i ^
^
5:12:13
which, apart from a normalization factor, leads directly to the form of a Cauchy
function
f C
1

2 :
1 ^
5:12:14
Now to this point we have transformed (5.12.9) into a sum of Cauchy functions of
different widths
ji
ji ^
n1 X
n
e
X
^ 2
f
:
5:12:15
0
2
2
i1 ji1 1 ^ 4 j i ^
By ignoring the time-dependent, but non-resonant, exponential in the numerator,
which computer analysis confirms to have little consequence, we can, in fact, go one
step further and judiciously approximate the variable quantities (j i), (j i)2 (e.g.
by their means), and thereby collapse the double sum in (5.12.15) to a single Cauchy
function.
To return to the problem with which we began, upon restoration of the physical
constants, the exact expression for the pdf of two-point half-life measurements
(5.12.8) can be accurately represented by a Cauchy density (5.12.14) centered on
the true half-life ^ with approximate width parameter
p
6 ^ 2
5:12:16
p :
n ln 2 0
The greater the number of measurements n, the narrower is the lineshape, and the
better the empirical Cauchy pdf matches the theoretically exact pdf.
Figure 5.13 compares the theoretically exact and empirical Cauchy densities for
different numbers of two-point activity measurements of a hypothetical radioactive
nucleus with half-life of 1000 time units. For a single pair of activity measurements
n 2, the exact pdf (5.12.8) skews markedly to the right and looks nothing like either
a Cauchy or Gaussian function. For a set of 11 activity measurements (N 55 pairs),
the exact pdf begins to resemble a Cauchy function displaced a little to the left.
However, for a set of only 26 activity measurements (N 325 pairs), the exact pdf
and Cauchy densities are indistinguishable over the range of half-life values
323

0.006
(c)
Probability Density
0.005
0.004
0.003
(b)
0.002
0.001
400
(a)
600
800
1000
1200
1400
1600
Half-Life
Fig. 5.13 Plot of exact probability density (solid gray) of the half-life distribution compared
with a single Cauchy density (dashed black) of width given by Eq. (5.12.16) for sample size n:
(a) 2, (b) 10, (c) 25. Parameters of the calculation are: true half-life 0 1000 t, initial mean
count rate 0 107/t, counting interval t 1. The time unit is arbitrary, but 1 day has been
used in application to long-lived radionuclides.
displayed. The Cauchy lineshapes in Figure 5.13 are actually centered on 996.5,
rather than on 1000. The small displacement, however, vanishes in the limit of
increasing N.
In short, location of the center of the histogram of two-point half-life estimates
leads directly to the true value of the half-life without the need for curve fitting. We
can estimate the uncertainty in the value of the half-life by compounded use of the
approximation (5.2.5) for variance of a function of a random variable, starting with
the relation between half-life and activity
!2
t ln 2
t ln 2 varZ
) var
,
5:12:17
Z
Z2
ln Z2
where t is the interval between measurements of the two activities comprising the
ratio Z Ai/Aj. Next, one applies (5.2.5) again to obtain var(Z) in terms of the
variances of the two activities (or counts)
Z
Ai
Aj
varZ
i 2i
,
2j 3j
5:12:18
324
which, by Poisson statistics, are equal to the respective means. Combining (5.12.17)
and (5.12.18) leads to the variance of one two-point estimate

1
,
5:12:19
var i j t2ijln 22 1
i j
whereupon the mean variance for entire set of N samples is obtained by summing
over all pairs
var
n1 X
n

ln 22 X
1
t2i j 1
:
i
j
N i1 ji1
5:12:20
The variance of the mean is obtained in the familiar way

var
,
N
5:12:21
2 ln 22t2
0
3
5:12:22
var
which takes a particularly simple form
var
in the limit of large N, approximation of the terms in the sum (5.12.20) by

2t2
1
j i2 ,
t2i j 1

i
j
0
5:12:23
and use of the identity

n1 X
n
X
i1 ji1
j i2
1 2 2
n n 1:
12
5:12:24
Appendix

X, Y, and Z are three independent random variables, not necessarily of the same kind.
The general formula for the pdf pW (w) is most easily derived by repeated use of the
delta function method. Thus, setting U XY and starting with the relation
xy

pWXYZw
pUupZz
w dz du
5:13:1
z
leads to
pWXYZw pUzwpZzjzjdz:
5:13:2
Repeating the process for the density pU (u)
pXxpYyxy udx dy,

pXYu
yields the relation
u
pXYu pXxpY
jxj1 dx,
x
which, when substituted into (5.13.2) gives the final result
zw
dx:
pWXYZw jzjpZzdz jxj1 pXxpY
x
5:13:3
5:13:4
5.13.1 Uniform distributions

To illustrate the use of (5.13.4) for a composite measurement comprising both
products and quotients, we consider the directly measured quantities to be governed
by a uniform distribution over the unit interval: W U1(0,1) U2(0,1)/U3(0,1).
Substitution of the pdf in (5.4.1) into (5.13.4) leads to the composite probability
density
325
326

8
1 1
>
<
ln w
pWw
zdz y1 dy pZzpYypXwz=y 2 2
>
1
:

4w2
0<w1
w1
5:13:5
and, by integrating pW (w) from 0 to w, the cumulative probability function

8
3
1
>
w
< w w ln w
0<w1
0
0
4
2
:
5:13:6
FWw pWw dw
1
>
:
1

w

1
0
4w
The moments of W, calculated by means of the pdf, are unbounded, and one will find
by simulation of the distribution of W with a uniform random number generator that
the moments of a sample indeed increase with sample size.
5.13.2 Normal distributions

Since many of the elemental physical quantities that are measured do follow a
Gaussian distribution, it is instructive to examine the distribution of W for the
general case of normally distributed random variables

Y N 2 2 , 22
Z N 3 3 , 23 :
5:13:7
X N1 1 , 21
X=N(4,1)
Probability Density
0.4
Y=N(10,1)
0.3
Z=N(8,2)
0.2
0.1
10
12
14
Outcome W=XY/Z
Fig. 5.14 Probability density functions of Gaussian variates X N(4,1), Y N(10,1), Z
N(8,2) (dashed) and the composite variate W XY/Z (solid) obtained from Eq. (5.13.8) by
numerical integration.
327
Table 5.8 Moments of W N(4,1) N(10,1)/N(8,2)

Moments
Symbol
Value
EPT
Mean of X : hXi
Standard deviation of X
Mean of Y: hYi
Standard deviation of Y
Mean of Z: hZi
Standard deviation of Z
Expectation hWi
Expectation hW2i
Expectation hW3i
Expectation hW4i
Standard deviation of W
Skewness
Kurtosis
1
1
2
2
3
3
m1
m2
m3
m4
W
SkW
KW
10
1
4
1
8
2
5.37
33.99
256.11
2321.23
2.27
696.28
7.76
5.31
1.84
Substitution of the corresponding pdfs into (5.13.4) leads to a relation (the integrations can be performed in either order)
pWw
1
23=2 1 2 3
1
23=2 1 2 3
jzje
z3 2 =2 23
dz
jxj1 ex1

1
jx je
x1 =2 21
dx
jzjez3
zw
2
2 22
=2 21 x 2
e
dx
zw
2
2 22
=2 23 x 2
e
dz
5:13:8
which would result in a very unwieldy expression if evaluated further analytically.
Figure 5.14, obtained by performing the integration in (5.13.8) numerically, shows
the density pW(w) for W N(4,1) N(10,1)/N(8,2). The moments of W, also obtained
by numerical integration using (5.13.8), are summarized in Table 5.8 and compared
with the mean and variance predicted by the approximate relations (5.2.3) and (5.2.5)
of error propagation theory. As seen in the figure, pW(w) is skewed markedly forward,
signifying that the probability of drawing outlying events in samples of large size will
be significantly greater than in the case of a normal distribution. The large value of
kurtosis further quantifies the fat tail and narrow peak of the distribution.
6
Doing the numbers nuclear physics
and the stock market
The determination of these price changes is subject to an infinite

number of factors: it is therefore impossible to hope for a mathematical prediction. Contradictory opinions in regard to these variations are so divergent that at the same instant buyers believe the
price will rise and sellers that it will fall. [The] dynamics of the stock
market will never be an exact science.
Louis Bachelier1
6.1 The stock market is a casino

Well have the details when we do the numbers is one of the ritualized comments of
the hosts of the popular National Public Radio (NPR) show Marketplace. Clearly,
a large segment of the population must regard the numbers as important because the
ritual of reporting them is followed nearly every day, all throughout the day, in
countless newspapers, websites, and other radio and television broadcasts. I have
read that the song NPR plays if the market is mixed that day is It Dont Mean
A Thing (If It Aint Got That Swing). The hosts of the show probably have no idea
how apt the title of the song is regardless of what the stock market does.
I am not an economist. Throughout most of my career as a physicist, I was too
immersed in the details of one project or another, as well as teaching, to gave any
thought to the stock market. Instead, like many others, I left investment decisions to
presumably expert financial advisors. That was a regrettable mistake. Finally, after
years of receiving dubious advice with results that ranged from the disappointing to
the catastrophic, I decided to examine the stock market as I would the decay of
radioactive nuclei. The exercise was quite enlightening. Indeed, the results convinced
me that financial advisors, who insisted that over time the market always yields the
highest returns, grossly misunderstood the dynamics of the stock market.
Louis Bachelier, from Theorie de la Speculation [Theory of Speculation], a thesis presented to the Faculty of Sciences
of the Academy of Paris on 29 March 1900, unnumbered page of the Introduction. (Translation from French by
M. P. Silverman.)
328
329
As a scientist I am frequently invited to talk about my research to audiences of

physicists and other technical people as well as to general audiences. Although
undertaken primarily for my own edification, the investigation of randomness in
the stock market made an interesting way to conclude the lectures I gave on the
randomness of nuclear decay. Interestingly, many in the audiences were more
surprised and disturbed by the implications of my study of the stock market than
by the prospect of non-random disintegration of nuclei. I suppose that shows that
the ubiquitous repetition of a baseless claim can have greater credibility, even to
physicists, than a fact solidly grounded in the laws of physics.
The stock market is a large and vastly complex stochastic system, and a person can
spend an entire professional life studying it. That was never my intention. I was not
(and am not) interested in the multiplicity of investment options, such as various
kinds of derivatives, that traders, speculators, arbitrageurs, hedge-fund managers,
and other murky denizens of the financial world deal with. I am concerned only with
the fundamental question that a typical more or less financially inexperienced lay
person, planning for an adequate retirement (rather than the chance of spectacular
wealth), might want to ask about a particular stock or portfolio of stocks.
From the record of performance of a particular stock or fund can I tell whether or
when
(a) to buy it?
(b) to sell it?
In my experience, thats really all an average investor wants to know. He doesnt
want to buy shares of something if the share price is likely to go down, and he
doesnt want to sell shares if the share price is likely to go up.
Here in a nutshell is the connection I found between nuclear decay and the stock
market.
Quantum mechanics predicts that radioactive nuclei decay randomly and independently. That means there is no information whatever in a time series of past
disintegrations to determine which or how many nuclei will decay at any subsequent
moment in the future. Nothing in the universe is supposed to be more random than
that. No experimental test can actually prove that radioactive nuclei decay randomly,
but it is possible, as I discussed in Chapter 3, to test whether they decay non-randomly.
If the decay of nuclei at one moment leads with some regularity to decay of nuclei
at a specified later moment, then the decays are correlated, and this apparent lack of
independence may provide useful predictive information. The correlation function
quantifies the correlations for all intervals within the duration of the time series.
If the mean number of disintegrating nuclei varies periodically, then the process
causing the nuclear disintegrations is again not random. The Fourier transform finds
all such periods within a time series.
Having tested various kinds of nuclear decay by these and other measures,
I was able to conclude in every investigation that there was no evidence to suggest
330
non-random behavior. The results did not overthrow quantum mechanics, but they
were published because in science it is always important to test ones beliefs carefully
and thoroughly.
The daily closing prices of stocks in a stock market provide another time series of
numbers. For a while the series may rise; shareholders are happy and economists will
tell you why the market is doing well. Then things change; the series may fall;
shareholders are unhappy and economists will now tell you why the market is doing
poorly. Economists will always have reasons for why the market is not doing well and
can always propose solutions to fix the problem. Unlike physicists who (for the most
part) are in accord over the fundamental principles of their discipline and can agree
when a problem is solved correctly, no two economists will likely ever agree on a
solution to a problem but, if the stock market is not doing well, they are all sure
there is a problem.
But maybe there isnt a problem.
I have examined the change in daily closing prices of numerous stocks and funds with
the same tests I used to search for non-random behavior in the disintegration of
radioactive nuclei and in the emission of photons from excited atoms. I looked at
these records for periods before the latest financial meltdown as well as afterward.
And here are the results: in no instance did I find convincing evidence of non-random
behavior. Correlations, periodicities, numerical patterns, and other statistical indices
showed that for all practical purposes of prediction,2 each company or fund could
well have been some kind of radioactive nucleus. Physicists call this kind of randomness white noise, a name that refers to a broad, flat spectrum of frequencies, rather
than to the preponderance of Caucasian traders in the New York Stock Exchange.
Moreover, the white-noise character of these track records seemed to be largely
independent of economic, political, or social perturbations. The implications of these
results, if they accurately characterize stock price fluctuations, are consequential.
First, you have undoubtedly read or been told whenever you invest that Past
performance is no guarantee of future results. Believe it! Nevertheless, if your experience is like mine, you can sense the prospectus winking at you as it offers this warning
(usually in very small font size) because the company executives or brokerage firm really
dont want you to believe it. In one flyer I received from one of the largest financial
services company in the USA, the warning was followed by another sentence (in larger
font size) that claimed to offer me a track record of competitive investment performance. Competitive in regard to what other random processes? If the prospectus was
completely truthful, the warning would read more transparently: Our track record is no
more correlated with future results than is the record of decay of radioactive nuclei. Of
course that might not mean much unless the reader was a physicist.
I refer to prediction by an ordinary investor, not by an ultra-fast computer designed to make trades in fractions of a
second. We will come to this point in due course.
331
Second, if the fluctuations in share price of a stock or fund contain no useful

information, then it evidently matters little what the fund manager does. You might
want to think about that if you are paying management fees. Moreover, if your
financial advisor selects investments for you on the basis of a companys track
record i.e. past performance you are paying for advice of no greater predictive
value than if you selected funds yourself by dartboard, coin toss . . . or nuclear decay.
You might want to think about that too if you are paying for the financial advice.
Third, if the changes in closing prices represent white noise, then the original
record of closing prices likely follows a statistical pattern analogous to the diffusion
of perfume molecules through the air, one characteristic feature of which is persistence. This Brownian noise named for a botanist and not for a color has long
correlation times. Brownian sequences show long upward trends that inexorably
reverse to long downward trends and vice versa, not as a result of any specific cause,
but simply because that is what Brownian noise does.3 Bear that in mind the next
time you are told to buy and hold or that the stock market always gives higher
returns than anything else over the long run.
When all is said and done, there are three ways to do well in the stock market.
The first is by luck. In any game of chance there will be some people who win
although most usually dont.
The second is to have information that other investors do not have. This is most
commonly done by insider trading and is considered illegal except, apparently, in
the case of members of the US Congress. A statistical study4 of the annual average
stock performance of US senators in relation to the market has shown that US
senators beat the stock market by 12.3% during a period when the average
American household had a (negative) return of 1.5%.
The third is to be very wealthy. Mathematical analyses of games of chance (e.g.
gamblers ruin) show that the greater the initial capitalization, the lower is a
gamblers probability of ultimate loss. With regard to the stock market, the more
capital you have, the better you can absorb losses from high-risk ventures that
promise exceptionally high returns.
When, following the recent financial meltdown, business magnate Warren Buffett
wrote a New York Times op-ed piece5 Buy American. I am, he expressed the
irrationally optimistic view that Over the long term, the stock market news will be
good. That expectation may or may not be true. If stocks are like radioactive nuclei,
then there is little reason to believe it. And, if the term is long enough, the investor
will be dead and his interest in stock market news considerably diminished. What
3
4
5
M. P. Silverman, Computers, coins, and quanta: unexpected outcomes of random events, A Universe of Atoms, An
Atom in the Universe (Springer, 2002) 279324.
A. J. Ziobrowski, P. Cheng, J. W. Boyd, and B. J. Ziobrowski, Abnormal returns from the common stock investments
of the U.S. senate, Journal of Financial and Quantitative Analysis 39 (2004) 661676.
W. Buffett, Buy American. I am, http://www.nytimes.com/2008/10/17/opinion/17buffett.html
332
Buffett neglected to mention, however, is that his wealth enabled him to undertake
risks that would be unwise for the average investor, so that irrespective of the
outcome, he would still end up wealthy whereas those of modest means who followed
his advice might lose most of their savings.
In short, the result of doing the numbers with nuclear physics is to realize with
near mathematical certainty that investing in the stock market is no different from
gambling in a casino but for one important distinction. The latter is done by choice
for amusement with money people can afford to lose (if they gamble responsibly).
However, for the increasing number of workers who must secure their retirement
income from some kind of defined contribution plan, the gambling is done out of
necessity with money they will need to live on.
To the question What can you expect to gain in the long term from investing in
the stock market? the mathematical answer is this: nothing.
Before proceeding further I will make the same disclaimer here that I make at my
lectures: I am not a financial advisor; I do not give (and am not now giving) financial
advice. I am only relating what I learned from a limited study of certain statistical
features of the stock market. Any decision readers may make or forego on the basis
of something they read in this book is their own responsibility.
Now as the Marketplace man says well have the details.
6.2 The details CREF, AAPL, and GRNG
The time record of a stock or stock fund (I will refer to either simply as a stock) can
take an infinite variety of appearances of which three such records are shown in the
upper panel of Figures 6.1, 6.2, and 6.3 for the Stock mutual fund of the College
Retirement Equities Fund (CREF), the Apple Computer Company (AAPL), and the
Grange Information Services Corporation (GRNG). I have examined the records of
many stocks, but have chosen these three for illustration for both instructional and
personal reasons.
Most teachers and researchers in the USA who read this book have probably
invested in CREF Stock and therefore have a personal interest in the statistics of its
track record. However, I chose it also because it is an example of an actively managed
stock fund. In other words, there is a department of financial experts whose fulltime job presumably is to determine which companies and how many shares of each
to include in the fund. One might expect, therefore, that if active management should
lead to results superior to dart throwing or nuclear decay we should assuredly
see this in the performance of CREF Stock.
I chose AAPL because I like Apple computers. All of my books (except the first
when I didnt have an Apple computer) and nearly all of my scientific publications
were written on one kind of Apple computer or another. However, apart from
familiarity bred of long association, I chose AAPL because nearly everyone (I would
think) has heard of the company, which at least up to the death of Steve Jobs had
333
CREF STOCK - Time Series
400
200
0
200
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
CREF STOCK - Log Power
10
5
0
5
0
0.5
1.5
2.5
Log Harmonic
CREF STOCK - Autocorrelation
0
1
0
200
400
600
800 1000 1200 1400 1600 1800 2000
Lag (d)
Fig. 6.1 Statistics of CREF STOCK over a 4096-day period from 27 December 1994 to
4 October 2010. Top panel: time series of closing prices: raw data (upper solid); detrended
data (lower solid); lines of regression (dashed). Middle panel: log of the power spectrum of
the detrended series (solid) and line of regression (dashed) with slope characteristic of
Brownian noise. Bottom panel: autocorrelation of the original (solid) and detrended
(dashed) time series.
a global reputation for producing innovative products that captured a large loyal
customer base. By such measure it is a successful company. One might expect,
therefore, that if technological innovation accompanied by artistic flair and a sharp
instinct for what appeals to a device-buying public should lead to results superior to
dart throwing or nuclear decay we should find this in the performance of
AAPL stock.
I choose GRNG for the opposite reasons. Virtually no one but a relatively small
number (compared to the general population) of technical specialists have heard of it,
and it is not actively managed.
334
AAPL - Time Series
400
300
200
100
0
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
AAPL - Log Power
10
5
0
5
0
0.5
1.5
2.5
Log Harmonic
AAPL - Autocorrelation
1
0
200
400
600
800
1000
1200 1400
1600
1800
2000
Lag (d)
Fig. 6.2 Statistics of AAPL stock over a 4096-day period from 21 July 1994 to 22 October
2010. Top panel: time series of closing prices: raw data (upper solid); detrended data (lower
solid); lines of regression (dashed). Middle panel: log of the power spectrum of the detrended
series (solid) and line of regression (dashed) with slope characteristic of Brownian noise.
Bottom panel: autocorrelation of the original (solid) and detrended (dashed) time series.
Each time record in the top panel of Figures 6.16.3 consists of the closing prices
on 212 4096 consecutive days, a period a little longer than 11 years. The reason for
selecting a period as a power of 2 is that it permitted calculation of the discrete
Fourier transform (DFT) by means of a fast algorithm as explained previously.
I have taken the time unit to be t 1 day because records of daily opening and
closing stock prices are readily available at no charge to the ordinary investor from
335
GRNG - Time Series
200
100
0
100
200
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
GRNG - Log Power
10
5
0
5
0
0.5
1.5
2.5
Log Harmonic
GRNG - Autocorrelation
1
1
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Lag (d)
Fig. 6.3 Statistics of GRNG stock over a 4096-day period. Top panel: time series of closing
prices: raw data (upper solid); detrended data (lower solid); lines of regression (dashed).
Middle panel: log of the power spectrum of the detrended series (solid) and line of regression
(dashed) with slope characteristic of Brownian noise. Bottom panel: autocorrelation of the
original (solid) and detrended (dashed) time series.
the internet and an ordinary investor (. . . works all day, comes home at night. . .)
would not likely have time or inclination to monitor a stock portfolio throughout the
day anyway.
The CREF time record in Figure 6.1 shows two plots: (a) the upper trace shows
the original time record with calculated trend line; (b) the lower trace shows the
336
detrended time record i.e. after the mean and trend have been removed. The
original record shows a net increase in share value over the observed period with a
stationary component that appears to exhibit wavelike returns to zero.
The single AAPL time series in Figure 6.2 shows that the AAPL share price
continued largely unchanged for more than 2000 days before beginning (around
mid 2004) a leap upward in value, despite several sharp reversals. Perhaps the surge
in share price followed the release of some spectacular new Apple device.
The two time series of GRNG share prices in Figure 6.3 again show original and
detrended records. For some 700 days following the point of origin, the share price
trended downward until something in the economy (perhaps the release of favorable
statistics by some federal agency) triggered a steady climb upward that lasted more
than 2500 days (~ 7.1 years) or greater than 61% (2500/4096) of the total displayed
record. Alas, the economic climate again triggered a reversal (. . .Another government
report? Refusal of the Chinese government to export rare-earth elements needed for
manufacture of computer chips? Attempted takeover by Google? Indictment of company
executives for securities fraud?. . .) and the share price began another long trek
downward.
Three different stocks three very different time records. Nevertheless, there is a
common dynamic associated with all three, as suggested by a plot of the log of the
power spectrum (of the detrended time series) against the log of the frequency (i.e.
harmonic number) shown in the middle panel of Figures 6.1, 6.2, and 6.3. All three
plots are virtually identical. The maximum likelihood lines of regression to the three
traces have respective slopes of 1.791 (CREF), 1.798 (AAPL), 1.887 (GRNG).
The value of this slope, 1.8, reveals important information concerning how well
one can forecast future price changes based on the past history. I will take up this
matter in due course.
The similarity of the dynamics underlying the three time records also shows up in
the autocorrelation function, as it must since the autocorrelation and power spectrum are related by the WienerKhinchin theorem. The dual traces in the lower panel
of Figures 6.1 6.3 show the sample serial correlations calculated for both the original
and detrended time records. Note that irrespective of whether the original record was
rising, falling, or flat during various parts of its history, the serial correlations are all
very similar in exhibiting a long-range correlation decreasing slowly with delay time.
The serial correlations were calculated, as described in Chapter 3, from the DFT of
the power spectrum. The maximum delay is k N/2 for a time record of length
N, beyond which the serial correlation rk begins to repeat. Leaving aside technical
details such as the accuracy or reliability of values of rk for large lag times k e.g.
values of k for which rk becomes negative and even overlooking the differences in rk
between the original and detrended time series, it is apparent from these figures that
positive correlations persisted for several hundred days.
It is not the original or raw time record of share prices, however, that matters to
the typical investor for retirement, but the record of price changes. After all, who
337
would invest in the stock market if he did not expect the value of the purchased
shares to climb higher? Figures 6.4, 6.5, and 6.6 respectively show the time record
(upper panel), power spectrum (middle panel), and correlation function (lower panel)
of the first difference i.e. the day-to-day price change corresponding to the price
histories in Figures 6.16.3. In other words, if fxi i 1. . .Ng is the time series of daily
closing prices, then fwj xj xj1 j 2. . .Ng is the time series of first differences.
The pair of dashed lines in the upper and lower panels mark 2 standard deviations
CREF STOCK - First Dierence Time Series
20
20
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
CREF STOCK - First Dierence Log Power
0
5
0
0.5
1.5
2.5
Log Harmonic
CREF STOCK - First Dierence Autocorrelation
0.1
0.1
0
50
100
150
200
250
300
350
400
Lag (d)
Fig. 6.4 Daily price change of the CREF time series in Figure 6.1. Top panel: time series
(solid) of first-difference closing prices and 2 limits (dashed). Middle panel: log of the firstdifference power spectrum (black points) and line of regression (dashed) characteristic of white
noise. Bottom panel: autocorrelation function of the first-difference time series (solid) and 2
limits (dashed).
338
AAPL - First Dierence Time Series
20
20
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
AAPL - First Dierence Log Power
5
0
0.5
1.5
2.5
Log Harmonic
AAPL - First Dierence Autocorrelation
0.1
0.1
0
50
100
150
200
250
300
350
400
Lag (d)
Fig. 6.5 Daily price change of the AAPL time series in Figure 6.2. Top panel: time series
limits (dashed).
about the mean (0) as obtained either empirically from the sample or theoretically
from a model distribution function to be discussed soon; the two calculations lead to
the same standard deviation. The dashed line in the middle panel shows the maximum likelihood line of regression whose slope in all three cases (CREF 0.018,
AAPL 0.355, GRNG 0.056) is close to 0. As I pointed out before, the slope of
the double-log plot of the power spectrum tells us much about the nature of a time
series. Apart from an apparent excess of fluctuations or volatility in the parlance
339
GRNG - First Dierence Time Series
10
10
0
500
1000
1500
2000
2500
3000
3500
4000
Time (d)
GRNG - First Dierence Log Power
0
5
0
0.5
1.5
2.5
Log Harmonic
GRNG - First Dierence Autocorrelation
0.05
0.05
0
50
100
150
200
250
300
350
400
Lag (d)
Fig. 6.6 Daily price change of the GRNG time series in Figure 6.3. Top panel: time series
limits (dashed).
of economists in the time records of CREF and AAPL the statistical behavior in the
three displays of first differences is very close to what one expects for white noise.
I have a confession to make. There is no company whose stock symbol is GRNG
at least none when I checked the list of market symbols at the time of writing this
chapter. GRNG is my designation of Gaussian Random Number Generator. The time
record in Figure 6.3 and associated statistical panels were obtained by a stochastic
algorithm that simulates nuclear decay. Any mathematical algorithm to generate
340
random numbers must necessarily be a pseudo-random number generator, but the

one I used is good enough for the purposes of this chapter. As a financial entity, the
Grange Information Services Corporation is an unmanaged company that provides no information whatever, and its white-noise statistical record of price changes
is practically equivalent to that of the highly managed CREF Stock and the highly
innovative Apple Company.
6.3 Theory of information H
Actually, the preceding remark about information is not quite true, since physicists
and communication engineers use the term information in a somewhat different
way than its vernacular meaning. As developed by H. Nyquist, R. V. L. Hartley,
and especially by C. E. Shannon6 all from the once formidably innovative Bell
Laboratories the concept of information is equivalent to statistical entropy to
within a scale-setting constant factor. The greater the number of possible outcomes
to some stochastic process, the greater is the uncertainty, and therefore the entropy,
of that process. By an extension of its meaning, the information content of a sequence
of outcomes or symbols in other words, a message is also greater, because
receipt of a particular message has now reduced the uncertainty. Had there been
only one possible outcome and its occurrence certain, then subsequent notification of
the result would have brought no new information at all.
One rationale for identifying information with entropy is the connection between
the latter and the probability of a transmitted message. Consider a message consisting of elements from a finite set of symbols fAi i 1. . .mg in which pi is the
m
X
probability of occurrence of symbol Ai and
pi 1. Since there are m possible
i1
choices of symbol for each outcome, there are mN conceivable messages of

length N. All these possible messages, however, are not equally probable. The number
of ways (multiplicity) of realizing a message with symbol frequencies fn1, n2. . .nmg in
which ni Npi is given by the multinomial distribution
MN
with constraint
m
X
i1
N!
N!
Y
m
n1 ! . . . nm !
ni !
6:3:1
i1
ni N. Using Stirlings approximation of x! for x 1,

1
2
ln x! x ln x x ln 2x,
6:3:2
one can show that the set of probabilities fpig that maximizes MN or, equivalently,
log MN (to any base) is the distribution that maximizes the statistical entropy
6
C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal 27 (1948) 389423, 623656.
H
m
X
1
pi log2 pi ,
log2 MN
N
i1
341
6:3:3
designated H by Shannon, who used base 2. [Note: To go from (6.3.1) to (6.3.3), one
can ignore the third term of the approximation (6.3.2) since it is much smaller than
the first two terms.] The entropy unit is then in bits: 1 bit the information acquired
in a single binary decision. Were H in (6.3.3) to be expressed in terms of natural
logarithms, as is usually the case in statistical mechanics, the unit of information
would be the nat.
If transmission of a particular symbol say Ak were certain ( pi ik i 1. . .m),
the information or uncertainty H in (6.3.3) would vanish. (Recall that ik is the
Kronecker delta symbol.) On the other hand, if the transmission of all symbols was
equally likely ( pi 1/m), then H would assume its maximum value Hmax
log2 m bits.
In the general case between certainty and total uncertainty, the expression (6.3.3)
for H represents the expectation hlog Pi in which P is a probability function with
realizations comprising the set fpig.
An alternative, axiomatic approach7 to defining information employed by
Shannon was to require a function H( p1. . .pm) with the properties that
H should be a continuous function of the ps,
H should be a monotonic increasing function of the number m of symbols,
H should satisfy a certain linearity criterion, best explained by means of an
example.
Suppose a decision is to be made with three possible outcomes with associated
probabilities fp1, p2, p3} that sum to unity. The uncertainty or information inherent
in this triad of choices is H( pi, p2, p3). Now suppose instead that the same decision is
to be made in two stages, where the first entails two choices with probabilities fp1, p4}
and the second, which occurs a fraction p4 of the time, entails two choices with
probabilities fp5, p6}. Overall, there are still three outcomes with respective probabilities fp1, p4 p5, p4 p6} that sum to unity, as illustrated in the decision tree in Figure 6.7.
Equivalence of the two procedures requires that
(a) p4 p2 p3 (since the new second choice replaces the original second and third
choices),
(b) p4 p5 p2,
(c) p4 p6 p3.
According to Shannon, the information inherent in the second procedure must take
the form H( p1, p4) p4H( p5, p6), where the coefficient p4 is a weighting factor
7
C. E. Shannon and W. Weaver, The Mathematical Theory of Communication (Illini Books, Urbana IL, 1964) 49.
342
Decision Trees
p1
p2
p1
p4
p4 p5
p5
p6
p3
1-Step
Decision
p4 p6
2-Step
Decision
Fig. 6.7 Decision trees for a decision with three possible outcomes of probability pi (i 1, 2, 3)
to be made either in one step (left panel) or in two steps (right panel) in which the outcomes
2 and 3 occur in a fraction p4 of the cases.
introduced because the second choice occurs only that fraction of the time. Shannon
was able to demonstrate that the only function satisfying the equality
Hp1 , p2 , p3 Hp1 , p4 p4 Hp5 , p6
6:3:4
under the specified constraints must have the form

m
X
Hp1 . . . pm
pi log pi
6:3:5
i1
up to an arbitrary scale factor which determines the units of information. If the units
are in bits, then the base of the logarithm, which was also arbitrary, is chosen to be 2.
I have on occasion been asked by students and colleagues why the letter H (and
not I, for example) was chosen to symbolize information. It is not uncommon in
statistical physics to find quantities symbolized by the first letter of the corresponding
word in German. Thus, for example, one will often see W for probability
(Wahrscheinlichkeit) and almost always Z for partition function (Zustandssumme
sum over states). So what about H? I cannot say with certainty this is, after all, a
book about chance and uncertainty but I would speculate8 that Shannon chose the
letter with Boltzmanns H-theorem in mind. Boltzmanns quantity H was an early
attempt (~1870s) at describing entropy. Why not E, then, since entropy in German
is Entropie? From what I have learned second-hand, Boltzmann did use a script
8
I have also speculated (correctly, I believe) on the origin of James Clerk Maxwells strange choice of electromagnetic
field symbols in my earlier book, Waves and Grains: Reflections on Light and Learning (Princeton University Press,
Princeton NJ, 1998).
343
upper-case E to represent entropy (. . .all nouns in German, whether proper or

common, are in upper case. . .), but the symbol was mistakenly taken for the letter
H by an English physicist.9 And so H has ironically come to symbolize information as
a result of incorrect reception of a symbol transmitted over a noisy channel i.e. the
brain of an English physicist. I think Shannon would have been amused at this
adventitious turn of events. On the other hand, the letter H may have been chosen
to honor Hartley, who first defined information quantitatively in one of his papers as
log mN.
The expression (6.3.5), which I will rewrite as
HA
m
X
pAi log pAi
6:3:6
i1
to indicate explicitly that it is the uncertainty of the events symbolized by the set fAig,
can be generalized to address the matter of conditional information (or conditional
entropy). Circumstances often arise as will be the case in discussing the stock
market in which we may want to know how much information is provided
by events fAig, given that events of another kind represented by the set of symbols
fBj j 1. . .m0 } are known to have occurred. The entropy of the second set takes the
same form as (6.3.6)
HB
m0
X
pBj log pBj :
6:3:7
j1
Both sets of probabilities are assumed complete

m
X
pAi
i1
m0
X
pBj 1,
6:3:8
j1
and it is to be recalled that the probability of an element Ai conditioned on element

Bj is
pAi jBj
pAi Bj
pBj
6:3:9
where p(AiBj) is the probability of joint occurrence of the two events. The marginal
probabilities p(Ai), p(Bj) of individual events are derived from the joint probabilities
by summation over the irrelevant elements
pAi
m0
X
j1
pAi Bj
pBj
m
X
pAi Bj :
6:3:10
i1
D. Lindley, Boltzmanns Atom: The Great Debate That Launched a Revolution in Physics (The Free Press, New York,
2001) 75.
344
The conditional entropy H(AjB) is defined in the first relation below

X

HAjB
pAi jBj log pAi jBj pBj
i, j

X
pAi Bj
pAi Bj log

pBj
i, j
6:3:11
in which the quantity in square brackets summed over index i is the entropy of the set
A conditioned on element Bj. When multiplied by p(Bj) and summed over index j, the
resulting expression is the entropy of A conditioned on the full set of symbols
B. Substitution of (6.3.9) leads to the second relation in (6.3.11) in terms of joint
and marginal probabilities.
Entropy (and therefore information), like energy in physics, is additive over
subsystems; that is, the entropy H(A B) of the combined system of m m0 events
fAi, Bj i 1. . .m, j 1. . .m0 } is just the sum of the entropies, H(A) H(B), of the
separate parts. This total entropy should not be confused, however, with the entropy
H(AB) of the mm0 joint events fAiBjg, which is defined in the first relation below
X
HAB
pAi Bj log pAi Bj
i, j
HA HBjA HB HAjB
6:3:12
and expressible, after substitution of (6.3.9), in either of two equivalent ways. Note
that H(AB) is generally not an additive function since it is the sum of an entropy and a
conditional entropy. As a consequence, H(AB) is always less than or equal to the sum
of the component entropies
HAB HA HB
6:3:13
HAjB HA or HBjA HB:
6:3:14
because
That is, the uncertainty H(AjB) (or H(BjA)) conditioned on the acquisition of new
information is always less than or equal to the unconditional uncertainty H(A) (or
H(B)). The equality in (6.3.13) or (6.3.14) holds only in the case of statistical independence of sets A and B whereby the joint probability factors: p(AiBj) p(Ai)p(Bj).
A formal demonstration of relation (6.3.14) is easily made but will be left to an
appendix.
The difference between the right and left sides of the inequalities in (6.3.13) and
(6.3.14)
HA, B HA HB HAB HA HAjB HB HBjA

X
pAi Bj
6:3:15
pAi Bj log
pAi pBj
i, j
345
provides a measure of the amount of new information acquired about A by

knowledge of the occurrence of events B. If systems A and B are statistically
independent, then (6.3.15) vanishes identically because H(AjB) H(A) (and
H(BjA) H(B)); alternatively, it is readily seen that the argument of the logarithm
becomes unity. In the opposite extreme, if correlation is certain so that occurrence
of Bj always leads to occurrence of Aj (in a system for which the number of A and
B symbols is equal: m m0 ), then the conditional probability must be p(AijBj) ij,
whereupon it follows formally from (6.3.11) that H(AjB) H(BjA) 0. This result
also makes sense physically because there is now no uncertainty and therefore no
information to be gained in any element of A given the occurrence of any
element of B.
If we divide (6.3.15) by H(A), we have a dimensionless non-negative ratio

pAi Bj
pAi Bj log
pAi pBj
HA, B
HAjB
i, j
X
,
1
HA
HA
pAi logAi

X
6:3:16
which gives the fraction or percentage decrease in uncertainty on A as a result of

knowing which events B have occurred.
As an illustration of the preceding ideas which will prove useful when we
consider the information content of stock market time series I was interested to
determine how much information about the local weather is provided by a National
Weather Service (NWS) 13 day forecast. To make matters simple, I just wanted
information on precipitation (rain and snow). By searching the internet (and locating
web pages that most likely will not exist by the time this book is published), I found
the following pertinent figures whose reliability I would never attest to, but which did
not seem unreasonable and will serve in any event for my example. The accuracy of a
NWS forecast of local precipitation was given as 83.95%. I will interpret this to mean
that on the average a forecast of precipitation resulted in precipitation and that
a forecast of no precipitation resulted in no precipitation with an accuracy of
about 84%. Further, I found that in the area where I live there is precipitation on
approximately 114 days of the year, which seemed to me a little high, but I will use
the figure nevertheless.
There are two sets of events, the actual weather and the predicted weather, which
I will represent as follows
A Precipitation
B Forecast of precipitation
A No precipitation B Forecast of no precipitation:
6:3:17
Based on frequency, the probabilities of the weather events are estimated to be

pA
114
0:31
365
pA 1 pA 0:69:
6:3:18
346
Based on the alleged NWS success rate, the conditional probabilities are taken to be

pAjB p AjB 0:84
pAjB pAjB 1 pAjB 0:16:
6:3:19
The unconditional prior uncertainty about precipitation is therefore10
HA pAlog2 pA pAlog2 pA 0:893 bits:
6:3:20
This is the information that the NWS forecast would provide if it were 100%
accurate.
To find the actual unconditional information provided by the NWS, we must first
calculate p(B) from the completeness relation
pA pAjBpB pAjBpB,
6:3:21
which, upon substitution of pB 1 pB, leads to

pB
pA pAjB
0:31 0:16
0:221
pAjB pAjB 0:84 0:16
6:3:22
and
pB 1 pB 0:779:
6:3:23
The unconditional uncertainty in the NWS forecast is therefore

HB pBlog2 pB pBlog2 pB 0:761 bits:
6:3:24
We can now calculate the set of four joint probabilities as follows

pAB pAjB pB 0:185
pAB pAjB pB 0:125
pAB pAjB pB 0:035
pAB pAjB pB 0:655:
6:3:25
Using the preceding figures, we obtain from (6.3.11) the uncertainty in the weather
(i.e. precipitation) given the NWS forecast

pAB
pAB
pAB
pA B
pABlog
HAjB pABlog
pABlog
pA Blog
pB
pB
pB
pB
0:634 bits:
6:3:26
The additional information, from (6.3.15), provided by the NWS

HA, B HA HAjB 0:259 bits
10
6:3:27
The exact unit is bits per symbol. The additional two words may seem redundant, but this is not the case when a
message consists of a series of symbols.
6.4 Is there information in a stock market time series?
347
amounts to a fractional decrease in uncertainty of

R
HA HAjB 0:259
29:0%:
HA
0:893
6:3:28
A 29% decrease in uncertainty seems like a respectable number for an 84% success
rate in prediction. In any event, every bit helps.

The discrete time series of a stocks closing price is a form of message comprising a
string of digital symbols. The fundamental question for a typical investor is whether,
on the basis of the past record of the stock, it is possible to predict if the share price is
going to rise or fall. To this end, it is the series of first differences, i.e. the daily change
in share price, that is (or should be) of primary concern. It is this information (if there
is information) that motivates the investor to buy or sell shares of a particular stock.
As representative examples of scores of stocks that I have examined, let us look again
at CREF, AAPL, and GRNG.
The autocorrelation functions plotted in the lower panel of Figures 6.1 6.6
showed decreasing long-range correlations of the original time record and no statistically significant correlations in the time record of first differences. It would seem
reasonable, therefore, to confine our question to a time delay of one day. In other
words, we seek to determine how much less uncertainty there is in the variation of the
closing price between today and tomorrow as a result of knowing how the price
changed from yesterday to today.
The situation is now a little more complicated than that of the NWS forecast of
precipitation since there are three possible states (price rise, price fall, price
unchanged) for each of two sets of symbols (future price change, past price change).
Generalizing the scheme of (6.3.17), I represent the states of the system as follows:
A Future price rise
A Future price fall
A0 Future price unchanged
B Past price rise

B Past price fall
B0 Past price unchanged:
6:4:1
Having chosen a portion of the first-difference time series of length N 1 days

(corresponding to an original time record of N days), I then scanned the series to
determine (a) the number of times the price rose, fell, or remained the same, and (b)
the number of times the price changed in a specific way (rose, fell, or remained the
same) following a price change of one of these three specified types. From these
numerical counts I could estimate all the relevant probabilities needed to calculate
information. These include the probabilities of future change p(Ai) (i , , 0) and
previous change p(Bj) (j , , 0) as well as the conditional probabilities p(AijBj)
that the price will change today in a certain way Ai given that it had undergone a
change Bj on the day before.
348
I started with the full 4096-day time records, which were used to calculate the
autocorrelation and power spectrum statistics of CREF, AARP, and GRNG, and
examined how the information content equivalently, the reduction in uncertainty
varied as I took shorter intervals (32, 16, 8, 4 days) closer to the present day, i.e. the
day on which a hypothetical investor intended to take some action. The longer the
time interval, the more closely the probabilities estimated from frequencies reflected
the true probabilities of the system, but, of course, the more remote was the
preponderance of price variations from the present.
An example of the conditional probability matrix p(AijBj) for CREF (N 4096) is
shown below. To simplify notation, only the price-change symbol (i , , 0) is
shown, it being understood that A occupies the first slot and B the second:
0
pj
B
pAi jBj @ pj
p0j
pj
pj
p0j
pj0
0:539
C B
pj0 A @ 0:446
0:015
p0j0
0:506
0:475
0:019
0:500
C
0:444 A:
0:056
6:4:2
As required by the completeness relation, each column sums to unity. The unconditional price-change probabilities for the same time period were
pB 0:523
pB 0:459
pB0 0:018:
6:4:3
The numbers show that the probability that the price will remain unchanged from
one day to the next is low. The anticipated unconditional price-change probabilities
p(Ai) are obtained by a generalization of (6.3.21)
pAi
3
X
pAi jBj pBj
6:4:4
j1
and turn out to be identical (to three decimal places) to the set p(Bj). Such agreement
is expected; in a sufficiently long time series of events such that the probabilities of
different states are estimated from frequencies of occurrence, it would be odd indeed
if a different set of probabilities were obtained merely by counting the same numbers
partitioned into categories (rise, fall, same). The agreement, however, deteriorates as
the time period over which the statistics are obtained shortens. This is an indication
that the data are too few to provide adequate estimates of probability and
therefore estimates of entropy derived from these frequencies are not statistically
meaningful.
The joint probabilities p(AiBj) follow from (6.4.2) and (6.4.3) by the same relations
employed in (6.3.25)
pAi Bj pAi jBj pBj
6:4:5
and lead to the joint probability matrix (with same symbol convention used
previously)
349
Table 6.1
Information content of time series of stock closing prices (4096 days)
Stock
H(AjB)
(bits)
H(A)
(bits)
H(A) H(AjB)
(bits)
HA HAjB
HA
(%)
P(B)
P(B)
P(B0)
CREF
AAPL
GRNG
1.105 45
1.140 96
0.999 63
1.107 03
1.141 60
0.999 84
1.581(3)
6.366(4)
2.180(4)
0.143
0.056
0.022
0.523
0.502
0.507
0.459
0.473
0.493
0.018
0.025
0
B
pAi Bj @ p
p0
p
p0
p
C B
p0 A @
p0
p00
0:282
0:232
0:233
0:218
8:7913
C
7:8143 A:
7:8143 8:7963 9:7684

6:4:6
Knowledge of p(AiBj), p(Ai) and p(Bj) permits one to calculate, by means of the
relations given in the previous section, the initial uncertainty of price variation
and the extent to which this uncertainty is diminished by knowing how the price
had varied in the past (with a delay of one day). The results for CREF, AAPL, and
GRNG are summarized in Table 6.1. The fractional acquisition of information (or,
equivalently, decrease in uncertainty) is minute and of no practical statistical use to
an investor.
Although the CREF and AAPL time series appear to provide a miniscule amount
of information more than GRNG, the initial and conditional entropies in the table
are nearly identical to what one would expect statistically for a coin toss with slightly
biased coin. If we ignore the small probability of a share price remaining unchanged
over a one-day interval, then there are just two states (price rise, price fall) of
approximately equal probability. Each symbol (, ) in the time series of first
differences then contributes an uncertainty of 1 bit (log2 2 1), which is precisely
the entropy obtained for GRNG and very close to the entropies of CREF and
AAPL. (Recall that entropy is the uncertainty per symbol in a long sequence of
symbols.) For time series shorter than 4096 days, but long enough to generate valid
estimates of probability (e.g. 100 days), the event 0 did not occur and the initial
and conditional entropies of CREF and AAPL turned out to be 1 bit within a few
parts in 104.
Indeed, the NWS meteorologists forecast of my local weather provided about
200 times more information to reduce my initial uncertainty of precipitation
than did the CREF time series of price changes. If I (still) had a financial
advisor who based his advice on examination of stock time charts, I think
I would do better to consult instead my local weatherman. In any event,
I couldnt do worse.
350
6.5 Stock price and molecular diffusion

The realization that stock prices follow what physicists refer to as a random walk or
Brownian motion goes back a long time in fact, to the start of the twentieth century
five years before Einstein published his theory of Brownian motion (the experimental
confirmation of which by Jean Perrin helped firmly establish scientific credence in the
atomic constitution of matter). As far as I am aware, the earliest application of
probability theory to market behavior was the doctoral thesis of Louis Bachelier
submitted to the Faculty of Sciences of the Academy of Paris in March 1900 and
subsequently published11 in the scientific annals of the Ecole Normale Superieure, one
of the French grandes ecoles.
Few physicists, I suppose, have ever heard of Bacheliers thesis. I knew nothing of
it myself throughout my stock market investigation until long afterward when
someone in the audience at one of my lectures brought the existence of the document
to my attention. From what I have read since, there has been much discussion among
historians as to whether Bacheliers thesis was duly appreciated by contemporary
French mathematicians, in particular his thesis advisor Henri Poincare, one of the
purest of mathematicians. The thesis subject, as Poincare remarked, was somewhat removed from those which candidates for a mathematics degree in France
ordinarily chose to develop. Nevertheless, Poincares report, which has been published,12 struck me as insightful and favorable. From his vantage point as a pure
mathematician and I would add from my own as a practical physicist Poincare
recognized clearly what constituted the seminal contributions of the thesis: an
original derivation of the law of errors (i.e. the Gaussian probability distribution),
the connection between the evolution in time of the Gaussian probability density
function (pdf ) and Fouriers theory of heat diffusion, and a clever combinatorial
argument that permitted reduction to a simple algebraic expression of a highly
complicated multiple integral arising from applications of probability theory to
random walk problems. All in all, given the time when it was written i.e. well
before the foundations of probability theory and statistics were rigorously laid in the
ensuing several decades Bacheliers thesis is quite a remarkable piece of work.
Although Bachelier focused primarily on specific stock market products like
options and forward contracts, his basic theory of the movement of stock prices is
still relevant today to a typical investor as I described that term earlier. Having stated
as a fundamental principle the seminal fact of stock market dynamics namely,
The mathematical expectation of the speculator is zero. Bachelier went on
to calculate the law of probability governing the uncertainty in stock prices.
His method of analysis, which is brief but incisive, is worth examining. Before doing
11
12
L. Bachelier, Theorie de la Speculation, Annales Scientifiques de lEcole Normale Superieure 3 (1900) 2186.
Translated into English by P. H. Cootner, The Random Character of Stock Market Prices (MIT Press, Cambridge
MA, 1964) 1778.
J.-M. Courtault et al., Louis Bachelier On The Centenary of Theorie De La Speculation, Mathematical Finance 10
(2000) 341353.
6.5 Stock price and molecular diffusion
351
so, however, the preceding italicized fundamental principle, which is italicized in

the thesis, warrants some clarification.
Apart from its vernacular meaning, the term expectation is, of course, a
statistical term referring to the mean of a random variable. Bachelier meant for his
fundamental principle to be interpreted in a global sense embracing all the speculators in the stock market. Poincare stated this meaning actually better than Bachelier
in his evaluation of the thesis.
The buyer believes in a probable rise, otherwise he would not buy, but if he buys, it is because
someone sells to him, and this seller obviously believes in a probable decline. From this results
that the market, considered as a whole, takes the mathematical expectation of all operations
and of all combinations to be zero.
Pictured geometrically, this statement connotes a horizontal averaging over the

activities of all investors taking place in the market within some short interval of time.
However, the statement can also be interpreted and I meant it as such in the
question at the end of Section 6.1 just as written above with a singular speculator.
Pictured geometrically, the statement implies that a vertical averaging over time of
the multifarious stock activity of one investor will yield zero expectation of gain.
How can that be?, you may be thinking, after all, people do make money in the
stock market and they are not all crooks or members of the U.S. Congress (or both).
That may be true, but it does not contradict the fundamental principle as interpreted either horizontally (ensemble averaging of many investors) or vertically (time
averaging of one investor). We shall see in regard to the latter that stock prices follow
a stochastic process closely modeled by a one-dimensional Gaussian random walk,
which gives rise to track records that can persist for long periods of time resulting in
either a net gain or a net loss.
Starting from the principle of zero expectation, Bachelier argued that the variance
in a quoted price was independent of the mean and that the mathematical form of the
probability law was symmetric about the mean. For convenience, therefore, he chose
a coordinate system such that the mean price was zero to express the probability
p(x,t)dx that a price fell in the range (x,x dx) at time t. Referring (without demonstration) to a principle of joint probabilities what we would today call the
ChapmanKolmogorov equation Bachelier then expressed p(x, t) in terms of the
probability densities at earlier moments t1, t2
px, t1 t2
px0 , t1 px x0 , t2 dx0 :
6:5:1
It is not difficult to show that the Gaussian function

ex =2 t
px, t p
2 2t
2
6:5:2
352
is a properly normalized solution to (6.5.1) with time-dependent parameter (the

variance) designated by 2t . These are not the symbols or terminology that Bachelier
used, but there is no point losing transparency in analysis for the sake of history.
The question then remained: how did 2t vary in time? Substitution of (6.5.2) into
(6.5.1) leads to the condition that 2t1 t2 2t1 2t2 . The solution to a functional
equation of general form
f 2 u v f 21 u f 22 v
6:5:3
can be approached in the same way I illustrated previously in regard to the functional
equations that arose in determination of Bayesian priors satisfying certain invariance
relations. Differentiate (once) both sides of (6.5.3), first with respect to u and next
with respect to v, to obtain the two first-order differential equations below expressed
as a single statement
0
f u vf 0 u v f 1 uf 1 u f 2 vf 2 v constant,
6:5:4
where the prime signifies differentiation with respect to the indicated argument.
Equation (6.5.4) is readily integrated to yield f(t)2 2Dt, where D is a constant to
be interpreted shortly (and encountered again in Chapter 10). Applied to the variance
in pdf (6.5.2), one obtains the complete expression (in my notation) found by Bachelier
ex =4Dt
px, t p
4Dt
2
6:5:5
for the spatial and temporal dependence of the probability law governing fluctuations in stock market prices. Keep in mind, however, that the coordinate x refers to
price displacement from the mean or true current market price, not physical length.
It is of significance to remark although Bachelier did not take notice of it that
the form of the right side of Eq. (6.5.1) is a convolution integral
f *gx f ygx ydy f x ygydy:

6:5:6
As such, it expresses the independence of variates in non-overlapping time intervals.
The moment generating function (mgf ) gX(u) or characteristic function (cf ) hX(u)
gX(iu) of a random variable X defined by a convolution of two independent variates is
the product of the mgfs or cfs of those two variates. Thus one could write for the
variates in (6.5.1)
gX u;t1 t2 gX u;t1 gX u;t2
or
hX u;t1 t2 hX u;t1 hX u;t2
6:5:7
where u is just an expansion variable. The variates are governed by the same
probability law hence the same function gX or hX but with a time-dependent
parameter. Normalized solutions to the functional equations in (6.5.7) are the exponential functions
6.6 Random walk as an autoregressive process
gX u e t u
1
2
2 2
or
hX u e t u ,
1
2
2 2
353
6:5:8
which the reader will recognize immediately as the mgf and cf of a Gaussian
distribution. Bachelier made no use of generating functions in his thesis.
The probability density (6.5.5) does not represent a stationary stochastic process
since the variance, 2t 2Dt
increases
with time. The root-mean-square spread in
p
price is then given by t 2Dt in analogy to the diffusion of molecules in a fluid or

meandering of a pollen grain on the surface of water, such as described by Einstein
in papers on Brownian motion published several years later.13 The analogy with
Fouriers theory of heat, as remarked by Poincare, is readily established by finding
the differential equation for which (6.5.5) is a solution. The equation can be deduced
systematically from first principles (which will be done in Chapter 10), or, having the
solution (6.5.5), one can simply take spatial and temporal derivatives and find that
they satisfy the heat-flow equation familiar to physicists
px, t
2 px, t
:
D
t
x2
6:5:9

The mathematical approach I took to modeling stock price fluctuations, which led to
simulated time histories like that of the Grange Corporation in Figure 6.3, was
different from that of Bachelier. As one of the simplest of a class of discrete (i.e.
finite-difference) equations describing a so-called autoregressive process, the model
embraced a wider range of possibilities against which to compare stock price variations, as well as provided (at least in my opinion) a more physically transparent
explanation of the approximate behavior of stock prices.
Let us designate by xt the stock closing price at time (i.e. day) t and assume that
this price is influenced primarily by the price of the day before xt1 as well as by a
random perturbation t for which there is no causal or deterministic description. The
perturbation is taken to be a random variable of mean hti 0 and variance h2t i 2
and therefore covariance
ht t0 i 2 tt0
6:6:1
where tt0 is the Kronecker delta function symbolizing no correlation between perturbations at different times.

If the perturbation is a Gaussian random variable t N 0, 2 , we can obtain
results equivalent to Bachelier, but it is not necessary to make such a choice at this
13
A. Einstein, (a) On the movement of small particles suspended in stationary liquids required by the molecular-kinetic
theory of heat, Annalen der Physik 17 (1905) 549560; (b) On the theory of Brownian movement, Annalen der Physik
19 (1906) 371381. Einsteins papers on Brownian motion are collected in the book Albert Einstein Investigations on the
Theory of the Brownian Movement, Eds. R. Furth and A. D. Cowper (Dover, New York, 1956).
354
point. Indeed, one of the objectives of the exercise I undertook was to determine what
kind of random shock adequately describes the random walk of stock prices. My
model, of a type labeled AR(1) for autoregressive process of order 1, was defined by
the master equation
xt xt1 t
6:6:2
where the parameter gauges the influence of the past lag 1 day on the present.
For the model to be useful when theory is matched to data, one must find that jj 1,
otherwise Eq. (6.6.2) results in run-away solutions that diverge exponentially with
time. Once Eq. (6.6.2) is solved, can be estimated from a time series in various ways,
and the value is highly informative in regard to the nature of the stochastic process.
I will describe shortly what resulted for the rather different looking CREF, AAPL,
and GRNG time series.
Equation (6.6.2) is an example of a Markov process, i.e. a stochastic process in
which the future depends only on the present, a characteristic experimentally tested
in the decay of radioactive nuclei.14 A more formal way of expressing this point is to
state that the conditional probability of obtaining the present state xt given all the
past values of the random variable X
PrXt xt jfxt0 g for all t0 < t PrXt xt jxt1
6:6:3
equals the probability of the present variate conditioned on only the most recent past
variate i.e. with time lag 1. In a more general autoregressive model AR(n)
xt 1 xt1 2 xt2 3 xt3 . . . n xtn t ,
6:6:4
the influence on the present reaches further into the past and there are more parameters to be determined from the available time series data.
Time series analysis by means of equations of the autoregressive type, as well
as other defined types that go by names (ARMA, ARIMA, ARCH, GARCH, etc.)
that sound either like a government agency or a person choking have been widely
investigated15 for their utility as forecasting tools. From the preceding section, one
would not expect stock price forecasts to provide useful information. Actually, the
model (6.6.2) does tell us something immediately: the mean price expected for
tomorrow is times the price today. We shall see that 1 to within statistical
uncertainty.
Equation (6.6.2) can be solved formally by writing a column of time-lagged
versions of the equation, each row multiplied by ,
14
15
M. P. Silverman and W. Strange, Experimental tests for randomness of quantum decay examined as a Markov
process, Physics Letters A 272 (2000) 19.
G. E. P. Box, G. M. Jenkins, and G. C. Reinsel, Time Series Analysis: Forecasting and Control (Prentice-Hall, New
York, 1994).
xt
xt1
xt1 2 xt2
2 xt2 3 xt3
3 xt3 4 xt4
..
.
t
t1
2 t2
3 t3
355
6:6:5
k xtk k1 xtk1 k tk

..
.
and then summing the rows. All terms on the left side except the first and last (defined
to be x0 0) drop out, and one obtains either of the following two expressions
(depending on index labeling)
xt
t
X
tk t
k1
t1
X
k tk :
6:6:6
k0
Use of the so-called backward-shift operator B defined by the operation

Bt t1
6:6:7
transforms the linear combination of perturbations in (6.6.6) into a summable

geometric series
xt
t1
X
Bk t
k0
1 Bt
t Bt
1 B
6:6:8
of the operator B acting on the present perturbation. The generating function (B)
will prove very useful shortly.
To model the time series of stocks, such as those shown in Figures 6.16.3, I chose
the perturbations to be independent, identically distributed (iid) normal variates of
mean 0 and variance 2 . Recall that one of the properties of the normal distribution is
its strict stability a term signifying that a linear combination of iid normal variates
also results in a normal variate according to the relations
n
X
n

X

ai N i i , 2i
N i ai i , a2i 2i N, 2
i1
n
X
i1
i1
ai i
n
X
6:6:9
a2i 2i :
i1
Thus, the linear combination in (6.6.6) can be collapsed to a single normal variate

1 2t
2t
2
xt N 0, 2t
6:6:10
1 2
of mean 0 and time-dependent variance. Note that the form of the variance in
(6.6.10), which follows rigorously from (6.6.9), is also obtained simply from the
operator expression (B) in (6.6.8) by replacing the backshift operator B with .
356
Strict stability is not a general property of statistical distributions.16 For example,

we have seen that the sum of two iid Poisson or binomial distributions also yields a
Poisson or binomial distribution, but the difference does not. The sum of two iid
uniform distributions yields a triangular distribution, as was shown in Chapter 1.
Another strictly stable distribution, however, which we have encountered various
times, is the Cauchy distribution. In principle, it is easy to ascertain whether a
distribution is strictly stable or not: If the sum of two iid variates yields a variate of
the same kind, then the product of the associated mgfs or cfs yields an mgf or cf of the
same functional form up to location and scale parameters. The family of stable
distributions is known generally as Levy distributions for the French mathematician
Paul Levy. The general Levy distribution is defined by its characteristic function, as it
is not possible to write a closed-form expression for the probability density. The
significance of stability in the analysis of stock prices is that, although stock prices
may randomly walk in time, one does not expect the functional form of the distribution of the relative change in price to change.
The time-dependent variance in (6.6.10) does not look like the variance derived by
Bachelier, whose method of analysis did not allow for the extra degree of freedom
represented by the parameter . However, in the limit that approaches 1

2
1 2t
2
Lim t Lim
2 t,
6:6:11
!1
! 1 1 2
one recovers (by use of LHopitals rule) the Bachelier model in which price variance
increases linearly with the number of time units or, equivalently, the root-meansquare price (taken about the mean) increases with the square root of the number
of time units. This is precisely the behavior expected for a random walk. The value
1 in an AR(1) process defines what physicists refer to as Brownian motion or
Brownian noise.
Consider next the theoretical covariance k hxtxtki (k 0, 1, 2. . .) where time
extends infinitely into the past and future although, of course, a sampled time series
has a definite beginning and end. Given the infinite extent of the hypothetical parent
series, assumed stationary, the covariance function has the symmetry
k hxt xtk i hxtk xt i hxt xtk i k
6:6:12
where the second equality follows from time-translation invariance and the third
equality merely states that the expectation value does not depend on the order of the
variates in the brackets. Multiplying the two sides of the master equation (6.6.2) by
xtk and then taking the expectation leads to the relation
hxt xtk i hxt1 xtk i ht xtk i
16
k k1 ht xtk i:
6:6:13
A stable distribution is one characterized by the linear relation a1X1 a2X2 a3X a4 where the as are constant
coefficients and the Xs are variates of the same kind (e.g. Gaussian). For a strictly stable distribution a4 0.
357
which reduces to the following,

k k1 k
1
0 hx2t i 2t 1 ht xt i 1 2 :
6:6:14
For lag k
1, the expectation htxtki vanishes because the random perturbation t
occurs later than the variate xtk and therefore cannot influence it. For lag k 0, one
makes use of the master equation (6.6.2) and perturbation covariance (6.6.1) to find
that the expectation
ht xt i ht xt1 t i h2t i 2 ,
6:6:15
reduces to the perturbation variance, as expressed in the last line of (6.6.14).

Dividing both sides of the equations in (6.6.14) by 0 leads to an equation
connecting the autocorrelation coefficients
k k =0 k1 ) k k k 0, 1, 2 . . .
6:6:16
that is readily solved by iteration, starting from 1 0 with 0 1, to yield the

exponential solution shown. The set of relations for autocorrelation coefficients in a
general AR(n) model (6.6.4),
9
1 1 0 2 1 . . . n jn1j
>
>
>
2 1 1 2 0 3 1 . . . n jn2j >
n
=
X
)
j jjkj
..
k60
>
>
.
j1
>
>
;
n 1 jn1j 2 jn2j . . . n 0
6:6:17
known as the YuleWalker equations, affords one way of estimating the model
parameters from the sample autocorrelation coefficients of an observed time series.
(The coefficient 0, although equal to 1 by construction, is shown explicitly in (6.6.17)
to maintain the pattern of indices.)
From the closed-form solution of the AR(1) process in (6.6.6) the power spectrum
S() at frequency c

0, where c 1/2t is the cut-off frequency, can be deduced
in a simple way. As a matter of notation, however, note that the dimensionless
product t falls in the range 12
t
0. Because t 1, I will omit writing it in
the ensuing mathematical expressions, whereupon a dimensionless frequency (actually a phase) within the above range will be designated by . The simple procedure for
obtaining the power spectrum then consists of replacing the backward shift operator
B in (B) by the phase factor e2i, whereupon one finds
S 2 2 je2 i j2 :
6:6:18
An explanation of why this works is left to an appendix.

t
Applying (6.6.18) to the AR(1) generating function B 1B
1B leads to the
power spectrum
358

e2 i 2t 2

!
2
i
t!
1 e
!1
1
2 2
2
,
1 cos 2
6:6:19
which is stationary in the asymptotic limit, since it is assumed that < 1, even if only
by an infinitesimal amount. For frequencies well below cut-off, a Taylor series
expansion of the denominator results in the relation
S /
1
,
2
6:6:20
which reproduces the inverse-square dependence of Brownian noise. Thus, to good

approximation a plot of ln S() against ln should be linear with slope 2. Recall that
the corresponding plots of the CREF, AAPL, and GRNG power spectra in Figures
6.16.3 had slopes (~ 1.8) close to this predicted value. The magnitude of this slope
is an indicator of the predictability of a time series.
Let us now examine the time series of stock prices in light of the foregoing model.
The solution (6.6.10) contains two unknown parameters: the lag parameter and
shock variance 2 . Since the first difference
wt xt xt1 t
2
is a normal variate N 0, with pdf

1
2
2
p wt j 2 p ewt =2 ,
2
2
6:6:21
6:6:22
the log-likelihood function of the sampled time series fxt t 1. . .Ng takes the form
L ln L
N

N
1 X
ln 2 2
xt xt1 2 constant,
2
2 t2
6:6:23
where the constant is unimportant. The parameters obtained by solving the

equations
L= 0
L= 2 0
6:6:24
that maximize L are readily found to be given by the expressions

N
X
xt xt1
t1
N
X
^ 2
x2t1
N 1
1X
^ t1 2 :
xt x
N t1
6:6:25
t1
The corresponding covariance matrix derived from the mixed second derivatives of
the log-likelihood is diagonal
359
2 L
2
B
B
C B
B 2 L
@
2
11
2 L
C
2 C
C
2 L C
A
2
2

var
0
!
0
,
var 2
6:6:26
and yields the variances

var
2
N 2t
2 4
var 2
N
6:6:27
with zero covariance as quantified dimensionlessly by the correlation coefficient

C12
12 p :
C11 C22
6:6:28
Actually, Eq. (6.6.23) is an approximation to the exact log-likelihood function

because it is conditioned on the first value x1 of the time series. More generally, had
we been seeking the n lag parameters of an AR(n) model, the expression corresponding to (6.6.23) would be conditioned on the first n values x1, x2, . . . xn of the time
series. Thus, the solution we arrived at in expressions (6.6.25) and (6.6.26) was
obtained from the conditional likelihood function. This solution is equivalent to a
least squares approximation to the maximum likelihood method. Ordinarily, one
would not expect the statistics of a long time series to be significantly affected, if at
all, by the value of the first element (or first few elements) in the series, and therefore
the use of the conditional likelihood function should be entirely satisfactory.
A difficulty can arise, however, if the parameter is very close to the limit beyond
which run-away solutions arise, in which case one should use the exact unconditional
likelihood function. However, maximization of the unconditional likelihood function
usually leads to coupled nonlinear equations for the parameters, which must then be
solved by numerical methods. A discussion of the exact likelihood function and its
application to the present problem is left to an appendix, except to note here that the
resulting numerical values that I obtained for the parameters agreed closely with
those estimated from (6.6.25) and (6.6.27).
Table 6.2 summarizes the (exact) maximum likelihood estimates of the CREF,
AAPL, and GRNG time series parameters, for both the long term (4096 days) and
short term (512 days), obtained from both the original and detrended data. The
values in Table 6.2 reveal in each case a Brownian stochastic process ( 1) of
comparable shock variance for the same time period. The AAPL and CREF time
series parameters and associated uncertainties were nearly identical for the original
and detrended series. For the GRNG time series illustrated in Figure 6.3 the estimated value of was very close to the parameter 2 supplied to the random
number generator.
360
Table 6.2
Maximum likelihood estimate of AR(1) parameters

Period N
AAPL
Lag parameter
Shock parameter
Correlation
Original series
Detrended series
1.001 3.5(4)
1.986 0.087
2.9(3)
1.001 6.7(4)
1.988 0.087
6.7(3)
1.003 8.2(4)
3.640 0.829
9.4(3)
0.966 0.012
3.630 0.825
0.011
1.000 1.6(4)
2.064 0.094
8.1(3)
0.998 9.5(4)
2.064 0.094
5.0(3)
1.002 7.7(4)
3.339 0.697
0.011
0.974 0.010
3.343 0.699
0.012
1.000 3.3(4)
1.981 0.087
8.1(3)
1.001 5.6(4)
1.982 0.087
7.8(3)
1.002 1.3(3)
1.923 0.23
0.020
0.999 5.5(3)
1.919 0.23
0.017
4096
512
Lag parameter
Shock parameter
Correlation
CREF
Lag parameter
Shock parameter
Correlation
4096
512
Lag parameter
Shock parameter
Correlation
GRNG
Lag parameter
Shock parameter
Correlation
4096
512
Lag parameter
Shock parameter
Correlation
Since the lag parameter is equal to 1 within statistical uncertainty, we should

expect the sample variance
s2Xt
t
t
1X
1X
xn xn 2 with xt
xn
t n1
t n1
6:6:29
of the time series to increase approximately linearly with t according to (6.6.11), and
the sample variance
s2Wt
t
t
1X
1X
wn wn 2 where wt
wn
t n2
t n2
6:6:30
of the first-difference series fwn yn yn1} (n 2. . .t) to remain largely constant.

The variance of the (detrended) AAPL time series, shown in Figure 6.8, illustrates
this feature.
361

2000
1500
AAPL Variance
(a)
1000
500
x 100
0
500
500
1000
1500
2000
2500
(b)
3000
3500
4000
Time (d)
Fig. 6.8 Time variation of the variance of the AAPL (a) detrended time series in Figure 6.1
(solid) and corresponding line of regression (dashed line); (b) first-difference of the detrended
time series multiplied by 100 for visibility (solid). The variance of pure Brownian noise
increases linearly in time, whereas that of white noise is constant.
In short, taking account of all the comparisons made so far, the CREF and AAPL
time series both of which typify other stock time records I have examined appear
to derive from stochastic processes largely characterizable as a Gaussian random
walk.
If you are wondering whether practically useful departures from a random-walk
process might have been discerned by analyzing the stock time series with a AR(n)
model with n > 1 or some other more complicated model that permitted a deeper slice
of the past to influence the present, the answer is almost assuredly no. The basic
strategy of deciding which, if any, of many possible linear models to apply to a nonstationary time series consists of transforming the series and examining the resulting
difference series until one attains a difference series whose autocorrelation function
resembles that of white noise. One then reverses direction putting all the components
together (integrating instead of differencing) to arrive at the identity of the
process characterizing the original time series. It is by such means that the strange
names of the time series were derived (e.g. ARIMA Autoregressive Integrated
Moving Average process).
The salient feature to all the stock market time series I have examined is the
immediate arrival at white noise with the first difference. There would be little point,
therefore, to employ a model more complicated than AR(1) if one were simply a
typical investor saving for retirement rather than a speculator or hedge-fund manager faced with the statistical uncertainties of some futures contract or other
362
CREF STOCK Daily Price Change

0.5
Cau(,)
Frequency per Bin
0.4
0.3
0.2
N(,2)
0.1
0
6
Class
Fig. 6.9 Histogram of first differences of the CREF STOCK time series of Fig 6.1 (black bars)
superposed by a Cauchy distribution (solid) Cau(,) Cau(0.2,0.76) with visually fit
location and scale parameters, and a normal distribution (dashed) Nw, var w
N5:19 103 , 4:26 determined by the sample mean and variance. Excluded from the w 0
bin of the histogram are contributions in which the price change was null because of market
closure.
derivative product. Over-modeling a time series does not provide new or more precise
information.
Nevertheless, there is one striking difference between actual stock time series and
Gaussian-simulated time series when one looks more closely at the distribution of
first differences of the raw or detrended time series as illustrated by the CREF data in
Figure 6.9. Although the GRNG first-difference series is well characterized by a
Gaussian distribution by virtue of its construction, histograms of CREF and AAPL
first differences have much narrower peaks and fatter tails that more closely resemble
a Cauchy distribution (although not to an extent that passes a chi-square test). The
statistical implication of fat tails, in comparison to the exponentially decreasing tails
of a Gaussian distribution, is greater volatility i.e. higher probability of the
occurrence of outlying or extreme events.
This greater volatility is apparent in the top panel of Figures 6.4 and 6.5 in the
rare, but not negligible, occurrence of first-difference excursions extending beyond
the region bounded by 5 standard deviations. For a random variable strictly
governed by a normal distribution, the probability of attaining by pure chance a
value of at least five standard deviations beyond the mean is about 6 107. Thus the
mean number of occurrences of such events in a time period of 4096 days (with one
363
Displacement
Gaussian Random Walk

100
N(0,22)
0
100
200
0
500
1000
1500
2000
2500
3000
3500
4000
3000
3500
4000
Steps
Cauchy Random Walk
Displacement
5000
Cau(0,2)
0
5000
0
500
1000
1500
2000
2500
Steps
Fig. 6.10 Comparison of one-dimensional Gaussian (upper panel) and Cauchy (lower panel)
random walks (RW) with location parameter 0 and width parameter 2. The N(0,22) path is
characterized by numerous small fluctuations; the Cau(0,2) path shows a relatively smooth
evolution punctuated intermittently by large fluctuations.
trial per day) would be 0.0023. Nevertheless, the autocorrelation of the first differences, shown in the bottom panel of Figures 6.4 and 6.5 is well characterized by white
noise and both original time series look very much more like a Gaussian random
walk than a Cauchy random walk, as illustrated in Figure 6.10.
The upper panel in Figure 6.10 shows the cumulative displacements in 4096 steps
of a normal variate N(0,4) i.e. with width parameter 2 simulated by a
Gaussian RNG. The lower panel shows the corresponding displacements of a Cauchy variate Cau(0,2) with width parameter 2 simulated by a Cauchy RNG. In
contrast to the Gaussian random walk, which appears to fluctuate rapidly on a fine
scale but proceed more or less smoothly on a coarse scale, the Cauchy random walk
appears to proceed with little change on a fine scale for long intervals and then
undergo large changes in scale suddenly. Stock market time series are approximate
Levy processes that fall somewhere between Gaussian and Cauchy random walks. In
appearance they look much more like the former than the latter, but the higher-thannormal incidence of stock market melt-downs serves as a graphic reminder of the
impact of extreme events residing in those fat tails.
364
6.7 Stocks go UP and UP and DOWN and DOWN

Nothing that is can pause or stay;
The moon will wax, the moon will wane,
The mist and cloud will turn to rain,
The rain to mist and cloud again,
Tomorrow be today.
Henry Wadsworth Longfellow, Kerimos
We have seen that there is no information (in the Shannon sense) in a time series or
record of past performance of stock prices. Nevertheless, to know that stock price
movements are characterized reasonably well by a Gaussian random walk enables
one to understand how that movement exemplifies one of the most astonishing,
counter-intuitive properties of a random walk a behavior responsible for much
delusion in regard to long-term returns of the stock market.
On a number of occasions I have asked diverse groups of people audiences at my
seminars, students in my classes, associates at work or at other activities what they
thought would be the cumulative gain (positive or negative) of tossing a fair coin a
few hundred times and receiving a dollar for every head and paying a dollar for every
tail. The reply almost invariably was to suggest that the net accumulation would be
close to zero since, after all, if the coin were unbiased, then there was a 5050 chance
of getting either a head or tail. The cumulative gain of a coin toss with a fair coin is
equivalent to the net displacement of a Bernoulli random walk of equal step size (let
us say 1 unit) with probability p to step right equal to the probability q to step left.
Even physicists who recognize this equivalence may not be fully aware of how
awesomely wrong the usual reply is.17
The four panels in Figure 6.11 show four realizations of a one-dimensional 1000step random walk simulated by a N(0, 1) Gaussian RNG. A Gaussian random walk is
different from a Bernoulli random walk in that the step size is continuous over an
infinite range with a probability determined by a Gaussian distribution. Nevertheless,
there are remarkable connections between the two types of random walk. Note first
how much time the random walker spends either above the zero ordinate (breakeven line) or below it. In the top panel, for example, the random walker remained at
or below the breakeven line for 978 trials out of 1000 that is, for 97.8% of the time.
What is the theoretical probability that a Gaussian random walk will remain nonpositive (or, equivalently, non-negative) for at least 97.8% of the time?
To answer that question generally, represent the displacement of the nth step by
Xn Nn(0, 2) each step being independent of preceding steps and the cumulative
n
X
Xj N0, n 2 . Then
displacement at the conclusion of the nth step by Sn
j1
17
See M. P. Silverman, Computers, coins, and quanta: unexpected outcomes of random events in my book A Universe
of Atoms, An Atom in the Universe (Springer, New York, 2002) 279324.
6.7 Stocks go UP and UP . . . and DOWN and DOWN
365
1D Gaussian Random Walk

20
0
20
40
60
0
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
800
900
1000
40
Cumulative Displacement
20
0
20
40
20
0
20
40
20
0
20
40
Number of Steps
Fig. 6.11 Four realizations of the cumulative displacement of a one-dimensional Gaussian
random walk of 1000 steps simulated by a N (0,1) Gaussian RNG. Note the large fraction of
each trajectory spent either above or below the origin.
366
Pr(S1 > 0, S2 > 0. . .Sn > 0) is the probability that all n steps lie above the breakeven
line. For n 1 we obtain the expected result
2 1=2
Pr S1 > 0 Pr X1 > 0 2
x21 =2 2
dx1 2
1=2
1
2
ez1 =2 dz1 ,
2
6:7:1
where the transformation z x/ was made to make the integral dimensionless.
For n 2
Pr S1 > 0, S2 > 0 Pr X1 X2 > 0, X1 > 0 Pr X2 > X1 , X1 > 0
2
2
22=2 ez1 =2 dz1 ez2 =2 dz2 :
6:7:2
z1
By induction, the probability for the nth step

n=2
Pr S1 > 0, . . . Sn > 0 2
z21 =2
dz1
z22 =2
z1
dz2
e
z1 z2
z23 =2
dz2 . . .
ezn =2 dzn
2
n1
X
zj
j1
2n=2
n
Y
k1
kX
1
e zk =2 dzk ,
2
6:7:3
zj
j1
results in a formidably looking multiple integral of Gaussian functions with constraints on the lower limits.
Although it may seem incredible, the value of the integral in (6.7.3) is given by a
very simple combinatorial formula

1 2n
1
6:7:4
Pr S1 > 0, S2 > 0 . . . Sn > 0 2n
! p n 1, 2 . . .
n
n1
n
2
derived for a one-dimensional Bernoulli random walk of 2n steps of equal size and
probability p q.18 The asymptotic (large n) reduction to the right of the arrow is
obtained by applying Stirlings approximation (6.3.2), this time including the squareroot term which is often omitted in statistical thermodynamics.
The probability of a non-negative cumulative displacement for k steps out of n and
a non-positive cumulative displacement for the remaining n k steps is then the
product of two factors of the form of (6.7.4)
18
W. Feller, An Introduction to Probability Theory and Its Applications Vol. 1 (Wiley, New York, 1950) 7475. Feller
derives the formula for a Bernoulli random walk; he does not discuss at all the Gaussian random walk.
367
Binomial Random Walk
0.18
Probability
0.16
0.14
0.12
0.1
0.08
0.06
0
10
Number of Steps to the Right

Fig. 6.12 Probability of cumulative displacement in a binomial random walk Bin10, 12 as
calculated exactly (circles) and approximately (diamonds) by Eq. (6.7.5) and its asymptotic
limiting expression. Lines connecting circles are to guide the eye.
1
Pr kjn 2n
2
2k
k

2n 2k
nk
1
! p :
n1 kn k
6:7:5
Apart from the end points (k 0, n) where it becomes singular, the asymptotic
expression is quite good even for relatively low n as shown in Figure 6.12 for n 10
steps. Although powerful mathematical software like Maple or Mathematica permits
one to evaluate (6.7.5) numerically for arbitrarily large n, the asymptotic formula
shows more transparently that a plot of Pr(kjn) as a function of k is concave upward
with a minimum at k n/2. In other words, the probability that the random walker
remains above (or below) the breakeven line for most of the n steps is much higher
than the probability of it spending time nearly equally in both domains. This
outcome may seem obvious when one thinks of the random walker as a molecule
diffusing from a bottle of perfume, but it is much less obvious when the image in
mind is the cumulative gain of tossing a fair coin.
The probability that the random walker remains in the positive domain for at least
k out of n time units is obtained by summing (6.7.5) over the range from k to n

n
1 X
2j
2n 2j
Pr K
kjn 2n
:
6:7:6
j
nj
2 jk
One can approximate (6.7.6) by integrating the asymptotic expression in (6.7.5) to
obtain a form of the so-called arc sine law
368
1
Pr K
kjn 1
k=n
r!

dx
1
2
2k
2
k
p 1 sin 1
,
1
1 sin 1
2
n
x1 x
6:7:7
where the last equality in (6.7.7) follows from the trigonometric identity cos 2
1 2sin2 . The answer to the question of how probable it is for a Gaussian random
walker to remain in the positive domain for 978 steps out of 1000 is then close to
9.64% by (6.7.6) or (approximately) 9.48% by the arc-sine law (6.7.7). In other
words, what may have seemed initially to be an extremely improbable event can
occur by pure chance in approximately 1 out of 10 time records.
The appearance, therefore, of a more or less upward trending movement of a stock
price over a substantial period of time is no reason necessarily to believe that it is the
consequence of anything other than pure chance i.e. the unpredictable outcome of a
myriad of uncontrollable influences. The trend will eventually reverse and a correspondingly persistent downward trending price movement will ensue. Such patterns of
persistence and reversal can lead to waves in a price track record like those
appearing in the raw or detrended CREF time series in Figure 6.1. The undulations
give the illusion of a predictable cyclical movement but that is only an illusion, as
established, for example, by examination of the power spectrum. There is no deterministic periodicity only the random outcomes of a stochastic process.
To justify the preceding assertion, consider the probability that a Gaussian
random walker, which began at the origin (location 0 at time 0) and remained in
the positive domain for n 1 steps then crosses the zero line into the negative domain
for the first time at the nth step. We are seeking the probability of first passage
through the origin
Pr S1
0, . . . Sn1
0, Sn < 0
n=2
z21 =2
dz1
z22 =2
dz2
z1 z2
z1
z23 =2
dz2 . . .
n2
X
j1
z2n =2
e
n1
X
zj
dzn1
ezn =2 dzn
2

zj
j1
n=2
n1
X
zj
n1
Y
k1
j1
z2k =2
k1
X
dzk
ezn =2 dzn
2
6:7:8
zj
j1
which differs critically from the expression in (6.7.3) by the limits on variable zn i.e.
n
X
on the rightmost integral where the condition Sn
Xj < 0 poses the constraint
j1
n1
X
369
Xj > Xn > . The relation in (6.7.8) involves another horrendous suite of
j1
Gaussian integrals, but it can be reduced to a simple closed-form expression, again

connected to a Bernoulli random walk, by rewriting the last integral in the form of
those that preceded it:
n1
X
zj
j1
ezn =2 dzn
2
ezn =2 dzn
2
p
2
n1
X
ezn =2 dzn :
2
6:7:9
zj
j1
Since the first term on the right side is just the Gaussian normalization constant,
substitution of (6.7.9) into (6.7.8) reduces the probability of first passage through the
origin into the difference of two probabilities of non-negative displacement, the first
of n 1 steps and the second of n steps (as one might have anticipated)
Pr S1
0, . . . Sn1
0, Sn < 0 Pr S1
0, . . . Sn1
0
Pr S1
0, . . . Sn1
0, Sn
0:
6:7:10
Substitution of the combinatorial expressions for a Bernoulli random walk leads to

1
2n 2
2n
Pr S1
0, . . . Sn1
0, Sn < 0 2n 4

n 2, 3, . . .:
n1
n
2
6:7:11
The probability that the first passage of a Gaussian random walk occurs somewhere
within the first n steps is then the sum

n
X
1
1
n1
2n 2
2n
2k
2k 2
1

1 2n 4

4
F n
2k
n1
n
k
k1
2
2
k2 2
6:7:12
which takes the asymptotic form
F1 n !
n1

1
2
1 p :
2
n
6:7:13
As the number of steps becomes infinitely large, the probability approaches 50% that
the random walker will cross from the positive into the negative domain (or from the
negative into the positive domain). Looking at the waves in the detrended CREF
time record, one can estimate a period of about 1100 days between the point where
the record initially crossed the breakeven axis into the positive domain and the first
return crossing into the negative domain. From either (6.7.12) or (6.7.13) the probability of this event is 48.3%. On average, the probability is about 35% that a
370
Gaussian random walker will make a first passage through the origin within 15 time
units i.e. about two weeks in terms of a record of daily stock prices. Note that for a
stationary time series there is nothing special about the point of origin. Any moment
in a long time series can be taken to be the origin of time, and the subsequent stock
prices measured with respect to that starting value.
At first thought, it may seem contradictory that (a) the probability of first passage
through the origin (which in the context of stock prices means loss of any accumulated gain) approaches 50% as the time period gets longer and (b) a Gaussian
random walker is far more likely to spend most of its time either in the positive
domain or negative domain and return only rarely to the breakeven point. However,
there is no inconsistency. The explanation is that the probability of second, third . . .
and all higher passages through the origin decreases rapidly. We can see how this
comes about by examining a relatively simple relation derived for a one-dimensional
Bernoulli random walk for which cumulative displacements from the origin are
integer multiples of a fixed step size, but which nonetheless leads to a general
property also exhibited by a one-dimensional Gaussian random walk. The probability that the random walker returns to the origin k times in a period of 2n time units
(which is expressed as such because it must be an even integer) is19

1
2n k
pk
:
6:7:14
n
n
22nk
The probability function (6.7.14) satisfies the completeness relation when the
variable k is summed over the range (0, n) since a Bernoulli random walker can make
at most n returns to the origin in 2n steps. (A minimum of two steps is required to
leave the origin and return.)
The mean number hRi and mean square number hR2i of returns to the origin can
be calculated exactly from (6.7.14)
r

n
X
p
k
2 n 32
n
2n k
p
hRi
1:128 n
6:7:15
1 ! 2
2nk
n
n1
n 1
k0 2
hR2 i

n
X
k2
6 n 32
2n k
2n 3 p
! 2n
2nk
n
n 1 n1
k0 2
6:7:16
and leads to the asymptotic expressions to the right of the arrows upon substitution
of the asymptotic formula (Stirlings approximation) for the gamma functions. There
are several unusual features to this distribution. Note first that hRi is proportional to
the square root of the time period. Since the return of a Bernoulli random walker to
the origin is tantamount to a tie score in the coin-tossing game, one might have
supposed that the number of such ties would be proportional to the duration of the
19
W. Feller, op.cit. p. 82.
371
game i.e. if the playing time were twice as long, the number of ties would be
doubled. But that is not the case. The number of ties increases only as the square
root of the playing time. This curious fluctuation behavior virtually ensures that the
random walker will make long excursions in either the positive or negative domain
and return only relatively rarely to the origin. However, the random walker does
return eventually with a 100% probability. Second, the standard deviation

q s
p
2
2
2
R hR i hRi 2 1
n 0:853 n
6:7:17
is also proportional to the square root of the time period, so that the fluctuation in
number of returns to the origin is of the same order as the mean a property we
encountered before (in Chapter 3) with the exponential distribution.
Because the distribution (6.7.14) is not Gaussian, one must be careful to interpret
correctly the statistical implications of the standard deviation (6.7.17). Since the
distribution p
is
one-sided,
the probability of an observation R falling between 0 and
hRi= R 1= =2 1 1:324 is
r hRi=
r 1:324
R
2
2
2
u2 =2
e
du
eu =2 du 0:814
6:7:18
and not 0.407 as in the case of a Gaussian distribution.

To apply the proceeding theory of return-to-the-origin of a Bernoulli random
walk to passage-through-the-origin of a Gaussian random walk the time of displacement n in the latter becomes 2n in the probability function of the former, as in
relation (6.7.4). In a Bernoulli random walk, there is a nonzero probability of the
walker returning precisely to the origin, whereas step size varies continuously in a
Gaussian random walk and the probability of exact return to the origin is zero. Thus,
in a Bernoulli random walk a return to the origin results in a change in lead in half the
cases, in contrast to a Gaussian random walk in which passage through the origin
always results in a change in lead.
Consider, for illustration the detrended CREF time series. I (i.e. my computer)
counted 27 passages across the origin (y0 $37.30) in Figure 6.1 within a period of
4096 days. Equations (6.7.15) and (6.7.17) predict for these conditions an expected
value
hRi R 72:2 54:6 ) f126:8
hRi
17:6g
consistent with the observed value of 27. Using (6.7.14), one finds that the observed
number of crossings should fall within the range jhRi Rj with a probability of
68.7%. The important point is that the appearance of waves in the CREF time series
is completely consistent with what one would expect for a random process.
372
6.8 What happened to the law of averages?

How can it be that in a fair game, as for example a coin toss with an unbiased coin,
one player can lead (or lag) for the preponderance of trials? Does this not violate the
law of averages? Actually, nothing has happened to the law of averages, or what is
more appropriately referred to as the law of large numbers in probability theory. The
problem is simply that the correct interpretation of the law is widely misunderstood.
Suppose you are gambling in a coin-toss game where you win $1 with probability
p if a head turns up and you lose $1 with probability q 1 p if a tail turns up. Your
expected gain at the end of N games is then hGi N( p q). However, the difference
D G hGi between your actual gain G (where a loss is a negative gain) and
p
expectation
hGi
fluctuates
with
a
root-mean-square
value
2
Npq that increases
D
p
with N . Thus, if the game is fair and you bet $100 at $1 per game, your capital at the
end of 100 games would lie between $90 and $110 with a probability of about 73% or
between $80 and $120 with a probability of about 96%. Details are left to an
appendix.
Many people think that the law of averages guarantees that the more games they
play the closer their gain G will be to their expectation hGi, but that idea is wrong.
Rather, the law states that as the number of games increases the closer will be the
fraction G/N to the fraction hGi/N although the gap as measured by D gets wider.
pIt
is the ratio (D/N) or (D/N) that tends toward 0 with increasing N (as 1= N ),
not D. Moreover, there is nothing in the law of averages to suggest a tendency for
the leads of two players to equalize within any specific game. The law implies only
that if a game is fair, then at the outset each player has equal chance of winning it,
and in the long run should win about half of the games.
In the context of the stock market, the law of averages tells you this: if you want to
be assured of a positive gain over the course of a lifetime of investing for retirement,
you will have to invest in stocks with a positive expectation. No company, of course,
can guarantee that.
6.9 Predicting the future

Its tough to make predictions, especially about the future, allegedly said Yogi
Berra, legendary catcher of the New York Yankees, to whom numerous malapropisms have been attributed. Predicting future stock prices when the past provides no
useful information makes that task even more difficult. It may seem like a pointless
endeavor, therefore, to try to do so, but bear in mind that apparently few people
who invest in the stock market would agree that stock movements behave randomly
(. . . otherwise why would they risk their retirement savings in the stock market?). In
any event, I was curious to see what came of it. Also, procedures for forecasting are
interesting in themselves and worth examining because they can be applied successfully when random noise in a time series does not constitute the entire signal.
373
Suppose we know the elements of a time series fxt0 t0 1 . . . tg up through time

t Nt, and it is desired to predict future elements ~x t 1, 2, . . . where the lead
(antonym to lag) marks intervals in terms of a specified time unit t. (In this section
the unit will be 1 day.) We suppose further that the value ~x t1 is most significantly
influenced by the present value xt, the trend (or slope) t xt xt1, and the
curvature 2t t t1 xt 2xt1 xt2 , in which case the forecasting equation
takes the form
~x t xt xt xt xt 2xt xt2
6:9:1
with three unknown constants , and . Less geometrically, but equivalently,

one can regard (6.9.1) simply as a series in sequential price differences truncated
at lag 2
~x t xt xt xt xt xt2 :
c0
c1
6:9:2
c2
Since the elements of the series on the right-hand side of (6.9.1) are known, the
predicted element ~x t1 of lead time units can be calculated once the constants have
been determined. In general, we would expect that the prediction is best for 1 and
gets progressively more uncertain the further one tries to forecast the future so we
will consider 1. To be able to predict the performance of stocks one day in
advance would be a marvelous boon to an average investor who knew how to do it.
We shall determine the constants by minimizing the mean square error of the
prediction

2p ~x t1 xt1 2

1xt xt xt1 xt1 xt2 xt1 xt 2 ,
6:9:3
where the expression in the second line, obtained by substitution of (6.9.1), has been
expressly arranged to contain differences of pairs of elements. The expectation of the
square of such a difference, known as a variogram,20

k xt xtk 2 ,
6:9:4
is expressible as either a difference of covariance elements k hxt xtki or an integral
of the power spectrum S over a specified frequency range (1, 2)
k hx2t i hx2tk i 2hxt xtk i 20 k 20 1 k
2
4 S 1 cos 2kd:
6:9:5
20
The variogram is usually represented by the Greek letter gamma. However, to avoid confusion with either the gamma
function (x) or the covariance function k, I represent the variogram by a gamma with an overbar: x.
374
The second line in (6.9.5) is obtained from the defining relation (6.9.4) in two steps by
(a) substitution of the Fourier transforms
xt
e2it d
xtk x*tk
* e2i tk d
6:9:6
and
(b) use of the identify
h *0 i S 0
6:9:7
which defines the power spectral amplitudes and recognizes that Fourier amplitudes
of different frequencies are uncorrelated. In the first line of (6.9.5) the assumption
was made that the time series is stationary so that hx2t i hx2tk i. For reasons
explained previously, one usually detrends i.e. transforms to zero overall mean
and slope a discrete, finite time series.
If the time series is continuous and of infinite extent, the integral in (6.9.5)
theoretically extends over the range (0, ), provided it converges. However, a finite
time series is necessarily band-limited and therefore the actual upper limit is the
cut-off frequency c 1/2t and the lower limit (fundamental) is the reciprocal of the
time period 0 1/Nt. For a long time series (N 1), one can take 0 0, provided
the integral converges. Upon substitution in (6.9.5) of the dimensionless variable
u the theoretical variogram of lag kt 2k/c becomes
4c
k
k
k
Su 1 cos u du:
6:9:8
2k=N
Determined empirically from the power spectrum of a discrete time series, the sample
estimate of the variogram is

N=2
X
2j k
Xk 4
Sj 1 cos
,
N
j0
6:9:9
where the phase of the cosine is the product of harmonic j j/Nt and lag k t.
The subscript on X kdenotes the time series.
Evaluation of (6.9.3) with use of the relation
1
2
k 0 k
6:9:10
from (6.9.5) leads to the mean square error

2p 12 0 11 21 2 2 1 21 2
41 2 31 32 3 1
6:9:11
375
expressed entirely in terms of the variance 0 hx2t i and variograms. The unknown
constants are determined by solving the set of equations
2p = 2p = 2p = 0:
6:9:12
However, before undertaking this task, we can save much work at the outset by
examining qualitatively the behavior of the variograms in two limiting cases characterizing the kinds of stochastic processes we know have arisen.
In the first limiting case, that of quasi-white noise, applicable to nuclear decay
and first differences of stock prices, the correlation coefficients k are all close to 0 for
k 6 0, and therefore the variograms k 20 . The mean square error then reduces to
2p =0 1 2 2 2 3 3
6:9:13
for which (6.9.12) leads to a set of homogeneous algebraic equations (i.e. with no
constant term) with unique solution 0. The forecast equation (6.9.1) is
then ~x t1 0. The past record provides no useful information; the process is
unpredictable.
In the opposite limiting case, that of quasi-Brownian noise, applicable to molecular diffusion and stock price movement, the correlation coefficients k are all close to
unity, i.e. 0 1, 1 1 , 2 1 2, and so on for some small quantity . Then
the variograms k 2k0 . Neglecting terms linear in yields a mean square error
2p =0 12 ,
6:9:14
which attains its minimum value for 1. If we set 1 in (6.9.11), the first two
terms vanish and the values of and that solve the second and third equations in
(6.9.12) are
1
3 2
1
2
22 212 213
:
241 2
6:9:15
Had curvature been neglected in the forecasting equation (6.9.1) then the slope
parameter minimizing the mean square error would be
0
2
1:
21
6:9:16
The conditions for pre-selecting , which determine the nature of the solution, can
be given a more rigorous, general basis by examining quantitatively the properties of
the integral that defines the variogram in (6.9.8). It is often the case that the power
spectrum of a random process takes the form of a power law S / jj for some
exponent over an applicable frequency range. The exponent defines the character of
the stochastic process and can be estimated empirically by fitting a line of regression
to a plot of log S vs log , the slope of which is independent of the choice of
376
logarithm base. Examination of the variogram of a continuous time series for

different values of has established that21
(a) over the range 1 > > 0 the process behaves like white noise and is essentially
unpredictable,
(b) over the range 3 > > 1 the process exhibits forms of Brownian noise (referred to
as fractal Brownian motion if is non-integral) with long-range correlations, and
(c) for > 3 the process gives rise to a smooth time history adequately describable by
simple extrapolation such as based on a Taylor series expansion.
The threshold value 1, which defines 1/f noise, also called pink noise or flicker
noise (from its initially observed occurrence in thermionic vacuum tubes), marks the
boundary between the predictable and unpredictable. For reasons not entirely
understood, 1/f noise seems to arise in a wide variety of random processes including
earthquakes, avalanches, ocean currents, heart rhythms, music, financial records,
and more.22
The range of pertinent to the time series of stock prices is 3 > > 1 for which
solution (6.9.15) or (6.9.16) applies. Since these solutions are expressible in terms of
ratios of k to 1, we need consider only the quantity
2 k
Ik,
1 k
2
1 cos 2u
du
u
1 cos 2u
du
u
!
1 !0
2 !
k1
6:9:17
which leads to theoretical prediction parameters

1 21 31 1 1
1 8 22 8 22 31

8 2 1
0 222 1
6:9:18
for Eq. (6.9.1). These parameters are plotted as a function of in Figure 6.13. From
the figure, one discerns a qualitative difference in predicted behavior depending on
whether is larger than 2 or less than 2. If the former pertains, then 0 and 1 are
positive, and a time series that was increasing in the (recent) past is predicted to
increase further in the (near) future a property referred to as persistence. If the
latter pertains, then 0 and 1 are negative, and a time series that was increasing is
now predicted to reverse and decrease a property termed anti-persistence.
21
22
S. Hergarten, Self-Organized Criticality in Earth Systems (Springer, Berlin, 2002) 6466.

E. Milotti, Linear processes that produce 1/f or flicker noise, Physical Review E 51 (1995) 30873103; 1/f noise:
a pedagogical review, arXiv:physics/0204033v1 [physics.class-ph] (12 April 2002).
377
Prediction Coecients
0.5
1
0
0
0.5
1
1
1.5
2.5
Power Spectral Exponent

Fig. 6.13 Variation in prediction coefficients with power spectral exponent . Parameters 1
and 1 quantify contributions to the forecasting equation (6.9.1) due to trend and curvature of
the time series. Parameter 0 is the parameter resulting from trend, alone. All three coefficients
are null at spectral power 2, which represents a pure one-dimensional Gaussian
random walk.
The curvature parameter 1, which decreases from a positive value as a function of ,

acts in the opposite sense.
The value 2 of pure Brownian noise marks the threshold between persistence and anti-persistence. The significance of this threshold is seen in this highly
unintuitive property of a one-dimensional Bernoulli random walk (e.g. cumulative
gain in an unbiased coin-toss game): the probability of the random walker to
return to the origin an infinite number of times is 100% but with an infinite
mean recurrence time.
The power spectral exponent of the detrended CREF, AAPL, and GRNG time
series (as well as the exponents of other stock time series that I have examined) were
found to be close to 1.8, very near the threshold of value of a pure random walk.
(See Figures 6.16.3.) This would indicate that stock price movements and a simulated Gaussian random walk should be mildly anti-persistent i.e. long leads (or
lags) with eventual reversals. Empirical variogram ratios
J Xk Xk=X1
6:9:19
calculated from (6.9.9) are well reproduced theoretically for 1.8 as summarized in
Table 6.3 together with the estimated forecast parameters.
Prediction parameters 0, 1 and 1 all close to 0 mean that contributions from
trend alone or from trend and curvature vanish to any practically useful degree,
again signifying that the records of price changes were essentially white noise. Thus
378
Table 6.3
Forecast parameters and variogram ratios
Time series
J(2)
J(3)
CREF
AAPL
GRNG
1.933
2.021
2.016
2.786
3.042
3.036
0.034
0.011
7.81(3)
0.0102
0.0103
0.0102
0.0410
3.753(4)
2.378(3)
I(2;1.8)
I(3;1.8)
2.118
3.062
the prediction equation (6.9.1) reduces simply to ~x t1 xt , re-confirming that the best
prediction for the price tomorrow is the price today.
A stochastic process with the property that the conditional expectation of the next
value, given the current and preceding values, is always the current value is known as
a martingale, a term originally referring to a betting strategy whereby the gambler
doubles his bet after every loss in the expectation of regaining the losses plus a profit
equal to the original bet. The betting strategy is unsound; the gambler has no longterm advantage over any other betting strategy, including the placement of bets of
random amounts.
The movement of stock prices in a stock market is a martingale. The expectation
of tomorrows price is todays price. When I wrote at the end of Section 6.1 that an
investor can expect to gain nothing from the stock market in the long run, I was not
making a snide comment, but summarizing accurately what a statistical analysis of
stock time series taught me.

People who invest in the stock market probably do not believe that the price of stocks
moves randomly. They evidently are convinced that there is an underlying rationale
to how the market behaves, which a skilled financial advisor or manager can exploit
to make them a profit. Legions of financial advisors and managers clearly must
believe it too. From what I have read,23 those who believe stock performance can
be forecast divide into two principal schools of thought: the technical analysts (or
chartists) and the fundamental analysts. The former study stock performance
records to look for signs indicating the direction of future change. As narrated in
this chapter, that was precisely what I did by means of a battery of statistical tests
originally employed to investigate quantum processes for evidence of nonrandomness. I found no such signs. The latter group do not focus on past time
records, but attempt to predict the future value of a companys stock based on a
variety of economic criteria related to the companys prospects of growth. This
23
B. G. Malkiel, A Random Walk Down Wall Street (Norton, New York, 2003), First Edition 1973.
379
strategy may seem to be a reasonable one, provided that such meaningful criteria can
actually be established and pertinent data acquired with the result that ensuing
assessments are found to be empirically (i.e. predictively) valid. Whether this is the
case or not is controversial.
Princeton University professor and former member of the Council of Economic
Advisors, Burton Malkiel challenged both schools of analysts by writing (page 24 in
the 1996 edition of his book) a blindfolded monkey throwing darts at a newspapers
financial pages could select a portfolio that would do just as well as one carefully
selected by the experts. The Wall Street Journal (WSJ) took up the challenge in
1988 with a contest to match the performance of four stocks selected by pros
against four stocks chosen by WSJ staff who threw darts at a stock table. Competitions, initiated at one-month intervals, ran initially for one month, later extended to
six months, at the end of which the price appreciation of stock picks in a given
contest were compared. In 1998 the WSJ published the results of the hundredth
dartboard contest.
Of the 100 competitions, according to the WSJ, experts beat dartboarders 61 to 39.
However, according to Malkiel and other academic researchers, the competition was
seriously flawed as a test of the dartboard hypothesis, and the outcomes were
erroneously interpreted. The most serious shortcoming was that the contests structured primarily for entertainment than for research were not double-blind. As the
gold standard of statistical testing of a product (e.g. new drug) or hypothesis, a
double-blind method is one in which neither the administrators nor subjects of a test
know the evolutionary development and results until the test is completed. Otherwise, human nature being what it is, personal biases, conscious or unintended, may
strongly influence the outcome. The WSJ, apparently ignorant of proper protocols,
published the experts stock picks and explanations of these selections at the start of
each contest, thereby biasing the subsequent investments of their many readers and
inflating the returns of the selected stocks. After the contests, the values of the expertselected stocks fell, whereas the dartboard-selected stocks continued to do well. All in
all, detailed analysis of the contests showed that the experts did not outperform either
the dartboarders or the market.24
There is a curious irony to the randomness of stock prices that reflects the
fundamental hypothesis of market behavior: the so-called efficient-market hypothesis. As interpreted by Malkiel
The efficient-market theory does not . . . state that stock prices move aimlessly and erratically
and are insensitive to changes in fundamental information. On the contrary, the reason prices
move in a random walk is just the opposite: The market is so efficientprices move so quickly
when new information does arisethat no one can consistently buy or sell quickly enough to
24
B. Liang, The Dartboard Column: The Pros, the Darts, and the Market, http://papers.ssrn.com/sol3/papers.cfm?
abstract_id=1068.
380
benefit. And real news develops randomly, that is, unpredictably. It cannot be predicted by
studying either past technical or fundamental information.
In a way, this is the same reasoning that underlies the application of statistical
physics to multi-particle systems like a (classical) gas. The molecules in a macroscopic
quantity of gas move in direct response to the forces acting on them. Metaphorically
speaking, these forces, arising from both the environment and intermolecular interactions, are like the new information in the quotation from Malkiel. They change
rapidly (due to temperature fluctuations, variations in intermolecular distances, etc.)
and the molecular responses occur rapidly. No computer, however large its memory,
could keep track of these changes, nor could any human or computer make use of
this astronomically vast amount of information in a timely way even if it were
possible (which it isnt) to catalog it. So the whole system behaves for all practical
purposes randomly, describable by coarse-grained statistical state variables (temperature, pressure, density, etc.) and not by the Newtonian coordinates and momenta
of each molecule.
The financial lesson that Malkiel and others draw from acceptance of the efficientmarket hypothesis is that it is fruitless to time the market i.e. to try to guess an
optimal time to get in or out of a particular stock. The best strategy, we are told, is to
buy an index stock and hold it for the long term. In that way the investor can do as
well as the market does, since there is no viable strategy for beating the market. In
support of this conclusion, Malkiel cites25 H. N. Seybun of the University of
Michigan who
. . .found that 95 percent of the significant market gains over the thirty-year period from the
mid-1960s through the mid-1990s came on 90 of the roughly 7,500 trading days. If you
happened to miss those 90 days, just over 1 percent of the total, the generous long-run stock
market returns of the period would have been wiped out. The point is that market timers risk
missing the infrequent large sprints that are the big contributors to performance.
As a physicist, however, I draw a different conclusion: Timing is everything! No, this

conclusion is not inconsistent with what I have written detailing the randomness of
stock prices and the efficiency of the market.
Consider again, a little more quantitatively this time, a classical ideal gas in
equilibrium with its environment. The macroscopic thermodynamic state of the gas
is characterized uniquely by pressure P, absolute temperature T, and number density
n through the ideal gas law
P nkT
6:10:1
in which k 1.38 1023 J/K is the universal Boltzmanns constant. Using (6.10.1)
one readily finds that there are approximately n 2.45 1019 molecules/cm3 at 1 atm
25
Malkiel, op. cit. 2003 Edition, page 169.
381
pressure and room temperature (about 300 K). To a good approximation therefore,
the mean distance d between molecules is d n1/3 or about 3.4 107 cm under the
presumed conditions. The molecules move about like a swarm of mad bees with a
root-mean-square (rms) speed derivable from the classical equipartition theorem (i.e.
mean energy 12 kT for each translational degree of freedom)
p
vrms 3kT=m:
6:10:2
Under the stated conditions, this speed is about 515 m/s.
In an equilibrium state the gas molecules are uniformly distributed macroscopically, but because of statistical fluctuations in the exchange of energy and momentum
with the environment there will occur randomly from time to time and place to place
pockets of higher or lower density or pressure than the mean. For a system to be in
stable equilibrium means that fluctuations of this kind subsequently damp out, rather
than intensify. How quickly do they damp out? Fluctuations in the gas dissipate on a
time scale comparable to the time between collisions of gas molecules because it
is these collisions that are responsible in the first place for creating equilibrium
conditions throughout the gas. From the kinetic theory of gases we find that the
mean free path (i.e. the mean distance between collisions) of molecules of diameter
a is given by the expression
1
p
2a2 n
6:10:3
and therefore the mean time between collisions can be estimated from the relation
c
vrms
6:10:4
Evaluating (6.10.4) in the case of nitrogen gas (N2) for which the molecular mass (28
atomic mass units) is m 4.7 1026 kg and molecular size is a d/8.74 3.9
108 cm (inferred from the ratio of densities of nitrogen in the liquid state, where the
molecules are assumed contiguous, and the gaseous state, where they are separated
by d), we find that c 1.2 1010 seconds.
In other words, as long as we observe the sample of nitrogen gas on time scales long
compared to fractions of a nanosecond, departures from equilibrium damp out
sufficiently rapidly that our apparatus is insensitive to their presence. A few tenths
of a nanosecond is a very short time interval; most laboratories do not have the means
to measure or in any way exploit processes on such a short time scale. But some do.
The 1999 Nobel Prize in Chemistry was awarded to a scientist (Ahmed Zewail of
CalTech) for showing that it is possible with rapid laser techniques to see how atoms
in a molecule move during a chemical reaction.26 The technique is known as
26
http://nobelprize.org/nobel_prizes/chemistry/laureates/1999/press.html
382
femtosecond spectroscopy and represents a kind of stop-motion photography on the

scale of 1015 seconds. The Nobel press release gushed: The contribution for which
Zewail is to receive the Nobel Prize means that we have reached the end of the road;
no chemical reactions take place faster than this. It is always risky in science to
believe that one has reached the end of the road. Although atoms may move
(i.e. oscillate) on a femtosecond time scale, electron processes can occur a thousand
times faster, and there are now laboratories where such processes in gas phase or
condensed matter are being studied with lasers on an attosecond (1018 s) time
scale.27 If it is fast information that one wants, then the person with the fastest
apparatus laser, computer, whatever will get that information first.
The efficient stock market is like an ideal gas in equilibrium with its environment.
At the macroscopic coarse-grained level the molecules of the gas and the stock prices of
the market both undergo stochastic processes reasonably modeled by a random walk
of one kind or another. Nevertheless, pockets of disequilibrium form in the market just
like density fluctuations in the gas. In an efficient market, the momentary disequilibria
will vanish as investors discover opportunities to buy something from one market at
a lower price than which they can sell it in another. This practice, known as arbitrage,
is the only way besides being a crook or US Congressman to beat the market.
However, you do need to be fast. And to do that, you need to be rich.
Ordinary investors (as I defined them) cannot beat the market this way because
their response times are much too slow. Arbitrage opportunities may exist for
seconds or less, but it takes hours to days for an ordinary investor to place an order
with a broker to buy or sell shares of stock. For the 3.6 million teachers, researchers,
and medical personnel with retirement accounts at TIAA-CREF,28 for example, an
order to buy or sell shares of CREF Stock placed today will not be executed until
5:00 pm tomorrow. Moreover, TIAA-CREF asserts the right to refuse to implement
any stock transactions that they deem to be ordered too frequently.
Besides femtosecond and attosecond lasers, physicists also created highfrequency trading,29 i.e. the use of supercomputers fed a continuous stream of
financial data from multiple markets to search for arbitrage advantages. These
computers then buy and nearly simultaneously sell tens of millions of shares of
stocks a day without ever glancing at a prospectus to see if there is any value in what
is being bought or sold. The high-frequency trading companies try to capture
favorable disequilibria (inefficiencies) that exist for only fractions of a second by
programming their computers to make a profit of 1 cent or less about 40 million
times a day30 which would make an overall tidy sum of $400 000. However, the cost
27
28
29
30
A. L. Cavalieri et al., Attosecond spectroscopy in condensed matter, Nature 449 (2007) 10291032.
TIAA-CREF, http://en.wikipedia.org/wiki/TIAA-CREF.
See T. A Bass, The Predictors (Henry Holt and Co., New York, 1999) for the fascinating story of How a band of
maverick physicists used chaos theory to trade their way to a fortune on Wall Street (blurb on front cover).
S. Kroft, CBS News, How Speed Traders Are Changing Wall Street, http://www.cbsnews.com/stories/2010/10/07/
60minutes/main6936075.shtml.
383
of software development or acquisition (e.g. by buying a prediction company),

computer purchase, access to data and brokerage services, and a high level of capital
to invest puts this activity far out of the reach of an ordinary investor. Principal
players are firms like Goldman Sachs, Barclays, Credit-Suisse, and Morgan Stanley,
among others.
Since speed determines advantage, I would expect that competition in highfrequency trading will eventually lead (if it has not already done so) to data transfer
rates approaching the theoretical limit derived by Shannon for nearly lossless communication channels. Further, I would expect that, as more robot traders enter the
game, market efficiency will rise in high-frequency trading and exploitable inefficiencies will become increasingly difficult to find.
As for controversies over whether the stock market is random, whether there is
information in a stock performance record, or whether hyperactive robot traders can
predict the future, I would say that nothing so far has refuted the conclusions
I summarized at the beginning of the chapter. Like the density fluctuations in an
ideal gas in equilibrium, the price fluctuations in an efficient market can be observed
and exploited only if one is fast enough. No human can be and so fat tails and
excess volatility notwithstanding, the prediction of stock price variations remains in
my view a futile hope for all practical purposes of the average investor.
Appendices
6.11 Information inequality H(AjB) H(A)

Starting with the entropy (6.3.6) for system A
HA
m
X
pAi log pAi
i1
m X
m0
X
pAi Bj log pAi
6:11:1
i1 j1
and the conditional entropy (6.3.11) for system A given knowledge of events in
system B
HAjB
pAi Bj log pAi jBj
i, j
pAi Bj log
i, j
pAi Bj
,
pBj
6:11:2
we take the difference

X
pAi Bj
log pAi
HAjB HA
pAi Bj log
pBj
i, j
X
pAi pBj
:
pAi Bj log
pAi Bj
i, j
6:11:3
Use in (6.11.3) of the general property of logarithms,

log x x 1,
6:11:4
derivable from the Taylor series expansion, leads to the inequality

HAjB HA

X
X
pAi pBj
1
pAi Bj
pAi
pBj 1 0
pAi Bj
i, j
i
j
from which follows

HAjB HA:
384
6:11:5
6.13 Exact maximum likelihood estimate of AR(1) parameters
385
6.12 Power spectral density of an autoregressive time series

An AR(n) time series (of zero mean) can be expressed as a series of random shocks
!
X
X
j tj
j Bj t Bt
6:12:1
y t xt
j0
j0
with coefficients j determinable from the AR(n) master equation. Use of the backshift operator B, permits one to represent the series succinctly by means of the
generator (B) defined above.
The covariance elements of the series can likewise be written in terms of the
expansion coefficients
k hyt ytk i 2
j jk
6:12:2
j0
where hj k i 2 j k . By combining the elements (6.12.2) with B we can create a

covariance generating function (B)
B
k Bk 2
k
X
X
X
X
j
jk Bk 2
j h Bhj
kj
j0
j0
h0
Set hjk
j0
h0
j Bj
h Bh 2 BB1
6:12:3
as a product of the series generating function and its inverse.

Replacement of B in (6.12.3) by e2i produces the expression
2 je2i j2 0 2
1
k cos 2k S ,
2
k1
6:12:4
which is merely a rearrangement of the WienerKhinchin theorem for the power

spectral density S in terms of the covariance function.
6.13 Exact maximum likelihood estimate of AR(1) parameters
Consider a time series fxt t 1. . .ng described by the master equation

t N 0, 2
xt xt1 t
6:13:1
for which the two parameters and 2 are to be estimated by the principal of
maximum likelihood. Thus, the complete likelihood function (i.e. posterior probability in a Bayesian analysis) is

Pr fxt gj, 2 Pr ft t 2 . . . ngj, 2 , x1 Pr x1 j, 2 ,
6:13:2
386
where

1=2 x2 =2 2
Pr x1 j, 2 2 2t
e 1 t
6:13:3
is the conditional probability that the first element of the series is x1 and

Pr ft t 2 . . . ngj, 2 , x1

2 n1=2
n
P
t 2
2t =2 2
6:13:4
is the conditional probability of the n 1 subsequent shocks given x1. Note that the
variance 2 appearing in (6.13.4) is that of the shock t, whereas the variance 2t in
(6.13.3) is that of xt. The two variances were shown to be related to one another and
to the covariance element 0 by
0 hx2t i 2t
2
:
1 2
6:13:5
Upon combining relations (6.13.3) through (6.13.5), the complete likelihood function
(6.13.2) becomes

n
P
2
2 2
1 x1
xt xt1
2 2

2 1=2
2
2 n=2
t2
Pr fxt gj, 2
1 e
6:13:6
from which follows the log likelihood function
n 1
L ln 2 ln 1 2
2
2
1 2 x21
n
X
xt xt1 2
t2
2 2
6:13:7
apart from constant terms.

For general information, it is worth noting that, if the time series were of the
form AR(m), then the second term on the right side of (6.13.7) would be the
square root of the determinant of a symmetric m m matrix M, and the numerator of the third term on the right would take the form xTMx, where xT is the
transpose of the vector x of series elements. M is a matrix of covariance elements
defined by31
0
B
B
M 2 B
@
0
1
..
.
m1
31
1
0
..
.
m2
E. P. Box, G. M. Jenkins, and G. C. Reinsel, op. cit. 296297.
11
m1
m2 C
C
.. C :
. A
0
6:13:8
387
6.14 Statistics of gambling and law of averages
Maximization of the log-likelihood function (6.13.7) leads to coupled nonlinear

equations
n
X
xt xt1
t2
^
X
6:13:9
n
^ 2
x2t1 x21 ^2

^ 2

2
1 ^
n
1
t2
x21
"
#
n
n
n
X
1 X
2X 2
2
x 2^
xt xt1 ^
xt1
n t2 t
t2
t2
6:13:10
in the parameters, which must be solved numerically. These numerical solutions for
the parameters of the CREF, AAPL, and GRNG time series and the values of the
parameters obtained previously from the conditional likelihood function are in
excellent agreement.
The matrix H of second derivatives of L with elements
!
!
2
n
2 L
1 ^
1 X
2
2
xt1 x1
H 11 2
2
^2
^ t2
1
"
!#
n
n
X
2 L
1 X
2
2
^
4
6:13:11
xt xt1
xt1 x1
H 12 H21
2
^ t2
t2
"
#
n
2 L
n
1 X
2
2 2
^
xt xt1 1 x1
H 22 2 4 6
2^
^ t2
2
yields the covariance matrix of the parameters through the relation C H1.

In Section 6.8 a coin-toss game was described in which a gambler wins $1 with
probability p if a head turns up and loses $1 with probability q if a tail turns up. Let
X be the gain of a single toss

1 with probability p
X
6:13:12
1 with probability q 1 p
for which the moment generating function (mgf ) is

gX t eXt pet qet :
6:13:13
The gain G after N independent tosses is the sum

G
N
X
i1
Xi
6:13:14
388
with corresponding mgf

gG t eGt pet qet N :
6:13:15
The expectation and variance of the gain are readily calculated from the mgf

dg t
hGi G Np q
6:13:16
dt t0

d 2 ln gGt
var G var G hGi
6:13:17
4 Npq:
dt2
t0
It then follows from Eq. (6.13.17)
the standard deviation D of the difference
pthat
D G hGi is proportional to N and therefore

D
1
/ p :
N
N
6:13:18
Alternatively, one can calculate the mgf of the random variable D/N
gD=Nt heDt=N i ehGit=N pet=N qet=N N
6:13:19
or
ln gD=Nt
hGit
N lnpet=N qet=N N ln1 pe2t=N 1,
N
6:13:20
which, when expanded in a power series to third order in t, yields a mgf of Gaussian
form leading to the asymptotic identification

D
2pq
:
6:13:21
N 0,
N
N
p
Thus D/N tends toward 0 as 1= N .
The probability of losing k games and winning N k games to achieve a gain of
N 2k is given exactly by the binomial expression

PrkjN, p

N k Nk
pq :
k
6:13:22
Under conditions of a fair game p q 12, the probability that ones gain after
100 games lies in the range (110
G
90) or in the range (120
G
80) is obtained
from the following sums of (6.13.22) over k (performed by Maple)
Pr110
G
90
Pr120
G
80
55
X
2100 k 45
1
60
X
2100 k 40
100
k
100
k
!
0:729
6:13:23
!
0:965:
389
The corresponding probabilities in the Gaussian approximation with mean 50 and

variance 10 are
1
Pr110
G
90 p
2
1
Pr120
G
80 p
2
1
1
2
2
ex
=2
dx 0:683
6:13:24
ex
=2
dx 0:954:
7
A projectile, while it moves by a motion compounded of a uniform

horizontal motion and a motion naturally accelerated downward,
describes under this movement a semi-parabolic line.
Galileo Galilei1
7.1 Knowing where they come down
2
In his satirical song about WW II German rocket engineer Werner von Braun,
songwriter and erstwhile mathematician Tom Lehrer sings (in a German accent):
Vonce ze rocketts are up, who cares vere zey come down? Physicists, of course, care
very much where they come down. Indeed, the study of projectile motion is ordinarily a fundamental part of any study of classical dynamics, introductory or advanced,
where it serves primarily to illustrate the laws of motion applied to objects in free-fall
in a uniform gravitational field. In this context, as evidenced by numerous textbooks
beginning with Galileos own, first published in 1638 students are taught to solve
problems that fall into certain standard categories such as
(1) ground-to-ground targeting (e.g. a missile is fired with speed v at an angle to the
horizontal at a target a horizontal distance d away),
(2) air-to-ground targeting (e.g. a package is dropped at a height h above the ground
from an airplane traveling horizontally at speed v),
(3) ground-to-air targeting (e.g. a projectile is launched at speed v and angle to the
horizontal at a pie plate simultaneously dropped from a height h and horizontal
distance d),
and possibly others.
In the commonly encountered textbook and classroom examples the specified
dynamical variables are exact, rather than distributed, quantities, and the objective
1
Galileo Galilei, from Discorsi e Dimostrazioni Matematiche Intorno a Due Nuove Scienze [Discourses and Mathematical
Demonstrations relating to Two New Sciences] (Elzevir, 1638), Fourth Day: Theorem I, Proposition I. (Translation
from Italian by M P Silverman)
Tom Lehrer song, Werner von Braun, performed in 1965 in San Francisco CA. A recorded performance is available
at http://www.youtube.com/watch?v=QEJ9HrZq7Ro.
390
7.1 Knowing where they come down
391
is to determine exactly some other dynamical quantity. For example, a typical

problem in category (1) above might be to calculate the range of the missile from v
and , or to determine the value of leading to the maximum range. Similarly,
a typical problem in category (2) might be to determine how far in advance (in terms
of horizontal distance or of time) the package must be released for it to land at a
specified ground location.
In the real world, however, the dynamical variables of a projectile are never
exactly known and it therefore becomes necessary to employ probability and
statistical reasoning besides Newtonian mechanics. The statistical study of projectile
motion raises a number of thought-provoking questions beyond the familiar ones
already cited. Consider the following.
If one knows the mean values and uncertainties (i.e. variances) of the initial
conditions (e.g. speed and angle of launch), within what range of distances will
the range of the projectile likely be? (The fact that the word range has a
particular dynamical meaning distinct from its statistical meaning is somewhat
unfortunate for both euphony and clarity, but context should make its usage here
unambiguous.)
How precisely must the initial dynamical variables be known if the projectile is to
land within a specified distance of the target with a probability of, let us say, 95%?
If the mean and variance of these initial variables are known, how many times
must one launch a projectile in order to be reasonably certain (let us say with a
confidence of 95%) that one of them will hit the target?
If the mathematical rule (i.e. probability density function or pdf ) governing the
distribution of initial velocities is known, how will the range of the projectiles be
distributed? Can this theoretical prediction be tested experimentally? If so, what
kinds of statistical tests would be useful, approximately how many projectile
launchings would be required in a sample, and how does this number depend on
the uncertainties of the initial dynamical variables?
All too frequently, the application of statistical reasoning in physics courses, including those with a laboratory component, entails not much more than the calculation
of means and standard deviations of measured quantities with little attention paid to
the actual distributions, which can differ significantly from a normal distribution, or
to the specific confidence limits implied by the estimated standard deviations. Facets
of this problem were discussed previously in Chapter 5.3 The present chapter goes
further by applying statistical reasoning to projectile problems of diverse interest for
reasons from recreational to practical.
I begin by considering the questions posed above for cases of ground-to-ground
targeting. I examine theoretically the distributions of ranges that result from different
3
See also, M.P. Silverman, W. Strange, and T.C. Lipscombe, The distribution of composite measurements: How to be
certain of the uncertainties in what we measure, American Journal of Physics 72 (2004) 10681081.
392
hypothesized distributions of the initial dynamical variables. Since sports have

always provided a fertile ground for the application of physics, the theoretical results
are used to analyze the ranges of a sample of home runs hit by three acclaimed
American baseball players (Mark McGwire, Sammy Sosa, and Barry Bonds) with an
objective to ascertain whether great hitters are more or less equivalent statistically, or
whether there are differences that suggest some unusual facet to their training.
7.2 Distribution of projectile ranges

The horizontal distance (range) R traveled by a projectile launched with speed V at an
angle to the ground is readily shown in any introductory physics textbook to be
R
V 2 sin 2
,
g
7:2:1
where the acceleration of gravity is g ~9.8 m/s2 near the Earths surface. If V and
vary randomly from launch to launch as governed by pdfs pV (v) and p(), then the
range will also follow a pdf pR(r) which can be deduced from pV (v) and p() by
methods developed in Chapter 5. Unless otherwise noted, we will again employ
standard notation where an upper-case letter represents a random variable and the
corresponding lower-case letter signifies its realization in a sample.
Once the pdf pX (x) of a random variable X (where b X a) is known, all the
statistical moments mk,
b
mk hx i xk pX x dx,
k
7:2:2
and functions of these moments can in principle be calculated. To recapitulate, the

most widely used are (a) the mean m1, which when appropriate will also be designated
by the symbol X, (b) the variance and standard deviation
p

varX X X 2 m2 m21
X varX,
7:2:3
(c) the skewness
*
SkX
X X
X
3 +
m3 3m2 m1 2m31
,
3X
7:2:4
which is a measure of the asymmetry of the distribution about the mean, and (d) the
kurtosis
*
+
X X 4
KX
,
7:2:5
X
393
which is a measure of the flatness of the distribution near the mean, and the variances
in these (and other) functions of X.
The pdf, which must satisfy the completeness relation
b
pX xdx 1,
7:2:6
dFX x
dx
7:2:7
is the derivative
pX x
of the cumulative probability function (cpf ) defined by

x
FX x PrX x pX x0 dx0 :
7:2:8
In considering the distribution of projectile range, we will examine the case in which
the angle of launch is sufficiently well-defined that for all practical purposes can be
taken to be a constant 0. This not only makes the resulting mathematics more
tractable, but corresponds reasonably well to the conditions of many experiments for
which the launch angle can be set precisely and the uncertainty in the force of discharge
leads to a spread in projectile speed. In that case, the range takes the form R cV2 with
constant c sin 20/g. The more general case will be deferred to an appendix.
The pdf pR(r) is deducible by first determining the distribution FR(r) in terms of the
distribution of V
p

FR r PrR r PrcV 2 r Pr jVj r=c
p
p

Pr V r=c
Pr V r=c ,
7:2:9
V>0
V<0
and then differentiating with respect to r

p
p
r=c pV r=c
dFR r pV
p
pR r
:
dr
2c r=c
7:2:10
Let us suppose for the present that the speed is distributed as a normal random
variable V NV 0 , 2V with mean V0 and standard deviation V. As a reminder, the
designation NV 0 , 2V signifies a pdf of the form
2
1
2
pX x p exX =2X
2 X
7:2:11
where x . Thus, substitution of (7.2.11) into (7.2.10) leads to the pdf

p 2
p 2
2
2
1
e r=cV 0 2 V e r=cV 0 2V
p
pR r p
,
7:2:12
2 2 c V
r=c
394
which, after some algebraic rearrangement becomes
pR r
p
p2
2
r R0
2 p
p
e
p
2 2r pR
R
p2
2
r R0
2 p
7:2:13
where
R0
V 20
sin 20
g
pR
s
sin 2 0
V
g
7:2:14
are conveniently defined quantities and not a mean and standard deviation of the
non-Gaussian distribution (7.2.13). For (V0/V) 1 the second exponential in (7.2.12)
is numerically insignificant although its presence is required theoretically for correct
normalization.4
Since R cV2, the expectation values of functions of R can be evaluated using
(7.2.13) or, equivalently, as expectations of functions of V using (7.2.11). Either way,
one obtains the following useful statistical relations:

mean R hRi c V 2 c V 20 2V

second moment m2 R2 c2 V 4 c2 V 40 6V 20 2V 3 4V

variance varR c2 varV 2 2c2 2V 2V 20 2V
h
i
p
1
standard deviation R varR 2cV 0 V 1 V =V 0 2
2
*
8
+ * 2
3 +
>
R R 3
V V 2
>
>
SkR
>
>
R
V2
>
>
>
>
>
2
<
V 3 V =V 0
h
i3=2
skewness
V0
2
>
1
>
1
=V
>
V
0
2
>
>
>

>
>
>
V 5 V 3 21 V 5 . . .
>
:
3
V0 4 V0
32 V 0
7:2:15
7:2:16
7:2:17
7:2:18
7:2:19
in which the series expansion of the exact expression for skewness is truncated at
O((V/V0)6). We will not need the kurtosis or moments mk for k > 3 in this chapter.
The mode of a distribution is the argument x at which the pdf pX(x) is maximum.
By ignoring the second term in (7.2.13) under the assumption that (V0/V) 1, and
solving the equation d ln pR(r)/dr 0, one obtains the modal range Rm
For zero mean and unit variance relation (7.2.12) reduces to the pdf of a chi-square distribution of one degree of
freedom, as expected for the square of a standard normal variate.
2
3
s
"

2
2 #
1 24
2 V 2 5
V
:
Rm cV 0 1 1
cV 20 1 2
V0
V0
4
395
7:2:20
The approximate second relation results from a Taylor series expansion to first
nonvanishing order of (V/V0)2.
The upper and lower
panels of Figure 7.1 respectively show histograms of the

speed V N V 0 , 2V and range R cV2 for parameters V0 100 m/s, V 15 m/s
(and therefore V/V0 0.15), and 0 50o. The histogram of speeds was compiled
from a sample of 50000 trials with a Gaussian RNG. The superposed pdf pR(r) in
(7.2.13) closely delineates the envelope of the corresponding histogram of ranges,
which yields sample moments in excellent agreement with values theoretically predicted by relations (7.2.15) (7.2.19),

as shown in Table 7.1. For comparison, the
2
Gaussian approximated pdf N R , R is also shown. The broader the distribution of
speed (as gauged by the ratio V/V0), the greater the departure of the distribution of
range from a normal distribution, with concomitantly increasing skewness. Because
of skewness, the mean R and mode Rm depart significantly from the kinematic range
R0 (which is not a stochastic variable). For a distribution of speeds sufficiently
narrow that all powers of V/V0 beyond the first can be neglected, the expressions
for R and R reduce to the expressions of standard error propagation theory (EPT)
R0
V 20 sin 20
g
R0
2V 0 V sin 20
:
g
7:2:21
To this point we have been considering the projectile discharge speed to be

a normal random variable. However, an alternative hypothesis, which perhaps
may be better justified by the physics of the launching process, is to regard the
kinetic energy of the projectile and therefore the square of the discharge speed as
a normally distributed
quantity. In that case, we can define the random

variable

Y V 2 N Y , 2Y , from which it follows that the range R cY cN Y , 2Y
N cY , c2 2Y is also a normal variate with mean cY, variance c2 2Y , and skewness 0.5
For comparison with the range distribution of the previous hypothesis, we make the
associations
Y hYi hV 2 i V 20 2V
7:2:22
from (7.2.15) and

1 2
varY 2Y hY 2 i hYi2 hV 4 i hV 2 i2 4V 20 2V 1 V2
2 V0
7:2:23
from (7.2.17).
5
From the identity N (a, b2) a bN (0, 1) proven in Chapter 1, it follows that cN (a, b2) c[a bN(0, 1)] ca cbN (0, 1)
N(ca, c2b2).
396

0.03
V = N(V0 ,V2)
Frequency per Bin
0.025
0.02
0.015
0.01
0.005
0
40
60
80
100
120
140
160
Speed (m/s)
0.0014
R = cV2
Frequency per Bin
0.0012
0.001
8 10 4
6 10 4
4 10 4
2 10 4
0
0
500
1000
1500
2000
Range (m)
Fig. 7.1 Top panel: histogram of 50000 samples of speed V drawn from a Gaussian RNG V
N(100,152) and sorted into 50 bins (bin width W 2.4). Bottom panel: corresponding
histogram of range R cV2 (bin width W 48.5) calculated for launch angle 0 50 .
Proportionality constant c sin (20)/g 0.1005; kinematic range R0 cV 20 1005 m;
standard deviation R 303.2 m. Solid curves trace
the exact probability density functions;
the dashed curve is the Gaussian approximation N R0 , 2R .
397
Table 7.1 Statistics of projectile range for Gaussian

speed R cV 2 cNV 0 , 2V
2
Statistic
Sample
Theory
Mean (m) R
Mode (m) Rm
Kinematic range (m) R0
Standard deviation (m) R
Skewness SkR
1028
956
(not a statistic)
302.7
0.422
1028
960
1005
303.2
0.446
For the case to be considered here in which V / V0 < 1, we can approximate the
preceding mean and standard deviation by the simpler expressions
Y V 20
Y 2V 0 V :
7:2:24
The range would then be distributed according to the Gaussian pdf

2
1
2 2
ercY =2c Y
pR r p
2 c Y
7:2:25
with c sin 20/g as before were it not for the fact that r must be non-negative.
There arises, therefore, the matter of proper normalization so that
pR rdr 1:
7:2:26
The correct normalization constant is found by integrating the Gaussian exponential

over the range r 0
exa =2b
dx
r

a
1 erf p
2
2b
to obtain a truncated Gaussian pdf

p
2= 1=b xa2 =2b2

e
I0, x,
pX x
a
1 erf p
2b
7:2:27
7:2:28
where I(0,)(x) is the interval function introduced in Chapter 5. In the limit (a/b) ! ,
the error function in the denominator approaches 1, and (7.2.28) reduces to the
familiar Gaussian pdf. Applied to (7.2.25), we arrive at the pdf of the range
p
2 2
2 2 2
2=
ercV 0 8c V 0 V

pR r
I0, r
7:2:29
V0
2cV 0 V
1 erf p
2 2 V
398
Table 7.2 Statistics of

for Gaussian kinetic
projectile range

energy R cV 2 cN V 20 , 2V 0 V 2
Statistic
Sample
Theory
Mean (m) R R0
Skewness SkR
1003
301.48
0.013
1005
301.47
0
after substitution of relations (7.2.24) into (7.2.28).

Table 7.2 summarizes
the
results of 50000 samples drawn from the Gaussian

RNG R cN V 20 , 2V 0 V 2 for the same launch angle and speed parameters that
applied to Table 7.1. Since the skewness is 0, the mean, modal, and median ranges are
all identical and very close to the (non-stochastic) kinematic range.
If the range is distributed normally, how then is the projectile speed, calculated
from (7.2.1), distributed? Expressed as a relation between random variables, the
q

q

launch speed is V N Y , 2Y N V 20 , 2V 0 V 2 . The cpf is obtained by a chain
of steps similar to that of (7.2.9), leading to the distribution

FV v PrV v Pr V 2 v2 FV 2 v2 ,
7:2:30
which, when differentiated, yields the non-Gaussian pdf

d
pV v FV 2 v2 2vpV 2 v2
dv
p v2 V 2 2
8V 2 2
0
0 V
2= ve

:
V0
V 0 V 1 erf p
2 2 V
7:2:31
The upper and lower panels of Figure 7.2 respectively show histograms of the range
q

R N cV 20 , c2 2V 2 and speed V N V 20 , 2V 2 corresponding to the hypothesis that
the square of the projectile launch speed is normally distributed; the parameters of
the distribution are, as before, V0 100 m/s, V 15 m/s, and 0 50o. The
histograms
again compiled from samples of 50000 trials with a Gaussian
2 were
2
RNG N V 0 , V 2 sorted into 50 bins. In both panels the theoretical pdfs are superposed and fit the envelopes of the histograms like a glove. Also shown for comparison in the lower panel is the pdf of a non-truncated Gaussian distribution centered
on V0 with standard deviation V. This Gaussian function, although unskewed,
approximates the center of the histogram reasonably well for the parameter ratio
V / V0 0.15.
Because the distribution of V is skewed, the mean and modal speeds differ.
Calculating the modal speed Vm by the same procedure used to calculate the modal
range in (7.2.20) leads to the expression
399

0.0015
Frequency per Bin
R = cN(V02 , 4V02V2)
0.001
5 10
0
0
500
1000
1500
2000
Range (m)
0.03
Frequency per Bin
0.025
V = N(V02 , 4V02V2)
0.02
0.015
0.01
0.005
0
40
60
80
100
120
140
160
Speed (m/s)
Fig. 7.2 Top panel: histogram of range R cN(1002, 30002) with theoretical Gaussian density
q

(solid). Bottom panel: histogram of corresponding speed V N 1002 , 30002 with exact
density function (solid) and Gaussian approximation (dashed). The kinematic range and
sample size are the same as for Figure 7.1.
s
"
2
2 #
1 V
V
V0 1
Vm V0 1
:
V0
2 V0
7:2:32
Comparison of sample and theoretical statistics for the speed V summarized in

Table 7.3 again shows very close agreement.
400
Table 7.3 Statistics of projectile speed V for Gaussian kinetic

energy: V 2 NV 20 , 2V 0 V
Statistic
Sample
Theory
Mean (m/s) V
Mode (m/s) Vm
Parameters (m/s)
V0 100 V 15
Skewness SkR
97.4
101.5
98.7
102.2
15.7
0.543
15.7
0.549
Ro
Fig. 7.3 Schematic diagram of the uncertainty region of size 2d about the range R0 within
which a projectile launched at variable speed V and sharp angle 0 may land.
Since the truncation correction differs insignificantly from unity,

p
erfV 0 =2 2 V erf2:357 0:99914,
we will use the distribution (7.2.25) of range to study the question of targeting precision.
How well controlled must the launch speed be in order that the projectile strike within a
specified distance of its target with a probability of at least 95%?
Let R0 cV 20 R be the location of the target and 2d the distance within which the
projectile must land, as illustrated in Figure 7.3. The probability that the projectile
lands within specified tolerances is calculable by the chain of steps

R R
d 0:95, 7:2:33
Prd R R0 d PrjR R0 j d Pr
R R
in which the quantity ZR (R R)/R is a standard normal variate, i.e. ZR N(0,1).
The requirement in (7.2.33) is then met numerically by d/R 1.96 2, and
substitution of R 2cV 20 V =V 0 2R0 V =V 0 leads to the inequality
7.3 Energy vs speed: a test of hypotheses
V =V 0 d=4R0 :
401
7:2:34
Suppose, for example, the projectile is a small ball shot from a spring-loaded
launcher to hit a 10 cm diameter pie plate lying on the floor a distance 3 m away.
Then d 5 cm, R0 300 cm, and from (7.2.34) one must have V/V0 1/24 0.042.
By contrast, if the projectile were a ground-to-ground missile intended to hit a 5 m
wide target a distance 1 km away, (7.2.34) would require V/V0 1/800 0.00125.
Inequality (7.2.34) helps make clear why a ballistic missile is a guided missile, i.e.
requires a guidance system.
Alternatively, for a given precision V/V0 we can ask the following question.
How many projectiles must be launched in order to be 95% confident that the
target is hit?
To answer this question, we utilize the previously demonstrated result that the mean
ZR of n samples of a normal random variable ZR N(0,1) is distributed normally with
mean 0 and standard deviation n1/2, i.e. Z R N0, n1 . The desired condition

d
p 0:95
7:2:35
Pr jZ R j
R = n
p
then leads to the requirement that nd= R 1:96 2 or, equivalently,
n 16R0 =d2 V =V 0 2 :
7:2:36
For the precision V/V0 0.15 used in the examples illustrated in Figures 7.1 and 7.2,
the minimum number of strikes needed to assure 95% confidence of landing within
5 m of a target 1 km away is n 14,400. Improving the precision by a factor of
10 would reduce the number of trials by a factor of 100.

In the preceding section we calculated the pdfs and lowest statistical moments of two
conceivable range distributions that may apply to the motion of projectiles in free
fall. In the first case, the launch speed of the projectile was hypothesized to follow a
normal distribution, a circumstance that might occur if the launch mechanism
entailed a constant force acting over a normally distributed time interval. In the
second case, the square of the launch speed was hypothesized to follow a normal
distribution, a circumstance that might occur if the launch mechanism endowed the
projectile with a normally distributed kinetic energy. Since the two distributions can
give useful physical information, it is of interest to be able to distinguish between
them. How can this be done? What statistic or statistics would be useful? How many
samples would be needed, and how does this number depend on the precision with
which the launch speed is known (i.e. on the width of the distribution)?
402
We take as our null hypothesis H0 the statement that the square of the speed and
therefore the range of the projectile is distributed normally. One way to proceed is
to do a chi-square analysis, which, as discussed previously, entails sorting measurements of the range among k 1, 2. . .n classes of specified width and calculating
2obs
n
X
Ok Ek 2
k1
Ek
7:3:1
where Ok is the observed frequency of elements of the kth class and Ek is the
corresponding theoretically expected frequency. The sum in (7.3.1) approaches a 2d
distribution for d n 1 degrees of freedom (dof ) if the number of elements in each
bin is reasonably large, a criterion usually taken to mean greater than 5. From chisquare tables or use of computational software one then ascertains a P-value, i.e.
probability Pr 2d 2obs .
Although widely used, the 2 test may not be incisive enough to distinguish
between a normal distribution and a distribution that closely approximates a normal
distribution over much of the range centered on the mean. In that case, which is the
case we are faced with, an alternative and possibly better approach is to examine a
statistic that is sensitive to some critical symmetry that distinguishes the two distributions. If H0 is true, then the distribution should have a vanishing skewness. The
skewness of the alternative distribution considered here is given by (7.2.19) and is
approximately 3V/V0.
Consider the random variable W Z3 where Z ((X X)/X)N(0,1). By
symmetry (i.e. odd parity), as well as by direct calculation, we know that hWi 0.
To use W in a statistical test, we need also to determine its variance, which is given by
varW hW 2 i hWi2 hZ 6 i hZ 3 i2 hZ 6 i 15
from which follows the standard deviation
p
W 15 3:87:
7:3:2
7:3:3
A variance of 15 signifies a relatively wide distribution. Indeed, it is precisely because

the variance of a moment increases with the order of the moment that only the
lowest-order moments are usually the most useful for statistical tests. Nevertheless,
one can improve upon the result (7.3.3) by considering the distribution of W n , the
mean of a sample of n independent values
of W, which, by the Central Limit

15
Theorem, is a normal variate W n N 0, n with standard deviation
r
15
Wn
7:3:4
n
to a good approximation for large n, provided the first two moments exist. Note that
W itself is not distributed normally. Rather, by means of the steps previously illustrated for transformation of a probability density function, one can derive the pdf
1
2=3
pW w p w2=3 e w :
3 2
1
2
403
7:3:5
What, approximately, must be the minimum size n of a sample of values of W Z3

if we are to be 95% confident that the observed value of W n falsifies the hypothesis
H0, which predicts W n 0? Recall that for the alternative hypothesis the skewness in
the distribution of the projectile range is approximately 3V/V0. Therefore, we must
determine n in the relation
!
!

3 V =V 0
Wn
1:96 Pr p 2 0:05,
7:3:6
Pr

Wn
15=n
which leads to
n
20
3 V =V 0 2
7:3:7
In the cases we examined previously where V/V0 0.15, Eq. (7.3.7) yields n 296.
With a little imaginative thinking, however, we can do better than (7.3.7) by
creating another distribution (a binomial distribution) with a variance smaller than
that of (7.3.4). Consider a sample of n independent variates Zi (Xi X)/X where
(i 1. . .n) and assign Bi(Z) 1 for Zi > 0 and Bi(Z) 1 for Zi < 0. Since Z is a
continuous random variable, the probability is 0 that it assumes precisely the value
0 and indeed this was found to be the case in the empirical study described in the
next section. Bi(Z) follows a binomial distribution with a mean p q and variance
4pq where p is the probability Pr (Zi 0) and q 1 p is the probability Pr (Zi < 0).
n
X
Bi is distributed approximately as a variate
Thus, by the CLT the mean Bn 1n
i1
N( p q, 4pq/n) for sufficiently large n. [Note that, statistically, this test is identical to
the coin-toss game of the preceding chapter (Sections (6.8), (6.14).] If the hypothesis
H0 is true, then p q, hBn i 0, and 2B 1=n.
n
The probability

Pr jBn = Bn j 1:96 0:05
7:3:8
that we can be 95% confident an observed value of Bn V =V 0 (from skewness

(7.2.19)) exceeds 0 by two standard deviations requires a minimum sample size of
approximately
n
4
V =V 0 2
7:3:9
For V/V0 0.15, (7.3.9) yields n 178, a reduction in sample size of about 40%
compared to (7.3.7).
404
It is useful to note here a general relation

Prf X k hf Xi=k,
7:3:10
where f(X) is a non-negative function of a random variable X with domain the real
line and k > 0, that permits estimation of sample size irrespective of the exact
distribution of X. The Weak Law of Large Numbers, derivable from (7.3.10), takes
the form6
PrjX X j for n 2X =2 ,
7:3:11
with > 0 and 0 < < 1 any two specified numbers within their respective ranges.
Applying this relation to X Bn with 0.05, 0.15, and 2B 1, leads to the
inequality n 889. Although consistent with (7.3.8) and (7.3.9), the Weak Law can
be highly inefficient.
7.4 Play ball! home runs and steroids

Baseball, apart from being a popular spectator sport in the USA and elsewhere,
provides an excellent laboratory for the investigation of projectiles. The interest in
this section is not on the mechanics of the batball collision, which have been well
studied previously, but principally on statistical matters and the unanticipated inferences one may draw from them.
Table 7.4 lists the home run distances in feet of three exceptional hitters: Mark
McGwire (1998 season), Sammy Sosa (1998 season), and Barry Bonds (2001
season).7 The three sets of data respectively comprise n1 70 samples, n2 66
samples, and n3 73 samples, making a total sample size n 209. The question
arose in my mind as a person who ordinarily paid scant attention to baseball of
whether exceptional hitters were all basically equivalent or whether performance
statistics of some players might be sufficiently outstanding (literally: stand out) as
to indicate some unusual training activity (such as illicit drug use). The Student t-test,
which was discussed in Chapter 1, provides one way to examine this hypothesis.
Let the random variables (X1i i 1. . .n1), (X2i i 1. . .n2), (X3i i 1. . .n3) represent
respectively the home run distances of McGwire, Sosa, and Bonds. Then the mean,
variance of the mean, and third moment (used in calculation of skewness) of the three
distributions can be estimated from the relations
Xk
6
7
nk
1X
Xki
nk i1
k 1, 2, 3
7:4:1
A.M. Mood, F.A. Graybill, and D. C. Boes, Introduction to the Theory of Statistics 3rd Edition (McGraw-Hill, New
York, 1974) 71, 232233.
Data are available at http://www.aw-bc.com/info/triola/tes09_02_eoc.pdf
405
Table 7.4
Home run distances (feet)
McGwire (1998)
360
370
380
360
425
370
450
350
510
430
369
460
430
341
370
350
480
450
450
390
385
430
527
390
430
452
510
410
420
380
430
461
420
500
420
340
550
388
430
380
450
380
460
478
423
470
470
470
400
410
420
410
440
398
430
440
440
390
360
400
409
458
377
410
420
410
390
385
380
370
Sosa (1998)
371
350
420
460
350
420
390
400
400
380
388
440
480
480
430
400
410
364
380
414
434
420
430
415
430
400
482
344
430
410
430
450
370
364
410
434
370
380
440
420
370
420
370
370
380
365
360
400
420
410
366
420
368
405
440
380
500
350
430
433
410
340
380
420
433
390
Bonds (2001)
420
417
370
420
415
436
410
450
320
360
410
380
488
361
442
404
440
400
430
320
375
430
394
385
410
360
410
430
370
415
410
390
410
400
380
440
380
411
417
420
390
375
400
375
365
420
391
420
375
405
400
360
410
416
410
347
430
435
440
380
440
420
380
350
420
435
430
410
410
429
396
420
454
S2X
k
nk
X
S2Xk
1
Xk, i Xk 2
nk
nk nk 1 i1
3
Mk
k 1, 2, 3
nk
X
n
Xk, i Xk 3
n 1n 2 i1
7:4:2
7:4:3
and lead to the results in Table 7.5.

As a rough estimate, based on (7.2.21) and the uncertain assumption that all the
home runs were hit at close to the optimal angle 0 45 , one can show that the
relative uncertainty sX =x 0:1 in range implies a relative uncertainty in
V =V 0 sX =2x ~ 0.05 in the speed at which the ball left the bat.
At this point a brief reminder may be warranted regarding the normalizing
constants in (7.4.2) and (7.4.3) for the second and third moments about the sample
mean. The constants are not simply the reciprocal of the sample size, which one might
406
Table 7.5
Home run statistics
Player
Home run range

x sX (ft)
Sample SD
sXk (ft)
Sample
skewness
Sample
size nk
McGwire
Sosa
Bonds
418.5 5.4
404.8 4.7
403.7 4.0
45.5
38.3
34.1
0.572
0.293
0.296
70
66
73
have expected and which in fact results from a maximum likelihood estimation.
Rather, these are the values that lead to unbiased estimations

3

MX x X 3 :
7:4:4
hS2X i x X 2
The origin of bias is the appearance in the sums of the sample mean X, which is a
random variable, and not the population mean X, which is a theoretical parameter.
Derivation of the estimator for the third moment is given in an appendix.
If we make the reasonable assumption that the projectile range is normally
distributed, how probable is it that these three sample means could have arisen from
random sampling of the same normal population? Suppose variates X1 and X2 are
distributed normally with means 1 and 2 and with the same population variance 2X .
Random samples of size n1 and n2 are taken from these two populations, and we
denote the sample means and sample variances by X1 , X2 , S21 , S22 . Then the sample
means are also normal variates

2
X 1 N 1 , X
n1

2
X2 N 2 , X
n2
7:4:5
and their difference is a normal variate

r

2 2
1
1
X2 X1 N 2 1 , X X 2 1 X
N0, 1
7:4:6
n1 n2
n1 n2
q
of mean (2 1) and standard deviation X n11 n12 . It then follows that the quantity
U
X2 X1 2 1
q
X n11 n12
7:4:7
is a standard normal variate N(0,1). Correspondingly, the quantity

V2
n1 S21 n2 S22
2 2n1 1 2n2 1 2n1 n2 2
2X
X
is a chi-square variate 2n1 n2 2 of d n1 n2 2 degrees of freedom.
7:4:8
407
Table 7.6
Comparison of player mean home run distances
Player pair
dof d n1 n2 2
T value
Pr (T tobs)
Z value
Pr (Z zobs)
McGwireSosa
McGwireBonds
SosaBonds
134
141
137
1.890
2.214
0.192
0.030
0.014
0.424
1.914
2.216
0.192
0.028
0.013
0.424
Thus the quantity [see Eq. (1.19.1)]

s
p
u d x2 x1 2 1 n1 n2 n1 n2 2
q
t
v
n1 n 2
n s2 n s2
1 1
7:4:9
2 2
is a realization of a Student t variate. To test the equivalence of two means (1 2)

we must calculate the statistic
s
p
u d
x2 x1
n1 n2 n1 n2 2
tsample
,
7:4:10
q
v
n1 n2
n1 s21 n2 s22
which contains only quantities obtained from sampling i.e. no theoretical population parameters. (Recall the definition of a statistic: a random variable which is a
function of observable random variables and contains no unknown parameters.)
Employing the Students t test on each of the three pairs of home run distributions,
I obtained the results of Table 7.6 for the probabilities Pr (T tsample) that the
differences in means could have arisen by pure chance:
The last two columns of the table also show the results based on the distribution of
.q
s2X
s2Y
a standard normal variate Z X Y
n1
n2 applicable to large samples (which,
to a good approximation, is what we have). Columns 4 and 6 show that the normal
distribution leads to slightly lower probabilities than the Student t distribution in
confirmation of the statement made in Chapter 1 based on plots of these two pdfs.
Thus, while it is probable that Sosa and Bonds are equivalent hitters, McGwire is
clearly in a different class. If the data of Table 7.4 are representative, then the
hypothesis that McGwires mean home run distance is statistically equivalent to
those of the other two hitters could be rejected with a probability better than 97%.
My comparative analysis of home run distances was originally completed in
2003 and submitted to the American Journal of Physics. Seven years later, in January
2010 news services reported
Mark McGwire admitted Jan. 11 that he used steroids on and off for nearly a decade, including
during the 1998 season when he broke the then single-season home run record.8
8
ESPN.com news services, McGwire apologizes to La Russa, Selig (January 12, 2010, 2:01 PM ET), http://sports.espn.
go.com/mlb/news/story?id=4816607
408
Table 7.7 Chi-square test of normal distribution of

home run range
Player
2obs

Pr 2d4 2obs
McGwire
Sosa
Bonds
2.77
3.51
5.52
59.8%
47.7%
23.8%
In his tearful apology, McGwire is reported to have said

Its very emotional, its telling family members, friends and coaches . . . Its the first time theyve
ever heard me . . . talk about this. I hid it from everybody.
Well . . . not everybody. Although correlation is not proof of cause and effect,
statistical analysis of sports-related projectile ranges such as distances of home
runs, shot puts, golf drives, long jumps, and the like, may nevertheless help identify
instances where superior athletic performance was achieved through the aid of
performance-enhancing drugs. Of course this will become less helpful if nearly every
participant in some popular projectile sport is on drugs.
There remains a loose end to tie: use of the Student t test presumed that the home
run distances were distributed normally. To test the normality of the random variable
Xk Xk =SXk , where k 1, 2, 3 refers to McGwire, Sosa, and Bonds respectively,
a chi-square analysis was made in which the three samples of size nk were grouped in
N 7 classes of width equal to SXk . The number of degrees of freedom of the statistic
N
X
Oj Ej 2 =Ej , in which Oj is the observed frequency and Ej is the theoret2
j1
ically expected frequency of the jth class, is d N 1 r where r is the number of

parameters that depend on the data. In the present case, r 2 for each player because
X and SX appear in the theoretical expression
XjN2 1SX
n
Ej p
2 SX
exX =2SX dx
2
7:4:11
XjN2 SX
and therefore d 4. The results in Table 7.7 provide no reason to reject the
hypothesis that the range of the home runs was distributed normally.
An examination of the skewness of each hitters home run distribution using the
estimator in (7.4.3) for the third moment about the mean and (7.4.2) for the variance
leads to the results in Table 7.8, which are again consistent with the zero skewness of
a normal distribution.
409
7.5 Air resistance
Table 7.8 Skewness test of distribution of

home run range
Player
Skobs
Pr(Sk jSkobsj)
McGwire
Sosa
Bonds
0.572
0.293
0.296
10.4%
25.9%
25.7%
7.5 Air resistance

The considerations of the previous section were based on kinematics of a projectile in
vacuum. While this assumption is adequate for many purposes, to obtain a realistic
estimate of the maximum range likely to be achievable by a home run, one must take
account of the resistive force or drag of the air. Besides the interest this may hold for
sports enthusiasts, a discussion of projectile motion in air introduces critical elements
of the theory of flight, a subject of enormous importance in its own right and the
focal point of two sections to follow.
The drag on an object moving through a stationary fluid or, equivalently, the
drag on a stationary object immersed in the moving fluid depends on the relative
speed v, as expressed by a dimensionless parameter (one of many in fluid dynamics)
called the Reynolds number
Re
v
,
7:5:1
in which is a characteristic length of the object (e.g. a radius or diameter), and is

the density and the viscosity of the fluid. The viscosity itself is the proportionality
constant in Newtons law of fluid friction
xy
dvx
,
dy
7:5:2
relating the shear stress (i.e. frictional force per unit area) xy at the interface between
contiguous layers of fluid moving horizontally, let us say along the x axis, to the rate
dvx/dy at which the velocity vx decreases in a direction (along the y axis) transverse to
the flow. A fluid that obeys (7.5.2) is a Newtonian fluid. Air and water are good
Newtonian fluids; treacle (molasses) is not. From the definition (7.5.1) and the
dimension of inferred from (7.5.2), it is not difficult to show that Reynolds number
is indeed a dimensionless parameter
h i
h i
1 1
M
M
L
TL
L
TL
ML T
L3
L3
Re
h i. h i.
1
,
F
V
ML
1
ML1 T 1
2
2 2
L
T L
where [M], [L], [T] commonly signify the fundamental dimensions of mass, length, and time.
410
For fluids at very low Reynolds number, the drag force is linearly proportional to
the relative speed, a situation referred to as creeping flow or Stokes flow. Viscosity
dominates inertia in this regime, and the drag force is large. Small particles in
air (such as nascent rain droplets or the tiny oil droplets in Millikans renowned
experiment to determine the charge of an electron) and aquatic micro-organisms
ordinarily experience creeping flow. Baseballs, motor cars, and aircraft experience
Newtonian flow.
For fluids at high Reynolds number but nevertheless at subsonic speeds, the drag
on an object is proportional to the square of the relative speed and usually written in
the form
1
Fd v2 Cd A
2
7:5:3
in which A is the cross-sectional area (or planform in aerodynamic terminology)

presented by the object to the flow, Cd is a dimensionless drag coefficient, and 12 v2 is
the so-called dynamic pressure. Dynamic pressure is seen immediately to have the
dimension of (kinetic) energy per unit of volume, which is dimensionally identical to
force per unit of area, or pressure. From Bernoullis principle relating pressure p and
speed v in fluid flow along a horizontal streamline
p
1 2
v constant,
2
7:5:4
the dynamic pressure is interpretable as the gauge pressure (i.e. pressure above the
ambient static value) exerted by a moving fluid at a point on a stationary surface
where the fluid is brought to rest.
In general, the drag coefficient is not constant, but varies with Reynolds number
in a manner dependent on the shape of the object. For a sphere, the drag coefficient
has a very interesting behavior, remaining approximately constant at Cd ~ 0.47 over
the wide range 105 Re 103 and then precipitously dropping to Cd ~ 0.1 over a
narrow range 23 105.9 The sudden decrease in drag is due to a kind of phase
transition in the flow pattern about the sphere.
In the ideal (and totally unrealistic) case of an inviscid (frictionless) fluid passing in
steady flow around a smooth sphere, the particles of fluid follow precise trajectories
or streamlines in well-defined layers. This steady laminar flow gives rise to zero drag;
a sphere placed at rest in such a flow would remain at rest. Prior to a more complete
understanding of fluid dynamics in the early years of the twentieth century, which
incorporated the effects of viscosity in a thin boundary layer over the surface of an
object, this strikingly false prediction was known as dAlemberts paradox. The
paradox results from the front-to-back symmetry of the streamlines. By Bernoullis
principle (7.5.4), the faster a fluid moves (in constant gravitational potential) over an
9
P. Wegener, What Makes Airplanes Fly? (Springer, New York, 1991) 9094.
411
7.5 Air resistance
object, the lower is the pressure it exerts on the object. The flow speed is greatest and
the pressure is least over the circumference of the midsection (defining the boundary
between the front and rear hemispheres) normal to the flow. In ideal steady inviscid
flow, the speed of a fluid particle at some location in front of the sphere is exactly the
same speed at its mirror-reflected location behind the sphere. Collectively, therefore,
the pressure distribution in the hemisphere facing the flow is the same as in the rear
hemisphere, and no net momentum is transferred to the sphere by the fluid. Hence,
the sphere remains at rest.
More realistically, however, at Reynolds numbers below the transition, air flow
around a sphere remains more or less laminar until it reaches the boundary of the
midsection, after which it separates from the sphere and degrades into turbulent eddies
behind the sphere. Now the front-to-back pressure distribution is asymmetric, and the
pressure over the rear hemisphere remains lower than that over the forward hemisphere.
Looked at from a reference frame in which the air is at rest and the sphere is moving,
the sphere experiences a backward push or drag by the air. At Reynolds numbers above
the transition, however, the layer of air closely enveloping the sphere (i.e. the so-called
boundary layer) becomes turbulent and air flow remains attached to the sphere over a
greater portion of the rear surface. The pressure behind the sphere rises to a greater
extent than was the case when flow separated at the midsection, and the pressure
differential between the two hemispheres is consequently reduced, leading to lower drag.
Generalizing (7.5.3) to two-dimensional motion along direction v within a plane
with the x axis horizontal and the y axis vertical as shown in Figure 7.4 we can write
Newtons second law of motion
vy
aL
Altitude
aD
vx
Horizontal Displacement
Fig. 7.4 Two-dimensional projectile motion in the presence of air drag. Drag deceleration aD
opposes the velocity v; lift acceleration aL is normal to the velocity; the acceleration of gravity g
acts vertically downward. Solution of the two-dimensional equations of motion (without lift)
leads to a trajectory that has a shorter ascent than descent time. As a consequence, the
projectile covers a greater horizontal distance in the ascent than in the descent.
412
8

>
Cd A v2x v2y vx
>
>
dv
x
>
>
0
<
dt
2m

>
>
2
2
>
C
A
v
v
vy
>
d
x
y
>
: dvy
g
dt
2m
1
2
dv 1
Cd Av2v mg
dt 2
1
2
7:5:5
for a non-rotating sphere of mass m in freefall through air of density with gravitational acceleration g. Problems in fluid dynamics (or specifically aerodynamics) are
almost always facilitated by transforming dynamical laws into relations of dimensionless quantities. From the material parameters that occur in (7.5.5) the following
characteristic velocity, time, and displacement parameters can be constructed. (Later
it will be useful to construct another characteristic velocity v associated with lift.)

2mg
:
7:5:6
Velocity
vd
Cd A

2m
:
7:5:7
Time
td
gCd A
1
2
1
2
Displacement
d vd t d
2m
:
Cd A
7:5:8
By rescaling velocity and time and, for later use, displacement in the following way
Vx
vx
vd
Vy
vy
vd
t
td
x
d
y
d
7:5:9
we can write (7.5.5) in a dimensionless form independent of dynamic and material

parameters

dV x 2
V x V 2y V x 0
dT

dV y 2
V x V 2y V y 1:
dT
1
2
1
2
7:5:10
It is worth noting that such rescaling, when it can be done, is of wide utility in
physics, not only because it simplifies the appearance of equations to be solved, but
because it permits experimental results from different physical systems to be fit to a
single mathematical expression. This, in fact, is the basis for simulating forces on
large-scale aerodynamic or hydrodynamic structures by small-scale models testable
in wind tunnels or water basins. One of the earliest examples of such a procedure in
the study of fluids is the Law of Corresponding States (LCS) pertaining to application of van der Waals equation to the thermodynamics of real gases. The LCS, which
is actually more general than van der Waals equation, holds that all gases, when
compared at the same reduced temperature and reduced pressure i.e. temperature
413
7.5 Air resistance
and pressure scaled by suitable parameters have the same compressibility and
deviate from ideal gas behavior to about the same extent.
Solution of the two-dimensional (2D) projectile problem in vacuum is quite simple
because the equations of horizontal and vertical motion are uncoupled. The resulting two
one-dimensional (1D) equations of motion are immediately integrable to yield expressions
8
< vx t vx0 constant vy t vy0 gt
7:5:11
no drag
1
: sx t vx0 t
sy t vy0 t gt2
2
for horizontal and vertical velocity and displacement. It is the coupling of horizontal
and vertical components because drag depends on relative speed of the air flow
that makes (7.5.10) difficult to solve analytically. Nevertheless, an analytical solution
to (7.5.10) is achievable by first transforming the set of equations to polar form
9
8
q
< V V 2 V2
V x V cos =
x
y
7:5:12
,
: tan V =V
V y V sin ;
y
and then algebraically manipulating the resulting pair of equations to exploit the
trigonometric identity cos2 sin2 1 and thereby to obtain the transformed pair
dV
V 2 sin
dT
d
V
cos
dT
7:5:13
where is the angle relative to the (horizontal) ground. Next, by using the second
relation in (7.5.13) to express the differential dT in terms of d, one can combine the
two equations into a single differential equation as a function of angle
d
V cos V 3 0:
d
7:5:14
And last, the variable change U 1/(V cos ) permits one to separate variables and
integrate separately over U and . The final result is the somewhat complicated but
useful closed-form expression
1
,
V q
1 sin
C0 ln cos cos 2 sin
where C0 incorporates the initial conditions of speed V0 and angle 0

1 V 20 sin 0
1 sin 0
:
C0
ln
cos 0
V 20 cos 2 0
7:5:15
7:5:16
The corresponding components of velocity along the x and y axes can then be
calculated from (7.5.12).
414
Alternatively, the original Cartesian form of Eqs. (7.5.10) is particularly convenient for an iterative numerical solution by computer and generates solutions directly
as a function of time, rather than angle. I have used a LevenbergMarquardt
algorithm (developed by the Argonne National Laboratory), which is incorporated
in the Mathcad symbolic computational software. Both approaches lead to the same
numerical results, which we shall examine shortly. Before doing so, it is instructive to
consider the stationary or asymptotic behavior of the drag equations in Cartesian
form. Setting the time derivatives in (7.5.10) to zero, one obtains solutions for
velocity
V x T ! 0
V y T ! 1
)
)
vx 0
vy vd
7:5:17
in which the horizontal component vanishes and the magnitude of the vertical
component, referred to as the terminal speed, equals the scale factor vd in (7.5.6).
Under stationary conditions, therefore, the projectile is descending vertically downward at a constant rate vd. For a baseball, however, the ground may intervene well
before it reaches terminal velocity.
Corresponding horizontal and vertical displacements cannot be reduced to closedform expressions but must be obtained numerically from the integrals
T
Sx T V x T dT
0
Sy T V y T 0 dT 0 :
7:5:18
Consequently, the range of the projectile R i.e. the value of Sx for which Sy 0
must be worked out numerically or graphically for each case of interest.
An alternative approximate measure worth examining at this point is to decouple
the equations (7.5.10) by neglecting Vy in the horizontal motion and Vx in the vertical
motion. Then, as in the case of no drag, the resulting equations can be integrated
relatively easily, leading to

8
tan T H T T T H
V x0
>
>
V y T
V x T
>
>
>
V
T
1
tanhT H T T T H
x0
<
( cos T H T
drag 1D
>
ln
T T H
>
>
cos T H
Sy t
Sx t lnV x0 T 1
>
>
:
ln cos T H coshT H T T T H
7:5:19
in which
T H tan 1 V y0 tan 1 V 0 sin 0
7:5:20
is the time for the projectile to reach maximum height H. Prior to TH, the ball is
traveling upward, Vy > 0, and both air drag and gravity act in the same direction
7.5 Air resistance
415
(downward) as described by the equation of motion dV y =dT V 2y 1. After TH,

however, the ball is traveling downward, Vy < 0, and air drag and gravity act in
opposite directions (air drag upward) as described by the equation of motion
dV y =dT V 2y 1. The complete one-dimensional vertical motion is governed by
the equation
dV y
jV y jV y 1
dT
7:5:21
in which the difference in sign of Vy in the two time periods must be borne in mind in
setting the appropriate limits of integration. One does not need to take account
explicitly of this transition point in the exact analytical solution to the coupled
equations (7.5.10) because the horizontal component Vx never vanishes.
To estimate what might be close to the ultimately achievable home run distance in
air, let us assume an atmosphere at about room temperature 20 C and 1 atm pressure
for which the air density is approximately air 1.204 kg/m3 and air viscosity is air
1.85 105 kg/ms.10 The fastest pitched baseballs have been clocked at a little over
100 miles per hour (mph), so it is perhaps not unreasonable to assume that such a ball
leaving the bat of a powerful hitter may be launched with an initial speed of about
110 mph or 49.2 m/s. The corresponding Reynolds number (7.5.1) with ball diameter
d 0.075 m as the characteristic length is Re 2.40 105, a value occurring very
close to the transition region for the drag coefficient of a sphere. I will adopt,
therefore, the lower value Cd ~ 0.2 for the region of higher Reynolds numbers. Given
a baseball mass of 0.145 kg, the scale factors in (7.5.6) then become vd 51.71 m/s,
td 5.27 s, and d 272.60 m.
Figure 7.5 shows the variation with (scaled) time T of the (scaled) velocity
components Vx, Vy, and speed V for a ball hit at an initial angle of 45 to the ground,
which in vacuum would yield the greatest range for a given initial speed. The solid
and dashed black lines trace the variation in speed and velocity components derived
from the exact analytical two-dimensional (2D) solution. The traces bear out the
previously deduced asymptotic limits Vx ! 0, Vy ! 1, shown in the figure as
horizontal light dashed lines. Vx and Vy decrease monotonically as expected and, in
practical terms, reach 98% of their asymptotic limits after about 3.0 (for Vy) and 4.0
(for Vx) time units td. The dashed gray lines, lying just above the dashed black lines,
trace the exact 1D solutions to the decoupled 2D equations. It is perhaps surprising
how closely the 1D solutions replicate the physically more realistic coupled 2D
solutions for the specified initial conditions. The largest apparent discrepancy occurs
in the calculation of Vx for which 2D coupling leads to a more rapid decrease in time.
I have simulated projectile motions for a wide range of initial conditions and obtain
10
The MKS unit of viscosity (kg/ms) comes directly from Eq. (7.5.2) and can be written equivalently as Pascal-second
(Pa s). The CGS counterpart is the poise: 1 Ps 0.1 Pa s.
416

1.5
Velocity (scaled)
0.5
Vx
0
0.5
1.5
2.5
3.5
0.5
Vy
1
1.5
Time (scaled)
Fig. 7.5 Speed (solid) and velocity components (dashed) of a baseball (mass 0.145 kg, radius
0.0375 m) plotted against time as obtained from solution of coupled 2D equations (dark black)
and uncoupled 1D equations (gray). Light black dashed traces show solutions in absence of air
drag. The drag coefficient is Cd 0.2 (for Reynolds number Re 2.4 105). Variables
are scaled by parameters vd 51.71 m/s, td 5.27 s, d 272.60 m. Initial conditions are
v0 49.17 m/s, 0 45 .
comparable outcomes. The light dashed lines trace the velocity components in the
absence of drag in which Vx is constant and Vy decreases linearly in time.
The trajectory of the ball, i.e. plot of vertical against horizontal displacement, is
shown in the upper trace of Figure 7.6 for the three model solutions: no drag,
uncoupled 1D and coupled 2D equations with drag. In vacuum, the path of the ball
is symmetric about the midpoint (point of greatest altitude), but in the resistive
medium of air the ball covers a greater horizontal distance in its ascent than its
descent. This gives the visual impression that the ball descends more quickly than it
rises, but this is a spurious inference as made apparent in the lower trace of Figure 7.6
in which the vertical displacements are plotted against time. Solid lines depict the
coupled 2D solution and dotted lines represent the uncoupled 1D solutions. Again
the latter shadow closely the former. The rise-time of the 2D solution to reach
maximum altitude H is about Tr 0.55 time units, whereas the fall-time to descend
from H to ground is Tf 1.16 0.55 0.61 units. Thus, the ball takes more time on
the way down than on the way up because the speed of the ball is greatest at launch,
as is likewise the drag which is proportional to the square of speed.
417
7.5 Air resistance

0.25
Altitude (scaled)
0.2
No
Drag
1D
0.15
0.1
2D
0.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Horizontal Displacement (scaled)

0.25
Altitude (scaled)
0.2
0.15
0.1
0.05
0.2
0.4
0.6
0.8
1.2
1.4
Time (scaled)
Fig. 7.6 Upper panel: plot of trajectory: altitude against horizontal displacement. Lower
panel: plot of altitude against time. Parameters are the same as for Figure 7.5. Traces are
derived from: solution of coupled 2D equations with drag (solid), solution of uncoupled 1D
equations with drag (dotted), solution of uncoupled 1D equations without drag (dashed).
The reverse asymmetry in the trajectory i.e. shorter horizontal displacement on

the descent comes about because air drag decelerates Vx, which in the absence of
drag is constant, with the consequence that a plot of Sx vs T has negative curvature,
i.e. is concave down like the plot of Sy vs T. Thus, the projectile covers progressively
less horizontal distance in each unit of time. As shown in the upper panel of
Figure 7.6, the ball has covered a distance of about Sx 0.30 units at the point of
418
maximum height, and the distance Sx 0.55 0.30 0.25 units from the point of
maximum height to the point of impact with the ground.
Although the total time of flight T 1.16 units is short compared to the time
required to approach terminal velocity, air drag has nevertheless made a huge impact
on the trajectory of the ball. As shown in the upper panel, drag has reduced the
theoretically achievable range by about 39% from R 0.90 in vacuum to R 0.55 in
air. In MKS units, obtained by multiplying the preceding numbers by the associated
scale factor d 272.6 m, the range is respectively 246.7 m in vacuum and 150.7 m in
air. Drag has also reduced the theoretically achievable height of the trajectory from
H 0.23 (62.7 m) in vacuum to H 0.17 (46.3 m) in air.
The 1D drag solutions provide a way to estimate the range of the ball with
reasonable accuracy. From (7.5.19) one can find the time T0 which satisfies the
relation Sy(T0) 0, i.e. the time at which the projectile has returned to ground

1 sin T H
T 0 T H ln
,
7:5:22
cos T H
whereupon it follows that the range is then
R1D Sx T 0 lnV x0 T 0 1:
7:5:23
For the illustrative initial conditions (v0 110 mph, 0 /4), Eqs. (7.5.22) and
(7.5.23) yield T0 1.223, R1D 0.599 in comparison with results of the 2D calculation T0 1.160, R1D 0.553. The 1D solution provides an upper limit of range, as is
apparent from Figure 7.6.
Besides leading to an asymmetric trajectory and reducing the altitude and range of
the ball, air drag also decreases the launch angle at which the longest range results.
For the specified set of air parameters and launch speed (110 mph), computer
simulation of the trajectories resulting from different launch angles led to a maximum range at 0 ~ 41 , rather than 45 . The results are summarized in Table 7.9,
Table 7.9
Scaled ranges of home runs for v0 110 mph (V0 0.951)
0 (deg)
Rvac
Rdrag
Rlift (spin 15 Hz)
15
20
25
30
35
40
41
42
45
50
0.452
0.582
0.693
0.783
0.850
0.891
0.896
0.900
0.904
0.891
0.351
0.425
0.481
0.521
0.545
0.556
0.557
0.556
0.553
0.536
0.664
0.706
0.719
0.709
0.679
0.633
0.622
0.611
0.573
0.500
7.6 Theory of flight
419
which includes, besides the conditions of vacuum and air drag, a third column of
numbers to be discussed after we take up the case of the flying ball that is a fly
ball that literally flies.11 The longest range under each set of flight conditions (in
vacuum, with air drag, or with air drag and lift) is set in bold font.

A baseball struck by a bat in such a way that it spins about an axis while in
translational motion creates a condition necessary for heavier-than-air flight, such
as produced by the wings of an aircraft. If such a condition occurs, then a force
referred to as lift, in addition to gravity and drag, can act upon the ball to keep it
aloft longer and permit it to travel farther. Lift, as shown by the dashed arrow in
Figure 7.4, acts perpendicular to the direction of motion. Drag acts opposite the
direction of motion. To the present, we have been considering the effects of frictional
drag i.e. drag resulting from the resistive force of the medium. However, as is
apparent from Figure 7.4, the lift has a vertical component which (as the name
suggests) causes the object to rise, as well as a horizontal component which contributes to retardation of the translational motion. This latter component is referred to as
induced drag.
Isaac Newton was perhaps the first person to consider the frictional drag of a fluid
on a sphere (a globe in Newtons words) in Book 2 of his great work, Principia.12
I have read an English translation of Principia and am not aware that Newton was
ever concerned with the subject of flight, only with the properties of fluids. Nevertheless, Newtons statement of the principles of fluid drag has been employed in more
modern times to give a heuristic explanation of the mechanism of flight that, while
not entirely wrong, is not correct either. In brief outline, Newton recognized that the
drag on a blunt object (like a sphere or cylinder) moving at uniform velocity v
through a fluid is proportional to the density of the fluid, the square of the speed
v2 v v, and the square of the diameter (i.e. cross-sectional area A) of the object.
With these principles and Newtons Second Law of motion relating force on an
object to the rate at which the object acquires or loses linear momentum, one can
calculate the vertical force or lift experienced by a long flat object at rest like a
wooden board of weight W, long dimension (span) and short dimension (chord) b,
inclined at an acute angle (incidence or angle of attack) to the air stream. Such a
board is a crude simulacrum of an airfoil such as the wing of an aircraft. Fluid
particles striking the board rebound (in some model-dependent way) downward,
thereby impressing an upward force to the board, which, in the simple case of fluid
leaving parallel to the surface, takes the form
11
12
For the reader unfamiliar with baseball terminology, a fly ball is a ball hit high into the air.
I. Newton, Principia Vol. 1: The Motion of Bodies (University of California Press, Berkeley, 1966) 334336. [Mottes
translation of 1729 revised by Cajori.]
420
Table 7.10
Jumbo Jet
Characteristics of Boeing 747100
Fuselage length (m)

Fuselage diameter (m)
Wingspan (m)
Wing area (m2)
Empty mass (kg)
Loaded mass (kg)
Aspect ratio (AR)
Frictional drag Cd
Cruising speed (m/s)
74.2
6.5
59.6
511
162386
333390
6.9
0.031
252.0
FN v2 b sin2
7:6:1
where the subscript N refers to Newton. Thus, if FN equals W, the board flies,
provided there is a means to keep it moving forward.
To my knowledge, Newton never made such a calculation. However, I have read
or listened to explanations of powered flight in terms of such a model. The basic idea
that a heavier-than-air object can be made to rise if (by Newtons Third Law) there is
a counter-motion of the medium downward is correct, but the details are not. An
airplane does not fly because the lower surface of its wings bats the air downward.
Such a picture leads quantitatively to a lift proportional to sin2, which, for small
angles of inclination required to avert stall, would be so small a quantity as to require
wings of impractically great length for powered flight to be achievable. Or, equivalently, for wings of practical length, the Newtonian model would require an
unacceptably high cruising speed v (in the frame of reference where the undisturbed
air is at rest). Rather, an airplane wing in virtue of its shape and orientation relative
to the air stream serves more like a huge pump than a bat, capturing air flow from
above the upper surface and directing it downward. Worked out correctly, the theory
of air flow over a long airfoil leads to a lift proportional to sin . . . not sin2.
An appreciation of the remarkable capacity of an airfoil to deflect air and generate
lift can come only from consideration of some pertinent numbers. Consider what is
required for steady level flight of a Boeing 747 [B747] jet airliner cruising at v 907 km/h
with an air stream incidence of 5 . The relevant characteristics (in MKS units) of
the aircraft are summarized in Table 7.10 for the earliest model (Boeing 747100).13
If the lift generated by the wings is to sustain the maximum weight upon take-off
W mg 333 400 kg9:81 m=s2 3 270 654 N,
13
7:6:2
Boeing 747 Family-Technical Specifications http://www.boeing.com/commercial/747family/pf/pf_classics.html
421
then by Eq. (7.6.1) and Newtons Second Law14

F dmv=dt vdm=dt
7:6:3
the rate of mass transport of air vertically downward at speed v sin must be
dm
W
3 270 654 N
149:0 tonnes=s:
dt
v sin 252 m=s sin 5
7:6:4
(Note: 1 metric ton or 1 tonne 1000 kg ~ 1.1023 tons.) This is an enormous

quantity of air to be deflected downward (downwash) each second and most of
it comes from above the wings. The two wings of a B747 together make a tapered
V-shaped silhouette i.e. each wing is widest at the root (junction with fuselage) and
narrowest at the tip. Nevertheless, let us model the wings simply as a rectangle of
length equal to the wingspan and mean chord b (distance from the leading edge to
trailing edge) equal to the wing area divided by the wingspan. Then the mass of air
(dm/dt)t thrust downward by the wings in time t 1 s would occupy the volume of
a rectangular solid extending above the wings for a height
h
dm=dtt
149:0 103 kg=s1 s
242:1 m:
air b
1:204 kg=m3 8:6 m59:6 m
7:6:5
To accentuate my point: the long thin wings of the B747 are deflecting air from
within at least one hundred meters above its surface to create the reaction that lifts
the aircraft. The term deflecting actually does not conjure up the appropriate
image. A better term perhaps would be sucking . . . like a pump. It is the angle of
inclination of the wing, more than the wing shape, that is responsible for this sucking
action. Parcels of air passing over the wing would leave behind a vacuum were it not
the case that more air is drawn down and over the wing. This continues so long as
there is relative motion of the wing in air.
If we were to use the Newtonian model (7.6.1) of lift which, as I noted previously, was not used in this way by Newton himself arising from deflection of air
from the underside of an airfoil, the cruising speed vN required to sustain weight
W would be

vN

W
3:3 106 N
836:6 m=s,
air b sin 2
1:2 kg=m3 8:6 m59:6 m sin 2 5
1
2
1
2
which greatly exceeds the speed of sound in air: vs 343 m/s at 20 C. Interestingly,
supersonic aerodynamics leads to a pattern of air flow for air speeds much in excess of the
speed of sound that is similar to what Newton imagined in his study of air resistance.15
14
15
This is an example where the time rate of change of momentum mv arises from the variation in mass (dm/dt) at constant
velocity rather than acceleration (dv/dt) of a constant mass. There are subtleties to the use of such a relation, which
I have overlooked now to avoid distraction.
T. von Karman, Aerodynamics (McGraw-Hill, New York, 1954) 122.
422
Besides the erroneous Newtonian explanation, I have also seen or heard misleading explanations of flight based on Bernoullis principle as applied to a stationary
airfoil in a uniformly moving fluid. Again, the basic idea is correct, but not the
details. The argument goes as follows. Air rushes over the cambered upper surface
faster than over the flatter lower surface of the wing in order that the two flows join at
the trailing edge. By (7.5.4) the faster air stream exerts a lower pressure on the upper
wing surface and therefore the plane is pushed upward by the greater pressure on the
lower surface.
This explanation fails on several accounts. First, there is no aerodynamic principle
requiring parcels of air to time their flow so as to meet at the trailing edge if they
simultaneously arrived at the leading edge. In fact, it is an essential condition of flight
that the part of a divided air stream passing over the top surface arrive at the trailing
edge before the part passing under the lower surface. This creates a nonvanishing
circulation about the wing to be discussed shortly. Second, the primary impulse for
the wing to rise comes from a low gauge pressure topside (near the leading edge) than
from a high gauge pressure under the wing. As I pointed out previously, it is more
accurate to think of the wing as being sucked up than pushed up. Finally, this
Bernoulli explanation has its cause and effect backward. It is the low pressure
created by deflection of the air stream downward that leads to a higher air speed over
the wing than under the wing . . . not the reverse statement.
Although Newtons laws and Bernoullis principle (which is derived from
Newtons Second Law) are essential ingredients to understanding flight, the correct
way in which they come together to give a quantitative account of lift is through two
seemingly remote and abstract concepts: circulation and vorticity. Consider, as
before, a reference frame with the airfoil at rest and the air stream moving to the
right with a uniform upstream (i.e. initially undisturbed) velocity v0. In a nutshell, lift
arises from a bound vortex (i.e. tornado-like whirlwind) of air induced at the initial
moments of relative motion between the wing and the air. During these moments, air
moving over the top surface of the wing does not flow smoothly over the entire
surface, but separates turbulently at some point above, but close to, the trailing edge
due to friction in a thin boundary layer encompassing the wing. Airflow along
the undersurface of the wing passes around the trailing edge and up over the top
surface (upwash) to the separation region. This counter-flow generates a starting
vortex, i.e. an anticlockwise circulation of air for an initial air stream directed to the
right, which the wing sheds and the air stream carries away. As a result of angular
momentum conservation in the fluid, a clockwise-circulating bound vortex is induced
around the wing.
The (clockwise) circulatory motion of the bound vortex, superposed on the
original horizontal air stream moving to the right at velocity v0, generates a faster
air stream over the top surface and a slower air stream over the bottom surface,
which, by Bernoullis principle, produces the differential pressure leading to a lifting
force sustained as long as relative motion of the wing in air continues. Theoretically,
423
for an airfoil of infinite span in an initially two-dimensional irrotational16 flow

which serves as a starting point for many aerodynamic calculations the lift per unit
of span fl is given by a very simple relation (i.e. simple to write, not necessarily simple
to derive or apply) known as the KuttaJoukowski theorem17
f l v0 ,
7:6:6
where is the density of the medium (air), v0 is the uniform wind speed of the
undisturbed air far in front of the airfoil, and
v ds ndS r v,
7:6:7
C
termed the circulation, is a line integral of the net wind velocity v over an arbitrarily
shaped planar contour C about the airfoil. For an airfoil of infinite span, the location
of the plane of the contour does not matter. For a finite airfoil, however, circulation
can vary with location and the total lift will require integrating fl over the span,
which we will do shortly when we apply these results to a baseball in the following
section.
The equivalent second expression in (7.6.7), resulting from use of Stokes theorem
of vector calculus, is an integral over an open surface bound by C (with appropriate
orientation of the outward normal to correspond to positive traversal of the contour)
of the vorticity , defined as the curl of the fluid velocity. The expressions in (7.6.7)
are suggestive of Amperes law in electromagnetism relating the current I (analogous
to ), magnetic induction B (analogous to v), and vector potential A (analogous to ).
Indeed, there is a BiotSavart law in aerodynamics by means of which the velocity
field associated with curved vortex lines can be calculated (although we shall not need
to do so in this book).
A general derivation of the KuttaJoukowski theorem can be found in advanced
aerodynamics references, but the basic ingredients can be understood from examining the lift on a long thin board of width (chord) b, such as illustrated in Figure 7.7.
Let boldfaced letters with carets x , y , z signify unit vectors along the corresponding
Cartesian axes. The span of the board is normal to the page (along the z axis).
Flowing to the right over and under the board is a steady horizontal wind of velocity
v0 v0 x , and circulating around the board in a clockwise sense is a bound vortex
moving with velocity ux over the top and velocity ux over the bottom. Because
the board is thin (in principle, infinitesimally thin), we can neglect the velocity of the
vortex at the leading and trailing edges. The total velocity at any point is the vector
sum v v0 u. From Eq. (7.6.7), the circulation is then
16
17
Irrotational flow does not mean that the fluid cannot rotate. Rather, it signifies that an object immersed in the fluid
does not change its orientation relative to fixed axes as it is carried by the fluid. Illustrative of such motion would be the
passenger cars on a Ferris wheel at an amusement park.
L.M. Milne-Thomson, Theoretical Aerodynamics (Macmillan, London, 1958) 9192.
424

y
v0
-u
v0
Fig. 7.7 Schematic diagram of air flow over a stationary airfoil with bound vortex. The farfield air stream and circulating air are largely parallel above the airfoil and anti-parallel below.
Thus, the net air speed (v0 u) is greater above the airfoil than below (v0 u), giving rise to lift
as described by the KuttaJoukowski theorem.
v ds v0 uds v0 udx 2ub:

0
7:6:8
The vertical component of the force on the board due to air pressure p takes the form
of a closed surface integral

Fl p n dS y plower pupper b
7:6:9
in which p is the pressure on a patch of differential area dS dxdz with outward
normal unit vector n y for the upper and lower surfaces, respectively. (As before,
we ignore the surfaces at the leading and trailing edges, but the force on them does
not contribute anyway because the scalar product of the outward unit normals
n x with y vanishes.) The minus sign in (7.6.9) shows that the direction of the
pressure force on each patch is along the inward normal. By Bernoullis formula
(7.5.4), we can replace pressure in (7.6.9) by

1 2
1
1
p p0 v0 v2 constant v2
2
2
2
upstream static and
dynamic pressure
to obtain a force per unit of span
7:6:10
7.7 Fly(ing) ball spin and lift
fl
i
Fl
1 h
plower pupper b b 2upper 2uower
2
i
1 h
b v0 u2 v0 u2
2
2buv0 v0 ,
425
7:6:11
which is recognized as the KuttaJoukowski theorem (7.6.6). The constant term in

(7.6.10) does not contribute to the difference of two pressures. More generally, it
drops out of the closed surface integral irrespective of the shape of the object.
A sphere spinning about its stationary center of gravity in a uniformly moving fluid
replicates the conditions albeit with different geometry of Figure 7.7, and by the
KuttaJoukowski theorem one can expect the ball to be subject to an aerodynamic
force perpendicular to the spin axis (i.e. direction of the angular velocity vector). This
phenomenon, called the Magnus effect for the physicist18 who systematically investigated the effect experimentally in the 1850s, was apparently also known to Newton in
the 1670s who thought (erroneously but then this period long predated the wave
theory of light) that it might provide an explanation of light refraction19. Remembering that he had often seen the deflection of a tennis ball struck obliquely by the
racket, Newton speculated that
. . . if the Rays of light should possibly be globular bodies, and by their oblique passage out of
one medium into another acquire a circulating motion, they ought to feel the greater resistance
from the ambient Aether, on that side, where the motions conspire, and thence be continually
bowed to the other.
Contemporary explanations of the Magnus force on a spinning object in air often

resort to a comparable explanation invoking Bernoullis principle. The argument is
that on the side of the object whose surface is moving in the direction of the main air
stream the fluid particles adjacent to the surface move faster, and consequently exert
lower pressure, than the fluid particles near the opposite side of the object whose
surface is moving against the main air stream. The argument is not wrong, but
insufficient because the attached boundary layer of fluid is too thin to generate the
observed pressure difference. A more complete explanation is that the boundary
layer on the side opposite the main stream detaches sooner than the boundary layer
on the side moving with the main stream and is shed (e.g. as eddies) thereby deflecting
the object laterally in the direction of the side moving with the main stream. A highly
18
19
G. Magnus, Uber die Abweichung der Geschosse, Abhandlungen der Koninglichen Akademie der Wissenschaften zu
Berlin (1852) 123. [Concerning the deviation of projectiles.]
Letter of Isaac Newton reproduced by I.B. Cohen, Isaac Newtons Papers and Letters on Natural Philosophy and Related
Documents (Harvard University Press, Cambridge MA, 1958). Republished as Isaac Newton, A new theory about
light and colors, American Journal of Physics 61 (1993) 108112.
426

FL
Air ow
(faster)
a
Backspin
Air ow
(slower)
Fig. 7.8 Magnus effect on a baseball backspinning at angular frequency about an axis
perpendicular to its translational velocity v, thereby producing a vertical lift force FL. The
effect is due to the greater airspeed (relative to the surface of the ball) and consequently lower
pressure at points in the upper hemisphere compared to corresponding points in the lower
hemisphere.
readable (if you read German) and comprehensive discussion of the Magnus effect
was published in 1925 by the aerodynamicist Ludwig Prandtl,20 creator of boundarylayer theory (and therefore in effect the father of the science of aerodynamics) to
explain the Windkraftschiff (windpower ship) invented by a German engineer Anton
Flettner, which employed two vertical rotating cylinders in place of sails.
To calculate the circulation rigorously, and therefore the lift, of a sphere spinning
in a viscous fluid, it is necessary to determine the velocity field of the fluid. In general,
this is very difficult to do analytically since it entails solving the nonlinear Navier
Stokes equation. We can avoid the necessity of doing so, however, by making a few
assumptions, adequate for the present purposes, that give insight into the quantities
that matter most.
Figure 7.8 shows a schematic diagram of a sphere of radius a backspinning at
angular frequency i.e. with angular velocity such that the cross product v,
where v is the velocity of the center of mass through the air (or v is the velocity of
the air stream in the rest frame of the sphere), is in the direction of the aerodynamic
lift FL. A point
on the
sphere at a horizontal distance x from the origin and radial
p
distance r a2 x2 from the rotation axis moves with a linear speed v(r) r. The
no-slip condition requires that air molecules adhere to the surface of the sphere and
move with it at the same angular frequency. If we consider only 2D flow within
20
L. Prandtl, Magnuseffekt und Windkraftschiff, Die Naturwissenschaften 6 (1925) 93104. An English translation is
available online as a NASA Technical Report: NACA Technical Memorandum 367, http://ntrs.nasa.gov/search.jsp
427
planar sections perpendicular to the rotation axis and assume that molecules at the
surface entrain those in successive layers within a thin boundary layer likewise to
follow the motion of the spheres surface (i.e. neglect effects of viscosity in the bulk
fluid outside the boundary layer), then the circulation about a contour of radius r
would be
r 2r 2 :
7:7:1
The circulation (7.7.1) contributes a vertically upward force dFl (x) on a section of
width dx about x of
dFl x 2rx
2 v dx:
7:7:2
The total lift on the sphere, obtained by integrating (7.7.2) over the range a x a, is
then easily shown to be
Fl
8 3
1
a v v2 Cl a2
3
2
Cl
16a
,
3v
7:7:3
where the second expression defines the coefficient of lift Cl by a relation analogous
to the one defining the coefficient of drag Cd. Note that Cl is a dimensionless constant
expressed, as might be expected, as a number of order unity times the ratio of
rotational and translational velocities. Based on dimensional considerations alone,
fluid flow about a smooth rotating sphere in translational motion should be characterized by two parameters: the Reynolds number
Re v2a=
7:7:4
and what in fluid dynamical terminology is known as the roll parameter

J a=v:
7:7:5
The essential feature of (7.7.3) is a lift proportional to the first power of the
relative speed of the ball and medium. Although early investigators of the transverse
force on a spinning sphere reported a force proportional to the square of the speed,
more recent systematic investigations of golf balls21 and baseballs22 are more or less
consistent with a linear force for high Reynolds number and low roll parameter.
Nevertheless, the problem is complex and experiments are not fully in agreement with
one another or with theory. In arriving at Eq. (7.7.3), I have made assumptions that
are not rigorously self-consistent. The velocity field, obtained by superposing a
uniform free-stream flow and the field of an ideal vortex centered on the sphere,
neglects viscosity even though viscosity is what engendered the fluid circulation.
21
22
P.W. Bearman and J.K. Harvey, Aeronautical Quarterly 27 (1976) 112.

R.G. Watts and R. Ferrer, The lateral force on a spinning sphere: Aerodynamics of a curveball, American Journal of
Physics 55 (1987) 4044.
428
More realistically, the flow around a spinning sphere at high Reynolds numbers
undoubtedly produces turbulence and eddies behind the sphere, which cause drag
and affect lift.
The same inconsistency can be found in attempts at a tractable theoretical analysis
of a rotating cylinder for which the validity of the result is easier to estimate (or at
least speculate).23 To the extent that inferences drawn from a cylinder may have
relevance to a sphere, one may conclude the following. For sufficiently large values
of J, there is a surface dividing the fluid into an irrotational part that flows past the
cylinder in the main stream and a part trapped near the cylinder that co-rotates with
it. If the boundary layer within which viscosity is important is small compared with
the thickness of this surface of entrapment, then vorticity cannot readily diffuse into
the main stream, whereupon the circulation given by (7.7.1) (with constant r) is
believed accurate to a good approximation. The criterion for validity of (7.7.1) can
be shown to be J >> Re , from which it follows from (7.7.5) that the spin frequency
/2 must satisfy
1
3
Re v
>>
:
2
2a
1
3
7:7:6
Thus, for a baseball traveling through room-temperature air at 110 miles per hour
(49.2 m/s), we have seen that Re ~ 2.4 105, whereupon (7.7.6) yields /2 3.6 Hz,
which is easily achievable for a spinning baseball. In the analysis of cylindrical flow,
however, the condition J 1 was assumed, and this is not the case for a batted
baseball. For example, J 0.072 for a baseball translating at 110 mph and spinning
at 15 Hz. There are, as well, other features of a baseball, such as the seam that
meanders over the surface, that may (or may not) contribute further complexity and
uncertainty to a rigorous analysis of lift. For purposes of illustration, therefore, I will
adopt (7.7.3) as our working relation since it embodies the maximum circulation for a
fixed (i.e. the circulation arising from rigid body rotation) and should presumably
lead to the most optimistic estimates of achievable home run length.
Taking account of the lift and induced drag resulting from the Magnus effect
(7.7.3) on a sphere as depicted geometrically in Figure 7.4, leads to a new characteristic speed vs associated with spin
mg
7:7:7
vs 8 3
3 a
and thus to a modification of the (dimensionless) equations of motion (7.5.10)

dV x 2
V x V 2y V x V y 0
dT
7:7:8

dV y 2
V x V 2y V y V x 1
dT
1
2
1
2
23
T.E. Faber, Fluid Dynamics for Physicists (Cambridge University Press, New York, 1995) 279283.
429
0.3
Drag+Spin
/2 = 15 Hz
Altitude (scaled)
0.25
0.2
No Drag
No Spin
Drag
0.15
0.1
0.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

Fig. 7.9 Trajectories of a baseball (same features as in Figure 7.5) launched with speed
110 miles/hour [49.17 m/s] at an angle of 25 (gray) or 45 (black) with lift due to a
moderate backspin of 15 Hz (solid). For comparison are shown trajectories for air drag
without lift (dotted) and for no air drag or spin (dashed). Aerodynamic parameters are: air
density 1.204 kg/m3, air viscosity 1.85 105 Poise, drag coefficient 0.2. The acceleration of
gravity is 9.81 m/s2.
for the velocity of the ball. The additional dimensionless parameter vd/vs
quantifies the relative influence of lift and drag. For a non-spinning ball, vs
and 0. In contrast to the equations of motion without spin, the set (7.7.8) cannot
be solved analytically, but is readily solvable numerically by the Levenberg
Marquardt algorithm to which I have referred before.
Figure 7.9 shows a comparative illustration of the trajectories (solid lines) of balls
launched at 110 mph at initial angles 0 of 25 (gray lines) and 45 (black lines) and a
moderate backspin of /2 15 Hz. To put this in perspective, dashed lines mark the
flight of the ball in vacuum (no drag, no spin) and dotted lines mark the flight in air
(drag, no spin) with a form drag coefficient again chosen to be Cd ~ 0.2 for Reynolds
numbers beyond the transition region. As we have already seen, without spin the
range in vacuum for fixed initial speed is always greatest for 0 45 , and the range
in air for the given parameters was greatest for 0 ~ 41 . With a backspin of 15 Hz
under the same conditions, the longest home run range was obtained for a much
lower initial angle, 0 ~ 25.3 . Now that spin has been discussed, a re-examination of
the results in Table 7.9 showing the launch angles that lead to maximum ranges for
all three sets of conditions (no drag/no spin; drag/no spin; drag/spin) would be
informative.
Figure 7.10 shows a comparison of trajectories for balls launched at 110 mph at
initial angles 0 of 0 (gray line) and 25 (black lines) and a higher backspin of /2
30 Hz. At 0 25 the trajectory looks like the cone of a volcano, the ascending and
430
Table 7.11
Effect of spin on the range of home runs
Frequency (Hz) /2
Range (scaled) 0 0
Range (scaled) 0 25
15
20
25
30
35
40
45
0
0.389
0.804
0.930
0.902
0.821
0.741
0.719
0.765
0.755
0.694
0.607
0.518
0.447
0.3
Drag+Spin
Altitude (scaled)
0.25
/2 = 30 Hz
0.2
0.15
Drag+Spin
No Drag
No Spin
0.1
Drag
0.05
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

Fig. 7.10 Trajectories of a baseball launched at an angle of 25 (black) or 0 (gray) with lift due
to a high backspin of 30 Hz (solid). Aerodynamic parameters and comparison cases are the
same as for Figure 7.9 For launch at 0 , the trajectories (with or without drag) of the nonspinning ball immediately drop below the launch level and are not seen in the figure.
descending segments both showing mild concave-upward curvatures, leading to

nearly equal ranges for spinless flight through vacuum and spinning flight through
air. At 0 0 , however, the ball takes off parallel to the ground and literally flies up
into the air before making a sharper descent. No plots are visible for spinless flight
because the ball immediately drops below the level of the launch. Simulations of
trajectories as a function of spin frequency at the preceding launch angles and speed
summarized in Table 7.11 lead to maximum home-run ranges when the horizontally launched ball is spinning at about 30 Hz and the ball launched at 25 is spinning
at about 20 Hz.
Increasing the spin rate beyond those recorded in Table 7.11 resulted in bizarre
trajectories with cusps and loops such as illustrated in Figure 7.11 for a ball launched
431
0.25
Drag+Spin
/2 = 100 Hz
Altitude (scaled)
0.2
0.15
0.1
0.05
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05

Fig. 7.11 Phugoid looping motion of a baseball launched at an angle of 0 (dashed gray) or 25
(solid black) and very high backspin of 100 Hz. For comparison the trajectory of the ball
launched at 25 without drag or spin is shown (dashed black). Aerodynamic parameters are the
same as for Figure 7.9.
at 110 mph at 0 (gray plot) or 25 (black plot) and spinning at 100 Hz. At this high
spin rate the ball is undergoing a sustained, oscillatory flight, reminiscent of (and
indeed related to) the behavior referred to as phugoid motion24 by W.F. Lanchester,
one of the first to understand the principles of heavier-than-air flight. The term
phugoid is actually a linguistic blunder. Lanchester sought a Greek word for flight
in the sense of flying, but chose a word that had the sense of fleeing. For perspective,
the figure also shows the vacuum trajectory (dashed black plot) of the ball launched at
25 without drag or spin.
To see clearly the mathematical origin of the looping motion, consider the equations of motion (7.7.8) in the limit of very high spin parameter such that the drag
terms quadratic in velocity are omitted
dV x
V y 0
dT
dV y
V x 1:
dT
7:7:9
The set (7.7.9) can again be solved exactly, leading to components of velocity

V x T V x0 1 cos T V y0 sin T 1
7:7:10

V y T V y0 cos T V x0 1 sin T
and of displacement
24
R. von Mises, Theory of Flight (Dover, New York, 1959) 539545.
432

sin T
1
V y0
Sx T V x0

sin T
1
V x0
Sy T V y0

cos T
T sin T
2

cos T
1 cos T

7:7:11
with initial conditions Vx0 V0 cos 0 and Vy0 V0 sin 0.

It is interesting to note that the phugoid motion can be simulated by a simple toy
constructed from a light-weight cardboard cylinder (obtained from a roll of paper
towels) with cardboard end caps of greater radius (which gives the appearance of a
pair of wheels). The cylinder is wrapped with a piece of string to generate backspin
when it is hand-thrown into the wind. In a report to the US Navy, the author
describes almost rapturously the device, referred to as a Rotorang:25
The Rotorang will climb rapidly upward to its maximum altitude then drift downwind and
earthward, executing a perfect loop. It will continue to fly until its speed of rotation becomes
equal to the wind at which time its resistance disappears. The glider will then hover several feet
about the ground for an astonishingly long period. As the spin decays further the Rotorang
will again assume a shallow glide path and will land some distance away.
I recall playing with such a toy as a child, although I am not familiar with the name
Rotorang.
Incidentally, the principal feature of interest to the author of the report was not
the phugoid (i.e. looped) motion of the Rotorang, but the fact that at a particular
spin rate its drag dropped so low that the toy hung motionless in the wind for an
unusually long period of time. This, the author claimed, was illustrative of what he
called the Barkley phenomenon (named for a man who pointed the characteristic
out to him during model-basin tests of rotary rudders): the drop in drag on a rotor
just prior to its reaching equality of surface and flow velocities i.e. a roll parameter
J 1. Recall that one of the earliest failures of the theory of ideal fluids was the
dAlembert paradox in effect, the false prediction that an object placed at some
location in a rapidly flowing stream of ideal fluid would just remain there at rest.
A cylinder spinning in air at the appropriate rate actually appears to do that,
although for complex reasons related to the spin and boundary layer within which
the fluid is not ideal, but viscous. Moreover, the end caps on the Rotorang were not
meant to serve principally at wheels, but to block air flow between the external and
internal regions that would reduce the circulation .
7.8 Falling out of the sky is a drag
In January 1945, First Lieutenant Federico Gonzales of the US 8th Air Force was
piloting the lead B-17 Flying Fortress in an air raid over Germany when half of the
25
J. Borg, The Magnus Effect An Overview of its Past and Future Practical Applications Vol. 1, Report AD-A165
902 (Department of the Navy, Washington DC, 1986) 2122.
433
left wing of his aircraft was shot off by ground fire. The plane, spinning rapidly,
split amidships and plunged 27000 feet (8.23 km) to the ground with the unconscious
pilot wedged under the instrument panel. Although severely injured, the pilot
survived, alone of the ten-man crew. So began the narrative of a book26 written
by the authors son that, a few years after its publication, came to my attention
in a serendipitous way: as a book-sale discard that my wife brought home to me.
It was a chance happening that re-directed my research focus for several years
afterward.
The pilot, having received medical attention in captivity, was eventually able,
despite some permanent injury, to resume a normal life after the war and, interestingly enough, became a professor of biophysics. When I learned these spare details of
Gonzaless extraordinary fall and subsequent recovery, two questions immediately
came to mind. First, how was it possible for any human to survive a fall of ~8 km
without a parachute? And second, feeling a certain kinship with the man through our
common pursuit of physics as a profession, I could not help wondering whether he,
himself, ever wondered about his survival, other than to regard it as an exceptionally
lucky outcome, if not a miracle.
The maximum acceleration that a human can endure has long been a subject of
fascination, as well as practical interest, particularly to insurance companies, automotive safety agencies, national space agencies, the military, and the like. Estimates
have ranged from about 10g to a little over 100g, depending on the duration and
orientation of impact, where g is the acceleration of gravity: g 9.81 m/s2. Particularly striking was the case of racing driver David Purley who survived a crash
estimated to have produced 178g as he decelerated from 173 km/h to 0 in a distance
of 66 cm.27 The most comprehensive study I am aware of concerning human impact
tolerance is a 324-page report prepared for the Insurance Institute for Highway
Safety.28 Among the findings of the authors, who investigated vertical falls up to
about 275 feet (. . . the height of the Golden Gate Bridge in San Francisco, California,
from which numerous suicide attempts have been made . . .) as a proxy for horizontal
car crashes, was that 350g for 2.53.0 ms was the approximate survival limit of
children under age 8 subject to head impacts. From such data it seems likely that a
few hundred g over a period of a few seconds would be a liberal upper limit to human
impact tolerance under most circumstances.
Straightforward application of the kinematics of uniform acceleration, such as is
taught in elementary mechanics, would tell us that, starting from rest, an object
(irrespective of its mass) falling a vertical distance h 8.23 km through vacuum (or
26
27
28
L. Gonzales, Deep Survival (W. W. Norton, New York, 2005) 915. Other accounts of the mission in which Lt.
Gonzales plane was shot down were recorded in diaries of various members of the squadron, excerpted online at the
website of the 398th Bomb Group Memorial Association http://www.398th.org/Missions/Dates/1945/January/
MIS_450123.html
David Purley, http://en.wikipedia.org/wiki/David_Purley
R. G. Snyder, D. R. Foust, and B. M. Bowman, Study of Impact Tolerance Through Free-Fall Investigations
(December 1977), Highway Safety Research Institute of the University of Michigan, Ann Arbor, Michigan.
434
atmosphere tenuous enough to be regarded as vacuum) would strike ground in

41 seconds with a speed of 402 m/s (~900 mph). If the object came to rest within
a distance d, let us say of 50 cm assuming that either the ground was not entirely
rigid or that the object was compressed upon impact the crash deceleration ac
in units of g
ac =g h=d
7:8:1
would amount to ac 16 460g. This resulting g-force is orders of magnitude beyond

any that a human body could possibly withstand. So how could the pilot have
survived?
Although luck and willpower may play a significant role in human destiny, there
are no miracles in physics. To understand unusual occurrences we must always
assume that at no time or place were the laws of physics ever suspended. And so
the simple kinematic calculation is not valid. What has been omitted, obviously, are
the resistive effect of the air and the greater range of deceleration afforded by impact
compression of the aircraft. That insects and other small creatures can reach a nonfatal terminal speed in the presence of air, rather than continue to accelerate while
falling, is a familiar fact. Baseballs, as we have just discussed, do not reach a terminal
speed before the ground intervenes. That a 25-tonne aircraft with a human being in
the front end might have done so is a circumstance that far transcends ordinary
experience and requires a felicitous combination of physical law, physical parameters, and geometry.
Intrigued by survival of the fallen airman, I set about to estimate the two numbers
that most defined his descent, i.e. his terminal velocity and rate of spin. However,
before discussing the extraordinary flight of a damaged B-17, it is useful to consider
the ordinary flight of an undamaged one whose pertinent features are summarized in
Table 7.12.29
An airplane in steady horizontal flight at speed v0 is an example of perfect balance.
The downward weight W of the plane of mass m is equal to the upward lift Fl of the
air reaction on the wings
1
Fl v20 Cl S W mg,
2
7:8:2
and the combined forward thrust Ft Pe/v0 of all the engines of combined power Pe
is balanced by the rearward drag of air resistance
1
Pe
Fd v20 Cd S Ft
v0
2
7:8:3
so that there is no net acceleration. Besides balance of forces, there is the absence of
moments. The upward lift of the port wing (i.e. on the pilots left), which would roll
29
http://en.wikipedia.org/wiki/Boeing_B-17_Flying_Fortress
Table 7.12
435
B-17 Flying Fortress
Fuselage length (m)

Fuselage diameter (m)
Wing length (m)
Wingspan (m)
Wing area S (m2)
Nose-to-cockpit length (m)
Empty mass (kg)
Loaded mass (kg)
Aspect ratio AR 2/S
Maximum speed (m/s)
Cruising speed (m/s)
Engine power (kW)
Number of engines
23
2.4
13.6
32
131.9
3.4
16391
24 495
7.58
128.3
81.4
895
4
the plane clockwise about the long axis of the fuselage is balanced by the upward lift
of the starboard wing (to the pilots right), which would roll the plane anti-clockwise.
The upward lift of both wings, which could pitch the nose of the plane upward in a
rotation about a horizontal axis through the wings, is balanced by the upward lift of
the rear horizontal stabilizers (the winglets in the tail assembly or empennage),
which would pitch the nose downward. And the thrust of the port engines, which
would yaw the plane to starboard about a vertical axis through the crafts center of
gravity, is balanced by the counter-thrust of the starboard engines, which would yaw
the plane to port.
In the expressions (7.8.2) and (7.8.3), the forces of lift and drag are expressed in
standard Newtonian form i.e. proportional to the square of the relative wind
speed in which is again the air density, S is again a reference area defined as the
projected area of the planform which for all practical purposes may be taken to be
the wing area, and the engine power Pe is equal to the product of velocity and thrust.
The lift-to-drag ratio then follows simply as
Fl
Cl mgv0
:
Fd Cd
Pe
7:8:4
The drag on a subsonic airplane arises from various sources. Skin (or friction)
drag results from viscous shearing stresses over the surface. Pressure (or form) drag
results from the integrated effect of static pressure normal to the surface. Together,
skin drag and pressure drag constitute profile drag. Other forms of drag not pertinent
to the present discussion arise from shock waves associated with relative speeds at or
beyond the speed of sound. These can be ignored now. For a sleek aerodynamically
shaped object like an airplane cruising horizontally (and subsonically) through the
air at high Reynolds number, the profile drag is primarily skin drag, and the drag
436
coefficient is small. The zero-lift profile drag of a B-17 is reported by NASA to be Cd

~ 0.0302.30 Contrast that with an airplane falling vertically through the air with its
wings horizontal; the profile drag is primarily pressure drag, and the drag coefficient
corresponds more or less to that of a flat plate with its broad surface to the wind.
From values given for plates of comparable aspect ratio,31 I would estimate Cd ~ 1.3
for a B-17 falling under such conditions. Besides profile drag, we have already
encountered induced drag, which is associated with a lifting force that tilts backward from the vertical so as to have a horizontal component anti-parallel to thrust.
Physically, the induced drag arises primarily from dissipation of translational kinetic
energy through shedding of wing-tip vortices. The total drag coefficient is the sum of
the drag coefficients for profile and induced drag.
The cambered wings of an airplane ordinarily make an angle (the incidence or
angle of attack) with the wind even if the center of gravity of the plane is moving
parallel to the ground. Under normal circumstances, this angle is a small number
when expressed in radians, thereby permitting an approximate equivalence between
the angle and its sine. For a range of angles below which stall occurs i.e. separation
of air flow from the wing surface resulting in decreased lift the lift and drag
coefficients are found to vary respectively with the first and second powers of the
incidence, and therefore to be governed approximately by a parabolic law 32
Cd Cd0
Profile
Drag
C2l
2 =S
7:8:5
Induced Drag
where the profile drag (first term) depends primarily on shape, and the induced drag
(second term) depends on lift and aspect ratio AR 2/S.
Given the aerodynamic complexities of a real aircraft in flight, the coefficients of
lift and drag are generally not amenable to theoretical prediction, but must be
obtained empirically which we can do from the data in Table 7.12. Thus, solving
for Cl and Cd from (7.8.2) and (7.8.3) and substituting the appropriate quantities
from the table lead to
30
31
32
Cl
2mg
224495 kg9:81 m=s2
0:426
v20 S 1:29 kg=m3 81:4 m=s2 131:9 m2
7:8:6
Cd
2Pe
24 895 103 W
0:078
3
v0 S 1:29 kg=m3 81:4 m=s3 131:9 m2
7:8:7
L. K. Loftin, Jr, Quest for Performance: The Evolution of Modern Aircraft, NASA SP-468 (NASA Scientific and
Technical Information Branch, Washington DC, 1985) Appendix A, Table II Characteristics of Illustrative
Aircraft 19181939, http://www.hq.nasa.gov/pao/History/SP-468/app-a.htm
The aspect ratio AR of a rectangular airfoil is the ratio of the wingspan to the chord length b. For wings of variable
width, the aspect ratio is defined by AR 2/S, where S is the wing area.
R. von Mises, Theory of Flight (Dover, New York, 1959) 140142, 165.
437
and therefore to a ratio Cl /Cd 5.5 for a fully loaded B-17 cruising at about 80 m/s
in level flight. NASA reports a maximum ratio (Cl /Cd)max 12.7.
Since the temperature and density of the atmosphere are not homogeneous, the
value of the air density employed above warrants comment. It is the density of dry air
at 0 C and 1 atm pressure, a set of conditions referred to as Standard Temperature
and Pressure (STP). Within the troposphere, i.e. the first ~11 km of the atmosphere
above sea level, the temperature of dry air rising adiabatically (i.e. without heat
exchange) decreases linearly with altitude at approximately 10 C/km. This variation,
which characterizes convective isentropic equilibrium in the atmosphere, is known as
the (dry air) adiabatic lapse rate. As the air temperature changes, so too does the
density and pressure according to the ideal gas law and the adiabatic expansion
equation which relate all three variables. Additionally, even in an isothermal atmosphere, the pressure, and therefore the density, decrease exponentially with altitude in
accordance with the barometric equation. Given that Lt. Gonzaless B-17 descended
over Germany in winter (January), it may be reasonable to assume that ground-level
temperature was about 0 C, and therefore his descent at 27000 feet (8.2 km) began in
an ambient temperature of about 80 oC. The variation in density affects the descent
rate, but I will deal with the complexities of a thermodynamically inhomogeneous
atmosphere in the next section where it is more pertinent to the content. For the
present purpose, however, of accounting for Lt. Gonzaless survival, it is sufficient
simply to adopt the STP value of air density.
With sudden destruction of the port wing on Lt. Gonzaless fateful day, this
perfect balance was instantly shattered. The lift on the starboard wing, now
unopposed by the lift of its counterpart, generated a torque about the long axis,
rolling the starboard wing upward and the remnant of the port wing downward. The
roll, according to the narrative I read, was violent enough to invert the plane. With a
port engine missing, the uncompensated torque of the starboard engines about the
vertical axis through the planes center of gravity yawed the nose of the plane to
the port side, initiating a spin. The lift of the horizontal stabilizers, now exceeding the
lift of the wings, forced the nose of the plane downward. Rolling, yawing, pitching,
the doomed B-17, its weight no longer supported by lift, plunged earthward, quickly
settling into a stable helical spiral described in aeronautical terms as a flat spin.
In a spinning descent, the aerodynamic variables are all out of kilter. With the
nose declined below horizontal and the aircraft falling downward, the relative wind is
primarily vertically upward flowing over the wings at an incidence (angle between air
stream and wing chord) a little below 90 , in fact, for a flat spin far above the
stalling angle. Under such conditions, the net aerodynamic force on the plane is
perpendicular to the wing chord. Drag, which in steady flight is horizontal, opposing
thrust, is now vertically upward, opposing weight. Lift, which in steady flight is
vertically upward, opposing weight, is now horizontal and radially inward, creating
the centripetal acceleration of the spin. Of the two kinds steep spin (the extreme
form of which is a spinning nose dive) and flat spin (the extreme form of which
438
resembles the descent of a Frisbee) the flat spin is the more dangerous because it is
stable. The wings are stalled, the control surfaces, particularly the rudder, cease to
function, and, once transients of the motion decayed away, the flat spin persists at a
steady rate. Nevertheless, hazardous and irrecoverable as it is reputed to be, I believe
that a flat spin saved the life of Lt. Gonzales.
In modeling the violent transition from steady horizontal flight to a helical flat
spin descent, I will consider first an intact B-17 (because its geometry is unambiguous) and assume that
(a) the cruising speed v0 of the aircraft became (at least approximately) the tangential
speed of the planes center of gravity about the axis of the helix, and
(b) the vertical descent started with an initial axial component vz0 0.
It then follows from the equation for lift as the source of centripetal acceleration
mv20
r
7:8:8
Fd mg
7:8:9
Fl v20 acent

Fd gr
g
7:8:10
Fl
and the steady-state equation for drag
that the lift-to-drag ratio
gives what in the pilots rest frame would be interpreted as the centrifugal force,
once the radius r of the helical trajectory is known. The radius for extreme flat spin is
ordinarily not larger than one-half of the wingspan,33 which from Eq. (7.8.10) and
Table 7.12 would lead to a centrifugal force of about 43g. This is a little below the
maximum acceleration experienced by a human on a rocket sled. Beyond 50g
sustained spin could lead to death or serious injury.
In the narrative about his fathers descent, the author wrote that the plane was
spinning hard enough to suck your eyeballs out,34 a literary embellishment that
might well apply to a centrifugal force of about 43g. The corresponding spin rate for
a radius 12 15:8 m is 0.82 Hz or very close to 50 revolutions per minute (rpm.) As
an interesting comparison, the Guinness World Record for spinning on ice skates
was (at the time of writing) 308 rpm set by Natalia Kanounnikova of Russia
on 27 March 2006.35 However, the maximum centrifugal force experienced by a
point on her body if she were modeled as a vertical cylinder of radius ~25 cm
would be ~26g.
33
34
35
B. N. Pamadi, Performance, Stability, Dynamics, and Control of Airplanes (AIAA, Reston VA, 1998) 650.
L. Gonzales, op. cit. p. 271.
Guinness World Records, http://community.guinnessworldrecords.com/_GUINNESS-WORLD-RECORDSHOLDING-FIGURE-SKATERS-GO-FOR-THE-GOLD-IN-VANCOUVER/blog/1866731/7691.html
439
The assumption underlying (7.8.9) that the plane fell at a steady terminal speed
will now be justified as we examine the descent. Although the nonlinear Newtonian
drag force couples vertical and horizontal components of velocity, we have seen in
the analysis of a baseball trajectory that treating horizontal and vertical motions
independently led to results in surprisingly good agreement with those obtained by
solution of the exact equations of motion. With adoption of the same approximate
procedure, the equation of motion for freefall from height h in a retarding atmosphere takes the dimensionless form
dV z
V 2z 1
dT
V z vz =vd ; T t=td
7:8:11
where z designates the vertical axis whose origin is at the initial location of the plane.
The (scaled) vertical displacement (Sz sz/sd) is measured from this origin. For a
loaded B-17 (m 24495 kg) falling like a flat plate (Cplate 1.3), the scale factors in
terms of which dynamical variables are expressed take values

1
2
velocity
vd
2mg
Cplate S
time
td
vd
4:75 s
g
7:8:13
displacement
sd vd td 221:48 m:
7:8:14
46:61 m=s
7:8:12
Two significant features distinguish Eq. (7.8.11) from the equation employed
previously for the vertical motion of a baseball. The first is the 1, rather
than 1, on the right-hand side, the sign of which reflects that positive displacement along the vertical axis occurs downward (in the direction of g) rather than
upward. The second is the initial condition vz0 0, instead of vz0 v0 sin 0. With
these two differences taken into account, the equation can be integrated to yield
expressions
V z T
V z Sz
vz t
tanh T V z0
1 V z0 tanh T
vd

vz sz
1 1 V 2z0 e2Sz
vd
1
2
!
tanh T
V z0 0
!
V z0 0
sz t
lncoshT V z0 sinhT
!
Sz T
V z0 0
sd
1 e2Sz
7:8:15
1
2
7:8:16
Sz T lncoshT

p
TSz ln eSz e2Sz 1
7:8:17
for the dynamical variables in terms of (scaled) time or vertical displacement.
440
From (7.8.14), one finds that the dimensionless displacement H corresponding to

h 8.2 km is H 37.16. For an exponent this large, the expression (7.8.16) for the
vertical speed of impact of the B-17 at the ground reduces to Vz(H) 1 or v(h) vd,
which is also the steady-state speed obtained by setting dVz/dT 0 in (7.8.11). From
the asymptotic form of (7.8.17) for H 1, the total descent time is simply given by
T ln2eH ) T H ln 2 37:85,
which is what one gets by integrating Vz dSz/dT 1. In standard units, t ~ 180 s or
about 3 min.
Were the airman to have hit solid ground directly at a vertical speed of ~47 m/s, he
would have perished. However, ensconced within a metal fuselage of diameter 2.4 m
(see Table 7.12) that would have collapsed upon flat impact, his deceleration length
could well have been somewhere in the vicinity of 12 m. Applying in this instance
the standard kinematic relations of uniform deceleration yields impact decelerations
as a function of crash length lc
8
< 110:7 lc 1 m
ac
v2d
73:8 lc 1:5 m
7:8:18
g
2glc :
55:4 lc 2 m
which, though severe, are within past precedents of survival.
It is to be noted that the element critical to Lt. Gonzaless survival was that
the plane came down flat like a Frisbee and not steep like an arrow. Replacement
of the form drag of a flat plate broadside to the wind with the friction drag of a
B-17 yields a value for ac/g in the thousands for any reasonable impact length,
whereupon the pilots future son would not have been around to write the
narrative.
The question remains as to whether the results of the preceding analysis for an
intact plane are valid given the extensive damage (with loss of structures) to the
aircraft. As seen from (7.8.12), the terminal freefall velocity is reduced by a loss of
mass but is increased by a reduction in wing area. Thus, to estimate reliably the
dynamical effects of damage requires some detailed anatomical information which
may no longer be available. In the narrative, half of one wing was shot away and
shortly afterward the plane broke in two amidships. To my knowledge there is
no photographic record of Lt. Gonzaless downed B-17, but from photographs
I have seen of other damaged B-17 aircraft, I would speculate that the fuselage
fractured just fore of the empennage at about three-quarters the distance from
the nose. Since the bombs on a B-17 were stored in racks in a bomb bay behind
the cockpit, the loss of the rear quarter of the fuselage did not mean loss of the
principal load.
To determine the mass of a B-17 missing one-half a wing and one-quarter the
fuselage requires knowing the masses of the wings and fuselage separately. Since no
technical specifications of the B-17 available to me gave these data, I estimated them
7.9 Descent without power
441
statistically from a NASA Technical Memorandum36 that provided fuselage and

wing masses of eight different transport aircraft of total weight varying from
5000 to about 55 000 lbs (i.e. 226824 948 kg). Interestingly, despite the wide variation in total weight of the aircraft, the ratio of wing mass to fuselage mass did not
differ greatly among the included models; the mean ratio and standard deviation
were found to be 1.068 0.413, which I took to be simply a 1:1 ratio. Thus, from the
known empty mass of the B-17 and the preceding fuselagewings mass ratio,
I estimated the mass of the unloaded fuselage to be 8195.5 kg and the mass of a
single wing to be 4097.8 kg, and thereby inferred that the mass of the downed B-17
was m ~ 20 397 kg when the 8104 kg load was included. Carrying out a dynamical
analysis with this value of m and a value for wing area 3/4 that of the undamaged
plane resulted in a terminal freefall velocity (49.1 m/s) and a range of impact decelerations [(ac/g) 61 123] slightly larger than before, but still within limits that people
have somehow survived.
This is not to say that Lt. Gonzales was not injured grievously. Curious to know
whether the man who became a biophysicist ever investigated the physics of how the
pilot of his youth survived a fall of more than 5 miles, I looked at those publications
of his that I could find, but found none related to my inquiry. Nevertheless, I believe
that the consequences of his fall may have played a significant role in the choice of his
research. Having suffered, according to his sons account, multiple fractures of his
hands, feet, limbs, and ribs, it is probably no coincidence that he became interested in
the healing of fractures, which he studied as a biophysicist by means of electron
microscopy.37
Dr. Federico Gonzales died in 2007 at the age of 86. I did not know the man
personally, but in an indirect way his experience rekindled my own interest in
aeronautics. And the fact that a significant part of my investigations of quantum
phenomena, reported in previous books,38 also involved electron microscopy (as a
means of generating electron interference patterns) further added to a sense of
familiarity. I was glad to read that overall, despite his injuries, he had a satisfying
and productive life. This section is dedicated to his memory.
7.9 Descent without power: how to rescue a jumbo jet disabled in flight
As a physicist conducting research in laboratories all over the world, I have spent a
lot of time in airplanes some 10 km above the ground. In all my travels, I have yet to
36
37
38
M. D. Ardema, M. C. Chambers, A. P. Patron, A. S. Hahn, H. Miura, and M. D. Moore, Analytical Fuselage and Wing
Weight Estimation of Transport Aircraft, NASA Technical Memorandum 110392 (May 1996), pp. 19, 22. The eight
aircraft whose fuselage and wing weights were given are: B-720, B-727, B-737, B-747, DC-8, MD-11, MD-83, and
L-1011, where B Boeing, DC Douglas, MD McDonell-Douglas, L Lockheed.
F. Gonzales and M. J. Karnovsky, Electron microscopy of osteoclasts in healing fractures of rat bone, Journal of
Biophysical and Biochemical Cytology 9 (1961) 299316.
M.P. Silverman, Quantum Superposition: Counterintuitive Consequences of Coherence, Entanglement, and Interference
442
meet a fellow air traveler who has not at least for a moment reflected on the
possibility that the plane may go down. The trend in design of modern commercial
aircraft, driven in part by rising costs of fuel, construction materials, and labor, is to
larger, heavier planes that transport ever greater numbers of passengers. Aerodynamicists now routinely contemplate design models capable of carrying 800 or more
people.39 Although air travel is presently considered very safe, no human-made
machine is 100% reliable, and it is therefore certain that at least one of these
airplanes would eventually fail in service with a huge number of fatalities. It is of
prime interest therefore to investigate how the laws of physics may be used to avert
such a catastrophe.
The fall and survival of Lt. Gonzales prompted me to consider more generally the
controlled descent of fragile loads, a topic of vital concern to space agencies, cargo
transporters, and general aviation. In regard to the latter, in particular, I was able to
demonstrate40 analytically how a large passenger airliner, having suffered total loss of
power, may be brought to ground by means of a sequentially released parachuteassisted descent with impact deceleration below 10g. The idea of protecting an entire
aircraft, rather than individual persons, with a parachute, unusual as it may seem, has
in fact been implemented commercially since about 1980 for small craft with maximum masses in the range of 2701410 kg and deploy speeds of about 6585 m/s.41
For large general aviation aircraft, however, the greater weights, speeds, and altitudes
are believed to make in-air recovery virtually impossible. Nevertheless, I have found
that in-air recovery of large general-aviation aircraft should be aerodynamically
feasible with decelerators of a size that currently exist and without necessarily requiring new materials.
The air resistance (drag force) on an object descending through an atmosphere
depends, as discussed in the previous section, on the air density, square of the relative
air speed, effective area presented to the air stream, and drag coefficient. The air density
in turn is a function of altitude and air temperature. In an atmosphere in isentropic
equilibrium, such as characterizes the Earths troposphere (depth of 816 km
from poles to tropics), the density varies adiabatically with altitude. The drag coefficient is largely independent of size, but depends weakly on Reynolds number (for
high Reynolds numbers) and sensitively on shape and origin (i.e. from pressure or
friction).
Although it is usually an acceptable approximation to regard air as an incompressible fluid for horizontal flight at subsonic speeds, the effect of compressibility on air
resistance will be significant at any speed for a sufficiently large vertical excursion.
39
40
41
A. Bowers (Senior Aerodynamicist for NASA), The Wing is The Thing (TWITT) Meeting, NASA Dryden Flight
Research Center, Edwards AFB, California USA, (16 September 2000). Presentation available at http://www.twitt.
org/BWBBowers.html
M.P. Silverman, Two-dimensional descent through a compressible atmosphere: Sequential deceleration of an
unpowered load, Europhysics Letters 89 (2010) 48002 p1p6.
Ballistic Recovery Systems, http://www.usairborne.com/brs_parachute.htm
443
Air is a poor conductor of heat. To say that air density varies adiabatically means
that over the brief time interval that a parcel of air expands or contracts in an
environment at different temperature, there is no heat flow into it from the immediate
surroundings. Thus, the work done in adiabatic expansion or contraction comes
from the internal energy of the parcel, which subsequently must cool (for expansion)
or become warmer (for compression). Combined application of the equation of state
of an ideal gas of molar mass M
p RT=M,
7:9:1
which relates pressure p, absolute (or Kelvin) temperature T, and density , with the
equation for an adiabatic transformation of an ideal gas
p1 T constant
7:9:2
derived from (7.9.1) with use of the Second Law of Thermodynamics, and the
barometric equation
dp
g
dz
7:9:3
that governs the decrease in pressure with altitude in a uniform gravitational field,
leads to expressions for the adiabatic lapse rate

dT
Mg 1
T0
z

) Tz T 0 1
7:9:4
hatm
dz
R
hatm
and the adiabatic variation of density with altitude

z 0 1
1
1
hatm
7:9:5
in which
hatm
RT 0
1Mair g
7:9:6
is defined as the adiabatic height of the atmosphere.

In the preceding equations, is the ratio of the specific heat at constant pressure to
the specific heat at constant volume, R 8.314 J/mol-K is the universal gas constant,
and 0, T0 are respectively the ground-level density and temperature. For (dry) air,
which is primarily a mixture of two diatomic gases, 78% N2 and 21% O2 by volume,
the mean molar mass is M 28.97 g, and the specific heat ratio ~ 1.40 is very close
to the value 7/5 predicted by quantum mechanics for a system of unexcited
diatomic molecules. The adiabatic height of the atmosphere (7.9.6) is hatm ~ 28 km for
a ground temperature T0 273 K (i.e. 0 C), and the theoretical (dry air) adiabatic
lapse rate constant in (7.9.4) is calculated to be (T0/hatm) 10 C/km. Since air in the
environment is not ordinarily dry, an average empirical lapse rate is approximately
444
6.5 C/km, but I will use the theoretical value for maximal influence of altitude on air
temperature and density.
The z axis in Eqs. (7.9.3) and (7.9.4) is oriented vertically upward with the origin at
ground level. However, when (in due course) we consider the descent of an aircraft, it
will also prove useful to employ a vertical axis oriented downward with the origin at
the initial height h of the falling object. Displacements measured downward from the
initial location will then be represented, as before, by sz (or dimensionless scaled
equivalent Sz) and displacements measured upward from the ground will be represented by z (or a scaled equivalent Z). The two sets of vertical coordinates are related
by sz z h or, as scaled variables, Sz Z H. With attention to symbolism, there
should be no confusion.
Consider now the application of Newtons Second Law of Motion with air drag,
Eq. (7.5.5), applied to a structure of total mass m comprising several separate but
attached plates, as in Figure 7.12, each of which contributes drag independently of
S2
S0
x
S1
-V
-Vx
-Vy
Fig. 7.12 Schematic diagram of an airfoil with horizontal and vertical decelerators, modeled as
plates with respective plan areas S0, S1, S2 moving relative to the air stream with velocity v and
incidence (as seen from the rest frame of the airfoil).
445
the others with a surface either perpendicular to the horizontal (x axis) or facing
downward (negative direction along z axis). In the rest frame of the structure the
wind blows with velocity of magnitude v at an incidence to the x axis. Decomposing
the equation into its horizontal and vertical components, one obtains a set of firstorder nonlinear equations
dvx
x v2x z vx vz 0
dt
dvz
z v2z x vx vz g
dt
7:9:7
with velocity components

vx
dx
v cos
dt
vz
dsz
dz
v sin :
dt
dt
The -coefficients (which have dimension of inverse length)

X
X
C i Si
z
C i Si
x
2m x-plates i
2m z-plates i
7:9:8
7:9:9
are the drag parameters (distinct from drag coefficients which are dimensionless) of
the x- and z-oriented plates.
The structure in Figure 7.12 is a plate model of a falling unpowered aircraft in
other words, just an elaborate projectile comprising only the essential components
of wings (w), a single horizontal (or drogue) parachute (hp), and one or more vertical
parachutes (vp). The components are characterized aerodynamically as plates of
projective area and drag coefficient (S0, Cw), (S1, Chp), (S2, Cvp), respectively. Air
resistance on these decelerators is due primarily to form drag (pressure) rather than
skin drag (friction). For such a configuration, the drag parameters (7.9.9) take the
simplified form
x
Chp S1
2m
Cw S0 np Cvp S2
2m
7:9:10
in which np is the number of vertical parachutes employed.

Insertion of expression (7.9.5) for the air density into (7.9.10) followed by substitution of the latter into (7.9.7), lead to equations of motion

1

dvx
h sz 1
2
vz
x0 vx z0 vx vz 1
0
dsz
hatm

1

dvz
h sz 1
2
z0 vz x0 vx vz 1
g
vz
dsz
hatm
7:9:11
in which the altitude-dependence is shown explicitly and (x0, z0) are the groundlevel drag parameters defined by (7.9.10) for density 0. Equations (7.9.11) are
446
expressed in terms of a single independent variable sz, the time derivatives in (7.9.7)
having been eliminated by use of the chain rule
d=dt dsz =dtd=dsz vz d=dsz :
7:9:12
The final step in expressing the equations of motion (and their eventual solutions)
is to transform them, as was done previously, into dimensionless form (V v/vd,
Z z/sd, H h/sd)
1

dV x 2 2
V x V x V z 1 H Sz 1 0
Vz
dSz
1

dV z 2
Vz
V z 2 V x V z 1 H Sz 1 1
dSz
whereby velocity, time, and displacement are scaled by factors

2mg
vd
v2
td
sd vd td d
vd
g
g
0 Cw S0 np Cvp S2
1
2
7:9:13
7:9:14
and
x0
Chp S1
z0 Cw S0 np Cvp S2
sd
v2
d
hatm ghatm
7:9:15
are dimensionless parameters. The aero-thermodynamic parameter can be interpreted as the ratio of the distance fallen from rest to about 93% of vertical terminal
velocity in a homogeneous atmosphere to the adiabatic height of the atmosphere.42
In full generality Eqs. (7.9.13) require numerical solution. They can be solved
analytically, however, for several important special cases.
7.9.1 Stationary solution

Setting dVx/dSz dVz/dSz 0 leads to velocity components
v s, x
0
vd
7:9:16
dSz
1 H Sz 1=21 :
dT
7:9:17
V s, x
V s, z
In the cases previously treated of projectile motion through a homogeneous fluid

medium (equivalent to setting 0 in (7.9.17)), the stationary vertical solution was
Vs,z 1, corresponding to a terminal velocity equal to the velocity scale factor vd.
Note that Vs,z in (7.9.17) is not a terminal velocity because it depends on the altitude,
42
In the approximation of uncoupled 1D motion, the distance Sz sz/sd 1 fallen vertically from rest leads to vertical
velocity V z Sz 1 1 e2 0:93.
1
2
447
which itself varies in time, and one cannot take a limit t ! because of the
restriction Sz H. Rather, we will see from exact numerical solutions of the equations
that the vertical velocity Vz can reach a maximum magnitude greater than 1 before
decreasing toward the limit Vs,z 1.
7.9.2 Vertical descent through a linear compressible atmosphere

When the altitude of a projectile is low in comparison with hatm, the expression for air
density (7.9.5) can be expanded in a Taylor series truncated at first order in z/hatm.
Then, in the absence of a horizontal velocity component, Eq. (7.9.13) reduces to a
linear, first-order differential equation in V 2z Z
dV 2z
21 0 ZV 2z 2
dZ
0 1
7:9:19
expressible directly in terms of altitude Z. Eq. (7.9.19) can be solved exactly by means
of an integrating factor to yield the expression
0
V 2z Z V 2z0 e H
Z 2 2HZ
0 2
2e1= 0 e Z
1
2
2Z
p
0 1
0 H

p
0 Z0 1
eu du
7:9:20
with initial condition Vz(H) Vz0. The relation between velocity and time must be
H
obtained by integration, T V z u1 du.
Z
7.9.3 One-dimensional horizontal and vertical descent

through a homogeneous atmosphere.
Uncoupling the x and z components of (7.9.13) with 0 leads to integrable
equations whose solutions with initial velocities Vx0, Vy0, although given in part
previously, are summarized in their totality in Table 7.13 for scaled variables.
Figure 7.13 shows the variation in velocities Vx, Vz with time T for the exact 2D
theory (7.9.13) (solid) and decoupled 1D approximation (dashed) (Table 7.13) for the
unpowered descent of an aircraft cruising horizontally with parameters pertinent to
the freefall of a Boeing 747, which will be discussed shortly. Two notable features are
(a) the faster decline of the horizontal velocity (black) with time in the 2D theory, and
(b) the rise of the vertical velocity (gray) above the terminal limit Vz 1, with
subsequent decline to Vz 1 in the 2D theory. Also shown in the figure are the 2D
velocity profiles in the case of zero horizontal drag ( 0) (dotted). In marked
contrast to the 1D case for which there would be no horizontal deceleration, the
448
Table 7.13
Solutions to drag equations for homogeneous density
Component
Scaled variables
Horizontal velocity
V xT
Horizontal acceleration
vxsx
V x0 eSx
vd

2
ax
V x0
AxT
g
V x0 T 1
sx
SxT lnV x0 T 1
sd
vxt
V x0
V x0 T 1
vd
V xSx
Horizontal displacement
V zT
vzt
tanhT V z0
vd
1 V z0 tanhT
V zSz

vzsz
1 1 V 2z0 e2Sz
vd
Vertical acceleration
AzT
az
1 V 2z0
g coshT V z0 sinhT2
Vertical displacement
SzT
sz
lncosh T V z0 sinhT
sd
Vertical velocity
1
2
coupling of Vx and Vz in the exact 2D analysis generates a horizontal deceleration

comparable to that achievable with a drogue parachute.
The theory developed above, which facilitates realistic modeling of the impact of
temperature and density variations on air drag and serves as a model for extension to
more general polytropic atmospheres, permits one to examine protocols to bring to
ground an unpowered general aviation aircraft with decelerations at all stages of the
descent within a range of passenger survivability, i.e. ~10g. For illustrative purposes, I consider a plane comparable to a Boeing 747100 Jumbo Jet, whose relevant
features are recorded in Table 7.10.
For in-air recovery of a crippled B747 a horizontal parachute would be deployed
first from the rear to reduce cruising speed from ~250 m/s to ~50 m/s followed by
symmetrical deployment from ports along the upper surface of the fuselage of one or
more vertical parachutes to decelerate the rate of descent to a survivable terminal
velocity vt,z. If lc is the impact deceleration length at the ground, then the objective of
the rescue protocol is to insure an impact deceleration ac =g v2t, z =2glc below 10.
Vertical decelerators comparable to the commercially available G-11 cargo parachute43 of nominal radius R 15.24 m, surface area Sp R2 729.7 m2, and mass
43
G-11 Cargo Parachute Assembly Technical Data Sheet, Mills Manufacturing Corporation, http://www.
millsmanufacturing.com/files/G-11%20Tech%20Data%20Sheet.pdf/view
449

3
Velocity (scaled)
2.5
Horizontal
1.5
Vertical
0.5
0
0
10
Time (scaled)
Fig. 7.13 Time variation of horizontal (black) and vertical (gray) components of velocity for
an unpowered B747100 with air drag provided by: (1) airfoil and drogue parachute calculated
by the exact 2D theory (solid) and uncoupled 1D theory (dashed); (2) airfoil without drogue
calculated by exact 2D theory (dotted). The initial altitude is h 10 km; initial velocity
components (m/s) are vx 250, vy 1. Plan areas (m2) are Sairfoil 511, Sdrogue 182.4.
Drag parameters (s1) are x0 0.0722, y0 0.1125 with aero-thermodynamic parameter
0.0278. The dashed line at the ordinate 1 marks the terminal vertical velocity.
113.4 kg would suffice, with a corresponding parachute of radius R/2 for the drogue.
At the time I first looked into the matter, the manufacturer packaged these parachutes in clusters up to 8. Sequential deployment symmetrically over the fuselage
makes it possible to reduce vertical impact with the ground to a level below that of
individual military parachutists (1015)g.44
The drag coefficient of a parachute Cp depends on shape and venting, and the
spread of values I found in the literature ranged from about 1.3 to 2.4 depending on
the mode of ascent.45 For illustrative purposes, I adopted Cp 1.5, which is a little
larger than the drag coefficient Cplate 1.3 of a plate (the wings) of aspect ratio 7.0 at
high Reynolds number.46 Given the maximum take-off mass in Table 7.10 and STP
44
45
46
J. R. Davis, R. Johnson, and J. Stepanek, Fundamentals of Aerospace Medicine (Lippincott Williams and Wilkins,
Philadelphia, 2008) 675676.
See, for example, (a) P. Wegener, What Makes Airplanes Fly?: History, Science, and Applications of Aerodynamics
(Springer, New York, 1991) 107; (b) Parachute Descent Calculations http://my.execpc.com/~culp/rockets/descent.
html#Velocity
R.W. Fox and A.T. McDonald, Introduction to Fluid Mechanics 4th Edition (Wiley, New York, 1992) 442, 468.
450
3
ground-level values for air density (0 1.294 kg/m
and temperature (273 K), the
p)
parameters in (7.9.13) become x0 0:0722, z0 0:0126 np 0:0208. Since Cd/
Cp ~ 0.021, we can ignore the contribution of the frictional drag on the aircraft when
treating the horizontal deceleration.
Upon solving the drag equations (7.9.13) with use of the foregoing parameters,
one finds that an unpowered B747 in unaided freefall would decelerate horizontally
to 50 m/s in 25.3 s while descending 2.01 km from an initial height of 10 km, and
attain a vertical velocity of 117 m/s. Deployment at that point of 24 G-11 parachutes
would bring the plane to a terminal velocity of 13.7 m/s, thereby subjecting passengers to an initial deceleration a0/g~33.4, which, while not necessarily life-threatening,
is nevertheless beyond the assumed level of tolerance. In a safe recovery, the parachutes must be deployed sequentially and in a manner to keep the wings parallel to
the ground (flat descent) so as to avoid unduly large initial accelerations.
An example of such a protocol, again obtained from numerical solution of Eqs.
(7.9.13), might unfold as follows. A B747, cruising 250 m/s at 10 km, becomes
disabled; all engines fail or are shut off intentionally to effect the recovery. The
drogue is deployed while the plane drops 4 km, which reduces the horizontal velocity
to 12.6 m/s and increases the vertical velocity to 118 m/s in about 40.6 s, at which
time six vertical parachutes are deployed symmetrically in three groups of two along
the fuselage. These decelerate the aircraft vertically to 30.5 m/s and horizontally to
nearly 0 m/s, with peak deceleration amax/g~10g, which decreases rapidly in time; the
5-second time-averaged deceleration is aav(5s)~1.6g. Then 18 more G-11 parachutes
are deployed symmetrically in three groups of six, the total of 24 decelerating the
aircraft (amax~2.7g; aav(5s)~0.3g) to a terminal velocity of 1.7 m/s, at which it falls the
remaining distance to ground. The plane strikes the ground flat, compressing the
cargo hold 2 m to produce an impact deceleration of less than 5g. Table 7.14
summarizes the kinematic details of the vertical descent from an initial altitude of
10 km both with and without use of a drogue. The two cases result in nearly the same
maximum decelerations and a difference in cumulative horizontal displacement of
less than 2 km.
The preceding summary does not take account of the opening time of the parachute canopy, for which the mean delay t of a G-11 is about 5.3 s.47 In numerous
simulations, however, I found that taking account of the delay by including suitable
time-dependent opening functions in Eqs. (7.9.13) did not change perceptively the
numerical results of Table 7.14 since the delay is very much less than the descent time
at each deployment stage.
It is worth noting that calculations were also performed for lower initial altitudes.
At lower altitudes the density of the air and therefore the drag on the parachutes
47
W.R. Lewis, Minimum Airdrop Altitudes for Mass Parachute Delivery of Personnel and Material Using Existing
Standard Parachute Equipment, ADED Report 642 (US Army Natick Laboratories, Natick, Massachusetts, April
1964) 11.
Table 7.14
Parachute-assisted descent of a disabled B747100 aircraft

0
vinitial
x
(m/s)
vfinal
x
(m/s)
vinitial
y
(m/s)
vfinal
y
(m/s)
Action
np
x (s )
y (s )
yInitial
(km)
Freefall w. drogue
Deploy 6 (2, 2, 2)
Deploy 18 (6,6, 6)
Accumulated
intervals
Freefall w/o drogue
Deploy 6 (2, 2, 2)
Deploy 18 (6, 6, 6)
Accumulated
intervals
Decel. length
Lc 2 m
Lc 3 m
0
6
24
0.072
0.072
0.072
0.112
0.371
0.716
10
6
3
250
12.6
0
12.6
0
0
1.0
118
30.5
0
6
24
0
0
0
0.112
0.371
0.716
10
6
3
250
26.9
0
26.9
0
0
1.0
123
30.5
1
1
T (s)
sx (km)
sy (km)
a0/g
118
30.5
13.7
40.6
88.0
203.0
331.6
3.4
0
0
3.4
4
3
3
10
1.5
10.0
2.7
123
30.6
13.7
38.7
87.8
203
329.5
5.1
0.1
0
5.2
4
3
3
10
1.0
11.1
2.7
ac/g
4.8
3.2
451
452
is greater, but the distance for recovery is of course shorter. Nevertheless, I found
that the recovery protocol is still sufficient. Initiated at an altitude of only 4 km with
deployments at 2.5 km and 1.5 km also led to landings with peak deployment
accelerations and impact decelerations below 10g.
The computer simulation of numerous airplane recoveries bolstered my confidence in the idea that sequential, symmetric deployment of vertical and horizontal
decelerators with drag parameters comparable to those of available parachutes can
bring a large general aviation airplane down safely in flat descent without subjecting
passengers to accelerations exceeding ~10g. Current barriers to such recovery are not
aerodynamic, but at most material. The peak horizontal drag exerted by an air
stream at 10 km altitude with relative velocity of 250 m/s on a 7.62 m radius drogue
is ~3.7 MN, which amounts to a tension of 30.5 kN in each of the 120 suspension
lines of diameter about 3.175 mm (1/8 inch), thereby requiring a tensile strength of
about 3.9 GPa. Although a drogue may in fact be dispensable, peak drag on a G-11
vertical parachute corresponding to a relative vertical wind speed of ~123 m/s is
~3.5 MN, thereby requiring nearly the same tensile strength of 3.7 GPa. The tensile
strength of the currently used Type III nylon cord is about 309 MPa.48 (Pressure
constraints on the canopies are much less severe; peak drag overpressure on the
drogue was ~0.20 atm in the preceding analysis).
There exist other materials, however, whose tensile strength is already within the
range needed and which may serve as precursors to suitable replacements for Nylon,
such as
(a) Vectran (2.93.3 GPa) an aromatic polyester spun from a liquid-crystal
polymer,49
(b) Zylon (5.8 GPA) a thermoset liquid crystalline polybenzoxazole,50 and
(c) fiber glasses such as E-Glass (3.5 GPa) and S-Glass (4.7 GPa).
Potentially new materials of extraordinary tensile strength may eventually be fabricated from allotropes of carbon with cylindrical nanostructure (C-nanotubes) which
have the highest tensile strength of any known material (composites 2.314.2 GPa;
single fibers of 22.2 GPa).51 Successful implementation of the recovery protocols may
also call for distributing the reaction force of the suspension lines over space or time
to avoid structural damage at sites of attachment. This should be achievable by
appropriate design and canopy shapes, controlled timing of canopy opening, and use
of extensible materials.
48
49
50
51
Nylon Cord PIA-C-5040/MilC-5040 Technical Data Sheet, Mills Manufacturing Corporation, http://www.
millsmanufacturing.com/files/Miltex-Tech%20Sheet.pdf/view.
R. B. Fette and M. F. Sovinski, Vectran Fiber Time-Dependent Behavior and Additional Static Loading Properties
(NASA/TM2004-212773) 13.
Tensile strength http://en.wikipedia.org/wiki/Tensile_strength
F. Li, H. M. Cheng, S. Bai, G. Su, and M. S. Dresselhaus, Tensile strength of single-walled carbon nanotubes directly
measured from their macroscopic ropes, Applied Physics Letters 77 (2000) 31613163.
Appendices
7.10 Distribution and variation of projectile range R(V, )

In the general case, the projectile range
R V 2 =g sin 2
7:10:1
is a function of two random variables V and , which are assumed here to be

independently distributed with densities pV(v) and p(). Thus
pR r pV V, pV Vp :
7:10:2
To express pR(r) directly in terms of r, set R X Y where X V2/g and Y sin (2)
and apply the rules for transforming pdfs to obtain

r
p 12 sin1 r=x
1
1
q
7:10:3
pX xpY
pX x
dx
pR r
2 dx,
x
0 jxj
0 jxj
2 1 xr
where

sin1 y
pY y p :
2 1 y2
p
1
2
7:10:4
This procedure can be applied again if it is desired to express pX(x) in terms of the
density pV(v). Substitution of specific densities for speed and angle into (7.10.3) and
(7.10.4) will generally lead to complicated mathematical expressions and, depending
on the choice of parameters, may also raise subtle issues regarding the range of the
angle variable.
It is not necessary, however, to use (7.10.3) to calculate the variance of the range
h
i
2R g2 hV 4 ih sin2 2i hV 2 i2 h sin 2i2 :
7:10:5
Let us suppose as an illustration that the speed and angle are distributed normally
according to

pV v N V 0 , 2V

p N 0 , 2
7:10:6
453
454
where V /V0 and /0 are both small enough that we need not be concerned with the
occurrence of nonphysical negative values of the variables. Then, as has been shown
previously, the second and fourth moments of speed lead to the relations
"
2 #
V
2
2
hV i V 0 1
7:10:7
V0
"
2
4 #
V
V
:
7:10:8
3
hV 4 i V 40 1 6
V0
V0
To evaluate the Gaussian integral of a sine or cosine function, make use of the
characteristic function h(t) of the random variable , defined by the expectation
heiti. With a range of integration , the Gaussian integral yields the
closed-form expression
1
he i p
2
it
eit e0
=2 2
d ei0 t
1
2
2 t2
7:10:9
From Eq. (7.10.9) and the Euler relations

sin n
1 in
e ein
2i
1
cos n ein ein
2
7:10:10
then follow the expectations

hsin ni e
1
2
n2 2
sin n0
n2 2
cos n0
hcos ni e
1h
hsin 2 ni 1 e
2
1
2
1
2
n2 2
7:10:11
7:10:12
cos 2n0 :
7:10:13
Equations (7.10.5) and (7.10.11)(7.10.13) lead to the following exact expression for
the variance of the range
2
3
!
2 4 !
2 !2

4
8 2
V
1e
cos
4
2
V
V
0
V
2R 20 4 16
3
e8 sin 2 20 5:
1
g
V0
V0
2
V0
7:10:14
2
and
Upon retaining terms to first order in the variances
reduces to
"
#
4V 20
V 2
2
2
2
2
sin 20 cos 20 ,
R
g
V0
2V ,
Eq. (7.10.14)
7:10:15
which could also have been obtained more simply by taking the differential of
(7.10.1) and then arbitrarily combining the two terms in quadrature, a procedure
455
that is not rigorous, but frequently justified by heuristic arguments. The derivation
given here, however, leads directly to the correct combination of component variances. Note, too, that 2R in the absence of the corresponding distribution function,
gives no information about confidence limits, i.e. the probability that a measurement
of R falls within some specified interval (e.g. R) about the true mean (i.e. population mean). The Central Limit Theorem can be used to estimate the sample mean R of
a large number (in principle, infinite number) of measurements, and one could
employ the Weak Law of Large Numbers, as we have done in Section 7.3, to make
such an estimate, but it will usually lead to a broader inequality than necessary, as
was shown in regard to skewness.

The problem is to estimate from n samples Zi (i 1. . .n) of the same population with
density pZ(z) the expectation of the third moment of Z about the mean
n D
E
X
Z i Z3
7:11:1
Z Z 3 pZ zdz f n
i1
where the unbiased estimator of the mean Z is

Z
1
n
n
X
7:11:2
Zi ,
i1
and f(n) is a function of the sample size to be determined.

Expansion of the right side of (7.11.1) leads to four terms, three of which contain Z
in the argument of the expectation operator. Unlike Z, which is a constant parameter of the distribution, Z, as defined in (7.11.2), is a sum of random variables.
Consider one of the three terms
X

n
n
n
1X
1X
3
3
Z nhZ i n 3
hZ i Z j Z k i 2
hZi Z j Z k i:
7:11:3
n i, j, k
n i, j , k
i1
The sum in (7.11.3) can be decomposed into three terms in which (a) three indices are
equal, (b) two indices are equal, and (c) no indices are equal. There are three pairs of
equal indices in (b): i j, i k, j k. Thus, the sum becomes
n
X
hZ i Z j Zk i
i, j , k
n
X
i
hZ i 3 i 3
n
X
i6j
hZi 2 ihZj i
n
X
hZ i ihZj ihZ k i,
7:11:4
i6j6k
which is reducible to
n
X
i, j, k
hZ i Z j Zk i nhZ3 i 3nn 1hZ2 ihZi nn 1n 2hZi3 :
7:11:5
456
Substitution of (7.11.5) into (7.11.3) and evaluation and combination in analogous

manner of the other terms in (7.11.1) lead to the expression

X
n
i
D
E

n 1n 2 h 3
3
Z 3 Z 2 hZi 2hZi3 f n Z Z3
Zi Z
n
i1
from which we obtain the function f(n) (n 1)(n 2)/n by which
n
DX
7:11:6
E
Z i Z3
i1
must be divided in order to be an unbiased estimator, as expressed in (7.4.3).
8
Modern statisticians are familiar with the notion that any finite
body of data contains only a limited amount of information on
any point under examination; that this limit is set by the nature of
the data themselves, and cannot be increased by any amount of
ingenuity expended in their statistical examination: that the statisticians task, in fact, is limited to the extraction of the whole of the
available information on any particular issue.
R. A. Fisher1
8.1 A radical hypothesis

Not long after it was published, I came across, quite by chance, a book2 that argued
in support of a bizarrely radical idea that I found increasingly disturbing the more
I read of it. Briefly, the selling point, which made the idea radical rather than merely
interesting, and which undoubtedly helped propel the book onto the best-seller lists,
was the claim that a group of randomly chosen ordinary people will give in the
aggregate a more accurate answer to some question (virtually any question) than
experts with specialized knowledge in that area.
Could that be? How could that be? Being a physicist, I resolved to do experiments
to find out for myself.
This chapter relates the outcomes of these experiments and the theoretical insights
I drew from them.3 Far from being merely an amusing (if not initially irritating)
narrative, the book actually raised in my mind a fundamental statistical question of
broad significance: how can useful information (if there is any) be mined from the
disparate responses of a group largely, but perhaps not entirely, comprised of nonspecialists? Indeed, by what means can one decide whether there actually is information in the sample of responses? I will answer these questions in due course.
1
2
3
R. A. Fisher, The Design of Experiments (Oliver and Boyd, 1935) 4445.

J. Surowiecki, The Wisdom of Crowds (Random House, New York, 2004). Quotations cited from pages xii, xiii,
M. P. Silverman, Review of Wisdom of Crowds by J. Surowiecki, American Journal of Physics 75 (2007) 190192.
457
458
Richard Feynman, as the reader probably knows, was one of the most colorful
American physicists of the twentieth century. Creator of his own version of quantum
mechanics based on path integrals, and seminal contributor to the formulation of
quantum electrodynamics, Feynman was also an entertaining raconteur of his lifes
experiences. In one of his narratives4 describing the tribulations of serving on a
California state commission charged with the selection of high school mathematics
textbooks, he related a brief fable about the length of the Emperor of Chinas nose.
So exalted was the Emperor of China, that no one was permitted to see him, and the
question in peoples minds was: how long is the Emperors nose? To find out,
someone (according to the narrative) asked people all over China what they thought
was the length and then averaged all the results. Evidently, this average was considered to be accurate because the sample was large and representative.
Feynmans message, however, which would seem the embodiment of common
sense, was that averaging a lot of uninformed guesses does not provide reliable
information. Yet, in a nutshell, this was exactly what the book I read appeared to
advocate as the most reliable way to acquire information.
The book, a New York Times Business Bestseller titled The Wisdom of Crowds (to
be abbreviated in this essay as WOC), was not concerned with finding the length of
the Emperors nose. It began instead with an anecdote relating to the weight of a
dressed ox, which the visitors to the annual West of England Fat Stock and Poultry
Exhibition could bet on for a sixpence ticket. The 1906 competition is noteworthy in
that it was attended by the English polymath and statistical innovator, Francis
Galton, well known for his anthropometric studies of human physical and mental
characteristics and their correlation with good breeding. Galtons experiments did
not give him a high opinion of the average person whose stupidity and wrongheadedness . . . [was] . . . so great as to be scarcely credible. Not to miss an opportunity to reconfirm his opinion, Galton borrowed the tickets after the awarding of
prizes and made a simple statistical analysis to determine the shape of the distribution (a bell-shaped curve? . . . we are not told) and the mean value of the participants
guesses. According to WOC, The crowd had guessed that the ox, after it had been
slaughtered and dressed, would weigh 1197 pounds. After it had been slaughtered
and dressed, the ox weighted 1198 pounds. In other words, the crowds judgment was
essentially perfect.
What is one to make of that agreement: that the story was apocryphal, an
exaggeration, a coincidence? Curious about the authenticity of the event,
I researched Galtons published articles and indeed I found that he described his
experiment at the Exhibition in a short paper published in Nature in 1907 under the
title Vox Populi, i.e. the voice of the people.5 Galton began his account with the
words
4
5
R. P. Feynman, Surely Youre Joking Mr. Feynman! (W. W. Norton, New York, 1985) 295296.
F Galton, Vox Populi, Nature 75, No. 1949 (March 7, 1907) 450451.
459
In these democratic days, any investigation into the trustworthiness and peculiarities of
popular judgments is of interest.
Galton then decided that

According to the democratic principle of one vote one value, the middlemost estimate
expresses the vox populi, every other estimate being condemned as too low or high by a
majority of the voters.
In less descriptive and more modern terminology, Galton had tallied the guesses and
found the median, which turned out to be 1207 lbs, a value too high by a mere 0.8%.
Surprised and impressed, he concluded
This result is, I think, more creditable to the trustworthiness of a democratic judgment than
might have been expected.
Having accepted, therefore, the WOC account of Galton at the fair to be a more
or less accurate description of an actual incident, it seemed to me not unreasonable
at first to believe that few people at the Exhibition were likely to have had any
experience in slaughtering and dressing oxen. Many were probably tradesmen or
professionals from town (carpenters, blacksmiths, coopers, lawyers, bankers, physicians, and the like) or maybe vegetable or poultry farmers. Thus, one might have
expected as perhaps Galton did the group average to deviate widely from the
true weight. In the words of WOC, . . . mix a few very smart people with some
mediocre people and a lot of dumb people, and it seems likely youd end up with a
dumb answer.
On further reflection, however, the reasonableness of the assumption vanished,
replaced by the question: why should it be assumed that few of the ticket purchasers
knew anything about the dressed weight of an ox? This was, after all, an exhibition
of fat stock, and it took place annually as presumably did the contest. So maybe a
substantial number of visitors were well-informed about the size, shape, and weight
of oxen. Maybe they frequented regional exhibitions or farms or slaughter houses
or butcher shops. Maybe they participated before in this contest. This was rural
England in 1906, not urban England in the twentyfirst century. People were accustomed either to growing their own food or to purchasing it, one step removed, at
the shops, markets, and farms where the food was produced and processed. I was
not around then, but my great grandmother was, and she did not purchase her food
at a local West of England supermarket where hundreds of small cuts of meat lay
neatly packaged on Styrofoam slabs wrapped with cellophane having traveled by
refrigerated railcars or airplane hundreds or thousands of miles from wherever it
was that the animals were raised. So maybe the 1906 Exhibition crowd was not
dumb.
I returned to researching Galtons papers to see whether he had further thoughts
on the matter. He did but even more interestingly so did one of his contemporaries
460
who wrote his objections to the editor of Nature.6 The perceptive gentleman was a
Mr. F. Perry-Coste of Cornwall, who, nearly one hundred years before I put pen to
paper (or fingers to keyboard) on the subject, had apparently had the identical
thought:
. . . Mr Galton says that the average competitor was probably as well fitted for making a just
estimate of the dressed weight of the ox as an average voter is of judging the merits of most
political issues on which he votes. . . . I do not think that Mr. Galton at all realizes how large a
percentage of the votersthe great majority, I should suspectare butchers, farmers, or men
otherwise occupied with cattle. To these men the ability to estimate the meat-equivalent weight
of a living animal is an essential part of their business . . . Now the point of all this is that, in so
far as this state of things prevails, we have to deal with, not a vox populi, but a vox expertorum.
[The] majority of such competitors know far more of their business, are far better trained, and
are better fitted to form a judgment, than are the majority of voters of any party, and of either
the uneducated or the so-called educated classes. I heartily wish that the case were otherwise.
A noteworthy point, although not one of particular importance at this juncture, is

that Galton, in replying to another letter-writer, acknowledged that the mean, rather
than the median, of the Exhibition competitors, gave a result closer to the true value.
I shall have more to say about this distinction later.
To the WOC author, however, Galton had stumbled upon a powerful truth,
namely that under the right circumstances, groups . . . are often smarter than the
smartest people in them. I will refer to that statement as the WOC hypothesis. Is
the hypothesis valid? And if it is valid, what are the right circumstances. Is the
WOC hypothesis obvious; or is it instead a radical idea at variance with centuries,
perhaps millennia, of human experience? Therein lay my dilemma as I pondered this
disconcerting, but thought-provoking book.
Whenever I learn of yet another egregiously unwise action taken by a faculty
committee, I cannot help but think of a humorous website initially known to me as
Demotivators.com.7 Under an inspirational image of many hands reaching from all
directions toward a common center to touch one another in group solidarity is the
unexpected aphorism: None of us is as dumb as all of us. The sentiment must be
deeply rooted because acid comments about group intelligence go back ages.
Nietzsche, for example, wrote I do not believe in the collective wisdom of individual
ignorance. Or consider Bernard Baruchs comment: Anyone taken as an individual
is tolerably sensible . . . as a member of a crowd, he at once becomes a blockhead. Or
the remark by the psychologist (and amateur physicist) Gustave Le Bon: In crowds
it is stupidity and not mother wit that is accumulated. And the most succinct version
of all my favorite by Tommy Lee Joness character Kay, one of the Men in Black,
protecting Earth from the scum of the universe: A person is smart. People are
dumb . . . I could go on and on, but I wont.
6
7
The Ballot Box, Nature 75, No. 1952 (March 28, 1907) 509. [Letters to the Editor from Galton and others.]
The website has since been changed to http://www.despair.com/
461
Running counter to a vast literature in favor of individual intelligence and

expertise is WOC. Indeed the very title of the book is an ideological play of
words, a rejoinder to Charles Mackays influential 1841 book called Extraordinary Popular Delusions and the Madness of Crowds. A curious reader may wonder
why WOC should hold such interest for me, a busy physicist, as anything other
than a weekends amusement, since neither the author, who is a business columnist for the New Yorker, nor, to the best of my knowledge, any of the people
whose work he cites are physicists. True enough but the WOC hypothesis has
far-reaching implications for the acquisition, verification, and implementation of
information, activities that physicists are continually engaged in. Moreover,
physicists by virtue of their training are so-called experts, and the thought that
their individual judgments on a scientific matter may actually be less useful than
the average opinion of, let us say, 50 people chosen at random off the street, is
disturbing.
Whether intentional or coincidental, the author seemed to have physicists and
mathematicians particularly in mind when he chose his opening examples of the
superiority of group over individual intelligence. In 1968 the US Navy lost the
nuclear submarine Scorpion by which I mean the vessel simply vanished at sea
whereupon the task of finding it was assigned to John Craven, USN chief scientist for
special projects. One of Cravens special projects shortly before the Scorpion disappeared had been to locate a thermonuclear (hydrogen) bomb that a US Air Force
plane accidentally dropped off the coast of Spain. (The WOC hypothesis aside, these
two incidents alone raise incisive questions about the wisdom of groups in military
uniforms.) Craven, according to WOC, assembled a team of diverse specialists,
listened to each mans speculative account of how the submarine was lost and where
it was conjectured to be, and then by means of a Bayesian search procedure compiled
a group solution that correctly located the submarine to within 220 yards at a spot no
individual expert had picked.
Impressive? Perhaps. But does this example really support the WOC hypothesis?
A Bayesian search technique is a strategy employing Bayes theorem (discussed in
Chapters 1 and 2), whereby prior information is continually updated by new data to
narrow the range of probable outcomes (the posterior probability). But the information must be informative if the posterior probability is to be more accurate than
the prior. In the search for the lost submarine, note that Craven did not seek the
opinion of 50 people off the street or from the white pages of the telephone directory,
or some such random sample. On the contrary, Craven sought opinions from a team
of specialists including mathematicians, submarine experts, and salvage men, and
used the data they provided to locate the submarine by methods of his own devising.
Is this not, as Galtons respondent, Perry-Coste of Cornwall phrased it, a vox
expertorum, not a vox populi?
Moreover, Craven pioneered the use of Bayesian search methods. He held a PhD
in ocean engineering as well as a law degree, and was instrumental in developing the
462
Polaris missile program, an extraordinarily complex system of nuclear defense.

According to Wired Magazine,8
In fact, most deep-ocean activitiessaturation diving, exploring with submersibles, searching
for tiny objects on the ocean floorowe their origins to top secret, cold war-era Navy projects
in which Craven had a hand.
Craven is recognized as a genius9 an individual who outshines the group so, if

anything, this example, as I see it, tends to refute the WOC hypothesis.
In another physics-related example, WOC revisited the Space Shuttle Challenger
disaster of 1986. Given the wide press attention, most readers, whether they are
physicists or not, have probably heard of Feynmans dramatic ice-water and O-ring
experiment during the Congressional hearings following the long inquiry into the
cause of the explosion. The cause of the disaster was ultimately attributed to a
leaking O-ring on the solid rocket booster, but this was not apparent at the outset
(although it was reported that some engineers expressed concerns well before the
launch, but were ignored). Nevertheless, according to WOC, by the end of this
catastrophic day when financial markets closed, the stock of Rockwell International
had fallen 3%, that of Lockheed 3%, that of Martin Marietta 3%, but the stock of
Morton Thiokol was down by 12%. All were companies that contributed to the
construction of the Challenger, but Morton Thiokol manufactured the solid rocket
booster.
WOC cites sources affirming that there were no clues at the time to the cause of
the accident, that there was no evidence of insider trading such as the dumping of
stock by Thiokol executives, or finger-pointing and stock-dumping by executives
of Thiokols competition. So how did the stock market know within hours who
was responsible for the Challenger disaster when the definitive report by the
Presidential Commission on the Challenger did not appear until six months
afterward? The books answer: It was all those investorsmost of them relatively
uninformedwho simply refused to buy the stock. In other words, by voting
with their dollars, the traders as a group were certain of Thiokols responsibility,
even if individual traders hadnt a clue. Strong support in favor of the WOC
hypothesis?
Perhaps but also perhaps not. Who can be sure that the traders of Thiokol stock
saw no prior indications of difficulties within the company? Maybe there were earlier
incidents of inadequate quality-control that prompted traders to sell Thiokol stock in
preference to stock of the other manufacturers. Is it really reasonable (from an
economic standpoint) to assume that those people who acquire stocks in companies
supporting as risky an endeavor as the US manned space program are uninformed
8
9
C. Hoffman, The mad genius from the bottom of the sea, Wired Magazine, http://www.wired.com/wired/archive/
13.06/craven_pr.html
http://en.wikipedia.org/wiki/John_Pina_Craven
8.2 A mathematical truism?
463
na ve traders, as opposed to technology specialists who study this segment of the

market carefully, watching for any hint of poor product performance? Besides, we
have seen (Chapter 5) that the time series of stock indices resemble a random walk, so
perhaps the timing of the 12% drop and shuttle explosion were more or less
coincidental.
It would be interesting to see whether traders again behaved as predicted by
the WOC hypothesis in 2003 in the immediate aftermath of the Space Shuttle
Columbia disaster. This time, however, the cause of the problem was known even
before the disaster inevitably occurred. Upon re-entry into the searing gases of
the Earths atmosphere, the spacecraft disintegrated as a result of damage to its
thermal protection system wrought during launch by a large block of foam that
had detached from the main propellant tank and struck the leading edge of the
left wing. From what I have been able to learn, the external fuel tanks were
covered by about ten types of foam insulation manufactured by six different
companies and installed by Lockheed-Martin (symbol LMT on the New York
Stock Exchange). The news report I read10 stated that Lockheed-Martin would
not name the company responsible for the foam block that disintegrated the
Columbia. If the WOC claim about the wisdom of all those investors . . . is to
be believed, perhaps a review of market activity for 1 February 2003 (the day of
the disaster) of all contributing foam manufacturers will reveal which company
the market designated as culpable. One might have expected investors to have
punished Lockheed-Martin itself, since, after all, this was the company that
installed the thermal tiles, but an examination of the 3-week period from 21 January 2003 through 10 February 2003 shows the LMT share price to be nearly
perfectly flat.
8.2 A mathematical truism?

From those intriguing beginnings of dressed oxen, lost submarines, and exploded
spacecraft, WOC compiled numerous other cases drawn from a broad spectrum of
human activity requiring critical decisions science, engineering, business, sports,
government illustrating the greater reliability of judgments by crowds than by
experts. By now it should be clear, however, that we are not really dealing with the
wisdom of crowds, but with the guesses of groups. The issue is not wisdom, but
information, which is quite a different thing.11 Nevertheless, it is important to
ascertain whether instances of superior collective insights are merely coincidences
or actual manifestations of some kind of law of human behavior.
10
11
W. Allison, A. Kumar, and C. Pittman, Foam insulation has history of damaging shuttle, St. Peterburg Times
(4 February 2003).
From a publishers standpoint, however, a book titled The Information of Groups would probably not sell as well
except perhaps to physicists and mathematicians who mistook the meaning of the title.
464
The WOC explanation is relatively simple. Each persons guess contains information and error. In the average of a large number of diverse, independent estimates
or predictions, the errors effectively cancel and, according to WOC, youre left with
information. The information is useful because we are all products of evolution and
therefore equipped to make sense of the world. In short, WOC tells us that the
answer rests on a mathematical truism, but none is explicitly indicated. Since the
author of WOC was not a mathematician, I could think of only two such truisms
that might have come to his attention and a third one of which he was unlikely to
be aware.
The first is the law of large numbers in which the mean m of a sample of
independent observations from a given population approaches the population mean
as the sample size increases. Consider, for example, an election poll. In a population
of 1 million people, a sample of 10 000 will give a more accurate representation of
opinion than a sample of 100. Indeed, the spread about the mean the standard
deviation of the mean m varies inversely with the square root of the sample size, so
that the result of the larger sample would be more sharply defined (and presumably
more reliable
the samples
pif
were representative) than the result of the smaller sample

by a factor 10 000=100 10. True though this statistical law is, its relevance to the
WOC hypothesis is questionable. In this case, the collectivity of opinion of the
population determines the election. This is not the case for a group estimate of a
pre-existing external fact like the location of a missing boat of the culpability of
a company.
The second truism is the Central Limit Theorem (CLT) in which the mean of a
large number of independent samples of a random variable is approximately
distributed in a Gaussian or normal distribution, irrespective of how the samples
themselves are distributed. The qualifying phrase a large number rigorously
means infinitely large, but practically speaking the CLT can sometimes hold very
well for a sample size of about 10 or even fewer, depending on the specific probability density. (We have seen in Chapter 1, for example, that the mean of three
uniform variates is distributed like a Gaussian variate to a surprisingly good
approximation.) Here, again, we have a statistically valid statement whose relevance to the WOC hypothesis is dubious. There is no a priori scientific or mathematical reason that I know of as to why the distribution of the mean of guesses
from a large group of diversely (un)informed people should be centered on the
correct answer. Recall Feynmans fable.
The third potential truism is a somewhat obscure (except perhaps to mathematically adept scholars of political science), but fascinating, mathematical
theorem known as Condorcets jury theorem. I have never encountered a
discussion of this theorem in textbooks of probability and statistics. Nevertheless, the jury theorem is worth examining in some detail because of its interesting mathematical structure as well as its potential relevance to the WOC
hypothesis.
8.3 Condorcets jury theorem
465

First expressed (to my knowledge) by the Marquis de Condorcet in a 1785 essay on
the probability of majority decisions,12 the eponymous jury theorem in its simplest
version (. . . it has since been generalized and debated in the research literature by
others . . .) goes as follows. Suppose that a group of n voters, each of whom acts
independently, are to cast votes on an issue with a binary outcome, one of which is
correct and the other wrong. The probability that a voter casts a correct vote is p,
and therefore 1 p is the probability of casting a vote for the wrong answer. The
decision of the group, which will be either right or wrong, is decided by a simple
majority.
Now here stated in practical terms is the amazing feature of the theorem: If
the probability that an individual voter makes a correct decision is only marginally
greater than 50%, the probability that the group makes a correct decision
approaches 100% as the size of the group increases. Reciprocally, for p marginally
smaller than 50%, the probability of a correct group decision approaches 0 as the
size n increases. How can a large group of barely informed individuals reach the
right decision with near certainty? We will look at this theorem first mathematically, and then by a simple heuristic argument that shows how the theorem
makes sense.
Let us represent the decision of the ith voter by a Bernoulli random variable
Xi (i = 1. . .n) with associated probability function

1 vote is correct PrXi 1 p
Xi
8:3:1
0 vote is wrong PrXi 0 1 p
n
X
Xi . The
whereupon the group decision is represented by the binomial variate Sn
i1
probability that the majority decision is correct is then given by
n
X
n j
PrSn nm jn
8:3:2
p 1 pnj ,
j
jn
m
where nm is the threshold number for a majority, i.e.

8
1
>
< n 1 n odd
nm 2
8:3:3
>
:1n 1
n even:
2
For n odd, nm is the median number. Consider, for example, a group of seven voters.
Then the minimum number of people needed for a majority is nm = 4, in agreement
with (8.3.3).
12
Marquis de Condorcet, Essai sur lapplication de lanalyse a` la probabilite des decisions a` la pluralite des voix,
(LImprimerie Royale, Paris, 1785), reproduced online by the Bibliothe`que Nationale de France: http://gallica.bnf.fr/
ark:/12148/bpt6k417181
466
Although it is by no means apparent, the sum in (8.3.2) is expressible as a beta

distribution
p
xnm 1 1 xnnm dx

p
n1
n
xnm 1 1 xnnm dx:
PrSn nm jn
nm 1
Bnm , n nm 1
0
8:3:4
This equivalence can be established by a straightforward analytical calculation,

which is relegated to an appendix, or by a more insightful combinatorial argument
based on order statistics, which proceeds in two parts as follows.
First part Let y[1] y[2] y[n] be the order statistics of a set of n independent
samples fyi i = 1. . . ng of a uniform random variable Y = U (0,1) with cumulative
distribution function
y
Fy PrY y dy y:
8:3:5
It has already been established in Chapter 1 that the cumulative distribution function
of the kth order statistic is
n
X
n
FY k y PrY k y
8:3:6
Fyj 1 Fynj ,
j
jk
whereupon the probability that at least k of the set fyi i 1 . . . ng is less than or equal
to some number p, where 1 p 0, is given by
PrY k
n
X
n j
p
p 1 pnj ,
j
jk
8:3:7
which is precisely the sum that appears in (8.3.2) if one sets k = nm.
Second part There is, however, another way to arrive at the probability Pr(Y[k] p)
by clever use of the multinomial distribution as employed in Section 1.31 of
Chapter 1. The probability that the order statistic y[k] lies between x and x + dx,
where x p, is the probability that
(a) k 1 elements of the set fyi i 1 . . . ng fall between 0 and x, and
(b) 1 element falls in the range (x, x + dx), and
(c) n k elements exceed x + dx.
Because all the elements were drawn independently from a uniform distribution, this
probability is proportional to the product xk 1(dx)(1 x dx)n k. In the limit of an
infinitesimal interval dx, the sum (i.e. integral) over all values of x p yields the
probability
p
Pryk p / x
467
p
k1
1 x
nk
dx C xk1 1 xnk dx:
8:3:8
The constant of proportionality C can be obtained in either of two ways. The first
way is by a combinatorial argument: the total number of ways to partition n
distinguishable elements into three categories respectively containing k 1, 1, and
n k elements is given by the multinomial coefficient

n!
nn 1!
n
n1
C
n
:
k 1, 1, n k
k1
k 1! 1! n k! k 1! n k!
8:3:9
The second way is to normalize the integral in (8.3.8), which is known as an incomplete beta function. The normalization constant will then be the reciprocal of a beta
function
C
1
1
xk1 1 xnk dx Bk, n k 1
0
kn k 1 k 1!n k!
1

n 1
n!
n1
n
k1
8:3:10
and yields the same result as (8.3.9).

In summary, we have established by the two-part combinatorial argument that
n1
PrSn nm jn n
x1
p
xnm 1 1 xnnm dx
8:3:11
is the probability that a majority decision of the group will be correct if p is the
probability that an individual in the group votes correctly.
For the sake of illustration, consider the case of an odd number of jurors, i.e. n =
2m + 1 where nm = m + 1 is the median. Eq. (8.3.11) then simplifies to
2m
PrSn m 1 2m 1
m
p
x1 xm dx,
8:3:12
where m = (n 1)/2. In the limit of large n (or m), the integrand in (8.3.12) becomes a
p
sharply peaked function of x with a width inversely proportional to n. This, in a
nutshell, is the reason why the probability (8.3.2) depends sensitively on whether
p exceeds 12 or not. To see in detail how this occurs, let us evaluate the integral which
cannot be reduced to an exact closed-form expression by the method of steepest
descent. This entails
468
(a) expanding the integrand, expressed as an exponential, in a Taylor series about

the point x 12 at which it becomes maximum
x1 xm emlnxln1x ee2m ln24mx2
1 2
1
1
1024
1
8mx12 64
3 x2 64x2 5 x2 ...
4
10
8:3:13

1 2
2 ,
(b) truncating the expansion at the order x

and
(c) algebraically manipulating the resulting expression into the form of a Gaussian
integral to obtain
3
2
p
2m
2p1
r
p
6 1
2
7
8:3:14
x1 xm dx e 22m1
ez =2 dz5,
4p
m
2
p
0
2m
p

1
where z x 2 = with width (standard deviation) 1= 8m.
Now consider the combinatorial factor multiplying the integral in (8.3.12) where m
is sufficiently large to justify use of Stirlings approximation for factorials
p
8:3:15
m! e 2m m=em ,
which is the leading factor in an infinite Stirling series

p
1
1
139
571
n! 2nn=en 1

:
12 n 288 n2 51840 n3 2488320 n4
8:3:16
The expression (8.3.16) is surprisingly accurate even for values of m as low as 1.13
It then follows that
h
pi
2m 2m
r

4 m
2m
2m
e
2m 12m!
2m
2m1 m
2
: 8:3:17
2m 1
h
e
pi2
m
m!2
mm em 2 m
Combining (8.3.17) and (8.3.14) leads to a Gaussian cumulative distribution function
p
p
2m
2p1
n
1
1
2
2
ez =2 dz ! p
ez =2 dz:
8:3:18
PrSn m 1 e p
p
2
2 p
p
1
2
2m
2mn
n
Upon substitution of p 12 , where is a positive or negative number of arbitrarily

small magnitude, Eq. (8.3.18) takes the form shown above to the right of the arrow,
where I have also approximated n 2m. Thus, as the group size approaches infinity,
it is immediately apparent that the probability of a correct majority
13
The error E(n) = n! fac(n), where fac(n) is the Stirling series to the order shown above, is on the order of 104, 105,
106 respectively for n = 1,2,3. As n increases, the absolute error eventually becomes much larger than 1, but the
relative error RE(n) = [n! fac(n)]/n! decreases rapidly. Thus, for n = 1,10,100, RE(n) = 104, 108, 1012.
>
>
1
>
> p ez2 =2 dz ! 1 > 0
>

>
< 2
1

Lim Pr S > n
n!
>
2
>
1
2
>
>
p
>
ez =2 dz ! 0 < 0
>
: 2
469
8:3:19
becomes either 100% or 0% depending on the sign of , i.e. on whether p is greater or

less than 12 by an arbitrarily small amount.
Although the mathematical demonstration of the jury theorem (8.3.4) is not
trivial, the consequence (8.3.19), signifying a near certainty that a sufficiently large
group will be correct (or incorrect) even if all of its members have a probability only
slightly greater (or lesser) than 50% of being correct, can be simply understood. Note
first that for a sample size n 1 the distribution of the binomial variate Sn is very
nearly Gaussian
Sn
n
X
i1

Xi Binn, p ! N np, np1 p
n 1
8:3:20
p
with mean np and standard deviation np1 p. This can be readily proven by
use of the moment generating function, or by invoking the Central Limit
Theorem.
Suppose the group size to be 10 000 and each individual in the group to be only
51% likely to answer a particular question correctly. The mean number of correct
voters is 5100 with a standard deviation 50. Thus, the number of group members
voting correctly will fall within a 2 range 5100 100 = (5000,5200) with a probability of 95%. In other words, the majority decision will be correct about 95% of
the time.
Suppose, however, the group size to be 1 000 000 and each individual, as before,
has a 51% chance of being correct. The mean number of correct voters is then
510 000 with a standard deviation 500. Now the number of group members voting
correctly will fall within a 20 range 510 000 10 000 = (500 000,520 000) with a
probability very close to 100%. (The exact value is 1 5.5
1089.) Thus, the
majority decision is likely to be correct 100% of the time.
In short, the larger the sample size n, the wider is the range (in units of ) of
voters beyond the sample median who give correct answers (for p > 0.5). How
large, in fact, must a group be (considering for illustration an odd number of
members) in order that the majority decision be correct 99% of the time if
individual members have a probability of being correct only 51% of the time?
Comparison of the StirlingGauss approximation (8.3.18) with a more exact
higher-order calculation based on the Stirling series (8.3.16) and expansion
(8.3.13) of the incomplete beta integral to tenth order leads to the results shown
in Table 8.1.
470
Table 8.1
Size of Condorcet jury for 99% group accuracy with p = 0.51
Group size n
High-order calculation
StirlingGauss approximation
13 525
13 527
13 529
13 531
0.989 995 99
0.990 000 57
0.990 005 15
0.990 009 73
0.989 986 85
0.989 991 44
0.989 996 02
0.990 000 06
Under the assumptions of the jury theorem a minimum group size of 13 527
members would be needed to assure a 99% chance of the majority decision being
correct. (The StirlingGauss approximation yielded a group larger by four members.)
Interesting as the jury theorem may be as a mathematical exercise, the key
question remains as to whether this theorem can serve as the truism upon which
to assert the inevitability of the WOC hypothesis. The answer in my opinion is a
negative one. Real-world decision making rarely, if ever, conforms to the conditions of the theorem. Most decisions are not of a binary nature, but require
qualitative judgments to be made from among many choices or numerical estimates that can fall within a wide range of real numbers. Moreover, it is entirely
unrealistic to expect all the members of a group to be equally informed (or
uninformed) so that a single probability p represents their state of knowledge. As
mentioned previously, extensions of Condorcets theorem have been published, but
I have seen none that would serve to justify the broad claims of WOC. Perhaps the
author of WOC had some other mathematical truism in mind, but, if so, I do not
know what it was.
8.4 Epimenides paradox of experts
There is a glaring, if not humorous, logical inconsistency in the indiscriminate
debunking of experts and expertise that one finds in WOC. For example, one reads
that
. . . expertise and accuracy are unrelated.
. . . experts decisions are seriously flawed.
. . . experts judgments [are] neither consistent with the judgments of other experts
in the field nor internally consistent.
. . . experts are surprisingly bad at . . . calibrating their judgments.
. . . experts . . . routinely overestimate the likelihood that theyre right.
From whom, one may ask, has the author acquired these professional insights on
experts?
WOC acknowledges, among others, James Shanteau . . . one of the countrys
leading thinkers on the nature of expertise . . . That would make Shanteau an
8.5 The Silverman GOG experiments
471
expert, right? And if he is an expert, then would not the above objections apply to
him too, which would mean that his critique of experts is not reliable, which would
then imply that the opinions of experts can be trusted? This is one of those selfreferential paradoxes like the well-known logical paradox enunciated by Epimenides
the Cretan that All Cretans are liars. If you ask a Cretan whether he is lying and he
replies truthfully in the affirmative, then he was actually not lying, which meant that
he did not reply truthfully. And so on.
In any event, the fact that the authors sources, by virtue of their expert status,
may be questionable does not necessarily mean that they were wrong, as evidenced by
a few past examples of expert opinion:
Who the hell wants to hear actors talkHarry Warner (1927)
I think there is a world market for maybe 5 computersThomas Watson (1943)
Computers in the future may weigh no more than 1.5 tonsPopular
Mechanics (1949)
and one of my favorites:
640 K ought to be enough for anybodyBill Gates (1981)
(I found these in WOC and in lists of quotes on the internet and cannot vouch for
their authenticity but they do sound good).
Personally, my own experiences with experts especially those providing advice
on financial, legal, or educational matters were much in accord with the critical
remarks above. It is, in fact, one of the character traits of good physicists to be
skeptical of authority and to try to find things out for themselves. And so I initiated a
set of experiments to investigate the Guesses Of Groups or GOG.

If the WOC hypothesis is valid, the societal ramifications of this radical proposition
are almost unimaginable. Why, for example, should the US criminal justice system
rely on the verdict of a jury of 12 randomly chosen people, when the facts of a case
could be put on a Department of Justice internet website where the average
verdict could be compiled much more reliably from millions of internet responses?
Why rely on some fallible expert to run the Federal Reserve System (the Fed)
when the direction of federal economic policy could be determined, likewise by
internet, from the composite opinions of millions of citizens? If the WOC hypothesis is valid, then the collective opinion of the citizenry ought to provide more
satisfactory solutions to current problems of health care such as which diseases to
research, what drugs to make, how much to charge, and how to pay for it all. Or,
regarding national security, why not have a betting market where betters from all
over the world can lay odds on the next terrorist attack? Come to think of it, the
472
US Defense Advanced Research Projects Agency (DARPA) already proposed that

in 2003, but quickly scuttled the idea in face of vehement criticism from horrified
members of Congress and the news media. Should the Defense Department have
persisted? Is the hypothesis valid?
As a physicist interested in stochastic phenomena although ordinarily involving
atoms, nuclei, and photons, rather than people I decided to see for myself with a
series of trials involving several undergraduate physics classes of about 30 students
each. The experiment was implemented in four phases.
Phase 1 involved the estimation of physical quantities more or less to hand. Thus,
for example, students were shown a transparent glass jar of steel shot and were
asked to estimate the number of shot (Trial 1). At a later time each student held
the jar and was asked again to estimate the number of shot (Trial 2) as well as
the combined weight or mass (depending on the units a student chose to use) of
shot and jar (Trial 3). The classroom in which we met had a false ceiling of
acoustic tiles; students were asked to estimate the distance between floor and
tile ceiling (Trial 4).
Phase 2 involved the estimation of things not to hand. These were Fermi-type
problems in which students were asked to estimate the periphery around the
college (Trial 5), the number of banks in the city (Trial 6), and the number of
restaurants in the city (Trial 7).
Enrico Fermi, whose name outside the physics community is less well known
than that of Einstein or Feynman, is the twentieth-century physicist whom I admire
most because of his exceptional achievements in both experimental and theoretical
physics. He was an inspiration for my own career as both an experimentalist and
theoretician. Physicists ordinarily specialize in one activity or the other, and so it is
not common to find a physicist who does both with equal facility. A colorful figure
in his own right, Fermi was said to have challenged his colleagues with questions
such as How many piano tuners are there in New York City?. It may seem at first
glance that one either knows the answer to a question like that or does not, but in
any event would be unable to deduce an answer without access to further information. Fermi, however, was masterful at estimating. I did not expect students
individually to have that ability, but I was curious to see whether the class, as a
group, could produce an accurate estimate even if no one (or perhaps a few) within
the group knew the correct answer. The correct answer, by the way, could be a
somewhat fluid number. In contrast to questions in Phase 1 in which the true value
could be established by a direct count or measurement, for Phase 2 I relied on an
internet yellow pages listing of all the restaurants or banks within my specified
region (City of Hartford). Submitting the identical query to the computer several
times in succession could produce slightly different numbers. In such cases, I took
the mean.
473
Phase 3 involved making predictions. Students were asked first to predict something connected with an activity (taking tests) with which they, as a group, were
thoroughly familiar: the class mean score on a quiz to be taken at the end of the
week (Trial 8). At another time they were asked to predict something in regard
to an activity (the stock market) with which I assumed few, if any, were
familiar: the change in the Dow Jones Industrial Average (DJIA) by the close
of day at the end of the week (Trial 9).
The preceding exercises took place one trial per class meeting, usually during the first
or last ten minutes of the period. Each student received from me a sheet of paper
stating what was to be estimated or predicted and asking for a numerical response as
well as a qualitative estimate of the students confidence in his or her answer (None,
Low, Medium, High). The purpose of the latter was to see whether there was any
correlation between accuracy and confidence. (There wasnt.) Since one of the
conditions alleged to be necessary for the validity of the WOC hypothesis is
the independence of individual guesses, students were instructed not to discuss the
exercise with their neighbors or to glance at their neighbors answers.
To avoid purely random guessing, in which case the situation would degenerate to
the one in Feynmans fable about the length of the Emperor of Chinas nose, it was
necessary that students gain something personally from answering accurately. In this
regard, my policy was to offer whoever came closest to the exact answer some extracredit points toward their cumulative course score. The amount offered was quite
modest, but, if you have ever taught at a college or university, you probably have a
good idea of what students would do for almost any amount of extra credit. Suffice it
to say that my students were satisfied with the offer. It should also be noted that
students were not required to put their names on the response sheets they turned in
(in case some would have felt embarrassed at submitting a wildly incorrect estimate),
but obviously they could earn extra credit only if I knew to whom to award the
points. Every participant revealed his or her identity.
Phase 4, the final phase, entailed an exercise of a kind different from the preceding
in which the participants merely had a few moments to view, hold, or think
about something before writing down their responses. Bearing in mind
Cravens search for the lost Scorpion, I wanted to see for myself whether
laymen working in groups or experts working individually were more
successful at a problem-solving activity. Having defined (privately14) a set of
criteria on the basis of which to identify student experts in the class, I divided
the class into five groups of six students each. In four of the groups comprising the
non-experts, the students within each group were to solve the problem as a team;
14
The criteria were not announced to the class since that could have seriously affected their attitude toward the exercise
and their performance.
474
in the fifth group, however, the six chosen experts were to sit far apart and work
on the problem individually. The five groups were set to work simultaneously in
different classrooms for the same amount of time (15 minutes).
Here is the first problem they were given (Trial 10):
Two former high school friends A and B, who had not seen one another for many years, met by
chance.
A:
B:
A:
B:
A:
B:
Do you have any children?

Three boys.
How old are they?
Here is a clue. The product of their ages is 36.
There are many possibilities.
Heres another clue: See that building across the street? The number of storeys equals the
sum of my childrens ages.
A: I still dont have enough information.
B: My eldest son has blue eyes.
A: Thanks, now I know the ages of your children.
What are the childrens ages? ________ _________ ________
How did A deduce this? Give explanation on reverse side.
I had initially intended for the foregoing problem to be the final trial, but the idea of
actually having the students search for something missing, as Craven did, strongly
appealed to me. Since (to my knowledge) the US military services had not lost
another boat or bomb in the intervening years since Craven was assigned to look
for such things, I settled on a simpler problem of lost treasure (Trial 11). Here it is.
During the early nineteenth century a wealthy professor buried his fortune on the Trinity College
campus, and you have come into possession of the following instructions found in a book once
belonging to his personal library.
1. Count thy steps from the door of the College Alehouse to the door of the Metaphysics Building,
turn left by a right angle, take the same number of steps and place a spike in the ground.
2. Count thy steps from the door of the College Alehouse to the door of the Alchemy Building, turn
right by a right angle, take the same number of steps and place a second spike in the ground.
3. At the point halfway between the two spikes dig for treasure.
Now the old Metaphysics Building still exists (it became McCook and presently houses the Physics,
Philosophy, and Religion departments), and the old Alchemy Building still exists (it became
Clement and houses the Chemistry Department and College Cinema), but the College Alehouse
has long since been demolished and no one alive today remembers where it used to be, although
there is much speculation.
Draw a map showing the Metaphysics and Alchemy Buildings and place a small cross (
) precisely
where on the map you believe the treasure is located. Explain your reasoning on the back of this
page.
475
Table 8.2
Summary of group judgment tests
Phase
No. Description
Correct
value
Closest
value(s)
Group
mean
Group
median
I. Estimation
(nearby)
229
230
178.1
152.5
229
230
199
152
1016.8 g
1000.0 g
2012.1 g
1818.2 g
4
5
Number of shot in a
jar (viewed only)
Number of shot in a
jar (viewed & held)
Mass of shot in a jar
(viewed & held)
Height of ceiling
Periphery of campus
Number of banks
39
7
8
Number of restaurants
Mean quiz score
311
78.2%
Change in DJIA
26.4 pts
10
11
Age of children
Treasure hunt
2
3
II. Estimation
(distant)
III. Prediction
IV. Deduction
263.0 cm 264.0 cm 287.6 cm 279.7 cm

1.81 miles 1.80 miles 2.07 miles 1.86 miles
35
42
300 [2]
78.5%,
78.0%
21 pts
32 pts
54.9
34
127.3
78.9%
102103
78.3%
14.9 pts
5 pts
I should explain that the two contemporary buildings mentioned above really do exist
and for reasons of political or financial expediency really do house the irrationally
eclectic combination of departments. Thus the term Metaphysics Building by
which I have long referred to the edifice within whose basement my research laboratory is located is an apt appellation. Furthermore, the conditions of this trial differed
from those of the preceding trial in that I did not attempt to identify experts, but
decided to let the students themselves partition their number into groups and individuals. This turned out to be a mistake (perhaps) in that no students organized
themselves into groups, and the mathematically most adept students in the class
submitted their results as individuals. Nevertheless, this outcome was itself
informative.
A summary of the 11 trials and their outcomes is displayed in Table 8.2. I leave the
solutions to the two logic problems to appendices.
To interpret the significance of the outcomes, I would like to reiterate, even at
the risk of redundancy, one of the key components to the WOC hypothesis.
Among the assertions defended in The Wisdom of Crowds was the statement by
economist Kenneth Arrow that average opinions of groups is frequently more
accurate than most individuals in the group. Since all the participating students in
476
a particular course made up the group in my experiments, I first looked at the

outcomes to see how the average group response compared with the best estimates
of individuals.
A quick scan of Table 8.2 shows that the responses of my physics students did not
appear to bear out the WOC hypothesis. In every instance, there were individual
answers that beat the group average. The group average, in fact, was not particularly
close to the true value in most of the trials.
The word average is an ambiguous term; it can refer to the three distinct notions
of mean, median, and mode. Recall that Galton initially calculated the median of the
crowds estimates of the weight of a dressed ox, but later found that the mean
provided closer agreement to the true value in the case that he published. Although
not specified in WOC, I inferred from context that the book probably also meant
mean. Since the sample mean is the sum of all responses divided by the number of
responses, it is sensitive to outliers; one wild estimate can displace the mean substantially. By contrast, the median, which is the value of the middle item (or the mean of
the two middle items) when the samples are arranged in increasing or decreasing
order of magnitude, is unaffected by the exact location of outliers. Table 8.2 shows
that the median was superior to the mean in a number of trials. The mode of a
distribution is the most probable value and would correspond here to the most
frequent student response in each trial. This statistic is not useful in samples of small
size because the submission of identical answers would be rare. If the numerical
answers were part of a continuum, then the mode would depend to some extent on
how the responses were binned, i.e. placed in mutually exclusive classes, a statistical
procedure discussed in earlier chapters.
Table 8.3 summarizes the trial outcomes as tests of the WOC hypothesis, i.e.
whether the group mean and/or median yielded values superior to the best individual responses. There are no Yes entries in the table, but I indicated by OK those
few instances where a group statistic was close to the true value and almost as good
as the best individual responses. The fractional error of the mean, shown as a
percentage in the first column, is defined by the ratio (group mean true value)/
(true value); a corresponding ratio defines the fractional error of the median. With
exception of the prediction of the class mean quiz score, which was extraordinarily
close, and estimation of the height of the classroom ceiling, which was moderately
close, the fractional errors of the group hardly encouraged confidence in the
wisdom of the crowd, at least for small to moderately sized groups. It is to be
noted, however, that in no case could I have predicted which individual student
would submit the best estimate.
8.6 Interpretation of the GOG experiments

So what can we learn from the GOG experiments? In particular, do they cast doubt
on the WOC hypothesis? To answer these questions we need to consider not just a
477
Table 8.3
Comparison of individual vs group results
Test
Estimate
shot
Estimate
mass
Estimate
height
Estimate
periphery
Estimate
banks
Estimate
restaurants
Predict quiz
average
Predict DIJA
change
Solve logic
problem
Fractional error
of the mean
Fractional error
of the median
Superiority of
group mean?
Superiority of
group median?
22.3%/13.1%
33.6%
No
No
97.9%
78.8%
No
No
9.4%
6.3%
No
No
14.4%
2.8%
No
OK
40.8%
12.8%
No
OK
59.1%
67.0%
No
No
0.90%
0.1%
OK
OK
43.56%
81.1%
No
No
50.00%
n/a
5050
n/a
single statistic like the mean or median, but the actual sample distributions, two of
which are shown as histograms in the upper and lower panels of Figure 8.1. Recall
that a histogram is an approximate graphical representation of the probability
distribution of the parent population from which a sample is taken. One divides
the range of sample outcomes into non-overlapping classes or bins into which the
outcomes are distributed. The histogram is then a plot of the frequency (i.e. number)
of outcomes in each bin as a function of the bin value.
Underlying the WOC hypothesis is an implicit assumption that group responses are
distributed more or less normally, i.e. in a bell-shaped curve with the preponderance of
samples clustered symmetrically about the mean and decreasing in frequency fairly rapidly
in the wings. In keeping with Galtons democratic principle of one vote one value, WOC
identified the collective judgment (wisdom) of a group with the sample mean
x
n
1X
xi ,
n i1
8:6:1
where fxi i = 1. . .ng is the set of individual responses. Although the assumption of
normality may seem reasonable since, after all, the ubiquity of the Gaussian
distribution (as Galton had doggedly revealed) is precisely why one refers to it as
normal the GOG experiments suggested to me that it is a flawed assumption.
478

25
Frequency
20
College Perimeter (miles)
15
10
0
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
More
14
12
Mass of Shot (g)
Frequency
10
8
6
4
2
0
500
1000
1500
2000
2500
3000
3500
4000
4500
More
Category
Fig. 8.1 Top panel: histogram of the estimates of the perimeter of Trinity College by physics
students given the point of departure and connecting streets. Bottom panel: histogram of
responses by the same students asked to estimate the mass or weight of a jar of steel shot. For
all such exercises, respondents could use whatever units they preferred; the estimates were
subsequently converted into standard units.
479
The sample distribution in each of the GOG experiments did not resemble a
normal distribution. True, the blockish quality of the resulting histograms, like the
two shown in Figure 8.1, reflected in part the fact that a sample of about 30 to 60
participating students (depending on the particular trial) did not comprise a large
population, but the sample was nonetheless large enough to draw meaningful conclusions. A much larger sample, it seemed to me, might smooth the envelope of the
histograms, but not necessarily eliminate the marked skewness in some or conspicuously small kurtosis in others.15
Having examined numerous histograms of the GOG trials, I reached a conclusion
very different from what Galton would have believed. The samples were not drawn from
a hypothetical normal population. Rather, in regard to each specific question or task put
to the group, the members fell roughly into a subgroup of those who were more or less
informed (i.e. had at least some idea of what constituted a reasonable response) and a
subgroup of those who were essentially clueless about the matter. Replies from the
informed subgroup would be distributed approximately normally (although not necessarily centered on the correct answer), whereas the widely scattered replies from the
uninformed subgroup would be better modeled by a uniform distribution.
For example, virtually all college and university students have taken numerous
tests in their lives, and so when asked to predict the class mean of a forthcoming quiz,
their replies could be expected to follow more or less a normal distribution centered
on a correct prediction. And this was the case. On the other hand, most students
probably ate at the college dining facility rather than at city restaurants, and therefore when asked to estimate the number of restaurants their replies were all over the
board, so to speak, ranging from about 25 to greater than 250.
The upper and lower panels of Figure 8.1 reflect this categorical division for two
other trials. Few students in my classes, even though they may have taken science
courses before, had background experiences preparing them to estimate the weight of
an object by holding it. There may have been a few students, perhaps, who went
grocery shopping at home or who cooked or baked in their kitchens and lifted so
many ounces of this or a half pound of that. But most had little kinesthetic sense of
weight or mass; the estimates of the mass of a jar of steel shot were all over the chart.
Contrast that with the quasi-Gaussian-looking histogram of estimates of the college
perimeter, attributable, I believe, to the fact that many students jogged that route
regularly or traversed it by car. Thus, the histogram took the shape of a normal
distribution from the informed group skewed to the right by a long tail (outliers of
excessive length) from those who simply guessed erratically. In general, then, the
distribution of group replies should be a mixture in varying proportions of the
informed and uninformed distributions.
15
Recall that kurtosis K (from the Greek root for bulging) is a measure of the fourth moment about the mean. It is a
gauge of the sharpness of the peak and heaviness of the tails. For a normal distribution K = 3. A distribution with
lower kurtosis has a more rounded peak, narrower shoulders, and shorter tails.
480
Before continuing with this thought, however, a brief word of explanation is called
for in view of my previous criticism (in Chapter 3) of inferences about nuclear decay
drawn by certain researchers from the shapes of histograms. What, then, would
justify at this point my own deductions based on histogram shape? First, it is most
certainly the case that a probability density function (pdf ), when plotted against the
variable upon which it depends, has a definite shape. An observer with appropriate
experience would surely recognize the bell shape of a Gaussian distribution, the Eiffel
Tower-like shape of a Cauchy distribution, the skewed ski-slope shape of a Rayleigh
distribution, and other more or less familiarly shaped pdfs. A histogram approximates the shape of a pdf, provided that the bins are not so numerous as to result in a
statistically insignificant number of events in each, nor so few as to reveal no shape at
all. When the resulting form of a histogram is not essentially altered by varying the
number and boundaries of the bins within a statistically permissable range, then it is
meaningful to speak of the histogram shape as an empirical approximation to the
true underlying pdf. What is not meaningful (as I demonstrated in Chapter 3) is to
assign significance to the shape of fluctuations in the numbers of events in these
(arbitrarily designated) bins, since such secondary spatial features (e.g. rabbit ears)
can change radically (. . . they are, after all, fluctuations . . .) with a change in the
number and value of the bins. And now, let us return to my GOG deductions.
The lesson I drew from the GOG experiments with my physics students was that in
seeking to optimize the information one can extract from a group, one should not
weight equally everyones response. Rather, a better strategy would be to give more
weight to the members of the informed group and less to those of the uninformed
group. Yet how could that be done, given that the members of each subgroup are not
individually identifiable? How could one tell whether a nearly correct response
actually came from the completely random guess of someone in the uninformed
group, or that an outlying incorrect response came from careful consideration by
someone in the tail of the informed subgroup? What was needed was a completely
objective statistical model that utilized only the sample of data without making any
attempt to assess the knowledge of individual respondents.
I will present such a model shortly, but first let us examine under what circumstances the sample mean is justified as an expression of the wisdom of the crowd.
8.7 Mining groups for information: Galtons democratic model

While there may be no mathematical truism that validates the judgment of a crowd
over that of individual experts, a mathematical argument can nevertheless be made
for justifying, in the absence of prior information,16 Galtons democratic model of
one voice, one vote. Recall that Galton initially chose to identify the judgment of
16
Technically, from a Bayesian point of view, there is always prior information, even if it consists of total ignorance.
We considered in Chapter 2 the statistical representation of ignorance.
8.7 Mining groups for information: Galtons democratic model
481
the crowd at the 1906 Fat Stock and Poultry Exhibition with the sample median
most likely because the median is not sensitive to outliers and then later favored the
group mean when it was pointed out (at least in that one instance) that it gave a more
accurate prediction. Although Galton presumably knew nothing of entropy at the
time, he made a statistically reasonable choice. In the absence of prior information
concerning the distribution of guesses, this choice can in fact be justified by the
principle of maximum entropy, introduced in Chapter 1.
Let us designate by fxig i = 1. . .n the set of independent guesses submitted by a
group of n members in response to some query. Group the guesses into K categories
(bins) fXkg with frequencies fnkg k = 1. . .K. It then follows that
K
X
nk n
8:7:1
k1
K
X
nk X k
k1
n
X
8:7:2
xi ,
i1
and the sample mean can be expressed in either of two equivalent ways
x
n
K
1X
1X
xi
nk X k :
n i1
n k1
8:7:3
Suppose pk to be the (unknown) probability that an outcome (guess) falls in the kth
bin. The sorting of n items into K bins constitutes a multinomial distribution for
which the probability P(fnkgjfpkg) of an observed configuration of outcomes fnkg is
given by
Pfnk gjfpk g n!
K
Y
p nk
k1
nk !
8:7:4
and the Shannon entropy of the configuration is therefore

H Pfnk gjfpk gln Pfnk gjfpk g
8:7:5
subject only to the completeness relation

K
X
pk 1:
8:7:6
k1
As discussed in Chapter 1, the most objective (least biased) assignment of probabilities fpkg, given the set fnkg, is obtained by maximizing the (8.7.5). In other words,
one must solve the equations
H
P

ln P 1 0
pk
pk
k 1 . . . K,
8:7:7
482
or, since P 6 (0, 1),

P
ln P
0)
0:
pk
pk
8:7:8
The second relation in (8.7.8) follows because a function and its logarithm are
extremized at the same points.
Substitution of (8.7.4) into (8.7.8) leads to a set of equations
K
L
X
nk ln pk 0 j 1 . . . K
pj pj k1
8:7:9
which is the same set of equations that would follow from application of the method
of maximum likelihood (ML). The log-likelihood function L of this system is
L
K
X
8:7:10
nk ln pk ,
k1
but, because of constraints (8.7.1) and (8.7.6), only K1 of the K terms in L are
independent. We can deal with the situation, as we have before, by use of a Lagrange
multiplier or more simply in this instance by rewriting L in terms of independent
quantities only, in the following way
!
!
K 1
K1
K1
K1
X
X
X
X
L
nk ln pk nK ln pK
nk ln pk n
nk ln 1
pk : 8:7:11
k1
k1
k1
k1
Substitution of (8.7.11) into (8.7.9) leads to the set of relations
nj
pj
n
1
K1
X
k1
K1
X
nk

pk
nK
constant
pK
j 1 . . . K 1:
8:7:12
k1
The constant is determined to be n from the completeness relation (8.7.6), and one
thereby obtains the maximum-entropy (ME) set of probabilities
ME
pk
nk
n
k 1 . . . K
8:7:13
for all values of k.

The sample mean in (8.7.3) is then seen to be the first moment
ME
m1
K
X
k1
ME
pk
Xk
K
X
nk
k1
Xk
n
1X
xi
n i1
8:7:14
of the distribution obtained from the principle of maximum entropy for the case
where no prior information (other than completeness) is known about how the
8.8 Mining groups for information: Silvermans Mixed-NU model
483
guesses of a group are distributed. This is the most unbiased value one can obtain by
querying the group once but there is no predictive value to it. Put the same question
to the same group again as I have done with students and, as individuals change
their estimates or guesses, a different set of frequencies fnkg and therefore probabilities fpkg will likely emerge.
Although the frequencies may fluctuate from trial to trial, examination of the
resulting histograms of responses suggests a discernable pattern or form which, if
real, would constitute information beyond pure ignorance. Knowing that form (i.e.
probability function) would permit one to ascertain more reliably the information
contained in the collective response of a group. It would approximate the collective
judgment of a group of infinite number, or, equivalently, the mean response of a
finite-size group to the same question posed an infinite number of times. This does
not mean, of course, that the calculated wisdom of the crowd would necessarily lie
closer to the true answer to the question put to the crowd but that the value
obtained would be more stable and therefore consistent.

The problem posed in the preceding sections is of general importance, transcending its origin as a source of amusement and statistical instruction for several
generations of my students. At its core is the question: Given the anonymous
independent responses of a large group of randomly selected people to some
question (whose correct answer would bring personal gain or avoid personal loss
in order that they reply thoughtfully and not guess wildly), how can one reliably
extract what information the total sample may contain? The simplest solution
I conceived the Mixed-NU model is based on a mixed Normal-Uniform
distribution.
Suppose that an unknown fraction f of the group responses come from individuals
belonging to the informed subgroup. Each response from this subgroup is an independent random variable X1=N(,2) drawn from a Gaussian distribution of
unknown mean and variance 2. The remainder of the responses, constituting a
fraction 1 f, come from the uninformed subgroup, each reply of which is an independent random variable X2 = U(a, b) drawn from a uniform distribution over the
range from the minimum reply a to the maximum reply b. The pdf characterizing the
group response is then expressible as a combination of independent Gaussian and
uniform distributions
2
1
1 f
2
pX x f p ex =2
I a, bx
2
ba
2
8:8:1
referred to as a mixed distribution. Recall that the interval function I[a,b](x) appearing
in (8.8.1) restricts its argument to the range b x a
484
Fig. 8.2 Bottom panel: scatter plot of N=NG+NU=10 000 samples from a mixed distribution
of Gaussian N(200,502) and Uniform U(40,600) variates. The population comprises
NG=7500 Gaussian (dense points) and NU=2500 uniform (diffuse points) samples. Top
panel: histogram of all samples enveloped by theoretical probability density (black). The
sample mean and standard deviation are respectively x 230:8, sx = 105.1 in close
agreement with the population mean and standard deviation x = 230.0, x = 105.4.

I a, b x
1
0
b x a
x > b, a > x :
8:8:2
Figures 8.2 and 8.3 show examples of mixed distributions (8.8.1) obtained empirically
by drawing a total of 10 000 samples of which a fraction f came from a Gaussian
RNG and 1 f came from a uniform RNG. The fractions f in the two figures are
respectively 0.75 and 0.25. As expected, the shape in Figure 8.2 looks predominantly
Gaussian with a long tail skewed to the right, as in the GOG histogram for college
perimeter, whereas the shape in Figure 8.3 resembles more what the histogram for
mass estimation might have been if the number of samples were closer to 7500, rather
than 60.
I will explain shortly how the unknown parameters (f, , 2) can be determined
from data. For now, assuming they are known, I identify the collective judgment of
485
Fig. 8.3 Bottom panel: scatter plot of N = NG + NU = 10 000 samples from a mixed
distribution of Gaussian N(200,502) and Uniform U(40,600) variates. The population
comprises NG = 2500 Gaussian (dense points) and NU = 7500 uniform (diffuse points)
samples. Top panel: histogram of all samples enveloped by theoretical probability density
(black). The sample mean and standard deviation are respectively x 289:6, sx = 151.2 in close
agreement with the population mean and standard deviation x = 290.0, x = 151.4.
the group with the expectation of X calculated with pdf (8.8.1), which is readily
shown to be
X hXi f 1 f
ab
:
2
8:8:3
p
Correspondingly, the uncertainty in group response is taken to be X varX,
where the variance
varX 2X hX2 i 2X

b2 ab a2
b a 2
2
2
f 1 f
f 1 f
3
2

2
2
b a
1
f 2 1 f
8:8:4
f 1 f b a
12
2
486
is neatly expressible as a weighted sum of the variance of the two components in the
mix and the square of the difference of their means.
The mixed model raises a subtle, but essential, distinction regarding the statistical
description of the group that may have escaped the readers attention and should be
clarified even though it entails a brief digression. Given the hypothesized pdf (8.8.1),
it would be wrong to think that the random variable X itself takes the mixed form
Xmixed f N, 2 1 f Ua, b:
8:8:5
Although the expectation of the variate defined by (8.8.5) yields precisely the same
result (8.8.3), the pdf of Xmixed is not that of a mixed distribution and the theoretical
variance (and higher moments) differ substantially from those calculable from
(8.8.1). A mixed random variable is not the same as a mixed distribution.
To see this, one can employ the methods introduced earlier in the book to show
that the pdf of Xmixed in (8.8.5) takes the form
xa1f
f
1
pXmixed x p
2 1 f b a
xb1f
f
eu =2 du
2

f b1 f x
f a 1 f x
p
p
erf
erf
2
2
,
21 f b a
8:8:6
where
x
2
2
erf x p ez dz
8:8:7
is the error function. The variance of (8.8.5) is

varXmixed f 2 1 f
b a2
:
12
8:8:8
Figure 8.4 shows a progression of forms taken by plots of pXmixed x as a function of x

for increasing values of the mixing parameter f. The shapes do not resemble at all the
histograms of the mixed GaussianUniform distribution.
The distinction between a mixed distribution and a mixed random variable, as
highlighted by the suite of plots in Figure 8.4, is that in the former each sample (i.e.
individual response) is either from a normal distribution or from a uniform distribution, whereas in the latter, each sample has fractional characteristics of both normal
and uniform distributions. As a quantum mechanical analogy, a mixed distribution is
like a superposition of probabilities, whereas a mixed random variable is like a
superposition of amplitudes. Thus, instead of a state vector like (8.8.5), a mixed
distribution comprising n random variables Vi( (i)) of parameters (i) with mixing
(e)
0.007
Probability Density Function p(x)
487
(d)
0.006
0.005
0.004
(c)
0.003
(b)
0.002
(a)
0.001
0
100
200
300
400
500
600
x
Fig. 8.4 Variation in shape of the pdf of a mixed random variable Xmixed = f N(200, 402) +
(1 f )U (40, 600) as a function of mixing coefficient f : (a) 0.05, (b) 0.25, (c) 0.50, (d) 0.75, (e) 0.95.
coefficients fi can be uniquely described by a density matrix [f1V1( (1)); f2V2( (2)); . . .
fnVn( (n))] or simply [f; V1( (1)); V2( (2))] for a binary mixed distribution, where it is
understood that the fraction f refers to the first variate in the bracket and the
complementary coefficient must be 1 f.
There are various ways to estimate the parameters (f, , 2) in the mixed model
[f; N(,2); U(a,b)] of group judgment. I have generally employed the method of
maximum likelihood which follows directly from Bayes theorem (for a uniform
prior) and possesses a number of statistically desirable properties, as discussed in
Chapter 1. Note that the range values (a, b) of the uniform distribution are not
unknown parameters to be solved for, but can be established well enough at the
outset from the sample of responses. The data xi (i = 1 . . . n) must be grouped, i.e. the
n samples sorted into K bins Xk (k = 1 . . . K) with frequency of nk samples in the kth
bin. The log-likelihood function L of n samples drawn from a population governed
by the probability density (8.8.1) then takes the form (discussed in Chapter 1)
L
K
X
k1

nk ln pXk j 1 , 2 , 3
nk ln pXk j f , , 2
8:8:9
k1
from which the maximum likelihood (ML) set of parameters, ^ 1 , ^2 , ^3 ^f , ^, ^

are then obtained by solving the coupled equations that maximize L
K
pXk jfg=j
L X
0
j k1 pXk jfg
j 1, 2, 3:
8:8:10
488
Note that the third parameter to be solved for can be either or 2. Because the
resulting equations (8.8.10) are highly nonlinear and require a numerical procedure
that calls for an initial guess, one choice may be less sensitive to the starting values
than the other and thereby lead more readily to convergence. As a general guideline,
choose parameters that are neither too large nor too small. In applications to be
discussed shortly (the BBCSilverman experiments), rapidly convergent solutions
were obtained for either choice when was of order 1; was the preferable parameter, however, when its value was of order 102. Taking 3 = , the ML equations of
the Mixed-NU Model become
"
#
2
2
K
X
L
eXk =2
1
p
nk
0 )
pXk j f , , 2 1 0
2
f
b

a
2
k1
L
0
L
0
K
X
nk Xk eXk
k1
K
X
k1
"
nk
Xk
2
=2 2
pXk j f , , 2 1 0
8:8:11
#
1 e Xk
=2 2
pXk j f , , 2 1 0
with
2
1
1 f
2
Ia, b Xk :
pXk j f , , 2 f p eXk =2
2
ba
2
8:8:12
The solutions must be found numerically.
8.9 The BBCSilverman experiments: the reach of television

Because class size was relatively small (one of the touted advantages of studying
physics at a liberal arts college), the GOG experiments did not provide large enough
populations to test my models of group judgment. Then, one day in 2007, while
pondering how to find larger groups, I received an email inquiry from a reporter
(Alexandra Freeman) of the British Broadcasting Corporation (BBC). Having read
my brief account of group experiments with physics students, she and her colleagues
associated with The One Show were interested in testing live on the air whether
crowds really were wise and asked whether I could advise them. I was delighted
to help!
I suggested that two trials be performed, virtually identical to the initial two trials
of my GOG experiments. The objective of the first would be to test the crowds
ability to estimate the weight (or mass) of something familiar; the objective of the
second would be to test the crowds ability to estimate the number of some set of
things under circumstances not so familiar. My expectations were that (a) the second
trial should reveal a larger contribution from the uniform component than the first,
and, (b) the mean of the Mixed-NU model should yield a closer estimate to the true
489
Fig. 8.5 Top panel: fruit cake used by the BBCs The One Show in 2007 to test the ability of a
crowd in Londons Borough Market to guess the weight of a cake. The winner received the
cake as a prize. Bottom panel: scale used at Borough Market to weigh the cake, showing true
mass of 5.315 kg.
value than the sample mean (Galtons model), provided the informed subgroup of
respondents (the Gaussian component) made up a sufficiently large faction of
the group.
The first trial took place in the large outdoor Borough Market, one of Londons
major food markets located in the Borough of Southwark close to the famed London
Bridge. Carrying a large rectangular fruit cake with a bright red question mark in the
center of the icing, as shown in Figure 8.5, BBC reporter Michael Mosley randomly
queried 123 shoppers for their estimates of the weight of the cake. As incentive to
guess accurately, the person coming closest to the true value Mcake = 5.315 kg, as
shown on the scale in Figure 8.5, would receive the cake as a prize. Truth be told,
I would not, myself, want to eat a cake that was carried around Borough Market all
day and held by more than a hundred people . . . but perhaps thats being too
fastidious.
In any event, like Galton at the West of England Exhibition, I received from Ms.
Freeman the complete record of guesses. Also like Galton, I found the crowds
judgment to be surprisingly good. Estimates ranged from 1.000 kg to 14.700 kg with
490

20
Cake Experiment
Frequency
15
Mixed-NU
10
Gaussian
5
0
0
10
12
14
Bin
Fig. 8.6 BBCSilverman cake experiment in Borough Market, London. Histogram of N =
123 estimates of the mass of a cake sorted into 29 bins of width 0.5 kg over the range 115 kg.
Superposed is the theoretical NU-Mixed model probability density (solid) with maximum
likelihood parameters (f, , 2) given in Table 8.4, and a normal density (dashed) based on
the unweighted group mean and variance. The two outcomes are respectively MMNU = 5.345
0.239 kg and MGaus = 5.416 0.223. The true mass was Mcake = 5.315 kg.
a sample mean Msample = 5.416 kg, a median of 5.100 kg, standard deviation Ssample =
2.471 kg, and standard error of 2.228 kg. In short, the crowd missed the exact mass with
a fractional error (Msample Mcake)/Mcake = 1.90%, or about 1 part in 50.
Figure 8.6 shows a histogram of the mass estimates, overlaid by a Gaussian
distribution NMsample , S2sample , reflecting Galtons (and WOCs) democratic belief
that the collective wisdom of a crowd resides in the sample mean of normally
distributed independent guesses. However, in keeping with the results of my GOG
experiments with physics students, the Borough Market histogram, with its concentrated density in the vicinity of 5 kg and a long flat tail skewed to the right, again
strongly resembled a mixed GaussianUniform distribution. The Mixed-NU distribution (8.8.12) with parameters determined by the method of maximum likelihood
(8.8.11) provided a better match to the data, as also shown in Figure 8.6 and
summarized in Table 8.4.
Having watched a video recording of the experiment in progress that Ms.
Freeman sent me, I thought this outcome made perfect sense. The population
queried by Mr. Mosely included seasoned housewives, who undoubtedly made
and lifted many cakes over the years, as well as some young people in their teens
or twenties, who probably never lifted, let alone made, a fruit cake. Nevertheless,
Table 8.4
491
The BBCSilverman wisdom of crowds experiments
Population size n
Number of bins
Bin width
Mixed-NU parameters
Uniform range (a, b)
Gaussian fraction f
Gaussian mean
Gaussian standard deviation
Exact value
Crowd (sample) mean value
Silverman Mixed-NU mean
Crowd (sample) percent error
Silverman Mixed-NU percent error
Cake experiment
Coin experiment
123
29
0.5 kg
1706
71
100 coins
(1, 15) kg
0.806
4.705
1.627
Mcake = 5.315 kg
Msample = 5.416
0.223 kg
MMNU = 5.345 0.239

Msample Mcake

1:90%

Mcake

MMNU Mcake
0:56%

Mcake
(0,7000) coins
0.868
736.40
354.62
Ncoin = 1111 coins
Nsample = 982 39 coins
NMNU = 1100 30

N sample N coin

11:60%

N coin

N MNU N coin
0:99%

N coin
since the venue was a major food market, it is not unreasonable to expect the
sampled population to comprise more mature, knowledgeable food preparers than
clueless youths. If so, that would account for why the mean judgment of the crowd
was quite accurate, and why a normal distribution captured the information of the
crowd almost as well (percent error 1.90%) as my Mixed-NU Model (percent error
0.56%). Nevertheless, a normal distribution alone fails to account for the high-end
fat tail.
Of particular interest to me was the second trial, which took place in the BBC
studio and entailed an exercise that was probably not part of the experiences of many
viewers who emailed in their guesses: to estimate the number of 1 coins in a large
open, transparent glass, as shown in Figure 8.7. The true value was Ncoin = 1111. The
1706 guesses received were all over the board, ranging from a low of 42 to a high of
43 200 with a sample mean Nsample = 982, a median of 695, standard deviation Ssample =
1593, and standard error of 39. Although the judgment of the crowd was worse than in
the cake experiment, it was not terribly bad, missing the exact value with a fractional
error of (Nsample Ncoin)/Ncoin = 11.6%, or about 1 part in 9.
Actually, the largest value submitted was 25 million, but was excluded since there
was reason to believe (as Ms. Freeman wrote me) that it was intended to sabotage
the experiment. Indeed, as a matter of common sense, where would The One Show
get 25 000 000 or more than $40 000 000 to put in a jar for the purpose of a brief
492
Fig. 8.7 Glass full of 1 coins used by BBCs The One Show in 2007 to test the ability of
viewers to estimate the number of items in a set.
infotainment exercise?17 On rare occasions, I have encountered such irrationally

extreme guesses myself in the responses from students. This is one reason, apart from
mathematical tractability, why the range parameters (a, b) of the uniform component
in my model are not treated as unknown quantities to be solved for, but as limits
established at the outset from the data on the basis of reasonableness.
Figure 8.8 shows a histogram of the responses overlaid with the Gaussian
NN sample , S2sample and the Mixed-NU density with ML-determined parameters, as also
summarized in Table 8.4. The high-end tail of the histogram extends far to the right,
and for aesthetic purposes, i.e. to avoid a plot comprising mostly empty space, the bin
axis is curtailed at 3000. Nevertheless, it is seen that the assumption of a normally
distributed population is unsustainable. In contrast, the Mixed-NU model captured
the distribution of guesses more faithfully and yielded a group mean estimate of 1100
coins, designated NMNU in Table 8.4, i.e. a fractional error of about 1%.
In other words, it would appear that the Mixed-NU model showed that this
particular group of independent respondents were better able (for whatever reason)
to estimate the number of coins than one would have been led to believe on the basis
of the unweighted (i.e. maximum-entropy) sample mean. This interpretation is
consistent with the resulting ML parameters, which predicted a Gaussian component
17
This is about 15% of the entire BBC One network annual budget.
493

200
Coin Experiment
Frequency
150
Mixed-NU
100
50
Gaussian
0
0
500
1000
1500
2000
2500
3000
Bin
Fig. 8.8 BBCSilverman coin experiment. Histogram of N = 1706 estimates of the number of
1 coins in the glass of Figure 8.7 sorted into 71 bins of width 100 over the range 07000.
Superposed is the NU-Mixed model probability density (solid) with maximum likelihood
parameters given in Table 8.4, and a normal density (dashed) based on the sample mean and
variance. The two outcomes are respectively NMixed = 1065 30 kg and NSample = 982 39.
The true count was Ncoin = 1111.
(the informed subgroup) of 86.8% a little higher, in fact, than the 80.6%
Gaussian component of the crowd of cake-weight estimators. The use of the term
informed does not necessarily imply that most individuals in the group were
especially skilled at estimating numbers only that they had a sense of what might
be a reasonable number in contrast to a preposterous one. Recall the lesson of the
Condorcet jury theorem: a group, if sufficiently large, can produce a correct majority
vote with near certainty, even if individual members were correct only marginally
more than half the time.
It is also possible that the closeness with which the two group estimates, as
extracted by the Mixed-NU model, matched the true values is partly fortuitous. In
this regard, several aspects of the analysis bear brief commentary.
First, the match between frequencies calculated from the model pdf and the data
would fail a chi-square test for goodness of fit i.e. give rise to a P-value smaller
than the conventionally set threshold of 5%. That failure does not in itself constitute a failure of the model because the sole purpose of the model was to provide a
better rationale than pure ignorance for gauging information (as expressed
through the mean and variance) available in the collective judgment of a group.
Unlike physical systems like atoms or stars, which are subject to well-founded
theories grounded in quantum mechanics and yield predictable line shapes of
one kind or another, there may well be no general theory for the response of a
group to some query. In that case an empirical model would be the best one can do.
494
Mathematical physicist Harold Jeffreys comment on the law of gravity is pertinent

here: There has not been a single date in the history of the law of gravitation when
a modern significance test would not have rejected all laws and left us with no
law.18
Second, it is evident from Figure 8.8 (and to a lesser extent in Figure 8.6) that a
small, but not insignificant portion of the area under the Mixed-NU distribution lies
above the extension of the bin axis into the negative domain i.e. into a non-physical
region, since the minimum coin count (or cake weight) cannot be below 0. This is a
problem similar to one encountered by high-energy physicists who, in fitting a
resonance profile to particle data, may find a nonzero probability that the mass of
the particle is negative. The problem can arise when the data manifest a high
uncertainty, whereupon the ratio of mean to standard deviation of the theoretical
fit is relatively low; for the coin experiment the ratio of ML Gaussian parameters is
/ 1.8. To avoid the nonphysical domain, yet still satisfy the completeness
relation, one must work with a pdf normalized over the range ( x 0).
I discuss two such approaches in the next section.
Third, because modeling a distribution, in contrast to just calculating a sample
mean, requires that data be grouped, the question arises as to what effect the number
and boundaries of the bins may have on final results. We have discussed this question
previously. A thorough examination of the cake and coin experiments (which I have
done) would go beyond the scope of this chapter. Let it suffice to reiterate that,
provided the number of bins is not so few as to provide little insight into the nature of
the distribution, nor so many that many bins are empty (or nearly so), the end result
is not greatly affected. For example, partitioning the data of the coin experiment into
141 bins (instead of 71) so that the bin width is 50 coins (instead of 100) led to a
Mixed-NU mean of 1089 30, i.e. a fractional error of 1.99%. Likewise, the choice
of 51 bins with a corresponding bin width of 140 coins led to a Mixed-NU mean of
1121 30, i.e. a fractional error of 0.90%. The three group estimates are practically
equivalent.
A fourth and final point to note about the cake and coin experiments is that the
Mixed-NU model found in both cases that roughly 80% of the group were
informed and 20% were uninformed. Perhaps this outcome is general for large,
diverse groups, but without more experiments, this tantalizing result is purely speculative. Upon completion of the cake and coin trials, I had, in fact, proposed additional experiments (e.g. concerning prediction as well as estimation tasks) to the
producer of The One Show, which we discussed at length by telephone but, in the
end, the shows objective was entertainment, not science, and so, to my knowledge,
the suggestions were never pursued.
18
H. Jeffreys, Theory of Probability 3rd Edition (Oxford, London, 1961) 391.
495
8.10 The log-normal distribution: a fundamental model of group judgment?

To avoid a non-physical negative domain in the application of the Mixed-NU model,
one can replace the pdf (8.8.1) with a modified density function
2
1
1 f
2
pX x f C, p ex =2 I 0, x
I a, b x
ba
2
8:10:1
constructed from a truncated Gaussian defined over the range ( x 0). Correct
normalization (to obtain unit area under the pdf ) is obtained by insertion of the
normalization constant
C,
2
1 erf
p
2
8:10:2
The mean (8.8.3) and variance (8.8.4) are then replaced by the relations
"
#
r
2
2
2
e =2
ab
p
X f 1
8:10:3
1 f
1 erf = 2
2
3
2
r
2 =2 2
2 = 2
2 = e
2
e
b a2
7
6
p
1

f
2X f 2 41
5

2
p
12
1 erf = 2
1 erf = 2
"
#2
r
2
2
2
e =2
1
p b a
f 1 f
1 erf = 2 2
8:10:4
derived from (8.10.1). Applied to the BBCSilverman coin experiment with data
sorted into 51 bins and others parameters as listed in Table 8.4, Equations (8.10.3)
and (8.10.4) lead to 1114 30 coins, yielding a fractional error of 0.31%.
Although the modified Mixed-NU model with correct normalization apparently
extracted an even closer (still fortuitous?) estimate of the true number of coins, there
remain some undesirable features. For one thing, the model pdf does not vanish at
the origin as it must, since no respondent who takes the exercise seriously would look
at a jar full of coins and estimate its number to be zero. Second, and perhaps more
important from an aesthetic perspective, is what may appear to some as an artificial
distinction between informed and uninformed subgroups. In mathematical
terms, it would be more satisfying if one could find a single pure density function
that generated the statistical features of a mixed-distribution model without assuming that respondents actually comprised two discretely different distributions.
This second desideratum raises a general (and, if you think about it, profound)
question of whether or not there may be a universal probability function for the
responses from a large (in principle, infinitely large) group of independent (i.e. noncoordinating) participants, each with a unique set of background experiences and
496
variable (in kind and amount) knowledge. This hypothetical group is, of course, an
idealization, but perhaps one that might be realized in a practical way by means of the
internet. Under the foregoing conditions, it would seem that orthodox statistical procedure already tells us what this universal probability function should be. If each
respondents guess is represented by a random variable of arbitrary kind (provided that
its first and second moments exist), the Central Limit Theorem (CLT), as we have seen in
Chapter 1, asserts that the variate representing the sum or mean of the set be distributed
normally. However, this prediction is not supported by the results of either my GOG or
BBC experiments. The CLT is a rigorous statistical law, but it can lead to less familiar
results under unusual circumstances. I will return to this point at the end of the section.
To find and test a hypothetical universal probability function, it is crucial that the
group to be sampled be large. What one might expect from a truly large and variable
group could perhaps be anticipated from the data set for the coin experiment, which,
for want of a larger data set at the time of writing, will have to serve as a proxy for an
ideal infinite group of respondents.
The top panel of Figure 8.9 plots the histogram frequencies as points (rather than
as bars) as a function of bin value. Having examined numerous shapes assumed by
various skewed distributions for different parametric choices, I found that nearly all
of them failed to depict convincingly either the central concentration of points or the
fat tail or both. One striking exception, however, was the log-normal distribution, the
pdf of which takes the forms

2
1
1
lnx=x0 =2 2
ln x2 =2 2
pX xj, p e
p e
8:10:5
2 x
2 x
and leads (as shown in an appendix) to the following statistical moments and
functions of moments
1 2 2
hXn i en2n
hXi e
n 0, 1, 2, . . .
12 2
x0 e
2 =2
8:10:7
2X hXi e 1:
2
8:10:8
SkX e 2e 1
2
1
2
KX e 4 2e 3 3e 2 3:
2
8:10:6
8:10:9
8:10:10
The solid curve in the top panel of Figure 8.9 is the theoretical pdf (8.10.5) with
parameters ^
, ^ determined from the maximum likelihood (ML) expressions
K
1X
nk lnX k
n k1
K
2
1X
^ 2
nk lnXk ^
n k1
applied to grouped data with K = 51 bins.
8:10:11
497

300
250
Coin Experiment
Frequency
200
150
Log Normal
100
50
0
0
500
1000
1500
2000
2500
3000
Bin
300
Frequency
250
200
150
Gaussian
100
50
0
1.5
1.75
2.25
2.5
2.75
3.25
3.5
3.75
Log Bin
Fig. 8.9 Coin experiment of Figure 8.8. Top panel: histogram of count estimates (gray dots)
fitted by a log-normal probability density (black solid) and plotted against class values. Data
were sorted into 51 bins of equal intervals of 140. Bottom panel: the same histogram and
density plotted against the logarithm (to base 10) of the class values.
If the coin data were a sample drawn from a log-normal population, then a plot of
histogram frequencies as a function of the logarithm (to any base) of the bin values
should transform the histogram into the shape of a Gaussian, as shown by the solid
trace in the lower panel of Figure 8.9 calculated from the pdf (8.10.5) with ML
parameters. The points of the transformed histogram follow the theoretical curve
reasonably well. Figure 8.10 compares the observed cumulative distribution,
Fk
k
1X
nj
n j1
k 1 . . . K,
8:10:12
498
Cumulative Probability
Coin Experiment
0.75
0.5
Log Normal
0.25
2.2
2.4
2.6
2.8
3.2
3.4
3.6
3.8
log Bin
Fig. 8.10 Coin experiment of Figure 8.8. Empirical (gray dots) and log-normal (black solid)
cumulative probabilities plotted against the logarithm (to base 10) of the class values.
which is largely independent of the arbitrary choice of class number and interval,
with the theoretical cumulative probability

1
ln x ^
p
1
8:10:13
FX x erf
2
2 ^
derived by integrating (8.10.5) with substitution of ML parameters. Agreement of
(8.10.12) and (8.10.13) is satisfyingly close.
The group judgment of the number of coins in a jar, given by the mean and
standard error relations (8.10.7) and (8.10.8), is summarized in Table 8.5 for several
choices of the number of classes K. Most striking is that the estimates all in the
vicinity of about 900 coins is considerably poorer, compared to the true value of
1111, than the results of either the Mixed-NU model or the sample mean (982)
representing one voice, one vote. Table 8.5 also shows results of mixing a lognormal and uniform distribution (Mixed-LNU model). With parameters determined
again by the maximum likelihood method, the resulting mixture had a log-normal
component of about 96% i.e. nearly a pure log-normal pdf but nevertheless
extracted a group judgment significantly closer to true value than the sample mean.
Statistically, however, there would be no motivation at this point for such a model,
since the pure log-normal function itself is supposed to provide the long tail that the
uniform distribution was adopted to provide in the original Mixed-NU model. I will
return to this point later.
It is a striking feature of many random processes that they give rise to results
represented by a log-normal distribution. Among the multitudinous phenomena for
Table 8.5
499
Coin experiment log-normal models
Pure log normal

Number of bins
Bin width
ML parameter ^
ML parameter ^
LN mean and SE
LN percent error
36
200
6.608
0.592
883 14
20.5%
51
140
6.609
0.622
900 15
19.0%
71
100
6.591
0.591
893 15
19.6%
Mixed log normal-uniform

ML parameter ^
ML parameter ^
ML parameter ^f
Mixed-LNU mean and SE
Mixed-LNU percent error
6.666
0.542
0.958
1018 20
8.4%
6.619
0.577
0.962
984 20
11.5%
6.578
0.585
0.961
957 20
13.9%
which a log-normal distribution has been claimed are (to cite but a few):19 (1) the
concentration of elements in the Earths crust, (2) the distribution of particles,
chemicals, and organisms in the environment, (3) the time to failure of some maintainable system, (4) the concentration of bubbles and droplets in a fluid, (5) coefficients of friction and wear, (6) the latent period of an infectious disease, (7) the
abundance of biological species, (8) the taxonomy of biological species, (9) number of
letters per word and numbers of words per sentence, and (10) the distribution of sizes
of cities. In my own laboratory, recent investigations of the mystifying and amusing
system of explosive glass droplets known as Ruperts drops found that the size
of dispersed glass fragments followed a distribution similar to the log-normal
distribution.20
The reader will note that the preceding sampling spans fields of physics, chemistry,
engineering, biology, medicine, linguistics, and more. Surely, the same stochastic
mechanism cannot be operating in all these cases. One cannot help asking: why does
a log-normal distribution turn up so often?
A general explanation for the occurrence of skewed distributions was given at least
as far back as the turn of the twentieth century,21 subsequently followed by numerous
elaborations. Briefly (and with more attention to content than rigor), the argument
goes as follows. Consider a process occurring sequentially that produces some initial
element X0 from which successive elements Xj ( j = 1. . .n) arise by a random action
j on the immediately preceding element Xj1, as in the sequence
19
20
21
Lists of reported occurrences of the log normal distribution with corresponding references are given by: (1) E. Limpert,
W. Stahel, and Markus Abbt, Log-normal distributions across the sciences: keys and clues, BioScience 51 (2001) 341
352; (2) Wikipedia, Log-normal distribution, http://en.wikipedia.org/wiki/Log-normal_distribution
M. P. Silverman, W. Strange, J. Bower, and L. Ikejimba, Fragmentation of explosively metastable glass, Physica
Scripta 85 (2012) 065403 (19).
J. C. Kapteyn, Skew Frequency Curves in Biology and Statistics, (P. Noordhoff, Groningen, 1903).
500
Xj Xj1 j Xj1 Xj1 1 j :
8:10:14
The outcome of the iterative process (8.10.14) at the nth step is then a product of
n factors
X n X0
n
Y
1 j
8:10:15
j1
of which the logarithm (the base is not important to this demonstration) takes the
form
ln Xn ln X0
n
X
ln 1 j ln X0
j1
n
X
j higher-order terms in :
j1
8:10:16
Under the assumption that the random action at each step is small, whereupon
neglect of the higher-order terms in the expansion of the logarithm is justifiable,
the sum of stochastic variables in the right side of relation (8.10.16) asymptotically
approaches, by virtue of the CLT, a Gaussian random variable. In other words,
ln Xn follows a normal distribution, and therefore Xn is a log-normal random
variable.
It is difficult to imagine, however, how the foregoing sequential process might
pertain in the mind of a person asked to estimate the number of coins in a jar. Would
the person start with an estimate of the number and then sequentially modify it by
factors proportional to the current estimate until arriving at a satisfactory value?
Most likely not.
I propose, instead, a different mechanism, based on how I, myself, would have
executed the task. In short, I would first estimate the volume of the container and
then multiply that number by my estimate of the number of coins per volume. To see
how this plays out, look again at Figure 8.7, which shows that the glass container
wide at the top and narrow at the base takes the approximate shape of a frustum of
a cone, i.e. the portion of the cone lying between two parallel planes that cut it
perpendicular to the symmetry axis. Now, with r the radius of the (small) circular
base, R the radius of the (wide) circular mouth, and H the vertical distance between
the two planes, it is a straightforward exercise in geometry to show that the volume of
the conical frustum is
Vr, R, H
2
r rR R2 H:
3
8:10:17
Upon letting C stand for the number of coins per volume, I would then calculate the
number of coins in the jar from the formula
Xr, R, H, C r 2 rR R2 HC:
3
8:10:18
501
The crucial point to bear in mind at this stage is that none of the needed numbers
(r, R, H, C) is known; all are random variables whose realizations (i.e. guesses) by
members of a group would be different. After examining Figure 8.7, made from the
video given me by Ms. Freeman who did not specify the dimensions of the glass
container, I assigned the following rough values of lengths and uncertainties (in
centimeter units)
r N3, 0:72
R N5, 1:02
H N20, 2:02
C N1, 0:22
8:10:19
and assumed, as shown explicitly by the expressions in (8.10.19), that they constituted the mean values and standard deviations of normally distributed variates.
Simple (in contrast to compound) physical quantities are often distributed normally,
so the assumption does not seem unreasonable to me.
Figure 8.11 shows plots (in gray) of the frequency (upper panel) and cumulative
probability (lower panel) of the guesses from a group of 1 000 000 respondents, as
simulated by a Gaussian random number generator (RNG) generating one million
values for each variate in (8.10.19) and multiplying the four realizations of each trial
together in accordance with expression (8.10.18) to arrive at an estimate of the
number of coins in the glass. The data were sorted into 400 bins of width ~10 units.
Because the number of bins is large and the bin width is narrow, the plots are shown
as continuous curves, rather than by a sequence of discrete bars or points. Superposed on the plots are the corresponding theoretical curves (in black) obtained from
a log-normal distribution with parameters calculated, as before, by the method of
maximum likelihood. We will come to the gray dashed curves in due course.
The first feature to note is the striking visual accord between the data and the lognormal fit. Although the match of a log-normal distribution to the distribution
arising from computer simulation is not perfect, it is so awesomely close (and has
remained that way for numerous repetitions of the experiment) that one must
discount any attribution to coincidence. The explanation is both simple and subtle.
First the simple part. Consider a random variable X to be the product of a large
number J of arbitrary non-negative random variables X = V1V2. . .VJ with finite
moments. Then log X takes the form of a sum of variates
log X
J
X
j1
ln V j
J
X
CLT
Y j ! N, 2
8:10:20
j1
which, by the CLT, asymptotically approaches a normal distribution. Under these

circumstances, X would then be log-normally distributed.
Now the subtle part. The variate X for the number of coins in the glass, defined by
Eq. (8.10.18), does not involve a large number of factors; just a sum of three terms,
each with four factors. Although it is conceivable that the sum of a small number of
random variables can yield a distribution very close to Gaussian we have already
502

1.4 10 4
Log Normal
1.2 10 4
Simulation of
Coin Experiment
Frequency
1 10 4
Mixed-NU
8000
6000
4000
2000
500
1000
1500
2000
2500
3000
2000
2500
3000
Bin
Cumulative Probability
0.8
0.6
0.4
0.2
500
1000
1500
Bin
Fig. 8.11 Computer-simulated coin experiment. Top panel: plot of frequency against class
value for (1) 106 guesses (gray) arrived at by the stochastic product (8.10.18) with normal
variates (8.10.19); (2) the theoretical log-normal density (solid black) and (3) the Mixed-NU
density (dashed), all model parameters being determined by maximum likelihood. Bottom
panel: the corresponding cumulative probabilities with the same traces as in the top panel.
encountered such a case in Chapter 1 with a sum of just three uniform variates this
is not the complete explanation. More to the point is that the product of normal
variates, whether independent (as in the product rRHC) or correlated (as in the
product R2HC) yields a log-normal variate either exactly or approximately, rather
than asymptotically. Thus the number of random variables in the product is not
503
Table 8.6 Coin experiment computer simulation

(number of coins in actual experiment = 1111)
Number of samples
Number of bins
Bin width
1 000 000
400
~10
LN model
Gaussian mean ^
Gaussian standard deviation ^
LN mean and SE
LN percent error
6.890
0.385
10580.6
4.8%
Mixed-NU model
Uniform range (a, b)
Gaussian fraction f
Gaussian mean
Gaussian standard deviation
Mixed-NU mean and SE
Mixed-NU percent error
(42, 4327)
0.986
1038
373.6
1054 0.4
5.1%
Simulated sample (1 million trials)

Sample mean and SE
Sample percent error
1055 0.4
5.1%
really key to the outcome. I demonstrate these properties explicitly in the appendix
on log-normal variates.
Other items to note concern the numerical details of the experiment, summarized
in Table 8.6.
First, as seen in Table 8.6, there is little difference (albeit a statistically significant
one given the recorded standard errors) between the sample mean and the mean
arrived at by a log-normal fit with ML parameters. In other words, a sample size of
one million trials approximates an infinite sample well enough that the normalized
set of empirical frequencies fnk/ng can be taken for all practical purposes as the true
probability function. The log-normal fit reproduces this probability function sufficiently closely to yield nearly the same value for the sample mean, although not
closely enough to pass a chi-square test. Given the objective, however, the latter
circumstance is unimportant. All that matters in the context of this investigation into
the wisdom of crowds is to get the best estimate that a crowd (which, now with
1 million respondents, really is a crowd) can provide. For the assumptions (8.10.19)
that have gone into the simulation, one cannot do better than the sample mean ~1055
in Table 8.6.
A second point concerns the matter of grouped vs ungrouped data. Given the
effectively infinite sample size, it was actually better to calculate the ML parameters
with ungrouped data; i.e. from the relations
504
n
1X
lnXi
n i1
n
2
1X
lnXi ^
^
n i1
8:10:21
where the sum is over elements, instead of from (8.10.11) in which the sum is over
classes. With grouping of data some information is always lost, but grouping of some
kind is necessary in order to visualize and model an empirical distribution. If all one
wants is the ML mean and standard error, grouping is not necessary but I have
found that model predictions based on parameters derived from ungrouped data
gave less satisfactory results than predictions based on parameters derived from
grouped data when sample size was small.
A third point concerns the calculation of standard error i.e. the standard
deviation of the mean a topic discussed at some length in Chapter 1. That discussion, however, pertained to a simpler situation, which does not apply now. The
variance of the mean of the log-normal model is not simply the variance of a single
estimate divided by the size of the sample. This understates the actual uncertainty
because it does not take account of the uncertainty in the ML parameters upon which
the mean depends. More generally, the variance of a function X = f(, ) of random
variables (, ) must be calculated from the conditional expectation and conditional
variance of X given the variables (, )
varX hvarXj , i varhXj , i:
8:10:22
In words: the variance of X is the sum of

(a) the expectation of the variance of X, given (, ), and
(b) the variance of the expectation of X, given (, ).
The demonstration of (8.10.22) is not difficult and can be found in statistics texts.22
Application of (8.10.22) (with use of approximation (5.2.5)) to the mean X of
n log-normal variates with ML parameters ^
, ^ , yields the expression
" 2
#
^
2^
^
2 e 1
2
var X e
cov ^
, ^
8:10:23
var ^
^ var ^
n
in which the first term (including the exponential prefactor) corresponds to component (a) above. To evaluate the next three terms which are part of component (b), turn
again to the calculation in Chapter 1 of the covariance matrix of a two-parameter
Gaussian distribution. The only differences between that calculation and the present
one are that (i) now the pdf is a function of lnX, rather than X, and (ii) the scale
22
A. M. Mood, F. A. Graybill, and D. C. Boes, Introduction to the Theory of Statistics 3rd Edition (McGraw-Hill,
New York, 1974) 159.
505
parameter is now taken to be , rather than 2. Thus, in place of Eqs. (1.24.22),

(1.24.23), and (1.24.24), we obtain
var^

^ 2
n
var^

^ 2
2n
cov^
, ^ 0
8:10:24
whereupon (8.10.23) reduces to

e2^ ^
varX
n

^ 4
:
e 1 ^
2
^ 2
8:10:25
I conclude this section with two further observations: one regarding my explanation of the generality of the log-normal distribution in estimation experiments, and
the other concerning the relationship and statistical implications of the two models
(Mixed-NU and LN) for extracting the knowledge of a group.
Although I arrived (theoretically and by computer simulation) at a log-normal
distribution by examining step by step how I would myself estimate the number of
coins in a jar as embodied by the stochastic product (8.10.18) I would emphasize
that I did not expect all (or perhaps even most) of the respondents emailing their
guesses to The One Show to have arrived at their estimates in the same way. Most
participants probably had no idea what a frustum was or how to calculate its
volume. This is, however, an unimportant geometric detail. One could model the
container simply as a box in the shape of a rectangular solid; the independent
variations in the product of height, length, and width would again generate a
distribution resembling a log-normal distribution. The seminal point to my explanation is that many respondents probably reasoned in some analogous way i.e. they
estimated the number of coins by multiplying several linear dimensions and a coin
density. If the variation in each stochastic variable resembled a normal distribution,
then a log-normal distribution of guesses was bound to emerge.
Now, as to implications. If a log-normal distribution accurately represents the
diverse conjectures of the members of a group, then it might appear (from Tables 8.4
and 8.5) that The One Show coin group was much less adept at numerical estimation than the results of the Mixed-NU model would indicate. The two models,
however, are not necessarily in conflict; they presume different populations, serve
different functions, and provide different information.
The Mixed-NU model is intended to assess the best collective guess of a particular
group responding to a single, specific query. This might be the kind of information
sought, for example, if one wanted to know right away the opinion of a class of
physics students, or the shoppers in a food market, or the studio audience of a
television game show whom the contestant can solicit collectively one time for advice
in answering a question. A repetition of the experiment with a different group of the
same size would probably lead to different statistics. The LN model presuming that
a log-normal distribution actually occurs ubiquitously assesses the hypothetical
506
collective response of a (practically) infinite sized group, in effect the parent population of all diversely knowledgeable, independently operating respondents who make
an effort to answer the posed question accurately. This might be the kind of information sought, for example, if one wanted to ascertain the opinion of a large group
on some technical issue to be decided in an approaching referendum. As the number
of respondents grew in time and more data were accumulated through periodic
polling, the distribution of responses would asymptotically become log normal with
a mean approaching the true mean of the entire population.
Regarding the italicized words above, when or why would a log-normal distribution not be expected to occur? Note that a log-normal distribution in my simulated
coin experiment arose repeatedly (i.e. with each simulation of one million estimates)
as a consequence of a knowledgeable calculational effort i.e. an estimating and
multiplying together of various uncertain factors and not as a result of unmotivated
random guessing. In other words, the log-normal distribution arose when the
computer-simulated group comprised informed, rather than uninformed,
members to use the words that inspired my simple Mixed-NU model in the first
place. If members of a group are largely uninformed in regard to some query, then
I would think that a mixed log-normal-uniform (Mixed-LNU) model would depict
the outcome better, as recorded in Table 8.5 for the BBCSilverman coin experiment.
For groups of small size, the distributions of responses in my experiments did not
much resemble a log-normal distribution; there were too few samples. For a very
large group of informed respondents, however, one would expect the Mixed-NU
model to classify most of the responses in the informed category i.e. to come up
with a parameter f close to unity and thereby arrive at a mean value close to the
sample mean. Stated somewhat differently, the Mixed-NU, LN, and Mixed-LNU
models should all do about equally well in fact, as well as could theoretically be
expected.
This is precisely what occurred when the Mixed-NU model was fit to the
computer simulated coin experiment, as shown by the dashed traces in Figure 8.11
with the statistical details given in the lower part of Table 8.6. Although the shape
of the resulting probability function does not fit the histogram of guesses as well
as does a log-normal density, in all cases where the ML equations of the model
could be solved numerically the resulting Gaussian fractional component was
within a few points of 100% and the model mean was virtually identical to the
sample mean.
In concluding this investigation into the assessment and efficacy of collective
judgment, I will highlight what I believe are useful lessons to be drawn from
the statistical exercises performed with my students and the viewers of BBCs The
One Show.
507
I began the project with an objective to find out whether the guesses of a group in
regard to various quantifiable matters (counts, weights, lengths, problem solving,
etc.) is better than the best guess by individuals within the group. In virtually every
trial with students, the mean response of the group did not surpass the best response
of one or more individuals. Indeed, in most cases, the judgment of the group was
considerably worse. The same was true with the BBC cake and coin trials. Four of the
shoppers queried in Borough Market guessed a cake mass of 5.3 kg, which was just
15 g below the exact mass. The mean of the 123 samples, while close, was still off the
mark by 101 g. Likewise, four of the BBC respondents emailed the exact number
(1111) of coins in the glass, whereas the mean of the 1706 samples (not counting the
saboteur) was off the mark by 129 coins.
From the perspective of potential utility, however, the key question is whether
those individuals giving the most accurate responses would do so again if the experiment in which they excelled was repeated. In other words, were they experts giving
expert opinion, or merely lucky guessers? If, for some cogent reason, you were
charged with the task to estimate the number of coins in the Royal Mint, would
you prefer to form a committee drawn from passersby in a London street, or hire
those four respondents on The One Show? The experiments on The One Show were
not repeated, so we will never know whether those respondents were experts. However, I had repeated some of my GOG experiments and found that individual best
responses were mostly lucky guesses. But there were exceptions. The individual who
estimated most closely the height of the classroom ceiling was on the college basketball team and apparently had a very good idea of his height and reach. He simply
stood up, raised an arm to the ceiling, and judged the distance accurately. If I was
charged by the Dean of Faculty to estimate the height of classroom ceilings, I would
prefer to hire this student rather than form a decanal committee of students and
faculty selected randomly on the college campus.
The next stage of the project had the objective of determining how best to extract
what information a group might provide. In the absence of any prior suppositions
concerning the composition of the group, the best group judgment would simply be
the sample mean (with associated uncertainty). But for small to moderately sized
groups, a single sample mean could be a poor assessment of the best response that
members of the group could give. Making the assumption that some individuals in
the group are more knowledgeable than others, I then developed a model (MixedNU) that could objectively (i.e. without my knowing any of the individuals) weight
the more informed opinions to a greater extent than uninformed, random guesses.
In the few cases where I could try the model on populations of statistically useful
size, the model yielded group means significantly closer to the known true values than
the unweighted sample means. How well such a model might work in other trials
remains to be seen.
In creating and testing other models, the examination of group responses strongly
suggested to me that,
508
if the size of the group is large enough, and

if the task demanded of the group involved some cogitative effort resulting in a
product of uncertain factors,
then individual responses would be distributed in an approximately log-normal
way. The log-normal probability function, with its concentration of density and
long tail skewed to the right (large positive values), automatically weighted opinions closest to the mean more heavily than outlying opinions, but without making
an artificial distinction between informed and uninformed respondents. However,
the log-normal mean (calculated with parameters obtained by maximum likelihood), could be markedly lower than the sample mean. Depending on the
true value of the sought-for quantity, this could indicate that the group judgment
was either considerably worse or better than one might have thought on the
basis of no model at all. (In the case of the coin experiment, the log-normal
estimate suggested a less reliable group judgment.) A discordance in this case does
not necessarily pose a contradiction, once it is understood that the log-normal
distribution should be interpreted as pertaining to a trial with an infinite population or infinite trials with a finite population, but not a single trial with a finite
population.
Recognition of the previously stated conditions under which a log-normal distribution would be expected to occur, likewise informs us of the circumstances under
which it is not likely to represent the judgment of a group well. For example, the
greater the proportion of a group that responds to a specify query with random
guessing, the more the distribution of responses would be characterized by a flat
(uniform) or mixed distribution rather than a log normal distribution.
All in all, here is my advice on the wisdom of crowds.
In matters (e.g. scientific) for which true expertise (and therefore true experts) can
be identified consult experts, not crowds.
In matters (e.g. popular culture; fields of study whose principles are ambiguous,
contentious, or untestable)23 for which no expertise or training is genuinely
involved consult a crowd.
The distribution of responses, and not merely the mean response, contains the
wisdom of the crowd.
The mean response of the crowd may be good, but not necessarily better than the
best response of individuals within the crowd. If you can identify these individuals,
find out if they are truly experts.
Good luck.
23
I am usually asked, whenever invited to give a lecture on this topic, what fields of study I had in mind in this line of
advice. It would of course be imprudent, if not offensive to some in the audience, to be specific. My reply to the inquirer
and to the reader is the same: You can probably answer that question yourself.
Appendices
8.12 Derivation of the jury theorem

To prove the relation
p
PrSn nm jn
n
X
n
jnm
pj 1 pnj
xnm 1 1 xnnm dx
Bnm , n nm 1
by direct analysis, we must first show that

1
m
X
n1
n
xm 1 xnm1 dx
pj 1 pnj n
m
j
j0
8:12:1
8:12:2
for some integer m n. This is done by iterated integration by parts of the integral in
(8.12.2), as shown below to the third level of reduction
1
xm 1 xnm1 dx
p
pm 1 pnm
m pm1 1 pnm1
nm
n mn m 1
1
2
nm2
mm 1 p 1 p
n mn m 1n m 2
m2
3
1
mm 1m 2
xm3 1 xnm2 :
n mn m 1n m 2
p
8:12:3
The pattern that unfolds is clear, and it does not
takemuch algebraic rearrangement to
n1
, leads to the left side of (8.12.2).
show that (8.12.3), upon multiplication by n
m
509
510
Having established relation (8.12.2), we next recognize that

PrSn nm jn 1 Pr Sn < nm jn
nm 1
X
n j
1
p 1 pnj
j
j0
1

1n
n1
nm 1
1
xnm 1 1 xnnm dx 1
xnm 1 1 xnnm dx
B nm , n nm 1
8:12:4
which reduces to (8.12.1) upon use of the defining relation for a beta function
p
1
Bnm , n nm 1 xnm 1 1 xnnm dx xnm 1 1 xnnm dx:
0
8:12:5
8.13 Solution to logic problem #1: how old are the children?
The product of the three childrens ages equals 36. This can be achieved in the
following ways, where (n1, n2, n3) lists the ages in decreasing order.
Ages of three children
Sum of ages
(9, 2, 2)
(9, 4, 1)
(4, 3, 3)
(6, 3, 2)
(6, 6, 1)
(12, 3, 1)
(18, 2, 1)
13
14
10
11
13
16
21
The sum of the ages is not specified in the problem, but it must not be unique since
B requested further information. Thus, that sum must be 13 since it results from the
two distributions (9, 2, 2) and (6, 6, 1). Of these possibilities, however, only (9, 2, 2) has
an eldest child, there being two older children of the same age in (6, 6, 1). The answer,
therefore, is (9, 2, 2). The extra information of blue eyes was just a red herring.
8.14 Solution to logic problem #2: where is the treasure?

The problem can be worked by ordinary Euclidian geometry, or, more interestingly,
by the use of complex numbers. I shall use the latter. Recall that multiplying a
complex number z = x + iy = rei by the imaginary unit i = ei/2 rotates the vector
511
Im
Alehouse
s1
s2
Alchemy
Metaphysics
Re
z2
Spike 2
z1
TREASURE
Spike 1
Fig. 8.12 Map of college campus showing the solution (location of treasure) to logic
problem #2.
(x,y) counter-clockwise by 90 . Likewise, multiplying by i = ei/2 rotates the

vector clockwise by 90 .
Place the Alehouse at some initially unknown location (x,y) on an Argand
diagram with the axes so chosen that the origin falls on the Real axis halfway
between the Metaphysics and Alchemy Buildings, as shown in Figure 8.12. The
complex-valued vectors s1 = (1 x, y) and s2 = (1 x, y) respectively locate
the Metaphysics and Alchemy Buildings relative to the Alehouse. Then the vectors
z1 = is1 = (y, 1x) and z2 = is2 = (y, 1 + x) respectively locate the first and
second spikes.
The location of the treasure is given by
z
z 1 z2
0, 1,
2
8:14:1
which places it one unit below the origin on the Imaginary axis. Surprisingly, as it
turns out, as long as the locations of the Metaphysics and Alchemy Buildings are
known, the initial location of the Alehouse does not matter.

A log-normal variate X, to be symbolized by (, 2), can be defined by the relation
Y ln X
8:15:1
512
where
Y N, 2 N0, 1
8:15:2
is a normal random variable of mean and variance 2. The second expression for
Y, which we already encountered in Chapter 1 (see Eq. (1.10.4)), will be particularly
useful shortly in examining the distribution of a product of normal variates. First,
however, let us consider how to calculate the moments of X.
8.15.1 Moments of a log-normal distribution
The inverse of Eq. (8.15.1), X = eY, provides a far more convenient way to calculate
the moments of a log-normal distribution than direct use of the probability density
(pdf ), moment-generating (mgf ), or characteristic (cf ) functions. Recall that the mgf
of a normal variate is
gY t heYt i et2 t :
1 2 2
8:15:3
It then follows immediately that the moments of X can be obtained directly from
(8.15.3)
hXn i gY n en2
1 2 2
n 0, 1, 2 . . .:
8:15:4
by setting the argument t equal to the order n of the sought-for moment. This leads to
expressions (8.10.6) (8.10.10), or, in general, to the following mgf and cf
gX t heXt i
n
X
t
hXn i
n!
n0
hX t he i
iXt
X
itn
n0
n!
n
X
t
n!
n0
n12n2 2
1
2
en n
2 2
8:15:5
8.15.2 Product of independent normal variates

If X is a product of n independent normal random variables,
X Y 1 1 , 21 Y 2 2 , 22 . . . Y n n , 2n
n
Y
Y i i , 2i
8:15:6
i1
then, by means of relation (8.15.2), the log of X can be cast into the form of a sum of
logarithms
ln X
n
X
lnY i i , 2i
i1
n
X
i1

X

ln i 1 i N i 0, 1
i1
n
X

ln i
ln 1 i N i 0, 1
i1
8:15:7
513
where the -coefficients are defined by

i i =i
8:15:8
and the subscript on the symbol Ni (0,1) emphasizes the independence of each
standard normal variate in the sum.
Upon expansion of the log functions ln (1+ i N(0,1)) in a Taylor series
ln 1
1n1
n1
n
2 3

n
2
3
8:15:9
and truncation at first order in i valid for well-localized Gaussian functions the
exact equation (8.15.7) takes the approximate form of a sum of normal random
variables
ln X
n
X
ln i
i1
n
X
i N i 0, 1
i1
n
X
n
X
ln i ,
2i
i1
i1
n
n
X
Y
i ,
2i
N ln
i1
8:15:10
i1
which is equivalent [again through use of (8.15.2)] to a single normal variate N(, 2)
with mean and variance
ln
n
Y
i1
i ,
n
X
2i :
8:15:11
i1
Thus, to good approximation (under the specified conditions) X is a log-normal

variate X (, 2).
8.15.3 Product of correlated normal variates

Geometric properties of symmetric figures may involve second and higher powers of
some dimensioned quantity, which can turn out to be a stochastic variable. We
examined in Chapter 5 the square of a random variable, in particular that of a normal
variate. We consider now an approximate, but nevertheless very useful, method for
determining the distribution of a normal variate raised to an arbitrary power.
Starting again with the expression (8.15.2) for a normal variate Y, we define the
variate X = Yn

X Y n 1 N0, 1 n :
8:15:12
The logarithm of (8.15.12) can be expanded in a Taylor series as before
ln X n lnY n ln n "ln1 N0, 1
#
2 N0, 12
3 N0, 13
n ln n N0, 1

2
3
8:15:13
514
and truncated (for sufficiently small n at first order to yield the approximate relation
ln X Nln n , n2 2 :
8:15:14
In other words, X approximates a log-normal variate X (ln n, n22).
8.15.4 Product and sum of independent log-normal random variables

If X1 1 1 , 21 and X2 2 2 , 22 are two independent log-normal variates, then
their product Z = X1X2 can be reduced to the form of a log-normal variate through
the following chain of steps
Z 1 1 , 21 2 1 , 22 e1 1 N1 0, 1 e2 2 N2 0, 1
e1 2 eN1 0, 1 N 2 0, 2
2
e1 2 eN0, 1 2
2
2
eN1 2 , 1 2 1 2 , 21 22 :
2
8:15:15
There is no simple exact reduction for the sum Z = X1 + X2 of two log-normal

variates, but the following approximate reduction to Gaussian form can be useful.
Z 1 1 , 21 2 1 , 22 e1 1 N1 0, 1 e2 2 N2 0, 1

e1 1 1 N 1 0, 1 e2 1 2 N 2 0, 1
e1 e2 N 1 0, 21 e21 N 2 0, 22 e22
Ne1 e2 , 21 e21 22 e22 :
8:15:16
Under conditions where linearization of the exponential forms in (8.15.16) is justified, the variate Z is representable as a normal variate with respective mean and
variance
Z e1 e2
2Z 21 e21 22 e22
where the -coefficients are defined by relation (8.15.8).
9
Part I
Power to the people
The beauty of electricity . . . is not that the power is mysterious, and

unexpected, . . . but that it is under law, and that the taught intellect
can even now govern it largely.
Michael Faraday1
Models should be as simple as possible, but not more so.
attributed to Albert Einstein
9.1 A different kind of law

I never met anyone who actually checked the reliability of his electric meter readings.
I did not, myself, begin this research with that intention in mind, but, like other
projects undertaken over the years, came to this one serendipitously. The outcome
was unexpected and unsettling but I am getting ahead of myself.
Somewhere, years ago, I read that mathematician Kurl Godel, acclaimed for his
incompleteness theorems, had retained to his death virtually every receipt, invoice,
bank statement, cheque stub, etc., accumulated when alive. I had no way of knowing
whether the alleged eccentricity was true or not, but at the time the thought resonated
uncomfortably as I looked round my study at file cabinets filled nearly to capacity
with much of the same items and more. The acquisition of a shredder helped alleviate
the burden of paper, but among the various folders were useful files of data, including
a record of monthly electrical energy utilization spanning a period of more than
thirty years.
There is much to learn from a record of energy consumption. (Of course, the
energy is not consumed, but transformed into heat and work.) At a purely personal
level, the series of random-looking numbers (in kilowatt hours, kWh) recalled
memories of significant events. A long interval of low readings brought to mind a
period of travel or a power failure from a major storm; a short interval of high
readings marked a family gathering and celebration. The course ones life can be read
to some extent in a time series of kilowatt hours.
1
B Jones, The Life and Letters of Faraday (1870), Vol. 2, p. 404. (Quotation from Faradays lecture notes of 1858.)
515
516
The random flow of energy I
Energy (kWh)
800
600
400
200
0
12
24
36
48
60
72
84
96
108
120
Time (months)
Fig. 9.1 Discrete time series fxtg (gray dots) of electrical energy usage (kWh) for a period of
120 consecutive months. The connecting black lines serve to guide the eye.
To a physicist, however, the numbers also convey a scientific narrative of hidden

order beneath a surface of apparent randomness. When Faraday, in the epigraph
above, remarked on the beauty of electricity under law, he was undoubtedly
referring to James Clerk Maxwells then recent construction of his eponymous
equations. But I will offer for the purposes of this chapter a different interpretation,
one rooted in statistics than in electromagnetism. Though the use of electric energy
may vary indeterminately from month to month, in the aggregate I was sure it
concealed a pattern governed by statistical law a law that the taught intellect,
in Faradays words, might not necessarily govern, but nonetheless employ to useful
ends. I set about, therefore, to find that law.
9.2 Examining the data: time and autocorrelations

By law, what is really meant is a mathematical model. There is no physical
theory from which to derive a persons consumption of electric energy. That usage
is subject to random influences (like the weather) and deterministic influences
(like the seasons). We have seen, however, that randomness comes in different
varieties that is, in patterns with different degrees of predictability. The task,
then, was to find what model best described the patterns hidden in my record of
energy usage.
Given any time series of observations, it is informative to begin by plotting the
record and adjusting it, as necessary, to facilitate further analysis. The top panel
of Figure 9.1 shows a sample of the time series fxtg (t = 1. . .N) of energy readings
at my home for a period of N = 120 consecutive months. The actual data
(in kWh) consist of discrete points (gray dots) taken at monthly intervals; the
lines in the figure are there merely to guide the eye. The plot shows jagged peaks
at irregular intervals and fluctuates asymmetrically about the sample mean
(dashed line) at
N
1X
xt 409:24 kWh:
N t1
517
9:2:1
(As an aside, I note that the average monthly electric energy consumption of a US
residential utility customer in 2011 (the last year for which I have data) was 940
kWh.)2 The fluctuations result in a sample variance
s2x
N
1X
xt x2 107:492 kWh2 :
N t1
9:2:2
Beyond these obvious features, the raw data reveal little about the underlying
pattern . . . if there is one.
The slight asymmetry of the record about the sample mean signifies that the time
series is non-stationary: as the series progresses from left to right, the sample mean
decreases in time. As discussed in previous chapters, it is usually necessary to remove
the mean and slope when mining a time series for information. A time series of nonzero mean produces a large spike in the power spectrum at zero frequency and leads
to slow damping of the autocorrelation. Likewise, a non-zero trend produces lowfrequency oscillations in the power spectrum that can obscure important features.
The panels of Figure 9.2 show the results of various operations on fxtg to prepare
the record for further analysis. The first panel shows the time series fytg

N1
y t x t x x t
t 1 . . . N
9:2:3
2
transformed to eliminate the mean x and slope x
2
3
N=3
N
X
X
1
4
x
xt
xt 5
N=3N N=3 tNN=31
t1
9:2:4
in which the bracketed expression [N/3] signifies the largest integer less than or equal
to N/3. Equation (9.2.4) is the discrete form of the slope of a continuous time record
3
2
T=3
T
1
7
6
x
xtdt
xtdt5
9:2:5
4
T=32T=3
2T=3
of duration T. A close look at the series fytg, in particular its symmetrical fluctuation
about the horizontal baseline (y = 0), shows that it does indeed appear to have zero
mean and zero slope, two statistics readily verified by direct computation, together
with the sample variance
http://www.eia.gov/tools/faqs/faq.cfm?id=97&t=3
518
Dierence 112
Dierence 12
Dierence 1
Energy (kWh)

400
200
0
- 200
- 400
12
24
36
48
60
72
84
96
108
120
12
24
36
48
60
72
84
96
108
120
12
24
36
48
60
72
84
96
108
120
12
24
36
48
60
72
84
96
108
120
400
200
0
- 200
- 400
400
200
0
-200
-400
400
200
0
-200
-400
Time (months)
Fig. 9.2 Top panel: time series fytg of Figure 9.1 adjusted for zero mean and zero trend. The
other panels show the respective difference series: second panel, fr1 yt g; third panel, fr12 yt g;
and bottom panel, fr1 r12 yt g.
s2y
N
1X
u2 96:362 9284:7 kWh2 :
N i1 i
9:2:6
Unless needed for clarity, physical units like kWh will be omitted in the remainder of
the chapter.
The second panel in Figure 9.2 shows the first-difference time series futg of lag
1 defined by the expression
ut r1 yt yt yt1
t 2 . . . N:
9:2:7
519
Similarly, the third panel shows the first-difference series at lag 12

t r12 yt yt yt12
t 13 . . . N
9:2:8
and the fourth panel shows the multiplicative difference series at both lags
wt r1 r12 yt t t1 yt yt1 yt12 yt13
t 14 . . . N:
9:2:9
As a matter of convention, the subscript 1 is usually dropped from the nabla (r) in
the case of first-difference lag 1. For clarity and, in particular, to distinguish
discrete differencing from the gradient operation I will retain the subscript in
all cases. Another matter of notation: use of the backward-shift (or more simply:
backshift) operator B, introduced in Chapter 6, allows one to express differencing
in a notationally simple way that will later facilitate the algebraic manipulation of
time series
ut 1 B yt
t 1 B12 yt
wt 1 B1 B12 yt :
9:2:10
Differencing reduces the number of elements in a time series, but the loss of these first
few elements is usually of no statistical consequence for a long series.
The utility of the difference series will become apparent when we examine the
sample autocorrelation functions. One can see, however, from looking at Figure 9.2
that a difference series gives a visual appearance of greater randomness compared to
the original series. For example, the series fytg of Figure 9.2 (top panel) show several
peaks at roughly 12-month intervals. These peaks, assuming they represent real
information and are not merely statistical fluctuations, have vanished in the series
fwtg (bottom panel). One of the strategies employed in solving a finite-difference
equation that results from a particular model of randomness is to operate on the
original time series so as to reduce it to white noise, represented by a random variable
of mean 0 and stationary variance, such as the following
t N0, 2 :
9:2:11
Differencing provides one way to try to do this.

We examine next the autocorrelation of the preceding time series. Various forms
of the autocorrelation function were discussed in Chapter 3. For the present, it
suffices to adopt the simplest form, given by the following expression for the four
series
Nk
X
zt zztk z
r z k
t
N
X
k 0 . . . m
zt z
9:2:12
where the subscript symbol (z = y, u, v, w) identifies the particular time series,

the index stands for the appropriate first element of the series (e.g. = 1 for y;
520
AC of y
ry(k)
0.5
0
-0.5
12
24
36
48
60
72
60
72
60
72
AC of 1 y
ru(k)
0.5
0
-0.5
12
24
36
48
AC of 112 y
AC of 12 y
rv(k)
0.5
0
-0.5
-1
12
24
36
48
rw(k)
0.5
0
-0.5
12
24
36
48
60
72
Lag (months)
Fig. 9.3 Autocorrelation (solid line) of the time series in Figure 9.2. Top panel, ry(k); second
panel, ru(k); third panel, rv(k); bottom panel, rw(k). The dashed lines represent approximate limits
of plus and minus two standard deviations: 2N1/2, where N is the length of the time series.
= 14 for w), and the lag number k, ranging from 0 to maximum lag m, marks the
delay in units of the sampling interval t (here equal to one month). As a matter of
notation, the symbol r(k) will be used for the sample autocorrelation and (k) for the
theoretical autocorrelation of a particular model. Also, although it was convenient in
Chapter 6 to express the lag number as a subscript (e.g. rk), in this chapter lag will be
expressed as an argument if a subscript is being used to identify the time series.
The four panels of Figure 9.3 show the variation with k of the autocorrelation
functions defined in (9.2.12). As in the plots of the time series, the autocorrelation functions are discrete functions defined at points (not shown) connected by solid
lines to aid the eye. The autocorrelation ry(k) (top panel) clearly reveals a pattern
buried in the original time series namely a slowly decaying periodic correlation of
521
energy readings with 12-month periodicity. The pattern of correlations is very long
range, continuing beyond the arbitrarily chosen maximum lag number (72). The
periodicity is extraordinarily precise: peaks are seen to occur at exact multiples of
12 months despite the noise content in the corresponding series fytg. We have seen in
a previous chapter that Brownian noise gives rise to long-range correlations; the
pattern in ry(k) is entirely different from that of Brownian noise.
Autocorrelation ru(k) (second panel) shows that differencing at lag 1 has eliminated nearly all correlations except those at lag numbers k = 1, 12, 24, and 36. The
pair of dashed lines delimitpthe
approximate boundaries of plus and minus two
standard deviations sr 1= N (under the assumption that the noise is Gaussian)
by which to decide tentatively whether a particular correlation is statistically significant or not. Although no correlations at multiples of 12 higher than 3 appear to be
significant in the plot of ru(k), the figure suggests that such correlations have merged
with the noise rather than actually vanished. The distinction between the two
alternatives is that, if the first is correct, one would expect correlations at k = 48,
60, etc. to become significant in longer repetitions (. . . therefore smaller variance . . .)
of the same stochastic process.
Autocorrelation r(k) (third panel) shows that differencing at lag 12 has eliminated
all correlations except the correlation at k = 12 months. Although the correlation
r(5) exceeds the +2sr boundary, I have no reason to believe there is anything
physically significant about 5-month intervals in my electric energy usage. On the
contrary, the physical significance of 12-month intervals e.g. January to January to
January, etc. is comprehensible. If the noise is approximately Gaussian, then one
would expect about 95% of the set of correlations fr(k)g to fall within the 2sr
limits. Therefore, 5%, or 1 in 20, should fall outside the limits purely by chance. It
should not be surprising, then, if at least 1 in a plot of 71 correlations (not counting
the constant r(0)=1), should exceed +2sr.
Finally, the autocorrelation rw(k) (bottom panel) shows that after multiplicative
differencing at lags 1 and 12 the correlations remaining in the time series are
primarily at lag values k = 1 and (possibly) 11, 12, and 13. Note the importance
of the algebraic sign (+ or ) to the pattern of autocorrelations. For example,
in the second panel there is a positive correlation at 12 months and a negative
(anti-)correlation at 1 month, whereas in the bottom panel the correlations at
1 and 12 months are both negative.
The autocorrelation function of a difference series is theoretically derivable from
the autocorrelation function of the original series. Consider, for example, a stationary time series fytg of mean 0 for which the theoretical autocovariance functions
y k hyt ytk i
k . . . 0, 1, 2 . . .
9:2:13
are known. The angular brackets in (9.2.13) signify an ensemble average. The function
y 0 hy2t i 2y
522
is the variance of fytg, and the theoretical autocorrelation function of fytg is defined
by the ratio
y k k=0:
9:2:14
Although a time series obtained from an actual experiment has a definite origin
(t = 0) and finite length (N), and the lag numbers of the associated sample fry(k)g
terminate at some designated maximum value (m), the time series of the underlying
hypothetical stochastic process is of infinite length with a theoretical autocovariance function fy(k)g extending over the range ( > k > ) symmetrically about
k = 0,
y k y k:
9:2:15
In other words, it is to be understood that jkj (rather than k) enters the argument of
(9.2.13) and (9.2.14) although, to keep notation simple, the absolute value sign will
not be employed unless needed for clarity.
The autocovariance of the series ut = r1yt can be obtained directly from the
expectation values
u k hut utk i hyt yt1 ytk ytk1 i
2y k y k 1 y k 1
9:2:16
which yield
u k
2y k y k 1 y k 1

2 y 0 y 1

y k 12 y k 1 y k 1
1 y 1
9:2:17
In the same way, one can derive the autocorrelation (k) of the series t = r12yt and
w(k) of the series wt = r1 r12yt

y k 12 y k 12 y k 12
9:2:18
k
1 y 12
3
2

1
7
6 y k 2 y k 1 y k 12 y k 1 y k 12
4
5
1
y k 11 y k 13 y k 11 y k 13
4

:
w k
1 y 1 y 12 12 y 11 y 13
9:2:19
Plots (not shown) of (9.2.17), (9.2.18), and (9.2.19) as functions of k, with the
theoretical y(k) approximated by the corresponding sample ry(k), superpose the plots
of ru(k), r(k), and rw(k) in Figure 9.3 nearly perfectly.
9.3 Examining the data: frequency and power spectra
523
Equations (9.2.17)(9.2.19) help explain the structure of the observed autocorrelation plots in Figure 9.3 even before we search for an underlying explanatory stochastic process. For example, look at the plot of ry(k) (top Panel) of Figure 9.3 and
consider the peak at k = 12. The theoretical expression (9.2.17) for ru(12) calls for
subtracting from ry (12) the mean of the two flanking values, ry(11) and ry(13). Since
ry(11) ry(13) and both are less than ry(12), the outcome is a statistically significant
positive number. However, now consider the calculation of ru(11), which calls for
subtracting from ry(11) the mean of ry(10) and ry(12). From the nearly linear slopes of
the hilly waveform, one sees that the mean of ry(10) and ry(12) is very nearly equal to
ry(11), and so the subtraction yields a result close to 0. As this condition pertains on
the left and right slopes of all hills of ry(k), the resulting structure of ru(k), apart from
the isolated anti-correlation at k = 1, resembles a comb of Dirac delta functions
located at lag numbers equal to multiples of 12. Similar reasoning can be applied to
account for the structure shown in the other panels.
We are faced, then, with a curious problem. My monthly usage of electrical energy
shows a strong, persistent (i.e. long range), decaying, 12-month periodic correlation
with a triangular-appearing base waveform. Two of the difference series show anticorrelated energy consumption at intervals of one month (e.g. high in January, low in
February, high in March, low in April, etc.) and either positive or negative correlations at 12-month intervals. What law or process accounts for such a structured
pattern, given that the noise in the corresponding time series masks any overt
periodic structure?
To help answer that question, let us first consider the power spectrum of fytg.
The last stage (for now) in the empirical investigation of the energy time series is to
obtain the power spectrum S(), calculable by any of several ways, each of which
illustrates points worth keeping in mind.
First, employing a form of the WienerKhinchin (WK) theorem, we can calculate3
S() from the sample autocorrelation ry(k) as a continuous function of frequency
or angular frequency = 2
S 1 2
m1
X
ry k cos k ry m cos m:
9:3:1
k1
The time interval corresponding to a peak at frequency p is T p 1

p .
An alternative use of (9.3.1) is to evaluate S() at the discrete set of m + 1 special
frequencies
3
The expression for S() in (9.3.1) omits a scale factor 2s2y t, which is unimportant if one is interested only in the location
and relative amplitude of the peaks.
524
jc
m
j 0, 1, 2 . . . m,
9:3:2
where
c
1
0:5
2 t
for t 1
9:3:3
is the Nyquist or cut-off frequency (introduced in Chapter 3), leading to m/2 independent (and therefore uncorrelated) spectral estimates

j k
1j ru m:
ru k cos
Sj 1 2
m
k1
m1
X
9:3:4
The time interval corresponding to a peak at harmonic j is Tj = (2m/j)t.

Next, we can obtain S() directly from the amplitudes of the discrete Fourier
transform of yt
Sj a2j b2j
j 0, 1, 2 . . . N=2,
9:3:5
where
a0
N
1X
y
N t1 t
aj>0

N
2X
2j t
yt cos
N t1
N
bj0
N
2X
y sin
N t1 t

2j t
: 9:3:6
N
The time interval corresponding to a peak at harmonic j is Tj = (N/j)t. Note that the
harmonic numbers of a given period T are not the same for series (9.3.4), in which
the maximum lag m determines periods, and for (9.3.5), in which the duration
N of the time series determines periods.
And last, we can resort to a fast digital transform (FFT) that would execute the
same task as (9.3.6) by a different and much quicker algorithm. In the present
study this was not necessary because the length of the time series (N = 120) is so
short that implementation of all methods took only fractions of a second.
There is yet another method, to be discussed shortly, for obtaining the autocorrelation and power spectrum specific to two fundamental types of linear stochastic
processes that will ultimately be part of the solution to this investigation.
The upper panel of Figure 9.4 shows a panoramic plot of the spectral power S() as
a function of computed from (9.3.1) for the entire frequency range (c 0).
It reveals a large peak at about 0.08 (month)1 and what appears to be at least three
much smaller, but potentially significant, peaks at higher frequencies, and confusing
oscillatory structure at lower frequencies. The lower panel shows in greater detail the
portion of the range from 0 to 0.2 containing the highest peak. Black points mark
calculated values taken at equal intervals whereas, again, the line connecting points
merely helps guide the eye. The more points included in the calculation, the fuller the
525

20
15
10
5
0
-5
0.1
0.2
0.3
0.4
0.5
Power S()
0.5
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency (month )
-1
Fig. 9.4 Top panel: power spectrum of time series fytg of Figure 9.2. Bottom panel: details of
power spectrum (black points) in vicinity of peak at period 12 months (frequency
1=12 0:083). Oscillatory side lobes accompanying the large central peak can overlap
nearby spectral components. Large gray dots mark spectral points at the minimal set of
independent harmonics and avoid spurious side lobes.
peaks would appear to be resolved. In the lower panel, peak frequencies and associated periods (in months) were found to be
1 0:008276
2 0:08276
3 0:1676
T 1 120:83
T 2 12:083
T 3 5:967
corresponding closely to 12 years, one year, and one-half year. As remarked previously in the book, the lowest frequency in the Fourier analysis of a discrete time
series corresponds to the duration of the time series and is not indicative of any
underlying physical mechanism. The power spectrum therefore reveals seasonal
periodicities in the usage of electrical energy which, of course, is not unexpected.
Although my home is not directly heated by electricity, electricity is required to run
the furnace which burns a fossil fuel.
The presence of oscillatory side lobes flanking the central peak in Figure 9.4 also
reveals a potential problem with calculating the power spectrum at so many points.
That, in fact, is one reason for displaying the results of the calculation. Speaking
generally, distortions in the spectrum engendered by overlap of oscillatory side lobes
526
of neighboring peaks can make it difficult to determine the location of a true

peak accurately or even discern the difference between a true peak and enhanced
side lobe.
The gray points in the lower panel of Figure 9.4 mark the spectral amplitudes Sj
calculated from (9.3.5) and (9.3.6). The peaks, now delineated by a minimal set of
three points each, occur at discrete harmonic numbers. When extended up to the cutoff (Nyquist) frequency, ct = 1/2, the plot reveals peaks at the following set of
harmonics and corresponding periods (months)
j1
j2
j3
j4
j5
j6
j7
1
10
20
30
40
50
60
T 1 120
T 10 12
T 20 6
T 30 4
T 40 3
T 50 2:4
T 60 2
9:3:7
which, (disregarding the non-physical peak at j1) will be recognized as the harmonic
series T(n)=12/n (n = 1, 2, 3. . .). The peaks at j2 and j3 show up strongly; the others
(not shown in the lower panel) are close to noise level, but still discernible. The peak
at j7 is only half complete since the spectrum ends at the Nyquist frequency
60
12. A longer time series would be needed to better resolve this portion of
c 120
the spectrum.
9.4 Seeking a solution: the construction of models

To recapitulate briefly, all the pertinent data (in the form of time series) have been
presented, and the latent patterns of temporal correlation and frequency composition
exposed to view and analysis. The task, then, is to make sense of it all and find the
stochastic process that accounts for these patterns. Clearly, though the time series in
Figure 9.1 appear to be random, they reveal a different kind of randomness than the
decay of radioactive elements (white noise) or the movement of stock prices (Brownian noise to a good approximation). Since the consumption of residential electrical
energy depends on personal lifestyle, the seasonal effects on temperature, and perhaps other uncontrollable variables, one may wonder, as a matter of principle,
whether such a solution exists, whether it is unique, and how it would be recognized
and verified.
The approach to a problem of this kind is to construct a mathematical model that
leads to the kinds of correlations observed in the data. In a manner of speaking, this
is the reverse of how one ordinarily proceeds in physics. It is usually the case as
exemplified by mechanics and electromagnetism that the physical laws are known,
and the required task is to find an appropriate solution for some specified experimental situation. In the present case, we have the solution (Figures 9.19.4), and the
9.5 Autoregressive (AR) time series
527
task is to find the process that generated it. In this endeavor, the guiding principle is
to account for the data by a theory employing the fewest independent parameters: a
parsimonious theory in the terminology of statistical analysis.
The building blocks of our theoretical construction come in two basic categories
of time series: (1) AR (for autoregressive) and (2) MA (for moving average).
Used alone and in various combinations involving differencing, the two forms permit
analysis of a wide range of linear physical systems.
We will examine briefly the seminal features of each class, starting with the first.

Linear regression in statistics is the expression of a dependent variable in terms of
explanatory independent variables, the objective being to determine the proportionality coefficients which constitute the parameters of the model. The term autoregressive connotes a model with no distinction between independent and dependent
variables; the sought-for function depends on itself at earlier times. We encountered
the simplest autoregressive process AR(1) in Chapter 6.
To recall and extend some basic ideas presented in Chapter 6, a stationary
autoregressive series of mean x and order n symbolized by AR(n) can be defined
by a master equation of the form
xt x 1 xt1 x 2 xt2 x 3 xt3 x n xtn x t
9:5:1
or, more simply, using our previous notation

yt
n
X
j ytj t
9:5:2
j1
in which fjg (j = 1. . .n) is the set of parameters to be determined from the data
and, as before, t N0, 2 is a Gaussian random variable of mean zero and
variance 2 . One is free, of course, to make t a different kind of random variable,
but unless there are cogent reasons for doing so, it is not usually done. The
normal distribution has the desirable property of stability (see Eq. (6.6.9)), which
facilitates the solution of (9.5.1) in ways that could not be applied to most other
distributions.
Two seminal advantages of working with AR models relate to the ease (at least in
principle) of (1) solving the master equation (9.5.2) and (2) determining the autocorrelation function (k) and power spectrum S().
Consider first the stationary solution to (9.5.2). In Chapter 6, we solved the case of
AR(1) by a judicious alignment and subsequent subtraction of time-lagged versions
of the equation. A more general direct approach employing the backshift operator
leads immediately to a solution in the form of an infinite series of random shocks.
Re-express (9.5.2) in the form
528
1
n
X
!
j B
yt t ,
9:5:3
j1
then solve (9.5.3) as an algebraic equation with the inverse operation interpreted as a
Taylor series expansion4
!1
!k
n
n
X
X
X
j
j
yt 1
j B
t
j B t :
9:5:4
j1
k0
j1
It then follows that yt is a Gaussian random variable of mean 0

yt N0, 2y :
9:5:5
A formal theoretical expression for the variance 2y is derived by expanding

X
k
n
j
j B
in a multinomial series and summing the squares of the coefficients of
j1
each independent random shock to obtain

#
"

X
X
k
21 42 63
2
2nn
1 2 3 n
2 ,
y 0 y
.
.
.
1
n
k0
fg
where the second sum is over all the partitions of k such that
n
X
9:5:6
j k. A necessary
j1
and sufficient condition for the variance to converge to a finite constant in other
words for the solution to be stationary is that the roots of the characteristic
equation
1
n
X
j Bj 1 B 0,
9:5:7
j1
with B regarded as a complex variable, lie outside the unit circle. Satisfaction of
the condition also guarantees that the correlation function (k) tends to 0 with
increasing k.
We have already seen in Eq. (6.6.17) that the correlation function of an AR
process is obtained by solving the set of YuleWalker (YW) equations
k > 0
n
X
j jj kj:
9:5:8
j1
We will examine the structure and content of these equations more closely later in the
investigation of particular AR models. For the present, it is to be noted that (9.5.8) is
4
This method of solving a finite-difference equation has its continuous counterpart in the use of the resolvent operator
to solve the Schrodinger equation in quantum mechanics. See M. P. Silverman, Probing The Atom (Princeton University
Press, 2000).
529
an infinite set of linear algebraic equations. In practice, one approximates (k) by the
sample function r(k) for as many equations (n in (9.5.8)) as needed to solve for the set
of unknown parameters fjg, and then uses these parameters and the YW equations
to generate the correlations at all other lag values. The solution involves inversion of
a matrix of coefficients, which generally must be performed numerically because an
analytical solution, except for a very low order AR process, would otherwise be
cumbersome to obtain and work with.
An equivalent, theoretically exact method5 for deriving the autocovariance function (k) which therefore provides exact expressions for the autocorrelation (k)
entails use of the autocovariance generating function (agf )
X
1

1
k zk
1 z1
Gz 1 z
9:5:9
k
with (z) defined in (9.5.7). Note that relation (9.5.9) is just a variant (up to a
constant scale factor) of Eq. (6.6.18) derived previously. The power spectrum S()
is then proportional to G(ei)
S / 1 ei 1 1 ei 1 j1 ei j2 :
9:5:10
In many, if not most, instances, a proportional relationship is all that is needed.

To see how this procedure works, consider again the single-parameter process,6
AR(1),
ut ut1 t :
9:5:11
The autocorrelation generating function (agf ) (in units of 2 ) is

Gz
1
1 z

1
1 z1
X
j, l0
jl z jjlj
k zk ,
9:5:12
k0
with the symmetry (9.2.15) taken into account so that the resulting power series is in
terms of non-negative powers only. By selecting all pairs of indices j and to make
j + l = k for a given k, one obtains the series
Gz z0 0 2 4 z1 1 3 5 z2 2 4 6
z0 0 z1 0 z2 2 0
9:5:13
5
6
M. Kendall, A. Stuart, and J. Ord, The Advanced Theory of Statistics Vol. 3 (Macmillan, NY, 1983) 526.
There are actually two parameters because the variance 2 is also not known in advance.
530
where
0
1
1 2
9:5:14
is the variance of ut in units of 2 . The autocorrelation function, following immediately from (9.5.13),
k k
k 0, 1, 2 . . .,
9:5:15
expresses an exponential decrease with lag number.

As a matter of practicality, it is worth noting that how one implements the expansion
of G(z) in (9.5.12) is crucial to getting useful results. In particular, the expansion in a
double sum (one in powers of z, the other in powers of z1) is necessary to obtaining the
correct series multiplying each power of z in the final product. If, in contrast, the two
factors [1(z)]1 and [1(z1)]1 are first multiplied together to form a single expression which is then expanded in a power series in z e.g. by a symbolic mathematics
application like Maple the coefficients would not be correct. The error arises because
the computer expands [1(z1)]1 in powers of z to obtain (z/)(z/)2(z/)3. . .,
instead of in powers of (z1) to obtain 1 + (z1)+(z1)2+(z1)3+. . ..
Consider next the power spectrum and contrast the effort required to calculate
S() by the standard method employing the WienerKhinchin (WK) theorem and by
a more efficient approach employing the agf. First, WK (with complex exponentials
rather than cosines):
WienerKhinchin
S /
X
j
jjj ei j 1
X
j1
j ei j
j ei j
j1
1
1
1

1

1
1 ei j
1 ei j
9:5:16
1 2
,
1 2 cos
2
where several algebraic simplifications mediated the transition between the lines.
Next, the autocorrelation generating function (effected in a single line):
AGF
1
1
:
9:5:17
S /
1 ei j 1 ei j 1 2 2 cos
Both methods yield the same function of to within a scale factor, as they must, but
the latter method requires less effort.
9.6 Moving average (MA) time series
The second building block in our search for a solution to the energy problem is the
MA(n) series, which takes the form
9.6 Moving average (MA) time series
yt t 1 t1 n tn

n
X
1
j Bj t 1 B t
531
9:6:1
j1
where a random perturbation (or shock) t at time t is independent of any other

random perturbation t, at a different time t 0. Thus, it follows that
ht t0 i 2 t t0
9:6:2
where tt0 is the familiar Kronecker delta symbol.

The designation moving average is somewhat ambiguous and confusing,
especially as the usage differs from the more familiar statistical transformation,
also termed moving average, for removing a known periodicity from a time
series. The nomenclature is entrenched, however, and unlikely to be changed. We
shall be using both types of moving average in this chapter, but the context should
make clear what usage is intended. Since the series (9.6.1) is already in the form of
a linear combination of Gaussian variates, it again follows that yt is itself a
Gaussian random variable of form (9.5.5), but with variance
y 0 2y 1
n
X
j2j :
9:6:3
j1
The structure of Eq. (9.6.1) shows that the present value of the function yt depends
on random shocks in the past. Since random events in the future cannot influence the
present, it must follow that
hyt tk i 0
k 1:
9:6:4
However, the ensemble average of yt with a past shock tk does not vanish. To
evaluate an expectation of this kind, as well as to find the autocorrelation function of
the MA time series, consider first the simplest member of this class, MA(1), with
equation
yt t t1 :
9:6:5
In contrast to an AR system, the autocovariance of a MA system requires evaluation

of two types of covariances:
(1) cov(variable, variable)
hyt ytk i ht ytk i ht1 ytk i,
9:6:6
hyt tk i ht tk i ht1 tk i:
9:6:7
(2) cov(variable, shock)
532
Substituting explicit values of k into these two equations leads to

k0
k1
k>1
2 ht1 yt i
hy2t i
hyt yt1 i ht yt1 i 2
y 0
y 1
hyt t1 i
ht t1 i 2
hyt ytk i
ht ytk i ht1 ytk i 0
9:6:8
y 1
y k
from which we find that the only nonvanishing correlations are at k = 0, 1

y 0 1
y 1
y 1
y 0 1 2
y k
0:
y k > 1
y 0
9:6:9
The pattern manifested by MA(1) carries through in the general case MA(n),
namely nonvanishing correlations only for k
n, although it would be somewhat
tedious to demonstrate this by calculations generalizing (9.6.8). Instead, the covariance structure of MA(n) is obtainable with greater facility through use of an
autocovariance generating function analogous to (9.5.9) for AR(n)

Gz 1 z 1 z1 :
9:6:10
Substitution of (z) from (9.6.1) into (9.6.10) leads to
Gz 1
n
X
j zj
n
X
j1
j zj
n
X
j l zjl
9:6:11
j, l1
j1
from which one directly extracts the autocovariance function (in units of 2 )
y 0 1
n
X
2j
j1
y 1 1
y 2 2
..
.
y n n
y k > n 0:
n
X
j1
n
X
j1 j
j2 j
9:6:12
j1
Thus, as claimed, the autocovariance vanishes for lag numbers greater than n.
9.7 Combinations: autoregressive moving average time series
533
Note that the sum over negative powers of z i.e. the third term on the right side
of (9.6.11) was not needed to determine the autocovariance. All terms, however, are
essential to calculate the power spectrum by a relation analogous to (9.5.10)
S / 1 ei 1 ei j1 ei j2 :
9:6:13
Substitution of (z) for MA(n) into (9.6.13) yields the spectrum

2
n
X

i j

S / 1
j e
j1
n
X
j1
2j
n
n X
n1

X
X
2 j cos j 2
j l cos j l :
9:6:14
j > l l1
j1
9.7 Combinations: autoregressive moving average time series

The two classes of building blocks, AR and MA, can be combined to yield a hybrid
class of linear stochastic systems generally referred to as ARMA whose general
master equation for the system ARMA( p, q) is of the form
yt
p
X
j1
j ytj t
q
X
j tj ,
9:7:1
j1
or more compactly in the notation of the previous two sections

!
!
p
q

X
X
j
j
1
j B y t 1
j B t ) 1 B yt 1 B t : 9:7:2
j1
j1
To delve comprehensively into the manifold varieties of ARMA time series would go
well beyond the objectives of this chapter, the primary focus of which is on insight
and methods for solving a particular problem of physical interest. Dedicated references exist for a more thorough treatment of the analysis of time series, and several
that I have used are referenced at the end of the book. Let it suffice to say without
demonstration that the covariance function of an ARMA time series is of infinite
extent, comprising damped exponentials and/or damped sine waves after the first
q p lag numbers.
The solution for yt is obtained immediately (albeit symbolically) from (9.7.2)

1
yt 1 B
1 B t ,
9:7:3
and is readily shown to be of Gaussian form of mean 0, if t is a standard normal
variate. The autocovariance function takes the form

1 z 1 z1

Gz
9:7:4
1 z 1 z1
534
and the corresponding power spectrum (up to a scale factor) is, as expected,
S / Gei
j1 ei j2
j1 ei j2
9:7:5
Equations (9.7.3) (9.7.5) tell everything one would want to know about an ARMA
( p, q) time series although the series expansions required to extract this information
will be the more computationally intensive the higher the orders p and q.
With this rudimentary understanding of AR, MA, and ARMA time series, we
now have the tools to model the electric energy problem.
9.8 Phase one: exploration of autoregressive solutions

To start simply (in the hope of also ending that way), we can reason from the
autocorrelation of meter readings, first panel of Figure 9.3, that readings one month
apart and 12 months apart are strongly correlated. Perhaps that is all that is needed
to explain the full pattern of autocorrelation values and power spectral amplitudes
characterizing normal residential use of electric energy in the USA. Let us begin,
therefore, with a master equation for the mean- and slope-adjusted time series
yt 1 yt1 12 yt12 t
9:8:1
that requires only two parameters in addition to the variance 2 of the random noise
(or shock) term t.
I label this system AR(12)1,12 because it is actually a reduced form of the twelfth
order autoregressive process, all parameters fjg being 0 except for j = 1 and 12. The
unknown variance 2 is not numbered among the parameters in labeling the process.
The assignment of order, which underlies the nomenclature, is based on a standard
method of solving finite difference equations. Given an equation like (9.8.1) without
the random shock, one usually makes the ansatz (i.e. an educated guess, or trial
solution) yt / Yt, which in the present case would lead to a twelfth order algebraic
equation. We will see how this procedure works later in a simpler, algebraically
solvable system.
The solution of (9.8.1), given by (9.5.4) and (9.5.6), predicts that yt is a Gaussian
random variable of mean 0 and variance
#
"
!#
"
X
k 2

k
X
X
212 21
k
2j 2kj
2
2
2
2
12 1 Pk 2
y

2
1 12
9:8:2
2
j
12
1
k0 j0
k0
in which Pk(z) is the Legendre function of order k.7 The reduction of the second
expression to the third is not given here, but can be verified directly by use of a
symbolic mathematical application like Maple.
7
A Legendre function yk(x) is an appropriately normalized solution to the second-order differential equation (1 x2) y00
2xy0 + k(k + 1) y = 0 in which the primes signify differentiation with respect to x.
535
Two methods for estimating the AR parameters entail use of (a) the YuleWalker
(YW) equations relating values of the autocovariance function at different lags, or
(b) the principle of maximum likelihood (ML) applied to the adjusted time series.
The first (YW) is quicker and simpler; the second (ML) is computationally more
intensive, but more accurate. Consider the simpler method first.
The YW equations of the AR(12)1,12 model,
0 1
k60 1 jk1j 12 jk12j
9:8:3
with substitution of the empirical autocorrelation values, take the form

!
1
r 11
r 11
1
YW
YW
12
r1
r12

:
9:8:4
and lead to the solution

YW
YW
12
ry 1 r y 11r y 12
1 ry 112
ry 12 r y 1r y 11
1 ry 112
0:358
9:8:5
0:391:
These results must be viewed with some caution since the choice of a different pair of
YW equations can produce a different set of numerical values.
The maximum likelihood (ML) method of estimating parameters of a linear time
series is ordinarily preferred, since it makes use of nearly all the elements of the time
series. Under the assumption that the residuals
t yt 1 yt1 12 yt12 N0, 2
9:8:6
are independent Gaussian variates, the conditional log-likelihood function L of the

AR(12)1,12 model takes the form (apart from irrelevant constants)
L
N
N
1 X
ln 2 2
y 1 yt1 12 yt12 2 :
2
2 t13 t
9:8:7
One then obtains three coupled equations by setting equal to 0 the first derivative of
L with respect to each of the three parameters. Solving the two equations
X
0 X 2
0X
1
1
yt1
yt1 yt12
!
yt yt1
ML
B t13
B t13
C 1
C
t13
BX
C
C
9:8:8
X
X
B
@
@
A
A
ML
yt1 yt12
y2t12
yt yt12
12
t13
t13
t13
that contain only the two AR parameters leads to the ML solution
536
ML
0:348
ML
12
0:470
9:8:9
the values of which are then substituted into the third equation for the variance of the
residuals
s2
N
2
1 X
ML
ML
yt 1 yt1 12 yt12 73:0262 :
N 12 t13
9:8:10
The covariance matrix, which is obtained from the second derivatives of the likelihood function, yield the following standard errors and cross-correlation of
parameters
s1 0:0786
s12 0:0813
r1 12
cov 1 2
0:371:
s1 s12
9:8:11
It is to be noted that the foregoing ML procedure, which is contingent on the first

12 values of the time series fytg, is equivalent to a least-squares solution. For a time
series extending N = 120 months, the dependence of results on the first 12 meter
readings is statistically unimportant unless the values of the parameters were to have
fallen very close to the unit circle, which is not the case. (See Section 6.13.)
Having estimated the parameters of the model, we can now investigate how well
AR(12)1,12 accounts for the observed statistics of electric energy consumption. To
begin with, the model makes two readily testable predictions:
first, the residuals should be independent standard normal variates with variance s2
given by (9.8.10), and
second, the elements of the stationary time series fytg should be standard normal
variates with variance proportional to s2 , as given by (9.8.2).
Visual inspection of the residuals, calculated from (9.8.6) with ML parameters and
plotted as a function of time in the upper panel of Figure 9.5, gives no indication that
the values are distributed non-randomly about the horizontal baseline t = 0.
The lower panel of Figure 9.5 shows a histogram of the residuals, distributed over
20 bins of approximate width 24.2, with superposed Gaussian curve of mean 0 and
variance s2 . A chi-square of 7.12 for 16 degrees of freedom8 yielded a P-value of
97.1%. Recall that this is the probability of obtaining a chi-square larger than the
observed value in a subsequent trial of the same stochastic process. Despite malaise
expressed by some statistical specialists regarding goodness-of-fit results that are too
good a point addressed in Chapter 1 I conclude that the data do not refute the
hypothesis of normally distributed residuals.
Loss of one degree each for (a) completeness relation and (b) estimation of 2 from the data.
537
Residual
200
200
10
20
30
40
50
60
70
80
90
100
110
120
Time (months)
20
Frequency
15
10
5
0
300
200
100
100
200
300
Bin
Fig. 9.5 Top panel: chronological record of residuals of the AR(12)1,12 model calculated with
ML parameters: 1 = 0.348, 12 = 0.470, 2 73:032 . Bottom panel: histogram of residuals
enveloped by Gaussian probability density N0, 2 (solid).
To test the hypothesis of independence, the autocorrelation of the residuals is

displayed in Figure 9.6. Dashed lines mark the 95% confidence limits, i.e. 2 standard
deviations
(sd), where an approximate upper bound on the standard deviation is
p
1= N . Slightly smaller values pertain at k = 1, 2 as indicated in the figure. For
independent Gaussian variates, one expects about 95% of values to fall within 2
sd. Thus, of the first 60 coefficients, one in 20 or about three coefficients may occur
outside the limits purely by chance. The figure shows one such point outside the limits
and three very close. The outcome of the test is again supportive of the null hypothesis.
Consider next the second prediction concerning the normal distribution of the
stationary time series. From relation (9.8.2), one would expect the ratio of var(yt) and
var(t) to be given by
"
!#
X
2y
212 21
2
2 k
12 1 Pk 2
1:735
9:8:12
2
12 21
k0
538
Autocorrelation r(k)

0.2
0.2
0
10
15
20
25
30
35
40
45
50
55
60
Lag k (months)
Fig. 9.6 Autocorrelation of the residuals of Figure 9.5. Dashed lines mark 95% confidence
limits: 1.3N1/2 for k = 1; 1.7N1/2 for k = 2; 2N1/2 for k > 2.
ML
ML
upon substitution of the values of 1

and 12 from (9.8.9). For comparison, the
ratio of sample variances from (9.2.6) and (9.8.10) is
s2y 9:285 103
1:741:
s2 5:333 103
9:8:13
It would seem that one could hardly ask for closer agreement between hypothesis and
verification.
It remains to be seen, however, how well the model predicts the full autocorrelation function (up to some specified maximum lag), as well as the power spectrum. For
the first task, we turn again to the YW equations. Setting y(k) equal to the empirical
values ry(k) for k = 1. . .12, permits estimation of y(k) iteratively at all other values of
k from the YW algorithm
8
>
< 1 for k 0
9:8:14
y k r y k for k 1. . .12
>
: ML
ML
1 y jk 1j 12 y jk 12j otherwise:
The resulting autocorrelation is plotted as the dashed line in the upper panel of
Figure 9.7. Although the function displays a decaying oscillatory waveform in
approximate accord with the sample autocorrelation (gray dots), it deviates increasingly in peak amplitude and location with lag number.
The corresponding power spectrum (dashed line), derived from (9.5.10)
1
S
2 ,

ML i
ML
1 1 e 12 e12 i
9:8:15
is compared with the sample power spectrum (gray line) in the lower panel of
Figure 9.7. Each power spectrum in the figure is normalized to its maximum
value. As evident from the figure, the AR(12)1,2 model (dashed trace) does not
539
Autocorrelation
AR(12)
0.5
0
-0.5
12
24
36
48
60
Lag (months)
Power S()
AR(12)
0.5
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency (month )
-1
Fig. 9.7 Top panel: comparison of empirical autocorrelation ry(k) (gray dots) with
autocorrelation calculated by model AR(12)1,12 (dashed) and model AR(12) (solid) with
maximum likelihood (ML) parameters. Bottom panel: comparison of the power spectrum
of the empirical time series (solid gray) with power spectra calculated on the basis of models
AR(12)1,2 (dashed black) and AR(12) (solid black) with ML parameters.
match well the relative amplitude and location of the fundamental peak (at frequency
= 1/12 0.083).
The two panels of the figure also display plots marked by solid black lines. We will
return to these shortly.
Although the 2-parameter AR(12)1,12 model has passed basic statistical tests of its
premises, one may nevertheless wonder whether it is perhaps a little too simplistic. In
other words, how do we know that eliminating the 10 parameters 2, 3,. . .11 at the
outset may not be responsible for the discrepancies seen in the autocorrelation and
power spectrum? If AR(12)1,12 works reasonably well, perhaps the full 12-parameter
AR(12) model will work even better. This would require, however, that we solve a set
of 12 coupled linear equations in order to determine the 12 unknown parameters.
Fortunately, that is the sort of tedious work for which computers are ideally suited.
Let us suppose that the model AR(12) applies and then estimate the full set of
parameters fjg (j = 1. . .12) by solving the YW equation, which can be written
compactly in matrix form as R = r, symbolizing the (12 12) matrix equation
540
1
B r1
B
B r2
B
B r3
B
B r4
B
B r5
B
B r6
B
B r7
B
B r8
B
B r9
B
@ r 10
r 11
r1
1
r1
r2
r3
r4
r5
r6
r7
r8
r9
r 10
r2
r1
1
r1
r2
r3
r4
r5
r6
r7
r8
r9
r3
r2
r1
1
r1
r2
r3
r4
r5
r6
r7
r8
r4
r3
r2
r1
1
r1
r2
r3
r4
r5
r6
r7
r5
r4
r3
r2
r1
1
r1
r2
r3
r4
r5
r6
r6
r5
r4
r3
r2
r1
1
r1
r2
r3
r4
r5
r7
r6
r5
r4
r3
r2
r1
1
r1
r2
r3
r4
r8
r7
r6
r5
r4
r3
r2
r1
1
r1
r2
r3
r9
r8
r7
r6
r5
r4
r3
r2
r1
1
r1
r2
r10
r9
r8
r7
r6
r5
r4
r3
r2
r1
1
r1
1 0 1
10
r11
1
r1
B 2 C B r 2 C
r10 C
C B C
CB
B
C B C
r9 C
C B 3 C B r 3 C
B
B C
C
r 8 C B 4 C
C B r4 C
B
C
C
C
r 7 C B 5 C B
B r5 C
B
B
C
C
r 6 C B 6 C B r 6 C
C
B
C B C:
r5 C
C B 7 C B r 7 C
B
C B C
r4 C
C B 8 C B r 8 C
B
B C
C
r 3 C B 9 C
C B r9 C
B
C
C
C
r 2 C B 10 C B
B r10 C
@
@
A
A
r1
11
r11 A
1
12
r12
9:8:16
Formidable as Eq. (9.8.16) may appear, there is a regularity indeed beauty to the
structure of R. In the terminology of linear algebra R is a Toeplitz matrix, i.e. a
square (n n) matrix of constant diagonal with elements of the form Rij = Rjijj.
Owing to the symmetry, the matrix has 2n 1, rather than n2, degrees of freedom
with the consequence that a linear equation such as (9.8.16) can be solved in a
number of operations of the order of n2, rather than a higher number for a general
matrix (such as n3 in the case of the standard Gaussian elimination method).
Substitution into (9.8.16) of the sample autocorrelation function fry(k) k =
1. . .12g at the first 12 lag numbers, followed by matrix inversion generates the
solution vector (YW)
0
0
1
1
0:3135
0:3025
B 0:1181 C
B 0:09344 C
B
B
C
C
B 0:03955 C
B 0:04619 C
B
B
C
C
B 0:03800 C
B 0:02498 C
B
B
C
C
B 0:05823 C
B 0:07602 C
B
B
C
C
B 0:01661 C
B 0:03391 C
YW
ML
B
B
C
C
B
B
9:8:17
C,
C
B 0:09264 C
B 0:1227 C
B 0:01026 C
B 0:00055 C
B
B
C
C
B 0:06359 C
B 0:05437 C
B
B
C
C
B 0:1063 C
B 0:05548 C
B
B
C
C
@ 0:07564 A
@ 0:09571 A
0:3152
0:4089
the two largest components of which do not differ greatly from the two-dimensional
YW solution (9.8.5).
Also shown in (9.8.17) is the solution vector (ML) obtained by maximizing the
13-parameter conditional log-likelihood function
!2
N
12
X
N
1 X
2
y
,
9:8:18
L ln 2
y
2
2 t13 t j1 j tj
541
the details of which are left to an appendix. The two sets of solutions are similar, but,
as in the case of the two-dimensional problem, the maximum likelihood approach is
to be preferred over solution of the partial set of YuleWalker equations. The
ML
ML
and 1
(which is about 74% of
parameters of greatest magnitude are 12
ML
ML
12 ); all other ML parameters are less than 30% of 12 . One may wonder
whether the extra work has led to any significant improvement in the explanatory
power of the model.
Use of the ML solution in (9.8.17) to estimate the autocorrelation function from a
generalization of the iterative YW algorithm (9.8.14)
8
1
k0
>
>
>
>
< r y k
k 1, 2 . . . 12
y k X
9:8:19
12
>
ML
>
>
jk

jj
k

13
>
y
j
:
j1
leads to the solid black curves in both panels of Figure 9.7.

The AR(12) autocorrelation function in the top panel matches the amplitudes of
the peaks and valleys of the sample autocorrelation better, but still deviates in phase
with increasing lag number. The corresponding AR(12) power spectrum in the lower
panel now matches the amplitudes of the fundamental peak quite well although not
the first harmonic but is still downshifted by a small amount. The explanation of
this frequency shift (and indirectly the autocorrelation phase shift as well) is no
mystery, but is intrinsic to the general form of the theoretical AR(n) power spectrum
derived from (9.5.10)
1
S
2 :

n
X

j ei j
1

j1
9:8:20
If all parameters but n were zero, then S() would become

S
2n
1
2j cos n
9:8:21
and give rise to maxima (i.e. spectral peaks) at frequencies such that = 2k/n (k =
0,1,2. . .) i.e. at periods n/k, which correspond in the present case exactly to the
observed harmonic series of 12/k months.
However, suppose that only the two largest parameters, n and 1, are non-zero.
Then S() becomes
S
2n
21
1
h

i
2 n cos n 1 cos n 1 cos n 1
9:8:22
542
Autocorrelation

1
Simulated AR(12)
0.5
0
-0.5
0
12
24
36
48
60
72
Lag
Power Spectrum
Simulated AR(12)
0.5
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency
Fig. 9.8 Top panel: comparison of empirical autocorrelation (gray dots) with autocorrelation
of an AR(12) time series (black solid) simulated with ML parameters and t = N(0, 852).
Bottom panel: corresponding AR(12) power spectrum (black) compared with empirical power
spectrum (gray dots and connecting lines).
and the frequency closest to = 2/n at which the bracketed expression vanishes is
influenced primarily by the term at frequency (n 1). For n = 12, and 12 and 1
both positive and of comparable magnitude (~0.3), the spectral peak is downshifted
from 2/12 by 0.015. If the two parameters had opposite signs, the spectral peak
would be upshifted (not necessarily by the same amount).
The power of rapid computation afforded by desktop computers and software like
Maple and Mathematica provides a complementary way to explore the intricacies of
any hypothetical model, besides subjecting the one empirical time series to a battery
of statistical tests. One can simulate the stochastic process numerous times an
example of a Monte Carlo method to create an ensemble of time series by which
to judge whether the time series, autocorrelation function, and power spectrum of the
one available sample are plausible representatives.
The top panel of Figure 9.8 shows a comparison of the empirical autocorrelation
ry(k) (gray dots) with the autocorrelation function of one such AR(12) simulation
(black solid) created by the algorithm
9.9 Phase two: adaptive and deterministic oscillations
yt
8
y 13 t 1
>
< t
12
X
ML s
>
j
ytj t
:
t 14
543
9:8:23
j1
in which t N0, 2 . Different values of the variance 2 were tried; the figure shows
one of the outcomes for 2 852 . The lower panel of the figure compares the
corresponding power spectrum of the simulated time series with that of the sample.
Overall, the autocorrelation of the simulated series is seen to follow reasonably
closely the pattern of the sample autocorrelation, apart from an initial overshoot
around k = 6 and the weak dephasing of peaks that occurs at higher lag numbers.
Likewise, the power spectrum of the simulation reproduces in magnitude and location both the fundamental and first harmonic peaks of the sample spectrum.
Because the noise t (obtained from a Gaussian pseudo-RNG) varies from simulation to simulation, the resulting autocorrelation and power spectrum of a particular
trial can look different in detail from those displayed in Figure 9.8. The plots in the
figure were intentionally chosen because they represented the corresponding empirical functions well. However, the fact that the AR simulations produced profiles like
these fairly often indicates that such outcomes are not unrepresentative.
One could, in principle, stop at this point if the intent were primarily model
development for forecasting. But to a physicist used to closer agreement between
his theories and his data, there is a nagging perception that with more effort one
might still do better and learn more.

Since the peaks in the sample power spectrum occur sharply at harmonics of 12
months, and the sample autocorrelation function looks somewhat like a decaying
cosine with peaks precisely at 12-month intervals, it is sensible to inquire whether the
observed time series may be explained at least in part by a sine or cosine function to
be noted generally as (co)sin. There are two fundamentally different ways to
model this.
One way is simply to insert a deterministic (co)sin function designated DeCos of
specified frequency or period = /2 = T1 into the model time series, for example
yt
12
X
j ytj a cos t b sin t t ,
9:9:1
j1
which increases the number of parameters by two. The additional parameters,

together with the original set of parameters, can be estimated by a maximumlikelihood fit. The advantage of the form used in (9.9.1) over an equivalent, but more
compact, representation c cos(t + ) in terms of amplitude and phase
544

1=2
c a2 b2
tan b=a
9:9:2
is that the sought-for parameters in (9.9.1) are all linear.

The second way is to employ an adaptive (co)sin function designated AdCos.
This is a type of AR(2) process in which the amplitude and phase self-adjust in
response to the evolution of the time series. Consider, for example, the AR(2) series
xt 1 xt1 2 xt2 t ,
9:9:3
which gives rise to the YuleWalker (YW) equation

k 1 jk1j 2 jk2j 0
k 0, 1, 2 :
9:9:4
To solve (9.9.4), make the ansatz k = sk and then multiply each term by s2k to
obtain the quadratic equation
s2 1 s 2 0
9:9:5
whose roots are s+ and s. The general solution is then of the form
k Ask B sk
9:9:6
with constants A, B to be determined from initial conditions.

In order for the autocorrelation to take the form of a decaying cosine

k e jkj ei k ei k
9:9:7
with < 1, Eq. (9.9.5) must factor in the following way

s ei s ei s2 2 cos 2
9:9:8
which, in comparison with (9.9.5), leads to the AR(2) parameters

1 2 cos
2 2
9:9:9
in (9.9.3) and (9.9.4). For a time series periodic in 12 time units (e.g. months), = 2/12
and the YW equation for AdCos then takes the form
p
k 3jk1j 2 jk2j 0
9:9:10
in which there is one, not two, parameters to be determined from data.
The autocorrelation function (9.9.10) and associated power spectrum
1
S

p
1 3ei 2 e2i 2
9:9:11
are compared with the corresponding empirical functions in Figure 9.9. The amplitude = 0.97 was chosen by simulation and visual inspection. The accord is actually
rather good. Peaks of the autocorrelation are precisely at integer multiples of 12
545
Autocorrelation
AdCos
0.5
0
-0.5
12
24
36
48
60
72
Lag
Power Spectrum
0.5
0
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Frequency
Fig. 9.9 Top panel: comparison of empirical autocorrelation (gray dots and connecting lines)
with autocorrelation (black) of AdCos time series of amplitude = 0.97 (determined by visual
inspection). Bottom panel: corresponding AdCos power spectrum (black) compared with
empirical power spectrum (gray dots and connecting lines).
(although there is still an overshoot in the vicinity of k = 6), and the fundamental
peak (which is the only peak) of the power spectrum is precisely at 2/12 (by
construction).
It is important to note that, however closely the autocorrelation function of a
hypothetical time series matches the autocorrelation of some empirical time series,
there is no guarantee that the stochastic process upon which the model is based will
describe satisfactorily the empirical series. Such is the case here with AdCos.
A simulated time series with less than (but close to) 1 and 2 about 80 or so can
produce an autocorrelation like that in Figure 9.9, but that simulated time series
displays variations too rounded and too regular to be generated by the same stochastic process that produced my electric energy readings.
Another seminal point: although the adaptive (co)sin parameters (9.9.9) lead to a
theoretical autocorrelation function that peaks at integer multiples of 12 when
employed in the YW equation (9.9.10), the autocorrelation of simulated time series
generated by stochastic process (9.9.3) with AdCos parameters does not peak at
integer multiples of 12, but, instead, displays phase shifts that increase with lag
546
number. In other words, an AdCos function either alone or in combination with

some AR(12) process does not likely account for the energy time series and autocorrelation shown in the first panels of Figures 9.2 and 9.3.
I have found by computer simulation that a simple DeCos process

2t
2t
s
yt a cos
b sin
t ,
9:9:12
12
12
with parameters
a 42:21
b 68:74
9:9:13
obtained by a maximum likelihood fit to the time series of energy readings, gives
results closer to those observed. It is to be noted that parameters (9.9.13) are precisely
the amplitudes a10 and b10 corresponding to period T = 12 in the Fourier analysis
(9.3.6) of the empirical time series.
The top panel of Figure 9.10 shows the purely deterministic part (oscillatory black
trace) of (9.9.12) superposed on the empirical series of energy readings (gray trace).
Peaks and valleys line up nearly perfectly, highlighting the seasonal pattern that is
not readily evident amidst the noise. The two traces are displaced upward by 400 units
so that one can see clearly the lower plot (black trace) obtained by simulating the full
stochastic process (9.9.12) with a Gaussian RNG of mean 0 and variance 2 762 .
In contrast to an AdCos simulation, the DeCos simulation, like the empirical series,
does not exhibit any visually striking periodicity. Only by direct comparison with the
deterministic (co)sin wave in the figure does one become aware of the seasonality of
the time series.
The lower panel of Figure 9.10 shows the power spectrum of the simulated time
series superposed over the power spectrum of the empirical series. The match in location
and relative amplitude is nearly perfect in the vicinity of the fundamental peak at
frequency 1=1/12. No higher harmonics, however, are apparent in this figure although
the peak near frequency 2=2/12 has shown up in other simulated trials.
The autocorrelation of a deterministic (co)sin is another pure (co)sin, which is
markedly different from the autocorrelation of the stochastic process (9.9.12) shown
in the top panel (black trace) of Figure 9.11, superposed on the autocorrelation of the
empirical series (gray dots). The four panels of Figure 9.11 present a panoply of
correlation functions like those in Figure 9.3 for the empirical series of energy readings
and associated difference series. It is striking how closely the simple process (9.9.12)
reproduces the essential statistical features of the autocorrelation of all four series y,
r1y, r12y, and r1r12y. This does not mean that the DeCos model necessarily represents the true stochastic process that generated he series of electric energy readings.
The question of judging the suitability of a model will be taken up in due course.
Using computer simulation, I have explored the results of combining DeCos and
AR(12) processes. The combination reproduces the autocorrelation function more
satisfactorily than does AR(12) alone, but such a hybrid model has 14 parameters,
547

800
DeCos
Energy (kWh)
600
400
200
0
-200
12
24
36
48
60
72
84
96
108
120
0.14
0.16
0.18
0.2
Power S()
Time (months)
20
10
0
0
0.02
0.04
0.06
0.08
0.1
0.12
Frequency (months )
-1
Fig. 9.10 Top panel: simulation of stochastic DeCos time series (9.9.12) with parameters a =
42.21, b = 68.74, 2 762 (lower black trace); plot of deterministic oscillatory portion
(upper black trace) superposed on empirical time series (gray). Bottom panel: power spectrum of
simulated DeCos series (black) compared with power spectrum of empirical series (gray).
which hardly meets the criterion of parsimony. To return to simplicity, we must

explore a different direction.

With only two prominent peaks in the correlation function at lags k = 1 and 12, the
autocorrelation of the difference series r1r12yt in the fourth panel of Figure 9.3 is a
particularly simple pattern which, to an experienced analyst, suggests a moving
average time series of the form
wt 1 B1 B12 t t t1 t2 t13
9:10:1
classified as MA(1) MA(1)12 for the two independent parameters. This is a pattern
that often shows up in analysis of linear systems subject to forcings at two different
periods, in this case one month and one year. The model systems, MA(1) MA(1)12
548
AC of y
DeCos
0.5
0
-0.5
12
24
36
48
60
72
12
24
36
48
60
72
12
24
36
48
60
72
12
24
36
48
60
72
AC of 1 y
1
0.5
0
-0.5
AC of 12 y
1
0.5
0
-0.5
AC of 112 y
1
0.5
0
-0.5
-1
Lag
s
Fig. 9.11 Autocorrelation functions (solid black) of a simulated DeCos series yt (top panel)
s
s
s
and corresponding difference series r1 yt (second panel), r12 yt (third panel), r1 r12 yt
2
2
(bottom panel) with parameters a = 60, b = 30, 76 . The empirical autocorrelation
of electric energy readings (gray dots) is also shown in the top panel. Dashed lines delimit the
region within approximately 2 standard deviations.
and AR(12)1,12, give different interpretations to the nature of these forcings. In the
language of systems analysis, the AR(12)1,12 process, defined by Eq. (9.8.1), asserts
that the current state of the system depends on the state of the system at 1 and 12 time
units earlier. The MA(1) MA(1)12 process, in contrast, asserts that the current state
of the system depends on random shocks that occurred at 1 and 12 time units earlier.
Both processes also include random shocks at the present moment.
Drawing on the summary description (9.6.12) of MA systems given in
Section 9.6, we can say that wt is a Gaussian random variable of mean 0 and
variance

2w w 0
2 1 2 2 2 1 2 1 2
2
with a covariance function whose only nonvanishing components are

w 1= 2 1 2
w 11= 2 w 13= 2

w 13= 2 1 2 :
Relations (9.10.2) and (9.10.3) yield the autocorrelation function
8
1
k0
>
>
>

>
>
>

k1
>
>
1 2
>
>
>
>
<

k 11, 13
w k
2
1 2
1
>
>
>
>
>

>
>

k 12
>
>
>
1
2
>
>
:
0
all other k
549
9:10:2
9:10:3
9:10:4
which vanishes at all but five specified lag numbers.

Since the expressions for w(1) and w(12) separately involve the two parameters,
we can estimate and directly from the empirical autocorrelation of r1r12yt at
lags 1 and 12

1:488

1:519

r w 1 0:463 ) 0:672,
rw 12 0:453 ) 0:658:
9:10:5
Two solutions for each parameter are obtained because the theoretical expressions in
(9.10.4) are quadratic in the parameters. The reason for discarding the solutions
greater than one will be explained shortly. For the moment, however, note that the
discarded solution for each parameter is the reciprocal of the retained solution. In
other words and to generalize two MA(q) processes
xt t
q
X
j1
j tj
yt t
q
X
1
j tj ,
9:10:6
j1
while they represent different time series, nevertheless give rise to identical autocorrelation functions. Thus, one cannot uniquely characterize a MA process from the
autocorrelation alone.
Interestingly, and are about the same in (9.10.5), namely ~0.7. Given these
parameters, we can test the theoretical equality of the autocorrelation at lags 11 and
13, the predicted value of which is w(11) = w(13) = 0.213. The empirical values are
rw(11) = 0.132 and rw(13) = 0.308. However, since these are two realizations of
equal random variables within the framework of the model MA(1) MA(1)12,
the more appropriate statistic testing their equivalence is the mean
550

rw 11 r w 13 2 0:220,
9:10:7
which looks to be in reasonable accord with the theoretical prediction (although one
cannot be sure without an estimate of the associated uncertainties). A more thorough
analysis of uncertainties will not be made here since the primary objective is merely to
see whether the investigated model provides a plausible explanation of the observed
series of energy readings.
It was noted in a previous section that a necessary and sufficient condition for an
AR model to be stationary is that the roots of the characteristic equation (9.5.7) must
lie outside the unit circle. The physical significance of this condition is that the
variance of the solution yt is finite and the autocorrelation function decreases with
increasing lag, in conformity with causality. An analogous criterion for a MA model
is that it be invertible. By invertible is meant that the system variable xt in (9.10.6)
be expressible in terms of systems variables, rather than random shocks, at earlier
times. Equivalently, this means that the MA process be expressible as an AR process.
Such an inversion is easily accomplished formally. First, solve for the current
shock

t 1
q
X
j B
1
j
xt 1 B1 xt ,
9:10:8
j1
and then substitute (9.10.8) into the original expression (9.10.6) for xt to obtain
xt t B1 B1 xt :
9:10:9
All terms on the right side of (9.10.9), apart from the current random shock, involve
the state of the system at earlier times. However, for the inverted solution to be
executable and give rise to a stationary solution, the roots of the characteristic
equation
1 B 0
9:10:10
must lie outside the unit circle. Otherwise, the solution (9.10.9), and therefore
(9.10.6), is not physically acceptable. This is the reason for discarding the other set
of solutions for and in (9.10.5). It can be proven (although not here) that for a
given autocorrelation function there is only one set of q parameters for which MA(q)
is invertible.
Assuming that the MA(1) MA(1)12 model accounts for the difference series wt,
the task then remains to deduce the stochastic process that accounts for the original
series yt. That task is a difficult one. Working backward from a difference series to the
original series is the finite-difference analogue to integration of a differential equation. Thus, the model represented by the finite-difference equation
wt r1 r12 yt yt yt1 yt12 yt13 t t1 t12 t13 9:10:11
551
is a form of ARMA model referred to as an Autoregressive Integrated Moving

Average or ARIMA model or, specifically, ARIMA (0,1,1) (0,1,1)12 for process
(9.10.11). The sequence of labels ( p, d, q) signifies a form
rd1 yt 1 Bd yt t
p
X
q
X
j ytj
j1
j tj
9:10:12
j1
whereas the sequence ( p, d, q)12 signifies a form

p
q
X
X

d
j yt 12j
j t 12j :
rd12 yt 1 B12 yt t
j1
9:10:13
j1
From the general solution (9.7.3) to an ARMA model,

yt 1 B
1
1 B t
!
1 B B12 B13
t ,
1 B B12 B13
9:10:14
in which the inverse operator is to be interpreted in terms of a series expansion, one

can infer that yt is a Gaussian random variable of mean 0 representable as an infinite
series of random shocks. To find var(yt), start with the defining relation between wt
and yt in (9.10.11)
varwt varyt yt1 yt12 yt13
1
4y 0 1 y 1 y 12 y 11 y 13 ,
2
9:10:15
where the statistical identity

varX1 X2 Xn
n
X
i1
varXi 2
covXi Xj
9:10:16
i >j
was employed to arrive at the second equality. It then follows immediately that
2w
2y

1
4 1 y 1 y 12 y 11 y 13
2
upon identifying 2y vary y 0.
Turning to the empirical series fytg and fr1r12ytg, we find

2y
varyt 9:285 103
) 2 0:769,
4
varwt 1:207 10
w
9:10:17
9:10:18
to be compared with the theoretical prediction (9.10.17), which can be estimated by

substituting the empirical values of the autocorrelation function
552
2y
2w
1

0:743:
1
4 1 ry 1 r y 12 ry 11 ry 13
2
9:10:19
The result is reasonably close. The second equality in (9.10.11) also gives us the
relation for the variance of residuals
2
var wt

1 2 1 2

) 2
1:207 104
1:492
73:742 :
9:10:20
To establish whether the hypothesized conditions of the model are met, the residuals
ftg were tested for normality and independence. Evaluating the residuals, however,
poses a difficulty not encountered in the case of AR models where there is a single
random shock t acting at the present. Recall that the residual t is the difference
between the current state of the system yt and the theoretical process from which yt
arose. From (9.10.11), the residual for the ARIMA(0,1,1) (0,1,1)12 process is
t yt yt1 yt12 yt13 t1 t12 t13 :
9:10:21
The state of the system yt within the interval N t 1 has been observed and is
known, but the random noise cannot be directly observed. How then are the variates
tj for j > 0 to be evaluated?
There are several ways to do this, including a method of back-forecasting, but
the simplest approach by far is to use the algorithm

t
0
13 t 1
yt yt1 yt12 yt13 t1 t12 t13
t 14:
9:10:22
For a long time series, the arbitrary assignment of 0s to the first 13 residuals will have
little consequence for the residuals test, provided these 13 0s are not included in the
resulting histogram. Execution of this test led to a histogram of residuals that was fit
to a Gaussian N(0, 73.742) [(9.10.20)], with 219 10:4 and P = 94.1% (the probability of obtaining a higher 2 for subsequent trials arising from the same stochastic
process). Similarly, a plot of the autocorrelation of residuals (to test independence)
showed no statistically significant outliers.
So far, then, the ARIMA model has not failed any of the preliminary tests. For a
more thorough investigation of the consequences of the model, we resort again to the
Monte Carlo method. The top panel of Figure 9.12 shows one of numerous time
series generated by computer simulation (black) with parameters = = 0.7 and a
Gaussian RNG N(0,(80)2), in comparison with the empirical energy series (gray).
Despite the apparent random fluctuations, there is, upon close examination, a
pronounced correlation of peaks at 12-month intervals. The corresponding power
spectra of the simulated and empirical series are shown in the bottom panel of the
553
Energy (kWh)
110 3
Simulated ARIMA
500
0
-500
12
24
36
48
60
72
84
96
108
120
0.14
0.16
0.18
0.2
Time (months)
Power S()
20
10
0
0
0.02
0.04
0.06
0.08
0.1
0.12
Frequency (month-1)
Fig. 9.12 Top panel: simulated ARIMA time series (black) compared with the empirical
energy series (gray) displaced upward by 500 units for clarity. Parameters are = = 0.7,
2 762 . Bottom panel: power spectrum of the simulated (black) and empirical (gray) time
series.
figure. The ARIMA model matches both the fundamental and first harmonic in
location and amplitude.
Figure 9.13 shows a panoply of autocorrelation functions of the simulated series
corresponding to the set shown in Figure 9.3 for the empirical series. The accord
between corresponding functions is remarkable. Not every simulated trial resulted in
such good agreement; after all, we are dealing with a stochastic process. Nevertheless,
the fact that the ARIMA model with maximum likelihood parameters , , 2
generates rather easily patterns like those shown in the figure suggests (. . . but does
not prove . . .) that the hypothesized process can account for the record of meter
readings representing my electric energy consumption.
As in my exploration of AR models, I also examined by computer simulation
the results of combining ARIMA(0,1,1) (0,1,1)12 and DeCos models, which leads to
a 5-parameter stochastic equation. Some of the trials led to autocorrelation functions
that matched the empirical functions quite well, but the simulated time series
exhibited an oscillatory structure after about the fortyeighth month that was too
regular and smooth to represent convincingly the actual series of readings.
554
AC of y
Simulated ARIMA
0.5
0
-0.5
12
24
36
48
60
72
12
24
36
48
60
72
12
24
36
48
60
72
12
24
36
48
60
72
AC of 1 y
1
0.5
0
-0.5
-1
AC of 12 y
1
0.5
0
AC of 112 y
-0.5
1
0.5
0
-0.5
-1
Lag (months)
s
Fig. 9.13 Autocorrelation (AC) of the simulated ARIMA time series of Figure 9.12: yt (top
s
s
s
panel); r1 yt (second panel), r12 yt (third panel), r1 r12 yt (bottom panel). AC of the
empirical energy readings (gray dots) is also shown in the top panel. Dashed lines delimit the
region within approximately 2 standard deviations.

In seeking a unique physical process that explains my long record of personal electric
energy usage, we have looked at autoregressive, moving average, and pure harmonic
(adaptive and deterministic) mathematical models. Each has accounted for the empirical time variation, autocorrelation, and power spectrum satisfactorily to a certain
degree. In this matter and others like it where a judgment is required, how can an
analyst tell which model is best? Indeed, is there one true process compared to which
other models are approximations? And if that were actually the case, by what means
could one ascertain whether the true process was among the set investigated?
555
These questions have elicited much discussion in the past . . . and probably still do.
A succinct response to what is perhaps the most basic of the questions is the remark
attributed to G. E. P. Box, a pioneer in time series analysis: Essentially, all models
are wrong, but some are useful.9 Like many aphorisms, this one contains some
truth, but needs to be understood in context. I suppose one might consider the
application of Maxwells equations to some classical electromagnetic system as a
model of that system, but, given the fundamental role that electrodynamics
assumes in the structure of theoretical physics, I can hardly imagine a physicist
thinking of the theory as wrong, but useful. Even if an electromagnetic system
being investigated turned out to be quantum in nature, rather than classical, a
resulting discrepancy between calculation and experiment would more likely draw
a judgment that the limits of validity of the theory were exceeded, rather than that
classical electrodynamics is wrong.
Loosely speaking, then, a model is what one constructs when there is no fundamental overarching theory to draw upon. That is often the case when the process to
be understood has originated in a field of study without self-consistent, reproducibly
testable fundamental laws or principles or, to the contrary, such principles may be
known but the problem is of such complexity that it is uncertain how to implement
them. For the subject matter of this chapter the uncertain flow of electric energy
where noise in the system reflects human behavior influenced by economics (cost of
energy), environmental concerns (aversion to waste and pollution), as well as physical circumstances (seasonality and local temperature), it is probably safe to say that
there is no unique law. Under other circumstances, however, where human behavior
does not enter significantly for example, the investigation in the next chapter of the
variable flux of solar energy into the ground an appropriate physical process will
emerge as the explanatory mechanism.
The matter of a true law aside, how is the best model (of a set of trial models) to be
determined? Two approaches that are widely relied upon are
the Akaike information criterion (AIC), and
the Bayesian information criterion (BIC).
The first was initially derived from information theory; the second was the outcome
of a Bayesian analysis. Both methods, in fact, can be products of Bayesian reasoning,
the only difference being in the choice of priors. Consider first the AIC.
In Chapter 6, we touched upon the rudiments of information theory as developed
in the late 1940s primarily by Claude Shannon.10 At the core of this theory is the
concept of statistical entropy H, which is a probabilistic measure of the information
contained in some set of symbols. The entropy concept is extraordinarily general and
9
10
Referenced by Wikiquotes [http://en.wikiquote.org/wiki/George_E._P._Box] to: G. E. P. Box and N. R. Draper,

Empirical Model Building and Response Surfaces (Wiley, NY, 1987) 424.
C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal 27 (1948) 379423.
556
can be applied to a wide variety of systems such as the letters transmitted in a

message (the problem that interested Shannon), the quantum states of a collection
of atoms (the problem that is addressed in statistical mechanics), and others too
numerous to mention. A few years after Shannons seminal paper on the theory of
communication, mathematicians Kullback and Leibler introduced a concept referred
to as the KullbackLeibler (KL) information entropy or KL divergence,11 which is a
quantity resembling a difference in entropy that provides a measure of the information lost when a model is used to approximate reality.
Although in most instances of application (outside physics) there is no true theory
of reality, adherence to this point of reference does not diminish the utility of the KL
information since it can be employed as a measure of the divergence of two competing models from the hypothetical true theory and therefore of the information loss
of the one model relative to the other. The mathematical expression for the KL
divergence will not be given here because it would require explanation of technical
points that are not needed for what is to follow.12 Let it suffice to say that it has all
the properties of a metric as defined in topology, except the triangle inequality
property, and therefore cannot be considered a true measure of distance.
The concept of KL information entropy would probably have remained an
interesting abstraction appreciated only by information theorists were it not for the
realization by H. Akaike, nearly 20 years later, of a formal relationship between
information theory and the principle of maximum likelihood (ML).13 Consider the
question: can one compare different models by estimating the parameters of each by
the principle of maximum likelihood, and then favoring the model with the largest
value of the maximized log-likelihood function? In general the answer is No
because a fit is likely to be better, and therefore the maximized log likelihood function
greater, the more parameters a model has. Standard ML procedure alone does not
favor the desired characteristic of parsimony.
Akaike proposed instead to choose the model that minimized the KL divergence,
which was approximately equivalent to maximizing the mean expected log-likelihood
functiona quantity involving two averaging procedures.14 The familiar maximized
log-likelihood is a biased estimate of the mean expected log-likelihood, for which,
asymptotically, the bias is the number of parameters of the model. Akaikes discovery
that an unbiased estimate of the KL information entropy could be expressed in terms
of the maximized log-likelihood function with an additional term to remove the bias
has led to a surprisingly simple expression the Akaike information criterion (AIC)
by which to judge quantitatively the effectiveness of a model:
11
12
13
14
S. Kullback and R. A. Leibler, On information and sufficiency, The Annals of Mathematical Statistics 22 (1951) 7986.
For details, consult S. Kullback, Information Theory and Statistics (Dover, 1968), Chapter 1.
H. Akaike, Information theory as an extension of the maximum likelihood principle, in Second International
Symposium on Information Theory (Eds.) B. N. Petrov and F. Csaki (Akademiai Kiado, Budapest, 1973) 267281.
^
If g(x) is the pdf of the model and y
are the ML parameters

determined from a set of data y, then the double
^
averaging referred to in the text is of the form Ey Ex log g xjy
, in which x and y are conceptualized as independent
random samples from the same distribution.
AIC 2 log Lmax 2K:
557
9:11:1
The first term of (9.11.1) is twice the log-likelihood function evaluated at the ML
parameters; the second term is twice the number K of free parameters (including the
variance of the random shock) characteristic of a particular model. Upon substitution of the ML parameters into the log-likelihood, the AIC reduces to the form
(derived in an appendix)
AIC N log b2 2K
9:11:2
where the ML estimate of the variance of N residuals is

N
1X
2 :
b2
N t1 t
9:11:3
As we shall see, (9.11.2) can be evaluated fairly easily for all the models we have
examined. Equation (9.11.2) is an asymptotic expression, applicable for N/K greater
than about 40. A second-order (in K) correction to the AIC, symbolically represented
by AICc,
AICc AIC
2KK 1
2NK
N log b2
,
NK1
NK1
9:11:4
was derived by N. Sugiura15 in 1978 for finite sample size. For notational simplicity
in the remainder of this section, I will use the symbol AIC to represent relation
(9.11.4).
Given a set of data and various explanatory models (that cannot be eliminated on
prior theoretical grounds), the model deemed most suitable is the one that minimizes
the AIC. For a given class of model, a larger number of parameters may result in a
lower b2 , but the structure of AIC is such as to penalize models with higher numbers
of parameters. As long as one is comparing models for the same set of data, the use of
AIC is not restricted to a set of nested models i.e. a series of models for which the
master equations differ sequentially from one another by an additional term. Moreover, the statistical distribution of the noise t need not be the same for all competing
models being ranked by the AIC.
We consider next, but more briefly, the Bayesian information criterion (BIC),
proposed in 1978 by G. Schwarz,16 on the basis of a Bayesian probability argument
rather than information theory. Noting that the maximum likelihood principle
invariably leads to choosing the highest possible dimension for the parameter space
of a given class of models, Schwarz sought a modification of the ML procedure by
examining the asymptotic behavior of Bayes estimators employing priors that
15
16
N. Sugiura, Further analysis of the data by Akaikes information criterion and the finite corrections, Communications
in Statistics Theory and Methods A7 (1978) 1326.
G. Schwarz, Estimating the dimension of a model, Annals of Statistics 6 (1978) 461464.
558
concentrate probability on lower-dimensional subspaces corresponding to the parameter space of competing models. The criterion he arrived at takes the form
BIC 2 log Lmax K log N N log b2 K log N,
9:11:5
which levies a greater penalty than does the AIC on models with many free parameters. As in the case of the Akaike criterion, the best model (of a set of competing
models) is the one with lowest BIC.
Comparison of (9.11.4) and (9.11.5) shows that the greater the number N of
included observations, the more the BIC and AIC will differ in their assessment of a
model of given K. Which of the two criteria is conceptually better founded or at
least more useful? In the derivation of the BIC all models being compared were
initially presumed to be equally likely; i.e. the prior distribution was uniform. Some
analysts have found this to be a poor choice:
While [a uniform prior] seems reasonable and innocent, it is not always reasonable and is never
innocent; that is, it implies that the target model is truth rather than a best approximating
model, given that parameters are to be estimated. This is an important and unexpected result.17
The AIC can likewise be derived from a Bayesian (rather than information theoretic)
argument, but with what is termed a savvy prior i.e. a prior that depends on the
number of free parameters and number of observations.
Table 9.1 summarizes the properties of seven models (out of a much larger set
I investigated), that best account for the statistical features of my record of electric
energy use. For purposes of organization, they are listed as belonging to one of the
three broad families: AR, ARIMA, and pure Cos (either adaptive or deterministic).
Most of the model parameters were estimated by the principle of maximum likelihood; a few resulted from visual inspection of computer simulations to match the
calculated and empirical autocorrelation functions.
Table 9.2 summarizes the results of applying the AIC and BIC procedures to the
models of Table 9.1 The three best models, according to both sets of criteria, are
ranked in Table 9.3 in order of increasing AIC.
The first-ranked (lowest AIC) model is the seasonal ARIMA, which posits that my
use of electrical energy at any moment is conditioned upon my use 1, 12, and
13 months previously, as well as by Gaussian random noise at the moment and at
1, 12, and 13 months previously. This is a fairly complicated law. It is not likely to be
the process that first occurs to a person looking at the zigs and zags of the empirical
time series or its autocorrelation. And yet, it is also a very simple law when understood as the outcome of a multiplicative differencing operation at the intervals of one
month (smallest interval between recordings) and one year (period of the Earths
revolution).
17
K. P. Burnham and D. R. Anderson, Multimodal inference: understanding the AIC and BIC in model selection,
Sociological Methods and Research 33 (2004) 261304.
Table 9.1
Parameters18
Summary of competing models

AR(12)
AR(12)1,12
AR(12)1,12 DeCos
^1
^2
^3
^4
^5
^6
^7
^8
^9
^10
^11
^12
0.3025
0.093 44
0.046 19
0.024 98
0.076 02
0.033 91
0.1227
0.000 55
0.054 37
0.055 48
0.095 71
0.4089
0.3476
0
0
0
0
0
0
0
0
0
0
0.4700
0.2790
0
0
0
0
0
0
0
0
0
0
0.3431
b2
a^
b^
(70.86)2
. . .. . .
. . .. . .
(73.28)2
. . .. . .
. . .. . .
(70.00)2
10.77
35.86
a^
b^
1
2
AdCos
. . .. . .
. . .. . .
0.98
p
3 1:697
2 = 0.960
^
^
b2
a
b
559
ARIMA(0,1,1) (0,1,1)12
0.6721
0.6583
(70.67)2
. . .. . .
. . .. . .
DeCos
42.205
68.742
. . .. . .
ARIMA(0,1,1) (0,1,1)12 DeCos

0.6721
0.6583
(98.81)2
18
15
The third-ranked model is a two-dimensional sub-class of AR(12), which posits that,

apart from current random noise, my use of electrical energy is conditioned only upon
my use at one month and one year previously. With three free parameters, it was the
simplest and most obvious model to start with on physical grounds. The second-ranked
model is the same as the third, but with an explicit low-amplitude 12-month periodic
variation not dependent on previous usage. This, too, could be justified physically.
There is a point to reflecting on the physical plausibility of each model even after it
has been judged by the Akaike or Bayes information criterion. It is to emphasize that
18
A caret over a parameter signifies that it is a maximum likelihood estimate. Parameters not marked by a caret were
obtained by simulation and visual inspection.
560
Table 9.2
AIC and BIC tests of model quality (series length: N = 107)

AR(12)
AR(12)1,12
AR(12)1,12 DeCos
Parameters r
Residual var 2
AICc
BIC
13
(70.9)2
941.7
972.5
3
(73.3)2
925.2
933.0
5
(70.0)2
919.8
932.6
Parameters r
Residual var 2
AICc
BIC
AdCos
2
(138.1)2
1059
1064
DeCos
3
(79.1)2
941.4
949.2
ARIMA(0,1,1) (0,1,1)12
Parameters r
Residual var 2
AICc
BIC
Table 9.3
ARIMA(0,1,1) (0,1,1)12
DeCos
5
(98.8)2
993.5
1006
3
(70.7)2
917.4
925.2
Ranking and relative probability of competing models
Model
AIC
Rank
AIC
Rel. prob.
ARIMA(0,1,1) (0,1,1)12
AR(12)1,12 DeCos
AR(12)1,12
917.4
919.8
925.2
1
2
3
0
2.4
7.8
1
31%
2.0%
such criteria are only guides, not rigid directives. Ultimately, the suitability of a
model to explain some physical phenomenon must depend on how compatible that
model is with what else is known and on the purpose for which the model is intended.
Nevertheless, having the AIC values for a set of models, what can one do with these
numbers?
If AICmin designates the lowest AICc value of a set of models, then the AIC
difference of model i is defined by
i AICi AICmin ,
9:11:6
and the relative likelihood of model i given a fixed set of data is proportional to
ei =2 . Thus, the AIC provides one way to estimate the probability of one model
compared to another. Table 9.3 shows that the second-ranked model is about 31% as
probable as the first-ranked, and the third-ranked model is only about 2% as
probable as the first-ranked. The other models listed in Table 9.2 have AIC differences sufficiently large (>20) as to justify their exclusion from further consideration.
561
Generally speaking, a model with

4 is a significant contender, but is highly
improbable for 10. The AIC weightings can be useful for purposes of forecasting.
Rather than retain only the single best model, an analyst may want to use several
potentially applicable models but weight the forecast with the relative probabilities
calculated from the AIC differences.
I, myself, had no design to use the models of Table 9.3 to forecast my electric
energy usage. To do so would have required that I accept the energy values provided
by the power company over the years to be accurate. In this regard, however, I was in
for a rude awakening.

You know my method. It is founded upon the observation of trifles.
Sherlock Holmes
The Boscombe Valley Mystery
As a physicist whose research involves using and measuring energy in a variety of

forms, I also keep track carefully of energy usage at home. I have long ago replaced
old windows with thermal ones, attached weather stripping to doors, and connected
all computer-controlled electrical devices to surge protectors that are turned off when
the device is not in use (because the devices, themselves, are hidden drains of electric
energy even when their onoff switch is off ). There is no television in the house to
dissipate time and energy, nor is there central heating and air conditioning or largescale appliances like an electric dish washer and rubbish disposal to do the work that a
human can easily do. (I am the family dish-washer.) What appliances are in the house
(e.g. stove, refrigerator and washer-dryer) have been there for years. Electrically, it is a
simple, stationary, and frugal existence, but one that suits my family and me very well.
It is also an existence that permits straightforward monitoring of ones use of
electric energy. You may understand, therefore, the disquieting feeling that descended on me when some years ago I examined the then recent record of monthly
electric bills and noticed what seemed like a small but steady trend upward in energy
use for the previous period of at least 51 months. The small gray dots (connected by
lines to guide the eye) plotted in Figure 9.14 shows the actual energy values in kWh.
The noise coupled with seasonality make it difficult to discern a trend with certainty,
but my instinct told me that something was not as it should be.
In the analyses of time series until now, it had always been convenient, indeed
largely necessary, to transform away the mean and slope in order to see more clearly
the hidden patterns of randomness. But there are circumstances, and here is one of
them, when it is precisely the mean and slope that are of critical interest. To remove
periodicity from a time series believed to vary at precisely the period T, one can
perform an operation referred to as a moving average. As stated earlier in the
chapter, this expression refers to a markedly different procedure than the stochastic
562

700
Energy (kWh)
600
500
400
300
200
0
10
15
20
25
30
35
40
45
50
Time (months)
Fig. 9.14 Segment of energy time series (small gray dots with connecting lines) with
superposed 12-month moving average (large gray dots) and least-squares line of regression
(dashed) showing positive trend.
processes previously designated by MA. In the present context, a moving average is a

transformation from xt to yt effected by replacing each point xt by the mean of the
T values xt, xt+1. . .xt+T1. If T is odd, then the mean represents the value of yt+ at the
midpoint 12 T 1 of the range. For example, if T = 5, then yt2 15
4
X
xtj .
j0
A minor difficulty arises when the period is an even number, as in the present case
(T = 12). The midpoint = 11/2 falls halfway between five and six time units from
the starting time t. The problem is surmounted by centering the average i.e. by first
performing a moving average of 12 units and then a second moving average of two
units. In detail, the procedure works as follows
9
11
>
1 X
>
>
yt
xtj
>
>
12 j0
>
>
>
=
11
12
1
X
X
1
1
) zt6 xt 2xt1 . . . 2xt11 xt12
yt1
xt1j
xtj >
24
>
12 j0
12 j1
>
>
>
>
>
1
>
;
zt6 yt yt1
2
9:12:1
The centered moving average zt is equivalent to transforming the original series xt by
1
a single moving average of 13 time units with weights 24
1, 2, 2 . . . 2, 2, 1.
The trace of large gray dots in Figure 9.14 shows the results of a centered 12-month
moving average on the 51-month portion of the electric energy record that disturbed
me. The trace of points does indeed trend upward as shown by a very well matched
563
straight (dashed) line obtained by a least-squares analysis (given in an appendix). The

slope and intercept of the line together with associated uncertainties are respectively

Slope
1:056 0:11
kWh=month
METER #1
9:12:2
Intercept 333:15 3:12 kWh:
It should be noted that a moving average transformation is highly sensitive to the
assumed period. Had a different period been chosen for example, even so close a
period as 11 time units the transformed series would show noisy oscillations. The
fact that the transformation at T = 12 has removed virtually all the structure from
the time series apart from the upward trend indicates that there are no further
features remaining to complicate the interpretation of the trend line.
What was I living in a house where electric energy usage has either been reduced or
remained stationary for years to make of an upward trend persisting for more than
four years before statistics helped me see it? What could I possibly be doing to increase
my energy usage each month? Small though it was, the ratio of mean to standard error
was 9.6. To elementary particle physicists hunting for the Higgs boson, a signal-tonoise ratio like that would be an unambiguous mark of success (and a Nobel Prize).
I wrote the power company, asked for their advice, and eventually received a letter
informing me that my meter was tested and showed a meter accuracy of 99.546%.
Disregarding the letter-writers probable confusion of accuracy and precision,
I wondered what that cryptic number actually meant. By what standard could the
power company determine that my meter (any meter!) was accurate to three decimal
places. What instrument could they have used to measure its precision to three
decimal places? After all, this is a commercial utility company, not the old US
Bureau of Standards (renamed the National Institutes of Science and Technology).
I wrote the power company again and was given a telephone number to speak with
Dan, the electric meter engineer. Dan was very helpful, although it took a bit to
translate engineer-speak into language meaningful to a physicist. Dan explained
that the weighted accuracy of my meter was 100.36, which means it can read
100 360 kWh if exactly 100 000 kWh were consumed. The test of my meter showed
that it read 99 546 kWh when exactly 100 000 kWh were used. This is where the
number 99.546% came from. (I refrained from asking the question of how the meter
tester knew that exactly 100 000 kWh were consumed in the test.)
The figures that Dan gave me show that the ratio of the meter uncertainty to the
mean reading is

x
100000 99 546
0:004 54:
9:12:3
x Power e
100000
Company
However, the trend that I could detect statistically is 1.056 kWh out of a mean
monthly usage19 of 333.68 kWh, or
19
If energy increases in time (months) as E(t) = + t, then the mean energy per month is E 12 .
564

700
Energy (kWh)
600
500
400
300
200
0
12
15
18
21
24
27
30
33
36
39
42
45
48
Time (months)
Fig. 9.15 Continuation (after first meter replacement) of energy time series (small gray dots
with connecting lines) of Figure 9.14 with superposed 12-month moving average (large gray
dots) and least squares line of regression (dashed black) up to time of second meter
replacement. Subsequent 12-month moving average is shown as a descerding dashed gray line.

x
1:056
0:00316,
e
x MPS 333:68
9:12:4
which is a smaller number, and therefore higher sensitivity, than what the power
company could measure. I pointed that out to Dan and asked whether the
power company ever tested their meters for long-term drift? He said he didnt know,
but would get back to me. The answer turned out to be No.
Dan put in a request for the meter at my home to be replaced. The replacement
was duly made and over the course of the next three years, while carrying on with my
other projects, I nevertheless recorded each month the energy reading on my electric
bill. A cursory inspection of Figure 9.15 might show that all was well. The small gray
dots and connecting lines again mark the monthly energy consumption. To my
satisfaction, a centered 12-month moving average (large gray dots) of the 39-month
time series could be fit to a straight (dashed black) line with a slope so flat that the
least-squares trend

Slope
0:174 0:17 kWh=month
METER #2
9:12:5
Intercept 350:95 4:14 kWh
was statistically equivalent to 0.
There was only one problem. Look closely at (9.12.5) and (9.12.2). According to
the replaced meter, my average energy consumption per month had suddenly jumped
up by close to 351 333 = 18 kWh (i.e. by ~5.4%) when, in reality, during all those
months I was still living an electrically frugal life. In the terminology of physics, the
565
power company seems to have substituted a defective meter with bias for a defective
meter with drift.
I contacted the power company again. (I think they remembered me.) Without my
requesting it, a technician came to the house soon afterward and replaced the meter.
That was unusual, so I telephoned the meter department to inquire why, and was told
that the company had decided to replace all the residential electric meters in the State
with what they termed smart meters. The feature of a smart meter that made it
smart was that it transmitted meter readings at about 900 MHz to a drive-by reader
so that no power company employee had to visit the house and actually look at the
meter. That may be so, but I could not help thinking that a smart meter might also be
one programmed to generate bias and drift at levels more difficult for a statistical
physicist to detect.
The dashed gray line at the far right in Figure 9.15 shows a segment of the 12-month
moving average of the energy time series for about one year following installation of
the third the smart meter. The mean energy consumption has dropped precipitously
and the slope appears flat. At the time of writing, this is the current meter in my home.

In my initial conversation with Dan the meter engineer, I inquired whether anyone
besides me had ever brought to his departments attention a meter defect that could
not be detected by the company, but which could be revealed by statistical analysis.
He said he was not aware of any such report. After the problem with the second
meter, the question passed through my mind: how likely is it for a residential customer
to receive two defective meters in succession? Two scenarios occurred to me.
Scenario I The lottery

Suppose you have entered a lottery. Your prize, if you win, is a defective electric
meter. What is the probability of winning the lottery twice? To keep matters simple,
there is only one prize and one winner.
If you have ever bought a ticket for a State lottery, you know that the probability of
winning is low. So low, in fact, that it has been said facetiously that the probability
of winning is the same whether you buy a ticket or not. That, of course, is not strictly
true. However, if the probability of someone winning is one in a million (106), then you
may be thinking that the probability of someone winning twice would be 1 in a million
million (1012). That is not correct. There are several ways to calculate the correct
probability, but the simplest is to make use of reasoning that, in a previous chapter,
served to test the power spectral amplitudes of a nuclear decay process for randomness.
Let p be the probability of winning the lottery.
Then the probability of the same person winning the lottery twice is p2.
566
2
Hence, the probability that the specific individual does not win twice is (1 p ).
2 N
If there are N ticket purchasers, the probability that no one wins twice is (1 p ) .
Therefore, the probability that at least one person does win twice is
P2 PWin TwicejN, p 1 1 p2 N :
9:13:1
For p 1 and N 1, the expression in (9.13.1) can be approximated by

PWin TwicejN, pe 1 eNp :
2
9:13:2
If p = 106 and N = 106, then

P2 1 1 1012 1 000 000 106 ,
which, though low, is nevertheless one million times more probable that originally
imagined. Relation (9.13.1) correctly answers the question of the probability that
someone not necessarily you wins twice.
According to the internet search I made a few years ago, the power company that
serves my home has about 1.2 million customers. From what I was able to find out at
the time, the failure rate of tested electric meters was about 0.5%. Although this is
not necessarily the probability that a customer will receive such a meter, it is the only
number I have, and so I will use it. Also, given that the power company cannot
always tell whether a meter works correctly or not, the number may well understate
the failure rate. In any event, substituting p = 0.005 and N = 1.2 106 into (9.13.1)
leads to a probability P2 ~ 1 9.4 1014.
In other words, it is virtually certain that at least one customer will receive two
defective meters entirely by chance. I suppose I may have been that customer.20
Scenario II The Department of Unearned Profit Enhancement (DUPE)

Consider the broader consequences of the small positive trend in Figure 9.14, which
shows that I was being charged each month for an additional 1.056 kWh that I most
likely did not use. The excess payment for a period of about 10 years at a rate
increasing (let us say linearly) from $0.08 to $0.12 per kWh is

1:056 kWh
0:08 0:12
$
120 months

$12:67:
9:13:3
month
2
kWh
For an individual this is perhaps not a particularly noticeable loss over so long a span
of time. What makes it worthy of interest, however, is that the power company had
20
A more recent internet search informed me that the failure rate of smart meters made by a particular company for
residential electric customers in California was 1600 out of 2 million, or 0.08%. If this failure rate applied to the
company serving the State where I live, the probability of someone getting two defective meters would be 53.6%.
567
1.2 million customers. If each were charged the extra amount in (9.13.3), the company would have brought in an unearned profit of more than $15 million. Now we are
talking about real money. Moreover, if meter error and therefore overcharge were
proportional to consumption, the illicit profit would be considerably higher, since my
own monthly consumption is comparatively low.
The relative likelihood of Scenario II to Scenario I can be tested, in principle, by
examining the meters of a representative sample of the companys customers to see
whether their drift, if any, and bias, if any, are all (or mostly) positive or whether
the odds are 5050 for a meter to register an extra profit or loss for the company.
Such a test would require appropriately sensitive instrumentation (which the meter
department apparently did not have at the time and, for all I know, may not have
now) or the time and patience to conduct a statistical analysis of a lot of time series.
So here, in summary, are two possible explanations for the unexpected positive
trend in energy readings that I found.
One is the near certain probability that someone would receive two defective
meters purely by chance and I was that someone.
The other is that let us be imaginative for a moment somewhere in a subterranean level of the power companys home office building and unknown to most
employees is the Department of Unearned Profit Enhancement (DUPE) whose
assignment is to design meters that indicate excess energy consumption by amounts
so low that not even the meter engineers (Dans group) can detect it. The ruse is
potentially highly rewarding and virtually undiscoverable unless a customer keeps
careful track of energy usage.
Which is correct? You choose.
Appendices
9.14 Solution of the AR(12)1,12 master equation

By use of the identity
X
1
k
1
k0
for < 1
9:14:1
yt can be expressed as a linear sum of an infinite number of Gaussian random

variables
"
yt
"
"
1 B 12 B
k0
X
k
X
k0 j0
X
k
X
k0 j0

12 k
#
t
#
kj 12k11j
B
t
j1 12
!
kj
j1 12
9:14:2
"
t12 k11j
#
X
k
X
k j t12 k11j
k0 j0
each of form N0, 2 multiplied by the factor

k
kj
j1 12
k j
j
9:14:3
defined in the third line. From the property of the Gaussian distribution (previously
demonstrated in greater generality)
cN0, 2 N0, c2 2 ,
it follows from (9.14.2) that
0 "
#2 1
X
k
X
ut N @0,
k j 2 A:
k0 j0
568
9:14:4
569
The evaluation of the sum over j in (9.14.4) will not be given here. The fastest way to
confirm the relation in (9.8.2) is to use a symbolic mathematical application like
Maple.

Although the problem of electric energy usage specifically involved the model
AR(12), the maximum likelihood (ML) method can be easily formalized to apply
to any order n. For a time series of sufficient length N, there is little information lost
in neglecting an N-independent contribution to the log-likelihood function and
starting the sum of residuals with index value n+1. This approximation to the true
ML method is equivalent to a conditional least-squares estimate.
The AR(n) master equation
yt
n
X
j ytj t ,
9:15:1
j1
in which it is assumed that t N0, 2 , contains n + 1 parameters to be determined.

Apart from an irrelevant constant, the log-likelihood function, under the conditional
least-squares approximation, is
2
N
n
X
N
1 X
2
y
,
L ln 2
y
2
2 tn1 t j1 j tj
9:15:2
from which follows the set of n + 1 ML equations

L
0
j
j 1 . . . n
9:15:3
L
0,
2
9:15:4
which can be cast in matrix form with solution [from (9.15.3)]

^ M1 V,
9:15:5
where
Mj k
N
X
tn1
ytj ytk
Vk
N
X
yt ytk
9:15:6
tn1
and ^T ^ 1 . . . ^ n is the (transposed) vector of ML parameters. Relation (9.15.4),

with substitution of the ML parameters, yields an expression for the error (or
residual) variance
570
b2
2
N
n
X
1 X
^ j ytj :
yt
N n tn1
j1
9:15:7
In general, the inversion operation of (9.15.5) would be performed numerically by

computer.
The covariance matrix C yielding the standard errors and covariances of the
parameters is given by
C H1 ,
9:15:8
where H is a (n + 1) (n + 1) matrix with elements

N
1 X
y y
i, j 1 . . . n
2 tn1 ti tj
"
#

N
n
X
1 X
^
k ytk yti
yt
Hn i 4
tn1
k1
Hi j
Hi n
Hn n
9:15:9
N n
:
2 4
9.16 Akaike information criterion and log-likelihood

Under the assumption that the error term t in a model master equation is a Gaussian
random variable, the log-likelihood has the structure
L
N
N
1 X
2 ,
ln2 2 2
2
2 t1 t
9:16:1
where the actual expression for t in a particular model (such as (9.15.2)) contains the
parameters to be estimated. That the summation in (9.16.1) begins with index t = 1
does not lessen the generality, since one can set t = 0 for some initial range of t.
N
X
2t with the ML parameter b2 leads
Factoring N from (9.16.1) and identifying N1
t1
to the approximate result for the maximum value of L

L max
N
N
ln b2 ln2 1
2
2
9:16:2
in which the second term is a constant that can be dropped from the AIC since the
models to be compared must all be based on the same data set of length N.

The linear least squares analysis can be found in almost any statistics textbook, and
there would be little point in reproducing the derivations here. Recorded below,
571
however, are the relations employed in the analysis of the trend line in Figures 9.14
and 9.15.
The sum of squares of residuals of the line to which the 12-month moving average
in Figures 9.14 and 9.15 was fit takes the form
X
Qa, b
xt at b2 :
9:17:1
t
Equating to zero derivatives of (9.17.1) with respect to a and b leads to the formal
matrix solution
11 0 N
1
0 N
N
X
X
X
0 1
1
tC B
xt C
B
a^
C B 7
C
B 7
7
@ AB
C B
C
C
C
BX
B
N
N
X
X
A
@ N
b^
2A @
t
t
t xt
7
9:17:2
and variance of residuals

V
N
1 X
^ 2:
xt a^ bt
N6 7
9:17:3
The elements of the inverse coefficient matrix in (9.17.2) are readily evaluated by
computer as sums, but can also be put into closed form
N
X
1N6
7
N
X
1
t N 7N 6
2
7
9:17:4
N
X

1
t2 N 6 2N 2 15N 91
6
7
to obtain analytical expressions for the regression parameters. The time index t in the
sums begins with 7, rather than 1, because the length of the 12-month moving
average series must be shorter than the original time series by six elements.
The covariance matrix yields the following expressions for the uncertainties and
covariance of the two parameters
N
X
t2
vara
N 6
N
X
7
9:17:5
2
t t
572
varb
V
N
X
9:17:6
2
t t
cov a, b
V t
9:17:7
N
1 X
N7
:
t
N6 7
2
9:17:8
N
X
t t
where
t
10
Part II
A warning from the weather under ground
Profound study of nature is the most fertile source of mathematical

discoveries . . . We see, for example, that the same expression whose
abstract properties geometers had considered . . . determines the
laws of diffusion of heat in solid matter, and enters into all the chief
problems of the theory of probability.
Joseph Fourier1

No this chapter is not about terrorists. The warning is not from The Weather
Underground2 but from the weather under ground. Or, more precisely, from the
climate under ground . . . but we will come to that in due time. Before taking up what
is under, let us look at what lies above.
The Sun lies above at a distance rs of approximately 150 million km from Earth.
It is a large, nearly spherical, gravitationally controlled, thermonuclear fusion reactor
in the sky, radiating electromagnetic energy isotropically at a rate of about 4 1026
watts. To put this number in perspective, the Sun radiates more than one million
times more energy in one second than is consumed globally by all nations on Earth in
a year.3 For purposes of modeling, it is a good approximation to consider the energy
output of the Sun as black-body radiation i.e. radiant energy in thermal equilibrium
with the matter that emitted it. Characteristics of black-body radiation are particularly simple to understand because they depend only on the thermodynamic parameter temperature and not on the structural details of atoms and molecules. Quantum
statistical aspects of thermal radiation were discussed in Chapter 4.
Although the core temperature of the Sun must be about 15 million K for fusion
of hydrogen into helium to occur, the solar energy that bathes the Earth emerges
1
2
3
J. Fourier, The Analytical Theory of Heat (Cambridge University Press, 1878) 7.

http://en.wikipedia.org/wiki/Weatherman_(organization) The Weather Underground was an American radical
organization that conducted a campaign of bombings throughout the 1970s.
The global consumption of energy is about 500 exa-joules (1 EJ = 1018 J).
573
574
The random flow of energy II
from the photosphere, whose temperature of 5800 K is inferred from the Wien
displacement law
max T 2:9 106 nm K
10:1:1
and the wavelength max 500 nm of its peak emission. Given the temperature T of a
spherical black-body of radius R, the rate of radiant emission follows from the
StefanBoltzmann law
Prad 4R2 SB T 4 ,
10:1:2
SB 5:67 108 WK4 m2
10:1:3
where
is the StefanBoltzmann constant.4 Substituting into (10.1.2) the solar radius Rs

700 000 km and surface temperature 5800 K yields the previously cited numerical
value 4 1026 watts for total radiant power.
As a consequence of the equivalence of mass and energy expressed in Einsteins
famous equation E = mc2, the mass of the Sun is decreasing by nearly 4.4 million
tonnes each second.5 There is no need for concern, however, for it would take more
than 10 trillion years for the Sun to vanish at that steady rate of depletion. The Sun,
of course, will not vanish. In less than 10 billion years from now, internal nuclear
reactions will have ceased, and the Sun will eventually collapse to an Earth-size white
dwarf star of a million times the density of normal earthly matter. It is uncertain
whether the Earth itself would survive such a stellar transformation, but if it does, no
terrestrial organism will likely be around to complain of the cold. An interesting
question to ponder is whether sentient organisms will still be around long before then
to complain of the heat.
The atmosphere also lies above. It is a multilayered spherical shell of gases in
which transitions between layers (so-called pauses) are defined by a change in
temperature gradient.
Human activity takes place for the most part in the troposphere, the lowest 10 km
or so for which atmospheric temperature decreases with altitude (due to adiabatic
convection) up to the tropopause. Thereafter begins the stratosphere, in which
temperature rises with altitude (because of ultraviolet heating of ozone) to about
50 km. Beyond the stratopause to about 100 km is the mesosphere, technically the
coldest place on Earth (despite being up in the air), where temperatures again
decrease with altitude reaching values as low as 100 C. Following the mesopause
to about 800 km is the thermosphere, in which temperature is rising again with
altitude and can reach values as high as 1500 C, which is close to the melting point
4
5
One can derive the relation SB 2 5 k4B =15h3 c2 from quantum statistics, in which h is Plancks constant, c is the speed
of light, and kB is Boltzmanns constant.
1 tonne = 1000 kg, equivalent to about 2200 lbs. The US ton is 2000 lbs.
575
of stainless steel. No steel, however, would melt in this ultra-rarefied layer through
which the International Space Station orbits (at about 400 km). Temperature is a
thermodynamic or statistical concept that ordinarily loses meaning when applied to
individual particles. Beyond the thermopause, the exosphere, in which remaining
atoms and molecules move freely along ballistic trajectories, extends to the nearperfect vacuum of space.
The radiant flux of the Sun i.e. energy transported each second across a unit area
normal to flow measured at the top of the atmosphere, is close to 1361 W/m2, a
value referred to as the Solar constant S0 (although it actually varies somewhat in
time and location). What precisely is meant by top of the atmosphere depends on
the method of measurement, the earliest being by high-altitude balloon (floating
in the stratosphere at 30 km) and the more recent by satellite (e.g. at about
950 km in the thermosphere). Knowing the inverse-square law of light intensity
(for isotropic radiation) and the distance of the Sun from Earth, one can estimate a
value for S0
S0
PS
4 1026 W
1415 W=m2
4r 2S 4 1:5 1011 m2
10:1:4
that approximates (within 4%) the empirical number.

If all the incoming solar radiation (termed insolation) were absorbed by
the Earth, the equilibrium temperature of the planet would be too cold to permit the
existence of surface-dwelling water-based living organisms since all water would be
frozen. To see this, bear in mind that the Earth itself is approximately a spherical
black-body radiator of radius RE 6400 km. For temperature to be stationary, the
rate of thermal emission must equal the rate of absorption that is, Prad = Pabs where
)

1
Prad 4 R2E SB T 4E
S0 4
) TE
255 K 18:4 C:
10:1:5
4 SB
Pabs R2E S0
In equating the two processes in (10.1.5), the radius of the Earth drops out. Note,
however, that emission occurs from the full surface area of a sphere, whereas the
effective area of absorption is that of a circular disk of the same radius. Perhaps this
is obvious geometrically, since the projection of a sphere onto a plane normal to the
incoming solar rays is a great circle. It is also readily demonstrable analytically. If the
incident solar flux is represented by a vector js, the rate of energy absorption by a
Intedifferential patch of area dS with outward unit normal n is dPabs js ndS.
gration over the hemisphere facing the Sun yields Pabs in (10.1.5), as shown in an
appendix.
Given that nearly 31% of incoming solar radiation is reflected back into space a
fraction that defines the Earths albedo and yet the planets mean surface
temperature is actually about +15 C, rather than even lower than 18.4 C, one
may be moved to inquire how such warming is possible. The answer lies within the
576
jS
Vacuum
jn
n
n-1
jn-1
j2
jG
j1
Ground
Fig. 10.1 Schematic diagram of energy flow in a planetary atmosphere modeled by n discrete
layers. jk is the flux (upward or downward) from the kth layer. The solar flux js is directly
absorbed by the ground, which radiates flux jG.
atmosphere. The atmosphere is largely transparent to the insolation, which, for

purposes of this explanation, lies primarily in the ultraviolet (UV) spectral range.
The molecules of the air, however, absorb the infrared (IR) thermal emission from
the ground and subsequently re-radiate it in all directions. Thus, the ground is
warmed not only by direct absorption of sunlight, but also by absorption of the
thermal radiation from the atmosphere, which, itself, is warmed by the ground.
To see analytically how this plays out suppose for the sake of generality that the
atmosphere comprises n layers, and that each is thick enough to absorb the thermal
radiation it receives from the nearest layer above and below it, but not so thick as to
trap its own emission. For the first layer, there is the ground below, and for the
last layer there is vacuum above, as schematically shown in Figure 10.1. Designate by
jk (k = 1 . . . n), jS, jG, respectively, the magnitude of the thermal flux from layer k, the
Sun, and the ground, in which
js 1 S0 939:1 Wm2 :
10:1:6
At thermal equilibrium, the rates of energy input and energy output must balance
within each layer, whereupon the processes schematically shown in Figure 10.1 give
rise to the following sequence of equations
jn jS
2jn jn1
2jn1 jn jn2
2jn2 jn1 jn3
..
.
2j2 j3 j1
2j1 j2 jG
) jn1 2jS
) jn2 3jS
) jn3 4jS
) j1 njS
) jG n 1 jS :
10:1:7
577
The general solution, obtained inductively, can be expressed as

jk n k 1 jS
k 0 . . . n
10:1:8
with the identification j0 jG .

The sought-for quantity is the temperature TE that replaces (10.1.5)

TE
jG
4 SB
14
n 1 jS
4 SB
14

)
29:8 C
15:1 C
n1
:
n 0:64
10:1:9
From relation (10.1.9) it is clear that assumption of even one layer leads to too high a
mean Earth temperature, whereas disregarding the atmosphere leads to too cold a
mean temperature. The thermodynamic principles underlying (10.1.9) are correct
although the assumption that the atmosphere absorbs all the IR emission from the
ground is not. A non-integer value 0.64 for n leads to a closer prediction of the
Earths mean temperature. The effective number of layers, defined by the empirical
relation
neff
TE
0
TE
1,
10:1:10
0
in which TE is the observed mean temperature and T E is the predicted temperature in

absence of an atmosphere, is designated the optical thickness of the atmosphere.6
The distribution of solar energy among the solid land, liquid water, and gaseous
atmosphere of the Earth entails numerous complex processes such as reflection,
particle scattering, evaporation, condensation, and more that need not be pursued
here. All processes considered, the Earths surface takes in about 50% of the incident
solar radiation. My curiosity aroused, I decided to investigate where that energy
went, how it got there, how it varied over time, and what it could teach us.

Buried in a quiescent part of a college campus off to one side of the athletic field is a
vertical section of casing through which snakes a cable bearing a sequence of
temperature probes positioned at 10, 20, 40, 80, 160, and 240 cm beneath the surface
(give or take a cm). Each probe measures ambient subterranean temperature by
means of a thermistor a portmanteau word for thermal resistor whose resistance is a sensitive, albeit nonlinear, function of the temperature. The devices are
sturdy and versatile with an operating range of 35 C to +50 C and measurement
error (according to the manufacturer) below 0.4 C. The probes communicate with
I discuss optical thickness and processes of absorption and scattering in the atmosphere in the book, M. P. Silverman,
Waves and Grains: Reflections on Light and Learning (Princeton University Press, 1998).
578
Temperature (oC)
30
(a)
(b)
(c)
(d)
(e)
(f)
20
10
0
0
0.5
1.5
2.5
Time (y)
Fig. 10.2 Panoramic plots (truncated to 2.5 years) of temperature variations measured hourly
during the period 20072012 by sensors at depths (in cm) of (a) 10, (b) 20, (c) 40, (d) 80, (e) 160,
and (f ) 240. The closer to the surface is the sensor location, the greater is the sensitivity to
diurnal noise. The temperature record at depth d is designated xd.
a data logger, which has recorded the six temperatures every hour on the hour since
noon of 7 June 2007.
At the time of writing this chapter, the experiment was still in progress. The
temperature histories that underlie this narrative each comprise N = 47 240 observations or about 5.4 years of collection. A panoramic sample of the data taken
from all six sensors during the first two and a half years is shown in Figure 10.2.
Labeled (a) (f ) according to depth, the temperature variations rise and fall
asynchronously in time with what is unmistakably a 12-month period although
we shall examine the frequency content more thoroughly in due course. The thin
black trace (f ), designated x240(t), comes from the deepest probe, which, at a depth of
240 cm, sat in thermal silence experiencing for the most part only the slow change of
the seasons. In contrast, (a) the noisiest gray trace x10(t), a mere 10 cm below the
surface, spent each moment in a (metaphorical) thermal rock concert, responding
nervously to each shriek and cry of the weather. With increasing depth as the
time series go from x20 to x40 to x80 to x160 the temperature fluctuations become
calmer.
Like the plot of electric energy usage in the previous chapter, the plots of underground temperature variations reveal interesting historical features in addition to
whatever scientific content they may have. Wide flat troughs of some waves disclose
long periods of apparent stasis above ground. These recall harsh New England
winters with the ground heavily laden with snow, bringing a thermal silence to the
subterranean landscape. Elsewhere in the record, shown in Figure 10.3, occur two
sharp vertical spikes hanging from the troughs of a wave like icicles. These recall
relatively brief, but intense episodes of rain in which the casing filled with water and
sensors rapidly thermalized to the same temperature.
579

40
30
x10
20
10
0
10
200
400
600
800
1000
1200
1400
1600
1800 2000
x10
30
25
20
800
801
802
803
804
805
806
807
808
200
400
600
800
1000
1200
1400
1600
809
810
20
x240
15
10
5
0
1800 2000
Time (d)
Fig. 10.3 Top panel: large-scale time variation of temperature record x10 (gray), 24-hour time
_
average (thin black) ~x 10 , and 365-day moving average (heavy black) x 10 ; the mean x10 (dashed)
is shown as horizontal baseline. Middle panel: small-scale time variation of x10 (solid) with ~x 10
(dashed) superposed. Bottom panel: large-scale time variation of temperature record x240
(gray), 24-hour time average (heavy black dashed) ~x 240 , and 365-day moving average (heavy
_
black) x 240 ; the mean x240 (dashed) is shown as baseline. At this depth (240 cm) the short-scale
record does not show diurnal variations.
Figure 10.3 shows the two records x10 (t) and x240 (t) in greater detail at both long
and short time scales. These two records are particularly interesting because the first
is the most sensitive to weather and the second to climate. Superposed over the
original hourly record (gray) in the first and third panels are the 24-hour averaged
records (black) obtained from the algorithm
24

1 X
~x d t
xd 24 t 1
24 1

N
t 1, 2, . . .
24
10:2:1
580
for depths d = 10 and 240 cm. As seen in the second panel, which shows the time
variation of the original record x10(t) (solid trace) and averaged record ~x 10 t
(dashed trace) over the course of 10 days, the transformation (10.2.1) has removed
all diurnal variation. A corresponding plot for x240(t) and ~x 240 t is not shown
because diurnal fluctuations at a depth of 240 cm are so low that there is little
difference between the two time series (as is evident from the overlapping traces in
the third panel).
The solid black traces nearly parallel to the baselines in the first and third panels
are plots of the 365-day moving average calculated from the algorithm
_
x d t
364
1 X
~x d t
365 0

N
365 :
t 1, 2, . . .
24
10:2:2
Because the period (365 days) is an odd number, centering is not necessary. The
apparent flatness of the two lines shows that the moving average has removed
virtually all annual variation. It may seem, then, from a preliminary examination
that if a 24-hour average and 365-day moving average7 have removed all structure
from the time series, there is nothing left to explain but that impression would be
mistaken.
We shall begin this study as we have begun previous ones by examining the
autocorrelation and power spectra to see what the eye of analysis can reveal.

In the previous chapters the exact expression chosen for calculating the sample
autocorrelation did not significantly affect the eventual interpretation of the data,
especially if the original time series was sufficiently long. In particular, it did not
matter much whether the numerator and denominator were each divided by the
respective number of summed elements, since the difference between the total
length N of the series and the length shortened by the lag number k was statistically unimportant. In the present situation, however, even though we are working
with a time series of 47 236 elements or a 24-hour averaged series of 1968
elements, the form of the sample autocorrelation function has significant
implications.
Since the present focus of attention is not on hourly fluctuations in the weather,
but on elucidating and utilizing the general physical law that governs energy flow
through the ground, consider the autocorrelation function of the 24-hour averaged
time series as first estimated from the usual form
7
It is worth noting, in case it is not apparent, that the two kinds of averages are structurally different. In effect, the
24-hour average replaces each suite of 24 points by one point, thereby shortening the input series by a factor of 24. In
contrast, the 365-day moving average replaces each point by a sum of 365 points, thereby shortening the input series by
a length of 365.
581
AC r10(k)
1
0
365
730
1095
1460
1825
Lag k (d)
Fig. 10.4 Autocorrelation r10 (k) vs lag k according to: non-normalized expression (10.3.1)
(solid black) and normalized expression (dashed black) (10.3.4). Thin gray trace shows pure
cosine function cos(2 k/365).
Mk
X

~x d t ~
d ~x d t k ~ d
~r d k
t1
M
X
2
~x d t ~ d
10:3:1
t1
in which M is the largest integer not exceeding N/24, and

d
~
M
1X
~x d t
M t1
10:3:2
is the sample mean. The plot of (10.3.1) for ~r 10 k is shown as the solid black trace in
Figure 10.4; a plot of ~r 240 k (not shown) generates a practically identical trace. The
trace can be matched nearly perfectly by an exponentially decaying cosine function
k k cos k
10:3:3
with = 2/365 and 0.9992; slight visual discrepancies (not shown in the figure)
between ~r 10 k and (10.3.3) become apparent only around lag 1460.
In contrast to (10.3.1), a plot of the sample autocorrelation (dashed trace)
~r d kcor
M
~r d k,
Mk
10:3:4
corrected to account for the number of terms in each summation, matches nearly
perfectly the superposed cosine function (thin gray trace)
kcor cos k:
10:3:5
Although the numerical difference between and 1 is very small, the two theoretical
autocorrelations differ significantly at long lags and carry different implications.
582
We have seen in the previous chapter that in general the autocorrelation function
does not uniquely determine the temporal function that generated the time series.
Nevertheless, in the present situation the theoretical time series is uniquely determined from both mathematical and physical circumstances. The temporal function
whose autocorrelation is a pure cosine (10.3.5) is, itself, a pure cosine
x1 t cos t,
10:3:6
whereas the function whose autocorrelation closely approximates a decaying cosine

(10.3.3) and which reduces to (10.3.5) as ! 1 can be shown to be also an exponentially decaying cosine
x2 t t cos t:
The exact autocorrelation of (10.3.7), derived in an appendix, is

2ln
k k cos k 2
sin k
10:3:7
10:3:8
with close to 0 if is close to 1.

The physical implication of (10.3.7), irrespective of how closely approaches 1
from below, is that the average daily temperature described by this record is decreasing in time. Given that the thermal energy entering the ground comes from the Sun,
assumed for this investigation to be a stable, periodic source, one would expect and
Figures 10.2 and 10.3 tend to support that the temperature should oscillate but not
decay. The appropriate form of the autocorrelation, then, would be (10.3.4).
10.4 Fourier transform and power spectrum of underground temperature

In preparation for performing a Fourier transform (FT), the time series of temperatures xd (t) were first transformed to corresponding time series yd (t) of mean 0
yd t xd t xd ,
10:4:1
where
xd
N
1X
xd t
N t1
10:4:2
differs from (10.3.2) in that it is a sum over hourly (not mean daily) observations. The
FT of (10.4.1) was calculated from relations
9
8

N
N
X
X
>
>
1
2
2
j
t
>
>
>
> ad j
1 k 0 >
yd t k 0
yd t cos
>
>
>

=
<
N t1
N t1
N
N
j 0, 1, 2 . . .

N
>
>
2
>
>
2X
2 j t
>
>
>
>
j
y
t
sin
b
>
>
d
d
;
:
N
N
t1
10:4:3
583

15
(a)
Temperature (oC)
10
(b)
(d)
(e)
0
5
(c)
10
15
0
0.25
0.5
0.75
1.25
1.5
1.75
2.25
2.5
2.75
Time (y)
Fig. 10.5 Waves at the fundamental frequency =2 Ty1 1 y1 in the Fourier analysis of
temperature time series (a) y10, (b) y40, (c) y80, (d) y160, (e) y240. Phase shifts and amplitudes
relative to y10 provide information regarding the spatial diffusion of energy through the
ground.
given earlier in the book, but repeated here to establish the notation and normalization convention used in this chapter. The symbol k0 is the familiar Kronecker delta
function. For yd (t), the amplitudes ad (0) = bd (0) = 0. Later in the chapter, it will
also be convenient to make use of the results of the FT in the equivalent form
cd j ad j2 bd j2 1=2

bd j
d j arctan
:
ad j
10:4:4
The value of the harmonic number jT corresponding to a particular period T in a time

series of length N is given by
jT N=T:
10:4:5
The time-variation of that harmonic component alone is then expressible as

2jT t
2jT t
j
yd T t ad jT cos
bd jT sin
:
10:4:6
N
N
For time series truncated to N = 3 y, the harmonic corresponding to Ty 1 y =
3
8760 h is jy = 3. In Figure 10.5 are shown the six waves yd t, which in their aesthetic
totality resemble a kind of ribbon blowing in the breeze. The six waves at fundamental Ty capture the large-scale variation of the time series in Figure 10.2 without
accompanying noise, and thereby clearly render the phase shifts and amplitude
differences among the different series. We will use this information later to study
the dynamics of energy propagation in the ground.
584

20
y10 (oC)
10
0
10
20
0
0.5
1.5
2.5
Time (y)
Fig. 10.6 Comparison of empirical time series y10 (gray) with Fourier superposition (black) of
fundamental waves at periods of one day and one year. Rapid diurnal variation at the time
scale of the figure accounts for the thickness of the black trace.
From relation (10.4.5) it follows that the harmonic number of the component at
the period Td 1 d = 24 h is jd = 1095. The nearest sub-surface waveform

2 t
2 t
y, d
b10 3 sin 3
y10 t a10 3 cos 3
N
N

10:4:7
2 t
2 t
a10 1095 cos 1095
b10 1095 sin 1095
N
N
obtained by superposing the components corresponding to annual and diurnal
variations is plotted (black trace) in Figure 10.6. Although one cannot see directly
the rapid 24-hour oscillations at the time scale (years) of the figure, it is evident that
this deterministic process, and not just noise due to weather, contributes to the
thickness of the empirical record (gray trace).
The power spectral amplitude at any harmonic is the square magnitude of the
corresponding Fourier amplitude
Sd j ad j2 bd j2 cd j2 :
10:4:8
Figure 10.7 shows log S10 (j) (black points) obtained by a fast Fourier transform
(FFT) of the mean-adjusted series y10 truncated to 215 = 32 768 elements. (The input
to the FFT algorithm employed by my mathematics software must be an integer
power of 2.) The logarithm (to base 10) is plotted to enhance visibility of the most
striking feature, which is a series of 11 sharp peaks shown in the upper panel at
equally spaced intervals over the harmonic range 016 000. The only other statistically significant peak in the spectrum occurs at j = 4, shown in the lower panel which
surveys at higher resolution the harmonic range 080.
For a time series of 215 elements, a peak at fundamental period of 1 y or 8760 h
would occur at harmonic number 3.74. Harmonic numbers, however, must be integers, and j = 4, corresponding to 8192 h, is the closest achievable approximation.
585
Log Power S10
x10 Temperature Record
2
0
2
4
6
0
10
12
14
16
Harmonic Number (1000)
Log Power S10
x10 Temperature Record
6
4
2
0
0
10
20
30
40
50
60
70
80
Harmonic Number
Fig. 10.7 Top panel: power spectrum log S10(j) (black points) as a function of harmonic
number j in the range 016 000. Peaks occur at frequencies corresponding to the diurnal
harmonic series (24 h)/n (n = 1, 2, . . .). Dashed white trace is the least-squares fit (from doublelog plot of Figure 10.9) yielding exponential decay at decay rate 10 = 2.44. Bottom panel:
details of log S10(j) for 80
j
0, showing a single peak corresponding to the fundamental
period of 1 year.
The diurnal period of 24 h corresponds to the harmonic number j = 1365, which

is the largest peak seen in the upper panel of Figure 10.7. The sequence of peaks
in the upper panel corresponds precisely to a harmonic8 series T24 (n) = 24/n with
n = 1,2,. . .11, as summarized in Table 10.1.
The word harmonic has two meanings here. Physicists refer to harmonic number as the integer multiple of a
fundamental frequency in a Fourier series. In mathematics, the harmonic series is the series of terms 1, 12 , 13 , 14 , . . ..
586
Table 10.1
Diurnal harmonics of record x10

15
Harmonic jd
log S10 (jd)
T exp jd 2j
1365
2371
4096
5461
6830
8196
9561
10 926
12 288
13 653
15 026
3.619 24
2.076 47
1.255 50
0.434 81
0.597 78
0.962 25
1.178 79
1.619 40
1.773 12
2.133 40
2.094 70
24.006
11.999
8.000
6.000
4.798
3.998
3.428
2.999
2.667
2.400
2.181
T thy n 24
n n 1, 2 . . .
24
12
8
6
4.8
4
3.429
3
2.667
2.4
2.181
There is no reason to believe the series abruptly terminates; rather, with increasing
n the power spectral amplitudes merge with the noise. If such a sequence were to
occur for the annual period, there would be peaks at j = 4, 8, 12, 16 . . ., but only the
fundamental is apparent in the lower panel of the figure. Evidently, the process
generating the harmonic series of peaks involves the rotation of the Earth on its
axis, but not its revolution about the Sun.
Although one expects the power series S10 to reveal periodicities of one year and
one day, the question arises as to why frequencies corresponding to higher harmonics
of one day also occur. Clearly, such peaks are not noise, but arise from a deterministic process of sharp regularity. I will offer an explanation shortly, but for the present
it is of interest to explore further both the signal and the noise within the S10 and S240
power spectra.
To begin, consider how the power S10 (jd) in the diurnal peaks, recorded in
Table 10.1, relates to harmonic number jd. Figure 10.8 shows that a plot (small
circles) of log S10 (jd) against log jd generates an unambiguously straight-line pattern
d
with negative slope 10 . (The superscript d stands for day or diurnal.) The
magnitude of the slope and associated uncertainty, obtained by a linear least-squares
fit, is
d
10 5:82 0:55:
10:4:9
The implication of the plot is that the power spectral amplitudes of the process
generating the series of diurnal peaks has a power-law dependence on frequency
d
S10 d / d 10 :
10:4:10
Consider, next, the plots (gray) of log Sd (j) against log j in Figure 10.9 for depths
d = 10 cm (upper panel) and 240 cm (lower panel). These plots include the entire
587

4
Log Power S10
x10 Diurnal Harmonics

2
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.1
4.2
Log Harmonic Numbers

Fig. 10.8 Plot of log S10(jd) (open circles) for just the diurnal harmonics jd, superposed by leastsquare line of regression (dashed).
Log Power S10
10
Temperature Record x10

5
0
5
0
0.5
1.5
2.5
3.5
Log Power S240
Temperature Record x240

0
5
0
0.5
1.5
2.5
3.5
Log Harmonic Number

Fig. 10.9 Loglog plots of full power spectra (gray points) log S10 (j) (top panel) and log S240 (j)
(bottom panel) superposed by least-square lines of regression (solid). The slopes of both lines
(upon removal of points associated with diurnal harmonics) are close to 1.8, indicative of
Brownian noise.
588
power spectrum, most points of which comprise noise rather than signal (i.e.
regularly spaced peaks). Since the sensor at 240 cm is largely shielded from diurnal
events at the surface, one would not expect to find and S240 (j) does not show a series
of diurnal harmonics like those appearing in the upper panel. The double-log plots
again reveal linear patterns, suggesting a power-law dependence of spectral power on
frequency. The respective slopes, extracted by least-squares analysis, were found to be
10 2:44 0:11
10:4:11
240 1:83 0:068:
10:4:12
The corresponding lines of regression are shown as black traces in the double-log
plots of Figure 10.9 and as a dashed white trace in the single-log plot of
Figure 10.7.
To put such numbers in perspective, recall that the exponent in a power-law
dependence S / is a measure of the kind of randomness displayed by the
process generating the power spectrum. The larger the value of , the less stochastic
and more deterministic the process appears to be. We have seen that 0 for white
noise, characteristic of the decay of radioactive nuclei and to good approximation of
the daily change in share price of stocks in a stock market. This is a process in which
future events are entirely unpredictable over the time interval between measurements
or observations. We have also encountered 1.8 for a process describable by
Brownian noise, which, as an example of diffusion, is more predictable. If points
within the spectral regions of the diurnal peaks were excised from S10 in the upper
panel of Figure 10.9, the slope (10.4.11) becomes much closer to (10.4.12) of S240.
That the noise in the power spectra Sd (j) corresponds to a diffusive process is not a
coincidence, as we shall see in the next section.
Return now to the question: what is responsible for occurrence of a harmonic
series of peaks in the temperature record x10, rather than just a single peak at the
period 24 h? To be surprised at the occurrence of these other peaks is to have
assumed that, because the Earth rotates at a period of 24 h, the heating of the ground
must vary periodically in the same way. However, this correspondence does not hold
for at least two reasons. First, except for two times of the year the autumnal and
vernal equinoxes the duration of daylight (when the Earth takes in solar energy) is
not equal to the duration of darkness (when the Earth cools). And second, even if the
periods of daylight and darkness were equal, the processes by which the ground cools
differ from those by which the ground is heated. Thus, although the Earth may rotate
on its axis once every 24 h,9 the thermodynamic processes of heating and cooling do
not occur symmetrically over that period.
9
Although not pertinent to the present discussion, it should be noted for accuracy that the motion of the Earth about its
axis is more complicated than simple rotation. Owing to the non-spherical distribution of mass, the principal moments
of inertia are not all equal, and the axis of the Earth undergoes a low-amplitude precession (Chandler wobble), tracing
out a circle about the North Pole every 400 days. See, for example, H. Goldstein, Classical Mechanics, 2nd Edition
(Addison-Wesley, 1980) 212.
589
An examination of the detailed physical processes by which the Earth heats and
cools is an undertaking beyond the intention of this chapter. Rather, the objective is
to account for the observed periodic forcings in the power spectrum by a simple
empirical model that incorporates the effects of these processes phenomenologically. As noted at the beginning, the ground is heated during the daylight hours by
absorbing short-wavelength radiation from the Sun and long-wavelength radiation
from the atmosphere. After sunset, there is still an exchange of energy with the
atmosphere; ordinarily, it is the ground that cools the faster. The asymmetry of
these processes can be represented by a piecewise periodic forcing function of the
form

cos t
if cos t 0
0
f t
10:4:13
1 cos t 2 otherwise
in which =2/24 h1 is the angular frequency of Earths rotation, is the phase
shift needed to establish a common time origin with the empirical time series, 0 is
a shift parameter that determines the fraction of daylight during the 24-h period,
and 1, 2 are parameters that characterize the after-sunset cooling processes.
The numerical values of the parameters in (10.4.13) are determined by adjusting
the power spectrum of f (t) to agree with S10 (jd). The upper panel of Figure 10.10
shows the match obtained (gray trace) for the observed harmonic series of 11 peaks
(black points). As a check, the resulting time series f (t) in the lower panel is superposed (dashed black trace) over the temporal function (gray trace) due solely to the
observed diurnal harmonics

X
2t
2t
d
a10 jd cos jd
b10 jd sin jd
:
10:4:14
y10 t
N
N
j
d
The comparison, shown for a 96 h period, is very close apart from a characterisd
tic distortion of (10.4.14) from a pure sinusoidal waveform, since y10 t is, after
all, a superposition of 11 harmonics. In a more comprehensive model, which is
not given here, the variation in the forcing f (t) due to ellipticity of the Earths
orbit is also accounted for. The function (10.4.13) would then also contain
a harmonic contribution at the angular frequency of the Earths revolution
= 2/8760 h 1.

The noise in the power spectrum Sd (for any depth d, although only S10 and S240 were
shown in figures) suggests that heat energy propagates through the ground in a
diffusive process. This is a process that can be treated deterministically by means
of a differential equation of motion as well as probabilistically by a less familiar
590
Log Power S10
x10 Diurnal Harmonics
10
15
0
10
12
14
16
2472
2484
2496
x10 Diurnal Series (oC)
Harmonic Number (1000)
1
0.5
0
0.5
1
2400
2412
2424
2436
2448
2460
Time (h)
Fig. 10.10 Top panel: comparison of S10 (jd) of diurnal harmonics (black dots) and the power
spectrum of empirical forcing function f (t) (gray) in (10.4.13). Bottom panel: comparison of
corresponding time variation of f (t) (dashed) and the contribution (10.4.14) to x10 (t) due solely
to the diurnal harmonics (gray).
mathematical structure known as a stochastic differential equation. Consider first the

more familiar approach.
Heat energy incident upon a surface of area A and temperature T will spontaneously flow a distance dx to a surface of lower temperature T dT, in accordance with
the Second Law of Thermodynamics (Figure 10.11). At a given moment t, the mean
flux of energy entering the hotter surface is j(x, t), and the mean flux leaving the
cooler surface is j(x+dx, t). Under quasi-steady-state conditions i.e. under circumstances, usually applicable, in which rapid decay of correlations destroys any influence of past behavior on the present10 the mean flux is proportional to the
temperature gradient
10
W. T. Grandy, Entropy and the Time Evolution of Macroscopic Systems (Oxford, 2008) 80.
591
j(x,t)
T
dq
dx
T-dT
j(x+dx,t)
Fig. 10.11 Schematic diagram of heat diffusion through a layer with temperature difference dT
between the top and bottom surfaces. j (x, t) is the flux of heat energy at horizontal level x at
time t.

jx, t T

Tx, t
,
x
10:5:1
a relation ordinarily referred to as Ficks Law (originally applied to mass transport

under a concentration gradient). The proportionality coefficient T is the coefficient
of thermal conductivity.
During the time interval dt the amount of energy dq transported through the layer
of thickness dx (by virtue of the definition of flux) is
dq jx dx, t jx, t Adt
10:5:2
which, by the First Law of Thermodynamics, is also expressible as

dq c A dxdT
10:5:3
in terms of the specific heat capacity c and mass density of the material of the layer.
Equating (10.5.2) and (10.5.3) leads (by standard limit-taking procedures of differential calculus) to the relation
Tx, t
1 jx, t

t
c x
10:5:4
equivalent to the conservation of energy. Upon substitution of the constitutive

relation (10.5.1) for j (x, t), (10.5.4) becomes a differential equation
Tx, t
2 Tx, t
D
t
x2
10:5:5
for the variation of temperature in space and time in terms of a single parameter, the
thermal diffusion constant or thermal diffusivity
D
T
:
c
10:5:6
592
To help convey the physical meaning of the four quantities in (10.5.6), it is useful to
note explicitly their dimensions and units (in the MKS system). These are
mass density
mass kg
3
vol
m
energy
J
masstemperature difference
kg K
energy=time
W
thermal conductivity T
lengthtemperature difference m K
length2 m2
thermal diffusivity
D
:
time
s
specific heat capacity c
10:5:7
A differential equation of the form (10.5.5) first order in time and second order
in space is a diffusion equation, familiar to physicists. We need to solve this
equation in order to make sense of the pattern of waveforms in Figure 10.2 or
Figure 10.5 in particular, to understand the relative amplitudes and phase shifts
(i.e. peak locations) of the records from the different sensors. Also, once the solution
T (x, t) is known for temperature, then it is a simple matter of spatial differentiation to
arrive at an expression for the energy flux j(x, t). This heat flow into the ground is part
of the energy balance at the surface, which, together with incident solar radiation,
also includes the transfer of heat between ground and atmosphere by conduction
(sensible heat) and the release or absorption of heat resulting from a phase change
(latent heat)

Latent
Sensible
Heat flux
Incident radiation
: 10:5:8
heat flux
heat flux
into ground
at ground surface
The problem of energy flows between ground and atmosphere and within
the atmosphere, as stated previously, is an undertaking outside the scope of this
chapter.
We turn next to solving the heat equation (10.5.5). To transform from a differential equation in two variables to a differential equation in one variable express T (x, t)
as a Fourier integral
Tx, t
~ x, ei t d
T
~ x, ei t d:
2Re T
10:5:9
The transition from the first line to the second requires the identity
~ x, * T
~ x, ,
T
10:5:10
which must hold if the temperature T (x, t) is to be a real-valued function. The utility
of the second expression in (10.5.9) is that one need find a solution only for
593
non-negative . Substitution of (10.5.9) into the diffusion equation (10.5.5) leads to

the relation

~ x, t
2 T
i ~
T x, t 0,
10:5:11
x2
D
recognizable as the Helmholtz equation, which turns up ubiquitously in physics,
particularly in the study of physical optics and acoustics. The solution for nonnegative , readily verified by substitution into (10.5.11), has the form
r !
~ x, T
~ exp i ix
> 0:
10:5:12
T
D
In the solution to the Helmholtz equation (or to the wave equation from which the
Helmholtz equation is often derived), the factor that multiplies the spatial variable x
in the argument of the complex exponential is referred to as the wave number, a
factor that is real-valued for traveling waves. In (10.5.12), however, the wave number
is complex, the implication of which can be seen from the use of Eulers relation11 to
express the square root of the unit imaginary as a sum of real and imaginary parts

p
1
10:5:13
i sin
p 1 i:
i ei =4 cos
4
4
2
Inserting (10.5.13) into (10.5.12) yields
~ x, T
~ e
T
p
2D
x i
2D
10:5:14
which, when imported into (10.5.9), generates the solution
~ e
Tx, t 2 T
2D x
r
x d
cos t
2D
10:5:15
subject to boundary conditions
~ cos td
T0, t T s t 2 T
10:5:16
T, t ! 0:
~ is then derivable by an
If the surface temperature Ts (t) is known, the function T
inverse Fourier transform
~ 1
T
11
ei = cos + i sin .
T s t cos tdt:

10:5:17
594
~ itself may be known from measurement or theory. Two

Alternatively, T
examples of particular utility are the following.
(A) Forcings at a set of different fundamental frequencies (e.g. 24 hours and 365
days)
X
~
n n ei n
10:5:18
T
n
leading to
Tx, t 0
n e
r

n
cos n t
x n :
2D
2D x
10:5:19
(B) Forcings at a set of harmonics n0 of a single fundamental frequency 0

X
~
T
n n0 ei n
10:5:20
n
leading to
Tx, t 0
X
n
n e
n0
2D
r

n0
x n :
cos n0 t
2D
10:5:21
The solutions obtained above describe temperatures that oscillate in time and decay
exponentially with depth. We will apply these wavelike solutions to the subterranean
temperature series shortly, but first it is instructive to examine an approach to the
diffusion equation that is obtained in a different way, has a different form, and
different interpretation.

When I was an undergraduate student, one of my mathematics professors once made
an off-hand remark about pathological functions, an example of which was a curve
that was everywhere continuous, but nowhere differentiable. Such a function had no
gap and no slope at any point. At the time, I could not imagine what a pathological
curve could look like or why mathematicians would bother with them. Much later,
I was to realize that, far from being of academic interest only, these functions
represent a widely occurring physical process diffusion.
We encountered diffusion in Chapter 6 in the study of stock prices as a random
walk. At any instant the different directions a random walker can take are determined stochastically. The path followed by a random walker is continuous, but
because of the fluctuations in direction, there is no meaningful limit to the ratio of
displacement to time interval. In other words, a random walker has a trajectory, but
no velocity (or speed). There is a statistical sense, however, to the time required to
595
achieve a certain mean-square displacement. It should not be unexpected, therefore,

to find that the parameter characterizing the rate of a diffusive process has dimensions of length2/time as in the thermal diffusion coefficient D of (10.5.7) and not
length/time.
An approach to the study of diffusion, that generalizes to continuous time
evolution the discrete time steps of an autoregressive process, is the stochastic
differential equation. The simplest such equation, referred to as a Wiener process,
has the form
p
10:6:1
Xt dt Xt 2Ddt N ttdt 0, 1
in which X(t) is a random variable and N ttdt 0, 1 is a standard normal variate of
mean 0 and variance 1 associated with the time interval (t, t + dt). The significance of
the time interval is that N ttdt 0, 1 is independent of (and therefore uncorrelated with)
any other normal variate falling outside that interval.
Compare (10.6.1) to the (by now familiar) finite difference equation for an AR (1)
process
xt 1 xt1 t N t 0, 2 N t 0, 1
10:6:2
in which 1 = 1 for Brownian noise, and the error term 1 is explicitly shown to be a
normal variate of mean 0 and variance 2 associated with the time t. Thus 1 and t+1
are independent and uncorrelated. The two equations, (10.6.1) and (10.6.2), have
superficially similar forms if one associates 2Ddt of the stochastic differential
2
equation
p with the variance of residuals of the finite difference equation. However,
it is dt, not dt, that enters the stochastic equation. Since the square root of a
differential quantity is not usually encountered in elementary calculus, one may well
wonder how to work with it.
In contrast to the diffusion equation (10.5.5) whose solution is a function, the
solution to the Wiener process (10.6.1) is a random variable or, equivalently, a
probability distribution. The equation is solvable by a method very similar to the
one employed in Chapter 6 to solve the AR(1) process (see Eq. (6.6.5)). To simplify
notation in (10.6.1), define
p
t 2Ddt N ttdt 0, 1
10:6:3
and examine the values of X(t) iteratively, starting with t = 0:
t0
t dt
t 2dt
..
.
t ndt
Xdt X0 0
X2dt Xdt dt X0 0 dt
X3dt X2dt 2dt X0 0 dt 2dt
n
X

kdt :
X n 1dt Xndt ndt X0
k1
10:6:4
596
The sum of variates in the last line of (10.6.4) reduces to a single normal variate
n
X
k1
kdt
p dt
. . . N ndt
2D dt N 0 0, 1 N 2dt
dt 0, 1
n1 dt 0, 1
10:6:5
N t0 0, 2D t
of variance 2Dt at the finite time t lim ndt, and the solution to (10.6.1) is therefore
dt!0
n!
Xt X0 N t0 0, 2Dt:
10:6:6
As shown in Chapter 6, the normal probability density function (pdf ) associated with
the distribution (10.6.6)
2
1
pX x, t p exx0 =4D t
4 Dt
10:6:7
satisfies the diffusion equation

px, t
2 px, t
D
:
t
x2
10:6:8
There is a close connection between the stochastic approach to diffusion and the
deterministic approach of the preceding section, which may be seen by solving the
thermal diffusion equation (10.5.5) in a different way. Instead of a Fourier transform
(10.5.9) with respect to the time variable (i.e. integration over angular frequency),
express the solution T (x, t) as a Fourier transform with respect to spatial coordinate
(i.e. integration over wave number)
~ kei k xi k t dk:

T
Tx, t
10:6:9
The dispersion relation i.e. explicit k-dependence of (k) will be found in the
course of the analysis.
For this method of solution, the initial condition
Tx, 0 T i x
~ kei k x dk,
T
10:6:10
rather than the surface boundary condition Ts(t), is required. Substitution of the
inverse transform of (10.6.10)
~ k 1
T
2
T i x0 ei k x dx0,
10:6:11
597
into (10.6.9) leads to the double integral

Tx, t
T i x0 ei k xx i k t dk dx0:
10:6:12

Operating on (10.6.12) with the time and spatial derivatives of the diffusion equation
(10.5.5) yields an equation

T i x0 Dk2 i ei k xx i k t dk dx0 0
10:6:13

that can be satisfied irrespective of initial condition only if

10:6:14
k iDk2 ,
which furnishes the requisite dispersion relation. To this point, then, the sought-for
solution takes the form

Tx, t
T i x0 ei k xx D k t dk dx0:
2
10:6:15

Integration over wave number in (10.6.15) by completing the square in the exponent is the final step to obtaining a solution
Tx, t

0 2
T i x0
e xx =4D t 0
p dx
4 D t
10:6:16
in which the space-time variation of temperature is representable as the outward

diffusion from an initial location by a Gaussian random walk, in analogy to the
diffusion of particles by Brownian motion. The kernel of the integral (10.6.16) is the
same probability density (10.6.7) that emerged from solving the stochastic differential
equation for a Wiener process.

In the experimental investigation of underground temperature variation described in
this chapter, measurements were accumulated as a function of time at fixed depths,
rather than as a function of depth at fixed times. Thus, the oscillatory solution
(10.5.15) and its special cases provide the more useful relations for theoretical analysis. Among the objectives of the study, apart from testing the temperature diffusion
model, is to determine a numerical value for the thermal diffusivity D. This is an
598
important quantity because of the seminal contribution of thermal diffusion to the net
energy balance at the Earths surface as indicated schematically in (10.5.8).
Re-examining Figure 10.5 in the light of the solution (10.5.19) applied to the single
(annual) fundamental frequency 2=T y 7:1726 104 s
r

p
Tx, t 1 e 2D x cos t
x 1
10:7:1
2D
suggests that there are two independent ways to obtain D from the time series of
temperatures at depth d relative to the measurements made at depth 10 cm: (a) by
phase shift and (b) by amplitude attenuation.
10.7.1 Method of phase shift

A harmonic function like cos ( t k x + ) attains its maximum value of 1 when its
argument satisfies t k x + = 0. If the next maximum occurs at coordinates x + x
and t + t, then these coordinates must also satisfy (t + t) k (x + x) + = 0. By
subtracting the first equation from the second, one obtains a relation
x
t
k
10:7:2
p
equivalent in form to phase velocity. Replacement of wave number by k =2D
followed by some algebraic manipulations leads to the phase-based value for D

1 x2 x 1 2
Ty
10:7:3
Dp
4 t2 t1
in terms of the time interval between temperature peaks recorded by two sensors at
different depths. The result (10.7.3) does not depend on the actual temperatures.
10.7.2 Method of amplitude attenuation

The ratio of peak temperatures recorded by two sensors at different depths is seen
from (10.7.1) to be
T max x1
e k x1 x2
10:7:4
T max x2
p
where the attenuation parameter is again k =2D. Inversion of (10.7.4) leads to
the amplitude-based value for D

x 2 x1
Da
:
10:7:5
T y logT max x1 logT max x2
Table 10.2 summarizes the estimates obtained for D by both methods in which
coordinates (x1, t1) in relations (10.7.3) and (10.7.5) pertain to the temperature sensor
599
Table 10.2
Thermal diffusion constant D of soil
Depth (cm)
Phase method Dp (m2/s)
10
20
40
80
160
240
Mean
Av
......
......
4.1877 107
4.5694 107
4.3569 107
4.7359 107
7
3.3906 10
3.6500 107
7
4.7672 10
5.3189 107
7
6.4114 10
7.2372 107
7
Dp 4:623 10
Da 5:102 107
D 4:86 0:49 107
Table 10.3
Component
Soil minerals
Soil organic
matter
Water
Amplitude method Da (m2/s)
Thermal properties of soil components12

Mass density
(kg/m3)
Specific heat Thermal conductivity

c (J/kgK)
T (W/mK)
Diffusion coefficient
D (m2/s)
2650
1300
870
1920
2.5
0.25
1.1 106
1.0 107
1000
4180
0.56
1.34 107
at depth 10 cm, and (x2, t2) refer in sequence to the records from the other sensors.
Numerical precision to four decimal places was retained for calculation, but the final
result
D 4:86 0:49 107 m2 =s
10:7:6
was rounded to a precision commensurate with the original data.

After calculating the numerical value of some physical quantity, the first question
that should come to mind is whether the outcome is reasonable or not. To check the
reasonableness of (10.7.6), I looked up the thermal properties of soil minerals, soil
organic matter, and water (which is a variable, but important, component of soil).
The data are summarized in columns 24 of Table 10.3 with the diffusion coefficients
that I deduced from (10.5.6) in the fifth column. Since the averaged mean thermal
diffusion coefficient (10.7.6) deduced from the underground temperature data falls
about midway between the highest value (for soil minerals) and lowest values (for soil
organic matter and water), there is no reason to think that the estimated value D is
unreasonable.
12
G. S. Campbell and J. M. Norman, An Introduction to Environmental Biophysics 2nd Edition (Springer, 1998) 118.
600
A more critical test, not only of the numerical value of D, but of the diffusion
model for solar energy transport into the ground, is the prediction of the temperature
record xd (t) at depth d from the record x10 (t) of the 10-cm sensor. The theoretical
temperature function, based on special cases (10.5.19) and (10.5.21), can be written as
!
!
r
r
2 t
x 10 cos
Tx, t 1 exp

x 10 1
D Ty
Ty
D Ty
!
!
r
r
2 t
2 exp

x 10 cos
x 10 2
10:7:7
D Td
Td
D Td
s
!
!
r
2
4 t
2

x 10 cos
3 exp
x 10 3 ,
D Td
Ty
D Td
where I have included contributions at the fundamental periods of 365 days and
24 hours and the first diurnal harmonic at 12 hours. The parameters (amplitudes
and phases)
11:562
1 c10 3
2 c10 1095 0:900
3 c10 2190 0:162
1 0:864
2 0:594
3 0:0:981
10:7:8
utilized in (10.7.7) come from the Fourier analysis of the mean-adjusted three-year
temperature record y10 (t).
How well the theory predicts the record y240 (t) based on the parameters of the
record y10 (t) is shown in Figure 10.12. Noisy and sharp gray traces respectively
depict the empirical y10 (t) and y240 (t) temperature records; solid and dashed black
Temperature (oC)
20
y10
10
y240
10
20
0
0.5
1.5
2.5
Time (y)
Fig. 10.12 Empirical temperature records y10 (noisy gray) and y240 (sharp gray) and temperature
records y10 (solid black) and y240 (dashed black) predicted from the one-dimensional diffusion
equation.
601
traces respectively depict the theoretical y10 (t) and y240 (t) waveforms. (Note again
that the apparent thickness of the theoretical y10 (t) waveform is due to the diurnal
oscillations, which are absent from the y240 (t) record.) The excellent agreement
between observation and theory in amplitude and phase suggests that the
deterministic features of the temperature records are well accounted for by the
one-dimensional diffusion model.
There is a further test to which the theory can be subjected which brings out, as a
byproduct, a feature of the environment that may be surprising to those who work
mainly indoors (like I do) and do not pay much attention to the energy flow outside.
Having determined the soil diffiusivity D, we can estimate the delay time between
peak temperature at a depth of 10 cm and at the surface. At the interface between the
ground and the air, it is the ground that directly absorbs solar radiation and subsequently heats the layer of air close to the ground, but the delay time is much shorter
than the time required for heat to conduct through the soil to a depth of 10 cm. It is
of interest, therefore, to compare empirical and predicted lag times between peak air
temperature at the surface and peak ground temperature at a depth of 10 cm.
Figure 10.13 shows the variation in time of the surface temperature (gray trace)
during 10 days in July 2007 and the corresponding subterranean temperature record
x10 (t) (black trace). As expected, peaks in the surface temperature occur before peaks
underground, the mean lag time being estimated at
obs
ts10 3:4 0:6 h
13-31 July 2007
Air
30
Temperature (oC)
10:7:9
x10
25
20
15
0
20
40
60
80
100
120
140
160
180
200
220
240
Time (h)
Fig. 10.13 Variation in temperature with time at the surface (gray) and at a depth below
ground of 10 cm (black) for the period 1331 July 2007. Mean lag time is approximately 3 12 h.
Surface temperatures peak at close to 14:00 h local time.
602
where s in the subscript stands for surface.pThe

time delay can be predicted from
relation (10.7.2) in which x=0.1 m, k =2D, and =2/Td is the diurnal

(rather than annual) angular frequency. The result
r
T d x
thy
ts10
3:5 0:2 h,
10:7:10
D 2
with uncertainty due primarily to the uncertainty in the value of D from (10.7.6), is
statistically equivalent to the observed delay.
The time axis of Figure 10.13 shows the number of hours from the origin, but not
the absolute time of day, which of course was recorded together with the temperature. Had the latter been shown, the reader would have seen that the peaks in surface
temperature each day fell approximately at 16:00 h or 4 p.m. local time, a fact that
apparently mystifies many people, to judge from internet inquiries and replies from
climate laboratories. The temperature and time observations in Figure 10.13 were
recorded accurately. Although the most intense solar radiation occurs close to local
noon, the lag time for the surface to reach peak temperature is such that the warmest
part of a summer day occurs in late afternoon.13

I see the bad moon arisin
I see trouble on the way
I see earthquakes and lightnin
I see bad times today . . .
I hear hurricanes ablowin . . .
I fear rivers overflowin . . .
Looks like were in for nasty weather . . .
Creedence Clearwater Revival14
Serendipity is the circumstance of discovering something one is not in quest of. For
example, in the investigation of the previous chapter I did not set out to test whether
the electric meter at my home was accurate and then discover that the power
company could conceivably make a fortune by distributing meters with slow drift.
Rather, I was interested at the outset only in finding a simple statistical model that
described residential electric energy usage. A similar statement could be made for the
research in the present chapter as well. I undertook the project with no agenda other
than to see how temperature varied at different depths below ground and find a
13
14
See, for example, (1) The Diurnal Cycle by the National Climatic Data Center, of the National Oceanic and
Atmospheric Administration, http://www.ncdc.noaa.gov/paleo/ctl/clisci0.html; and (2) P. C. Knappenberger, P. J.
Michaels, and P. D. Schwartzman, Observed changes in the diurnal temperature and dewpoint cycles across the
United States, Geophysical Research Letters 23 (1996) 26372640.
Lines from the song Bad Moon Rising.
603

13.25
365-day MA x240
Temperatuere (oC)
13
12.75
12.5
12.25
12
11.75
11.5
0
200
400
600
800
1000
1200
1400
1600
Time (d)
Fig. 10.14 The 365-day moving average (solid) of the mean daily x240 temperature and the
least-squares line of regression (dashed) with slope of 0.28 0.0032 C/y.
simple way to model that behavior mathematically. However, it has been my experience over many years that researches undertaken purely out of curiosity (. . . in
contrast to the serious work for which one has to spend hours writing grant
proposals . . .) often leads to unexpected and significant outcomes.
Let us return to the third panel of Figure 10.3, a plot of the temperature (x240)
recorded at 240 cm below ground at which depth the vicissitudes of the weather are
nearly entirely damped out. Besides the features of the plot described previously,
there is a long, flat, horizontal black line reclining below the dashed baseline like a
snake sleeping under a log. This line is the 365-day moving average (MA-x240) of the
daily-averaged hourly x240 temperature record. As mentioned previously, the flatness
of the record after removal of 24-hour and 365-day periodicities indicates that all
significant time variations in the x240 time series have been accounted for. Nevertheless, far from being of little interest, the MA-x240 time series has its own story to tell.
Figure 10.14 presents the MA-x240 record at higher resolution (black trace)
together with the least-squares line of regression (dashed). The time interval covered
is approximately 5.4 years (shorter by 365 days on the graph because of the moving
average) from 2007 to 2012, during which the averaged temperature has been
increasing at a rate of approximately
0:28 0:0032 C=y,
10:8:1
determined from the slope of the line of regression. To put this number in perspective, there are several comparisons one can make.
Numerous studies employing different measurement techniques have documented
the variation in the mean temperature of the Earths surface. Although at one time a
highly contentious issue primarily among policy makers than among scientists who
604
actually performed such studies the fact that the planet is warming anomalously is
now beyond dispute.15 The precise rate of change may vary somewhat depending on
the source of the information, but most numbers I have seen are close to the figure
o
Global mean annual temperature

10:8:2
e 0:02 C=y
rise over the past 30 years
reported by the US National Research Council.16 There is no error in decimal point;
the rate (10.8.1) is about 14 times the rate (10.8.2).
The objective in measuring the mean global temperature rise is to have a background figure that is largely independent of the wide variation in local heat distribution. As such, methodology is sought that specifically avoids the heat-island effect,
the phenomenon that urban areas are ordinarily hotter than rural ones by virtue of
fewer trees to provide shade and capture moisture, more asphalt and cement to
absorb heat and produce run-off rather than facilitate ground absorption of precipitation, among other reasons. However, were it simply a matter of a more or less fixed
temperature difference, the heat-island effect would not affect the rate of temperature
increase.
Perhaps a more compatible comparison would be regional, rather than global.
According to a report17 of the Union of Concerned Scientists, the mean temperature
increase for the US Northeast is
o
US Northeast annual
0:028 C=y
10:8:3
temperature rise 1970 2002 e
There is still no decimal point error; the rate (10.8.1) is ten times the rate (10.8.3).
Since my analysis of the underground temperature data covered a relatively short
time span of half a decade, I sought to determine whether a longer history of
temperature measurements would also reveal so striking an anomaly. The upper
panel of Figure 10.15 shows the original time series of above-ground measurements
of the average July temperature for the City of Hartford, in which the college campus
is located, spanning five decades from about 1960 to 2012.18 The data were collected
at the Hartford-Brainard Airport, a distance of about 6 km from where my underground temperature measurements were made. In the lower panel is the time series
resulting from a 10-year moving average, performed to suppress fluctuations and
obtain a graph for comparison with decadal figures reported in the literature.
15
16
17
18
References in the scientific literature and popular news media are legion, but the following news article captures the
basic sentiment: J Gillis, Global Temperatures Highest in 4,000 Years, New York Times (7 March 2013), http://www.
nytimes.com/2013/03/08/science/earth/global-temperatures-highest-in-4000-years-study-says.html. The basis of the
news article is the report: S. A. Marcott, J. D. Shakun, P. U. Clark, and A. C. Mix, A reconstruction of regional
and global temperature for the past 11,300 years, Science 339 (8 March 2013) 11981201; available online at http://
www.sciencemag.org/content/339/6124/1198.abstract.
Americas Climate Choices, Committee on Americas Climate Choice, National Research Council (National Academy
Press, 2011) 15. The figure reported is 0.6 C over 30 years.
Union of Concerned Scientists, Climate Change in the U.S. Northeast (UCS Publications, October 2006) 10. The
reported figure is 0.14 F per decade.
http://weatherwarehouse.com/WeatherHistory/PastWeatherData_HartfordBrainardField_Hartford_CT_July.html
605
Mean Temp. (oC)
28
26
Hartford CT (July)
24
22
20
18
1964
1972
1980
1988
1996
2004
2012
1988
1996
2004
2012
Mean Temp. (oC)
25
24
10-year MA
23
22
21
1964
1972
1980
Year
Fig. 10.15 Top panel: variation in mean July temperature of Hartford, Connecticut (US) from
1960 to 2012. Bottom panel: 10-year moving average (black points) and least-squares line of
regression (solid) with slope 0.028 0.0026 C/y. (Lines connecting points serve only to guide
the eye.)
Superposed on the moving average is the least-squares line of regression, the slope of
which again yields a rate
o
City of Hartford annual
10:8:4
0:028 0:0026 C=y
temperature rise 1960 2012 e
virtually identical to the regional rate (10.8.3). Clearly, then the much higher rate of
temperature rise is a recent phenomenon.
The anomalously high rate of increase in the Hartford temperature immediately
raises the question of just how anomalous this rate is in particular, whether it
applies to other cities. Hartford is a medium-size city with a population (as of 2012)
of 124 893. As a final comparison, I turned my attention to nearby New York City
(NYC), the largest metropolitan area in the USA by population estimated (as of
2012) to be 8 336 697.
Above-ground temperature records for NYC, measured at a station in Central
Park, Manhattan, are available online for the period 19002012 as annual averages
for each individual month January through December. I downloaded the information
into 12 separate data files T,i ( = 1. . .12; i = 0. . .112), in which index specifies
the month (e.g. = 1 ) January) and index i specifies the year beyond 1900 (e.g.
i = 60 ) 1960), and combined them into a single time-ordered file 12+i = T,i of
N = 12 112 = 1344 Celsius temperatures. (The original files were in Fahrenheit.)
606
Temp. (oC)
30
20
10
0
12-mo MA Temp. (oC)
-10
18
16
14
12
10
8
60
64
68
72
76
80
84
88
92
96
100 104 108 112
60
64
68
72
76
80
84
88
92
96
100 104 108 112
18
16
14
12
NYC
10
107
108
109
110
111
112
Time (y)
Fig. 10.16 Temperature variation over different time intervals for New York City. Top panel:
monthly mean temperatures from 1960 to 2012. Middle panel: 12-month moving average
covering period 19602012. Bottom panel: 12-month moving average covering period
20072012. Dashed traces are least-squares lines of regression.
The first panel of Figure 10.16 shows the temperature variation for a truncated
sample of the data spanning 19602012. To the naked eye the pattern is little more
than a pleasing sinusoid at the expected annual period and with barely perceptible
amplitude variation, much like the subterranean x240 temperature record in
Figure 10.3. But the eye of analysis tells a different story.
The second and third panels show the centered 12-month moving average
fk k 1 . . . Ng
k
12
X
j1
cj kj

1
c0 c12 ;
24
cj
2
j 2 11
24
10:8:5
over the period 19602012 (middle panel) and the period 20072012 (bottom panel),
superposed by least-squares lines of regression (dashed). The moving average has
removed all traces of an annual periodicity, leaving only Gaussian random noise.
Although at first glance k in the middle panel looks noisy compared to the original
record in the top panel, that appearance is deceptive due to the different vertical
scales. Had k been plotted at the scale of k in the top panel, the maximum
607
fluctuations (of about 3 C) would only marginally have exceeded the thickness of the
line of regression.
Here, then, are the results I obtained for the NYC rate of temperature change per
year for three different spans of time:
Long span
Medium span
Short span
1900 2012
1960 2012
2007 2012
r1 1:48 0:083 102 C=y

r2 2:11 0:074 102 C=y
r3 38:3 5:0 102 C=y:
10:8:6
The rate of temperature increase, anomalous though it may be, is entirely consistent
with what I had found previously for Hartford i.e. (a) a rate of about 0.02 C/y,
comparable to the regional rate, when taking the period 19602012, but (b) a much
higher rate of about 0.38 C/y for the recent period 20072012. That the NYC rate is
higher than Hartfords (by about 35%) is not surprising given that the population
size and, especially, the population density of NYC (10 430 km2) is much higher
than Hartfords (2 772 km2). Also, this rate was measured above ground, whereas
the rate I deduced for Hartford was from measurements 2.4 m below ground.
Although I began the study of underground temperature with primarily an
academic interest, the numbers that ultimately emerged from this exercise have
far-reaching implications all the more serious because they quantify an aspect of
climate change that is only very infrequently, if at all, given public exposure.
Certainly, there is much discussion of global warming although from a
scientific perspective that label represents only one part of a more complex outcome
of climate change (and not all locations may end up warmer). It is the global
aspect, however, upon which attention is mostly fixed that is, the potential largescale responses of the planet to heating such as melting of glaciers in Greenland or ice
shelves in Antarctica, the subsequent rise in ocean level with attendant flooding, the
disruption of large circulatory currents in the Atlantic with concomitant planetary
redistribution of heat, heightened frequency and intensity of extreme weather events
like hurricanes and tornados, broad regional perturbations of ecosystems leading to
extinctions of species and spread of disease-bearing insects and other vectors.
None of the preceding possibilities is to be dismissed out of hand or taken lightly
but neither do I think at this point that these events will be the ones that undermine a
stable society the soonest. Rather, it is the local, not global, heat stress to cities. The
cities are where most people live, and, as the human population continues to increase
beyond sustainable levels,19 more and more people will be living in cities. As temperatures likewise climb, so too will the demand for means of cooling, a demand that
already challenges electric grids in the US and abroad. According to the US Centers
for Disease Control and Prevention20
19
20
M. P. Silverman, The Children Keep Coming: How Many Can Live on Earth?, http://www.trincoll.edu/silverma/
reviews_commentary/how_many_can_live_on_earth.html [Op-Ed piece submitted to the Wall Street Journal, 2011]
http://www.cdc.gov/climateandhealth/effects/heat.htm
608
Heat waves are already the most deadly weather-related exposure in the U.S., and account for more
deaths annually than hurricanes, tornadoes, floods, and earthquakes combined.
If there is a bright side to this inauspicious forecast, it is that local problems are more
readily remedied than global ones. I am not optimistic that meaningful action will be
taken anytime soon within the USA, let alone among different nations, to adopt
measures for controlling greenhouse gas emissions. Too many participants have a
stake in maintaining the status quo. It is difficult to get a man to understand
something, writer Sinclair Lewis remarked, when his salary depends upon his not
understanding it. In place of salary you can substitute more generally political
and financial interests. Fortunately, addressing the problem of heat stress at the city
level does not require negotiating an international treaty or reconciling the clashing
interests of political parties. Remedies that forestall disaster could include actions as
simple as planting trees and installing reflecting paneling on rooftops.
***
There was a time once although not any longer when a reputable scientist could
dismiss the use of statistical reasoning in the planning and interpretation of experiments. One of Nobel Laureate Ernest Rutherfords famous statements is: If your
experiment needs statistics, then you ought to have done a better experiment.
Rutherford, it could be noted, said other things equally ill-considered (such as
regarding the possibility of extracting energy from the atom . . . by which he meant
nucleus . . . as moonshine). The fact is: statistical methods in the service of the laws
of physics provide a tool for seeing through random noise so that what intelligible
patterns of information are present can be recognized, quantified, and used to
achieve an end (hopefully a worthy one).
Songwriter Bob Dylan may have had a scientific point (whether he knew it or not)
in his oft-quoted line from the song Subterranean Homesick Blues.21 One probably
does not need a weatherman to know which way the wind blows. But if Dylan were
living in a sprawling densely populated city a score of years from now, he might want
to amend that lyric to read
You dont need a physicist to know which way the heat flows.
For the time, being, however, until the deadly serious matter of urban heat stress is
widely recognized and acted upon, you do.
21
Bob Dylan, Subterranean Homesick Blues, Columbia Records (original release 1965).
Appendices
10.9 Absorption of solar radiation by a sphere

Place a right-handed Cartesian coordinate system with vertical z axis at the center of
a sphere of radius R. The outward unit normal n (same as radial unit vector) is
n x sin cos y sin sin z cos
10:9:1
with polar angle and azimuthal angle . From elementary calculus, a differential
patch of surface area on the sphere is dS = R2 sin dd.
Suppose the rays of the Sun incident on the sphere to be parallel to the y axis. Then
the unit solar flux vector is j y , and the (positive) rate of energy absorption per
unit incident energy is given by the integral over the hemisphere facing the Sun
jn dS R
I
1
2S
sin d sin 2 d R2 :
10:9:2
10.10 Autocorrelation of a decaying oscillator

Consider a time-varying function
xt e t=2 cos t
10:10:1
with decay constant and angular frequency , such as is commonly encountered in

mechanics (damped oscillator), atomic physics (dipole radiator), electronics (RLC
circuits), and many other contexts. The autocovariance function is
T

ek =2 t
k hxtxt ki
e cos t cos t k dt
T
ek =2
10:10:2
e= cos 2 cos k cos sin sin k d
609
610
in which the second line results from a change of integration variable and use of a
trigonometric identity to expand cos ( (t+k)). The autocorrelation function then
takes the form

k
I 2
k
cos k
k
sin k ,
10:10:3
0
I 1
where
e=2 < 1
10:10:4
= 2ln=
10:10:5
and the two integrals in (10.10.2) are evaluated to be

1
I1
2
e cos 2 d
1
I2
2
2 21 e2
22 4
e cos sin d
1 e2
:
22 4
Substitution of (10.10.6) and (10.10.7) into (10.10.3) leads to

k
sin k :
k cos k 2
2
10:10:6
10:10:7
10:10:8
Bibliography
Since A Certain Uncertainty is a narrative of personal scientific investigations, I list in

this bibliography only those few books all from my own library accumulated
throughout the years that I consulted during the course of these researches. It is
quite possible that some of the books are long out of print and that others have more
recent editions. Nevertheless, my purpose is to comment briefly on books that
actually helped me and that I believe are still among the most informative references
for physicists and other physical scientists, rather than to provide a long list of titles
that I never used. The books commented on below are for the most part mathematically intensive; one will not find long verbal descriptions and colorful figures. Also, be
aware that the word Introduction in a title does not necessarily mean that the book
is elementary.
Probability and statistics
Harold Jeffreys, Theory of Probability 3rd Edition (Oxford, 1961)

A classic, compendious treatment of probability in the vanguard of Bayesian
methodology.
William Feller, An Introduction to Probability Theory and Its Applications 2nd Edition,
Vol. 1 (Wiley, 1950) [Vol. 2 (Wiley, 1966)]
An advanced two-volume set (of which I found the first volume to be the more
useful), which discusses the application of probability theory, largely from a
frequentist perspective, to a variety of problems of physical interest.
Alexander Mood, Franklin Graybill, and Duane Boes, Introduction to the Theory of
Statistics 3rd Edition (McGraw-Hill, 1974)
An intermediate-level statistics textbook with particularly good discussion of
different probability distributions and methods of estimation and hypothesis
testing.
611
612
Bibliography
[1] Maurice Kendall and Allan Stuart, The Advanced Theory of Statistics, Vol. 1
Distribution Theory, 2nd Edition (Hafner, 1963)
[2] Maurice Kendall and Allan Stuart, The Advanced Theory of Statistics, Vol. 2
Inference and Relationship (Griffin, 1961)
[3] Maurice Kendall, Alan Stuart, and J Keith Ord, The Advanced Theory of Statistics,
Vol. 3 Design and Analysis and Time-Series 4th Edition (Macmillan, 1983)
This is a definitive advanced treatment of orthodox statistics: a massive threevolume encyclopaedic work that only the most avid statistics enthusiast is likely to
want to read cover to cover. It gives the broadest coverage of any statistics
reference I know. (My three volumes are of different editions because they were
acquired at widely different times from different places.)
George Box, Gwilym Jenkins, and Gregory Reinsel, Time Series Analysis (Pearson,
2005)
A classic work, much of it developed by George Box, that presents a thorough
treatment of the different classes of model stochastic systems for time-series
analysis, forecasting, and control.
Thermodynamics and statistical physics
Herbert Callen, Thermodynamics (Wiley, 1960)

This is a graduate-level physics textbook that provides the clearest exposition I have
seen of the fundamental principles of equilibrium thermodynamics and the relation
between alternative, but equivalent, formulations of these principles by means of
Legendre transformations. There is a second edition, which includes more material
on statistical thermodynamics, but it is marred (at least my copy is) by printing errors.
E. T. Jaynes, Papers on Probability, Statistics and Statistical Physics, edited by
R Rosenkrantz (Kluwer, 1989)
This volume includes Jaynes eye-opening papers on the principle of maximum entropy
as a logical basis for deducing the fundamental relations of equilibrium statistical
mechanics, as well as other papers extending the work into the non-equilibrium
domain. The papers in their totality provide the most comprehensive and vigorously
argued support for a Bayesian approach to probability that I have read.
Information theory
Claude Shannon and Warren Weaver, The Mathematical Theory of Communication

(University of Illinois, 1964)
This is a clear, concise, introductory account on the relation between entropy,
information, and communication by one (Shannon) who helped create the theory
and by another (Weaver) noted for his expository ability.
Index
adiabatic lapse rate 437, 443

Akaike information criterion (AIC) 555556, 570
albedo (of Earth) 575
aliasing 135
anti-bunching 235236
apophenia 181, 194
atmosphere (of Earth)
layers 574
thermal equilibrium 576
autocorrelation
coefficient 131, 357
distribution of 193
Fourier transform of 138
function 130
of beta-decaying nuclei 156
of decaying oscillator 609
of electric energy utilization 516
of stationary process 130, 138
of stock prices 336
of underground temperature 580
Bacheliers thesis 350
backward-shift operator 355, 519
barometric equation 443
baseball statistics 404
Bayes problem 91
solution with uniform prior 94
solution with Jaynes prior 97
Bayes theorem 3
and chain of inferences 4
and meaning of ignorance 74, 76
and parameter estimation 75
and maximum likelihood 77, 80, 82
and determination of the prior 78
Bayesian information criterion (BIC) 555, 557
BBCSilverman experiments 488
Bell states 242
Bernoulli process 20, 165
Bernoullis principle 410
Bertrands paradox 78
Bessel functions 24
and distribution of autocovariance 192
and distribution of product of normal RVs 289
and Poisson process 173
and Skellam distribution 23
beta function 44, 93
binomial distribution 7, 403
and nuclear decay 123, 128
moment generating function 20
black-body radiation
of the Earth 575
of the Sun 573
BoseEinstein statistics see statistics
boson 195
boundary layer (of fluid) 411
Cauchy distribution 29
of half-life estimates 318
of median 294
of ratio of standard normal variates 191, 292
Central Limit Theorem 13, 28, 32, 312, 402
and WOC hypothesis 464
ChapmanKolmogorov equation 351
characteristic function 32
chemical potential 199
of massless particles 260
of neutrino 266267
relation to Lagrange multiplier 263
thermodynamic definition 263
chi-square distribution 38
and degrees of freedom 40
and gamma distribution 40
and test of hypothesis 65
cholesterol ratio 313
circulation 422423
combinatorial coefficient
binomial 8
multinomal 10
Compton wavelength 196
613
614
Condorcets jury theorem 465, 509
correlation coefficient 25, 57
covariance function 130, 356
and variogram 373
discrete estimate 131, 191
distribution of 192
of autoregressive time series 529
of moving average time series 532
CramerRao theorem 56, 60
Craven, John (and Bayesian search method) 461
cross-correlation function 130, 149
cryptography 237
cumulative distribution function (cdf ) 7, 393
cut-off frequency 134
dAlemberts paradox 410
degeneracy 200
degrees of freedom
and maximum lag 150
in chi-square distribution 3940, 60, 64
in time series 147148
delta function 34, 140
density of states 206
diffusion 350
coefficient (thermal diffusivity) 591, 598
equation of motion 353, 591, 596
of thermal energy 589, 597
digamma function 109
Dirichlet function 268
distributions of products and ratios 277, 325
of Poisson RVs 278
of uniform RVs 281, 325
of normal RVs 287, 326
drag 409, 419, 442
coefficient 410, 449
in Newtonian flow 410
in Stokes flow 410
on airfoil 434
electronpositron annihilation 118, 261
ensemble average 130
entropy 47
and extensivity 205
and prior information 49
and probability 49
and Shannon information 48, 340
and tests of randomness 258
conditional 343
ergodicity 131
error function (erf ) 13, 486
error propagation theory (EPT) 274, 311, 395
of mean 274
of product of RVs 274
of ratio of RVs 275
of variance 274
estimator 55
unbiased 55
Index
Euler generating function 19
Eulers constant 109
expectation (statistical) 351
exponential distribution 14
and lack of memory 15
and nuclear decay 157
and power spectrum 190
extensivity 205
FermiDirac statistics see statistics
fermion 195
Ficks law 591
First Law of Thermodynamics 222, 260
Fourier series 133 (continuous function), 133
(discrete function)
Fourier transform 138
CooleyTukey FFT algorithm 146
discrete 189, 524, 582
normalization constant 142
flight (theory of ) 419
fugacity 264
gamma coincidence experiment 117
gamma distribution 40
and Fourier amplitudes 190
gamma function 24
incomplete 108
Gauss multiplication formula 44
Gaussian (or normal) distribution 12
and Central Limit Theorem 13
solution to diffusion equation 351
geometric distribution 15, 175
and photon emission probability 210
Gibbs correction 205
Gibbs paradox 205
Gonzales, Federico (fall and survival) 432
harmonic series 585
heat-island effect 604
height of the atmosphere 443
Helmholtz equation 593
histogram 10, 114
and multinomial distribution 10, 154
ideal gas 380
adiabatic transformation 443
and stock market 382
equation of state 443
mean free path 381
rms speed 381
time between collisions 381
information mining
Galton model 480
Silverman model 483
information theory 340, 347, 384
insolation 575
Index
intensity interferometry 215
interval function 281
invariance principle 96
inverse probability 92
KullbackLeibler information (KL) 556
kurtosis 6, 285, 392
KuttaJoukowski theorem 423
Lagrange multiplier 4951
and statistical mechanics 51, 199
latent heat 592
law of averages 372, 387
law of corresponding states 412
law of errors 13
law of large numbers 404, 464
Legendre duplication formula 44
Levy distribution 356
lift 419
coefficient 427
on a sphere 426
on airfoil 434
light 194
as photons 195
chemical potential 207, 216
correlation coefficient 232
fluctuations 212214, 220
heat capacity 218, 220
helicity 197
non-classical states 236
photon anti-bunching 236
photon bunching 232, 234, 244
photon correlations 226, 229
pressure 221
likelihood function 4, 55
and chi-square distribution 6465
and probability 63, 65
conditional 359
of autoregressive time series 358
of multivariable Gaussian 57
ratio 62
log-normal distribution 495496, 502, 511
Magnus effect 425426
Markov process 354
martingale 378
maximum entropy principle 46
and statistical mechanics 199
maximum likelihood (ML) method 54
and chi-square test 65
and least squares test 65
ML covariance matrix 56
ML estimator 56
MaxwellBoltzmann statistics see statistics
Maxwell relations 223
mean
maximum likelihood estimate 58
sample 30, 404

theoretical 6
mode (of a distribution) 75, 394
models (mathematical) 516, 554555
moment generating function (mgf) 16
moments of a distribution 56, 17, 392
negative 45, 296, 299
theoretical vs empirical 30
moving average (transformation) 561
365-day 580
multinomial distribution 10
and Condorcets jury theorem 466
and entropy 340
multiplicity 11
of BoseEinstein statistics 202, 231
of FermiDirac statistics 203
negative binomial distribution 176, 230
neutrino (Dirac and Majorana) 267
Newtons law of fluid friction 409
noise
1/f (or pink or flicker) 376
Brownian 377
shot 213
wave 214
white 140, 149, 160, 182, 340, 361, 375376
normal variate
equivalence relation 18
standard form 13, 26
nuclear decay processes 114
and solar neutrinos 185
bismuth-212 branching decays 305
bound-state beta decay 184
electron-capture decay 184
half-life 124, 318
neutron decay 266
probability generating function 126
proton decay 117
pseudo fluctuations 309
secular equilibrium 305, 315
sodium-22 decay 117
statistics of 122, 152
thorium-232 decay chain 316
null hypothesis 62
order statistics 72
and Condorcets jury theorem 466
and WalkerFisher test 159
lowest and highest 73
median 294
parametric down-conversion (PDC)
197, 241
Type I 241
Type II 241
Parsevals theorem 148
615
616
Index
partition function 5051, 199

partition of an integer 19
Pascals triangle 88
photon-number splitting attack 240
phugoid motion 431
Poisson distribution 9
and nuclear decay 120, 128
and photon emission 245
and theory of runs 169
difference of Poisson variables 23
relation to exponential distribution 14
sum of Poisson variables 22
polylogarithm 221
power (of statistical test) 63
power law 160, 588
power spectral density 138, 142
and variogram 373
as a power law 160, 586
completeness relation 188
discrete 143
of AR time series 529
of autoregressive process 357
of harmonic function 141
of MA time series 533
of stock prices 336
of underground temperature 582
of white noise 140
principle of insufficient reason
(or indifference) 50
probability
as frequency 2
as plausibility 2
Bayes theorem 34
conditional 3, 84
density function (pdf) 5, 7, 393
generating function (pgf) 23, 105
posterior 4
prior 4
P-value 66
and uniform distribution 70
quantum computer 239
quantum key distribution 240
quantum mechanics 198
random variable (RV) 5
random walk 350, 353
arc sine law 367
Bernoulli 364, 370
Cauchy 363
cumulative displacement 366
first passage 368
Gaussian 363364
range (of projectile) 392394, 397
Rayleigh distribution 191 (See Tables 3.1 and 3.2)
Reynolds number 409, 427
Riemann zeta function 217
roll parameter 427

runs (statistical) 163
and digital time series 170
definition of 164
exclusive and inclusive 165
mean number of 167, 170
probability of 167
recurrent 246247
sampling theorem 134, 148
Schwartz inequality 148
sensible heat 592
serial correlation function 131
Silverman Guesses of Groups (GOG) experiments
471
SilvermanBayes experiment 100
sinc function 149
single-photon states 239240, 242
Skellam distribution 23
skewness
sample 404
theoretical 6, 285, 392
unbiased estimator 455
Solar constant 575
specific heat 591592
stability (of a distribution) 355
Standard Cosmological Model 183
standard error 31, 71, 504
and Central Limit Theorem 312
Standard Model of particle physics 183
standard temperature and pressure (STP) 437
stationary process (weak and strong) 130, 132
statistic (sing.: random variable) 200
statistical mechanics 47
canonical ensemble 199
chemical potential 199
grand canonical ensemble 220221
Helmholtz potential 223
mean occupation number 207
occupation probability 199
partition function 199
variance in particle number 224
statistics (plur.: occupation problem) 200
BoseEinstein 200201
FermiDirac 200, 202
MaxwellBoltzmann 204
StefanBoltzmann law 574
Stirlings
approximation 61
series 468
stock market 328
and ideal gas 382
closing prices 332
fundamental principle (Bacheliers) 350
information content 347
price changes 336
price dynamics 350
WSJ dartboard contest 379
Index
Students t distribution 41
forms of pdf 44
test of batting equivalence 404, 407
sufficient statistic 56
terminal speed 414
thermal conductivity 592
thermal diffusivity 592
time series models
autocorrelation 519, 521
autoregressive (AR) 353, 527, 534, 595
autoregressive integrated moving average
(ARIMA) 551
autoregressive moving average (ARMA)
533
first-differences 518519
invertibility 550
ML parameters 385, 569
moving average (MA) 527, 530, 547
power spectral density 385, 523
predictability 373, 376
with oscillations 543
Toeplitz matrix 540
ultra-relativistic plasma 263
617
uncertainty relation (time-frequency) 148

uniform distribution 34
and generation of random numbers 38
and P-values 70
relation to cumulative distribution 37
variance
conditional 504
maximum likelihood estimate 59
sample 30, 404
theoretical 6, 392
variogram 373
vorticity 422423
waiting time 175
to multiple occurrence 175
WalkerFisher (WF) test 159
Wien displacement law 574
Wiener process 595
WienerKhinchin (WK) theorem 132, 139, 523
WOC (wisdom of crowds) hypothesis 460
Yule (or pure birth) process 125
YuleWalker equation 357, 528

(Mark P. Silverman) A Certain Uncertainty Nature

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

(Mark P. Silverman) A Certain Uncertainty Nature

Загружено:

Авторское право:

Доступные форматы

A CERTAIN UNCERTAINTY:

NATURES RANDOM WAYS

Based around a series of real-life scenarios, this vivid introduction to statistical

University Printing House, Cambridge CB2 8BS, United Kingdom

To Sue, Chris and Jen

Books by Mark P. Silverman

1.27 Bayes theorem and the meaning of ignorance

The fundamental problem of a practical physicist

Mother of all randomness

Mother of all randomness

Doing the numbers nuclear physics and the stock market

Information inequality H (AjB) H(A)

On target: uncertainties of projectile flight

The guesses of groups

The random flow of energy

The random flow of energy

boundaries of histograms, the selection of Bayesian priors, the relationship between

It is remarkable that a science which began with the consideration

1.1 Probability: The calculus of uncertainty

Tools of the trade

1.2 Rules of engagement

1.2 Rules of engagement

Tools of the trade

PD2 jD1 H PD1 jH PH

PH 0 jh PhjH 0 PH0 1212

PH 0 jh2 , h1 Ph2 jh1 , H0 Ph1 jH 0 PH 0

1.3 Probability density function and moments

Tools of the trade

A random variable is a quantity whose value at each observation is determined by a

Thus the nth moment of the distribution of X is defined to be

Several particularly significant moments or combinations of moments include:

which is a measure of the asymmetry of a probability distribution about its center,

With regard to notation, the subscript X designating the random variable of

It therefore follows by use of Leibnitzs equation from elementary calculus

Tools of the trade

labels a trial, is the coefficient of pkqnk in the binomial expansion p qn

1.5 The Poisson distribution: counting the improbable

1.5 The Poisson distribution: counting the improbable

from which follows the equality

Since is never negative in a Poisson distribution (physically, it is a distribution

Tools of the trade

1.6 The multinomial distribution: histograms

defined above is the multinomial combinatorial coefficient.

1.6 The multinomial distribution: histograms

Distribution of outcomes of two dice

(The symbol is often used to represent multiplicity in statistical physics.) Note,

Tools of the trade

Table 1.2 Expected outcomes of 100 tosses of two

multinomial probability function is cumbersome; we will do this rigorously and

2ni nPyi 1 Pyi ,

as summarized in Table 1.2.

1.7 The Gaussian distribution: measure of normality

which is related to the error function

in the following way

Tools of the trade

As an academic physicist I am regularly asked by students whether I grade on a

1.8 The exponential distribution: Waiting for Godot

1.8 The exponential distribution: Waiting for Godot

then gives the pdf of an exponential distribution of waiting times.

PrX > t ex dx et :

and the mean time between events is 1/p

Tools of the trade

1.9 Moment-generating function

ln gBin t npet 1 np2 et 1 :

Recall that: ln1 x x 12 x2 13 x3 14 x4 :