Frequentism and Bayesianism - A Practical Introduction

Pythonic Perambulations
(https://jakevdp.github.io/)
Musings and ramblings through the world of
Python and beyond
Atom (/atom.xml)
Search
Navigate
Archives (/archives.html)
Home Page (http://www.astro.washington.edu/users/vanderplas)
Frequentism and Bayesianism: A

Practical Introduction
Mar 11, 2014
This post is part of a 5-part series: Part I (http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-apractical-intro/)
Part
II
(http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bayesianism-2-when-results-
differ/) Part III (http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/) Part

IV
(http://jakevdp.github.io/blog/2014/06/14/frequentism-and-bayesianism-4-bayesian-in-python/)
Part
(http://jakevdp.github.io/blog/2015/08/07/frequentism-and-bayesianism-5-model-selection/)
See also Frequentism and Bayesianism: A Python-driven Primer (http://arxiv.org/abs/1411.5018), a peer-reviewed
article partially based on this content.
One of the first things a scientist hears about statistics is that there is are two different approaches: frequentism and
Bayesianism. Despite their importance, many scientific researchers never have opportunity to learn the distinctions
between them and the different practical approaches that result. The purpose of this post is to synthesize the
philosophical and pragmatic aspects of the frequentist and Bayesian approaches, so that scientists like myself
might be better prepared to understand the types of data analysis people do.
I'll start by addressing the philosophical distinctions between the views, and from there move to discussion of how
these ideas are applied in practice, with some Python code snippets demonstrating the difference between the
approaches.
Frequentism vs. Bayesianism: a Philosophical Debate

Fundamentally, the disagreement between frequentists and Bayesians concerns the definition of probability.
For frequentists, probability only has meaning in terms of a limiting case of repeated measurements. That is, if I
measure the photon flux F from a given star (we'll assume for now that the star's flux does not vary with time), then
measure it again, then again, and so on, each time I will get a slightly different answer due to the statistical error of
my measuring device. In the limit of a large number of measurements, the frequency of any given value indicates the
probability of measuring that value. For frequentists probabilities are fundamentally related to frequencies of
events. This means, for example, that in a strict frequentist view, it is meaningless to talk about the probability of
the true flux of the star: the true flux is (by definition) a single fixed value, and to talk about a frequency distribution
for a fixed value is nonsense.
For Bayesians, the concept of probability is extended to cover degrees of certainty about statements. Say a
Bayesian claims to measure the flux
of a star with some probability
P(F)
: that probability can certainly be
estimated from frequencies in the limit of a large number of repeated experiments, but this is not fundamental. The
probability is a statement of my knowledge of what the measurement reasult will be. For Bayesians, probabilities
are fundamentally related to our own knowledge about an event. This means, for example, that in a Bayesian
view, we can meaningfully talk about the probability that the true flux of a star lies in a given range. That probability
codifies our knowledge of the value based on prior information and/or available data.
The surprising thing is that this arguably subtle difference in philosophy leads, in practice, to vastly different
approaches to the statistical analysis of data. Below I will give a few practical examples of the differences in
approach, along with associated Python code to demonstrate the practical aspects of the resulting methods.
Frequentist and Bayesian Approaches in Practice: Counting

Photons
Here we'll take a look at an extremely simple problem, and compare the frequentist and Bayesian approaches to
solving it. There's necessarily a bit of mathematical formalism involved, but I won't go into too much depth or
discuss too many of the subtleties. If you want to go deeper, you might consider please excuse the shameless
plug taking a look at chapters 4-5 of our textbook (http://www.amazon.com/dp/0691151687/).
The Problem: Simple Photon Counts

Imagine that we point our telescope to the sky, and observe the light coming from a single star. For the time being,
we'll assume that the star's true flux is constant with time, i.e. that is it has a fixed value
Ftrue
(we'll also ignore
effects like sky noise and other sources of systematic error). We'll assume that we perform a series of
measurements with our telescope, where the i
th
measurement reports the observed photon flux Fi and error ei .
The question is, given this set of measurements D
= {Fi , ei }
, what is our best estimate of the true flux Ftrue ?
(Gratuitous aside on measurement errors: We'll make the reasonable assumption that errors are Gaussian. In a Frequentist perspective,
ei
is the standard deviation of the results of a single measurement event in the limit of repetitions of that event. In the Bayesian
perspective,
ei
is the standard deviation of the (Gaussian) probability distribution describing our knowledge of that particular
measurement given its observed value)
Here we'll use Python to generate some toy data to demonstrate the two approaches to the problem. Because the
measurements are number counts, a Poisson distribution is a good approximation to the measurement process:
In[1]: # Generating some simple photon count data
import numpy as np
from scipy import stats
np.random.seed(1) # for repeatability
F_true = 1000 # true flux, say number of photons measured in 1 second
N = 50 # number of measurements
F = stats.poisson(F_true).rvs(N) # N measurements of the flux
e = np.sqrt(F) # errors on Poisson counts estimated via square root
Now let's make a simple visualization of the "measured" data:
In[2]: %matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.errorbar(F, np.arange(N), xerr=e, fmt='ok', ecolor='gray', alpha=0.5)
ax.vlines([F_true], 0, N, linewidth=5, alpha=0.2)
ax.set_xlabel("Flux");ax.set_ylabel("measurement number");
These
measurements
each
have
different
error
ei
which
is
estimated
from
Poisson
statistics
(http://en.wikipedia.org/wiki/Poisson_distribution) using the standard square-root rule. In this toy example we

already know the true flux
Ftrue
, but the question is this: given our measurements and errors, what is our best
estimate of the true flux?

Let's take a look at the frequentist and Bayesian approaches to solving this.
Frequentist Approach to Simple Photon Counts
We'll start with the classical frequentist maximum likelihood approach. Given a single observation
we can compute the probability distribution of the measurement given the true flux
Ftrue
Di = (Fi , ei )
given our assumption of
Gaussian errors:
(F F
P(Di |Ftrue ) =
exp
2
true )
2e
This should be read "the probability of
Di
given Ftrue equals ...". You should recognize this as a normal distribution
with mean Ftrue and standard deviation ei .

We construct the likelihood function by computing the product of the probabilities for each data point:
N
(D|Ftrue ) =
P(Di |Ftrue )
i=1
Here
D = {Di }
represents the entire set of measurements. Because the value of the likelihood can become very
small, it is often more convenient to instead compute the log-likelihood. Combining the previous two equations and
computing the log, we have
log =
log(2
2 [
2
i
)+
(Fi
2
true )
i=1
What we'd like to do is determine
Ftrue
such that the likelihood is maximized. For this simple problem, the
maximization can be computed analytically (i.e. by setting d log /d Ftrue
= 0
). This results in the following observed
estimate of Ftrue :
Fest =
w F
w
i
; wi = 1/e
Notice that in the special case of all errors ei being equal, this reduces to
N
1
Fest =
Fi
i=1
That is, in agreement with intuition, Fest is simply the mean of the observed data when errors are equal.
We can go further and ask what the error of our estimate is. In the frequentist approach, this can be accomplished
by fitting a Gaussian approximation to the likelihood curve at maximum; in this simple case this can also be solved
analytically. It can be shown that the standard deviation of this Gaussian approximation is:
1/2
est
=
(
i=1
wi
)
These results are fairly simple calculations; let's evaluate them for our toy dataset:
In[3]: w = 1. / e ** 2
print("""
F_true = {0}
F_est = {1:.0f} +/- {2:.0f} (based on {3} measurements)
""".format(F_true, (w * F).sum() / w.sum(), w.sum() ** -0.5, N))
F_true = 1000
F_est = 998 +/- 4 (based on 50 measurements)
We find that for 50 measurements of the flux, our estimate has an error of about 0.4% and is consistent with the
input value.
Bayesian Approach to Simple Photon Counts

The Bayesian approach, as you might expect, begins and ends with probabilities. It recognizes that what we
fundamentally want to compute is our knowledge of the parameters in question, i.e. in this case,
P(Ftrue |D)
Note that this formulation of the problem is fundamentally contrary to the frequentist philosophy, which says that
probabilities have no meaning for model parameters like
Ftrue
. Nevertheless, within the Bayesian philosophy this is
perfectly acceptable.
To compute this result, Bayesians next apply Bayes' Theorem (http://en.wikipedia.org/wiki/Bayes'_theorem), a
fundamental law of probability:
P(Ftrue |D) =
P(D|Ftrue )P(Ftrue )
P(D)
Though Bayes' theorem is where Bayesians get their name, it is not this law itself that is controversial, but the
Bayesian interpretation of probability implied by the term P(Ftrue |D) .
Let's take a look at each of the terms in this expression:
P(Ftrue |D)
: The posterior, or the probability of the model parameters given the data: this is the result we
want to compute.
P(D|Ftrue )
P(Ftrue )
: The likelihood, which is proportional to the (D|Ftrue ) in the frequentist approach, above.
: The model prior, which encodes what we knew about the model prior to the application of the
data D .
P(D)
: The data probability, which in practice amounts to simply a normalization term.
If we set the prior P(Ftrue )

P(Ftrue |D)
(a flat prior), we find
(D|F
true )
and the Bayesian probability is maximized at precisely the same value as the frequentist result! So despite the
philosophical differences, we see that (for this simple problem at least) the Bayesian and frequentist point estimates
are equivalent.
But What About the Prior?
You'll noticed that I glossed over something here: the prior,
P(Ftrue )
. The prior allows inclusion of other information
into the computation, which becomes very useful in cases where multiple measurement strategies are being
combined to constrain a single model (as is the case in, e.g. cosmological parameter estimation). The necessity to
specify a prior, however, is one of the more controversial pieces of Bayesian analysis.
A frequentist will point out that the prior is problematic when no true prior information is available. Though it might
seem straightforward to use a noninformative prior like the flat prior mentioned above, there are some surprisingly
subtleties
(http://normaldeviate.wordpress.com/2013/07/13/lost-causes-in-statistics-ii-noninformative-
priors/comment-page-1/) involved. It turns out that in many situations, a truly noninformative prior does not exist!
Frequentists point out that the subjective choice of a prior which necessarily biases your result has no place in
statistical data analysis.
A Bayesian would counter that frequentism doesn't solve this problem, but simply skirts the question. Frequentism
can often be viewed as simply a special case of the Bayesian approach for some (implicit) choice of the prior: a
Bayesian would say that it's better to make this implicit choice explicit, even if the choice might include some
subjectivity.
Photon Counts: the Bayesian approach

Leaving these philosophical debates aside for the time being, let's address how Bayesian results are generally
computed in practice. For a one parameter problem like the one considered here, it's as simple as computing the
posterior probability
parameter
Ftrue
P(Ftrue |D)
as a function of
Ftrue
: this is the distribution reflecting our knowledge of the
. But as the dimension of the model grows, this direct approach becomes increasingly intractable.
For this reason, Bayesian calculations often depend on sampling methods such as Markov Chain Monte Carlo
(MCMC) (http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo).
I won't go into the details of the theory of MCMC here. Instead I'll show a practical example of applying an MCMC
approach using Dan Foreman-Mackey's excellent emcee (http://dan.iel.fm/emcee/current/) package. Keep in mind
here that the goal is to generate a set of points drawn from the posterior probability distribution, and to use those
points to determine the answer we seek.
To perform this MCMC, we start by defining Python functions for the prior
the posterior
P(Ftrue |D)
P(Ftrue )
, the likelihood
P(D|Ftrue )
, and
, noting that none of these need be properly normalized. Our model here is one-
dimensional, but to handle multi-dimensional models we'll define the model in terms of an array of parameters ,
which in this case is
= [Ftrue ]
In[4]: def log_prior(theta):

return 1 # flat prior
def log_likelihood(theta, F, e):
return -0.5 * np.sum(np.log(2 * np.pi * e ** 2)
+ (F - theta[0]) ** 2 / e ** 2)
def log_posterior(theta, F, e):
return log_prior(theta) + log_likelihood(theta, F, e)
Now we set up the problem, including generating some random starting guesses for the multiple chains of points.
In[5]: ndim = 1 # number of parameters in the model

nwalkers = 50 # number of MCMC walkers
nburn = 1000 # "burn-in" period to let chains stabilize
nsteps = 2000 # number of MCMC steps to take
# we'll start at random locations between 0 and 2000
starting_guesses = 2000 * np.random.rand(nwalkers, ndim)
import emcee
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[F, e])
sampler.run_mcmc(starting_guesses, nsteps)
sample = sampler.chain # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].ravel() # discard burn-in points
If this all worked correctly, the array sampleshould contain a series of 50000 points drawn from the posterior. Let's
plot them and check:
In[6]: # plot a histogram of the sample
plt.hist(sample, bins=50, histtype="stepfilled", alpha=0.3, normed=True)
# plot a best-fit Gaussian
F_fit = np.linspace(975, 1025)
pdf = stats.norm(np.mean(sample), np.std(sample)).pdf(F_fit)
plt.plot(F_fit, pdf, '-k')
plt.xlabel("F"); plt.ylabel("P(F)")
Out[6]: <matplotlib.text.Text at 0x1075c7510>
We end up with a sample of points drawn from the (normal) posterior distribution. The mean and standard deviation
of this posterior are the corollary of the frequentist maximum likelihood estimate above:
In[7]: print("""
In[7]: print("""
F_true = {0}
F_est = {1:.0f} +/- {2:.0f} (based on {3} measurements)
""".format(F_true, np.mean(sample), np.std(sample), N))
F_true = 1000
F_est = 998 +/- 4 (based on 50 measurements)
We see that as expected for this simple problem, the Bayesian approach yields the same result as the frequentist
approach!
Discussion
Now, you might come away with the impression that the Bayesian method is unnecessarily complicated, and in this
case it certainly is. Using an Affine Invariant Markov Chain Monte Carlo Ensemble sampler to characterize a onedimensional normal distribution is a bit like using the Death Star to destroy a beach ball, but I did this here because
it demonstrates an approach that can scale to complicated posteriors in many, many dimensions, and can provide
nice results in more complicated situations where an analytic likelihood approach is not possible.
As a side note, you might also have noticed one little sleight of hand: at the end, we use a frequentist approach to
characterize our posterior samples! When we computed the sample mean and standard deviation above, we were
employing a distinctly frequentist technique to characterize the posterior distribution. The pure Bayesian result for a
problem like this would be to report the posterior distribution itself (i.e. its representative sample), and leave it at
that. That is, in pure Bayesianism the answer to a question is not a single number with error bars; the answer is the
posterior distribution over the model parameters!
Adding a Dimension: Exploring a more sophisticated model

Let's briefly take a look at a more complicated situation, and compare the frequentist and Bayesian results yet
again. Above we assumed that the star was static: now let's assume that we're looking at an object which we
suspect has some stochastic variation that is, it varies with time, but in an unpredictable way (a Quasar is a good
example of such an object).
We'll propose a simple 2-parameter Gaussian model for this object:
= [, ]
where is the mean value, and
is
the standard deviation of the variability intrinsic to the object. Thus our model for the probability of the true flux at
the time of each observation looks like this:
Ftrue
(F )
exp
Now, we'll again consider N observations each with their own error. We can generate them this way:
In[8]: np.random.seed(42) # for reproducibility
N = 100 # we'll use more samples for the more complicated model
mu_true, sigma_true = 1000, 15 # stochastic flux model

F_true = stats.norm(mu_true, sigma_true).rvs(N) # (unknown) true flux
F = stats.poisson(F_true).rvs() # observed flux: true flux plus Poisson er
rors.
e = np.sqrt(F) # root-N error, as above
Varying Photon Counts: The Frequentist Approach

The resulting likelihood is the convolution of the intrinsic distribution with the error distribution, so we have
N
(D| ) =
i=1
(F )
exp
+e )
i
[ 2(
2
+e ) ]
i
Analogously to above, we can analytically maximize this likelihood to find the best estimate for :
est
w F
w
i
; wi =
+e
And here we have a problem: the optimal value of
depends on the optimal value of . The results are correlated,
so we can no longer use straightforward analytic methods to arrive at the frequentist result.
Nevertheless, we can use numerical optimization techniques to determine the maximum likelihood value. Here we'll
use
the
optimization
routines
available
within
Scipy's
optimize
(http://docs.scipy.org/doc/scipy/reference/optimize.html) submodule:
In[9]: def log_likelihood(theta, F, e):
return -0.5 * np.sum(np.log(2 * np.pi * (theta[1] ** 2 + e ** 2))
+ (F - theta[0]) ** 2 / (theta[1] ** 2 + e ** 2))
# maximize likelihood <--> minimize negative likelihood
def neg_log_likelihood(theta, F, e):
return -log_likelihood(theta, F, e)
from scipy import optimize
theta_guess = [900, 5]
theta_est = optimize.fmin(neg_log_likelihood, theta_guess, args=(F, e))
print("""
Maximum likelihood estimate for {0} data points:
mu={theta[0]:.0f}, sigma={theta[1]:.0f}
""".format(N, theta=theta_est))
Optimization terminated successfully.
Current function value: 502.839505
Iterations: 58
Function evaluations: 114
Maximum likelihood estimate for 100 data points:
Maximum likelihood estimate for 100 data points:

mu=999, sigma=19
This maximum likelihood value gives our best estimate of the parameters
and
governing our model of the
source. But this is only half the answer: we need to determine how confident we are in this answer, that is, we need
to compute the error bars on and .
There are several approaches to determining errors in a frequentist paradigm. We could, as above, fit a normal
approximation to the maximum likelihood and report the covariance matrix (here we'd have to do this numerically
rather than analytically). Alternatively, we can compute statistics like
and
2
dof
to and use standard tests
(http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test) to determine confidence limits, which also depends

on strong assumptions about the Gaussianity of the likelihood. We might alternatively use randomized sampling
approaches
such
as
Jackknife
(http://en.wikipedia.org/wiki/Jackknife_(statistics))
or
Bootstrap
(http://en.wikipedia.org/wiki/Bootstrapping_(statistics)), which maximize the likelihood for randomized samples of

the input data in order to explore the degree of certainty in the result.
All of these would be valid techniques to use, but each comes with its own assumptions and subtleties. Here, for
simplicity, we'll use the basic bootstrap resampler found in the astroML (http://astroML.org) package:
In[10]: from astroML.resample import bootstrap
def fit_samples(sample):
# sample is an array of size [n_bootstraps, n_samples]
# compute the maximum likelihood for each bootstrap.
return np.array([optimize.fmin(neg_log_likelihood, theta_guess,
args=(F, np.sqrt(F)), disp=0)
for F in sample])
samples = bootstrap(F, 1000, fit_samples) # 1000 bootstrap resamplings
Now in a similar manner to what we did above for the MCMC Bayesian posterior, we'll compute the sample mean
and standard deviation to determine the errors on the parameters.
In[11]: mu_samp = samples[:, 0]
sig_samp = abs(samples[:, 1])
print " mu
= {0:.0f} +/- {1:.0f}".format(mu_samp.mean(), mu_samp.std())
print " sigma = {0:.0f} +/- {1:.0f}".format(sig_samp.mean(), sig_samp.st

d())
mu
= 999 +/- 4
sigma = 18 +/- 5
I should note that there is a huge literature on the details of bootstrap resampling, and there are definitely some
subtleties of the approach that I am glossing over here. One obvious piece is that there is potential for errors to be
correlated or non-Gaussian, neither of which is reflected by simply finding the mean and standard deviation of each
model parameter. Nevertheless, I trust that this gives the basic idea of the frequentist approach to this problem.
Varying Photon Counts: The Bayesian Approach

The Bayesian approach to this problem is almost exactly the same as it was in the previous problem, and we can
set it up by slightly modifying the above code.
In[12]: def log_prior(theta):
# sigma needs to be positive.
if theta[1] <= 0:
return -np.inf
else:
return 0
def log_posterior(theta, F, e):
return log_prior(theta) + log_likelihood(theta, F, e)
# same setup as above:
ndim, nwalkers = 2, 50
nsteps, nburn = 2000, 1000
starting_guesses = np.random.rand(nwalkers, ndim)
starting_guesses[:, 0] *= 2000 # start mu between 0 and 2000
starting_guesses[:, 1] *= 20
# start sigma between 0 and 20
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[F, e])

sampler.run_mcmc(starting_guesses, nsteps)
sample = sampler.chain # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].reshape(-1, 2)
Now that we have the samples, we'll use a convenience routine from astroML to plot the traces and the contours
representing one and two standard deviations:
In[13]: from astroML.plotting import plot_mcmc
fig = plt.figure()
ax = plot_mcmc(sample.T, fig=fig, labels=[r'$\mu$', r'$\sigma$'], color
s='k')
ax[0].plot(sample[:, 0], sample[:, 1], ',k', alpha=0.1)
ax[0].plot([mu_true], [sigma_true], 'o', color='red', ms=10);
The red dot indicates ground truth (from our problem setup), and the contours indicate one and two standard
deviations (68% and 95% confidence levels). In other words, based on this analysis we are 68% confident that the
model lies within the inner contour, and 95% confident that the model lies within the outer contour.
Note here that
=0
is consistent with our data within two standard deviations: that is, depending on the certainty
threshold you're interested in, our data are not enough to confidently rule out the possibility of a non-varying source!
The other thing to notice is that this posterior is definitely not Gaussian: this can be seen by the lack of symmetry in
the vertical direction. That means that the Gaussian approximation used within the frequentist approach may not
reflect the true uncertainties in the result. This isn't an issue with frequentism itself (i.e. there are certainly ways to
account for non-Gaussianity within the frequentist paradigm), but the vast majority of commonly applied frequentist
techniques make the explicit or implicit assumption of Gaussianity of the distribution. Bayesian approaches
generally don't require such assumptions.
(Side note on priors: there are good arguments that a flat prior on
necessarily non-informative in the case of scale factors like
subtley biases the calculation in this case: i.e. a flat prior is not
There are interesting arguments to be made that the Jeffreys Prior
(http://en.wikipedia.org/wiki/Jeffreys_prior) would be more applicable. Here I believe the Jeffreys prior is not suitable, because is not a
true scale factor (i.e. the Gaussian has contributions from
ei
as well). On this question, I'll have to defer to others who have more
expertise. Note that subtle some would say subjective questions like this are among the features of Bayesian analysis that
frequentists take issue with).
Conclusion
I hope I've been able to convey through this post how philosophical differences underlying frequentism and
Bayesianism lead to fundamentally different approaches to simple problems, which nonetheless can often yield
similar or even identical results.
To summarize the differences:
Frequentism considers probabilities to be related to frequencies of real or hypothetical events.
Bayesianism considers probabilities to measure degrees of knowledge.
Frequentist analyses generally proceed through use of point estimates and maximum likelihood
approaches.
Bayesian analyses generally compute the posterior either directly or through some version of MCMC
sampling.
In simple problems, the two approaches can yield similar results. As data and models grow in complexity, however,
the two approaches can diverge greatly. In a followup post, I plan to show an example or two of these more
complicated situations. Stay tuned!
Update:
see
the
followup
post:
Frequentism
and
Bayesianism
II:
When
Results
Differ
(http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bayesianism-2-when-results-differ/)
This
post
was
written
entirely
in
(http://jakevdp.github.io/downloads/notebooks/FreqBayes.ipynb)
the
IPython
this
notebook.
notebook,
or
see
You
a
can
static
download
view
here
(http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/FreqBayes.ipynb).
Posted by Jake Vanderplas Mar 11, 2014

Tweet
396
150
Comments
22 Comments
Pythonic Perambulations
Share
Recommend 1
Login
Sort by Best
Join the discussion

Kjell Swedin
a year ago
Thanks for the well-crafted article. I'm working my way through the excellent CS109 Data
Science lectures from Harvard, and as chance would have it, today I listened to "Lecture 10 How to think like a Baysian" - much of the same material with a more theoretical bent.
For those who might be interested, CS109 is a very modern take (first taught fall 2013) on an
introductory data science class. It uses Python, the Ipython notebook, and video lectures to
cover a representitive set of data science topics. Materials are free and available here:
http://cs109.org/
1
yono
Reply Share
a year ago
Hi,
I am somewhat confused by what you mean by "true flux" and "error" when stating the photon
counting problem at the beginning. The emission process of photons from an object such as a
star (or say an idealized black body with constant temperature) is in itself a poisson process, so
the numebr of emitted photons will vary by the square root law without introducing any
measurement errors. i.e. - even if you could capture the photons emitted in all directions from
the start with 100% efficiency, you would still get a 1/sqrt(N) variability in the results.
So when you say "true flux", do you mean exactly 1000 photons arrive at my detector during
each sample, but my detector has some measurement error (e.g. electronic shot noise), or that
only on average I get 1000 photons, but actually this number varies and that is the source of
the error?
Reply Share
jakevdp
Mod
> yono a year ago
I guess that is a bit sloppy... by "true flux" I guess I mean (in a frequentist sense) the
value you would measure in a large number of trials, when statistical errors drop out.
For example, if you measured 10^12 photons after 10^9 seconds, you can be pretty
dang sure the flux is 1000 photons per second (assuming our constant flux model is
correct, of course). In a Bayesian sense, the true flux is some parameter F that
characterizes our model, and our observations are drawn according to some distribution
about that true parameter.
As you mention, the variance around that true flux could be due to statistical or
systematic errors, but for simplicity here I've just considered random errors.
Reply Share
astraw > yono
a year ago
I have a related question. The motivation for the likelihood function starts with
"Gratuitous aside on measurement errors: We'll make the reasonable assumption that
errors are Gaussian." (And this seems to be the statement leading to @yono's question.)
My question is: is it reasonable to state that the central limit theorem specifies that the
Gaussian likelihood function is expected here because each F_i is itself containing a
large number of observations? So, in this case, each measurement F_i is then the result
of a large number of observations. But how would one justify that? The way the problem
is phrased, I thought each F_i is a single observation from a detector (with some error
e_i).
Otherwise, my question boils down the following: why do we have a Gaussian likelihood
function here even though the generative model is Poisson?
Reply Share
jakevdp
Mod
> astraw a year ago
There are two questions I see here:

1) Why should we assume Gaussian errors when we know that the errors are
actually Poisson?
2) Why should we assume Gaussian errors when we have no other information?
The answer to number 1 is quite easy: in the large N limit, the Poisson
distribution approaches a Gaussian. And by large N, I mean that the two become
very close (to the point where an analysis like this one would not be affected) for
N as small as ~10 or so.
The answer to number 2 is a bit deeper: it's possible to show via maximum
entropy considerations that the least-binding assumption you can make about a
distribution given some measure of width is to say it's a normal distribution with
distribution given some measure of width is to say it's a normal distribution with
that width. So in the zero-information case, when some assumption is required,
the Gaussian is the best assumption to make.

yankov
Reply Share
a year ago
Thanks for the great post. Offtop, but what did you use to embed pieces of ipython notebook to
an octopress' post? I couldn't find anything workable.
Reply Share
jakevdp
Mod
> yankov a year ago
This blog is Pelican, not Octopress, and I use https://github.com/getpelican/...

Reply Share
yankov > jakevdp
a year ago
Got it. The theme of your blog is the default theme for octopress, but apparently
it's not specific to it.

WinVector LLC
Reply Share
a year ago
Nice article. In this direction my group has been trying to help teach that you tend to need to be
familiar with both frequentist and Bayesian thought (you can't always choose one or the other:
http://www.win-vector.com/blog... ) and that Bayesianism only appears to be the more
complicated of the two ( http://www.win-vector.com/blog... ).
Reply Share
aloctavodia
a year ago
Hi, great post, thanks!

I have a doubt. You say "the contours indicate one and two standard deviations (68% and 95%
confidence levels). In other words, based on this analysis we are 68%
confident that the model lies within the inner contour, and 95%
confident that the model lies within the outer contour." But the 68-95-99.7 rule is only valid for
gaussian distributions. I think the contours represents the 0.68 and 0.95 percentiles, hence you
are right when you say "we are 68%
confident that the model lies within the inner contour, and 95%
confident that the model lies within the outer contour", but is not true that those contours
represent 1 and 2 standard deviations.
Reply Share
jakevdp
Mod
> aloctavodia a year ago
Good point - the contours do contain 68%/95% of the points, and this is often
colloquially referred to as 1 and 2 standard deviations in analogy with gaussians, but
that language is a bit sloppy.
Reply Share
Shayne Hodge
a year ago
I've found this series helpful, though pointedly pointing out how rusty my prob/stats are. Two
questions - (1) good printed resources - I have my old stochastic systems books (EE-centric)
and a newer econometrics book (fond of using notation I've never seen, lots of superscripted *),
but neither is particularly on point. Don't really want a textbook, but something better than
pop-sci.
Second, when you go to the 2 factor model, I want to say intuitively this is just a joint-Gaussian
distribution. Is that the case under the frequentist approach? What I don't get on the Bayesian
is not just why it's not Gaussian, but why it's not symmetrical. Our two component r.v.'s are
gaussian (right?), so if absolutely nothing else I'd expect to see left-right and top-bottom
symmetry. Seems like a complete breakdown of my intuition. (Again:
https://www.youtube.com/watch?...
see more
Reply Share
mvuorre
a year ago
Thanks for the great series of posts.
Reply Share
Daniel Halperin
a year ago
So what about when the star's flux is time-varying, so there is no "one true flux"?
Reply Share
jakevdp
Mod
> Daniel Halperin a year ago
The second problem is an example of this (the true flux varies with time in a stochastic
manner). We fit for mu and sigma which describe the mean and standard deviation of
this time-varying flux.
But it sounds like you're thinking of a situation in which the flux varies more
predictably: say, periodically with time. In this case, you just need to construct an
appropriate model that can account for the periodicity. I'm hoping to cover something
along those lines in a followup post (when I can find the time...)
Reply Share
Daniel Halperin > jakevdp
a year ago
Right. So if the model can't be compressed to one or a few "true" parameters do

frequentists just give up?
Maybe I should come up with an example before I actually ask this question :p.

Royi
Reply Share
a year ago
Hi,
Great post.
Could you an option to download posts as PDF?
Thank You.
Reply Share
Shannon Quinn
a year ago
Awesome post. Love statistics, love Python, love the intersection of both. A few comments,
mostly to sharpen my own chops on the subject.
At the beginning you defined the fundamental philosophies of frequentist vs Bayesian
statistics, in particular focusing on whether or not the approach attempts to characterize the
true value with the estimator it computes. However, my understanding is that it's the reverse of
what you said: frequentist statistics attempt to determine the true value, while Bayesian
statistics attempt to reflect your own beliefs. This post (also by Larry Wasserman, who I've had
the incredible honor of taking classes with) is in line with that.
I appreciate that you mention the subtleties involved in attempting to choose noninformative
priors. But it may also be prudent to point out that as the problem becomes more complex, it
simultaneously becomes much more difficult to choose noninformative priors. As the data
becomes higher dimensional, unless you're sampling exponentially more points (unlikely in the
real world) your data is intrinsically becoming more sparse and, therefore, becoming much
more sensitive to the choice of prior; even small variations in the prior can drastically alter the
posterior.
This isn't to stomp on Bayesian statistics or to herald the frequentist approach. I'm just trying
to see, from someone obviously knowledgeable on these topics, if my own knowledge is entirely
off base :) Thanks again!
Reply Share
jakevdp
Mod
> Shannon Quinn a year ago
I think you misunderstood me... I didn't mean to imply that frequentists are not
concerned with true parameter values, only that they think it's nonsense to talk about
probability distributions of parameter values (by the very definition of probability!). The
final result and confidence estimate is most certainly an attempt to constrain the true
value.
As to Bayesianism reflecting "beliefs" in opposition to "truth", I think that's a case of
taking the Bayesian definition of probability and running just a little too far with it :)
And here, I'm afraid, my own bias toward Bayesianism is starting to show...
Reply Share
Shannon Quinn > jakevdp
a year ago
Ah yes, in the case of distributions of parameter values, you are absolutely

correct in that distinction.
I don't necessarily mean to say beliefs are in opposition to truth--I would agree
that's running too far with it. I just mean to say that both approaches have their
applications and are clearly useful under certain conditions (as your post does a
great job of pointing out), but they both come with assumptions and potential
pitfalls that make their applications questionable under other circumstances.

Olivier Grisel
Reply Share
a year ago
Great post Jake! Some comments:

> That is, in pure Bayesianism the answer to a question is not a single
number with error bars; the answer is the posterior distribution over
the model parameters!
You can also report a credible interval for each parameter. It might be easier to interpret by a
human being than the full estimate of the posterior distribution (as you did on the plot in the
second example):
http://en.wikipedia.org/wiki/C....
About the bootstrap, it's a great way to estimate the confidence in your frequentist estimates
without making further parametric assumptions on the data. However we should keep in mind
that the bootstrap only provides an asymptotic estimates of the standard error of your
estimated parameters. In practice, if you have less than 50 original data points you might run
the risk of underestimating the true width of the confidence interval and therefore get mislead
to be overconfident in your estimates. I ran some simulation in this notebook to highlight this:
http://nbviewer.ipython.org/gi...
The Bayesian approach on the other hand uses priors to inject uncertainty in your modeling. It
is therefore my understanding that Bayesian statistics will rather tell you "you don't have
enough data points to be reasonably confident in that hypothesis (e.g. that sigma is non-zero in
your example)" while bootstrap confidence intervals can fail to reject the null hypothesis if you
don't have enough data points to capture enough of the true variability.
Reply Share
jakevdp
Mod
> Olivier Grisel a year ago
Thanks for the feedback and links, Olivier!
Reply Share
WHAT'S THIS?
ALSO ON PYTHONIC PERAMBULATIONS
Why Python is Slow: Looking Under the
Frequentism and Bayesianism III:
Why Python is Slow: Looking Under the

Hood
55 comments a year ago
Frequentism and Bayesianism III:

Confidence, Credibility, and why
Frequentism
77
comments a year
andago
Maciek Dems Efficient programming in
Nikratio Alright, you got me convinced. But
Python needs different thinking. All the

operations should be vectorized. This is
especially true when using
unfortunately your article lacks the one thing

neccessary to make more people follow your
The Hipster Effect: An IPython Interactive

Exploration
Frequentism and Bayesianism V: Model

Selection
23 comments 10 months ago
9 comments 24 days ago
Recent Posts
Out-of-Core Dataframes in Python: Dask and OpenStreetMap (https://jakevdp.github.io/blog/2015/08/14/out-ofcore-dataframes-in-python/)
Frequentism and Bayesianism V: Model Selection (https://jakevdp.github.io/blog/2015/08/07/frequentism-andbayesianism-5-model-selection/)
Learning Seattle's Work Habits from Bicycle Counts (Updated!)
(https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/)
The Model Complexity Myth (https://jakevdp.github.io/blog/2015/07/06/model-complexity-myth/)
Fast Lomb-Scargle Periodograms in Python (https://jakevdp.github.io/blog/2015/06/13/lomb-scargle-in-python/)
Follow @jakevdp
6,395 followers
Copyright 2012-2015 - Jake Vanderplas - Powered by Pelican (http://getpelican.com)

Frequentism and Bayesianism - A Practical Introduction

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Frequentism and Bayesianism - A Practical Introduction

Загружено:

Авторское право:

Доступные форматы

Pythonic Perambulations

Frequentism and Bayesianism: A

differ/) Part III (http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/) Part

Frequentism vs. Bayesianism: a Philosophical Debate

of a star with some probability

: that probability can certainly be

Frequentist and Bayesian Approaches in Practice: Counting

The Problem: Simple Photon Counts

(we'll also ignore

measurement reports the observed photon flux Fi and error ei .

The question is, given this set of measurements D

, what is our best estimate of the true flux Ftrue ?

measurement given its observed value)

(http://en.wikipedia.org/wiki/Poisson_distribution) using the standard square-root rule. In this toy example we

estimate of the true flux?

Frequentist Approach to Simple Photon Counts

given our assumption of

This should be read "the probability of

with mean Ftrue and standard deviation ei .

What we'd like to do is determine

maximization can be computed analytically (i.e. by setting d log /d Ftrue

). This results in the following observed

Bayesian Approach to Simple Photon Counts

. Nevertheless, within the Bayesian philosophy this is

: The data probability, which in practice amounts to simply a normalization term.

If we set the prior P(Ftrue )

(a flat prior), we find

But What About the Prior?

You'll noticed that I glossed over something here: the prior,

. The prior allows inclusion of other information

Photon Counts: the Bayesian approach

: this is the distribution reflecting our knowledge of the

In[4]: def log_prior(theta):

In[5]: ndim = 1 # number of parameters in the model

Adding a Dimension: Exploring a more sophisticated model

where is the mean value, and

mu_true, sigma_true = 1000, 15 # stochastic flux model

Varying Photon Counts: The Frequentist Approach

And here we have a problem: the optimal value of

depends on the optimal value of . The results are correlated,

Maximum likelihood estimate for 100 data points:

governing our model of the

to and use standard tests

(http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test) to determine confidence limits, which also depends

(http://en.wikipedia.org/wiki/Bootstrapping_(statistics)), which maximize the likelihood for randomized samples of

= {0:.0f} +/- {1:.0f}".format(mu_samp.mean(), mu_samp.std())

print " sigma = {0:.0f} +/- {1:.0f}".format(sig_samp.mean(), sig_samp.st

Varying Photon Counts: The Bayesian Approach

# start sigma between 0 and 20

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[F, e])

There are interesting arguments to be made that the Jeffreys Prior

Posted by Jake Vanderplas Mar 11, 2014

Join the discussion

> yono a year ago

astraw > yono

> astraw a year ago

There are two questions I see here:

> yankov a year ago

This blog is Pelican, not Octopress, and I use https://github.com/getpelican/...

yankov > jakevdp

Hi, great post, thanks!

> aloctavodia a year ago

Thanks for the great series of posts.