Вы находитесь на странице: 1из 19

Pythonic Perambulations

(https://jakevdp.github.io/)
Musings and ramblings through the world of
Python and beyond
Atom (/atom.xml)
Search
Navigate

Archives (/archives.html)
Home Page (http://www.astro.washington.edu/users/vanderplas)

Frequentism and Bayesianism: A


Practical Introduction
Mar 11, 2014
This post is part of a 5-part series: Part I (http://jakevdp.github.io/blog/2014/03/11/frequentism-and-bayesianism-apractical-intro/)

Part

II

(http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bayesianism-2-when-results-

differ/) Part III (http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/) Part


IV

(http://jakevdp.github.io/blog/2014/06/14/frequentism-and-bayesianism-4-bayesian-in-python/)

Part

(http://jakevdp.github.io/blog/2015/08/07/frequentism-and-bayesianism-5-model-selection/)
See also Frequentism and Bayesianism: A Python-driven Primer (http://arxiv.org/abs/1411.5018), a peer-reviewed
article partially based on this content.
One of the first things a scientist hears about statistics is that there is are two different approaches: frequentism and
Bayesianism. Despite their importance, many scientific researchers never have opportunity to learn the distinctions
between them and the different practical approaches that result. The purpose of this post is to synthesize the
philosophical and pragmatic aspects of the frequentist and Bayesian approaches, so that scientists like myself
might be better prepared to understand the types of data analysis people do.
I'll start by addressing the philosophical distinctions between the views, and from there move to discussion of how
these ideas are applied in practice, with some Python code snippets demonstrating the difference between the
approaches.

Frequentism vs. Bayesianism: a Philosophical Debate


Fundamentally, the disagreement between frequentists and Bayesians concerns the definition of probability.

For frequentists, probability only has meaning in terms of a limiting case of repeated measurements. That is, if I
measure the photon flux F from a given star (we'll assume for now that the star's flux does not vary with time), then
measure it again, then again, and so on, each time I will get a slightly different answer due to the statistical error of
my measuring device. In the limit of a large number of measurements, the frequency of any given value indicates the
probability of measuring that value. For frequentists probabilities are fundamentally related to frequencies of
events. This means, for example, that in a strict frequentist view, it is meaningless to talk about the probability of
the true flux of the star: the true flux is (by definition) a single fixed value, and to talk about a frequency distribution
for a fixed value is nonsense.
For Bayesians, the concept of probability is extended to cover degrees of certainty about statements. Say a
Bayesian claims to measure the flux

of a star with some probability

P(F)

: that probability can certainly be

estimated from frequencies in the limit of a large number of repeated experiments, but this is not fundamental. The
probability is a statement of my knowledge of what the measurement reasult will be. For Bayesians, probabilities
are fundamentally related to our own knowledge about an event. This means, for example, that in a Bayesian
view, we can meaningfully talk about the probability that the true flux of a star lies in a given range. That probability
codifies our knowledge of the value based on prior information and/or available data.
The surprising thing is that this arguably subtle difference in philosophy leads, in practice, to vastly different
approaches to the statistical analysis of data. Below I will give a few practical examples of the differences in
approach, along with associated Python code to demonstrate the practical aspects of the resulting methods.

Frequentist and Bayesian Approaches in Practice: Counting


Photons
Here we'll take a look at an extremely simple problem, and compare the frequentist and Bayesian approaches to
solving it. There's necessarily a bit of mathematical formalism involved, but I won't go into too much depth or
discuss too many of the subtleties. If you want to go deeper, you might consider please excuse the shameless
plug taking a look at chapters 4-5 of our textbook (http://www.amazon.com/dp/0691151687/).

The Problem: Simple Photon Counts


Imagine that we point our telescope to the sky, and observe the light coming from a single star. For the time being,
we'll assume that the star's true flux is constant with time, i.e. that is it has a fixed value

Ftrue

(we'll also ignore

effects like sky noise and other sources of systematic error). We'll assume that we perform a series of
measurements with our telescope, where the i

th

measurement reports the observed photon flux Fi and error ei .

The question is, given this set of measurements D

= {Fi , ei }

, what is our best estimate of the true flux Ftrue ?

(Gratuitous aside on measurement errors: We'll make the reasonable assumption that errors are Gaussian. In a Frequentist perspective,
ei

is the standard deviation of the results of a single measurement event in the limit of repetitions of that event. In the Bayesian

perspective,

ei

is the standard deviation of the (Gaussian) probability distribution describing our knowledge of that particular

measurement given its observed value)

Here we'll use Python to generate some toy data to demonstrate the two approaches to the problem. Because the
measurements are number counts, a Poisson distribution is a good approximation to the measurement process:
In[1]: # Generating some simple photon count data
import numpy as np
from scipy import stats
np.random.seed(1) # for repeatability
F_true = 1000 # true flux, say number of photons measured in 1 second
N = 50 # number of measurements
F = stats.poisson(F_true).rvs(N) # N measurements of the flux
e = np.sqrt(F) # errors on Poisson counts estimated via square root
Now let's make a simple visualization of the "measured" data:
In[2]: %matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.errorbar(F, np.arange(N), xerr=e, fmt='ok', ecolor='gray', alpha=0.5)
ax.vlines([F_true], 0, N, linewidth=5, alpha=0.2)
ax.set_xlabel("Flux");ax.set_ylabel("measurement number");

These

measurements

each

have

different

error

ei

which

is

estimated

from

Poisson

statistics

(http://en.wikipedia.org/wiki/Poisson_distribution) using the standard square-root rule. In this toy example we


already know the true flux

Ftrue

, but the question is this: given our measurements and errors, what is our best

estimate of the true flux?


Let's take a look at the frequentist and Bayesian approaches to solving this.

Frequentist Approach to Simple Photon Counts

We'll start with the classical frequentist maximum likelihood approach. Given a single observation
we can compute the probability distribution of the measurement given the true flux

Ftrue

Di = (Fi , ei )

given our assumption of

Gaussian errors:

(F F

P(Di |Ftrue ) =

exp

2
true )

2e

This should be read "the probability of

Di

given Ftrue equals ...". You should recognize this as a normal distribution

with mean Ftrue and standard deviation ei .


We construct the likelihood function by computing the product of the probabilities for each data point:
N

(D|Ftrue ) =

P(Di |Ftrue )

i=1

Here

D = {Di }

represents the entire set of measurements. Because the value of the likelihood can become very

small, it is often more convenient to instead compute the log-likelihood. Combining the previous two equations and
computing the log, we have

log =

log(2
2 [

2
i

)+

(Fi

2
true )

i=1

What we'd like to do is determine

Ftrue

such that the likelihood is maximized. For this simple problem, the

maximization can be computed analytically (i.e. by setting d log /d Ftrue

= 0

). This results in the following observed

estimate of Ftrue :
Fest =

w F
w
i

; wi = 1/e

Notice that in the special case of all errors ei being equal, this reduces to
N

1
Fest =

Fi

i=1

That is, in agreement with intuition, Fest is simply the mean of the observed data when errors are equal.
We can go further and ask what the error of our estimate is. In the frequentist approach, this can be accomplished
by fitting a Gaussian approximation to the likelihood curve at maximum; in this simple case this can also be solved
analytically. It can be shown that the standard deviation of this Gaussian approximation is:
1/2

est

=
(

i=1

wi
)

These results are fairly simple calculations; let's evaluate them for our toy dataset:
In[3]: w = 1. / e ** 2
print("""
F_true = {0}
F_est = {1:.0f} +/- {2:.0f} (based on {3} measurements)
""".format(F_true, (w * F).sum() / w.sum(), w.sum() ** -0.5, N))

F_true = 1000
F_est = 998 +/- 4 (based on 50 measurements)

We find that for 50 measurements of the flux, our estimate has an error of about 0.4% and is consistent with the
input value.

Bayesian Approach to Simple Photon Counts


The Bayesian approach, as you might expect, begins and ends with probabilities. It recognizes that what we
fundamentally want to compute is our knowledge of the parameters in question, i.e. in this case,
P(Ftrue |D)

Note that this formulation of the problem is fundamentally contrary to the frequentist philosophy, which says that
probabilities have no meaning for model parameters like

Ftrue

. Nevertheless, within the Bayesian philosophy this is

perfectly acceptable.
To compute this result, Bayesians next apply Bayes' Theorem (http://en.wikipedia.org/wiki/Bayes'_theorem), a
fundamental law of probability:
P(Ftrue |D) =

P(D|Ftrue )P(Ftrue )
P(D)

Though Bayes' theorem is where Bayesians get their name, it is not this law itself that is controversial, but the
Bayesian interpretation of probability implied by the term P(Ftrue |D) .
Let's take a look at each of the terms in this expression:
P(Ftrue |D)

: The posterior, or the probability of the model parameters given the data: this is the result we

want to compute.
P(D|Ftrue )
P(Ftrue )

: The likelihood, which is proportional to the (D|Ftrue ) in the frequentist approach, above.

: The model prior, which encodes what we knew about the model prior to the application of the

data D .
P(D)

: The data probability, which in practice amounts to simply a normalization term.

If we set the prior P(Ftrue )


P(Ftrue |D)

(a flat prior), we find

(D|F

true )

and the Bayesian probability is maximized at precisely the same value as the frequentist result! So despite the
philosophical differences, we see that (for this simple problem at least) the Bayesian and frequentist point estimates
are equivalent.

But What About the Prior?

You'll noticed that I glossed over something here: the prior,

P(Ftrue )

. The prior allows inclusion of other information

into the computation, which becomes very useful in cases where multiple measurement strategies are being
combined to constrain a single model (as is the case in, e.g. cosmological parameter estimation). The necessity to
specify a prior, however, is one of the more controversial pieces of Bayesian analysis.
A frequentist will point out that the prior is problematic when no true prior information is available. Though it might
seem straightforward to use a noninformative prior like the flat prior mentioned above, there are some surprisingly
subtleties

(http://normaldeviate.wordpress.com/2013/07/13/lost-causes-in-statistics-ii-noninformative-

priors/comment-page-1/) involved. It turns out that in many situations, a truly noninformative prior does not exist!
Frequentists point out that the subjective choice of a prior which necessarily biases your result has no place in
statistical data analysis.
A Bayesian would counter that frequentism doesn't solve this problem, but simply skirts the question. Frequentism
can often be viewed as simply a special case of the Bayesian approach for some (implicit) choice of the prior: a
Bayesian would say that it's better to make this implicit choice explicit, even if the choice might include some
subjectivity.

Photon Counts: the Bayesian approach


Leaving these philosophical debates aside for the time being, let's address how Bayesian results are generally
computed in practice. For a one parameter problem like the one considered here, it's as simple as computing the
posterior probability
parameter

Ftrue

P(Ftrue |D)

as a function of

Ftrue

: this is the distribution reflecting our knowledge of the

. But as the dimension of the model grows, this direct approach becomes increasingly intractable.

For this reason, Bayesian calculations often depend on sampling methods such as Markov Chain Monte Carlo
(MCMC) (http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo).
I won't go into the details of the theory of MCMC here. Instead I'll show a practical example of applying an MCMC
approach using Dan Foreman-Mackey's excellent emcee (http://dan.iel.fm/emcee/current/) package. Keep in mind
here that the goal is to generate a set of points drawn from the posterior probability distribution, and to use those
points to determine the answer we seek.
To perform this MCMC, we start by defining Python functions for the prior
the posterior

P(Ftrue |D)

P(Ftrue )

, the likelihood

P(D|Ftrue )

, and

, noting that none of these need be properly normalized. Our model here is one-

dimensional, but to handle multi-dimensional models we'll define the model in terms of an array of parameters ,
which in this case is

= [Ftrue ]

In[4]: def log_prior(theta):


return 1 # flat prior
def log_likelihood(theta, F, e):
return -0.5 * np.sum(np.log(2 * np.pi * e ** 2)
+ (F - theta[0]) ** 2 / e ** 2)
def log_posterior(theta, F, e):
return log_prior(theta) + log_likelihood(theta, F, e)
Now we set up the problem, including generating some random starting guesses for the multiple chains of points.

In[5]: ndim = 1 # number of parameters in the model


nwalkers = 50 # number of MCMC walkers
nburn = 1000 # "burn-in" period to let chains stabilize
nsteps = 2000 # number of MCMC steps to take
# we'll start at random locations between 0 and 2000
starting_guesses = 2000 * np.random.rand(nwalkers, ndim)
import emcee
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[F, e])
sampler.run_mcmc(starting_guesses, nsteps)
sample = sampler.chain # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].ravel() # discard burn-in points
If this all worked correctly, the array sampleshould contain a series of 50000 points drawn from the posterior. Let's
plot them and check:
In[6]: # plot a histogram of the sample
plt.hist(sample, bins=50, histtype="stepfilled", alpha=0.3, normed=True)
# plot a best-fit Gaussian
F_fit = np.linspace(975, 1025)
pdf = stats.norm(np.mean(sample), np.std(sample)).pdf(F_fit)
plt.plot(F_fit, pdf, '-k')
plt.xlabel("F"); plt.ylabel("P(F)")
Out[6]: <matplotlib.text.Text at 0x1075c7510>

We end up with a sample of points drawn from the (normal) posterior distribution. The mean and standard deviation
of this posterior are the corollary of the frequentist maximum likelihood estimate above:
In[7]: print("""

In[7]: print("""
F_true = {0}
F_est = {1:.0f} +/- {2:.0f} (based on {3} measurements)
""".format(F_true, np.mean(sample), np.std(sample), N))

F_true = 1000
F_est = 998 +/- 4 (based on 50 measurements)

We see that as expected for this simple problem, the Bayesian approach yields the same result as the frequentist
approach!

Discussion
Now, you might come away with the impression that the Bayesian method is unnecessarily complicated, and in this
case it certainly is. Using an Affine Invariant Markov Chain Monte Carlo Ensemble sampler to characterize a onedimensional normal distribution is a bit like using the Death Star to destroy a beach ball, but I did this here because
it demonstrates an approach that can scale to complicated posteriors in many, many dimensions, and can provide
nice results in more complicated situations where an analytic likelihood approach is not possible.
As a side note, you might also have noticed one little sleight of hand: at the end, we use a frequentist approach to
characterize our posterior samples! When we computed the sample mean and standard deviation above, we were
employing a distinctly frequentist technique to characterize the posterior distribution. The pure Bayesian result for a
problem like this would be to report the posterior distribution itself (i.e. its representative sample), and leave it at
that. That is, in pure Bayesianism the answer to a question is not a single number with error bars; the answer is the
posterior distribution over the model parameters!

Adding a Dimension: Exploring a more sophisticated model


Let's briefly take a look at a more complicated situation, and compare the frequentist and Bayesian results yet
again. Above we assumed that the star was static: now let's assume that we're looking at an object which we
suspect has some stochastic variation that is, it varies with time, but in an unpredictable way (a Quasar is a good
example of such an object).
We'll propose a simple 2-parameter Gaussian model for this object:

= [, ]

where is the mean value, and

is

the standard deviation of the variability intrinsic to the object. Thus our model for the probability of the true flux at
the time of each observation looks like this:
Ftrue

(F )

exp

Now, we'll again consider N observations each with their own error. We can generate them this way:
In[8]: np.random.seed(42) # for reproducibility
N = 100 # we'll use more samples for the more complicated model

mu_true, sigma_true = 1000, 15 # stochastic flux model


F_true = stats.norm(mu_true, sigma_true).rvs(N) # (unknown) true flux
F = stats.poisson(F_true).rvs() # observed flux: true flux plus Poisson er
rors.
e = np.sqrt(F) # root-N error, as above

Varying Photon Counts: The Frequentist Approach


The resulting likelihood is the convolution of the intrinsic distribution with the error distribution, so we have
N

(D| ) =

i=1

(F )

exp

+e )
i

[ 2(

2
+e ) ]
i

Analogously to above, we can analytically maximize this likelihood to find the best estimate for :

est

w F
w
i

; wi =

+e

And here we have a problem: the optimal value of

depends on the optimal value of . The results are correlated,

so we can no longer use straightforward analytic methods to arrive at the frequentist result.
Nevertheless, we can use numerical optimization techniques to determine the maximum likelihood value. Here we'll
use

the

optimization

routines

available

within

Scipy's

optimize

(http://docs.scipy.org/doc/scipy/reference/optimize.html) submodule:
In[9]: def log_likelihood(theta, F, e):
return -0.5 * np.sum(np.log(2 * np.pi * (theta[1] ** 2 + e ** 2))
+ (F - theta[0]) ** 2 / (theta[1] ** 2 + e ** 2))
# maximize likelihood <--> minimize negative likelihood
def neg_log_likelihood(theta, F, e):
return -log_likelihood(theta, F, e)
from scipy import optimize
theta_guess = [900, 5]
theta_est = optimize.fmin(neg_log_likelihood, theta_guess, args=(F, e))
print("""
Maximum likelihood estimate for {0} data points:
mu={theta[0]:.0f}, sigma={theta[1]:.0f}
""".format(N, theta=theta_est))
Optimization terminated successfully.
Current function value: 502.839505
Iterations: 58
Function evaluations: 114
Maximum likelihood estimate for 100 data points:

Maximum likelihood estimate for 100 data points:


mu=999, sigma=19

This maximum likelihood value gives our best estimate of the parameters

and

governing our model of the

source. But this is only half the answer: we need to determine how confident we are in this answer, that is, we need
to compute the error bars on and .
There are several approaches to determining errors in a frequentist paradigm. We could, as above, fit a normal
approximation to the maximum likelihood and report the covariance matrix (here we'd have to do this numerically
rather than analytically). Alternatively, we can compute statistics like

and

2
dof

to and use standard tests

(http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test) to determine confidence limits, which also depends


on strong assumptions about the Gaussianity of the likelihood. We might alternatively use randomized sampling
approaches

such

as

Jackknife

(http://en.wikipedia.org/wiki/Jackknife_(statistics))

or

Bootstrap

(http://en.wikipedia.org/wiki/Bootstrapping_(statistics)), which maximize the likelihood for randomized samples of


the input data in order to explore the degree of certainty in the result.
All of these would be valid techniques to use, but each comes with its own assumptions and subtleties. Here, for
simplicity, we'll use the basic bootstrap resampler found in the astroML (http://astroML.org) package:
In[10]: from astroML.resample import bootstrap
def fit_samples(sample):
# sample is an array of size [n_bootstraps, n_samples]
# compute the maximum likelihood for each bootstrap.
return np.array([optimize.fmin(neg_log_likelihood, theta_guess,
args=(F, np.sqrt(F)), disp=0)
for F in sample])
samples = bootstrap(F, 1000, fit_samples) # 1000 bootstrap resamplings
Now in a similar manner to what we did above for the MCMC Bayesian posterior, we'll compute the sample mean
and standard deviation to determine the errors on the parameters.
In[11]: mu_samp = samples[:, 0]
sig_samp = abs(samples[:, 1])
print " mu

= {0:.0f} +/- {1:.0f}".format(mu_samp.mean(), mu_samp.std())

print " sigma = {0:.0f} +/- {1:.0f}".format(sig_samp.mean(), sig_samp.st


d())
mu

= 999 +/- 4

sigma = 18 +/- 5

I should note that there is a huge literature on the details of bootstrap resampling, and there are definitely some
subtleties of the approach that I am glossing over here. One obvious piece is that there is potential for errors to be
correlated or non-Gaussian, neither of which is reflected by simply finding the mean and standard deviation of each

model parameter. Nevertheless, I trust that this gives the basic idea of the frequentist approach to this problem.

Varying Photon Counts: The Bayesian Approach


The Bayesian approach to this problem is almost exactly the same as it was in the previous problem, and we can
set it up by slightly modifying the above code.
In[12]: def log_prior(theta):
# sigma needs to be positive.
if theta[1] <= 0:
return -np.inf
else:
return 0
def log_posterior(theta, F, e):
return log_prior(theta) + log_likelihood(theta, F, e)
# same setup as above:
ndim, nwalkers = 2, 50
nsteps, nburn = 2000, 1000
starting_guesses = np.random.rand(nwalkers, ndim)
starting_guesses[:, 0] *= 2000 # start mu between 0 and 2000
starting_guesses[:, 1] *= 20

# start sigma between 0 and 20

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[F, e])


sampler.run_mcmc(starting_guesses, nsteps)
sample = sampler.chain # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].reshape(-1, 2)
Now that we have the samples, we'll use a convenience routine from astroML to plot the traces and the contours
representing one and two standard deviations:
In[13]: from astroML.plotting import plot_mcmc
fig = plt.figure()
ax = plot_mcmc(sample.T, fig=fig, labels=[r'$\mu$', r'$\sigma$'], color
s='k')
ax[0].plot(sample[:, 0], sample[:, 1], ',k', alpha=0.1)
ax[0].plot([mu_true], [sigma_true], 'o', color='red', ms=10);

The red dot indicates ground truth (from our problem setup), and the contours indicate one and two standard
deviations (68% and 95% confidence levels). In other words, based on this analysis we are 68% confident that the
model lies within the inner contour, and 95% confident that the model lies within the outer contour.
Note here that

=0

is consistent with our data within two standard deviations: that is, depending on the certainty

threshold you're interested in, our data are not enough to confidently rule out the possibility of a non-varying source!
The other thing to notice is that this posterior is definitely not Gaussian: this can be seen by the lack of symmetry in
the vertical direction. That means that the Gaussian approximation used within the frequentist approach may not
reflect the true uncertainties in the result. This isn't an issue with frequentism itself (i.e. there are certainly ways to
account for non-Gaussianity within the frequentist paradigm), but the vast majority of commonly applied frequentist
techniques make the explicit or implicit assumption of Gaussianity of the distribution. Bayesian approaches
generally don't require such assumptions.
(Side note on priors: there are good arguments that a flat prior on
necessarily non-informative in the case of scale factors like

subtley biases the calculation in this case: i.e. a flat prior is not

There are interesting arguments to be made that the Jeffreys Prior

(http://en.wikipedia.org/wiki/Jeffreys_prior) would be more applicable. Here I believe the Jeffreys prior is not suitable, because is not a
true scale factor (i.e. the Gaussian has contributions from

ei

as well). On this question, I'll have to defer to others who have more

expertise. Note that subtle some would say subjective questions like this are among the features of Bayesian analysis that
frequentists take issue with).

Conclusion
I hope I've been able to convey through this post how philosophical differences underlying frequentism and
Bayesianism lead to fundamentally different approaches to simple problems, which nonetheless can often yield
similar or even identical results.
To summarize the differences:
Frequentism considers probabilities to be related to frequencies of real or hypothetical events.
Bayesianism considers probabilities to measure degrees of knowledge.
Frequentist analyses generally proceed through use of point estimates and maximum likelihood
approaches.
Bayesian analyses generally compute the posterior either directly or through some version of MCMC
sampling.

In simple problems, the two approaches can yield similar results. As data and models grow in complexity, however,
the two approaches can diverge greatly. In a followup post, I plan to show an example or two of these more
complicated situations. Stay tuned!
Update:

see

the

followup

post:

Frequentism

and

Bayesianism

II:

When

Results

Differ

(http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bayesianism-2-when-results-differ/)
This

post

was

written

entirely

in

(http://jakevdp.github.io/downloads/notebooks/FreqBayes.ipynb)

the

IPython
this

notebook.

notebook,

or

see

You
a

can
static

download
view

here

(http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/FreqBayes.ipynb).

Posted by Jake Vanderplas Mar 11, 2014


Tweet

396

150

Comments
22 Comments

Pythonic Perambulations

Share

Recommend 1

Login

Sort by Best

Join the discussion


Kjell Swedin

a year ago

Thanks for the well-crafted article. I'm working my way through the excellent CS109 Data
Science lectures from Harvard, and as chance would have it, today I listened to "Lecture 10 How to think like a Baysian" - much of the same material with a more theoretical bent.
For those who might be interested, CS109 is a very modern take (first taught fall 2013) on an
introductory data science class. It uses Python, the Ipython notebook, and video lectures to
cover a representitive set of data science topics. Materials are free and available here:
http://cs109.org/
1
yono

Reply Share

a year ago

Hi,
I am somewhat confused by what you mean by "true flux" and "error" when stating the photon
counting problem at the beginning. The emission process of photons from an object such as a
star (or say an idealized black body with constant temperature) is in itself a poisson process, so
the numebr of emitted photons will vary by the square root law without introducing any
measurement errors. i.e. - even if you could capture the photons emitted in all directions from
the start with 100% efficiency, you would still get a 1/sqrt(N) variability in the results.
So when you say "true flux", do you mean exactly 1000 photons arrive at my detector during
each sample, but my detector has some measurement error (e.g. electronic shot noise), or that
only on average I get 1000 photons, but actually this number varies and that is the source of

the error?

Reply Share

jakevdp

Mod

> yono a year ago

I guess that is a bit sloppy... by "true flux" I guess I mean (in a frequentist sense) the
value you would measure in a large number of trials, when statistical errors drop out.
For example, if you measured 10^12 photons after 10^9 seconds, you can be pretty
dang sure the flux is 1000 photons per second (assuming our constant flux model is
correct, of course). In a Bayesian sense, the true flux is some parameter F that
characterizes our model, and our observations are drawn according to some distribution
about that true parameter.
As you mention, the variance around that true flux could be due to statistical or
systematic errors, but for simplicity here I've just considered random errors.

Reply Share

astraw > yono

a year ago

I have a related question. The motivation for the likelihood function starts with
"Gratuitous aside on measurement errors: We'll make the reasonable assumption that
errors are Gaussian." (And this seems to be the statement leading to @yono's question.)
My question is: is it reasonable to state that the central limit theorem specifies that the
Gaussian likelihood function is expected here because each F_i is itself containing a
large number of observations? So, in this case, each measurement F_i is then the result
of a large number of observations. But how would one justify that? The way the problem
is phrased, I thought each F_i is a single observation from a detector (with some error
e_i).
Otherwise, my question boils down the following: why do we have a Gaussian likelihood
function here even though the generative model is Poisson?

Reply Share

jakevdp

Mod

> astraw a year ago

There are two questions I see here:


1) Why should we assume Gaussian errors when we know that the errors are
actually Poisson?
2) Why should we assume Gaussian errors when we have no other information?
The answer to number 1 is quite easy: in the large N limit, the Poisson
distribution approaches a Gaussian. And by large N, I mean that the two become
very close (to the point where an analysis like this one would not be affected) for
N as small as ~10 or so.
The answer to number 2 is a bit deeper: it's possible to show via maximum
entropy considerations that the least-binding assumption you can make about a
distribution given some measure of width is to say it's a normal distribution with

distribution given some measure of width is to say it's a normal distribution with
that width. So in the zero-information case, when some assumption is required,
the Gaussian is the best assumption to make.


yankov

Reply Share

a year ago

Thanks for the great post. Offtop, but what did you use to embed pieces of ipython notebook to
an octopress' post? I couldn't find anything workable.

Reply Share

jakevdp

Mod

> yankov a year ago

This blog is Pelican, not Octopress, and I use https://github.com/getpelican/...


Reply Share

yankov > jakevdp

a year ago

Got it. The theme of your blog is the default theme for octopress, but apparently
it's not specific to it.


WinVector LLC

Reply Share

a year ago

Nice article. In this direction my group has been trying to help teach that you tend to need to be
familiar with both frequentist and Bayesian thought (you can't always choose one or the other:
http://www.win-vector.com/blog... ) and that Bayesianism only appears to be the more
complicated of the two ( http://www.win-vector.com/blog... ).

Reply Share

aloctavodia

a year ago

Hi, great post, thanks!


I have a doubt. You say "the contours indicate one and two standard deviations (68% and 95%
confidence levels). In other words, based on this analysis we are 68%
confident that the model lies within the inner contour, and 95%
confident that the model lies within the outer contour." But the 68-95-99.7 rule is only valid for
gaussian distributions. I think the contours represents the 0.68 and 0.95 percentiles, hence you
are right when you say "we are 68%
confident that the model lies within the inner contour, and 95%
confident that the model lies within the outer contour", but is not true that those contours
represent 1 and 2 standard deviations.

Reply Share

jakevdp

Mod

> aloctavodia a year ago

Good point - the contours do contain 68%/95% of the points, and this is often
colloquially referred to as 1 and 2 standard deviations in analogy with gaussians, but
that language is a bit sloppy.

Reply Share

Shayne Hodge

a year ago

I've found this series helpful, though pointedly pointing out how rusty my prob/stats are. Two
questions - (1) good printed resources - I have my old stochastic systems books (EE-centric)
and a newer econometrics book (fond of using notation I've never seen, lots of superscripted *),
but neither is particularly on point. Don't really want a textbook, but something better than
pop-sci.
Second, when you go to the 2 factor model, I want to say intuitively this is just a joint-Gaussian
distribution. Is that the case under the frequentist approach? What I don't get on the Bayesian
is not just why it's not Gaussian, but why it's not symmetrical. Our two component r.v.'s are
gaussian (right?), so if absolutely nothing else I'd expect to see left-right and top-bottom
symmetry. Seems like a complete breakdown of my intuition. (Again:
https://www.youtube.com/watch?...

see more

Reply Share

mvuorre

a year ago

Thanks for the great series of posts.

Reply Share

Daniel Halperin

a year ago

So what about when the star's flux is time-varying, so there is no "one true flux"?

Reply Share

jakevdp

Mod

> Daniel Halperin a year ago

The second problem is an example of this (the true flux varies with time in a stochastic
manner). We fit for mu and sigma which describe the mean and standard deviation of
this time-varying flux.
But it sounds like you're thinking of a situation in which the flux varies more
predictably: say, periodically with time. In this case, you just need to construct an
appropriate model that can account for the periodicity. I'm hoping to cover something
along those lines in a followup post (when I can find the time...)

Reply Share

Daniel Halperin > jakevdp

a year ago

Right. So if the model can't be compressed to one or a few "true" parameters do


frequentists just give up?

Maybe I should come up with an example before I actually ask this question :p.


Royi

Reply Share

a year ago

Hi,
Great post.
Could you an option to download posts as PDF?
Thank You.

Reply Share

Shannon Quinn

a year ago

Awesome post. Love statistics, love Python, love the intersection of both. A few comments,
mostly to sharpen my own chops on the subject.
At the beginning you defined the fundamental philosophies of frequentist vs Bayesian
statistics, in particular focusing on whether or not the approach attempts to characterize the
true value with the estimator it computes. However, my understanding is that it's the reverse of
what you said: frequentist statistics attempt to determine the true value, while Bayesian
statistics attempt to reflect your own beliefs. This post (also by Larry Wasserman, who I've had
the incredible honor of taking classes with) is in line with that.
I appreciate that you mention the subtleties involved in attempting to choose noninformative
priors. But it may also be prudent to point out that as the problem becomes more complex, it
simultaneously becomes much more difficult to choose noninformative priors. As the data
becomes higher dimensional, unless you're sampling exponentially more points (unlikely in the
real world) your data is intrinsically becoming more sparse and, therefore, becoming much
more sensitive to the choice of prior; even small variations in the prior can drastically alter the
posterior.
This isn't to stomp on Bayesian statistics or to herald the frequentist approach. I'm just trying
to see, from someone obviously knowledgeable on these topics, if my own knowledge is entirely
off base :) Thanks again!

Reply Share

jakevdp

Mod

> Shannon Quinn a year ago

I think you misunderstood me... I didn't mean to imply that frequentists are not
concerned with true parameter values, only that they think it's nonsense to talk about
probability distributions of parameter values (by the very definition of probability!). The
final result and confidence estimate is most certainly an attempt to constrain the true
value.
As to Bayesianism reflecting "beliefs" in opposition to "truth", I think that's a case of
taking the Bayesian definition of probability and running just a little too far with it :)
And here, I'm afraid, my own bias toward Bayesianism is starting to show...

Reply Share

Shannon Quinn > jakevdp

a year ago

Ah yes, in the case of distributions of parameter values, you are absolutely


correct in that distinction.
I don't necessarily mean to say beliefs are in opposition to truth--I would agree
that's running too far with it. I just mean to say that both approaches have their
applications and are clearly useful under certain conditions (as your post does a
great job of pointing out), but they both come with assumptions and potential
pitfalls that make their applications questionable under other circumstances.


Olivier Grisel

Reply Share

a year ago

Great post Jake! Some comments:


> That is, in pure Bayesianism the answer to a question is not a single
number with error bars; the answer is the posterior distribution over
the model parameters!
You can also report a credible interval for each parameter. It might be easier to interpret by a
human being than the full estimate of the posterior distribution (as you did on the plot in the
second example):
http://en.wikipedia.org/wiki/C....
About the bootstrap, it's a great way to estimate the confidence in your frequentist estimates
without making further parametric assumptions on the data. However we should keep in mind
that the bootstrap only provides an asymptotic estimates of the standard error of your
estimated parameters. In practice, if you have less than 50 original data points you might run
the risk of underestimating the true width of the confidence interval and therefore get mislead
to be overconfident in your estimates. I ran some simulation in this notebook to highlight this:
http://nbviewer.ipython.org/gi...
The Bayesian approach on the other hand uses priors to inject uncertainty in your modeling. It
is therefore my understanding that Bayesian statistics will rather tell you "you don't have
enough data points to be reasonably confident in that hypothesis (e.g. that sigma is non-zero in
your example)" while bootstrap confidence intervals can fail to reject the null hypothesis if you
don't have enough data points to capture enough of the true variability.

Reply Share

jakevdp

Mod

> Olivier Grisel a year ago

Thanks for the feedback and links, Olivier!

Reply Share

WHAT'S THIS?

ALSO ON PYTHONIC PERAMBULATIONS

Why Python is Slow: Looking Under the

Frequentism and Bayesianism III:

Why Python is Slow: Looking Under the


Hood
55 comments a year ago

Frequentism and Bayesianism III:


Confidence, Credibility, and why
Frequentism
77
comments a year
andago

Maciek Dems Efficient programming in

Nikratio Alright, you got me convinced. But

Python needs different thinking. All the


operations should be vectorized. This is
especially true when using

unfortunately your article lacks the one thing


neccessary to make more people follow your

The Hipster Effect: An IPython Interactive


Exploration

Frequentism and Bayesianism V: Model


Selection

23 comments 10 months ago

9 comments 24 days ago

Recent Posts
Out-of-Core Dataframes in Python: Dask and OpenStreetMap (https://jakevdp.github.io/blog/2015/08/14/out-ofcore-dataframes-in-python/)
Frequentism and Bayesianism V: Model Selection (https://jakevdp.github.io/blog/2015/08/07/frequentism-andbayesianism-5-model-selection/)
Learning Seattle's Work Habits from Bicycle Counts (Updated!)
(https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/)
The Model Complexity Myth (https://jakevdp.github.io/blog/2015/07/06/model-complexity-myth/)
Fast Lomb-Scargle Periodograms in Python (https://jakevdp.github.io/blog/2015/06/13/lomb-scargle-in-python/)
Follow @jakevdp

6,395 followers

Copyright 2012-2015 - Jake Vanderplas - Powered by Pelican (http://getpelican.com)

Вам также может понравиться