Академический Документы
Профессиональный Документы
Культура Документы
(https://jakevdp.github.io/)
Musings and ramblings through the world of
Python and beyond
Atom (/atom.xml)
Search
Navigate
Archives (/archives.html)
Home Page (http://www.astro.washington.edu/users/vanderplas)
Part
II
(http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bayesianism-2-when-results-
(http://jakevdp.github.io/blog/2014/06/14/frequentism-and-bayesianism-4-bayesian-in-python/)
Part
(http://jakevdp.github.io/blog/2015/08/07/frequentism-and-bayesianism-5-model-selection/)
See also Frequentism and Bayesianism: A Python-driven Primer (http://arxiv.org/abs/1411.5018), a peer-reviewed
article partially based on this content.
One of the first things a scientist hears about statistics is that there is are two different approaches: frequentism and
Bayesianism. Despite their importance, many scientific researchers never have opportunity to learn the distinctions
between them and the different practical approaches that result. The purpose of this post is to synthesize the
philosophical and pragmatic aspects of the frequentist and Bayesian approaches, so that scientists like myself
might be better prepared to understand the types of data analysis people do.
I'll start by addressing the philosophical distinctions between the views, and from there move to discussion of how
these ideas are applied in practice, with some Python code snippets demonstrating the difference between the
approaches.
For frequentists, probability only has meaning in terms of a limiting case of repeated measurements. That is, if I
measure the photon flux F from a given star (we'll assume for now that the star's flux does not vary with time), then
measure it again, then again, and so on, each time I will get a slightly different answer due to the statistical error of
my measuring device. In the limit of a large number of measurements, the frequency of any given value indicates the
probability of measuring that value. For frequentists probabilities are fundamentally related to frequencies of
events. This means, for example, that in a strict frequentist view, it is meaningless to talk about the probability of
the true flux of the star: the true flux is (by definition) a single fixed value, and to talk about a frequency distribution
for a fixed value is nonsense.
For Bayesians, the concept of probability is extended to cover degrees of certainty about statements. Say a
Bayesian claims to measure the flux
P(F)
estimated from frequencies in the limit of a large number of repeated experiments, but this is not fundamental. The
probability is a statement of my knowledge of what the measurement reasult will be. For Bayesians, probabilities
are fundamentally related to our own knowledge about an event. This means, for example, that in a Bayesian
view, we can meaningfully talk about the probability that the true flux of a star lies in a given range. That probability
codifies our knowledge of the value based on prior information and/or available data.
The surprising thing is that this arguably subtle difference in philosophy leads, in practice, to vastly different
approaches to the statistical analysis of data. Below I will give a few practical examples of the differences in
approach, along with associated Python code to demonstrate the practical aspects of the resulting methods.
Ftrue
effects like sky noise and other sources of systematic error). We'll assume that we perform a series of
measurements with our telescope, where the i
th
= {Fi , ei }
(Gratuitous aside on measurement errors: We'll make the reasonable assumption that errors are Gaussian. In a Frequentist perspective,
ei
is the standard deviation of the results of a single measurement event in the limit of repetitions of that event. In the Bayesian
perspective,
ei
is the standard deviation of the (Gaussian) probability distribution describing our knowledge of that particular
Here we'll use Python to generate some toy data to demonstrate the two approaches to the problem. Because the
measurements are number counts, a Poisson distribution is a good approximation to the measurement process:
In[1]: # Generating some simple photon count data
import numpy as np
from scipy import stats
np.random.seed(1) # for repeatability
F_true = 1000 # true flux, say number of photons measured in 1 second
N = 50 # number of measurements
F = stats.poisson(F_true).rvs(N) # N measurements of the flux
e = np.sqrt(F) # errors on Poisson counts estimated via square root
Now let's make a simple visualization of the "measured" data:
In[2]: %matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.errorbar(F, np.arange(N), xerr=e, fmt='ok', ecolor='gray', alpha=0.5)
ax.vlines([F_true], 0, N, linewidth=5, alpha=0.2)
ax.set_xlabel("Flux");ax.set_ylabel("measurement number");
These
measurements
each
have
different
error
ei
which
is
estimated
from
Poisson
statistics
Ftrue
, but the question is this: given our measurements and errors, what is our best
We'll start with the classical frequentist maximum likelihood approach. Given a single observation
we can compute the probability distribution of the measurement given the true flux
Ftrue
Di = (Fi , ei )
Gaussian errors:
(F F
P(Di |Ftrue ) =
exp
2
true )
2e
Di
given Ftrue equals ...". You should recognize this as a normal distribution
(D|Ftrue ) =
P(Di |Ftrue )
i=1
Here
D = {Di }
represents the entire set of measurements. Because the value of the likelihood can become very
small, it is often more convenient to instead compute the log-likelihood. Combining the previous two equations and
computing the log, we have
log =
log(2
2 [
2
i
)+
(Fi
2
true )
i=1
Ftrue
such that the likelihood is maximized. For this simple problem, the
= 0
estimate of Ftrue :
Fest =
w F
w
i
; wi = 1/e
Notice that in the special case of all errors ei being equal, this reduces to
N
1
Fest =
Fi
i=1
That is, in agreement with intuition, Fest is simply the mean of the observed data when errors are equal.
We can go further and ask what the error of our estimate is. In the frequentist approach, this can be accomplished
by fitting a Gaussian approximation to the likelihood curve at maximum; in this simple case this can also be solved
analytically. It can be shown that the standard deviation of this Gaussian approximation is:
1/2
est
=
(
i=1
wi
)
These results are fairly simple calculations; let's evaluate them for our toy dataset:
In[3]: w = 1. / e ** 2
print("""
F_true = {0}
F_est = {1:.0f} +/- {2:.0f} (based on {3} measurements)
""".format(F_true, (w * F).sum() / w.sum(), w.sum() ** -0.5, N))
F_true = 1000
F_est = 998 +/- 4 (based on 50 measurements)
We find that for 50 measurements of the flux, our estimate has an error of about 0.4% and is consistent with the
input value.
Note that this formulation of the problem is fundamentally contrary to the frequentist philosophy, which says that
probabilities have no meaning for model parameters like
Ftrue
perfectly acceptable.
To compute this result, Bayesians next apply Bayes' Theorem (http://en.wikipedia.org/wiki/Bayes'_theorem), a
fundamental law of probability:
P(Ftrue |D) =
P(D|Ftrue )P(Ftrue )
P(D)
Though Bayes' theorem is where Bayesians get their name, it is not this law itself that is controversial, but the
Bayesian interpretation of probability implied by the term P(Ftrue |D) .
Let's take a look at each of the terms in this expression:
P(Ftrue |D)
: The posterior, or the probability of the model parameters given the data: this is the result we
want to compute.
P(D|Ftrue )
P(Ftrue )
: The likelihood, which is proportional to the (D|Ftrue ) in the frequentist approach, above.
: The model prior, which encodes what we knew about the model prior to the application of the
data D .
P(D)
(D|F
true )
and the Bayesian probability is maximized at precisely the same value as the frequentist result! So despite the
philosophical differences, we see that (for this simple problem at least) the Bayesian and frequentist point estimates
are equivalent.
P(Ftrue )
into the computation, which becomes very useful in cases where multiple measurement strategies are being
combined to constrain a single model (as is the case in, e.g. cosmological parameter estimation). The necessity to
specify a prior, however, is one of the more controversial pieces of Bayesian analysis.
A frequentist will point out that the prior is problematic when no true prior information is available. Though it might
seem straightforward to use a noninformative prior like the flat prior mentioned above, there are some surprisingly
subtleties
(http://normaldeviate.wordpress.com/2013/07/13/lost-causes-in-statistics-ii-noninformative-
priors/comment-page-1/) involved. It turns out that in many situations, a truly noninformative prior does not exist!
Frequentists point out that the subjective choice of a prior which necessarily biases your result has no place in
statistical data analysis.
A Bayesian would counter that frequentism doesn't solve this problem, but simply skirts the question. Frequentism
can often be viewed as simply a special case of the Bayesian approach for some (implicit) choice of the prior: a
Bayesian would say that it's better to make this implicit choice explicit, even if the choice might include some
subjectivity.
Ftrue
P(Ftrue |D)
as a function of
Ftrue
. But as the dimension of the model grows, this direct approach becomes increasingly intractable.
For this reason, Bayesian calculations often depend on sampling methods such as Markov Chain Monte Carlo
(MCMC) (http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo).
I won't go into the details of the theory of MCMC here. Instead I'll show a practical example of applying an MCMC
approach using Dan Foreman-Mackey's excellent emcee (http://dan.iel.fm/emcee/current/) package. Keep in mind
here that the goal is to generate a set of points drawn from the posterior probability distribution, and to use those
points to determine the answer we seek.
To perform this MCMC, we start by defining Python functions for the prior
the posterior
P(Ftrue |D)
P(Ftrue )
, the likelihood
P(D|Ftrue )
, and
, noting that none of these need be properly normalized. Our model here is one-
dimensional, but to handle multi-dimensional models we'll define the model in terms of an array of parameters ,
which in this case is
= [Ftrue ]
We end up with a sample of points drawn from the (normal) posterior distribution. The mean and standard deviation
of this posterior are the corollary of the frequentist maximum likelihood estimate above:
In[7]: print("""
In[7]: print("""
F_true = {0}
F_est = {1:.0f} +/- {2:.0f} (based on {3} measurements)
""".format(F_true, np.mean(sample), np.std(sample), N))
F_true = 1000
F_est = 998 +/- 4 (based on 50 measurements)
We see that as expected for this simple problem, the Bayesian approach yields the same result as the frequentist
approach!
Discussion
Now, you might come away with the impression that the Bayesian method is unnecessarily complicated, and in this
case it certainly is. Using an Affine Invariant Markov Chain Monte Carlo Ensemble sampler to characterize a onedimensional normal distribution is a bit like using the Death Star to destroy a beach ball, but I did this here because
it demonstrates an approach that can scale to complicated posteriors in many, many dimensions, and can provide
nice results in more complicated situations where an analytic likelihood approach is not possible.
As a side note, you might also have noticed one little sleight of hand: at the end, we use a frequentist approach to
characterize our posterior samples! When we computed the sample mean and standard deviation above, we were
employing a distinctly frequentist technique to characterize the posterior distribution. The pure Bayesian result for a
problem like this would be to report the posterior distribution itself (i.e. its representative sample), and leave it at
that. That is, in pure Bayesianism the answer to a question is not a single number with error bars; the answer is the
posterior distribution over the model parameters!
= [, ]
is
the standard deviation of the variability intrinsic to the object. Thus our model for the probability of the true flux at
the time of each observation looks like this:
Ftrue
(F )
exp
Now, we'll again consider N observations each with their own error. We can generate them this way:
In[8]: np.random.seed(42) # for reproducibility
N = 100 # we'll use more samples for the more complicated model
(D| ) =
i=1
(F )
exp
+e )
i
[ 2(
2
+e ) ]
i
Analogously to above, we can analytically maximize this likelihood to find the best estimate for :
est
w F
w
i
; wi =
+e
so we can no longer use straightforward analytic methods to arrive at the frequentist result.
Nevertheless, we can use numerical optimization techniques to determine the maximum likelihood value. Here we'll
use
the
optimization
routines
available
within
Scipy's
optimize
(http://docs.scipy.org/doc/scipy/reference/optimize.html) submodule:
In[9]: def log_likelihood(theta, F, e):
return -0.5 * np.sum(np.log(2 * np.pi * (theta[1] ** 2 + e ** 2))
+ (F - theta[0]) ** 2 / (theta[1] ** 2 + e ** 2))
# maximize likelihood <--> minimize negative likelihood
def neg_log_likelihood(theta, F, e):
return -log_likelihood(theta, F, e)
from scipy import optimize
theta_guess = [900, 5]
theta_est = optimize.fmin(neg_log_likelihood, theta_guess, args=(F, e))
print("""
Maximum likelihood estimate for {0} data points:
mu={theta[0]:.0f}, sigma={theta[1]:.0f}
""".format(N, theta=theta_est))
Optimization terminated successfully.
Current function value: 502.839505
Iterations: 58
Function evaluations: 114
Maximum likelihood estimate for 100 data points:
This maximum likelihood value gives our best estimate of the parameters
and
source. But this is only half the answer: we need to determine how confident we are in this answer, that is, we need
to compute the error bars on and .
There are several approaches to determining errors in a frequentist paradigm. We could, as above, fit a normal
approximation to the maximum likelihood and report the covariance matrix (here we'd have to do this numerically
rather than analytically). Alternatively, we can compute statistics like
and
2
dof
such
as
Jackknife
(http://en.wikipedia.org/wiki/Jackknife_(statistics))
or
Bootstrap
= 999 +/- 4
sigma = 18 +/- 5
I should note that there is a huge literature on the details of bootstrap resampling, and there are definitely some
subtleties of the approach that I am glossing over here. One obvious piece is that there is potential for errors to be
correlated or non-Gaussian, neither of which is reflected by simply finding the mean and standard deviation of each
model parameter. Nevertheless, I trust that this gives the basic idea of the frequentist approach to this problem.
The red dot indicates ground truth (from our problem setup), and the contours indicate one and two standard
deviations (68% and 95% confidence levels). In other words, based on this analysis we are 68% confident that the
model lies within the inner contour, and 95% confident that the model lies within the outer contour.
Note here that
=0
is consistent with our data within two standard deviations: that is, depending on the certainty
threshold you're interested in, our data are not enough to confidently rule out the possibility of a non-varying source!
The other thing to notice is that this posterior is definitely not Gaussian: this can be seen by the lack of symmetry in
the vertical direction. That means that the Gaussian approximation used within the frequentist approach may not
reflect the true uncertainties in the result. This isn't an issue with frequentism itself (i.e. there are certainly ways to
account for non-Gaussianity within the frequentist paradigm), but the vast majority of commonly applied frequentist
techniques make the explicit or implicit assumption of Gaussianity of the distribution. Bayesian approaches
generally don't require such assumptions.
(Side note on priors: there are good arguments that a flat prior on
necessarily non-informative in the case of scale factors like
subtley biases the calculation in this case: i.e. a flat prior is not
(http://en.wikipedia.org/wiki/Jeffreys_prior) would be more applicable. Here I believe the Jeffreys prior is not suitable, because is not a
true scale factor (i.e. the Gaussian has contributions from
ei
as well). On this question, I'll have to defer to others who have more
expertise. Note that subtle some would say subjective questions like this are among the features of Bayesian analysis that
frequentists take issue with).
Conclusion
I hope I've been able to convey through this post how philosophical differences underlying frequentism and
Bayesianism lead to fundamentally different approaches to simple problems, which nonetheless can often yield
similar or even identical results.
To summarize the differences:
Frequentism considers probabilities to be related to frequencies of real or hypothetical events.
Bayesianism considers probabilities to measure degrees of knowledge.
Frequentist analyses generally proceed through use of point estimates and maximum likelihood
approaches.
Bayesian analyses generally compute the posterior either directly or through some version of MCMC
sampling.
In simple problems, the two approaches can yield similar results. As data and models grow in complexity, however,
the two approaches can diverge greatly. In a followup post, I plan to show an example or two of these more
complicated situations. Stay tuned!
Update:
see
the
followup
post:
Frequentism
and
Bayesianism
II:
When
Results
Differ
(http://jakevdp.github.io/blog/2014/06/06/frequentism-and-bayesianism-2-when-results-differ/)
This
post
was
written
entirely
in
(http://jakevdp.github.io/downloads/notebooks/FreqBayes.ipynb)
the
IPython
this
notebook.
notebook,
or
see
You
a
can
static
download
view
here
(http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/FreqBayes.ipynb).
396
150
Comments
22 Comments
Pythonic Perambulations
Share
Recommend 1
Login
Sort by Best
a year ago
Thanks for the well-crafted article. I'm working my way through the excellent CS109 Data
Science lectures from Harvard, and as chance would have it, today I listened to "Lecture 10 How to think like a Baysian" - much of the same material with a more theoretical bent.
For those who might be interested, CS109 is a very modern take (first taught fall 2013) on an
introductory data science class. It uses Python, the Ipython notebook, and video lectures to
cover a representitive set of data science topics. Materials are free and available here:
http://cs109.org/
1
yono
Reply Share
a year ago
Hi,
I am somewhat confused by what you mean by "true flux" and "error" when stating the photon
counting problem at the beginning. The emission process of photons from an object such as a
star (or say an idealized black body with constant temperature) is in itself a poisson process, so
the numebr of emitted photons will vary by the square root law without introducing any
measurement errors. i.e. - even if you could capture the photons emitted in all directions from
the start with 100% efficiency, you would still get a 1/sqrt(N) variability in the results.
So when you say "true flux", do you mean exactly 1000 photons arrive at my detector during
each sample, but my detector has some measurement error (e.g. electronic shot noise), or that
only on average I get 1000 photons, but actually this number varies and that is the source of
the error?
Reply Share
jakevdp
Mod
I guess that is a bit sloppy... by "true flux" I guess I mean (in a frequentist sense) the
value you would measure in a large number of trials, when statistical errors drop out.
For example, if you measured 10^12 photons after 10^9 seconds, you can be pretty
dang sure the flux is 1000 photons per second (assuming our constant flux model is
correct, of course). In a Bayesian sense, the true flux is some parameter F that
characterizes our model, and our observations are drawn according to some distribution
about that true parameter.
As you mention, the variance around that true flux could be due to statistical or
systematic errors, but for simplicity here I've just considered random errors.
Reply Share
a year ago
I have a related question. The motivation for the likelihood function starts with
"Gratuitous aside on measurement errors: We'll make the reasonable assumption that
errors are Gaussian." (And this seems to be the statement leading to @yono's question.)
My question is: is it reasonable to state that the central limit theorem specifies that the
Gaussian likelihood function is expected here because each F_i is itself containing a
large number of observations? So, in this case, each measurement F_i is then the result
of a large number of observations. But how would one justify that? The way the problem
is phrased, I thought each F_i is a single observation from a detector (with some error
e_i).
Otherwise, my question boils down the following: why do we have a Gaussian likelihood
function here even though the generative model is Poisson?
Reply Share
jakevdp
Mod
distribution given some measure of width is to say it's a normal distribution with
that width. So in the zero-information case, when some assumption is required,
the Gaussian is the best assumption to make.
yankov
Reply Share
a year ago
Thanks for the great post. Offtop, but what did you use to embed pieces of ipython notebook to
an octopress' post? I couldn't find anything workable.
Reply Share
jakevdp
Mod
a year ago
Got it. The theme of your blog is the default theme for octopress, but apparently
it's not specific to it.
WinVector LLC
Reply Share
a year ago
Nice article. In this direction my group has been trying to help teach that you tend to need to be
familiar with both frequentist and Bayesian thought (you can't always choose one or the other:
http://www.win-vector.com/blog... ) and that Bayesianism only appears to be the more
complicated of the two ( http://www.win-vector.com/blog... ).
Reply Share
aloctavodia
a year ago
Reply Share
jakevdp
Mod
Good point - the contours do contain 68%/95% of the points, and this is often
colloquially referred to as 1 and 2 standard deviations in analogy with gaussians, but
that language is a bit sloppy.
Reply Share
Shayne Hodge
a year ago
I've found this series helpful, though pointedly pointing out how rusty my prob/stats are. Two
questions - (1) good printed resources - I have my old stochastic systems books (EE-centric)
and a newer econometrics book (fond of using notation I've never seen, lots of superscripted *),
but neither is particularly on point. Don't really want a textbook, but something better than
pop-sci.
Second, when you go to the 2 factor model, I want to say intuitively this is just a joint-Gaussian
distribution. Is that the case under the frequentist approach? What I don't get on the Bayesian
is not just why it's not Gaussian, but why it's not symmetrical. Our two component r.v.'s are
gaussian (right?), so if absolutely nothing else I'd expect to see left-right and top-bottom
symmetry. Seems like a complete breakdown of my intuition. (Again:
https://www.youtube.com/watch?...
see more
Reply Share
mvuorre
a year ago
Reply Share
Daniel Halperin
a year ago
So what about when the star's flux is time-varying, so there is no "one true flux"?
Reply Share
jakevdp
Mod
The second problem is an example of this (the true flux varies with time in a stochastic
manner). We fit for mu and sigma which describe the mean and standard deviation of
this time-varying flux.
But it sounds like you're thinking of a situation in which the flux varies more
predictably: say, periodically with time. In this case, you just need to construct an
appropriate model that can account for the periodicity. I'm hoping to cover something
along those lines in a followup post (when I can find the time...)
Reply Share
a year ago
Maybe I should come up with an example before I actually ask this question :p.
Royi
Reply Share
a year ago
Hi,
Great post.
Could you an option to download posts as PDF?
Thank You.
Reply Share
Shannon Quinn
a year ago
Awesome post. Love statistics, love Python, love the intersection of both. A few comments,
mostly to sharpen my own chops on the subject.
At the beginning you defined the fundamental philosophies of frequentist vs Bayesian
statistics, in particular focusing on whether or not the approach attempts to characterize the
true value with the estimator it computes. However, my understanding is that it's the reverse of
what you said: frequentist statistics attempt to determine the true value, while Bayesian
statistics attempt to reflect your own beliefs. This post (also by Larry Wasserman, who I've had
the incredible honor of taking classes with) is in line with that.
I appreciate that you mention the subtleties involved in attempting to choose noninformative
priors. But it may also be prudent to point out that as the problem becomes more complex, it
simultaneously becomes much more difficult to choose noninformative priors. As the data
becomes higher dimensional, unless you're sampling exponentially more points (unlikely in the
real world) your data is intrinsically becoming more sparse and, therefore, becoming much
more sensitive to the choice of prior; even small variations in the prior can drastically alter the
posterior.
This isn't to stomp on Bayesian statistics or to herald the frequentist approach. I'm just trying
to see, from someone obviously knowledgeable on these topics, if my own knowledge is entirely
off base :) Thanks again!
Reply Share
jakevdp
Mod
I think you misunderstood me... I didn't mean to imply that frequentists are not
concerned with true parameter values, only that they think it's nonsense to talk about
probability distributions of parameter values (by the very definition of probability!). The
final result and confidence estimate is most certainly an attempt to constrain the true
value.
As to Bayesianism reflecting "beliefs" in opposition to "truth", I think that's a case of
taking the Bayesian definition of probability and running just a little too far with it :)
And here, I'm afraid, my own bias toward Bayesianism is starting to show...
Reply Share
a year ago
Olivier Grisel
Reply Share
a year ago
Reply Share
jakevdp
Mod
Reply Share
WHAT'S THIS?
Recent Posts
Out-of-Core Dataframes in Python: Dask and OpenStreetMap (https://jakevdp.github.io/blog/2015/08/14/out-ofcore-dataframes-in-python/)
Frequentism and Bayesianism V: Model Selection (https://jakevdp.github.io/blog/2015/08/07/frequentism-andbayesianism-5-model-selection/)
Learning Seattle's Work Habits from Bicycle Counts (Updated!)
(https://jakevdp.github.io/blog/2015/07/23/learning-seattles-work-habits-from-bicycle-counts/)
The Model Complexity Myth (https://jakevdp.github.io/blog/2015/07/06/model-complexity-myth/)
Fast Lomb-Scargle Periodograms in Python (https://jakevdp.github.io/blog/2015/06/13/lomb-scargle-in-python/)
Follow @jakevdp
6,395 followers