Вы находитесь на странице: 1из 46

1

What is the Philosophy of Statistics?



At one level of analysis at least, statisticians and philosophers
of science ask many of the same questions:

- What should be observed and what may justifiably be
inferred from the resulting data?

- How well do data confirm or fit a model?

- What is a good test?

- Must predictions be novel in some sense? (selection
effects, double counting, data mining)

- How can spurious relationships be distinguished from
genuine regularities? from causal regularities?

- How can we infer more accurate and reliable
observations from less accurate ones?

- When does a fitted model account for regularities in the
data?

That these very general questions are entwined with long
standing debates in philosophy of science helps to explain
why the field of statistics tends to cross over so often into
philosophical territory.
2

Statistics Philosophy

3 ways statistical accounts are used in philosophy of
science

(1) Model Scientific Inference to capture either the
actual or rational ways to arrive at evidence and inference
(2) Resolve Philosophical Problems about scientific
inference, observation, experiment;
(problem of induction, objectivity of observation,
reliable evidence, Duhem's problem,
underdetermination).
(3) Perform a Metamethodological Critique
-scrutinize methodological rules, e.g., accord special
weight to "novel" facts, avoid ad hoc hypotheses, avoid
"data mining", require randomization.

Philosophy Statistics
Central job to help resolve the conceptual, logical, and
methodological discomforts of scientists as to: how to
make reliable inferences despite uncertainties and errors?

Philosophy of statistics and the goal of a philosophy of
science relevant for philosophical problems in scientific
practice

3
Fresh methodological problems arise in practice
surrounding a panoply of methods and models relied on
to learn from incomplete, and often non-experimental,
data.
Examples abound:
Disputes over hypothesis-testing in psychology (e.g., the
recently proposed significance test ban);
Disputes over the proper uses of regression in applied
statistics;
Disputes over dose-response curves in estimating risks;
Disputes about the use of computer simulations in
observational sciences;
Disputes about external validity in experimental
economics; and,
Across the huge landscape of fields using the latest, high-
powered, computer methods, there are disputes about
data-mining, algorithmic searches, and model validation.
Equally important are the methodological
presuppositions that are not, but perhaps ought to be,
disputed, debated, or at least laid out in the open
often, ironically, in the very fields in which philosophers
of science immerse themselves.

4

I used to teach a course in this department: philosophy of
science and economic methodology
We read how many economic methodologists questioned
the value of philosophy of science
If philosophers and others within science theory cant
agree about the constitution of the scientific method (or
even whether asking about a scientific method makes
any sense), doesnt it seem a little dubious for
economists to continue blithely taking things off the shelf
and attempting to apply them to economics? (Hands,
2001, p. 6).
Deciding that it is, methodologists of economics
increasingly look to sociology of science, rhetoric,
evolutionary psychology.
The problem is not merely how this cuts philosophers of
science out of being engaged in methodological practice;
equally serious, is how it encourages practitioners to
assume there are no deep epistemological problems with
the ways they collect and base inferences on data.
5
Professional agreement on statistical philosophy is not
on the immediate horizon, but this should not stop us
from agreeing on methodology, as if what is correct
methodologically does not depend on what is correct
philosophically (Berger, 2003, p. 2).
In addition to the resurgence of the age-old
controversies significance test vs. confidence
intervals, frequentist vs. Bayesian measures, the
latest statistical modeling techniques have introduced
brand new methodological issues.
High-powered computer science packages offer a
welter of algorithms for automatically selecting among
this explosion of models, but as each boasts different,
and incompatible, selection criteria, we are thrown back
to the basic question of inductive inference: what is
required, to severely discriminate among well-fitting
models such that, when a claim (or hypotheses or model)
survives a test the resulting data count as good evidence
for the claims correctness or dependability or adequacy.
6

A romp through 4 "waves in philosophy of statistics"


History and philosophy of statistics is a huge territory
marked by 70 years of debates widely known for reaching
unusual heights both of passion and of technical
complexity.


Wave I ~ 1930 1955/60
Wave II~ 1955/60-1980
Wave III~1980-2005 & beyond
Wave IV ~ 2006 and beyond



7

A core question: What is the nature and role of
probabilistic concepts, methods, and models in making
inferences in the face of limited data, uncertainty and
error?

1. Two Roles For Probability:
Degrees of Confirmation and Degrees of Well-Testedness

a. To provide a post-data assignment of degree of
probability, confirmation, support or belief in a
hypothesis;
b. To assess the probativeness, reliability,
trustworthiness, or severity of a test or inference
procedure.

These two contrasting philosophies of the role of
probability in statistical inference are very much at the
heart of the central points of controversy in the three
waves of philosophy of statistics
8


Having conceded loss in the battle for justifying induction,
philosophers appeal to logic to capture scientific method

Inductive Logics Logic of falsification

Confirmation Theory
Rules to assign degrees of
probability or confirmation to
hypotheses given evidence e


Methodological falsification
Rules to decide when to
prefer or accept hypotheses
Carnap C(H,e)

Popper

Inductive Logicians
we can build and try to justify
inductive logics
straight rule: Assign degrees of
confirmation/credibility

Statistical affinity

Bayesian (and likelihoodist)
accounts


Deductive Testers
we can reject induction and
uphold the rationality of
preferring or accepting
H if it is well tested

Statistical affinity
Fisherian, Neyman-Pearson
methods: probability enters to
ensure reliability and severity of
tests with these methods.



9

I. Philosophy of Statistics: The First Wave

WAVE I: circa 1930-1955:
Fisher, Neyman, Pearson, Savage, and Jeffreys.

Statistical inference tools use data x
0
to probe aspects of the
data generating source:
In statistical testing, these aspects are in terms of statistical
hypotheses about parameters governing a statistical distribution
H tells us the probability of x under H, written P(x;H)
(probabilistic assignments under a model)
Important to avoid confusion with conditional probabilities in
Bayess theorem, P(x|H).
Testing model assumptions extremely important, though will
not discuss.
10

Modern Statistics Begins with Fisher:
Simple Significance Tests

Example. Let the sample be X = (X
1
, ,X
n
), be IID from a
Normal distribution (NIID) with o =1.

1. A null hypothesis H
0
: H
0
: = 0
e.g., 0 mean concentration of lead, no difference in mean
survival in a given group, in mean risk, mean deflection of
light.

2. A function of the sample, d(X), the test statistic: which
reflects the difference between the data x
0
= (x
1
, ,x
n
), and H
0
;
The larger d(x
0
) the further the outcome is from what is
expected under H
0
, with respect to the particular question being
asked.

3. The p-value is the probability of a difference larger than
d(x
0
), under the assumption that H
0
is true:
p(x
0
)=P(d(X) > d(x
0
); H
0
)

11
The observed significance level (p-value) with observed
X = .1
p(x
0
)=P(d(X) > d(x
0
); H
0
).
The relevant test statistic d(X) is:
d(X) = ( X -
0
)/o
x
,
where X is the sample mean with standard deviation o
x
=
(o/n).
0
Observed- Expected(under H )
( )
x
d
o
= X
Since
x
n
o
o = = 1/5 = .2, d(X) = .1 0 in units of o
x

yields
d(x
0
)=.1/.2 = .5
Under the null, d(X) is distributed as standard Normal,
denoted by d(X) ~ N(0,1).
(Area to the right of .5) ~.3, i.e. not very significant.

12



Logic of Simple Significance Tests: Statistical Modus
Tollens

Every experiment may be said to exist only in order to
give the facts a chance of disproving the null hypothesis
(Fisher, 1956, p.160).

Statistical analogy to the deductively valid pattern modus
tollens:
If the hypothesis H
0
is correct then, with high
probability, 1-p, the data would not be statistically
significant at level p.
x
0
is statistically significant at level p.
____________________________
Thus, x
0
is evidence against H
0
, or x
0
indicates the falsity of
H
0
.

Fisher described the significance test as a procedure
for rejecting the null hypothesis and inferring that the
phenomenon has been experimentally demonstrated
once one is able to generate at will a statistically
significant effect. (Fisher, 1935a, p. 14),
13

The Alternative or Non-Null Hypothesis
Evidence against H
0
seems to indicate evidence for
some alternative.
Fisherian significance tests strictly consider only the
H
0

Neyman and Pearson (N-P) tests introduce an
alternative H
1
(even if only to serve as a direction of
departure).
Example. X = (X
1
, ,X
n
), NIID with o =1:

H
0
: = 0 vs. H
1
: > 0
Despite the bitter disputes with Fisher that were to
erupt soon after ~1935, Neyman and Pearson, at first saw
their work as merely placing Fisherian tests on firmer
logical footing.
Much of Fishers hostility toward N-P methods
reflects professional and personality conflicts more than
philosophical differences.
14

Neyman-Pearson (N-P) Tests

N-P hypothesis test: maps each outcome x = (x
1
, ,x
n
)
into either the null hypothesis H
0
, or an alternative
hypothesis H
1
(where the two exhaust the parameter
space) to ensure the probabilities of erroneous rejections
(type I errors) and erroneous acceptances (type II errors)
are controlled at prespecified values, e.g., 0.05 or 0.01, the
significance level of the test.
Test T(o): X = (X
1
, ,X
n
), NIID with o =1,
H
0
: =
0
vs. H
1
: >
0

if d(x
0
) > c
o
, "reject" H
0
, (or declare the result
statistically significant at the o level);
if d(x
0
) < c
o
, "accept" H
0
,
e.g. c
o
=1.96 for o=.025, i.e.

Accept/Reject uninterpreted parts of the mathematical
apparatus.

Type I error probability = P(d(x
0
) > c
o
; H
0
) o.
The Type II error probability:
P(Test T(o) does not reject H
0
;

=
1
) =
= P(d(X) < c
o
; H
0
) = (
1
), for any
1
>
0
.
15
The "best" test at level o at the same time minimizes the
value of for all
1
>
0
, or equivalently, maximizes the
power:

POW(T(o);
1
)= P(d(X) > c
o
;
1

T(o) is a Uniformly Most Powerful (UMP) level o test
16

Inductive Behavior Philosophy

Philosophical issues and debates arise once one begins to
consider the interpretations of the formal apparatus

Accept/Reject are identified with deciding to take
specific actions, e.g., publishing a result, announcing a
new effect.

The justification for optimal tests is that
it may often be proved that if we behave according to
such a rule ... we shall reject H when it is true not more,
say, than once in a hundred times, and in addition we may
have evidence that we shall reject H sufficiently often
when it is false.

Neyman: Tests are not rules of inductive inference but rules of
behavior:
The goal is not to adjust our beliefs but rather to adjust our
behavior to limited amounts of data

Is he just drawing a stark contrast between N-P tests and
Fisherian as well as Bayesian methods? Or is the behavioral
interpretation essential to the tests?
17

Inductive behavior vs. Inductive inference
battle
commingles philosophical, statistical and personality
clashes.
Fisher (1955) denounced the way that Neyman and
Pearson transformed his significance tests into
acceptance procedures.

- Theyve turned my tests into mechanical rules or
recipes for deciding to accept or reject statistical
hypothesis H
0
,

- The concern has more to do with speeding up
production or making money than in learning about
phenomena
18

N-P followers are like:
Russians (who) are made familiar with the ideal
that research in pure science can and should be geared
to technological performance, in the comprehensive
organized effort of a five-year plan for the nation.
(1955, 70)

In the U.S. also the great importance of
organized technology has I think made it easy to
confuse the process appropriate for drawing correct
conclusions, with those aimed rather atspeeding
production, or saving money.
19

Pearson distanced himself from Neymans
inductive behavior jargon, calling it Professor
Neymans field rather than mine.

But the most impressive mathematical results were in
the decision-theoretic framework of Neyman-Pearson-
Wald.

Many of the qualifications by Neyman and Pearson
in the first wave are overlooked in the philosophy of
statistics literature.

Admittedly, these evidential practices were not
made explicit *. (Had they been, the subsequent waves of
philosophy of statistics might have looked very different).

*Mayos goal in ~ 1978

20
The Second Wave: ~1955/60 -1980

Post-data criticisms of N-P methods:
Ian Hacking (1965), framed the main lines of criticism by
philosophers Neyman-Pearson tests as suitable for before-trial
betting, but not for after-trial evaluation. (p. 99):
Battles: initial precision vs. final precision,
before-data vs. after data
After the data, he claimed, the relevant measure of support is
the (relative) likelihood
Two data sets x and y may afford the same "support"
to H, yet warrant different inferences [on
significance test reasoning] because x and y arose
from tests with different error probabilities.

o This is just what error statisticians want!
21

o But (at least early on) Hacking (1965) held to the

Law of Likelihood: x
0
support hypotheses H
1
more
than H
2
if,

P(x
0
;H
1
) > P(x
0
;H
2
).

Yet, as Barnard notes, there always is such a rival
hypothesis: That things just had to turn out the way they
actually did .
Since such a maximally likelihood alternative H
2
can
always be constructed, H
1
may always be found less well
supported, even if H
1
is trueno error control
Hacking soon rejected the likelihood approach on such
grounds, likelihoodist accounts are advocated by others.
22

Perhaps THE key issue of controversy in the
philosophy of statistics battles

The (strong) likelihood principle, likelihoods suffice to
convey all that the data have to say

According to Bayess theorem, P(x|) ... constitutes
the entire evidence of the experiment, that is, it tells all
that the experiment has to tell. More fully and more
precisely, if y is the datum of some other experiment, and
if it happens that P(x|) and P(y|) are proportional
functions of (that is, constant multiples of each other),
then each of the two data x and y have exactly the same
thing to say about the values of (Savage 1962, p. 17.)
the error probabilist needs to consider, in addition, the
sampling distribution of the likelihoods.

significance levels and other error probabilities all
violate the likelihood principle (Savage 1962).

23
Paradox of Optional Stopping

Instead of fixing the same size n in advance, in some tests, n is
determined by a stopping rule:
In Normal testing, 2-sided H
0
: = 0 vs. H
1
: 0

Keep sampling until H is rejected at the .05 level

(i.e., keep sampling until | X | > 1.96 o/
n
).

Nominal vs. Actual significance levels: with n fixed the type 1
error probability is .05.
With this stopping rule the actual significance level differs
from, and will be greater than .05.

By contrast, since likelihoods are unaffected by the stopping
rule, the LP follower denies there really is an evidential
difference between the two cases (i.e., n fixed and n determined
by the stopping rule).

Should it matter if I decided to toss the coin 100 times and
happened to get 60% heads, or if I decided to keep tossing until
I could reject at the .05 level (2-sided) and this happened to
occur on trial 100?
Should it matter if I kept going until I found statistical
significance?

Error statistical principles: Yes! penalty for perseverance!
The LP says NO!

24
Savage Forum 1959: Savage audaciously declares that
the lesson to draw from the optional stopping effect is that
optional stopping is no sin so the problem must lie with
the use of significance levels. But why accept the
likelihood principle (LP)? (simplicity and freedom?)

The likelihood principle emphasized in Bayesian statistics
implies, that the rules governing when data collection stops
are irrelevant to data interpretation. It is entirely appropriate to
collect data until a point has been proved or disproved (p.
193)This irrelevance of stopping rules to statistical inference
restores a simplicity and freedom to experimental design that
had been lost by classical emphasis on significance levels (in
the sense of Neyman and Pearson) (Edwards, Lindman, Savage
1963, p. 239).
For frequentists this only underscores the point raised years
before by Pearson and Neyman:
A likelihood ratio (LR) may be a criterion of relative fit
but it is still necessary to determine its sampling distribution
in order to control the error involved in rejecting a true
hypothesis, because a knowledge of [LR] alone is not adequate
to insure control of this error (Pearson and Neyman, 1930, p.
106).
25

The key difference: likelihood fixes the actual outcome,
i.e., just d(x), while error statistics considers outcomes other
than the one observed in order to assess the error properties
LP irrelevance of, and no control over, error
probabilities.
("why you cannot be just a little bit Bayesian" EGEK
1996)
Update: A famous argument (1962, Birnbaum)
purports to show that plausible error statistical principles
entails the LP!
"Radical!" "Breakthrough!" (since the LP entails the
irrelevance of error probabilities!
But the "proof" is flawed! (Mayo 2010 See blog).
26
The Statistical Significance Test Controversy
(Morrison and Henkel, 1970) contributors chastise social
scientists for slavish use of significance tests
o Focus on simple Fisherian significance tests
o Philosophers direct criticisms mostly to N-P tests.

Fallacies of Rejection: Statistical vs. Substantive Significance
(i) take statistical significance as evidence of
substantive theory that explains the effect
(ii) Infer a discrepancy from the null beyond what the test
warrants

(i) Paul Meehl: It is fallacious to go from a statistically
significant result, e.g., at the .001 level, to infer that ones
substantive theory T, which entails the [statistical] alternative
H
1
, has received .. quantitative support of magnitude around
.999

A statistically significant difference (e.g., in child rearing) is
not automatically evidence for a Freudian theory.

T is subjected to only a feeble risk, violating Popper.
27
Fallacies of rejection:
(i) Take statistical significance as evidence of
substantive theory that explains the effect

(ii) Infer a discrepancy from the null beyond what the
test warrants.
Finding a statistically significant effect, d(x
0
) > c
o
(cut-
off for rejection) need not be indicative of large or
meaningful effect sizes test too sensitive

Large n Problem: an o significant rejection of H
0
can be
very probable, even with a substantively trivial discrepancy
from H
0
can
This is often taken as a criticism because it is assumed that
statistical significance at a given level is more evidence
against the null the larger the sample size (n) fallacy!
"The thesis implicit in the [NP] approach [is] that a hypothesis
may be rejected with increasing confidence or reasonableness
as the power of the test increases (Howson and Urbach 1989
and later editions)
In fact, it is indicative of less of a discrepancy from the null
than if it resulted from a smaller sample size.
28

(analogy with smoke detector: an alarm from one that often
goes off from merely burnt toast (overly powerful or sensitive),
vs. alarm from one that rarely goes off unless the house is
ablaze)

Comes also in the form of the Jeffrey-Good-Lindley
paradox
Even a highly statistically significant result can, with n
sufficiently large, correspond to a high posterior probability to
a null hypothesis.
29
Fallacy of Non-Statistically Significant Results

Test T(o) fails to reject the null, when the test statistic
fails to reach the cut-off point for rejection, i.e., d(x
0
) c
o
.

A classic fallacy is to construe such a negative result as
evidence FOR the correctness of the null hypothesis (common
in risk assessment contexts).
No evidence against is not evidence for
Merely surviving the statistical test is too easy, occurs too
frequently, even when the null is false.

results from tests lacking sufficient sensitivity or
power.
The Power Analytic Movement of the 60s in psychology
Jacob Cohen: By considering ahead of time the Power of
the test, select a test capable of detecting discrepancies of
interest.
pre-data use of power (for planning).

30

A multitude of tables were supplied (Cohen, 1988), but
until his death he bemoaned their all-to-rare use.

(Power is a feature of N-P tests, but apparently the
prevalence of Fisherian tests in the social sciences, coupled,
perhaps, with the difficulty in calculating power, resulted in
ignoring power. There was also the fact that they were not able
to get decent power in psychology; they turned to meta-
analysis)

31
Post-data use of power to avoid fallacies of insensitive tests
If there's a low probability of a statistically significant
result, even if a non-trivial discrepancy o
non-trivial
is present (low
power against o
non-trivial)
) then a non-significant difference is not
good evidence that a non-trivial discrepancy is absent.
Still too course: power is always calculated relative to the cut-
off point c
o
for rejecting H
0
.
Consider test T(o= .025) , o = 1, n = 25, and let
o
non-trivial
= .2
No matter what the non-significant outcome, power to detect
o
non-trivial
is only .16!
So wed have to deny the data were good evidence that < .2
This suggested to me (in writing my dissertation around
1978) that rather than calculating
(1) P(d(X) > c
o
; =.2) Power
one should calculate
(2) P(d(X) > d(x
0
); =.2). observed power (severity)

Even if (1) is low, (2) may be high. We return to this in
the developments of Wave III.
32


III. The Third Wave: Relativism, Reformulations,
Reconciliations ~1980-2005
+


(skip) Rational Reconstruction and Relativism in
Philosophy of Science
Fighting Kuhnian battles to the very idea of a unified method of
scientific inference, statistical inference less prominent in
philosophy

largely used rational reconstructions of scientific episodes,
in appraising methodological rules,
in classic philosophical problems e.g., Duhems
problemreconstruct a given assignment of blame so as to
be warranted by Bayesian probability assignments.
no normative force.

The recognition that science involves subjective judgments and
values, reconstructions often appeal to a subjective Bayesian
account (Salmons Tom Kuhn Meets Tom Bayes).
(Kuhn thought this was confused: no reason to suppose an
algorithm remains through theory change)

Naturalisms, HPS


33
Wave III in Scientific Practice

Statisticians turn to eclecticism.

Non-statistician practitioners (e.g., in psychology,
ecology, medicine), bemoan unholy hybrids
a mixture of ideas from N-P methods, Fisherian tests, and
Bayesian accounts that is inconsistent from both perspectives
and burdened with conceptual confusion. (Gigerenzer, 1993,
p. 323).

- Faced with foundational questions, non statistician
practitioners raise anew the questions from the first and
second waves.

- Finding the automaticity and fallacies still rampant, most,
if they are not calling for an outright ban on significance
tests in research, insist on reforms and reformulations of
statistical tests.
Task Force to consider Test Ban in Psychology: 1990s
34

Reforms and Reinterpretations Within Error Probability
Statistics
Any adequate reformulation must:
(i) Show how to avoid classic fallacies (of acceptance
and of rejection) on principled grounds,

(ii) Show that it provides an account of inductive
inference
35
Avoiding Fallacies


To quickly note my own recommendation (for test T(a)):
Move away from coarse accept/reject rule; use specific result
(significant or insignificant) to infer those discrepancies from
the null that are well ruled-out, and those which are not.
e.g., Interpretation of Non-Significant results:

If d(x) is not statistically significant, and the
test had a very high probability of a more
statistically significant difference if >
0
+ ,
then d(x) is good grounds for inferring
0

+ .

Use specific outcome to infer an upper bound
* (values beyond are ruled out by given
severity.)

If d(x) is not statistically significant, but the test
had a very low probability chance of a more
statistically significant difference if >
0
+ ,
then d(x) is poor evidence for inferring
0
+
.
The test had too little probative power to have
detected such discrepancies even if they existed!
36

Takes us back to the post-data version of power:

Rather than construe a miss as good as a mile, parity of
logic suggests that the post-data power assessment should
replace the usual calculation of power against
1
:

POW(T(o),
1
) = P(d(X) > c
o
; =
1
),

with what might be called the power actually attained or, to
have a distinct term, the severity (SEV):

SEV(T(o),
1
) = P(d(X) > d(x
0
); =
1
),

where d(x
0
) is the observed (non-statistically significant)
result.




37

Figure 1 compares power and severity for different
outcomes


Figure 1. POW(T(.025),
1
=.2) =.168, irrespective of the value
of d(x
0
) ; solid curve, the severity evaluations are data-specific:
The severity for the inference: < .2

Both X = .39, andX = -.2, fail to reject H
0
, but
But with X = .39, SEV( < .2) is low (.17)
But with X = -.2, SEV( < .2) is high (.97)
38



Fallacies of Rejection: The Large n-Problem

While with a nonsignificant result, the concern is erroneously
inferring that a discrepancy from
0
is absent;
With a significant result x
0
, the concern is erroneously inferring
that it is present.

Utilizing the severity assessment an o-significant
difference with n
1
passes >
1
less severely than with n
2
where
n
1
> n
2
.

Figure 2 compares test T(o), with three different sample
sizes:
n = 25, n = 100, n = 400, denoted by T(o,n);
where in each case d(x
0
) = 1.96 reject at the cut-off
point.

In this way we solve the problems of tests too sensitive or not
sensitive enough, but theres one more thing ... showing how it
supplies an account of inductive inference

Many argue in wave III that error statistical methods cannot
supply an account of inductive inference because error
probabilities conflict with posterior probabilities.
39

Figure 2 compares test T(o), with three different sample sizes:
n =25, n =100, n =400, denoted by T(o,n);
in each case d(x
0
) = 1.96 reject at the cut-off point.



Figure 2. In test T(o), (H
0
: < 0 against H
1
: > 0, and o= 1),
o=.025, c
o
= 1.96 and d(x
0
) = 1.96.

The severity for the inference: > .1
n = 25, SEV( >.1) is .93
n = 100, SEV( >.1) is .83
n = 400, SEV( >.1) is .5

40

P-values vs. Bayesian Posteriors
A statistically significant difference from H
0
can correspond
to large posteriors in H
0
. From the Bayesian perspective, it
follows that p-values come up short as a measure of inductive
evidence,
- the significance testers balk at the recommended priors
resulting in highly significant results being construed as no
evidence against the null or even evidence for it!
The conflict often considers the two sided T(2o) test
H
0
: = 0 vs. H
1
: 0.
(The difference between p-values and posteriors are far
less marked with one-sided tests).

Assuming a prior of .5 to H
0
, with n = 50 one can classically
reject H
0
at significance level p = .05, although P(H
0
|x) = .52
(which would actually indicate that the evidence favors H
0
).

This is taken as a criticism of p-values, only because, it is
assumed the .51 posterior is the appropriate measure of the
beliefworthiness.
41

As the sample size increases, the conflict becomes
more noteworthy.

If n = 1000, a result statistically significant at the
.05 level leads to a posterior to the null of .82!
SEV (H
1
) = .95 while the corresponding posterior has gone
from .5 to .82. What warrants such a prior?


n (sample size)
______________________________________________________
p t n=10 n=20 n=50 n=100 n=1000

.10 1.645 .47 .56 .65 .72 .89
.05 1.960 .37 .42 .52 .60 .82
.01 2.576 .14 .16 .22 .27 .53
.001 3.291 .024 .026 .034 .045 .124

(1) Some claim the prior of .5 is a warranted frequentist
assignment:
H
0
was randomly selected from an urn in which 50% are
true
(*) Therefore P(H
0
) = p

42
H
0
may be 0 change in extinction rates, 0 lead
concentration, etc.
What should go in the urn of hypotheses?

For the frequentist: either H
0
is true or false the
probability in (*) is fallacious and results from an
unsound instantiation.



We are very interested in how false it might be, which is
what we can do by means of a severity assessment.
(2) Subjective degree of belief assignments will not ensure
the error probability, and thus the severity assessments we
need.

(3) Some suggest an impartial or uninformative Bayesian
prior gives .5 to H
0
, the remaining .5 probability being spread
out over the alternative parameter space, Jeffreys.

This spiked concentration of belief in the null is at odds with
the prevailing view we know all nulls are false.

The Bayesian recently co-opts 'error probability' to describe a
posterior, but it is not a frequentist error probability which is
measuring something very different.
43


Fisher: The Function of the p-Value Is Not Capable of
Finding Expression

Faced with conflicts between error probabilities and Bayesian
posterior probabilities, the error probabilist would conclude
that the flaw lies with the latter measure.
Fisher: Discussing a test of the hypothesis that the stars
are distributed at random, Fisher takes the low p-value (about 1
in 33,000) to exclude at a high level of significance any theory
involving a random distribution (Fisher, 1956, page 42).
Even if one were to imagine that H
0
had an extremely high
prior probability, Fisher continues never minding what
such a statement of probability a priori could possibly mean
the resulting high posteriori probability to H
0
, he thinks, would
only show that reluctance to accept a hypothesis strongly
contradicted by a test of significance (ibid, page 44) . . . is
not capable of finding expression in any calculation of
probability a posteriori (ibid, page 43).
44

Wave IV? 2006+ The Reference Bayesians Abandon
Coherence, the LP, and strive to match frequentist error
probabilities!
Contemporary Impersonal Bayesianism

Because of the difficulty of eliciting subjective priors, and
because of the reluctance among scientists to allow
subjective beliefs to be conflated with the information
provided by data, much current Bayesian work in practice
favors conventional default, uninformative, or
reference, priors .

1. What do reference posteriors measure?

- A classic conundrum: there is no unique
noninformative prior. (Supposing there is one
leads to inconsistencies in calculating posterior
marginal probabilities).
- Any representation of ignorance or lack of
information that succeeds for one parameterization
will, under a different parameterization, entail having
knowledge.

Contemporary reference Bayesians seeks priors that are
simply conventions to serve as weights for reference
posteriors.
45
- not to be considered expressions of uncertainty,
ignorance, or degree of belief.
- may not even be probabilities; flat priors may not
sum to one (improper prior). If priors are not
probabilities, what then is the interpretation of a
posterior? (a serious problem I would like to see
Bayesian philosophers tackle).


2. Priors for the same hypothesis changes according to
what experiment is to be done! Bayesian incoherence
If the prior is to represent information why should it be
influenced by the sample space of a contemplated
experiment?

Violates the likelihood principle the cornerstone of
Bayesian coherency

Reference Bayesians: it is the price of objectivity.
seems to wreck havoc with basic Bayesian
foundations, but without the payoff of an objective,
interpretable output even subjective Bayesians
object


46
3. Reference posteriors with good frequentist
properties
Reference priors are touted as having some good
frequentist properties, at least in one-dimensional
problems.

They are deliberately designed to match frequentist error
probabilities.
If you want error probabilities, why not use techniques
that provide them directly?

Note: using conditional probability which is part and
parcel of probability theory, as in Bayes nets does not
make one a Bayesian
no priors to hypotheses