0 оценок0% нашли этот документ полезным (0 голосов)
53 просмотров46 страниц
Philosophers of science and statisticians often ask the same questions. How can spurious relationships be distinguished from genuine regularities? how can we infer more accurate and reliable observations from less accurate ones? the goal of a philosophy of science relevant to philosophical problems in scientific practice.
Philosophers of science and statisticians often ask the same questions. How can spurious relationships be distinguished from genuine regularities? how can we infer more accurate and reliable observations from less accurate ones? the goal of a philosophy of science relevant to philosophical problems in scientific practice.
Авторское право:
Attribution Non-Commercial (BY-NC)
Доступные форматы
Скачайте в формате PDF, TXT или читайте онлайн в Scribd
Philosophers of science and statisticians often ask the same questions. How can spurious relationships be distinguished from genuine regularities? how can we infer more accurate and reliable observations from less accurate ones? the goal of a philosophy of science relevant to philosophical problems in scientific practice.
Авторское право:
Attribution Non-Commercial (BY-NC)
Доступные форматы
Скачайте в формате PDF, TXT или читайте онлайн в Scribd
At one level of analysis at least, statisticians and philosophers of science ask many of the same questions:
- What should be observed and what may justifiably be inferred from the resulting data?
- How well do data confirm or fit a model?
- What is a good test?
- Must predictions be novel in some sense? (selection effects, double counting, data mining)
- How can spurious relationships be distinguished from genuine regularities? from causal regularities?
- How can we infer more accurate and reliable observations from less accurate ones?
- When does a fitted model account for regularities in the data?
That these very general questions are entwined with long standing debates in philosophy of science helps to explain why the field of statistics tends to cross over so often into philosophical territory. 2
Statistics Philosophy
3 ways statistical accounts are used in philosophy of science
(1) Model Scientific Inference to capture either the actual or rational ways to arrive at evidence and inference (2) Resolve Philosophical Problems about scientific inference, observation, experiment; (problem of induction, objectivity of observation, reliable evidence, Duhem's problem, underdetermination). (3) Perform a Metamethodological Critique -scrutinize methodological rules, e.g., accord special weight to "novel" facts, avoid ad hoc hypotheses, avoid "data mining", require randomization.
Philosophy Statistics Central job to help resolve the conceptual, logical, and methodological discomforts of scientists as to: how to make reliable inferences despite uncertainties and errors?
Philosophy of statistics and the goal of a philosophy of science relevant for philosophical problems in scientific practice
3 Fresh methodological problems arise in practice surrounding a panoply of methods and models relied on to learn from incomplete, and often non-experimental, data. Examples abound: Disputes over hypothesis-testing in psychology (e.g., the recently proposed significance test ban); Disputes over the proper uses of regression in applied statistics; Disputes over dose-response curves in estimating risks; Disputes about the use of computer simulations in observational sciences; Disputes about external validity in experimental economics; and, Across the huge landscape of fields using the latest, high- powered, computer methods, there are disputes about data-mining, algorithmic searches, and model validation. Equally important are the methodological presuppositions that are not, but perhaps ought to be, disputed, debated, or at least laid out in the open often, ironically, in the very fields in which philosophers of science immerse themselves.
4
I used to teach a course in this department: philosophy of science and economic methodology We read how many economic methodologists questioned the value of philosophy of science If philosophers and others within science theory cant agree about the constitution of the scientific method (or even whether asking about a scientific method makes any sense), doesnt it seem a little dubious for economists to continue blithely taking things off the shelf and attempting to apply them to economics? (Hands, 2001, p. 6). Deciding that it is, methodologists of economics increasingly look to sociology of science, rhetoric, evolutionary psychology. The problem is not merely how this cuts philosophers of science out of being engaged in methodological practice; equally serious, is how it encourages practitioners to assume there are no deep epistemological problems with the ways they collect and base inferences on data. 5 Professional agreement on statistical philosophy is not on the immediate horizon, but this should not stop us from agreeing on methodology, as if what is correct methodologically does not depend on what is correct philosophically (Berger, 2003, p. 2). In addition to the resurgence of the age-old controversies significance test vs. confidence intervals, frequentist vs. Bayesian measures, the latest statistical modeling techniques have introduced brand new methodological issues. High-powered computer science packages offer a welter of algorithms for automatically selecting among this explosion of models, but as each boasts different, and incompatible, selection criteria, we are thrown back to the basic question of inductive inference: what is required, to severely discriminate among well-fitting models such that, when a claim (or hypotheses or model) survives a test the resulting data count as good evidence for the claims correctness or dependability or adequacy. 6
A romp through 4 "waves in philosophy of statistics"
History and philosophy of statistics is a huge territory marked by 70 years of debates widely known for reaching unusual heights both of passion and of technical complexity.
Wave I ~ 1930 1955/60 Wave II~ 1955/60-1980 Wave III~1980-2005 & beyond Wave IV ~ 2006 and beyond
7
A core question: What is the nature and role of probabilistic concepts, methods, and models in making inferences in the face of limited data, uncertainty and error?
1. Two Roles For Probability: Degrees of Confirmation and Degrees of Well-Testedness
a. To provide a post-data assignment of degree of probability, confirmation, support or belief in a hypothesis; b. To assess the probativeness, reliability, trustworthiness, or severity of a test or inference procedure.
These two contrasting philosophies of the role of probability in statistical inference are very much at the heart of the central points of controversy in the three waves of philosophy of statistics 8
Having conceded loss in the battle for justifying induction, philosophers appeal to logic to capture scientific method
Inductive Logics Logic of falsification
Confirmation Theory Rules to assign degrees of probability or confirmation to hypotheses given evidence e
Methodological falsification Rules to decide when to prefer or accept hypotheses Carnap C(H,e)
Popper
Inductive Logicians we can build and try to justify inductive logics straight rule: Assign degrees of confirmation/credibility
Statistical affinity
Bayesian (and likelihoodist) accounts
Deductive Testers we can reject induction and uphold the rationality of preferring or accepting H if it is well tested
Statistical affinity Fisherian, Neyman-Pearson methods: probability enters to ensure reliability and severity of tests with these methods.
9
I. Philosophy of Statistics: The First Wave
WAVE I: circa 1930-1955: Fisher, Neyman, Pearson, Savage, and Jeffreys.
Statistical inference tools use data x 0 to probe aspects of the data generating source: In statistical testing, these aspects are in terms of statistical hypotheses about parameters governing a statistical distribution H tells us the probability of x under H, written P(x;H) (probabilistic assignments under a model) Important to avoid confusion with conditional probabilities in Bayess theorem, P(x|H). Testing model assumptions extremely important, though will not discuss. 10
Modern Statistics Begins with Fisher: Simple Significance Tests
Example. Let the sample be X = (X 1 , ,X n ), be IID from a Normal distribution (NIID) with o =1.
1. A null hypothesis H 0 : H 0 : = 0 e.g., 0 mean concentration of lead, no difference in mean survival in a given group, in mean risk, mean deflection of light.
2. A function of the sample, d(X), the test statistic: which reflects the difference between the data x 0 = (x 1 , ,x n ), and H 0 ; The larger d(x 0 ) the further the outcome is from what is expected under H 0 , with respect to the particular question being asked.
3. The p-value is the probability of a difference larger than d(x 0 ), under the assumption that H 0 is true: p(x 0 )=P(d(X) > d(x 0 ); H 0 )
11 The observed significance level (p-value) with observed X = .1 p(x 0 )=P(d(X) > d(x 0 ); H 0 ). The relevant test statistic d(X) is: d(X) = ( X - 0 )/o x , where X is the sample mean with standard deviation o x = (o/n). 0 Observed- Expected(under H ) ( ) x d o = X Since x n o o = = 1/5 = .2, d(X) = .1 0 in units of o x
yields d(x 0 )=.1/.2 = .5 Under the null, d(X) is distributed as standard Normal, denoted by d(X) ~ N(0,1). (Area to the right of .5) ~.3, i.e. not very significant.
12
Logic of Simple Significance Tests: Statistical Modus Tollens
Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis (Fisher, 1956, p.160).
Statistical analogy to the deductively valid pattern modus tollens: If the hypothesis H 0 is correct then, with high probability, 1-p, the data would not be statistically significant at level p. x 0 is statistically significant at level p. ____________________________ Thus, x 0 is evidence against H 0 , or x 0 indicates the falsity of H 0 .
Fisher described the significance test as a procedure for rejecting the null hypothesis and inferring that the phenomenon has been experimentally demonstrated once one is able to generate at will a statistically significant effect. (Fisher, 1935a, p. 14), 13
The Alternative or Non-Null Hypothesis Evidence against H 0 seems to indicate evidence for some alternative. Fisherian significance tests strictly consider only the H 0
Neyman and Pearson (N-P) tests introduce an alternative H 1 (even if only to serve as a direction of departure). Example. X = (X 1 , ,X n ), NIID with o =1:
H 0 : = 0 vs. H 1 : > 0 Despite the bitter disputes with Fisher that were to erupt soon after ~1935, Neyman and Pearson, at first saw their work as merely placing Fisherian tests on firmer logical footing. Much of Fishers hostility toward N-P methods reflects professional and personality conflicts more than philosophical differences. 14
Neyman-Pearson (N-P) Tests
N-P hypothesis test: maps each outcome x = (x 1 , ,x n ) into either the null hypothesis H 0 , or an alternative hypothesis H 1 (where the two exhaust the parameter space) to ensure the probabilities of erroneous rejections (type I errors) and erroneous acceptances (type II errors) are controlled at prespecified values, e.g., 0.05 or 0.01, the significance level of the test. Test T(o): X = (X 1 , ,X n ), NIID with o =1, H 0 : = 0 vs. H 1 : > 0
if d(x 0 ) > c o , "reject" H 0 , (or declare the result statistically significant at the o level); if d(x 0 ) < c o , "accept" H 0 , e.g. c o =1.96 for o=.025, i.e.
Accept/Reject uninterpreted parts of the mathematical apparatus.
Type I error probability = P(d(x 0 ) > c o ; H 0 ) o. The Type II error probability: P(Test T(o) does not reject H 0 ;
= 1 ) = = P(d(X) < c o ; H 0 ) = ( 1 ), for any 1 > 0 . 15 The "best" test at level o at the same time minimizes the value of for all 1 > 0 , or equivalently, maximizes the power:
POW(T(o); 1 )= P(d(X) > c o ; 1
T(o) is a Uniformly Most Powerful (UMP) level o test 16
Inductive Behavior Philosophy
Philosophical issues and debates arise once one begins to consider the interpretations of the formal apparatus
Accept/Reject are identified with deciding to take specific actions, e.g., publishing a result, announcing a new effect.
The justification for optimal tests is that it may often be proved that if we behave according to such a rule ... we shall reject H when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject H sufficiently often when it is false.
Neyman: Tests are not rules of inductive inference but rules of behavior: The goal is not to adjust our beliefs but rather to adjust our behavior to limited amounts of data
Is he just drawing a stark contrast between N-P tests and Fisherian as well as Bayesian methods? Or is the behavioral interpretation essential to the tests? 17
Inductive behavior vs. Inductive inference battle commingles philosophical, statistical and personality clashes. Fisher (1955) denounced the way that Neyman and Pearson transformed his significance tests into acceptance procedures.
- Theyve turned my tests into mechanical rules or recipes for deciding to accept or reject statistical hypothesis H 0 ,
- The concern has more to do with speeding up production or making money than in learning about phenomena 18
N-P followers are like: Russians (who) are made familiar with the ideal that research in pure science can and should be geared to technological performance, in the comprehensive organized effort of a five-year plan for the nation. (1955, 70)
In the U.S. also the great importance of organized technology has I think made it easy to confuse the process appropriate for drawing correct conclusions, with those aimed rather atspeeding production, or saving money. 19
Pearson distanced himself from Neymans inductive behavior jargon, calling it Professor Neymans field rather than mine.
But the most impressive mathematical results were in the decision-theoretic framework of Neyman-Pearson- Wald.
Many of the qualifications by Neyman and Pearson in the first wave are overlooked in the philosophy of statistics literature.
Admittedly, these evidential practices were not made explicit *. (Had they been, the subsequent waves of philosophy of statistics might have looked very different).
*Mayos goal in ~ 1978
20 The Second Wave: ~1955/60 -1980
Post-data criticisms of N-P methods: Ian Hacking (1965), framed the main lines of criticism by philosophers Neyman-Pearson tests as suitable for before-trial betting, but not for after-trial evaluation. (p. 99): Battles: initial precision vs. final precision, before-data vs. after data After the data, he claimed, the relevant measure of support is the (relative) likelihood Two data sets x and y may afford the same "support" to H, yet warrant different inferences [on significance test reasoning] because x and y arose from tests with different error probabilities.
o This is just what error statisticians want! 21
o But (at least early on) Hacking (1965) held to the
Law of Likelihood: x 0 support hypotheses H 1 more than H 2 if,
P(x 0 ;H 1 ) > P(x 0 ;H 2 ).
Yet, as Barnard notes, there always is such a rival hypothesis: That things just had to turn out the way they actually did . Since such a maximally likelihood alternative H 2 can always be constructed, H 1 may always be found less well supported, even if H 1 is trueno error control Hacking soon rejected the likelihood approach on such grounds, likelihoodist accounts are advocated by others. 22
Perhaps THE key issue of controversy in the philosophy of statistics battles
The (strong) likelihood principle, likelihoods suffice to convey all that the data have to say
According to Bayess theorem, P(x|) ... constitutes the entire evidence of the experiment, that is, it tells all that the experiment has to tell. More fully and more precisely, if y is the datum of some other experiment, and if it happens that P(x|) and P(y|) are proportional functions of (that is, constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the values of (Savage 1962, p. 17.) the error probabilist needs to consider, in addition, the sampling distribution of the likelihoods.
significance levels and other error probabilities all violate the likelihood principle (Savage 1962).
23 Paradox of Optional Stopping
Instead of fixing the same size n in advance, in some tests, n is determined by a stopping rule: In Normal testing, 2-sided H 0 : = 0 vs. H 1 : 0
Keep sampling until H is rejected at the .05 level
(i.e., keep sampling until | X | > 1.96 o/ n ).
Nominal vs. Actual significance levels: with n fixed the type 1 error probability is .05. With this stopping rule the actual significance level differs from, and will be greater than .05.
By contrast, since likelihoods are unaffected by the stopping rule, the LP follower denies there really is an evidential difference between the two cases (i.e., n fixed and n determined by the stopping rule).
Should it matter if I decided to toss the coin 100 times and happened to get 60% heads, or if I decided to keep tossing until I could reject at the .05 level (2-sided) and this happened to occur on trial 100? Should it matter if I kept going until I found statistical significance?
Error statistical principles: Yes! penalty for perseverance! The LP says NO!
24 Savage Forum 1959: Savage audaciously declares that the lesson to draw from the optional stopping effect is that optional stopping is no sin so the problem must lie with the use of significance levels. But why accept the likelihood principle (LP)? (simplicity and freedom?)
The likelihood principle emphasized in Bayesian statistics implies, that the rules governing when data collection stops are irrelevant to data interpretation. It is entirely appropriate to collect data until a point has been proved or disproved (p. 193)This irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson) (Edwards, Lindman, Savage 1963, p. 239). For frequentists this only underscores the point raised years before by Pearson and Neyman: A likelihood ratio (LR) may be a criterion of relative fit but it is still necessary to determine its sampling distribution in order to control the error involved in rejecting a true hypothesis, because a knowledge of [LR] alone is not adequate to insure control of this error (Pearson and Neyman, 1930, p. 106). 25
The key difference: likelihood fixes the actual outcome, i.e., just d(x), while error statistics considers outcomes other than the one observed in order to assess the error properties LP irrelevance of, and no control over, error probabilities. ("why you cannot be just a little bit Bayesian" EGEK 1996) Update: A famous argument (1962, Birnbaum) purports to show that plausible error statistical principles entails the LP! "Radical!" "Breakthrough!" (since the LP entails the irrelevance of error probabilities! But the "proof" is flawed! (Mayo 2010 See blog). 26 The Statistical Significance Test Controversy (Morrison and Henkel, 1970) contributors chastise social scientists for slavish use of significance tests o Focus on simple Fisherian significance tests o Philosophers direct criticisms mostly to N-P tests.
Fallacies of Rejection: Statistical vs. Substantive Significance (i) take statistical significance as evidence of substantive theory that explains the effect (ii) Infer a discrepancy from the null beyond what the test warrants
(i) Paul Meehl: It is fallacious to go from a statistically significant result, e.g., at the .001 level, to infer that ones substantive theory T, which entails the [statistical] alternative H 1 , has received .. quantitative support of magnitude around .999
A statistically significant difference (e.g., in child rearing) is not automatically evidence for a Freudian theory.
T is subjected to only a feeble risk, violating Popper. 27 Fallacies of rejection: (i) Take statistical significance as evidence of substantive theory that explains the effect
(ii) Infer a discrepancy from the null beyond what the test warrants. Finding a statistically significant effect, d(x 0 ) > c o (cut- off for rejection) need not be indicative of large or meaningful effect sizes test too sensitive
Large n Problem: an o significant rejection of H 0 can be very probable, even with a substantively trivial discrepancy from H 0 can This is often taken as a criticism because it is assumed that statistical significance at a given level is more evidence against the null the larger the sample size (n) fallacy! "The thesis implicit in the [NP] approach [is] that a hypothesis may be rejected with increasing confidence or reasonableness as the power of the test increases (Howson and Urbach 1989 and later editions) In fact, it is indicative of less of a discrepancy from the null than if it resulted from a smaller sample size. 28
(analogy with smoke detector: an alarm from one that often goes off from merely burnt toast (overly powerful or sensitive), vs. alarm from one that rarely goes off unless the house is ablaze)
Comes also in the form of the Jeffrey-Good-Lindley paradox Even a highly statistically significant result can, with n sufficiently large, correspond to a high posterior probability to a null hypothesis. 29 Fallacy of Non-Statistically Significant Results
Test T(o) fails to reject the null, when the test statistic fails to reach the cut-off point for rejection, i.e., d(x 0 ) c o .
A classic fallacy is to construe such a negative result as evidence FOR the correctness of the null hypothesis (common in risk assessment contexts). No evidence against is not evidence for Merely surviving the statistical test is too easy, occurs too frequently, even when the null is false.
results from tests lacking sufficient sensitivity or power. The Power Analytic Movement of the 60s in psychology Jacob Cohen: By considering ahead of time the Power of the test, select a test capable of detecting discrepancies of interest. pre-data use of power (for planning).
30
A multitude of tables were supplied (Cohen, 1988), but until his death he bemoaned their all-to-rare use.
(Power is a feature of N-P tests, but apparently the prevalence of Fisherian tests in the social sciences, coupled, perhaps, with the difficulty in calculating power, resulted in ignoring power. There was also the fact that they were not able to get decent power in psychology; they turned to meta- analysis)
31 Post-data use of power to avoid fallacies of insensitive tests If there's a low probability of a statistically significant result, even if a non-trivial discrepancy o non-trivial is present (low power against o non-trivial) ) then a non-significant difference is not good evidence that a non-trivial discrepancy is absent. Still too course: power is always calculated relative to the cut- off point c o for rejecting H 0 . Consider test T(o= .025) , o = 1, n = 25, and let o non-trivial = .2 No matter what the non-significant outcome, power to detect o non-trivial is only .16! So wed have to deny the data were good evidence that < .2 This suggested to me (in writing my dissertation around 1978) that rather than calculating (1) P(d(X) > c o ; =.2) Power one should calculate (2) P(d(X) > d(x 0 ); =.2). observed power (severity)
Even if (1) is low, (2) may be high. We return to this in the developments of Wave III. 32
III. The Third Wave: Relativism, Reformulations, Reconciliations ~1980-2005 +
(skip) Rational Reconstruction and Relativism in Philosophy of Science Fighting Kuhnian battles to the very idea of a unified method of scientific inference, statistical inference less prominent in philosophy
largely used rational reconstructions of scientific episodes, in appraising methodological rules, in classic philosophical problems e.g., Duhems problemreconstruct a given assignment of blame so as to be warranted by Bayesian probability assignments. no normative force.
The recognition that science involves subjective judgments and values, reconstructions often appeal to a subjective Bayesian account (Salmons Tom Kuhn Meets Tom Bayes). (Kuhn thought this was confused: no reason to suppose an algorithm remains through theory change)
Naturalisms, HPS
33 Wave III in Scientific Practice
Statisticians turn to eclecticism.
Non-statistician practitioners (e.g., in psychology, ecology, medicine), bemoan unholy hybrids a mixture of ideas from N-P methods, Fisherian tests, and Bayesian accounts that is inconsistent from both perspectives and burdened with conceptual confusion. (Gigerenzer, 1993, p. 323).
- Faced with foundational questions, non statistician practitioners raise anew the questions from the first and second waves.
- Finding the automaticity and fallacies still rampant, most, if they are not calling for an outright ban on significance tests in research, insist on reforms and reformulations of statistical tests. Task Force to consider Test Ban in Psychology: 1990s 34
Reforms and Reinterpretations Within Error Probability Statistics Any adequate reformulation must: (i) Show how to avoid classic fallacies (of acceptance and of rejection) on principled grounds,
(ii) Show that it provides an account of inductive inference 35 Avoiding Fallacies
To quickly note my own recommendation (for test T(a)): Move away from coarse accept/reject rule; use specific result (significant or insignificant) to infer those discrepancies from the null that are well ruled-out, and those which are not. e.g., Interpretation of Non-Significant results:
If d(x) is not statistically significant, and the test had a very high probability of a more statistically significant difference if > 0 + , then d(x) is good grounds for inferring 0
+ .
Use specific outcome to infer an upper bound * (values beyond are ruled out by given severity.)
If d(x) is not statistically significant, but the test had a very low probability chance of a more statistically significant difference if > 0 + , then d(x) is poor evidence for inferring 0 + . The test had too little probative power to have detected such discrepancies even if they existed! 36
Takes us back to the post-data version of power:
Rather than construe a miss as good as a mile, parity of logic suggests that the post-data power assessment should replace the usual calculation of power against 1 :
POW(T(o), 1 ) = P(d(X) > c o ; = 1 ),
with what might be called the power actually attained or, to have a distinct term, the severity (SEV):
SEV(T(o), 1 ) = P(d(X) > d(x 0 ); = 1 ),
where d(x 0 ) is the observed (non-statistically significant) result.
37
Figure 1 compares power and severity for different outcomes
Figure 1. POW(T(.025), 1 =.2) =.168, irrespective of the value of d(x 0 ) ; solid curve, the severity evaluations are data-specific: The severity for the inference: < .2
Both X = .39, andX = -.2, fail to reject H 0 , but But with X = .39, SEV( < .2) is low (.17) But with X = -.2, SEV( < .2) is high (.97) 38
Fallacies of Rejection: The Large n-Problem
While with a nonsignificant result, the concern is erroneously inferring that a discrepancy from 0 is absent; With a significant result x 0 , the concern is erroneously inferring that it is present.
Utilizing the severity assessment an o-significant difference with n 1 passes > 1 less severely than with n 2 where n 1 > n 2 .
Figure 2 compares test T(o), with three different sample sizes: n = 25, n = 100, n = 400, denoted by T(o,n); where in each case d(x 0 ) = 1.96 reject at the cut-off point.
In this way we solve the problems of tests too sensitive or not sensitive enough, but theres one more thing ... showing how it supplies an account of inductive inference
Many argue in wave III that error statistical methods cannot supply an account of inductive inference because error probabilities conflict with posterior probabilities. 39
Figure 2 compares test T(o), with three different sample sizes: n =25, n =100, n =400, denoted by T(o,n); in each case d(x 0 ) = 1.96 reject at the cut-off point.
Figure 2. In test T(o), (H 0 : < 0 against H 1 : > 0, and o= 1), o=.025, c o = 1.96 and d(x 0 ) = 1.96.
The severity for the inference: > .1 n = 25, SEV( >.1) is .93 n = 100, SEV( >.1) is .83 n = 400, SEV( >.1) is .5
40
P-values vs. Bayesian Posteriors A statistically significant difference from H 0 can correspond to large posteriors in H 0 . From the Bayesian perspective, it follows that p-values come up short as a measure of inductive evidence, - the significance testers balk at the recommended priors resulting in highly significant results being construed as no evidence against the null or even evidence for it! The conflict often considers the two sided T(2o) test H 0 : = 0 vs. H 1 : 0. (The difference between p-values and posteriors are far less marked with one-sided tests).
Assuming a prior of .5 to H 0 , with n = 50 one can classically reject H 0 at significance level p = .05, although P(H 0 |x) = .52 (which would actually indicate that the evidence favors H 0 ).
This is taken as a criticism of p-values, only because, it is assumed the .51 posterior is the appropriate measure of the beliefworthiness. 41
As the sample size increases, the conflict becomes more noteworthy.
If n = 1000, a result statistically significant at the .05 level leads to a posterior to the null of .82! SEV (H 1 ) = .95 while the corresponding posterior has gone from .5 to .82. What warrants such a prior?
n (sample size) ______________________________________________________ p t n=10 n=20 n=50 n=100 n=1000
(1) Some claim the prior of .5 is a warranted frequentist assignment: H 0 was randomly selected from an urn in which 50% are true (*) Therefore P(H 0 ) = p
42 H 0 may be 0 change in extinction rates, 0 lead concentration, etc. What should go in the urn of hypotheses?
For the frequentist: either H 0 is true or false the probability in (*) is fallacious and results from an unsound instantiation.
We are very interested in how false it might be, which is what we can do by means of a severity assessment. (2) Subjective degree of belief assignments will not ensure the error probability, and thus the severity assessments we need.
(3) Some suggest an impartial or uninformative Bayesian prior gives .5 to H 0 , the remaining .5 probability being spread out over the alternative parameter space, Jeffreys.
This spiked concentration of belief in the null is at odds with the prevailing view we know all nulls are false.
The Bayesian recently co-opts 'error probability' to describe a posterior, but it is not a frequentist error probability which is measuring something very different. 43
Fisher: The Function of the p-Value Is Not Capable of Finding Expression
Faced with conflicts between error probabilities and Bayesian posterior probabilities, the error probabilist would conclude that the flaw lies with the latter measure. Fisher: Discussing a test of the hypothesis that the stars are distributed at random, Fisher takes the low p-value (about 1 in 33,000) to exclude at a high level of significance any theory involving a random distribution (Fisher, 1956, page 42). Even if one were to imagine that H 0 had an extremely high prior probability, Fisher continues never minding what such a statement of probability a priori could possibly mean the resulting high posteriori probability to H 0 , he thinks, would only show that reluctance to accept a hypothesis strongly contradicted by a test of significance (ibid, page 44) . . . is not capable of finding expression in any calculation of probability a posteriori (ibid, page 43). 44
Wave IV? 2006+ The Reference Bayesians Abandon Coherence, the LP, and strive to match frequentist error probabilities! Contemporary Impersonal Bayesianism
Because of the difficulty of eliciting subjective priors, and because of the reluctance among scientists to allow subjective beliefs to be conflated with the information provided by data, much current Bayesian work in practice favors conventional default, uninformative, or reference, priors .
1. What do reference posteriors measure?
- A classic conundrum: there is no unique noninformative prior. (Supposing there is one leads to inconsistencies in calculating posterior marginal probabilities). - Any representation of ignorance or lack of information that succeeds for one parameterization will, under a different parameterization, entail having knowledge.
Contemporary reference Bayesians seeks priors that are simply conventions to serve as weights for reference posteriors. 45 - not to be considered expressions of uncertainty, ignorance, or degree of belief. - may not even be probabilities; flat priors may not sum to one (improper prior). If priors are not probabilities, what then is the interpretation of a posterior? (a serious problem I would like to see Bayesian philosophers tackle).
2. Priors for the same hypothesis changes according to what experiment is to be done! Bayesian incoherence If the prior is to represent information why should it be influenced by the sample space of a contemplated experiment?
Violates the likelihood principle the cornerstone of Bayesian coherency
Reference Bayesians: it is the price of objectivity. seems to wreck havoc with basic Bayesian foundations, but without the payoff of an objective, interpretable output even subjective Bayesians object
46 3. Reference posteriors with good frequentist properties Reference priors are touted as having some good frequentist properties, at least in one-dimensional problems.
They are deliberately designed to match frequentist error probabilities. If you want error probabilities, why not use techniques that provide them directly?
Note: using conditional probability which is part and parcel of probability theory, as in Bayes nets does not make one a Bayesian no priors to hypotheses