Вы находитесь на странице: 1из 3

INSIGHTS | P E R S P E C T I V E S

STATISTICS

Aligning statistical and scientific reasoning


Misunderstanding and misuse of statistical significance impede science

By Steven N. Goodman bright line of significance, to the exclusion of significance, based on the frequentist no-
of external considerations like prior evidence, tion of probability, defined in terms of veri-

I
magine the American Physical Society understanding of mechanism, or experimen- fiable frequencies of repeatable events. He
convening a panel of experts to issue a tal design and conduct. wanted to avoid the subjectivity of the Bayes-
missive to the scientific community on the Bright-line thinking, coupled with atten- ian approach, in which the probability of a
difference between weight and mass. And dant publication and promotion incentives, hypothesis (inverse probability), neither re-
imagine that the impetus for such a mes- is a driver behind selective reporting: cherry- peatable nor observable, was central.
sage was a recognition that engineers and picking which analyses or experiments to Fisher was a champion of P values as one of
builders had been confusing these concepts report on the basis of their P values. This in several tools to aid the fluid, inductive proc-
for decades, making bridges, buildings, and turn corrupts science and fills the literature ess of scientific reasoningnot to substitute

Downloaded from http://science.sciencemag.org/ on June 12, 2016


other components of our physi- with claims likely to be overstated or false. for it. Fisher used significance merely to in-
POLICY cal infrastructure much weaker We cannot solve these problems without un- dicate that an observation was worth follow-
than previously suspected. derstanding how we got to this point. ing up, with refutation of the null hypothesis
That, in a sense, is what happened with R. A. Fisher revolutionized statistical infer- justified only if further experiments rarely
the recent release of a statement from the ence and experimental design in the 1920s failed to achieve significance (4). This is in
American Statistical Association (ASA), with and 30s by establishing a comprehensive stark contrast to the modern practice of mak-
the deceptively innocuous title, ASA state- framework for statistical reasoning and ing claims based on a single demonstration
ment on statistical significance and p-values writing the first statistical best-seller for ex- of statistical significance.
(1). The scientific measure in need of clari- perimenters. He formalized an approach to In their development of hypothesis test-
fication was the P valueperhaps the most inference involving P values and assessment ing in the 1930s, Jerzy Neyman and Egon
ubiquitous statistical index used in scientific
research to help decide what is true and what
A. How can these data be interpreted?
is not. The ASA saw misunderstanding and STATISTICAL
misuse of statistical significance as a factor APPROACH QUESTION ANALYSIS INTERPRETATION
in the rise in concern about the credibility of Hypothesis test Should we act as though the observed 1. P 0.05 Studies 1 and 2 indicate action based
many scientific claims (sometimes called the (bright line) effect is nonzero (given prespecified 2. P 0.05 on a nonzero true effect is justified.
reproducibility crisis) and is hoping that its error rates)? 3. NS Study 3 indicates it is not.
official statement on the matter will help set Fisherian P value How much evidence is there that the 1. P = 0.03 Studies 1 and 2 provide moderate,
scientists on the right course. true effect is different from zero? 2. P = 0.05 statistically significant evidence that
The formal definition of P value is the prob- 3. P = 0.11 the new treatment is better. Study 3
ability of an observed data summary (e.g., an supplies weak but insufficient
average) and its more extreme values, given evidence to say the treatment is
effective.
a specified mathematical model and hypoth-
esis (usually the null). The problem is that Estimation What range of true effects is statisti- Effect, 95% confi- Studies 1 and 3 indicate the new
this index by itself is not of particular interest. cally consistent with the observed dence interval (%) treatment had a small to moderate
effects? 1. 6, 0.5 to 12 effect. Study 2 is consistent with
What scientists want is a measure of the cred-
2. 20, 2.5 to 38 either small or large effects.
ibility of their conclusions, based on observed 3. 6, 1.4 to 13
data. The P value neither measures that nor
Bayes factor How strongly do the data support Bayes factor, Studies 1 and 3 together decrease
is it part of a formula that provides it.
a large, clinically important effect large:small effect odds of an important effect 98
This confusion between the index we have (10 to 25%) versus a small, 1. 1:14 fold (1/7 x 1/14 = 1/98), Study 2
and the measure we want produces miscon- unimportant one (0 to 10%)? 2. 3:1 increases odds 3 fold, for net 33 fold
ceptions that the P value is the probability 3. 1:7 decrease (3 x 1/98 = 1/33).
that the null hypothesis is true or that the B. Likelihood functions
observed data occurred by chancediffer- 16
ent ways of saying the same thing (2, 3). This
Trial 1 Trial 2 Trial 3
pernicious error creates the illusion that the 8
P value alone measures the credibility of a
conclusion, which opens the door to the mis- 0
taken notion that the dividing line between 0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
scientifically justified and unjustified claims
GRAPHIC: C. SMITH/SCIENCE

is set by whether the P value has crossed the Changing questions, changing answers. Three randomized trials show response rates of 20% in the control arm
and rates in the treatment arms of (1) 26% (n = 900), (2) 40% (n = 100), and (3) 26% (n = 500). The effect deemed
clinically important is 10%. (A) Each statistical approach asks a different question, hence interpretations are different.
Departments of Medicine and of Health Research and Policy, Scientists must decide which statistical question best matches their scientific question. (B) Likelihood functions,
Stanford University School of Medicine, Division of Epidemiology,
Meta-research Innovation Center at Stanford (METRICS), proportional to the probability of the observed data (vertical axis) under each possible true effect (horizontal axis),
Stanford, CA 94305, USA. Email: steve.goodman@stanford.edu measure how strongly the observed effects support different true effects (which cannot be directly observed).

1180 3 JUNE 2016 VOL 352 ISSUE 6290 sciencemag.org SCIENCE

Published by AAAS
Pearson went where Fisher was unwilling The concordance of these statements, sepa- Bayes factors and fully Bayesian analyses
to go (5). In a hypothesis test, one specifies rated by over half a century, underscores are not without their own complications (10,
a null statistical hypothesis and an alterna- lack of progress in approaches to statistical 1215), as are all other recommended ap-
tive, and is to reject the null and accept inference in the applied literature, despite proaches. But, if they were more widely used,
the alternativeor vice versaon the basis advances in statistical methodology. This is rules would evolve. That said, no P-value al-
of whether an estimate falls into a prespeci- due in part to the way statistical inference is ternatives will solve the problems noted by
fied region defined by two error rates: type I taught to scientists; not as a variety of named, the ASA if they are used in bright-line fash-
(alpha, false positive) and type II (beta, false competing approaches, each with strengths ion, such as applying a confidence interval
negative). Once these error rates are set, sci- and weaknesses, but as anonymized proce- only to see if it includes the null value.
entific reasoning is effectively out of the pic- dures, universally applicable, seemingly with- P values are unlikely to disappear, and the
ture (see the figure). Judgment ideally enters out controversy or alternatives (6, 7). ASA did not recommend their elimination
through customization of the alternative hy- Contrast this situation with other sciences. rather, a change in how they are interpreted
pothesis and the error rates, contingent on In any high-school physics textbook, one will and used. But how can scientists follow the
the seriousness of each kind of error. find theories and models by Copernicus, Gal- ASA (and Fishers) dictates to combine them
The Neyman-Pearson method did not ileo, Newton, Einstein, and so on. Students with contextual factors? There are few ex-
use P values, but was combined with the are trained to understand the incomplete amples in the scientific literature. How many
Fisherian P-value approach in textbooks explanatory power of each theory, the contro- papers explain why, in one context, a finding
and research articles (6, 7). Without foun- versies, why new theories were accepted (or with P = 0.006 is insufficient to make a claim,

Downloaded from http://science.sciencemag.org/ on June 12, 2016


dational justification, this created the il- not), and what questions they raised. whereas, in another, P = 0.08 might be all
lusion that quantitative inference could Theories of statistical inference are no less that is needed (11)? Any attempt to do that
be automated, with hypothesis rejection nuanced or contested, as evidenced by the 23 in an individual research paper would likely
determined by whether the P value is less commentaries that followed the ASA state- meet resistance from reviewers or editors.
than the type I error, set at 5% in most sci- ment (1). But such controversy, rarely taught The field of genomics has shown us that
ences today. This combination did violence in applied courses or texts, is unappreciated evidential thresholds are changeable within
to both approaches, particularly to Fishers. by most who use statistical tools. This seem- disciplines, with P 108 now sought for
He vehemently opposed using P values for ing absence of controversy about the founda- claiming relations derived from genome-
automatic inference, referring to hypothesis tions of these methods has fostered growth wide scanning. Thresholds in physics are far
tests disparagingly as decision functions of social-scientific structures reifying those lower than the P 0.05 level used in biomed-
or acceptance procedures. His dismay was icine and the social sciences. Whether such
pointed and prescient: thresholds can or should be modified by de-
[N]o scientific worker has a fixed level of sign, by discipline, or by individual study are
significance at which from year to year, Theories of statistical rich areas for future exploration (16).
and in all circumstances, he rejects hy- inference arenuanced Science has progressed dramatically over
potheses; he [examines] each particular the past 90 years, despite these issues. How
case in the light ofevidence andideas [and] contested. much faster and more efficiently can it pro-
[p. 42 of (8)]. ceed if new statistical approaches to infer-
The concept that the scientific worker valuesenshrined in journal practices, pro- ence are adopted, and if incentive structures
can regard himself as an inert item in a motion, and funding criteria, as well as in the are aligned with optimal statistical and sci-
vast co-operative concern working accord- standard discourse of sciencewhich makes entific practices? The ASA has posed a chal-
ing to accepted rules, is encouraged by them extraordinarily difficult to change. lenge to all who use statistical measures to
directing attention away from his duty to Another reason these practices persist is justify their claims. Let us hope the next cen-
form correct scientific conclusions,and that, until the recent rise of concern about tury will see as much progress in the inferen-
by stressing his supposed duty to mechani- research reproducibility (9), the scientific tial methods of science as in its substance. j
cally make a succession of automatic de- community has perceived few adverse con- REFERENCES
cisions.... The idea that this responsibility sequences from their use. Many papers over 1. R. L. Wasserstein, N. A. Lazar, Am. Stat.
can be delegated to a giant computer pro- the past century have issued cautions simi- 10.1080/00031305.2016.1154108 (2016).
2. S. Goodman, Semin. Hematol. 45, 135 (2008).
grammed with Decision Functions belongs lar to those of the ASA, but have largely been
3. D. R. Cox, Br. J. Clin. Pharmacol. 14, 325 (1982).
to a phantasy of circles, rather remote ignored by the general scientific community. 4. R. A. Fisher, J. Min. Agric. Great Britain 33, 503 (1926).
from scientific research [pp. 104105 (8)]. Benefits of having seemingly objective rules 5. J. Neyman, E. S. Pearson, Philos. Trans. R. Soc. Lond. A 231,
have outweighed theoretical cavils (6, 10). 289 (1933).
6. G. Gigerenzer et al., The Empire of Chance (Cambridge
Sixty years later, we have the ASA express- The ASA suggested several ways to im- Univ. Press, Cambridge, 1989).
ing the same sentiment: prove statistical interpretation, including 7. G. Gigerenzer, J. N. Marewski, J. Manage. 41, 421 (2015).
Researchers should bring many contextual more complete reporting of all analyses per- 8. R. A. Fisher, Statistical Methods and Scientific Inference
(Hafner, New York, ed. 1, 1956).
factors into play to derive scientific infer- formed, and a number of alternative inferen- 9. F. S. Collins, L. A. Tabak, Nature 505, 612 (2014).
ences, including the design of a study, the tial approaches. One of these, Bayes factors 10. B. Efron, Am. Stat. 40, 1 (1986).
quality of the measurements, the external (11, 12), is a measure derived from Bayes theo- 11. S. N. Goodman, Ann. Intern. Med. 130, 1005 (1999).
12. R. E. Kass, A. E. Raftery, J. Am. Stat. Assoc. 90, 773 (1995).
evidence for the phenomenon under study, rem indicating how strongly the data should 13. S. Greenland, C. Poole, Epidemiology 24, 62 (2013).
and the validity of assumptions that un- shift belief toward one hypothesis versus an- 14. H. Hoijtink, P. van Kooten, K. Hulsker, Multivariate Behav.
derlie the data analysis. The widespread other. If we were told that the experimental Res. 51, 2 (2016).
15. R. D. Morey, E. J. Wagenmakers, J. N. Rouder, Multivariate
use of statistical significance (generally results lowered the prestudy odds of the null Behav. Res. 51, 11 (2016).
interpreted as p 0.05) as a license for hypothesis by a factor of 4, this would lead 16. V. E. Johnson, Proc. Natl. Acad. Sci. U.S.A. 110, 19313
making a claim of a scientific finding (or to a far different reasoning process than does (2013).
implied truth) leads to considerable distor- P = 0.03, which is difficult to combine with
tion of the scientific process (1). external knowledge (11) (see the chart). 10.1126/science.aaf5406

SCIENCE sciencemag.org 3 JUNE 2016 VOL 352 ISSUE 6290 1181


Published by AAAS
Aligning statistical and scientific reasoning
Steven N. Goodman (June 2, 2016)
Science 352 (6290), 1180-1181. [doi: 10.1126/science.aaf5406]

Editor's Summary

Downloaded from http://science.sciencemag.org/ on June 12, 2016


This copy is for your personal, non-commercial use only.

Article Tools Visit the online version of this article to access the personalization and
article tools:
http://science.sciencemag.org/content/352/6290/1180

Permissions Obtain information about reproducing this article:


http://www.sciencemag.org/about/permissions.dtl

Science (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week
in December, by the American Association for the Advancement of Science, 1200 New York
Avenue NW, Washington, DC 20005. Copyright 2016 by the American Association for the
Advancement of Science; all rights reserved. The title Science is a registered trademark of AAAS.

Вам также может понравиться