Академический Документы
Профессиональный Документы
Культура Документы
P values and hypothesis testing methods are frequently misused in clinical research. Much of
this misuse appears to be owing to the widespread, mistaken belief that they provide simple,
reliable, and objective triage tools for separating the true and important from the untrue or
unimportant. The primary focus in interpreting therapeutic clinical research data should be
on the treatment (oomph) effect, a metaphorical force that moves patients given an
effective treatment to a different clinical state relative to their control counterparts. This
effect is assessed using 2 complementary types of statistical measures calculated from the
data, namely, effect magnitude or size and precision of the effect size. In a randomized trial,
effect size is often summarized using constructs, such as odds ratios, hazard ratios, relative
risks, or adverse event rate differences. How large a treatment effect has to be to be
consequential is a matter for clinical judgment. The precision of the effect size (conceptually
related to the amount of spread in the data) is usually addressed with confidence intervals.
P values (significance tests) were first proposed as an informal heuristic to help assess how
unexpected the observed effect size was if the true state of nature was no effect or no
difference. Hypothesis testing was a modification of the significance test approach that
envisioned controlling the false-positive rate of study results over many (hypothetical)
repetitions of the experiment of interest. Both can be helpful but, by themselves, provide
only a tunnel vision perspective on study results that ignores the clinical effects the study was
conducted to measure.
JAMA Cardiol. doi:10.1001/jamacardio.2016.3312
Published online October 12, 2016.
jamacardiology.com
E1
See Figure 1B
True treatment effect
Measurement error
Patient-to-patient variability
Variation in therapeutic opportunity space
Variations caused by
background therapies
Variations in biological
ability to respond to therapy
remeasure type has quantitatively much less influence on the precision of estimated outcomes than patient-to-patient variability. In
clinical research, when we refer to the precision of the effect size,
we are using a concept that blends the effects of uncertainty due
to patternless measurement errors with the uncertainty (often nonrandom) resulting from variations among patients. The most commonly used precision assessment tools in clinical research are confidence intervals. All other things being equal, studies with greater
precision in the estimation of a nonzero effect size are also more likely
to have a smaller P value, but the relationship between precision and
P values is indirect, as will be discussed below.
The treatment effects we report in clinical research thus reflect the admixture of at least 4 distinct but intertwined factors
(Figure 1A), including the true size of the treatment effect under ideal
circumstances (which is unknowable) and the following 3 sources
of uncertainty: (1) the selection (almost always nonrandom) of participants into the trial from the universe of nominally eligible individuals, (2) the amount of measurement error in the effect size esE2
jamacardiology.com
Quantifiable
Unquantifiable (ambiguity)
jamacardiology.com
E3
E4
Misconception
Comment
learned how to use the experimental intervention to get a predictable and desired result. One or 2 experiments usually could not settle
any complex scientific question. This insight is as relevant to modern clinical trials as it was to Fishers small agricultural experiments.
In some areas of cardiovascular medicine, such as secondary prevention trials with -blockers, angiotensin-converting enzyme inhibitors, antiplatelet agents, and statins, the ability to perform multiple, reasonably similar trials has been pivotal in providing the
consistent evidence necessary to achieve widespread clinical acceptance and to support incorporation into clinical guidelines. However, even repeated huge modern clinical trials sometimes do not
provide everything we need to know to understand a therapy. One
of the best examples involves the thrombolytic therapy megatrials
that laid the foundations of modern reperfusion therapy for acute
myocardial infarction.20,21 Two independent trials (Gruppo Italiano
per lo Studio della Sopravvivenza nellInfarto Miocardico [GISSI 2]22
with 12 490 patients and the Third International Study of Infarct Survival [ISIS 3]20 with 41 299 patients) failed to find any of the expected clinical advantages suggested by earlier mechanistic investigations for tissue plasminogen activator over streptokinase.20 While
many considered the matter settled based on this substantial body
of trial evidence, the unwillingness of some others to dismiss the cognitive dissonance produced by this result led to a third megatrial
(Global Utilization of Streptokinase and Tissue Plasminogen Activator for Occluded Coronary Arteries [GUSTO]21 with 41 021 patients), which by modifying some aspects of the tissue plasminogen activator therapy from the earlier trials demonstrated the precise
mortality benefit of tissue plasminogen activator over streptokinase predicted at the outset of the trial.23
The P value has inspired an impressive amount of passionate editorializing in the scientific literature.5,24-35 Much of the negativity expressed about P values seems due to 2 problems. First, the P value
has nothing to say about the issue of how different the 2 cohorts are
(the treatment oomph in the data).10,36 Fisher likely would have argued that it is the responsibility of the investigator to make that assessment and that it is outside the intended scope of significance
testing. Second, the P value is sample size sensitive.37,38 With a large
enough sample, any nonzero treatment difference will yield a significant P value.39 Conversely, with small samples, only very large
treatment effects are likely to generate significant P values.
If the P value has such important limitations, why is it still so incredibly popular with researchers? We believe that several factors help
explain this phenomenon. First, the P value appears to offer a simple,
objective triage tool: P < .05 means pay attention, while P .05 is
safe to ignore.40 Second, as this triage use of the P value illustrates,
misinterpretation and misapplication of the P value are common
(Table 2).5,32 The most frequently offered statistical textbook definition is not much help, namely, that the P value is the probability of observing a difference (effect size) as large as was measured or larger
under the assumption that the null hypothesis (no difference) is actually correct. Viewing the data from the perspective of P values is a
form of statistical tunnel vision (Figure 3), focusing on the tail (of the
test statistic distribution where it crosses the null point) and ignoring
the dog (the effect size the experiment was conducted to measure).
The hazard of this practice is well illustrated by the Surgical Treatment for Ischemic Heart Failure (STICH) trial,42 which randomized
1212 patients with an ejection fraction of 35% or less and coronary artery disease amenable to coronary artery bypass grafting (CABG) to
jamacardiology.com
Figure 3. Comparison of the P Value View of Data With the Effect Size
or Precision View
Effect in men (RR, 0.61; 95% CI, 0.38-0.98)
Effect in women (RR, 0.60; 95% CI, 0.33-1.10)
Measured estimate of
effect magnitude
1.0
Effect size + precision
of estimate view
P value view
0.8
0.6
0.4
0.2
95% CI reading line
0
0.01
0.1
1.0
10
100
Relative Risk
The results are shown from a hypothetical study comparing the effect of a
treatment in men vs women. The y-axis shows the full range of possible P values
(0 to 1.0), and the x-axis shows relative risk (RR) (values <1 indicate better
outcomes with treatment relative to control). This type of display is called a
P value function (and can also be shown as a confidence interval function by
changing the y-axis to 1 minus the P value). It is explained in greater detail in the
Epidemiology textbook by Rothman.41 Typically, we are most interested in the
P values calculated against the null hypothesis (in this case represented by an
RR of 1, shown as a solid vertical line). Three key concepts are shown. First, the
P value view of the data, shown in the inset on the right, focuses completely on
where the tails of the P value functions cross the null position (RR of 1). The
P value view does not include any direct assessment of the size of the treatment
effect produced. Second, even within the 95% CIs, the possible values of RR are
not all equally likely. The values most compatible with the data collected are
those at or close to the point estimates of effect size. Third, the narrowness or
wideness of the P value function, reflecting precision of the estimate of effect
size, can perversely affect interpretation using P values. In this example, it is
clear that the effect sizes are essentially identical, but there is less precision in
the estimate for the women, leading to a nonsignificant result. A P valuecentric
interpretation might lead to the misguided conclusion that the therapy works in
men but not in women. The figure was generated using a program created by
Kenneth Rothman, DMD, DrPH (http://www.krothman.org/episheet.xls) and
used with his permission.
CABGormedicaltherapyalone.Withamedianfollow-upof56months,
the CABG-medicine treatment effect on all-cause mortality (the primary end point) was a hazard ratio of 0.86 with P = .12, and the trial
was interpreted as negative. After an additional 5 years of follow-up,
the CABG effect size for all-cause mortality was unchanged (0.84 vs
0.86); however, the precision of the estimate was improved (95% CI,
0.73-0.97 vs 0.72-1.04), and the P value was now significant (P = .02),
a positive trial.43 All that really changed between the first (negative
trial) report and the second (positive trial) was the position of the upper tail of the test statistic relative to the null position. Such overreliance on P values (the tails) relative to the oomph effect of treatment
(the dog), is common4 but makes no sense.
jamacardiology.com
E5
Conclusions
In this article, we have presented a view of clinical research centered on measuring and understanding something of clinical or scientific importance, the treatment oomph effect of interest. Figuring out how best to conceptualize and measure clinical effects, as
well as finding the most insightful ways to analyze and interpret the
data using modern statistical tools, is difficult, complex, and often
messy. The primary goal of the research process is not to generate
P values or perform hypothesis tests. Those tools, like the many others in the statistical toolbox, can be helpful but must be applied
thoughtfully and with full appreciation for their assumptions and
limitations. Unfortunately, no statistical tool or technique can
guarantee access to the shortest path to the truth. Repeated experiments, as Fisher recognized years ago, may be the best way to
tame much of the messiness. If that is not feasible, and it often is
not, medicine can still accomplish much by making pragmatic, wellreasoned use of the evidence it does have.
ARTICLE INFORMATION
Accepted for Publication: July 22, 2016.
Published Online: October 12, 2016.
doi:10.1001/jamacardio.2016.3312
Author Contributions: Dr Mark had full access to
all the data in the study and takes responsibility for
the integrity of the data and the accuracy of the
data analysis.
Study concept and design: All authors.
Drafting of the manuscript: Mark.
Critical revision of the manuscript for important
intellectual content: All authors.
Administrative, technical, or material support:
Mark, Lee.
Conflict of Interest Disclosures: All authors have
completed and submitted the ICMJE Form for
Disclosure of Potential Conflicts of Interest, and
none were reported.
Funding/Support: This publication was supported
in part by award UL1TR000445 from the Clinical
E6
jamacardiology.com
jamacardiology.com
E7