You are on page 1of 51

A P-value Ain’t What You Think It Is

Al M Best, PhD
Professor, Periodontics, School of Dentistry
Professor, Biostatistics, School of Medicine
Outline
• Idea for the editorial
• A history of significance testing
• A guide to misinterpretation
• Using a dental example
• My practice as a collaborator

Best AM, Greenberg BL, Glick M. From tea tasting to t test: A P value ain’t
what you think it is. Journal of the American Dental Association. 2016
Jul;147(7):527-9. PMID: 27350642.
7-Mar-2017 retractionwatch.com blog

http://retractionwatch.com/2016/03/07/were-using-a-common-statistical-test-
all-wrong-statisticians-want-to-fix-that/
TAS

http://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108
Metrics

amstat.tandfonline.com/doi/citedby/10.1080/00031305.2016.1154108
Supplemental Material
• Greenland, S, Senn, SJ, Rothman, KJ, Carlin, JB, Poole, C, • Ioannidis, John PA: Fit-for-purpose inferential
Goodman, SN and Altman, DG: “Statistical Tests, methods: abandoning/changing P-values versus
P-values, Confidence Intervals, and Power: abandoning/changing research
A Guide to Misinterpretations” • Johnson, Valen E: Comments on the “ASA
• Altman, Naomi: Ideas from multiple testing of high Statement on Statistical Significance and P-values"
dimensional data provide insights about reproducibility and and marginally significant p-values
false discovery rates of hypothesis supported by p-values • Lavine, Michael, and Horowitz, Joseph: Comment
• Benjamin, Daniel J, and Berger, James O: A simple • Lew, Michael J: Three inferential questions, two
alternative to p-values types of P-value
• Benjamini, Yoav: It’s not the p-values’ fault • Little, Roderick J: Discussion
• Berry, Donald A: P-values are not what they’re cracked up • Mayo, Deborah G: Don’t throw out the error control
to be baby with the bad statistics bathwater
• Carlin, John B: Comment: Is reform possible without a • Millar, Michele: ASA statement on p-values: some
paradigm shift? implications for education
• Cobb, George: ASA statement on p-values: Two • Rothman, Kenneth J: Disengaging from statistical
consequences we can hope for significance
• Gelman, Andrew: The problems with p-values are not just • Senn, Stephen: Are P-Values the Problem?
with p-values • Stangl, Dalene: Comment
• Goodman, Steven N: The next questions: Who, what, when, • Stark, PB: The value of p-values
where, and why?
• Ziliak, Stephen T: The significance of the ASA
• Greenland, Sander: The ASA guidelines and null bias in statement on statistical significance and p-values
current teaching and practice
Supplemental Material

Greenland, S, Senn, SJ, Rothman, KJ,


Carlin, JB, Poole, C, Goodman, SN and
Altman, DG: “Statistical Tests, P-values,
Confidence Intervals, and Power: A Guide to
Misinterpretations” Eur J Epidemiol. 2016
Apr;31(4):337-50.
The Lady Tasting Tea

● Classical example

Salsburg D. The Lady Tasting Tea. New York, NY: WH Freeman and Co; 2001.
Fisher RA. Statistical Methods and Scientific Inference. 3rd ed. New York, NY: Hafner
Press; 1973.
Coke vs Pepsi
● Say I poured, hidden from you, two soft-
drink cups. One with Coke and one with
Pepsi. Then I ask you: “Which is Coke?
And which is Pepsi?”
● What are the possible outcomes?
Actual number
outcome correct
1 2
2 0
From: Maita Levine and Raymond H. Rolwing (1993).
Teaching Statistics, 15, 4-5.
Likelihood of outcomes

● Look at the exact distribution of the


number of correct. Calculate the
probability of each result.
number more extreme
correct frequency proportion results
0 1 0.5 1.00
2 1 0.5 0.50
● Would this experiment be convincing?
Coke vs Pepsi: 4 cups
● Assuming an equal number of Cokes and
Pepsis, the next larger experiment would
be 4 cups.
● What are the possible outcomes?
Actual number
outcome correct
1 0
2 2
3 2
4 2
5 2
6 4
Likelihood of Outcomes
● With each outcome equally likely, we
calculate the p-values for all the possibilities:
number more extreme
correct frequency proportion results
0 1 0.1667 1.0000
● 2 4 0.6667 0.8333
4 1 0.1667 0.1667
● Would this experiment be convincing?
– So if someone got all 4 right, we would be able to
conclude that this person could “… tell the
difference between Coke and Pepsi,
p-value = .1667.” Would this be convincing?
Fisher’s tea lady used 8 cups
● All the possible outcomes
Actual Number Actual Number Actual Number Actual Number
# Correct # Correct # Correct # Correct
1 0 18 4 36 4 54 6
2 2 19 4 37 4 55 6
3 2 20 4 38 4 56 6
4 2 21 4 39 4 57 6
5 2 22 4 40 4 58 6
6 2 23 4 41 4 59 6
7 2 24 4 42 4 60 6
8 2 25 4 43 4 61 6
9 2 26 4 44 4 62 6
10 2 27 4 45 4 63 6
11 2 28 4 46 4 64 6
12 2 29 4 47 4 65 6
13 2 30 4 48 4 66 6
14 2 31 4 49 4 67 6
15 2 32 4 50 4 68 6
16 2 33 4 51 4 69 6
17 2 34 4 52 4 70 8
35 4 53 4
Likelihood of Outcomes
● We calculate the p-values
Number more extreme
Correct frequency proportion results
0 1 0.0143 1.0000
2 16 0.2286 0.9857
4 36 0.5143 0.7571
6 16 0.2286 0.2429
8 1 0.0143 0.0143
● If someone got all 8 right, we could conclude
that this person could “… tell the difference
between Coke and Pepsi, p-value = .0143.”
Would this be convincing?
Inference?

● “Statistical analysis of medical studies is


based on the key idea that we make
observations on a sample of subjects and
then draw inferences about the population
of all such subjects from which the sample
is drawn.”
Altman D, Machin D., Bryant T, & Gardner M (Eds.) (2013) Statistics with confidence: confidence intervals and
statistical guidelines. John Wiley & Sons. ISBN 0-7279-1375-1. Page 3.
Gardner MJ, Altman DG. (1988) Estimating with confidence. Br Med J. 30;296(6631):1210-1. PMID: 3133015; PubMed
Central PMCID: PMC2545695.
Jerzy Neyman & Egon Pearson

● Viewed Fisher’s work as mathematically


fuzzy and heuristic
● Instead of focusing on what a scientist
thinks about the evidence, an experiment
should tell the scientist what to do.
● Out of this came Ha, type-I and type-II
error rates, power
Greenland’s “Guide to Misinterpretations”

● Lapidus et al. “Effect of premedication to provide


analgesia as a supplement to inferior alveolar
nerve block in patients with irreversible pulpitis.”
JADA 2016 147(6):427-37.
● CONCLUSIONS: There is moderate evidence to
support the use of oral NSAIDs-in particular,
ibuprofen-1 hour before the administration of
IANB local anesthetic to provide additional
analgesia to the patient.

Greenland et al. “Statistical Tests, P-values, Confidence Intervals, and Power:


A Guide to Misinterpretations” Eur J Epidemiol. 2016 Apr;31(4):337-50.
Severely infected irreversible pulpitis
Tom Hanks (2000) A FedEx executive must
transform himself physically and emotionally to
survive a crash landing on a deserted island
Ibuprofen versus placebo, frequency of
participants in each group having “little or no
pain during endodontic treatment.”

“The probability of … is .020.”


Benzodiazepine versus placebo, frequency of
participants in each group having “little or no pain
during endodontic treatment.”

“The probability of … is .954.”


True or False?
The p-value is the probability that
the null hypothesis is true.

● For example, the test of the ibuprofen null


hypothesis gave P = 0.02, the null
hypothesis has only a 2% chance of being
true.

Greenland et al. “Statistical Tests, P-values, Confidence Intervals, and Power: A


Guide to Misinterpretations” Eur J Epidemiol. 2016 Apr;31(4):337-50.
The p-value is the probability that
the null hypothesis is true.
No!
The p-value simply indicates the degree to
which the data conform to the pattern
predicted by the null hypothesis
and all the other assumptions used in the
test (the underlying statistical model).
Backwards

● The absurdity of the common backwards


interpretation might be appreciated
by pondering how the p-value,
which is a probability deduced from a set
of assumptions,
can possibly refer to the probability of
those assumptions.
True or False?
The p-value is the probability that chance
alone produced the observed association.

● For example, the p-value for the ibuprofen


null hypothesis is 0.02.
And so there is a 2% probability that
chance alone produced the association.
The p-value for the null hypothesis is the
probability that chance alone produced the
observed association.

No! To say this is asserting that every


assumption used to compute the p-value is
correct, including the null hypothesis.
Greenland. et al.’s Guide

● 14 misinterpretations of a single study’s


p-value(s)
● 4 misinterpretations of p-values across
studies or in subgroups
● 5 misinterpretations of confidence intervals
● 2 misinterpretations of power
< .05
p< .05 does
means … ?mean
NOT

● Ho is false, should be rejected


● Ha is true
● Scientifically important effect detected
● Substantially important relationship
demonstrated
● Chance of false positive finding is 5%
pp >> .05
.05 does
means …?mean
NOT

● Ho is true, should be accepted


● Ha is false
● Evidence in favor of Ho
● There is no effect
● The effect size is small
Greenland. et al.’s Conclusions included:

● The probability, likelihood, certainty, etc.


for a hypothesis cannot be derived from
statistical methods alone.
● Significance tests and confidence intervals
do not by themselves provide a logically
sound basis for concluding an effect is
present or absent with a given probability.
Not even scientists can
easily explain p-values
● You can get it right,
or you can make it intuitive,
but it’s all but impossible to do both.
ASA: Conclusion
● Good statistical practice, as an essential component of
good scientific practice, emphasizes:
– principles of good study design and conduct,
– a variety of numerical and graphical summaries of
data,
– understanding of the phenomenon under study,
– interpretation of results in context,
– complete reporting and
– proper logical and quantitative understanding of
what data summaries mean.
● No single index should substitute for scientific
reasoning.
ASA: Conclusion
● Good statistical practice, as an essential component of
good scientific practice, emphasizes:
– principles of good study design and conduct,
– a variety of numerical and graphical summaries of
data,
– understanding of the phenomenon under study,
– interpretation of results in context,
– complete reporting and
– proper logical and quantitative understanding of
what data summaries mean.
● No single index should substitute for scientific
reasoning.
Study Design and Conduct

● PICO-T
● Bias, Confounding, Contamination

● And, eventually, chance


Publication Bias Exposure Bias (performance bias) Interpretation Bias
1. Bias of rhetoric 1. Contamination bias 1. Mistaken identity bias
2. All’s well literature bias 2. Withdrawal bias 2. Cognitive dissonance bias
3. Reference bias 3. Compliance bias 3. Magnitude bias
4. Positive results bias 4. Therapeutic personality bias 4. Significance bias
5. Hot stuff bias 5. Bogus control bias 5. Correlation bias
6. Pre-publication bias 6. Misclassification bias 6. Under-exhaustion bias
7. Post-publication bias 7. Proficiency bias The Dunning-Kruger effect
8. Sponsorship bias Detection Bias (measurement bias)
9. Meta-analysis bias 1. Insensitive measure bias
Selection Bias (susceptibility bias) 2. Underlying cause bias (rumination
1. Popularity bias bias)
2. Centripetal bias 3. End-digit preference bias
3. Referral filter bias 4. Apprehension bias
4. Diagnostic access bias 5. Unacceptability bias
5. Diagnostic suspicion bias 6. Obsequiousness bias
6. Unmasking bias 7. Expectation bias
7. Mimicry bias 8. Substitution game bias
8. Previous opinion bias 9. Family information bias
9. Wrong sample size bias 10. Exposure suspicion bias
10. Admission rate bias (Berkson) 11. Recall bias
11. Prevalence-incidence bias (Neyman) 12. Attention bias
12. Diagnostic vogue bias 13. Instrument bias Hartman JM, Forsen JW Jr,
13. Diagnostic purity bias 14. Surveillance bias
14. Procedure selection bias 15. Comorbidity bias
Wallace MS, Neely JG.
15. Missing clinical data bias 16. Nonspecification bias “Tutorials in clinical research:
16. Non-contemporaneous control bias 17. Verification bias (work-up bias) part IV: recognizing and
17. Starting time bias Analysis Bias (Transfer Bias) controlling bias.” Laryngoscope.
18. Unacceptable disease bias 1. Post-hoc significance bias
19. Migrator bias 2. Data dredging bias
2002 Jan;112(1):23-31.
20. Membership bias 3. Scale degradation bias Expanded from:
21. Nonrespondent bias 4. Tidying-up bias (deliberate elimination Sackett DL. “Bias in analytic
22. Volunteer bias bias)
23. Allocation bias 5. Repeated peeks bias research.” J Chronic Dis.
24. Vulnerability bias 1979;32(1-2):51-63.
25. Authorization bias
Cognitive Bias Codex
ASA: Conclusion
● Good statistical practice, as an essential component of
good scientific practice, emphasizes:
– principles of good study design and conduct,
– a variety of numerical and graphical summaries of
data,
– understanding of the phenomenon under study,
– interpretation of results in context,
– complete reporting and
– proper logical and quantitative understanding of
what data summaries mean.
● No single index should substitute for scientific
reasoning.
Context

● David Moore:

“Data are numbers,


but they are not ‘just numbers.’
They are numbers with a context.”

Moore and Notz 2006, Statistics: Concepts and Controversies,


NY: Freeman, p xxi
Context

Tonight we’re going to let the statistics


speak for themselves
Ed Koren, © The New Yorker, 9 December 1974
ASA: Conclusion
● Good statistical practice, as an essential component of
good scientific practice, emphasizes:
– principles of good study design and conduct,
– a variety of numerical and graphical summaries of
data,
– understanding of the phenomenon under study,
– interpretation of results in context,
– complete reporting and
– proper logical and quantitative understanding of
what data summaries mean.
● No single index should substitute for scientific
reasoning.
Words Matter

● CONSORT 2010
● How to Report Statistics in Medicine
● AMA Manual of Style

Moore and Notz 2006, Statistics: Concepts and Controversies, NY: Freeman, p xxi
CONsolidated Standards of Reporting Trials

● The CONSORT Statement comprises a 25-


item checklist and a flow diagram. The checklist
items focus on reporting how the trial was
designed, analysed, and interpreted; the flow
diagram displays the progress of all participants
through the trial.
● The CONSORT “Explanation and Elaboration”
document explains and illustrates the principles
underlying the CONSORT Statement.
www.consort-statement.org
Specialized CONSORT
● Harms (safety)
● Non-inferiority
● Cluster randomized trials
● Herbal, Acupuncture
● Non-pharmacologic agents
● Pragmatic trials
● Parent reported outcomes
● N-of-1 trials
● Orthodontic trials
● Pilot and feasibility trials
Enhancing the QUAlity and Transparency
of health Research
● STROBE – Observational studies
● PRISMA – Systematic reviews
● CARE – Case reports
● SRQR – Qualitative research
● STARD – Diagnostic/prognostic studies
● SQUIRE – Quality improvement studies
… a total of 358 reporting guidelines

http://www.equator-network.org/
Dedication
● Lang: To anyone who
has encountered the
frustration of what
I call “Statistical
Buddhism”
● To those who know,
no explanation is
necessary.
To those who do not
know, no explanation
is possible.
● Glossary
– P value: probability of
obtaining the observed
data (or data that are
more extreme) if the null
hypothesis were exactly
true.

● www.amamanualofstyle.com
Everitt BS. The Cambridge Dictionary of Statistics in the Medical Sciences.
Cambridge, England: Cambridge University Press; 1995.
Al’s Conclusion
● Good statistical practice, is an essential component of
good scientific practice
– Data are information in context.
– Insist on a full and complete description of the
context of a study.
– A p-value is calculated from a set of numbers
encased in certain assumptions.
– Viewed alone, the p-value may be meaningless.
● No single index can substitute for scientific
reasoning.
Thank you
ASA: Six Principles
● P-values can indicate how incompatible the data are with a
specified statistical model.
● P-values do not measure the probability that the studied
hypothesis is true, or the probability that the data were
produced by random chance alone.
● Scientific conclusions and business or policy decisions
should not be based only on whether a p-value passes a
specific threshold.
● Proper inference requires full reporting and transparency.
● A p-value, or statistical significance, does not measure the
size of an effect or the importance of a result.
● By itself, a p-value does not provide a good measure of
evidence regarding a model or hypothesis.
George Cobb—Looking Ahead:
Five Imperatives
● George Cobb (2015) Mere Renovation is Too Little Too Late: We Need to Rethink our Undergraduate Curriculum from the Ground Up, The American Statistician, 69:4, 266-282, DOI:
10.1080/00031305.2015.1093029

● Flatten prerequisites
– Calc I → Calc II → Calc III → Probability → Math
Stat → Biostatistics
● Strip away technical formalism and formulas
● Embrace computation
● Exploit context
– Interpretation, motivation,direction
● Teach through research